# Lecture 18: Image Processing with Machine Learning

In this notebook, we will start exploring image processing and machine learning. We'll start by understanding how images, both in black and white (grayscale) and color, are represented as data. 

After, in a hands-on activity, the MNIST dataset, a collection of handwritten digits, will be used to train a Random Forest classifier. This exercise aims to illustrate the practical application of machine learning models in interpreting and classifying image data.

By the end of this notebook, you should have a introduction to:
- How digital images are structured as data.
- The process of converting images into a format suitable for machine learning.
- Training a machine learning model using image data.
- Evaluating the model's performance in classifying images.


### Set up imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from skimage import data
from skimage.color import rgb2gray
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

### Load built-in image from skimage

In [None]:
# load an microscope image of a immunohistochemistry from skimage.data


# display the image



### Grayscale images

In [None]:
# convert the image to grayscale


# display the image


In [None]:
# look at the data of the image


# look at the data of the image


In [None]:
# convert the image to a dataframe


# inspect the grayscale dataframe


### Plot the first column as a line plot

In [None]:
# Plot the first column as a line plot


### Color images

In [None]:
# look at color image data


In [None]:
# look at the shape of the color image data


In [None]:
# isolate the red, green, and blue channels


# look at the red channel


In [None]:
# inspect what the ravel function does


In [None]:
# plot a histogram of the red, green, and blue channels


In [None]:
# plot red image


In [None]:
# plot green image


In [None]:
# plot blue image


## Machine Learning Model Training Activity

**Objective**: In this activity, our goal is to apply what we've learned about image data representation to a practical machine learning task. We will use the MNIST dataset, which consists of thousands of handwritten digits, as our data source. Our challenge is to train a Random Forest classifier to accurately predict the digits based on their image data.

### Steps:
1. **Load the MNIST dataset**: We'll start by loading the dataset, which has been pre-split into features (`X`) and labels (`y`).
2. **Inspect the dataset**: It's always a good idea to visually inspect the first image in the dataset and understand its structure.
3. **Prepare the data**: We will perform a train-test split, reserving 20% of the dataset for testing our model.
4. **Train a Random Forest classifier**: Using the training data, we'll train a Random Forest model with specified hyperparameters.
5. **Evaluate the model's performance**: After training, we'll test the model on the unseen test data to assess its accuracy.
6. **Analyze misclassifications**: Finally, we'll dive deeper into the model's predictions, identifying the most commonly misclassified digits and analyzing possible reasons for these errors.


### Load mnist data set

In [None]:
# load mnist data
mnist = fetch_openml('mnist_784')

X, y = mnist["data"], mnist["target"]

In [None]:
# display the X dataframe


In [None]:
# display the y series


In [None]:
assert X.shape == (70000, 784)
assert y.shape == (70000,)

### Inspect first image

In [None]:
first_image_data = list(X.iloc[0])
first_image = np.array(first_image_data).reshape(28, 28)

plt.imshow(first_image, cmap="gray")

In [None]:
# print the label of the first image


In [None]:
assert first_label == '5'

### Run train-test split
Save 20% of data for testing

In [None]:
# Run train-test split, and save 20% of data for testing


In [None]:
assert X_train.shape == (56000, 784)
assert X_test.shape == (14000, 784)
assert y_train.shape == (56000,)
assert y_test.shape == (14000,)

### Train a random forest classifier model 
Use the hyperparameters n_estimators = 100, and random state = 42

In [None]:
# Train a random forest classifier on the training data


### Evaluate the model's performance
Make predictions for the testing data, and use the scikit learn function `accuracy_score` to evalute the model.

In [None]:
# Predict the labels of the test data and calculate the accuracy

In [None]:
assert accuracy == 0.967

#

### Determine the most commonly misclassified digit

In [None]:
# loop over y_test and y_pred and print the first 10 mismatches


In [None]:
# now make a dictionary where keys are digits, and values are the number of mismatches for that digit


In [None]:
assert mismatch_dictionary['0'] == 20
assert mismatch_dictionary['9'] == 73

### End of Activity