<center>
<table>
  <tr>
    <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/nasa-logo.svg" width="100"/> </td>
     <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/ASTG_logo.png?raw=true" width="80"/> </td>
     <td> <img src="https://www.nccs.nasa.gov/sites/default/files/NCCS_Logo_0.png" width="130"/> </td>
    </tr>
</table>
</center>

        
<center>
<h1><font color= "blue" size="+3">ASTG Python Courses</font></h1>
</center>


---

<center>
    <h1><font color="red">Image Classification with Scikit-Learn</font></h1>
</center>

## Useful Links

- <a href="https://www.dataquest.io/blog/sci-kit-learn-tutorial/">Scikit-learn Tutorial: Machine Learning in Python</a>
- <a href="https://debuggercafe.com/image-classification-with-mnist-dataset/">Image Classification with MNIST Dataset</a>
- <a href="https://davidburn.github.io/notebooks/mnist-numbers/MNIST%20Handwrititten%20numbers/">MNIST handwritten number identification</a>

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
%matplotlib inline
import numpy as np
import scipy.stats as stats

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd

In [None]:
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn import metrics

In [None]:
print(f"Numpy version:        {np.__version__}")
print(f"Pandas version:       {pd.__version__}")
print(f"Seaborn version:      {sns.__version__}")
print(f"Scikit-Learn version: {sklearn.__version__}")

## <font color="red"> MNIST Dataset</font>

- The <A HREF="https://en.wikipedia.org/wiki/MNIST_database"> MNIST database</A> (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems.
- The database is also widely used for training and testing in the field of machine learning.
- The dataset we will be using contains 70000 images of handwritten digits among which 10000 are reserved for testing.
- This dataset is  suitable for anyone who wants to get started with image classification using Scikit-Learn. 

### Obtain the Dataset

In [None]:
from sklearn.datasets import fetch_openml
mnist_data = fetch_openml('mnist_784', version=1)

In [None]:
print(mnist_data.DESCR)

### Features of the Dataset

In [None]:
print("Keys: ", mnist_data.keys())

**Note that the `data` and `target` already separated.**

In [None]:
print(f"Shape of Data: {mnist_data.data.shape}")

In [None]:
print(f"Datatype of Data: {type(mnist_data.data)}")

In [None]:
print(f"Shape of the Target Data: {mnist_data.target.shape}")

In [None]:
print(f"Datatype of Target Data: {type(mnist_data.target)}")

In [None]:
print(f"Feature Names: {mnist_data.feature_names}")

In [None]:
print(f"Url: {mnist_data.url}")

Extract the feature and target arrays:

In [None]:
np_data, np_target = mnist_data['data'], mnist_data['target']

In [None]:
print(f' Shape of data:   {np_data.shape}')
print(f' Shape of target: {np_target.shape}')

**Checking the Data**

In [None]:
len(np.unique(np_data))

In [None]:
np_data.values[0]

In [None]:
len(np.unique(np_data.values[0]))

**Checking the Target**

In [None]:
print(f"Datatype of the target values: {np_target.dtype}")

In [None]:
np_target[0]

In [None]:
type(np_target[0])

Print few values:

In [None]:
print(np_target[0:5])

Changing the labels from string to integers:

In [None]:
np_target = np_target.astype(np.uint8)

In [None]:
print(np_target[0:5])

Print the number of unique labels:

In [None]:
np.unique(np_target)

In [None]:
np_target.value_counts()

In [None]:
np_target.value_counts().sum()

In [None]:
total = np_target.value_counts().sort_values(ascending=False)
percent = (np_target.value_counts()/np_target.count()).sort_values(ascending=False)*100
percent_data = pd.concat([total, percent], axis=1, 
                         keys=['Total', 'Percent'])
percent_data

<font color="blue"> 
There are 70000 numbers, each stored as an array of 784 numbers depicting the opacity of each pixel, it can be displayed by reshaping the data into a 28x28 array and plotting using matplotlib. 
</font>

In [None]:
some_index = 15657
some_digit = np_data.values[some_index]
some_digit_image = some_digit.reshape(28,28)

plt.imshow(some_digit_image, 
           cmap = matplotlib.cm.binary, 
           interpolation='nearest')
plt.axis=('off')

Let us find the target for row `some_index`:

In [None]:
np_target[some_index]

In [None]:
np_data.values.shape

**Display few images**

In [None]:
import random

def display_digits(X, y):
    """
      Given an array of images of digits X and 
      the corresponding values of the digit y,
      this function plots 96 unique randomly selected images 
      and their values.
    """
    # Figure size (width, height) in inches
    fig = plt.figure(figsize=(8, 6))

    # Adjust the subplots 
    fig.subplots_adjust(left=0, right=1, bottom=0, top=1, 
                        hspace=0.05, wspace=0.05)

    num_images = X.shape[0]
    
    num_selected_images = 96
    row_indices = random.sample(range(num_images), num_selected_images)
    
    i = 0
    for idx in row_indices:
        # Initialize the subplots: 
        # Add a subplot in the grid of 8 by 12, at the i+1-th position
        ax = fig.add_subplot(8, 12, i + 1, xticks=[], yticks=[])
        
        # Display an image at the i-th position
        ax.imshow(X[idx].reshape(28, 28), cmap=plt.cm.binary, 
                  interpolation='nearest')
       
        # label the image with the target value
        ax.text(0, 7, str(y[idx]))
        i += 1

    # Show the plot
    plt.show()

In [None]:
display_digits(np_data.values, np_target)

### <font color="red">Model Selection Process</font>

![fig_skl](https://miro.medium.com/max/1400/1*LixatBxkewppAhv1Mm5H2w.jpeg)
Image Source: Christophe Bourguignat

- A Machine Learning algorithm needs to be trained on a set of data to learn the relationships between different features and how these features affect the target variable. 
- We need to divide the entire data set into two sets:
    + Training set on which we are going to train our algorithm to build a model. 
    + Testing set on which we will test our model to see how accurate its predictions are.
    
Before we create the two sets, we need to identify the algorithm we will use for our model.
We can use the `machine_learning_map` map (shown at the top of this page) as a cheat sheet to shortlist the algorithms that we can try out to build our prediction model. 

### <font color="red">Separating the Training and Testing Set</font>

- The first 60000 (among the 70000) images are used for training.
- The remaining 10000 images are used for validations

In [None]:
num_train = 60000

X_train = np_data.values[:num_train]
X_test  = np_data.values[num_train:]
y_train = np_target[:num_train]
y_test  = np_target[num_train:]

In [None]:
print(f' Train Data:  {X_train.shape}')
print(f' Test Data:   {X_test.shape}')
print(f' Train label: {y_train.shape}')
print(f' Test Label:  {y_test.shape}')

**Shuffle the training set:**

In [None]:
nn = X_train.shape[0]
shuffle_index = np.random.permutation(nn)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

###  <font color="red">Training a Binary Classifier</font>

- Binary classification means there are two classes to work with that relate to one another as `true` and `false`.
- Here, we want to identify a single digit: looking at `9`s.
- The classification will tell us if we have a `9` (true) or not (false).

**Set the target arrays as boolean arrays:** true if 9 otherwise false.

In [None]:
y_train_9 = (y_train == 9)
y_test_9 = (y_test == 9)

**Create and train the model:**

- We use the SGDClassifier that applies regularized linear model with SGD (Stochastic Gradient Descent) learning to build an estimator.
- The method helps building an estimator for classification and regression problems.

In [None]:
from sklearn.linear_model import SGDClassifier

sgd_clf =SGDClassifier(max_iter=1000, tol=1e-3, random_state=42)
sgd_clf.fit(X_train, y_train_9)

**Make an initial prediction:**

In [None]:
some_index = 15657
some_digit = np_data.values[some_index]
print(np_target[some_index])

In [None]:
some_digit_predict = sgd_clf.predict([some_digit])

In [None]:
some_digit_predict

**Measuring accuracy using cross validation**

- The `stratifiedKfold` class performs stratified sampling to produce folds that contain a representative ratio of each class. 
- At each iteration the code creates a clone of the classifier, trains that clone on the training fold and then makes predictions on the test fold. 
- It then counts the number of correct predictions and outputs the ratio of correct predictions.

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone
n_splits = 3
skfolds = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

kfold_scores = list()
idx = 0
for train_index, test_index in skfolds.split(X_train, y_train_9):
    clone_clf = clone(sgd_clf)
    X_train_folds = X_train[train_index]
    y_train_folds = y_train_9[train_index]
    X_test_fold = X_train[test_index]
    y_test_fold = y_train_9[test_index]
    
    clone_clf.fit(X_train_folds, y_train_folds)
    y_pred = clone_clf.predict(X_test_fold)
    n_correct = sum(y_pred == y_test_fold)
    scr = n_correct/len(y_pred)
    kfold_scores.append(scr)
    print(f"Test: {idx} -- Score: {scr}")
    idx += 1

In [None]:
print(f"{n_splits}-fold average score: {np.mean(np.array(kfold_scores))}")

In [None]:
from sklearn.model_selection import cross_val_score

validation = cross_val_score(sgd_clf, X_train, y_train_9, 
                             cv=3, 
                             scoring='accuracy', 
                             verbose=1)

print(validation)

The sklearn cross_val_score in action returning the same result.

In [None]:
accuracy = (sum(np_target==9)/len(np_target))*100
print(f'{np.max(validation)*100}% accuracy might not as impressive as it sounds \n where there are {accuracy :.2f}% of 9s in the dataset')

**Confusion matrix**

- A confusion matrix is a tabular summary of the number of correct and incorrect predictions made by a classifier. 
- It can be used to evaluate the performance of a classification model through the calculation of performance metrics like accuracy, precision, recall, and F1-score.
- The confusion matrix is a much better way to evaluate the performance of a classifier, especially when there is a skewed dataset as we have here with only 10% of the dataset being the target.
- Each row represents a class, each column a prediction:
   * The first row is negative cases (non-9s) with the top left containing all the correctly classified non-9s (True Negatives), the top right the 9s incorrectly classified as non-9s (False-Positves).
   * The second row represents the positive class, 9s in this case, bottom left contains the 9s incorrectly classified as non-9s (False Negatives), the bottom right containing the correctly classified 9s (True Positives)


|     | Actual |      |
| --- |:---   |:--- |
| **Prediction** | True Positive  | False Positive |
|                | False Negative | True Negative |

We first need a set of predictions to compare to the actual targets:

In [None]:
from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_9, cv=3)

In [None]:
cf_matrix = metrics.confusion_matrix(y_train_9, y_train_pred)

print(f"Confusion Matrix: \n{cf_matrix}")
print(f"\n Number of images: {np.sum(cf_matrix)}")

We can visualize the confusion matrix:

In [None]:
cm = metrics.confusion_matrix(y_train_9, y_train_pred, 
                              labels=sgd_clf.classes_)
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm, 
                                      display_labels=sgd_clf.classes_)
disp.plot() 

#### Precision/Recall
- Precision-Recall is a useful measure of success of prediction when the classes are very imbalanced.
- Precision measures the number of true positives (correctly classified 9s) as a ratio of the total samples classified as a 9: $\frac{T_P}{T_P + F_P}$
- Recall measures the number of true positives as a ratio of the total number of positives: $\frac{T_P}{T_P + F_N}$.
- The precision-recall curve shows the tradeoff between precision and recall for different threshold. 
  - A system with high recall but low precision returns many results, but most of its predicted labels are incorrect when compared to the training labels. 
  - A system with high precision but low recall is just the opposite, returning very few results, but most of its predicted labels are correct when compared to the training labels. 
  - An ideal system with high precision and high recall will return many results, with all results labeled correctly.

In [None]:
y_scores = cross_val_predict(sgd_clf, X_train, y_train_9, cv=3, method='decision_function')

In [None]:
from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train_9, y_scores)

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.figure(figsize=(12,8))
    plt.title('Precision and recall vs decision threshold')
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
    plt.xlabel("Threshold")
    plt.legend(loc="upper left")
    plt.ylim([0,1])

plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.show()

### <font color="red">Training and Prediction on the Entire Dataset</font>

- We will use the Stochastic Gradient Descent classifier (SGD). 
- Scikit-Learn’s SGDClassifier is a good starting point for linear classifiers. 
- Using the loss parameter we will see how Support Vector Machine (Linear SVM) and Logistic Regression perform for the same dataset.


#### Using Linear Support Vector Machine (SVM)
- We use linear SVM with stochastic gradient descent (SGD) learning.
- The gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule.
- To use the Linear SVM Classifier, we need to set the loss parameter to `hinge`. 

In [None]:
from sklearn.linear_model import SGDClassifier
 
sgd_clf = SGDClassifier(loss='hinge', random_state=42)
sgd_clf.fit(X_train, y_train)

- Before testing the model, it is a good practice to first see the cross-validation scores on the training data. 
- That you will give you a very good projection of how the model performs.

In [None]:
valid = cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring='accuracy')

In [None]:
min_val = np.min(valid)*100
max_val = np.max(valid)*100
print(f'For three-fold Cross-Validation you are getting around: {min_val}%-{max_val}%')

We can now compute the actual test scores:

In [None]:
scoreSVM = sgd_clf.score(X_test, y_test)
print("Test score of the Linear SVM: ", scoreSVM)

In [None]:
y_predict = sgd_clf.predict(X_test)

cm = metrics.confusion_matrix(y_test, y_predict, 
                              labels=sgd_clf.classes_)
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm, 
                                      display_labels=sgd_clf.classes_)
disp.plot()

### Using Logistic Regression

In [None]:
sgd_clf = SGDClassifier(loss='log', random_state=42)
sgd_clf.fit(X_train, y_train)

In [None]:
cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring='accuracy')

In [None]:
scoreLR = sgd_clf.score(X_test, y_test)
print("Test score of the Logistic Regression: ", scoreLR)

In [None]:
y_predict = sgd_clf.predict(X_test)

cm = metrics.confusion_matrix(y_test, y_predict, 
                              labels=sgd_clf.classes_)
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm, 
                                      display_labels=sgd_clf.classes_)
disp.plot();

### Random Forest Classifier

- Random forests is a supervised learning algorithm. 
- A forest is comprised of trees. 
- It is said that the more trees it has, the more robust a forest is. 
- Random forests creates decision trees on randomly selected data samples, gets prediction from each tree and selects the best solution by means of voting. 
- It also provides a pretty good indicator of the feature importance.

In [None]:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators = 500)
forest = forest.fit(X_train, y_train)

In [None]:
forest_output = forest.predict(X_test)

Calculate accuracy on the prediction:

In [None]:
print("Random Forest with n_estimators:500")
print(accuracy_score(y_test, forest_output))

Display few true images against predictions:

In [None]:
display_digits(X_test, forest_output)

In [None]:
cm = metrics.confusion_matrix(y_test, forest_output, 
                              labels=forest.classes_)
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm, 
                                      display_labels=forest.classes_)
disp.plot()

### Gradient Boosting Classifier



In [None]:
from sklearn.ensemble import GradientBoostingClassifier

clf = GradientBoostingClassifier(n_estimators=10, learning_rate=1.0, 
                                 max_depth=1, random_state=0).fit(X_train,y_train)

In [None]:
gbc_output = clf.predict(X_test) 

Calculate accuracy on the prediction:

In [None]:
print(f"Gradient Boosting Accuracy: {accuracy_score(y_test, gbc_output)}")

Display few true images against predictions:

In [None]:
display_digits(X_test, gbc_output)

In [None]:
cm = metrics.confusion_matrix(y_test, gbc_output, 
                              labels=clf.classes_)
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm, 
                                      display_labels=clf.classes_)
disp.plot()

### MLP Classifier

- The Multi-layer Perceptron classifier relies on an underlying Neural Network to perform the task of classification.

In [None]:
from sklearn.neural_network import MLPClassifier

#### With the Stochastic Gradient Descent (`sgd`) Solver

In [None]:
clf = MLPClassifier(solver='sgd', hidden_layer_sizes=(10,), 
                    random_state=1)
clf.fit(X_train, y_train)   
neural_output = clf.predict(X_test)

Calculate accuracy on the prediction:

In [None]:
print(f"MLP sgd Accuracy: {accuracy_score(y_test, neural_output)}")

Display few true images against predictions:

In [None]:
display_digits(X_test, neural_output)

In [None]:
cm = metrics.confusion_matrix(y_test, neural_output, 
                              labels=clf.classes_)
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm, 
                                      display_labels=clf.classes_)
disp.plot()

#### With the Quasi-Newton (`lbfgs`) Solver

In [None]:
clf = MLPClassifier(solver='lbfgs', hidden_layer_sizes=(10,), 
                    random_state=1)
clf.fit(X_train, y_train)   
neural_output = clf.predict(X_test)

Calculate accuracy on the prediction:

In [None]:
print(f"MLP lbfgs Accuracy: {accuracy_score(y_test, neural_output)}")

Display few true images against predictions:

In [None]:
display_digits(X_test, neural_output)

In [None]:
cm = metrics.confusion_matrix(y_test, neural_output, 
                              labels=clf.classes_)
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm, 
                                      display_labels=clf.classes_)
disp.plot()