## Example of ML on Images:  Classifying Handwritten Digits

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import sklearn.datasets
import sklearn.model_selection
import sklearn.metrics
import dtreeviz
import ipywidgets

We use the toy digit dataset provided by scikit-learn.  

(We will also find it fun later to try our hand at the full MNIST dataset, one of the classic initial problems for budding machine-learning practicioners.)

In [None]:
d = sklearn.datasets.load_digits()

In [None]:
print(d.DESCR)

In [None]:
x = d.data
y = d.target

In [None]:
x.shape

In [None]:
y.shape

In [None]:
x[0]

In [None]:
y[0]

The samples consist of 64 features, one for each pixel value of an 8x8 image array.  We can reshape the sample into an 8x8 array in order to visualize it.

In [None]:
sample = x[4].reshape(8,8)
plt.imshow(sample, cmap='binary')

In [None]:
for i in range(100):
    plt.subplot(10,10,i+1)
    sample = x[i].reshape(8,8)
    plt.imshow(sample, cmap='binary')

In [None]:
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(
        x, y, test_size=0.2, random_state=42)

One catch to watch out for in splitting up your data into training and test sets:  stratification.

Let's say you have a dataset that has 90% cat images and 10% dog images.  If you split your data and end up with 99% cats in your training data and 1% dogs, you'll be training your model on an unrepresentative sample.  (Sampling issues like this can be much more consequential and damaging than distinguishing cats from dogs!)

In [None]:
plt.hist([y, y_train]);

Here the difference in percentages is noticeable but not too significant by eye.  Nevertheless, we can stratify our split properly by including the "stratify" parameter and assigning it our target variable.

In [None]:
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(
        x, y, test_size=0.2, random_state=42, stratify=y)

In [None]:
plt.hist([y, y_train]);

## Logistic Regression

In [None]:
import sklearn.linear_model
lr_classifier = sklearn.linear_model.LogisticRegression()

In [None]:
lr_classifier.fit(x_train, y_train)

It will not be uncommon for you to run into scenarios in which you encounter warnings or errors when trying to train models.

In such cases, they can be fruitful opportunities to consult the documentation and learn more about various training options.

Here, the error message gives us clues about potentially insightful documentation.

To fast-forward, it will be useful here for Logistic Regression if we rescale our sample data from being integer values over [0:16] to being continuous values scaled to have a normal distribution of values -> the sklearn StandardScaler will rescale the features to have 0 mean and unit variance.

In [None]:
import sklearn.preprocessing

In [None]:
scaler = sklearn.preprocessing.StandardScaler()

In [None]:
x_scaled = scaler.fit_transform(x_train)

In [None]:
x_scaled[0]

Here's the difference in image between original and rescaled.

In [None]:
sample1 = x_train[7].reshape(8,8)
sample2 = x_scaled[7].reshape(8,8)

fig,ax = plt.subplots(1,2)
ax[0].imshow(sample1, cmap='binary')
ax[1].imshow(sample2, cmap='binary')

In [None]:
for i in range(100):
    plt.subplot(10,10,i+1)
    sample = x_scaled[i].reshape(8,8)
    plt.imshow(sample, cmap='binary')

In [None]:
lr_classifier.fit(x_scaled, y_train)

In [None]:
lr_classifier.predict(x_scaled[[7]])

In [None]:
y_train[7]

Our classifier was trained on scaled data, so we must scale any new data similarly (though we only need to do the transform now, not the fit.)

In [None]:
x_test_scaled = scaler.transform(x_test)

In [None]:
y_pred = lr_classifier.predict(x_test_scaled)

In [None]:
print(f"Accuracy: {sklearn.metrics.accuracy_score(y_test, y_pred):.2%}")

In [None]:
cm = sklearn.metrics.confusion_matrix(y_test, y_pred)
cm

In contrast with binary classification, calculating precision and recall (and etc) for multi-class classification problems can be computed in slightly different ways depending on how one does averaging. 

Micro-average:  equal importance to each instance.  Gives a global perspective of performance where overall performance is more important than class-specific performance.  

Macro-average:  equal importance to each class.  Computes the metric independently for each class and then takes the average (hence treating all classes equally). Can be useful if you don't want the performance metric to be dominated by the performance of the majority class.

Weighted average:  assign weights to each class before averaging.  Helpful if the performance on certain classes is critical or more reflective of real-world scenarios.

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
print(f"Accuracy: {sklearn.metrics.accuracy_score(y_test, y_pred):.2%}")
print(f"Precision: {sklearn.metrics.precision_score(y_test, y_pred, average='micro'):.2%}")
print(f"Recall: {sklearn.metrics.recall_score(y_test, y_pred, average='micro'):.2%}")

In [None]:
print(f"Accuracy: {sklearn.metrics.accuracy_score(y_test, y_pred):.2%}")
print(f"Precision: {sklearn.metrics.precision_score(y_test, y_pred, average='macro'):.2%}")
print(f"Recall: {sklearn.metrics.recall_score(y_test, y_pred, average='macro'):.2%}")

# Decision tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
tree_clf = DecisionTreeClassifier()

In [None]:
tree_clf.fit(x_train, y_train)

In [None]:
tree_clf.classes_

In [None]:
y_pred = tree_clf.predict(x_test)

In [None]:
cm = sklearn.metrics.confusion_matrix(y_test, y_pred)
cm

In [None]:
print(f"Accuracy: {sklearn.metrics.accuracy_score(y_test, y_pred):.2%}")

### Interpretation?

In [None]:
text_representation = sklearn.tree.export_text(tree_clf)
print(text_representation)

In [None]:
tree_clf.classes_

In [None]:
plt.figure(figsize=(12,8))
sklearn.tree.plot_tree(tree_clf, 
               feature_names=range(64),  
               class_names=[str(i) for i in tree_clf.classes_],
               filled=True);

In [None]:
%%capture --no-display

vizmodel = dtreeviz.model(tree_clf, 
         x,
         y,
         feature_names=range(64),
         class_names=[i for i in tree_clf.classes_],
         target_name="y")

vizmodel.view()

# Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier()

In [None]:
rf_clf.fit(x_train, y_train)

In [None]:
rf_clf.classes_

In [None]:
y_pred = rf_clf.predict(x_test)

In [None]:
cm = sklearn.metrics.confusion_matrix(y_test, y_pred)
cm

In [None]:
print(f"Accuracy: {sklearn.metrics.accuracy_score(y_test, lr_classifier.predict(x_test_scaled)):.2%}")
print(f"Accuracy: {sklearn.metrics.accuracy_score(y_test, tree_clf.predict(x_test)):.2%}")
print(f"Accuracy: {sklearn.metrics.accuracy_score(y_test, rf_clf.predict(x_test)):.2%}")

Can we improve the Random Forest accuracy?

Actually, what parameters does it currently have?

In [None]:
rf_clf.get_params()

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
cv_grid = GridSearchCV(RandomForestClassifier(n_jobs=-1,random_state=42),
                       param_grid = {
                           'max_depth' : [None,10,20],
                           'n_estimators' : [50,100,200],
                           'max_leaf_nodes' : [None,5,10]
                       })
cv_grid.fit(x_train, y_train)
cv_grid.best_params_

In [None]:
y_pred = cv_grid.predict(x_test)

In [None]:
cm = sklearn.metrics.confusion_matrix(y_test, y_pred)
cm

In [None]:
print(f"Accuracy: {sklearn.metrics.accuracy_score(y_test, y_pred):.2%}")

### Interpretation?

In [None]:
rf_clf.estimators_[0]

In [None]:
def plttrees(t=0):
    plt.figure(figsize=(12,8))
    sklearn.tree.plot_tree(rf_clf.estimators_[t], 
               feature_names=range(64),  
               class_names=[str(i) for i in tree_clf.classes_],
               filled=True);

ipywidgets.interact(plttrees, t=range(len(rf_clf.estimators_)));

In [None]:
rf_clf.feature_importances_

In [None]:
plt.imshow(rf_clf.feature_importances_.reshape(8,8),
           cmap='binary')

# Interpretability for Logistic Regression

For logistic regression, we have a model and we've learned its coefficients:

In [None]:
lr_classifier.coef_.shape

What are the 640 coefficients telling us?

We've learned before that logistic regression was finding the coefficients for the logistic curve:
$$ f(x) = \frac{1}{1 + e^{-\theta^T x}} $$

There are a few ways to tackle multi-category classification.  The default for logistic regression is to learn coefficients for the softmax function:
$$ P(y = j | x) = \frac{e^{\theta_k^T x}}{\Sigma_{k=1}^{K}{e^{\theta_k^T x}}} $$

The basic gist for this problem is that there are:
* 64 features (the pixels)
* 10 classes (K = 10 for the softmax equation)
* softmax is very nice because all probabilities sum to 1 and the class with the largest probability is the predicted class

In [None]:
denominator = 0
for i in range(10):
    denominator += np.exp((lr_classifier.intercept_[i] +
                           np.dot(x_test[0], lr_classifier.coef_[i])))
for i in range(10):
    prob = np.exp((lr_classifier.intercept_[i] +
                   np.dot(x_test[0], lr_classifier.coef_[i]))) / denominator
    print("P(y={:d}|x_test[0]) = {:.2e} = {:.2f}".format(i,prob,prob))

In [None]:
lr_classifier.predict_proba([x_test[0]])

In [None]:
np.sum(lr_classifier.predict_proba([x_test[0]]))

In [None]:
np.argmax(lr_classifier.predict_proba([x_test[0]]))

In [None]:
lr_classifier.predict([x_test[0]])

In [None]:
plt.bar(range(10),lr_classifier.predict_proba([x_test[0]])[0])

It's good to know the equation, but that's a complex equation with a lot of coefficients to interpret.

Can we get insights into why a number got misclassified?

In [None]:
elem = 0
plt.imshow(x_test[elem].reshape(8,8), cmap='binary');
print('Real value:',y_test[elem])
print('Predicted value:',lr_classifier.predict([x_test[elem]]))

We will revisit the general topic of interpretation + ML.