In [0]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Get data

MNIST data consists of images of 70000 handwritten images.  
It is often called "Hello World" of ML.


In [0]:
from sklearn.datasets import fetch_mldata

mnist = fetch_mldata("MNIST original")
mnist



ConnectionResetError: ignored

DESCR : Description of dataset  
data : Features (pixel intensity ranging from 0 to 255)  
target: Label (digits)

In [0]:
X, y = mnist["data"], mnist["target"]
X.shape

In [0]:
y.shape

## Displaying some random image

In [0]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

# Getting some random image 
img = X[60000]

# Reshaping image into a 28x28 matrix
img = img.reshape(28,28)

# Displaying image using matplotlib
plt.imshow(img, cmap= matplotlib.cm.binary, interpolation="nearest")
plt.axis("off")
plt.show()

In [0]:
y[60000] # Checking label of image above

# Creating training and test set

In [0]:
X_train , X_test = X[:60000], X[60000:]
y_train , y_test = y[:60000], y[60000:]

# Shuffling training set 
To ensure that every fold in CV has every digit and some algorithms perform worse when they get too many similar instances in row.

In [0]:
import numpy as np

# Creates a distribution from 0 to 60000, spread randomly
shuffle_index = np.random.permutation(60000)

X_train , y_train = X_train[shuffle_index], y_train[shuffle_index]

# Binary classifier
Now, we will create a **5 Detector** using Stochastic Gradient Descent Classifier. It can handle large dataset efficiently.  
To use the classifier we have to import **SGDClassifier** from **sklearn.linear_model**

In [0]:
from sklearn.linear_model import SGDClassifier

sgd = SGDClassifier(random_state=42) # Seeding random for reproducible results

# Converting labels to booleans
y_train5 = (y_train==5)
y_test5 = (y_test==5)

sgd = sgd.fit(X_train, y_train5)

# Predicting label of digit visualized above
print(sgd.predict([X[60000]]))

Seems like binary classifier works well

# Performance Measures

## Cross validation on Stratified data

In [0]:
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

# Creates 3 pairs of train-test set
skfolds = StratifiedKFold(n_splits=3,random_state=42)

for train_index,test_index in skfolds.split(X_train,y_train5):
  
  # Creates clone of SGD classifier
  clone_clf = clone(sgd)
  
  # Training set (within CV)
  X_train_folds = X_train[train_index]
  y_train_folds = y_train5[train_index]
  
  # Validation set
  X_test_fold = X_train[test_index]
  y_test_fold = y_train5[test_index]
  
  # Training SGD classifier
  clone_clf.fit(X_train_folds,y_train_folds)
  
  # Predictions
  y_pred=clone_clf.predict(X_test_fold)
  
  # Number of Correct Predictions
  n_correct=sum(y_pred==y_test_fold)
  
  # Accuracy = Number of Correct Predictions/ Total Predictions
  print(n_correct/len(y_pred))

Accuracy is pretty high for SGD classifier

 ## Normal Cross validation

In [0]:
from sklearn.model_selection import cross_val_score 

cv = cross_val_score(sgd, X_train, y_train5, cv=10, scoring="accuracy")

print(cv)
print("Mean:{}".format(np.mean(cv)))
print("STD:{}".format(np.std(cv)))

The result is too optimistic but lets create a classifier that classifies every instance as "not 5".

In [0]:
from sklearn.base import BaseEstimator, TransformerMixin

class Never5(BaseEstimator, TransformerMixin):
  
  def __init__ (self):
    pass
  
  def fit(self, X, y=None):
    return self
  
  def predict(self, X):
    
    # Returns a boolean matrix of Zeros 
    return np.zeros((len(X),1), dtype=bool)
  
nev5 = Never5()


cv = cross_val_score(nev5, X_train, y_train5, cv=10, scoring="accuracy")

print(cv)
print("Mean:{}".format(np.mean(cv)))
print("STD:{}".format(np.std(cv)))
    

Well, a dumb predictor can also have an accuracy of 90% because only 10% of instances are 5 (in real).  
That is why accuracy is not right performance measure for classifiers for the dataset containing too many instances of one class.

## Other performance measures

### Confusion matrix  

A confusion matrix helps us identify how many times an instance is misclassified.  
The rows are instances of data and columns are class.

Scikit-learn's **cross_val_predict()** returns predictions rather than scores.

In [0]:
from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(sgd, X_train, y_train5, cv=3)

These predictions are *clean* as the classifier did not saw the data while prediction. To make a confusion matrix we just call **confusion_matrix()** from sklearn.metrics.

In [0]:
from sklearn.metrics import confusion_matrix

conf = confusion_matrix(y_train5, 
                        y_train_pred)

conf

The confusion matrix returns instances as 
$$
\begin{bmatrix} 
5\ classified\ as\ 5 & 5\ classified\ as\ Not5 \\
Not5\ classified\ as\ 5 & Not5\ classified\ as\ Not5 
\end{bmatrix}
$$

The first column (5s) is the positive class. The first row in positive class is 5s correctly classfied as 5. Hence, **True Positives**.
Second row in positive class is not5s misclassified as 5s. Hence, **False Positives**.

Similarly, first and second row in negative class (second row) are **False Negatives** and **True Negatives**.

The two metrics we can calculate from confusion matrix are **Precision** and **Recall**.

The accuracy of positive predicition is called **Precision**.

$$Precision = \frac{True\ Positives}{True\ Positives + False\ Positives}$$

The True Positive rate i.e. how many true positves are detected in all the positive classifications is called **Recall**.

$$Recall = \frac{True\ Positives}{True\ Positives + False\  Negatives}$$

it is also known as *sensetivity*.

On combining precision and recall we get **$F_1$ score**. It is the harmonic mean of precision and recall.

$$F_{1} = \frac{2}{\frac{1}{precision}+\frac{1}{recall}}$$

$$ $$

$$F_{1} = 2 \times \frac{precision \times recall}{precision + recall}$$

Value of $F_1$ score is higher when the value of both precision and recall is high but most of the times that is not the case. When you increase precision, recall decreases and when you increase recall, precision decreases. This is called *precision/recall tradeoff*.

In [0]:
from sklearn.metrics import precision_score, recall_score

print(precision_score(y_train5, y_train_pred))

print(recall_score(y_train5, y_train_pred))

In [0]:
from sklearn.metrics import f1_score

print(f1_score(y_train5, y_train_pred))

The **SGDClassifier** computes a score using decision function.If the score is above some threshold value then it is classified into positve class otherwise negative class.

As the threshold value is increased, preicision increases but recall decreases.

```classifier.decision_function(value)``` returns the decision score for that value.

In [0]:
some_digit = X[20000]

dec_score = sgd.decision_function([some_digit])
threshold = 0
dec_class = dec_score > threshold
print("When threshold is {} then it is a 5 : {}".format(threshold, dec_class))

threshold = 20000
dec_class = dec_score > threshold
print("When threshold is {} then it is a 5 : {}".format(threshold, dec_class))

To decide the value of thershold we have to look at precision/recall curve.

In [0]:
from sklearn.model_selection import cross_val_predict

# To get the decision scores of all the instances
y_scores = cross_val_predict(sgd, X_train, y_train5, cv=3, 
                             method="decision_function")

from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train5, y_scores)

plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
plt.plot(thresholds, recalls[:-1], "g--", label="Recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0,1])
plt.show()

Another way to decide threshold value is plot precision directly against recall.

In [0]:
%matplotlib inline 
plt.plot(recalls,precisions,"b--")
plt.ylabel("Precision")
plt.xlabel("Recall")
plt.show()

We should select a value just before that drop

In [0]:
# A 90% precision classifier by guess (threshold = 70000)
threshold = 70000
y_train_pred90 = (y_scores>70000)

# It is actually a 85% precision classifier
print(precision_score(y_train5, y_train_pred90))

# 58% recall
print(recall_score(y_train5, y_train_pred90))

### Receiver Operating Characterstic (ROC) Curve

**ROC Curve** plots the **True Positive Rate**(Recall) versus **False Positive Rate**.

**False Postive Rate** could be calculated as **1 - True Negative Rate**. True Negative Rate is also known as *specificity*.


In [0]:
from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_train5, y_train_pred)

plt.plot(fpr, tpr, "r--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.show()

To compare classifiers you can compare **Area Under Curve** of ROC Curves. A perfect classifier will have AUC =1.

In [0]:
from sklearn.metrics import roc_auc_score

roc_auc_score(y_train5, y_train_pred)


**RULE OF THUMB : ** Use Precision/Recall curve whenenver positive class is rarer (like in this case) and you care more about False Positives than False Negatives. Use ROC Curve otherwise.

### Training Random Forest and calculating ROC curve and ROC AUC score for the classifier

**RandomForestClassfier** returns probablity matrix that an instance belong to some class instead of decision score.


In [0]:
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(random_state=42)

forest_scores = cross_val_predict(forest, X_train, y_train5, cv=3, 
                                  method="predict_proba")
# Probability of positive class is score
forest_score_pos = forest_scores[:,1]

fpr_forest, tpr_forest, thresholds = roc_curve(y_train5, forest_score_pos)

plt.plot(fpr_forest, tpr_forest, "r--", label="Random Forest")
plt.plot(fpr, tpr, "b--", label="Stochastic Gradient Descent")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend(loc="lower right")
plt.show()

The classifier more to the upper left (**Random Forest**) is the better classifier.

In [0]:
roc_auc_score(y_train5, forest_score_pos)

It is almost perfect score for Random Forest Classifer

# Multiclass Classification

Multinominal or Multiclass classifiers such as KNearestNeighbor can distinguish between more that two classes.
You could use multiple binary classifiers for multiclass classification but scaling could be a problem with algorithms such as Support Vector Machine.  
There are multiple statergies to use binary classifiers as multiclass classifiers.

**One Vs All** (OvA) : When you train one binary classifiers one for each class (10 binary classfiers for 10 digits)  
**One Vs One** (OvO) : When you train binary classifiers for each pair of class (45 binary classfiers for 10 digits)
Advantage of OvO is that it need to be trained on a specific part of dataset, not the complete dataset.  

Generally OvA is preferred over OvO but for some algorithms such as SVM, OvO is preferred as training large number of classifiers on small dataset is convinient (for SVM), than training few classifiers on large dataset.


In [0]:
# Training on original label and features
sgd = sgd.fit(X_train, y_train)

sgd.predict([some_digit])

In [0]:
scores = sgd.decision_function([some_digit])
print(np.argmax(scores)) # Class 3 has highest decision score

In [0]:
print(sgd.classes_) # All classes fed to classifier

If you want to force OvA or OvO strategy on classifier then make an instance of **OneVsRestClassfier** or **OneVsOneClassfier** respectively and pass the classfier as variable.

In [0]:
from sklearn.multiclass import OneVsOneClassifier

ovo_sgd = OneVsOneClassifier(SGDClassifier(random_state=42))

ovo_sgd = ovo_sgd.fit(X_train, y_train)

print(ovo_sgd.predict([some_digit]))

# Number of binary classfiers
print(len(ovo_sgd.estimators_))

In [0]:
forest.fit(X_train, y_train)
forest.predict_proba([some_digit])

In [0]:
cross_val_score(sgd, X_train, y_train, cv=3, scoring="accuracy")

In [0]:
cross_val_score(forest, X_train, y_train, cv=3, scoring="accuracy")

Scaling the features will improve accuracy

In [0]:
from sklearn.preprocessing import StandardScaler

sca = StandardScaler()

X_scaled = sca.fit_transform(X_train.astype(np.float64))

cross_val_score(sgd, X_scaled, y_train, cv=3, scoring="accuracy")

In [0]:
cross_val_score(forest, X_scaled, y_train, cv=3, scoring="accuracy")

## Confusion matrix

In [0]:
# Creating confusion matrix

y_pred_score = cross_val_predict(forest, X_scaled, y_train, cv=3)

conf = confusion_matrix(y_train, y_pred_score)
conf

In [0]:
plt.matshow(conf, cmap = plt.cm.gray)
plt.show()

But the above matrix is biased towards classes with more instances in dataset.

In [0]:
# Calculating number of instances per class by adding rows
col_sums = conf.sum(axis=1, keepdims=True)

# Normalizing confusion matrix
norm_conf = conf / col_sums

# Filling 0s in diagonal as it does not represent errors
np.fill_diagonal(norm_conf, 0)

# Plotting confusion matrix
plt.matshow(norm_conf, cmap = plt.cm.gray)
plt.show()

# Multilabel Classification

In this type of classification, dataset can have multiple classes and those labels can have multiple values (binary labels). 

**KNeighbor** is one such kind of classifier.

The following classifier has multiple classes :
1. Larger than 7
2. Odd

and both have  boolen values (True or False)

In [0]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()

y_large7 = (y_train > 7)
y_odd = (y_train % 2 != 0)

y_multi = np.c_[y_large7, y_odd]

knn = knn.fit(X_train, y_multi)

knn.predict([some_digit])

In [0]:
"""knn_score = cross_val_predict(knn, X_train, y_multi, cv=3)

# average = "macro" means all labels are equally important 
# (same number of instances for each label)

# if that is not the case use average="weighted"
f1_score(y_train, knn_score, average="macro")"""

# Multioutput Classification

It is a generalization of multilabel classification, where each label could have multiple class.

Here, we will make a classifier that reduces noise i.e. tweaks the pixel intensities.
Each pixel (label) has multiple values ranging from 0 to 255 (classes).

In [0]:
# Noisy images as input
noise_train = np.random.randint(0,100, (len(X_train), 784))
noise_test = np.random.randint(0, 100, (len(X_test), 784))

X_train_mod = X_train + noise_train
X_test_mod = X_test + noise_test

# Clean images as prediction labels
y_train_mod = X_train
y_test_mod = X_test

In [0]:
knn = KNeighborsClassifier()
knn = knn.fit(X_train_mod, y_train_mod)
cleaned = knn.predict([X_test_mod[6]])

plt.imshow(cleaned.reshape(28,28))

# Trying to make a Kneighbor classifier with 97% accuracy


In [0]:
knn_final = KNeighborsClassifier()

cross_val_knn = cross_val_score(knn_final, X_scaled, y_train, cv=3, scoring="accuracy")

print("Initial score : {}".format(cross_val_knn))
print("Mean : {}".format(np.mean(cross_val_knn)))
print("STD: {}".format(np.mean(cross_val_knn)))

In [0]:
param_grid = [{"n_neighbors":[3, 5, 7],
             "weights":["uniform","distance"]}]

from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(knn_final, param_grid,
                          cv=5, scoring="accuracy")
grid_search.fit(X_scaled, y_train)

In [0]:
grid_search.best_estimator_