In [None]:
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [None]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, parser='auto')

In [None]:
print(type(mnist['data']))
display(mnist['data'])

In [None]:
mnist.keys()

Here, taking mnist['data'] i.e. pixel values of images as X and the target data i.e. the numbers the images represent as y

In [None]:
X,y = mnist['data'], mnist['target']
print(X.shape)
print(y.shape)

There are 70,000 images, and each image has 784 features. This is because each image
is 28×28 pixels, and each feature simply represents one pixel’s intensity, from 0
(white) to 255 (black)

In [None]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

some_digit = X[0:1]
some_digit_image = some_digit.values.reshape(28,28)


In [None]:
plt.imshow(some_digit_image, cmap = matplotlib.cm.binary, interpolation="nearest")
plt.show()

In [None]:
y[0:1]

In [None]:
y = y.astype(int)

In [None]:
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:] 

Creating 5-detector that classifies whether the given number is 5 or not. 

In [None]:
y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)

Stochastic Gradient Classifier(SGDClassifier) deals with training instances independently one at a time. It relies on randomness during training.

In [None]:
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state= 42)
sgd_clf.fit(X_train, y_train_5)

In [None]:
sgd_clf.predict(some_digit)

### Performance Measures

#### 1. Measuring ***Accuracy*** Using **Cross-Validation**

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=3, random_state=42, shuffle=True)

for k,v in skfolds.split(X_train, y_train_5):
    print('train_idx:',k,'test_idx:',v)
print(type(k), type(v))

In [None]:

for train_index, test_index in skfolds.split(X_train, y_train_5):
    # clone functions creates a clone of already existing trained model
    clone_clf = clone(sgd_clf)
    X_train_folds = X_train.iloc[train_index]
    y_train_folds = y_train_5.iloc[train_index]
    X_test_fold = X_train.iloc[test_index]
    y_test_fold = y_train_5.iloc[test_index]

    clone_clf.fit(X_train_folds, y_train_folds)
    y_pred = clone_clf.predict(X_test_fold)
    n_correct = sum(y_pred == y_test_fold)
    print(n_correct/len(y_pred))

The above process can be carried out directly by using **cross_val_score**

In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring='accuracy')

Accuracy not a good performance matrix for skewed data

#### 2. ***Confusion Matrix***

In [None]:
from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf, X_train,y_train_5,cv=3)

Just like the cross_val_score() function, cross_val_predict() performs K-fold
cross-validation, but instead of returning the evaluation scores, it returns the predictions made on each test fold. This means that you get a clean prediction for each
instance in the training set (“clean” meaning that the prediction is made by a model
that never saw the data during training).

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_train_5, y_train_pred)
# [[TN, FP],
# [FN, TP]]

#### 3. ***Precision*** and ***Recall***

precision = TP/(TP+FP) i.e. true positive / predicted positives
recall = TPR = sensitivity = TP/(TP + FN) i.e. true positive / actual positives

In [None]:
from sklearn.metrics import precision_score, recall_score
print(precision_score(y_train_5, y_train_pred))
print(recall_score(y_train_5, y_train_pred))


If you trained a classifier to detect videos that are safe for kids, you would probably prefer a classifier that rejects many good videos (low recall) but keeps only safe ones (high precision)

Suppose you train a classifier to detect
shoplifters on surveillance images: it is probably fine if your classifier has only 30%
precision as long as it has 99% recall (sure, the security guards will get a few false
alerts, but almost all shoplifters will get caught).

#### 4. ***F1 Score***

harmonic mean of precision and recall

F1 score gets high value only when both precision and recall have high values

F1 Score = 2 * (precision * recall)/(precision + recall)

In [None]:
from sklearn.metrics import f1_score
print(f1_score(y_train_5, y_train_pred))

#### ***Precision and Recall Tradeoffs***

As the precision increases, the recall score decreases and vice versa

To see this, instead of calling the classifier’s
predict() method, you can call its decision_function() method, which returns a
score for each instance, and then make predictions based on those scores using any
threshold you want:

In [None]:
y_scores = sgd_clf.decision_function(some_digit)
print(y_scores)

threshold = 0
y_some_digit_pred = (y_scores > threshold)
print(y_some_digit_pred)


threshold=8000
y_some_digit_pred = (y_scores > threshold)
print(y_some_digit_pred)

This confirms that raising the threshold decreases recall. The image actually repre‐
sents a 5, and the classifier detects it when the threshold is 0, but it misses it when the
threshold is increased to 8,000.

To obtain the right threshold **precision_recall_curve** is used

For this you will first need to get the
scores of all instances in the training set using the cross_val_predict() function
again, but this time specifying that you want it to return decision scores instead of
predictions

In [None]:
from sklearn.metrics import precision_recall_curve

y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method='decision_function')
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

In [None]:
def plot_precision_recall_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], 'b--', label='Precision')
    plt.plot(thresholds, recalls[:-1], 'g-',label='Recall')

plot_precision_recall_threshold(precisions, recalls, thresholds)    
plt.show()

The precision curve is bumpier than the recall
curve because precision may sometimes go
down when you raise the threshold (although in general it will go
up).

In [None]:
threshold_with_90_precision = thresholds[np.argmax(precisions >=0.9)]
print('threshold with 90% precision:    ',threshold_with_90_precision)

y_train_pred_90 = (y_scores >= threshold_with_90_precision)
print('precision score:    ',precision_score(y_train_5, y_train_pred_90))
print('recall_score:    ',recall_score(y_train_5, y_train_pred_90))


#### 4. ***ROC Curve***

**Receiver Operating Characteristics** curve is used to plot True Positive Rate(TPR) against False Positive Rate(FPR)

FPR = 1- TNR(SPECIFICITY)

Hence, ROC curve = plot of TPR VS 1-specificity

In [None]:
from sklearn.metrics import roc_curve
fpr,tpr, thresholds = roc_curve(y_train_5, y_scores)

def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr,tpr, label=label)
    plt.plot([0,1],[0,1],'g--')

plot_roc_curve(fpr,tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate (Recall)')

plt.show()

The higher the TPR, the higher FPR. The dotted line represents the ROC curve of a purely
random classifier; a good classifier stays as far away from that line as possible (toward
the top-left corner)

#### 5. ***ROC AUC SCORE***

AUC = Area Under Curve

ROC AUC = Area Under Receiver Operating Characteristic Curve

Perfect classifier has ROC AUC = 1

Purely random classifier has ROC AUC = 0.5

In [None]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y_train_5,y_scores)

 As a rule
of thumb, you should prefer the Precision/Recall curve whenever the positive
class is rare or when you care more about the false positives than
the false negatives, and the ROC curve otherwise. 

Comparing SGD and Random Forest Classifiers ROC curve

In [None]:
from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train,y_train_5,cv=3, method = 'predict_proba')
y_scores_forest = y_probas_forest[:,1]
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5, y_scores_forest)
plt.plot(fpr,tpr,"b:",label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest") 
plt.legend(loc="lower right")
plt.show()



In [None]:
roc_auc_score(y_train_5, y_scores_forest)