### MNIST Dataset
http://yann.lecun.com/exdb/mnist/

MNIST ("Modified National Institute of Standards and Technology") is the de facto “hello world” dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike.
The MNIST database contains 60,000 training images and 10,000 testing images.
![title](mnist.png)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
pd.options.display.max_columns = None
random_state = 42

### Load Data 
The MNIST data comes pre-loaded with sklearn. The first 60000 images are training data and next 1000 are test data

In [None]:
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
X, y = mnist['data'], mnist['target']


X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

print('Training Shape {} Test Shape {}'.format(X_train.shape, X_test.shape))

### Create a Validation set
In real world ML scenarios we create separate Train, Validation and Test set. We train our model on Training set,  optimize our model using validation set and evalaute on Test set so that we dont induce bias.  Since we alreday have test set we need to split training set into separate traiining and validation sets. As we will see later that we can do K-fold cross valaidtion which removes the necessaity of creating Validations sets

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size = 0.2, 
                                                     random_state = random_state, stratify=  y_train )
print('Training Shape {} Validation Shape {}'.format(X_train.shape, X_valid.shape))


In [None]:
pd.DataFrame(X_train).head()

  ### Display Sample Image
 

In [None]:
import matplotlib

def display_digit(digit):
# digit = X_train[2]
    digit_image = digit.reshape(28,28)
    plt.imshow(digit_image, cmap = matplotlib.cm.binary, interpolation = 'nearest')
    plt.axis('off')
    plt.show()
    


In [None]:
digit = X_train[2]
display_digit(digit)

 Each Image consist of 28 X 28 pixels with pixel values from 0 to 255. The pixel values represent the greyscale intensity increasing from 0 to 255. As we can see below digit 6 can be represented by pixel intensities of varying values and the region where pixel intensities has high value are assosciated with the image of 6
 

In [None]:
pd.DataFrame(digit.reshape(28,28))

###  Traget Value Counts

In [None]:
pd.DataFrame(y_train)[0].value_counts()

## Train Model Using Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
model  = DecisionTreeClassifier(random_state = random_state)
model.fit(X_train, y_train) 

#### Validation Set Accuracy

In [None]:
from sklearn.metrics import accuracy_score
y_pred = model.predict(X_valid)
test_acc = accuracy_score(y_valid, y_pred)
print('Validation accuracy', test_acc)

## Train Model Using Random Forest:Default 

In [None]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state = random_state, n_jobs = 4)
model.fit(X_train, y_train) 

#### Validation set accuracy
We can see that just using default parameters, we are able to achieve better accuracy on Random Forest compared to a single Decision tree. Random Forest by default uses 10 decision trees to make predcitions and the end results is combined prediction of all trees.

In [None]:
y_pred = model.predict(X_valid)
test_acc = accuracy_score(y_valid, y_pred)
print('Validation accuracy', test_acc)

## Train Model Using Random Forest: Tuned HyperParameters
The Hyperparameters were manually Tuned, we will later see serach algorithms to find best hyperparameters automatically


In [None]:
model = RandomForestClassifier(  n_estimators = 80,
                                  criterion = 'entropy',
                                  bootstrap =  False,
                                  max_depth   =  30,
                                 verbose = 1,
                                 random_state = random_state,
                                 n_jobs = 4
                                 )
model.fit(X_train, y_train) 


#### Validation Set Accuracy

In [None]:
y_pred = model.predict(X_valid)
test_acc = accuracy_score(y_valid, y_pred)
print('Validation accuracy', test_acc)


#### Test Set Accuracy

In [None]:
y_pred = model.predict(X_test)
test_acc = accuracy_score(y_test, y_pred)
print('Validation accuracy', test_acc)


### Random Incorrect Predictions
Lets display random 10 images in test data which were incorrectly predicted by our model.
We can notice some of the images are difficult to identify even for humans

In [None]:
def display_incorrect_preds(y_test, y_pred):
    test_labels = pd.DataFrame()
    test_labels['actual'] = y_test
    test_labels['pred'] = y_pred
    incorrect_pred = test_labels[test_labels['actual'] != test_labels['pred'] ]
    random_incorrect_pred = incorrect_pred.sample(n= 10)

    for i, row in random_incorrect_pred.iterrows():
        print('Actual Value:', row['actual'], 'Predicted Value:', row['pred'])
        display_digit(X_test[i])
        
 
     

In [None]:
display_incorrect_preds(y_test, y_pred)      