### MNIST Dataset
http://yann.lecun.com/exdb/mnist/

MNIST ("Modified National Institute of Standards and Technology") is the de facto “hello world” dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike.
The MNIST database contains 60,000 training images and 10,000 testing images.
![title](mnist.png)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
pd.options.display.max_columns = None
random_state = 42

In [None]:
import time
def timer_start():
    global t0
    t0 = time.time()


def timer_end():
    t1 = time.time()   
    total = t1-t0
    print('Time elapsed', total) 
    return total

### Load Data 
The MNIST data comes pre-loaded with sklearn. The first 60000 images are training data and next 1000 are test data

In [None]:
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
X, y = mnist['data'], mnist['target']


X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

print('Training Shape {} Test Shape {}'.format(X_train.shape, X_test.shape))

### Create a Validation set
In real world ML scenarios we create separate Train, Validation and Test set. We train our model on Training set,  optimize our model using validation set and evalaute on Test set so that we dont induce bias.  Since we already have test set we need to split training set into separate traiining and validation sets. As we will see later that we can do K-fold cross validation which removes the necessaity of creating Validations set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size = 0.2, 
                                                     random_state = random_state, stratify=  y_train )
print('Training Shape {} Validation Shape {}'.format(X_train.shape, X_valid.shape))


In [None]:
pd.DataFrame(X_train).head()

  ### Display Sample Image
 

In [None]:
import matplotlib

def display_digit(digit):
    digit_image = digit.reshape(28,28)
    plt.imshow(digit_image, cmap = matplotlib.cm.binary, interpolation = 'nearest')
    plt.axis('off')
    plt.show()
   
    
digit = X_train[92]
display_digit(digit)

 Each Image consist of 28 X 28 pixels with pixel values from 0 to 255. The pixel values represent the greyscale intensity increasing from 0 to 255. As we can see below digit 4 can be represented by pixel intensities of varying values and the region where pixel intensities has high value are assosciated with the image of 4
 

In [None]:
pd.DataFrame(digit.reshape(28,28))

###  Traget Value Counts

In [None]:
pd.DataFrame(y_train)[0].value_counts()

## Train Model Using Gradient Boosted Machine
The training on GBM is extremely slow for dataset this large. It is not feasable to use this for practical purpose hence code is commented. We will use a better alogorithm for boosted trees. For small dataset this still can be used hence code is not deleted 

In [None]:
# timer_start()
# from sklearn.ensemble import GradientBoostingClassifier
# model  = GradientBoostingClassifier(random_state = random_state,
#                                     verbose = 1)
# model.fit(X_train, y_train) 
# timer_end()

#### Validation Set Accuracy

In [None]:
# from sklearn.metrics import accuracy_score
# y_pred = model.predict(X_valid)
# test_acc = accuracy_score(y_valid, y_pred)
# print('Validation accuracy', test_acc)

## Train Model Using LightGBM: Default

LightGBM devloped by Microsoft Research Team, is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:

<br>Faster training speed and higher efficiency
<br>Lower memory usage
<br>Better accuracy
<br>Parallel and GPU learning supported
<br>Capable of handling large-scale data

https://lightgbm.readthedocs.io/en/latest/

In [None]:
from lightgbm import LGBMClassifier
timer_start()
model = LGBMClassifier(  n_jobs = 4,                      
                        random_state  = random_state ,
                        objective =  'multiclass',
                        num_class = 10  )
model.fit(X_train, y_train )
          
timer_end()
model

#### Validation set accuracy
The defualt  model gave an impressive accuracy of 97% comapared to 94.5% accuracy of RandomForest

In [None]:
from sklearn.metrics import accuracy_score
y_pred = model.predict(X_valid)
test_acc = accuracy_score(y_valid, y_pred)
print('Validation accuracy', test_acc)

## Train Model Using LightGBM:Tuned with Early Stopping
The Idea behind early stopping is that we train the model for large number of iterations, but stop when the validation score stops improving. This is a powerful mechanism to deal with overfiiting

In [None]:
timer_start()
model = LGBMClassifier(  n_jobs = 4,    
                         n_estimators=10000,
                         random_state  = random_state ,
                         learning_rate = 0.1,
                         subsample = 0.8,
                         colsample_bytree= 0.8,
                         subsample_freq = 1,                 
                         objective =  'multiclass',
                         num_class = 10  )
model.fit( X_train, 
           y_train,
           eval_set=[ (X_valid, y_valid)],
           eval_metric= 'multi_error',
           verbose= 40,
           early_stopping_rounds= 50,      
           )
          
timer_end()
model

#### Validation Set Accuracy

In [None]:
y_pred = model.predict(X_valid)
test_acc = accuracy_score(y_valid, y_pred)
print('Validation accuracy', test_acc, 'error', 1-test_acc)


#### Test Set Accuracy
The Test Accuracy 98.21% of a tuned LightGBM Model is better than 97.06% of Tuned RandomForest. The increase of 1.2% may not seem much but it means  120 more correct predcition on  test set of 10000 samples.


In [None]:
y_pred = model.predict(X_test)
test_acc = accuracy_score(y_test, y_pred)
print('Validation accuracy', test_acc)


### Random Incorrect Predictions
Lets display random 10 images in test data which were incorrectly predicted by our model. We can notice some of the images are difficult to identify even for humans

In [None]:
def display_incorrect_preds(y_test, y_pred):
    test_labels = pd.DataFrame()
    test_labels['actual'] = y_test
    test_labels['pred'] = y_pred
    incorrect_pred = test_labels[test_labels['actual'] != test_labels['pred'] ]
    random_incorrect_pred = incorrect_pred.sample(n= 10)

    for i, row in random_incorrect_pred.iterrows():
        print('Actual Value:', row['actual'], 'Predicted Value:', row['pred'])
        display_digit(X_test[i])
        


In [None]:
display_incorrect_preds(y_test, y_pred)        