## Baseline for MNIST with random forest

MNIST dataset was downloaded from kaggle.com in .csv format.

train.csv: contains 42000 train example. Each row in csv file contains the digit label in first column, followed by the 28x28 = 784 pixels values of the image.


In [1]:
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [2]:
dataset = pd.read_csv(r"~/Python/Kaggle Digits/train.csv")
labels = dataset['label'].values
train = dataset.iloc[:, 1:].values

In [3]:
print 'Train set: ' + str(train.shape)
print 'Labels: ' + str(labels.shape)

Train set: (42000, 784)
Labels: (42000,)


Normalize Data to zero mean and unit variance:

In [4]:
def normalizeData(X): 
    #input: X: 2D-numpy array with shape (#datapoints, #features)
    #output: normalized X
    return (X -128.0)/255.0
X = normalizeData(train)

Shuffle data and split into test, train and validation set:

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.1, random_state=0)

In [6]:
print 'X_train: ' + str(X_train.shape)
print 'y_train: ' + str(y_train.shape)
print 'X_test: ' + str(X_test.shape)
print 'y_test: ' + str(y_test.shape)

X_train: (37800, 784)
y_train: (37800,)
X_test: (4200, 784)
y_test: (4200,)


Optimize hyperparameters for Random Forest Classifier using gridsearch:

In [7]:
from sklearn.model_selection import GridSearchCV
params = {'n_estimators': [300, 500], 'min_samples_split': [2, 5]}
clf = GridSearchCV(RandomForestClassifier(), params, cv=5, verbose =3)
clf.fit(X_train, y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] min_samples_split=2, n_estimators=300 ...........................
[CV] .. min_samples_split=2, n_estimators=300, score=0.964961 -   1.0s
[CV] min_samples_split=2, n_estimators=300 ...........................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   59.9s remaining:    0.0s


[CV] .. min_samples_split=2, n_estimators=300, score=0.962576 -   1.0s
[CV] min_samples_split=2, n_estimators=300 ...........................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  2.0min remaining:    0.0s


[CV] .. min_samples_split=2, n_estimators=300, score=0.968126 -   1.0s
[CV] min_samples_split=2, n_estimators=300 ...........................
[CV] .. min_samples_split=2, n_estimators=300, score=0.965996 -   1.0s
[CV] min_samples_split=2, n_estimators=300 ...........................
[CV] .. min_samples_split=2, n_estimators=300, score=0.964929 -   1.0s
[CV] min_samples_split=2, n_estimators=500 ...........................
[CV] .. min_samples_split=2, n_estimators=500, score=0.965490 -   1.6s
[CV] min_samples_split=2, n_estimators=500 ...........................
[CV] .. min_samples_split=2, n_estimators=500, score=0.963237 -   1.6s
[CV] min_samples_split=2, n_estimators=500 ...........................
[CV] .. min_samples_split=2, n_estimators=500, score=0.968655 -   1.6s
[CV] min_samples_split=2, n_estimators=500 ...........................
[CV] .. min_samples_split=2, n_estimators=500, score=0.966790 -   1.6s
[CV] min_samples_split=2, n_estimators=500 ...........................
[CV] .

[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed: 26.6min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_estimators': [300, 500], 'min_samples_split': [2, 5]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=3)

Print best parameters from Gridsearch:

In [8]:
print(clf.best_params_)

{'min_samples_split': 2, 'n_estimators': 500}


Predict and calculate accuracy:

In [9]:
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
acc_train = metrics.accuracy_score(y_train, y_pred_train)
acc_test = metrics.accuracy_score(y_test, y_pred_test)
print 'Random Forest Classifier on Input Data: '
print 'Train accuracy: ' + str(acc_train)
print 'Test accuracy: ' + str(acc_test)

Random Forest Classifier on Input Data: 
Train accuracy: 1.0
Test accuracy: 0.964047619048
