<h1>Digit Recognizer Kaggle Competition</h1>

<h2>Using Random Forest Classifiers & Principal Components Analysis</h2>

<h3>Bryan Bruno</h3>

<h3>Building Environment</h3>

In [41]:
RANDOM_SEED = 1

import scipy.io
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import timeit
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_val_predict, cross_val_score
from sklearn.metrics import f1_score

from sklearn.metrics import confusion_matrix  
from sklearn.metrics import accuracy_score

In [42]:
train = pd.read_csv("train.csv")
train.shape

(42000, 785)

In [43]:
train.head()

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [44]:
test = pd.read_csv("test.csv")
test.shape

(28000, 784)

In [45]:
test.head()

Unnamed: 0,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [46]:
y = train["label"]
X = train.drop(["label"], axis = 1)

In [47]:
X_train = X.values
y_train = y.values

<h3>Random Forest Classifier</h3>

In [48]:
start = timeit.default_timer()

rfc = RandomForestClassifier(n_estimators = 10, bootstrap = True, max_features = "sqrt")
rfc.fit(X_train, y_train)

stop = timeit.default_timer()
time = stop - start
print("Elapsed Run Time:", time)

Elapsed Run Time: 2.839517455999953


In [49]:
y_pred = cross_val_predict(rfc, X, y_train, cv = 10)

In [50]:
cmat = confusion_matrix(y_train, y_pred)  
print("Random Forest Confusion Matrix:\n", cmat)  
print("Random Forest Accuracy:\n", accuracy_score(y_train, y_pred))

Random Forest Confusion Matrix:
 [[4037    2   10    9    7   19   23    2   20    3]
 [   0 4613   23   13    8    8    2    6    8    3]
 [  30   14 3948   33   16    7   15   47   56   11]
 [  21   12   98 3989    6   91   14   32   65   23]
 [  15   12   12    5 3884    8   16   11   18   91]
 [  31    7   21  164   24 3462   29   10   26   21]
 [  41   10   16    2   21   54 3979    1   13    0]
 [  10   30   76   21   36    7    1 4146   13   61]
 [  20   32   55   93   30   65   31   16 3684   37]
 [  27   14   26   61  128   29    4   65   42 3792]]
Random Forest Accuracy:
 0.9412857142857143


In [51]:
print("F1 Score for Random Forest:\n", f1_score(y, y_pred, average="macro"))

F1 Score for Random Forest:
 0.9405988839068048


In [52]:
out = pd.read_csv("test.csv")
out.head()

Unnamed: 0,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [53]:
test_out = rfc.predict(out)
test_out

array([2, 0, 9, ..., 3, 9, 2], dtype=int64)

In [54]:
np.savetxt("testout.csv", test_out, delimiter = ",")

In [55]:
pd.DataFrame({"ImageId": list(range(1, len(test_out) + 1)),
              "Label": test_out}).to_csv("testout.csv", index = False, header = True)

After submitting this to Kaggle.com I recieved an accuracy score of 0.95285. This was about a 1% increase from the predicted accuracy. Not bad at all!

Username: brunster

<h3>Principal Components Analysis</h3>

In [83]:
y = train["label"]
X = train.drop(["label"], axis = 1)

combined = X.append(test, ignore_index = True)

combined.shape

(70000, 784)

In [85]:
X_train = X.values
y_train = y.values
y_train.reshape(-1,1)

print(X_train.shape)
print(y_train.shape)

(42000, 784)
(42000,)


In [86]:
start = timeit.default_timer()

pca = PCA(n_components = 0.95) 
X_pca = pca.fit_transform(X_train)

print("Elapsed Run Time:", time)

stop = timeit.default_timer()
time = stop - start

Elapsed Run Time: 6.544657907999863


In [87]:
pca.n_components_

154

After the PCA we can see a significant decrease in the number of principal components.

<h3>Principal Components Analysis Random Forest Classifier</h3>

In [95]:
start = timeit.default_timer()

pca_rfc = RandomForestClassifier(n_estimators = 10, bootstrap = True, max_features = "sqrt")
pca_rfc_model = pca_rfc.fit(X_train, y_train)

print("Elapsed Run Time:", time)

stop = timeit.default_timer()
time = stop - start

Elapsed Run Time: 5.876126433999616


In [96]:
y_pred = cross_val_predict(pca_rfc, X_pca, y_train, cv = 10)

In [97]:
cmat = confusion_matrix(y_train, y_pred)  
print(cmat)  
print("PCA Random Forest Accuracy: ", accuracy_score(y_train, y_pred))

[[3949    1   33   31   13   26   47    8   19    5]
 [   5 4570   29   17    5   15   11    9   18    5]
 [  89   26 3699   96   37   22   41   47  109   11]
 [  59   21  162 3739   13  157   17   37  111   35]
 [  27   31   67   26 3619   15   47   40   26  174]
 [  90   13   77  298   80 3042   56   23   90   26]
 [  93    7   77   30   45   74 3791    2   14    4]
 [  19   60   87   32  100   20   10 3914   24  135]
 [  95   40  151  246   67  227   33   39 3115   50]
 [  53   23   47   84  295   50   11  184   50 3391]]
PCA Random Forest Accuracy:  0.8768809523809524


In [98]:
print("F1 Score for PCA Random Forest:\n", f1_score(y, y_pred, average = "macro"))

F1 Score for PCA Random Forest:
 0.8746202747425709


Lower score than our original by a pretty decent amount. With the PCA applied, accuracy should be similar, if not slightly lower.

In [99]:
out2 = pd.read_csv("test.csv")
out2.head()

Unnamed: 0,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [100]:
test_out2 = pca_rfc.predict(out2)

In [101]:
np.savetxt("testout2.csv", test_out2, delimiter = ",")

In [102]:
pd.DataFrame({"ImageId": list(range(1, len(test_out2) + 1)),
              "Label": test_out2}).to_csv("testout2.csv", index = False, header = True)

After submitting this to Kaggle.com. I received an accuracy score of 0.94085.

Username: brunster

Higher accuracy than predicted. The PCA performed has an issue that I'll need to look into.

<h3>Hyper Parameter Tuning</h3>

This is my first Kaggle and while I'm happy with competing... I'd like to really move up on the leaderboards. I figure the best way to do this while focusing on Random Forest Classifiers is hyper parameter tuning. Let's tighten up our model and see if we can improve our accuracy!

Machine time is an intended tradeoff for this.

In [103]:
y = train["label"]
X = train.drop(["label"], axis = 1)

X_train = X.values
y_train = y.values

In [None]:
from sklearn.model_selection import GridSearchCV

start = timeit.default_timer()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = RANDOM_SEED)

param_grid = [{'n_estimators': [10, 100, 1000], 'max_features': [2, 8, 12, 'sqrt']},
              {'bootstrap': [True], 'n_estimators': [10, 100, 1000], 'max_features': [2, 8, 12, 'sqrt']}]

forest_class = RandomForestClassifier()
grid_search = GridSearchCV(forest_class, param_grid)
grid_search.fit(X_train, y_train)  

stop = timeit.default_timer()
time = stop - start
print("Elapsed Run Time:", time)

grid_search.best_params_

In [104]:
# performed earlier - printing results below:

# Elapsed Run Time: 2484.151869664
# {'max_features': 'sqrt', 'n_estimators': 1000}

In [105]:
start = timeit.default_timer()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = RANDOM_SEED)
rfc = RandomForestClassifier(n_estimators = 1000, bootstrap = True, max_features = "sqrt")

rfc.fit(X_train, y_train)
print("Random Forest Classifer Predicted Accuracy:", rfc.score(X_test, y_test))

stop = timeit.default_timer()
time = stop - start
print("Elapsed Run Time:", time)

Random Forest Classifer Predicted Accuracy: 0.9668571428571429
Elapsed Run Time: 198.69324952800025


In [106]:
start = timeit.default_timer()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = RANDOM_SEED)
rfc = RandomForestClassifier(n_estimators = 2000, bootstrap = True, max_features = "sqrt")

rfc.fit(X_train, y_train)
print("Random Forest Classifer Predicted Accuracy:", rfc.score(X_test, y_test))

stop = timeit.default_timer()
time = stop - start
print("Elapsed Run Time:", time)

Random Forest Classifer Predicted Accuracy: 0.9676190476190476
Elapsed Run Time: 398.285473764


In [107]:
out3 = pd.read_csv("test.csv")
test_out3 = rfc.predict(out3)
np.savetxt("testout3.csv", test_out3, delimiter = ",")
pd.DataFrame({"ImageId": list(range(1, len(test_out3) + 1)),
              "Label": test_out3}).to_csv("testout3.csv", index = False, header = True)

After waiting over 40 minutes for the hyper parameter tuning, we can see that the only real changes to our initial parameters are adding more estimators. At 1000 estimators, the predicted accuracy increases by around 3%. I didn't want to add 2000 estimators to the GridSearchCV due to the additional time it would take to run. Instead, I just performed the test with the adjusted estimators. It is a little disappointing to see that the estimated accuracy between 1000 to 2000 only increases by 0.0001, but should be expected at that high number of estimators based on the dataset size.

I submitted both the 1000 and 2000 estimators and the 1000 estimators just barely beat the 2000 one. My final highest accuracy submission is: 0.96457.

Username: brunster

<h3>RFC vs PCA RFC</h3>

The Random Forest and Principal Component Analysis Random Forest performed similarly in practice. There is a design flaw with the methodology recommended for us to take when performing the PCA. Since the training and test sets were combined, it most likely caused data leakage. The reasoning behind this is due to the arithmetic used. Compressing the full set of data transforms it. These additional calculations may result is a loss of information and accuracy, especially with larger or significant compression.

I have adjusted the PCA to only interact with the training data. Upon this adjustment, the results showed a slight decrease in accuracy over the regular Random Forest Classifier. The result has been submitted to Kaggle.com and I received a score of 0.94085. This is a marginal loss over the RFC accuracy score of 0.95285.

<h3>Conclusion</h3>

Between RFC and PCA RFC, I would recommend using a regular hyper parameter tuned RFC model to achieve the best performance. Even without tuning the parameters, the regular RFC still out performed the adjusted and corrected PCA. Aside from marginally higher scores, the RFC is much faster in terms of performance and implementation. There are less transformations and work involved to produce a better result. This model and recommendation is specific to this data in addition to only considering these two types of classifiers. There are better models and methods to achieve higher scores and I recommend looking into them. 