# Digit Recognizer

url = https://www.kaggle.com/c/digit-recognizer

### Understanding the Question

Given a dataset of handwritten images can we correctly classify which number it is.  We will have the labels for our training set.

This is a supervised classification problem.

### Getting Started - Load & Inspect Data

The data is available from Kaggle at https://www.kaggle.com/c/digit-recognizer/data. We are told that each image is 28 pixels by 28 pixels, given a total of 784 pixels in the image.  Each pixel has a value of 0 to 255 indicating how dark it is (higher the value, the darker the pixel).  The train.csv contains 785 columns, the first is the label and the remaining 784 are the pixels and their respective darkness value.

In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv("train.csv")
df.head()

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [3]:
df.shape

(42000, 785)

In [4]:
df.dtypes

label       int64
pixel0      int64
pixel1      int64
pixel2      int64
pixel3      int64
pixel4      int64
pixel5      int64
pixel6      int64
pixel7      int64
pixel8      int64
pixel9      int64
pixel10     int64
pixel11     int64
pixel12     int64
pixel13     int64
pixel14     int64
pixel15     int64
pixel16     int64
pixel17     int64
pixel18     int64
pixel19     int64
pixel20     int64
pixel21     int64
pixel22     int64
pixel23     int64
pixel24     int64
pixel25     int64
pixel26     int64
pixel27     int64
pixel28     int64
            ...  
pixel754    int64
pixel755    int64
pixel756    int64
pixel757    int64
pixel758    int64
pixel759    int64
pixel760    int64
pixel761    int64
pixel762    int64
pixel763    int64
pixel764    int64
pixel765    int64
pixel766    int64
pixel767    int64
pixel768    int64
pixel769    int64
pixel770    int64
pixel771    int64
pixel772    int64
pixel773    int64
pixel774    int64
pixel775    int64
pixel776    int64
pixel777    int64
pixel778  

In [5]:
df.describe()

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
count,42000.0,42000.0,42000.0,42000.0,42000.0,42000.0,42000.0,42000.0,42000.0,42000.0,...,42000.0,42000.0,42000.0,42000.0,42000.0,42000.0,42000.0,42000.0,42000.0,42000.0
mean,4.456643,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.219286,0.117095,0.059024,0.02019,0.017238,0.002857,0.0,0.0,0.0,0.0
std,2.88773,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,6.31289,4.633819,3.274488,1.75987,1.894498,0.414264,0.0,0.0,0.0,0.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,254.0,254.0,253.0,253.0,254.0,62.0,0.0,0.0,0.0,0.0


In [6]:
df.isnull().sum()

label       0
pixel0      0
pixel1      0
pixel2      0
pixel3      0
pixel4      0
pixel5      0
pixel6      0
pixel7      0
pixel8      0
pixel9      0
pixel10     0
pixel11     0
pixel12     0
pixel13     0
pixel14     0
pixel15     0
pixel16     0
pixel17     0
pixel18     0
pixel19     0
pixel20     0
pixel21     0
pixel22     0
pixel23     0
pixel24     0
pixel25     0
pixel26     0
pixel27     0
pixel28     0
           ..
pixel754    0
pixel755    0
pixel756    0
pixel757    0
pixel758    0
pixel759    0
pixel760    0
pixel761    0
pixel762    0
pixel763    0
pixel764    0
pixel765    0
pixel766    0
pixel767    0
pixel768    0
pixel769    0
pixel770    0
pixel771    0
pixel772    0
pixel773    0
pixel774    0
pixel775    0
pixel776    0
pixel777    0
pixel778    0
pixel779    0
pixel780    0
pixel781    0
pixel782    0
pixel783    0
dtype: int64

### Cleaning & Pre-Processing the Data

Normally I would also take some time to plot and visualize the data but I don't think we will be able to visually receive any meaningful insights by looking at plots, so I will skip that step.

In regards to cleaning our data, we have 0 NaN values and all columns are numerical.  Also every column will be needed for this problem.

The big step we need to take here is to simply normalize our feature values.  Since we know that every feature value is between 0 and 255, we can simply divide by 255 to get a value between 0 and 1. 

Normalized X = X / 255

In [7]:
#Copy the Raw Data into Feature and Label dataframes
train_X = df.drop('label', axis=1)
train_Y = df['label']

#Normalize Features
train_X = train_X / 255.0
train_X.describe()

Unnamed: 0,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
count,42000.0,42000.0,42000.0,42000.0,42000.0,42000.0,42000.0,42000.0,42000.0,42000.0,...,42000.0,42000.0,42000.0,42000.0,42000.0,42000.0,42000.0,42000.0,42000.0,42000.0
mean,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.00086,0.000459,0.000231,7.9e-05,6.8e-05,1.1e-05,0.0,0.0,0.0,0.0
std,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.024756,0.018172,0.012841,0.006901,0.007429,0.001625,0.0,0.0,0.0,0.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.996078,0.996078,0.992157,0.992157,0.996078,0.243137,0.0,0.0,0.0,0.0


In [8]:
#Convert DataFrames to array for use in ML Algorithims
X = train_X.values
Y = train_Y.values

### Build & Tune Support Vector Machine

For this problem I am choosing to use a SVM.  Now since the full training set of 42,000 examples will take a bit of time to train, I am going to take a smaller random subset of this training set to do some prototyping on.  

While tuning a machine learning algorithm on a smaller subset of the data can lead to overfitting the hyper-parameters, the speed improvement is worth it for this example.

In [9]:
#Split Training Data into smaller random subset
from sklearn.model_selection import train_test_split
X_train_small, X_test_small, Y_train_small, Y_test_small = train_test_split(X, Y, test_size=0.95, random_state=5)

Now we have a random 5% subset of our training data.  Note that X_test_small and Y_test_small are not needed.

#### Get Baseline (default hyper-parameters)

In [11]:
#Test Default SVC
from sklearn.model_selection import KFold, cross_val_score
from sklearn.svm import SVC

#Initialize Model
svm = SVC()
#Create cross validation generator
kfold = KFold(n_splits=7, random_state=5)
#Train & Test model
cross_val_results = cross_val_score(svm, X_train_small, Y_train_small, cv=kfold, scoring='accuracy')
print cross_val_results.mean()

0.857142857143


Remember, this default SVM is trained on only 5% of our training data.  This is our baseline accuracy with the default hyper-parameters.  The reason we use a smaller set is so we can test a few different parameters with the assumption being that improvement on this smaller set of data can translate into better results for our full training set.

#### Tune C parameter

In [14]:
c_range = np.logspace(-2, 10, 13) #Get a range of C values

#Loop and test each one, printing the results
for c in c_range:
    svm_model = SVC(C=c, gamma='auto') #gamma='auto' is default setting
    kfold = KFold(n_splits=7, random_state=5)
    cross_val_results = cross_val_score(svm_model, X_train_small, Y_train_small, cv=kfold, scoring='accuracy')
    print "Mean: %f   C: %f" % (cross_val_results.mean(), c)

Mean: 0.090000   C: 0.010000
Mean: 0.457619   C: 0.100000
Mean: 0.857143   C: 1.000000
Mean: 0.894762   C: 10.000000
Mean: 0.894286   C: 100.000000
Mean: 0.895714   C: 1000.000000
Mean: 0.895714   C: 10000.000000
Mean: 0.895714   C: 100000.000000
Mean: 0.895714   C: 1000000.000000
Mean: 0.895714   C: 10000000.000000
Mean: 0.895714   C: 100000000.000000
Mean: 0.895714   C: 1000000000.000000
Mean: 0.895714   C: 10000000000.000000


Using 5% of the training data we find that the roughly optimal range for C is to be greater than 10.  While this is not perfectly optimal it gives us a rough idea of which C values make sense for this problem and should be 'good enough' for our purpose.  Rough and robust tend to work better in predicting out of sample data.  Now we will use this roughly optimal C value and tune our gamma parameter.

#### Tune gamma parameter

In [16]:
gamma_range = np.logspace(-9, 3, 13) #Get a range of gamme values

#Loop and test each one, printing the results
for g in gamma_range:
    svm_model = SVC(C=10, gamma=g)
    kfold = KFold(n_splits=7, random_state=5)
    cross_val_results = cross_val_score(svm_model, X_train_small, Y_train_small, cv=kfold, scoring='accuracy')
    print "Mean: %f   gamma: %f" % (cross_val_results.mean(), g)

Mean: 0.090000   gamma: 0.000000
Mean: 0.090000   gamma: 0.000000
Mean: 0.090000   gamma: 0.000000
Mean: 0.090000   gamma: 0.000001
Mean: 0.414762   gamma: 0.000010
Mean: 0.847143   gamma: 0.000100
Mean: 0.886667   gamma: 0.001000
Mean: 0.924286   gamma: 0.010000
Mean: 0.742381   gamma: 0.100000
Mean: 0.108095   gamma: 1.000000
Mean: 0.090000   gamma: 10.000000
Mean: 0.090000   gamma: 100.000000
Mean: 0.090000   gamma: 1000.000000


In [17]:
#Fine tune the gamma range a bit
gamma_range = [.001, .005, .0075, .01, .0125, .05, .075, .1]

#Loop and test each one, printing the results
for g in gamma_range:
    svm_model = SVC(C=10, gamma=g)
    kfold = KFold(n_splits=7, random_state=5)
    cross_val_results = cross_val_score(svm_model, X_train_small, Y_train_small, cv=kfold, scoring='accuracy')
    print "Mean: %f   gamma: %f" % (cross_val_results.mean(), g)

Mean: 0.886667   gamma: 0.001000
Mean: 0.911429   gamma: 0.005000
Mean: 0.920476   gamma: 0.007500
Mean: 0.924286   gamma: 0.010000
Mean: 0.928571   gamma: 0.012500
Mean: 0.914762   gamma: 0.050000
Mean: 0.851429   gamma: 0.075000
Mean: 0.742381   gamma: 0.100000


#### Train Learner on all training data

Now that we have optimal parameters, lets train the actual learner we will use to make our predictions.

In [18]:
svm_final = SVC(C=10, gamma=0.01)
svm_final.fit(X, Y)

SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.01, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

#### Load & Transform Test Features

Now we are ready to make predictions on the test set.  To do this we need to load in the test data, apply the same feature transforms we did earlier, and then make predictions using our svm.

In [19]:
#Load Test Data
test_df = pd.read_csv("test.csv")
#Transform
test_df = test_df / 255.0
X_test = test_df.values

#### Make Predictions & Submit

The submission file should be two columns (ImageId and Label). ImageId is a list of 1-28,000 as per Kaggle's submission instructions.

In [21]:
#Get Predictions
predictions = svm_final.predict(X_test)

#Create ImageId index
image_id = np.arange(1, 28001)

#Create Submission DataFrame
submission = pd.DataFrame({'ImageId':image_id, 'Label':predictions})
submission.head()

Unnamed: 0,ImageId,Label
0,1,2
1,2,0
2,3,9
3,4,9
4,5,3


In [23]:
#Write to File
submission.to_csv("digit_recognizer_submission.csv", index=False)

#### Conclusion

Using this slightly tuned SVM I was able to get a score of 0.980.