# Kaggle Competition: Digit Recognizer with MNIST
https://www.kaggle.com/c/digit-recognizer

I've got some experience coding and using tensorflow at this point, but I've only done regression problems. I'd like to try my hand at my first computer vision/classification problem, and build my first convolutional neural network in tensorflow. 

### Setup
Let's begin the setup by importing the necesary libraries and datasets.

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import math as m
import os
import hashlib
from sklearn.preprocessing import OneHotEncoder
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"
pd.options.display.float_format = '{:.5f}'.format

filepath = os.path.join(os.getcwd(),"Data")

dataset = pd.read_csv(os.path.join(filepath,"train.csv"), index_col = None)
test = pd.read_csv(os.path.join(filepath,"test.csv"), index_col = None)

### Data exploration and cleanup

We'll take a look at the datasets and the columns and check if there are any missing labels in the dataset.

In [2]:
"Training Dataset Size (Rows, Columns): " + str(dataset.shape)
"Testing Dataset Size (Rows, Columns): " + str(test.shape)
dataset.head(n=3)
test.head(n=3)

"There are " + str(dataset.isnull().sum().sum()) + " missing labels in the training dataset."
"There are " + str(test.isnull().sum().sum()) + " missing labels in the testing dataset."

'Training Dataset Size (Rows, Columns): (42000, 785)'

'Testing Dataset Size (Rows, Columns): (28000, 784)'

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Unnamed: 0,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


'There are 0 missing labels in the training dataset.'

'There are 0 missing labels in the testing dataset.'

There are no missing labels in both the training dataset and the testing dataset.

Let's examine the pixel columns. 

According to https://www.kaggle.com/c/digit-recognizer/data , the images are 28 x 28 pixel (784 pixels in total) grayscale images of handwritten digits, "0" through "9". The first column in the training set is the actual handwritten digit, whereas the testing dataset lacks the actual digit label. That's for us to guess! 

The first 28 pixel columns (columns 000 to 027) are of the darkness levels (0 being white, and 255 being completely black) of the top most row of 28 pixels. Columns 028 to 055 are the second row from the top's darkness levels, and so on. 


### Checking for class imbalance

One thing to worry about that is unique to classification problems is class imbalance, where one class (or in this case, digit) is under or overrepresented in the dataset. If we have 99% "3's" and the remaining 1% are the remaining digits, a "dumb" classifier that guesses "3" for every image given, would have 99% correct classifications right off the bat. We'll check for any class imbalances in the training dataset.

In [3]:
digit_counts = dataset["label"].value_counts().to_frame()
digit_counts.columns = ["Occurrences"]
digit_counts.columns.name = "Digit"
digit_counts["Frequency(%)"] = (digit_counts["Occurrences"]/digit_counts["Occurrences"].sum())*100
digit_counts["Relative Frequency"] = digit_counts["Occurrences"]/digit_counts["Occurrences"].min()
digit_counts

Digit,Occurrences,Frequency(%),Relative Frequency
1,4684,11.15238,1.23426
7,4401,10.47857,1.15968
3,4351,10.35952,1.14651
9,4188,9.97143,1.10356
2,4177,9.94524,1.10066
6,4137,9.85,1.09012
0,4132,9.8381,1.0888
4,4072,9.69524,1.07299
8,4063,9.67381,1.07062
5,3795,9.03571,1.0


While not a perfect 1/10 split amongst all the digits, this imbalance is minor, with the most common class, "1", appearing only 1.234 times as much as the least frequent class, "5".

In [4]:
def test_split(id, seed, test_proportion):
    if type(test_proportion) not in [float, int] or test_proportion > 1 or test_proportion < 0:
        raise ValueError("Test proportion must be a real number between 0 and 1")
    test = str(id) + str(seed)
    test_digest = hashlib.md5(test.encode("ascii")).hexdigest()
    test_hex = int(test_digest[-6:], 16) #last 6 digits only
    split = test_hex/0xFFFFFF
    if split > test_proportion:
        return 0
    else:
        return 1
    
dataset['split'] = dataset.index.map(lambda x: test_split(id = x, seed = 'MNIST valid', test_proportion = 0.20))

train_valid_percentage = dataset['split'].sum()/dataset.shape[0] * 100 #verify split percentage is correct
"{0:.5f}".format(train_valid_percentage) + "% of the training set has been designated as the validation set" 

train, valid = dataset.loc[(dataset.split == 0)], dataset.loc[(dataset.split == 1)]
pixel_columns = [column_name for column_name in list(train) if column_name.startswith("pixel")]
label_columns = [column_name for column_name in list(train) if column_name.startswith("label")]
train_x, train_y = train.loc[:,pixel_columns], train.loc[:,label_columns]
valid_x, valid_y = valid.loc[:,pixel_columns], valid.loc[:,label_columns]

'20.00952% of the training set has been designated as the validation set'

In [5]:
OneHot = OneHotEncoder(sparse = True)
train_y = OneHot.fit_transform(train_y)