#                                  Digit Regognizer in python

<h2>Importing packages</h2>
<p>
Pandas and Numpy are standard packages that are used for data cleaning and manipulation. Imread is the function used to import the image. accuracy_score is used to specify the evaluation metrics as accuracy. Tensorflow is used as the backend and keras as the wrapper to define the neural network.
</p>

In [1]:
%pylab inline
import os
import numpy as np
import pandas as pd
from scipy.misc import imread
from sklearn.metrics import accuracy_score

import tensorflow as tf
import keras


Populating the interactive namespace from numpy and matplotlib


Using TensorFlow backend.


<br>
<p>
Setting up a random state object inorder to randomly select samples to check whether our model is correct
</p>

In [2]:
seed=100
rng=np.random.RandomState(seed)

<br>
<p>
Creating root directories to input our training data and testing data. This is simply because the data is in a complex navigation of folders
</p>

In [3]:
root_dir = os.path.abspath('../..')
data_dir = os.path.join(root_dir, '/Users/ebby/Desktop/kaggle/digit recognizer')
sub_dir = os.path.join(root_dir, '/Users/ebby/Desktop/kaggle/digit recognizer')
os.path.exists(root_dir)
os.path.exists(data_dir)
os.path.exists(sub_dir)

True

<h3> Importing the training and testing data</h3>

In [4]:

train = pd.read_csv(os.path.join(data_dir, 'Train', 'train.csv'))
test = pd.read_csv(os.path.join(data_dir, 'Test_fCbTej3.csv'))

sample_submission = pd.read_csv(os.path.join(data_dir, 'Sample_Submission_lxuyBuB.csv'))

train.head()

FileNotFoundError: File b'/Users/ebby/Desktop/kaggle/digit recognizerTrain/train.csv' does not exist

<p>Taking a look at the images provided inorded to understand how the data provided looks like</p>

In [None]:
img_name = rng.choice(train.filename)
filepath = os.path.join(data_dir, 'Train', 'Images', 'train', img_name)

img = imread(filepath, flatten=True)

pylab.imshow(img,cmap='gray')
pylab.axis('off')
pylab.show()


<p>Unlike csv files images must be read serially using imread. For this a for loop is used and the images are stored in a list. The data is divided by 255.0 inorder to normalize the data.</p>

In [None]:
temp_array=[]
for image in train.filename:
    image_path=os.path.join(data_dir, 'Train', 'Images', 'train', image)
    img = imread(image_path, flatten=True)
    temp_array.append(img)

train_data=np.stack(temp_array)
train_data=train_data/255.0
train_data = train_data.reshape(-1, 784).astype('float32')

temp_array=[]
for image in test.filename:
    image_path=os.path.join(data_dir,'Train','Images','test',image)
    img=imread(image_path,flatten=True)
    temp_array.append(img)
    
test_data=np.stack(temp_array)
test_data=test_data/255.0
test_data=test_data.reshape(-1,784).astype('float32')
    

<h3> Creating the binary vector matrix</h3>
<p>
The input values are numbers from 0-9 . The values should be converted to a binary vector and represented in a matrix form
</p>

In [None]:
train_y = keras.utils.np_utils.to_categorical(train.label.values)

<h3>Creating training data and validation data for Validation testing</h3>
<p>
The data is split in a 70:30 ratio

In [None]:
split_size = int(len(train_data)*0.7)

train_x, val_x = train_data[:split_size], train_data[split_size:]
train_y, val_y = train_y[:split_size], train_y[split_size:]

<h3>Training the neural network</h3>
<p> Three dense layers are created with activation functions relu and softmax are created using keras. The input dimensions is set to 784 while the output of the input layer is set to 50. THe hidden layers continue to pocess with 50 inputs and 50 outputs. The output layer gives only 10 outputs to represent the 10 digits

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Activation
input_num_units=784
model = Sequential([
    Dense(output_dim=50, input_dim=input_num_units),
    Activation('relu'),
    Dense(output_dim=50, input_dim=50),
    Activation('relu'),
    Dense(output_dim=10,input_dim=50),
    Activation('softmax'),
])
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


<h3>Training the model</h3>
<p>The number of epochs is set to 5 and the batch size is 100. The validation data is specified for validation testing

In [None]:
trained_model = model.fit(train_x, train_y, nb_epoch=5, batch_size=100, validation_data=(val_x, val_y))

Checking the predictions for a  random image 

In [None]:
test_output=model.predict_classes(test_data)
testvalue=model.predict(test_data)
img_name = rng.choice(test.filename)
filepath = os.path.join(data_dir, 'Train', 'Images', 'test', img_name)

img = imread(filepath, flatten=True)

test_index = int(img_name.split('.')[0]) - train.shape[0]

print( "\n Prediction is: ", test_output[test_index])
print(testvalue[test_index])
pylab.imshow(img, cmap='gray')
pylab.axis('off')
pylab.show()

Transfering the predicted values to a csv file

In [None]:
submission = pd.DataFrame({
        "filename": test["filename"],
        "label": test_output,
    })
submission.to_csv(os.path.join(sub_dir, 'sub02.csv'), index=False)

In [None]:
test.columns