# MNIST - A Exploration in to Data Analysis

This is monologue of my progress through this project.

MNIST is the Hello World of neural networks. My aim is to learn as much as possible about neural networks and other data analysis methods by analysing this dataset as well as learning how to best use the features of the high level programming language Python.

My approach to this problem will be through the following steps:
1. Understand the raw data
2. Consider different potential solutions to the problem
3. Pick a model
4. Process the raw data so it is in a suitable form for the algorithms to work
5. Fit/train a model
6. Analyse the model and try to find improvements

## 1 - Understanding the Raw Data

The data are in csv files. Each file has 785 columns. For the training data, the first column is the label, that is the digit drawn by the user. The remaining 784 columns make up the images. The images are 28x28 pixels in size and each row represents one whole image. Consequently each image is of relatively low quality and is in greyscale. The testing data is identical but with the label column omitted, hence there are only 784 columns in this file.

The images can be reconstructed by taking the first 28 cells as the first row of pixels of the image. The subsequent 28 cells make up the second row of pixels of the image etc.

The submission file must have the following format:
```
ImageId,Label
1,0
2,0
3,0
etc.
```

## 2 - Potential Solutions

One popular solution is to use a neural network to classify the digits. This is something I will explore using Tensorflow on my GPU. If this is successful I may then look to improve this training process with regards to speed and efficiency by training the model in C/C++ on my GPU. 

This problem would also lend itself well to using some sort of clustering algorithm. Further research and implentation in to k-means clustering algorithms and the algorithms I have learned at university will be considered.

A classification regression tree may also work here.

Finally I am aware of support-vector machines. I am not entirely sure what these are hence more research is required, but that is another method that should be explored.

## 3 - Neural Network Model

The neural network model requires an initial layer with one node per pixel. In this case, that is 28^2 input nodes. The output layer requires as many nodes as there are classes. This final layer usually uses a rectified linear unit (ReLU) on each of the nodes. This converts each of the nodes to a value between 0 and 1 such that the sum of all the values sums to 1. This means the output can be interpreted as a probability of the image beloging to each particular class.

The difficult part of neural networks is filling in the hidden layers. A single layer is only capable of linear regression. However, deeper networks and other special layers such as convolutional layers allow for a model to pick different features. The aim here is to learn how these layers can be combined to create a good model and this shall be done through hands on experimentation and through researching online.

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
#import matplotlib.pyplot as plt
import tensorflow as tf

I shall now read the train and test data files.

In [2]:
raw_data_train = pd.read_csv('data/train.csv')
raw_data_test = pd.read_csv('data/test.csv')

print(type(raw_data_test))

print(raw_data_train.head())

print('The training data has the following dimensions', raw_data_train.shape)
print('The testing data has the following dimensions', raw_data_test.shape)

<class 'pandas.core.frame.DataFrame'>
   label  pixel0  pixel1  pixel2  pixel3  pixel4  pixel5  pixel6  pixel7  \
0      1       0       0       0       0       0       0       0       0   
1      0       0       0       0       0       0       0       0       0   
2      1       0       0       0       0       0       0       0       0   
3      4       0       0       0       0       0       0       0       0   
4      0       0       0       0       0       0       0       0       0   

   pixel8  ...  pixel774  pixel775  pixel776  pixel777  pixel778  pixel779  \
0       0  ...         0         0         0         0         0         0   
1       0  ...         0         0         0         0         0         0   
2       0  ...         0         0         0         0         0         0   
3       0  ...         0         0         0         0         0         0   
4       0  ...         0         0         0         0         0         0   

   pixel780  pixel781  pixel782  pix

Evidently the train dataset has 42000 rows and the test dataset has 28000 rows. So there is plenty of data to be working with!

First, I need to preprocess the data so it is in a usuable form. The block below coverts the raw data stored as pd.DataFrames and converts it to np.arrays

In [11]:
img_rows, img_cols = 28, 28
num_classes = 10

def prep_raw_data(raw_data):
    """Splits the training data in to training and validation datasets.
    """
    out_y = raw_data.label
    x = raw_data.values[:, 1:]
    return out_y, reshape_data(x)

def reshape_data(array):
    """Returns n x 28 x 28 numpy array that has been standardised.
    """
    if isinstance(array, pd.DataFrame):
        #If array is a pd.DataFrame, assume array is the testing data and convert it to a numpy array
        array = array.values[:,:]
        
    return array.reshape(array.shape[0], 28, 28)/255

y_train, x_train = prep_raw_data(raw_data_train)
x_test = reshape_data(raw_data_test)