## The MNIST Dataset

Our repositiory will not host the data. It is available from [here](https://www.kaggle.com/c/digit-recognizer/data). Currently, the `.gitignore` file ignores `mnist/data/`. You can place the files in a local directory of the same name or some other location. If you choose another location, please add it to `.gitignore`.

From the Kaggle Data Description
> The data files train.csv and test.csv contain gray-scale images of hand-drawn digits, from zero through nine.

> Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.

> The training data set, (train.csv), has 785 columns. The first column, called "label", is the digit that was drawn by the user. The rest of the columns contain the pixel-values of the associated image.



## Import the training set to a numpy array

In [61]:
import numpy as np
training_data = np.genfromtxt('../data/train.csv', delimiter = ',', skip_header = 1, dtype=np.dtype(int))

After importing the numpy library, we use the `genfromtxt` routine to create a numpy array from the csv. the paramater `skip_header = 1` skips the first row of the csv file which is the column headers.

In [62]:
training_data.shape

(42000, 785)

The dataset contains 42,000 training images. As described above, each row is a column with the image label and 784 columns of gray-scale image data. Let's separate the label column from the image data as follows.

In [63]:
training_labels = training_data[:,0]
training_labels.shape

(42000,)

In [64]:
np.unique(training_labels, return_counts = True)

(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
 array([4132, 4684, 4177, 4351, 4072, 3795, 4137, 4401, 4063, 4188]))

The resulting array has the appropriate dimensions and unique values. From the counts we also see a fairly uniform distbution of digits. Now lets pull out the image arrays and save the results.

In [65]:
training_images = training_set[:,1:]
training_images.shape

(42000, 784)

In [69]:
np.save('../data/training_labels', training_labels)
np.save('../data/training_images', training_images)