# MNIST DATASET

This note book will explain what the mnist dataset is and how to read the data into memory so it can be used to train a nural network.


## Table of Contents
1. [What is the Mnist Dataset](#WhatIs)
2. [Imports](#Imports)
3. [Reading bytes from file](#reading)
4. [Little and Big Endian](#endian)
5. [Displaying the Image](#image)
6. [Displaying the Labels](#label)
7. [Conclusion](#conclusion)


## What is the Mnist Dataset <a name="WhatIs"></a>

The MNIST dataset is a very large database of handwritten digits that is commonly used for training various image processing systems. The dataset is also widely used in the field of machine learning. 

This dataset itself is presented in four files.
The first file contains 60,000 training images.
The second file contains 60,000 labels for the training images.
The Third file contains 10,000 test images.
The fourth file contians 10,000 labels for the test images.

The files can be located and downloaded from the MNIST website in gzip form. The images are not in any standard image rendering format, therefore we must write our own program to be able to read the images and save them in the .png format.


## Imports <a name="Imports"></a>

In [1]:
import gzip
import numpy as np
import matplotlib.pyplot as plt

ImportError: cannot import name '_path'

## Reading Bytes From Files <a name="reading"></a>

In [None]:
with gzip.open('data/train-images-idx3-ubyte.gz', 'rb') as f:
    file_content = f.read()

## Little and Big Endian <a name="endian"></a>


The bytes can be read differently depending on which cpu you are using.

Big endian treats the farthest binary value to the left as the most significant value.

Little endian treats the farthest binary value to the right as the most significant value.

 First, we get the first four bytes as a slice

In [None]:
l = file_content[0:4]

We now get the data type

In [None]:
type(l)

Print the magic number. This value should be 2051 if the data has been read correctly.

In [None]:
int.from_bytes(file_content[0:4], byteorder='big')

The next reserved four bytes are the amount of images the file contains.

In [None]:
file_content[4:8]

int.from_bytes(file_content[4:8], byteorder='big')

We can see there is 60000 images in the above example.

The next eight bytes contain the dimensions of each image.

In [None]:
file_content[8:12]
int.from_bytes(file_content[8:12], byteorder='big')

In [None]:
int.from_bytes(file_content[12:16], byteorder='big')

We can see they are 28 x 28 pixels.

After this we are reading pixel values as unsigned bytes and as the the dimensions of each image are 28 x 28 for every 784 bytes we should get a new image.
Each unsigned byte is between 0 and 255, 0 being a value of white and 255 black.
The pixel values between zero and 255 are darker shades of grey acending.

## DIsplaying the image <a name="image"></a>

To display the image we have to create an array of the next 784 bytes in the file and set it to a 2D array of unsigned bytes using reshape.

We use matplotlib to render the image in greyscale

In [None]:
image = ~np.array(list(file_content[16:800])).reshape(28,28).astype(np.uint8)

plt.imshow(image, cmap='gray')

Let's show the second image in the file now.

In [None]:
image = ~np.array(list(file_content[800:1584])).reshape(28,28).astype(np.uint8)

plt.imshow(image, cmap='gray')

## Displaying the Labels <a name="label"></a>

To display the image labels we do the same as above but we are only getting a single integer value back.

The files first 8 bytes are reserved for the magic number and amount of labels contained within the file.

After the first eight bytes each subsequent byte holds the label for the image.

In [None]:
with gzip.open('data/train-labels-idx1-ubyte.gz', 'rb') as f:
    labels = f.read()

Below we get the first and second labels from the file. If these match the numbers displayed above we can see the files read in correctly.

In [None]:
int.from_bytes(labels [8:9], byteorder='big')

In [None]:
int.from_bytes(labels [9:10], byteorder='big')

## Conclusion <a name="conclusion"></a>

In this notebook I have explained what the Mnist dataset is and how you read the gzipped files into memory so that the images and labels may be used to train a nural network.