# _MNIST Dataset_

<img src="images/MNIST.png"/>

### About

MNIST(Modified National Institute of Standards and Technology) is a sub data set of NIST(National Institute of Standards and Technology), a large database of handwritten digits. MNIST is used to train image processing systems and is basically the "hello world" of computer vision.

MNIST contains 60,000 training images and 10,000 testing images. Training images are used to train a system, and testing images are used to test the trained system.

### This Notebook

This notebook will aim to explain the famous MNIST data set. I aim to show you various methods of loading the dataset into memory for use.

#### Downloading the dataset

In [1]:
# Adapated code from
#       Download file from URL       
#        -  https://stackoverflow.com/questions/22676/how-do-i-download-a-file-over-http-using-python
#       Use Operating System commands 
#         - https://stackoverflow.com/questions/1274405/how-to-create-new-folder

# Import OS to use operating system commands
import os
# Import urllib.request to make a HTTP request.
import urllib.request

# Create a directory "data"
os.makedirs("./data")
# Change directory to "data"
os.chdir("./data")

print("Downloading Files... (This may take a while depending on your internet connection)")

# Use urllib's request to retrieve the gzip files from yann.lecun.com and save them into the current directory.
urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz', 'train-images-idx3-ubyte.gz')
urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz', 'train-labels-idx1-ubyte.gz')
urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz', 't10k-images-idx3-ubyte.gz')
urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz', 't10k-labels-idx1-ubyte.gz')

Downloading Files... (This may take a while depending on your internet connection)


('t10k-labels-idx1-ubyte.gz', <http.client.HTTPMessage at 0x22df3d1e978>)

Using the above script, you create a directory within your current directory called "data". Within this directory you download the files for use.

#### Unzip the files

In [2]:
import gzip
# Importing numpy to convert lists into arrays.
import numpy as numpy
# Importing PIL to convert arrays into images.
from PIL import Image

# The functions (read_labels and read_images) are used to decompress, read and store a file 
# by passing the file location as a parameter.
# read_labels will read the label files and return a list of labels.
def read_labels(file):
    try:
        with gzip.open(file) as f:
            # Magic number - *Expected to be 2049*
            magic_num = int.from_bytes(f.read(4), byteorder="big")

            # Number of Labels - *Expected to be 60000 training file labels & 10000 testing file labels*
            no_labels = int.from_bytes(f.read(4), byteorder="big")

            # Print out file details.
            # If parsed correctly, the values should be the same as the values expected.
            print("File:",file,"\nMagic number: \t\t%10d\nNumber of Images: \t%10d\n"%(magic_num,no_labels))
            
            # Create a list of labels -
            # I don't have the processing power to loop over each label/image so I just use the first
            # n images where n = no_labels / 1000. I can assume if it works for n images, it will work for
            # all of the images.
            # Looping over the number of labels divided 1000 and reading in each label 1 by 1.
            label_list = [int.from_bytes(f.read(1), byteorder="big") for i in range(int(no_labels / 1000))]

            # Return the list of labels to be used in different functions.
            return label_list
    finally:
        # Close the file after using it.
        f.close()