# Implementing a Simple NN for evaluating notMNIST dataset

In order to do this exercise we will need to:

1. Prepare Data
2. Preprocess Data
3. Build Neural Network
4. Train Neural Network
5. Test Neural Network

We will start by downloading our dependencies.

In [4]:
# standard python libraries
import hashlib # for asserting on file checksum
import os # for navigating through directories
import pickle # for serializing python objects
from urllib.request import urlretrieve # for downloading dataset
from zipfile import ZipFile # for reading zip files

# third party libraries
import numpy as np # numerical array library
from PIL import Image # Python Image Library, for loading images
# Scikit-learn
from sklearn.model_selection import train_test_split # for splitting data into train/validation
from sklearn.preprocessing import LabelBinarizer # for one-hot encoding
from sklearn.utils import resample # for randomn sampling from dataset
from tqdm import tqdm # progress meter library

print('All modules imported')

All modules imported


## Prepare Data

In this fase we will:

1. Download the notMNIST dataset
2. Uncompress features and labels
3. Randomnly sample subset of 150000 images 

In [12]:
def download(url, file):
    # check if file was already downloaded
    if not os.path.isfile(file):
        # if not, then download file
        print('Downloading {}...'.format(file))
        urlretrieve(url, file)
        print('Download Finished')

# download the training and test dataset
download('https://s3.amazonaws.com/udacity-sdc/notMNIST_train.zip', 'train.zip')
download('https://s3.amazonaws.com/udacity-sdc/notMNIST_test.zip', 'test.zip')

# check if files are corrupted
assert hashlib.md5(open('train.zip', 'rb').read()).hexdigest() == 'c8673b3f28f489e9cdf3a3d74e2ac8fa', 'File is corrupted, download it again'
assert hashlib.md5(open('test.zip', 'rb').read()).hexdigest() == '5d3c7e653e63471c88df796156a9dfa9', 'File is corrupted, download it again'

print('All files downloaded')

All files downloaded


In [17]:
def uncompress_features_labels(file):
    features = []
    labels = []
    with ZipFile(file) as zipfile:
        # Progress Bar - ZipFile.namelist() returns a list of files in the zip archive
        file_progress_bar = tqdm(zipfile.namelist(), unit='files')
        
        # loop through files
        for file in file_progress_bar:
            # check if file is not a directory
            if not file.endswith('/'):
                # convert features to images
                with zipfile.open(file) as image_file:
                    # open and load image
                    image = Image.open(image_file)
                    image.load()
                    # transform image into nparray and flatten the image to a 1 dimensional array, float32
                    feature = np.array(image, dtype=np.float32).flatten()
                
                # extract labels from file name
                # file is in format train/A34.png
                # if we split, we get ['train', 'A34.png']
                # we want the second element ([1]) and the first character ([0])
                label = os.path.split(file)[1][0]
                
                # append label and feature to array
                features.append(feature)
                labels.append(label)
                
    return np.array(features), np.array(labels)

# get the features and labels from the zip files
train_features, train_labels = uncompress_features_labels('train.zip')
test_features, test_labels = uncompress_features_labels('test.zip')

print('All features and labels uncompressed')

100%|██████████| 210001/210001 [00:32<00:00, 6440.30files/s]
100%|██████████| 10001/10001 [00:01<00:00, 6633.29files/s]

All features and labels uncompressed





In [19]:
# limit the amount of training data to work with
sample_size = 150000
train_features, train_labels = resample(train_features, train_labels, n_samples=sample_size, replace=False, random_state=123)

print('Random subset sampled for training data')

Randomn subset sampled for training data


## Preprocess Data

Now, we will preprocess the data by doing:

1. Normalize features and labels
2. One-Hot Encode labels
3. Randomize and split datasets for training and validation
4. Checkpoint: Serialize all features and labels

Min-Max Scaling:
$
X'=a+{\frac {\left(X-X_{\min }\right)\left(b-a\right)}{X_{\max }-X_{\min }}}
$

In [20]:
def normalize_min_max_scaling(image_data):
    a, b = 0.1, 0.9
    normalized_image = a + ((image_data - 0) * (b - a))/(255 - 0)
    return normalized_image

train_features = normalize_min_max_scaling(train_features)
test_features = normalize_min_max_scaling(test_features)

print('Training and testing features normalized')

Training and testing features normalized


In [21]:
# one-hot enconde labels
encoder = LabelBinarizer()
encoder.fit(train_labels)
train_labels = encoder.transform(train_labels)
test_labels = encoder.transform(test_labels)

# change label type to float32
train_labels = train_labels.astype(np.float32)
test_labels = test_labels.astype(np.float32)

print('Training and test labels one-hot encoded')

Training and test labels one-hot encoded


In [24]:
# randomize and split dataset for training and validation
train_features, valid_features, train_labels, valid_labels = train_test_split(train_features, train_labels, test_size=0.05, random_state=832289)

print('Training features and labels randomized and split')

Training features and labels randomized and split


In [28]:
# serialize and save data for access
pickle_file = 'myModel.pickle'
if not os.path.isfile(pickle_file):
    print('Saving data to pickle file...')
    try:
        with open(pickle_file, 'wb') as pfile:
            pickle.dump(
                {
                    'train_dataset': train_features,
                    'train_labels': train_labels,
                    'valid_dataset': valid_features,
                    'valid_labels': valid_labels,
                    'test_dataset': test_features,
                    'test_labels': test_labels
                }, pfile, pickle.HIGHEST_PROTOCOL)
    except Exception as e:
        print('Unable to save data to {}: {}'.format(pickle_file, e))
        raise
        
print('Data cached in pickle file')

Saving data to pickle file...
Data cached in pickle file


In [25]:
print(train_labels[:5,:])

[[ 0.  0.  0.  0.  0.  0.  1.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  1.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  1.]
 [ 0.  1.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  1.  0.  0.  0.  0.]]


In [26]:
print(train_features[:5,:])

[[ 0.1         0.61450982  0.38235295 ...,  0.1         0.1         0.1       ]
 [ 0.1         0.10627451  0.1        ...,  0.14078432  0.1         0.10627451]
 [ 0.1         0.1         0.1        ...,  0.1         0.1         0.1       ]
 [ 0.21607843  0.78705883  0.86862749 ...,  0.1         0.1         0.1       ]
 [ 0.1         0.1         0.1        ...,  0.1         0.1         0.1       ]]


## Build Neural Network

## Train Neural Network

## Test Neural Network