# Implementing a Simple NN for evaluating notMNIST dataset

In order to do this exercise we will need to:

1. Prepare Data
2. Preprocess Data
3. Build Neural Network
4. Train Neural Network
5. Test Neural Network

We will start by downloading our dependencies.

In [4]:
# standard python libraries
import hashlib # for asserting on file checksum
import os # for navigating through directories
import pickle # for serializing python objects
from urllib.request import urlretrieve # for downloading dataset
from zipfile import ZipFile # for reading zip files

# third party libraries
import numpy as np # numerical array library
from PIL import Image # Python Image Library, for loading images
# Scikit-learn
from sklearn.model_selection import train_test_split # for splitting data into train/validation
from sklearn.preprocessing import LabelBinarizer # for one-hot encoding
from sklearn.utils import resample # for randomn sampling from dataset
from tqdm import tqdm # progress meter library

print('All modules imported')

All modules imported


## Prepare Data

In this fase we will:

1. Download the notMNIST dataset
2. Uncompress features and labels
3. Randomnly sample subset of 150000 images 

In [12]:
def download(url, file):
    # check if file was already downloaded
    if not os.path.isfile(file):
        # if not, then download file
        print('Downloading {}...'.format(file))
        urlretrieve(url, file)
        print('Download Finished')

# download the training and test dataset
download('https://s3.amazonaws.com/udacity-sdc/notMNIST_train.zip', 'train.zip')
download('https://s3.amazonaws.com/udacity-sdc/notMNIST_test.zip', 'test.zip')

# check if files are corrupted
assert hashlib.md5(open('train.zip', 'rb').read()).hexdigest() == 'c8673b3f28f489e9cdf3a3d74e2ac8fa', 'File is corrupted, download it again'
assert hashlib.md5(open('test.zip', 'rb').read()).hexdigest() == '5d3c7e653e63471c88df796156a9dfa9', 'File is corrupted, download it again'

print('All files downloaded')

All files downloaded


In [17]:
def uncompress_features_labels(file):
    features = []
    labels = []
    with ZipFile(file) as zipfile:
        # Progress Bar - ZipFile.namelist() returns a list of files in the zip archive
        file_progress_bar = tqdm(zipfile.namelist(), unit='files')
        
        # loop through files
        for file in file_progress_bar:
            # check if file is not a directory
            if not file.endswith('/'):
                # convert features to images
                with zipfile.open(file) as image_file:
                    # open and load image
                    image = Image.open(image_file)
                    image.load()
                    # transform image into nparray and flatten the image to a 1 dimensional array, float32
                    feature = np.array(image, dtype=np.float32).flatten()
                
                # extract labels from file name
                # file is in format train/A34.png
                # if we split, we get ['train', 'A34.png']
                # we want the second element ([1]) and the first character ([0])
                label = os.path.split(file)[1][0]
                
                # append label and feature to array
                features.append(feature)
                labels.append(label)
                
    return np.array(features), np.array(labels)

# get the features and labels from the zip files
train_features, train_labels = uncompress_features_labels('train.zip')
test_features, test_labels = uncompress_features_labels('test.zip')

print('All features and labels uncompressed')

100%|██████████| 210001/210001 [00:32<00:00, 6440.30files/s]
100%|██████████| 10001/10001 [00:01<00:00, 6633.29files/s]

All features and labels uncompressed





In [19]:
# limit the amount of training data to work with
sample_size = 150000
train_features, train_labels = resample(train_features, train_labels, n_samples=sample_size, replace=False, random_state=123)

print('Random subset sampled for training data')

Randomn subset sampled for training data


## Preprocess Data

Now, we will preprocess the data by doing:

1. Normalize features
2. One-Hot Encode labels
3. Randomize and split datasets for training and validation
4. Checkpoint: Serialize all features and labels

Min-Max Scaling:
$
X'=a+{\frac {\left(X-X_{\min }\right)\left(b-a\right)}{X_{\max }-X_{\min }}}
$

In [20]:
def normalize_min_max_scaling(image_data):
    a, b = 0.1, 0.9
    normalized_image = a + ((image_data - 0) * (b - a))/(255 - 0)
    return normalized_image

train_features = normalize_min_max_scaling(train_features)
test_features = normalize_min_max_scaling(test_features)

print('Training and testing features normalized')

Training and testing features normalized


In [21]:
# one-hot enconde labels
encoder = LabelBinarizer()
encoder.fit(train_labels)
train_labels = encoder.transform(train_labels)
test_labels = encoder.transform(test_labels)

# change label type to float32
train_labels = train_labels.astype(np.float32)
test_labels = test_labels.astype(np.float32)

print('Training and test labels one-hot encoded')

Training and test labels one-hot encoded


In [24]:
# randomize and split dataset for training and validation
train_features, valid_features, train_labels, valid_labels = train_test_split(train_features, train_labels, test_size=0.05, random_state=832289)

print('Training features and labels randomized and split')

Training features and labels randomized and split


In [28]:
# serialize and save data for access
pickle_file = 'myModel.pickle'
if not os.path.isfile(pickle_file):
    print('Saving data to pickle file...')
    try:
        with open(pickle_file, 'wb') as pfile:
            pickle.dump(
                {
                    'train_dataset': train_features,
                    'train_labels': train_labels,
                    'valid_dataset': valid_features,
                    'valid_labels': valid_labels,
                    'test_dataset': test_features,
                    'test_labels': test_labels
                }, pfile, pickle.HIGHEST_PROTOCOL)
    except Exception as e:
        print('Unable to save data to {}: {}'.format(pickle_file, e))
        raise
        
print('Data cached in pickle file')

Saving data to pickle file...
Data cached in pickle file


In [25]:
print(train_labels[:5,:])

[[ 0.  0.  0.  0.  0.  0.  1.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  1.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  1.]
 [ 0.  1.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  1.  0.  0.  0.  0.]]


In [26]:
print(train_features[:5,:])

[[ 0.1         0.61450982  0.38235295 ...,  0.1         0.1         0.1       ]
 [ 0.1         0.10627451  0.1        ...,  0.14078432  0.1         0.10627451]
 [ 0.1         0.1         0.1        ...,  0.1         0.1         0.1       ]
 [ 0.21607843  0.78705883  0.86862749 ...,  0.1         0.1         0.1       ]
 [ 0.1         0.1         0.1        ...,  0.1         0.1         0.1       ]]


### --> Checkpoint: Load pickle data

In [35]:
%matplotlib inline

# Load the modules
import pickle
import math

import numpy as np
import tensorflow as tf
from tqdm import tqdm
import matplotlib.pyplot as plt

# Reload the data
pickle_file = 'myModel.pickle'
with open(pickle_file, 'rb') as f:
  pickle_data = pickle.load(f)
  train_features = pickle_data['train_dataset']
  train_labels = pickle_data['train_labels']
  valid_features = pickle_data['valid_dataset']
  valid_labels = pickle_data['valid_labels']
  test_features = pickle_data['test_dataset']
  test_labels = pickle_data['test_labels']
  del pickle_data  # Free up memory

print('Data and modules loaded.')

Data and modules loaded.


## Build Neural Network

In [55]:
# hyperparameters
learning_rate = 0.2
epochs = 5
batch_size = 128

tf.reset_default_graph()

# initialization
X = tf.placeholder(tf.float32, [None, 784])
W = tf.Variable(tf.truncated_normal([784, 10]))
b = tf.Variable(tf.zeros([10]))
Y_ = tf.placeholder(tf.float32, [None, 10])
init = tf.global_variables_initializer()

# model
Y = tf.nn.softmax(tf.matmul(X, W) + b)

# loss
cross_entropy = -tf.reduce_sum(Y_ * tf.log(Y), axis=1)
loss = tf.reduce_mean(cross_entropy)

# metrics
is_correct = tf.equal(tf.argmax(Y, 1), tf.argmax(Y_, 1))
accuracy = tf.reduce_mean(tf.cast(is_correct, tf.float32))

# optimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
train_step = optimizer.minimize(loss)

## Train Neural Network

In [56]:
def get_next_batch(features, labels, iter_step, batch_size):
    assert len(features) == len(labels), 'features and labels must have the same size'
    begin = iter_step * batch_size
    end = begin + batch_size
    return features[begin:end], labels[begin:end]

# initialize session and variables
sess = tf.Session()
sess.run(init)

train_data = {X: train_features, Y_:train_labels}
valid_data = {X: valid_features, Y_:valid_labels}

epoch_size = math.ceil(len(train_features) / batch_size)
iterations = epoch_size * epochs
for i in range(iterations):
    # get next batch of train data
    batch_X, batch_Y = get_next_batch(train_features, train_labels, i, batch_size)
    train_batch = {X: batch_X, Y_:batch_Y}
    
    # run train step
    sess.run(train_step, feed_dict=train_batch)
    
    # print results every epoch
    if i % epoch_size == 0:
        
        # determine train accuracy and loss
        train_acc, train_loss = sess.run([accuracy, loss], feed_dict=train_data)
    
        # determine validation accuracy
        valid_acc, valid_loss = sess.run([accuracy, loss], feed_dict=valid_data)
    
        epoch = i / epoch_size
        print('Epoch {}'.format(int(epoch)))
        print('Training Accuracy: {} / Training Loss: {}'.format(train_acc, train_loss))
        print('Validation Accuracy: {} / Validation Loss: {}\n'.format(valid_acc, valid_loss)) 

Epoch 0
Training Accuracy: 0.15397192537784576 / Training Loss: 12.262728691101074
Validation Accuracy: 0.15479999780654907 / Validation Loss: 12.329452514648438

Epoch 1
Training Accuracy: 0.7506386041641235 / Training Loss: 1.554160475730896
Validation Accuracy: 0.7465333342552185 / Validation Loss: 1.5511505603790283

Epoch 2
Training Accuracy: 0.7506386041641235 / Training Loss: 1.554160475730896
Validation Accuracy: 0.7465333342552185 / Validation Loss: 1.5511505603790283

Epoch 3
Training Accuracy: 0.7506386041641235 / Training Loss: 1.554160475730896
Validation Accuracy: 0.7465333342552185 / Validation Loss: 1.5511505603790283

Epoch 4
Training Accuracy: 0.7506386041641235 / Training Loss: 1.554160475730896
Validation Accuracy: 0.7465333342552185 / Validation Loss: 1.5511505603790283



## Test Neural Network

In [57]:
# define test feed_dictionary
test_data = {X: test_features, Y_:test_labels}

# run nn against test data
test_acc, test_loss = sess.run([accuracy, loss], feed_dict=test_data)    
print('Test Accuracy: {} / Test Loss: {}/n/n'.format(test_acc, test_loss)) 

Test Accuracy: 0.8228999972343445 / Test Loss: 0.9720494151115417/n/n
