<h1 align="center"  style="font-family:Fontin;color:#000066;font-size:24pt">Experimenting with notMNIST data</h1> 

<figure>
<img src="notmnist.png" alt="notMNIST" style="width:70%;">
<figcaption> Image source : yaroslavvb.blogspot.com</figcaption>
</figure>

<p align="left"  style="font-family:Fontin;font-size:12pt">notMNIST dataset is a collection of 28x28 images of letters 'a' through 'j' created by Yaroslav Bulatov. More information regarding this dataset can be found in his blog <a href="http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html">here.</a></p> 

<p align="left"  style="font-family:Fontin;font-size:12pt">Dataset consists of two parts:</p>
<ol>
<li> About 500k images with ~6.5% label error rate to be used as train and validation data (large).</li>
<li> About 19k images with ~0.5% label error rate to be used as test data (small).</li>
</ol>

<p align="left"  style="font-family:Fontin;font-size:12pt">Dataset can be downloaded from these links (<a href="http://commondatastorage.googleapis.com/books1000/notMNIST_large.tar.gz" download>large</a>,<a href="http://commondatastorage.googleapis.com/books1000/notMNIST_small.tar.gz" download>small</a>), made available through Udacity machine learning course.</p>

<h2 align="left"  style="font-family:Fontin;font-size:12pt">Data Curation:</h2>
<ol>
<li> Remove the corrupted data.</li>
<li> Convert the dataset into a 3D array (image index, x, y) of floating point values.
<ul>
<li> x: image width (28)</li>
<li> y: image height (28)</li>
</ul>
</li>
<li> Normalize data with approximately zero mean and 0.5 standard deviation to make training process easier.</li>
<li> Mix the data so that all 10 classes are distributed randomly</li>
<li> Divide the labels and data in to train, valid and test components and save them in a single pickle file.
<ul>
<li> Train dataset: 400,000 images from large set</li>
<li> Valid dataset: 100,000 images from large set</li>
<li> Test dataset: 18,000 images from small set</li>
</ul>
</li>
</ol>

Note: Some of the code is based on Udacity machine learning course exercises. I used a cluster with multiple CPUs to run the program and the results are not rendered from ipython but imported from the output files from #sbatch jobs.

<h2 align="left"  style="font-family:Fontin;font-size:12pt">
After the data curation we'll first train a logistic model using <a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html">
sklearn.linear_model.LogisticRegression.</a>
</h2>


In [None]:
import numpy as np
import tensorflow as tf
from six.moves import cPickle as pickle
from sklearn.linear_model import LogisticRegression

# load the datasets and lables from previously saved pickle.

pickle_file = 'notMNIST.pickle'

with open(pickle_file, 'rb') as f:
    save = pickle.load(f)
    train_dataset = save['train_dataset']
    train_labels = save['train_labels']
    valid_dataset = save['valid_dataset']
    valid_labels = save['valid_labels']
    test_dataset = save['test_dataset']
    test_labels = save['test_labels']
    del save  # hint to help gc free up memory
    print 'Training set: ', train_dataset.shape, train_labels.shape
    print 'Validation set: ', valid_dataset.shape, valid_labels.shape
    print 'Test set: ', test_dataset.shape, test_labels.shape

# Reformat the image files in to flat vector
nsamples, nx, ny = train_dataset.shape
train_dataset = train_dataset.reshape((nsamples,nx*ny))
nsamples1, nx1, ny1 = test_dataset.shape
test_dataset = test_dataset.reshape((nsamples1,nx1*ny1))

print 'After reformatting the data'
print 'Training set: ', train_dataset.shape, train_labels.shape
print 'Validation set: ', valid_dataset.shape, valid_labels.shape
print 'Test set: ', test_dataset.shape, test_labels.shape

#Train a logistic model

logistic = LogisticRegression() #use n_jobs to assign more CPUs if available 
logistic.fit(train_dataset,train_labels)

#Calculate the accuracy
print 'Accuracy: ', logistic.score(test_dataset,test_labels)*100



Training set:  (400000, 28, 28) (400000,)<br>
Validation set:  (100000, 28, 28) (100000,)<br>
Test set:  (18000, 28, 28) (18000,)<br>
After reformatting the data<br>
Training set:  (400000, 784) (400000,)<br>
Validation set:  (100000, 28, 28) (100000,)<br>
Test set:  (18000, 784) (18000,)<br>
Accuracy:  89.45

<h2 align="left"  style="font-family:Fontin;font-size:12pt">Now lets do a logistic regression using stochastic gradient descent in Tensorflow.</h2>

In [None]:
import numpy as np
import random
import tensorflow as tf
from six.moves import cPickle as pickle
from sklearn.linear_model import LogisticRegression
# load the datasets and lables from previously saved pickle.

pickle_file = '/scratch/piyadasa/deep_learning_udacity/ass1/notMNIST.pickle'

with open(pickle_file, 'rb') as f:
    save = pickle.load(f)
    train_dataset = save['train_dataset']
    train_labels = save['train_labels']
    valid_dataset = save['valid_dataset']
    valid_labels = save['valid_labels']
    test_dataset = save['test_dataset']
    test_labels = save['test_labels']
    del save  # hint to help gc free up memory    
    print 'Training set: ', train_dataset.shape, train_labels.shape
    print 'Validation set: ', valid_dataset.shape, valid_labels.shape
    print 'Test set: ', test_dataset.shape, test_labels.shape

# Reformat the image files in to flat vector
image_size = 28
num_labels = 10

def reformat(dataset, labels):
    dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
    # Map 0 to [1.0, 0.0, 0.0 ...], 1 to [0.0, 1.0, 0.0 ...]
    labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
    return dataset, labels
train_dataset, train_labels = reformat(train_dataset, train_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
print 'After reformatting the data'
print 'Training set: ', train_dataset.shape, train_labels.shape
print 'Validation set: ', valid_dataset.shape, valid_labels.shape
print 'Test set: ', test_dataset.shape, test_labels.shape    

# We do a stochastic gradient descent training using smaller batches od training data
batch_size = 2048

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed at run time with a training minibatch.
    
    tf_train_dataset = tf.placeholder(tf.float32,shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables.

    weights = tf.Variable(tf.truncated_normal([image_size * image_size, num_labels]))
    biases = tf.Variable(tf.zeros([num_labels]))


    # Training computation.
    logits = tf.matmul(tf_train_dataset, weights) + biases
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits, tf_train_labels))

    # Optimizer.
    optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)

    # Predictions for the training, validation, and test data.
    train_prediction = tf.nn.softmax(logits)
    valid_prediction = tf.nn.softmax(tf.matmul(tf_valid_dataset, weights) + biases)
    test_prediction = tf.nn.softmax(tf.matmul(tf_test_dataset, weights) + biases)

# define accuracy function     
def accuracy(predictions, labels):
    return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))/ predictions.shape[0])    
    
#Run the graph
num_steps = 10001

with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    for step in range(num_steps):
        # Pick a random sample of train data
        sample=random.sample(range(len(train_labels)),batch_size)
        batch_data = train_dataset[sample]
        batch_labels = train_labels[sample]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.

        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}

        _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict=feed_dict)

        if (step % 500 == 0):
            print "Minibatch loss at step %d: %f" % (step, l)
            print "Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels)
            print "Validation accuracy: %.1f%%" % accuracy(valid_prediction.eval(), valid_labels)
            
    print "Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels)