Deep Learning
=============

Assignment 1
------------

The objective of this assignment is to learn about simple data curation practices, and familiarize you with some of the data we'll be reusing later.

This notebook uses the [notMNIST](http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html) dataset to be used with python experiments. This dataset is designed to look like the classic [MNIST](http://yann.lecun.com/exdb/mnist/) dataset, while looking a little more like real data: it's a harder task, and the data is a lot less 'clean' than MNIST.

In [None]:
# Standard library
import os
import json
    
# Third-party packages
import h5py
import matplotlib.pyplot as pl
%matplotlib inline
import numpy as np
from IPython.display import display, Image
from skimage.io import imread
from sklearn.linear_model import LogisticRegression

# Python 2 & 3 support
from six.moves import cPickle as pickle
from six.moves.urllib.request import urlretrieve, urlopen

In [None]:
with open("config.json") as f:
    config = json.loads(f.read())

if config['cache_path'] is None:
    config['cache_path'] = os.getcwd()
    
data_url = config['notMNIST']['data_url']
cache_file = os.path.join(config['cache_path'], os.path.basename(data_url))

First, we'll download the dataset to our local machine. The data consists of characters rendered in a variety of fonts in a series of 28 by 28 images. The labels simply identify the letter presented in each image (and are limited to A-J, so, 10 classes). The training set and test set have about 500000 and 19000 image-label pairs, respectively. Even with these sizes, it should be possible to train models quickly on any machine.

_Note: This could take some time! You are about to download a ~1.7 GB file. Go get some coffee._

In [None]:
# how many bytes are we expecting
url = urlopen(data_url)
meta = url.info()
expected_bytes = int(meta['Content-Length'])
    
if (os.path.exists(cache_file) and os.stat(cache_file).st_size != expected_bytes) \
    or not os.path.exists(cache_file) or not os.path.isfile(cache_file):
    urlretrieve(data_URL, cache_file)
    
    received_bytes = os.stat(cache_file).st_size
    if received_bytes != expected_bytes:
        raise IOError("Download error: size expected = {} bytes, size received = {} bytes"
                      .format(expected_bytes, received_bytes))
        
    print("Data downloaded and verified.")

else:
    print("Data file already exists and is verified.")

First, we'll print some information about the data:

In [None]:
with h5py.File(cache_file, 'r') as f:
    for name,group in f.items():
        print("{}:".format(name))
        
        for k,v in group.items():
            print("\t {} {}".format(k,v.shape))

---
Problem 1
---------

Let's take a peek at some of the data to make sure it looks sensible. Each exemplar should be an image of a character A through J rendered in a different font.  

Plot a 3 by 3 grid of sample images from the test set and set the title of each panel to the character name (use the labels).

_Hint: use `matplotlib.pyplot.imshow()`_

---

In [None]:
with h5py.File(cache_file, 'r') as f:
    pass

---
Problem 2
---------

Now display the mean of all images from each class individually and again set the title of each panel to the corresponding character name.

---

In [None]:
with h5py.File(cache_file, 'r') as f:
    pass

---
Problem 3
---------

Next, we'll randomize the data. It's important to randomize both the train and test data sets. Verify that the data is still labeled correctly after randomization.

---

In [None]:
def randomize(data, labels):
    pass

with h5py.File(cache_file, 'r') as f:
    train_dataset, train_labels = randomize(f['train']['images'][:], f['train']['labels'][:])
    test_dataset, test_labels = randomize(f['test']['images'][:], f['test']['labels'][:])

---
Problem 4
---------
Another check: we expect the data to be balanced across classes. Verify that.

---

---
Problem 5
---------

By construction, this dataset might contain a lot of overlapping samples (identical images), including training data that's also contained in the validation and test set. Overlap between training and test can skew the results if you expect to use your model in an environment where there is never an overlap, but are actually ok if you expect to see training samples recur when you use it.

- How much overlap is there between training, validation and test samples?
- What about near duplicates between datasets? (images that are almost identical)
- Create a sanitized validation and test set, and compare your accuracy on those in subsequent assignments.
---

---
Problem 6
---------

Let's get an idea of what an off-the-shelf classifier can give you on this data. It's always good to check that there is something to learn, and that it's a problem that is not so trivial that a canned solution solves it.

Train a simple model on this data using 50, 100, 1000 and 5000 training samples. Hint: you can use the LogisticRegression model from sklearn.linear_model.

Optional question: train an off-the-shelf model on all the data!

---