Deep Learning
=============

Assignment 1 - Deduplicate dataset that we saved to the pickle file
------------

## WARNING: Uses Python 3

Previously in `1_notmnist.ipynb`, we created a pickle with formatted datasets for training, development and testing on the [notMNIST dataset](http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html).

**Here we check the dataset for duplicates and save another pickle without duplicate examples.**

We use a simple hashing function that might give us false positives but no false negatives. (I.e. we potentially get rid of too many examples but the resulting set won't have duplicates)

In [1]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
import pickle as pickle
import numpy as np
import tensorflow as tf
from functools import reduce

First reload the data we generated in `1_notmist.ipynb`.

In [17]:
pickle_file = 'notMNIST.pickle'

with open(pickle_file, 'rb') as f:
  save = pickle.load(f)
  train_dataset = save['train_dataset']
  train_labels = save['train_labels']
  valid_dataset = save['valid_dataset']
  valid_labels = save['valid_labels']
  test_dataset = save['test_dataset']
  test_labels = save['test_labels']
  del save  # hint to help gc free up memory
  print('Training set', train_dataset.shape, train_labels.shape)
  print('Validation set', valid_dataset.shape, valid_labels.shape)
  print('Test set', test_dataset.shape, test_labels.shape)

Training set (200000, 28, 28) (200000,)
Validation set (10000, 28, 28) (10000,)
Test set (18724, 28, 28) (18724,)


Reformat into a shape that's more adapted to the models we're going to train:
- data as a flat matrix,
- labels as float 1-hot encodings.

In [18]:
image_size = 28
num_labels = 10

def reformat(dataset):
  return dataset.reshape((-1, image_size * image_size)).astype(np.float32)

train_dataset = reformat(train_dataset)
valid_dataset = reformat(valid_dataset)
test_dataset = reformat(test_dataset)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)

Training set (200000, 784) (200000,)
Validation set (10000, 784) (10000,)
Test set (18724, 784) (18724,)


## Test for duplicates

In [20]:
#Calculate simple hashes for all examples by multiplying each pixel-value with a random float and summing up the products
r = np.random.rand(train_dataset.shape[1])
train_hashes = np.tile(train_dataset.dot(r), (1, 1))
valid_hashes = np.tile(valid_dataset.dot(r), (1, 1))
test_hashes = np.tile(test_dataset.dot(r), (1, 1))

In [21]:
def deduplicate(dataset):
    """Find duplicates within the dataset and return only unique indices."""
    hashes = []
    unique_indices = []
    for i in range(dataset.shape[1]):
        if dataset[0][i] not in hashes:
            hashes.append(dataset[0][i])
            unique_indices.append(i)
    return np.array(unique_indices)
        
train_unique_idx_auto = deduplicate(train_hashes)
valid_unique_idx_auto = deduplicate(valid_hashes)
test_unique_idx_auto = deduplicate(test_hashes)

In [6]:
def cross_deduplicate(data1, data2):
    """find duplicates in 2 datasets. return unique indices for dataset1."""
    return np.arange(data1.shape[1])[~(data2.T==data1).any(axis=0)]

valid_train_unique_idx = cross_deduplicate(train_hashes, valid_hashes)
test_train_unique_idx = cross_deduplicate(train_hashes, test_hashes)
valid_test_unique_idx = cross_deduplicate(valid_hashes, test_hashes)

  app.launch_new_instance()


In [22]:
# Intersect unique indices for training set: 
# unique within training set, unique between train & test, unique between valid & train
train_unique_idx = reduce(np.intersect1d, (train_unique_idx_auto, valid_train_unique_idx, test_train_unique_idx))
valid_unique_idx = reduce(np.intersect1d, (valid_unique_idx_auto, valid_test_unique_idx))
test_unique_idx = test_unique_idx_auto

In [23]:
# Report deduplication results
print("Removed %d duplicates from %d training examples." % 
      (len(train_dataset) - len(train_unique_idx), len(train_dataset)))
print("Removed %d duplicates from %d validation examples." % 
      (len(valid_dataset) - len(valid_unique_idx), len(valid_dataset)))
print("Removed %d duplicates from %d validation examples." % 
      (len(test_dataset) - len(test_unique_idx), len(test_dataset)))

Removed 15606 duplicates from 200000 training examples.
Removed 286 duplicates from 10000 validation examples.
Removed 491 duplicates from 18724 validation examples.


In [28]:
def reformat_back(dataset):
    return dataset.reshape((-1, image_size, image_size)).astype(np.float32)

train_dataset_out = reformat_back(train_dataset[train_unique_idx])
valid_dataset_out = reformat_back(valid_dataset[valid_unique_idx])
test_dataset_out = reformat_back(test_dataset[test_unique_idx])
train_labels_out = train_labels[train_unique_idx]
valid_labels_out = valid_labels[valid_unique_idx]
test_labels_out = test_labels[test_unique_idx]

In [30]:
pickle_file = 'notMNIST_dedup.pickle'
try:
  f = open(pickle_file, 'wb')
  save = {
    'train_dataset': train_dataset_out,
    'train_labels': train_labels_out,
    'valid_dataset': valid_dataset_out,
    'valid_labels': valid_labels_out,
    'test_dataset': test_dataset_out,
    'test_labels': test_labels_out,
    }
  pickle.dump(save, f, pickle.HIGHEST_PROTOCOL)
  f.close()
except Exception as e:
  print('Unable to save data to', pickle_file, ':', e)
  raise

In [31]:
print('Resulting dataset shapes')
print('Training set', train_dataset_out.shape, train_labels_out.shape)
print('Validation set', valid_dataset_out.shape, valid_labels_out.shape)
print('Test set', test_dataset_out.shape, test_labels_out.shape)

Resulting dataset shapes
Training set (184394, 28, 28) (184394,)
Validation set (9714, 28, 28) (9714,)
Test set (18233, 28, 28) (18233,)
