# Chapter 4: Deep Learning (Part 1) - Data preprocessing


Part 1
------------


The objective of this first part is to learn about simple data curation practices. Data curation (for machine learning) consists basically in analyse, label, and separate in classes your input data.  In section two, you used pre-curated and separated data from the INRIA's person data set. Your first task is to create your training/testing sets by hand and analyse how well "balance" they are. 

**Objectives** 
In the following sections, you will use this part to "feed" both, a classic _Swallow_ (not-Deep) classification using handcrafted features and a Deep (kind of Deep) neuronal network. Your task will consist in analyse the accuracy of the aforementioned classification based on the amount of data available; from few hundred of samples to the full data-set.

In this section we will provide the general steps and, as in the previous section, you will be asked to search in the function parameters and syntaxis in users documentation.


## Dataset 

In this last section of the course we will use the the [notMNIST](http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html) dataset. This dataset is designed to look like the classic [MNIST](http://yann.lecun.com/exdb/mnist/) dataset, while looking a little more like _real data_: it's not trivial, and the data is a lot less 'clean' than MNIST.


### Libraries: 

Be sure that you can import all the libraries below, in addition, for next chapter you will make use of the tensor flow library to implement the used neuronal network. Be sure to be able to import it as: 

`` import tensorflow as tf``

following the documentation page 

https://www.tensorflow.org/install/

We will use only CPU's based training. 



In [None]:
import cv2 as cv
import numpy as np
import os
import math
import tarfile
from utils import *
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import LabelEncoder
from sklearn.svm import LinearSVC
from six.moves.urllib.request import urlretrieve
import matplotlib.pyplot as plt
import glob
from six.moves import cPickle as pickle
%matplotlib inline

### Download the data. 

As in section 2, you need to download the data and set the input directory. Be sure to have at about ~1 Gb of free space. If the function is not able to download the data, try on the MNIST site. 



In [None]:

# Download the Data
# The following functions will download the data for you and uncompress it

# WARNING:  These varaibles set the input/output paths for ALL the bellow functions.
url = 'http://commondatastorage.googleapis.com/books1000/'
last_percent_reported = None
data_root = '../data/' # Change me to store data elsewhere


def maybe_download(filename, expected_bytes, force=False):
  """
  Downloads a file if not present, and make sure it's the right size!.
  If there's a file with the same name, the function will not try to 
  download the dataset again!
  """

  dest_filename = os.path.join(data_root, filename)
    
  if force or not os.path.exists(dest_filename):
    print('Attempting to download:', filename, 'This may take a while. Please wait.') 
    filename, _ = urlretrieve(url + filename, dest_filename)
    print('\nDownload Complete!')
  statinfo = os.stat(dest_filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified', dest_filename)
  else:
    raise Exception(
      'The file ' + dest_filename + 'already exist but seems corrupted. Delete it or download it from the browser!')
  return dest_filename


def maybe_extract(filename, force=False):
  """
  Uncompress the data set for you
  """
  root = os.path.splitext(os.path.splitext(filename)[0])[0]  # remove .tar.gz

  if os.path.isdir(root) and not force:
    # You may override by setting force=True.
    print('%s dataset (seems to be) already present.\nSkipping extraction of %s.' % (root, filename))
  else:
    print('Extracting data for %s. This may take a while. Please wait.' % root)
    tar = tarfile.open(filename)
    tar.extractall(data_root)
    tar.close()
    
  data_folders = [os.path.join(root, d) for d in sorted(os.listdir(root)) if os.path.isdir(os.path.join(root, d))]
  print("All setup.")
  return data_folders


In [None]:
# Downloads ifa needed.
large_filename = maybe_download('notMNIST_large.tar.gz', 247336696)
small_filename  = maybe_download('notMNIST_small.tar.gz', 8458043)


In [None]:
large_folders = maybe_extract(large_filename)
small_folders = maybe_extract(small_filename)

In [None]:
large_folders

---
Problem 1
---------

When working with data, always check your data. 

Create a description of your input data. Describe in a table or list (one for each sample size): 

* Number of classes (characters)
* Number of samples per class
* General information on the image size and number of channels.

Visualize one sample per class bellow for a chosen size (large or small). 

---


In [None]:
# Code here.


Now that you have all your images set up we will load the data into a more manageable format. Since depending on your computer setup you might not be able to fit it all in memory, we'll load each class into a separate dataset, store them on disk and curate them independently. 

To do this we will use pickles!

https://docs.python.org/3.2/library/pickle.html

“Pickling” is the process whereby a Python object (it can be anything!) is converted into a byte stream (binary format), and “unpickling” is the inverse operation. We will use pickles to save the FULL set of images for each character in one pickle. The result will be a 3D array (image index, x, y) of floating point values, normalized to have approximately zero mean and standard deviation ~0.5 to make training easier down the road. This process is known as "normalizing the data" or "feature scaling, which is very important to ensure convergence in the optimization step, as well to ensure that the feature space is well defined.

https://en.wikipedia.org/wiki/Feature_scaling


Your task: using the skeleton function bellow you need to: 

* 1) Load all the images in FLOAT format for each class (A,..,J), 1 channel only.
* 2) Transform each image intensities such that the range goes from -125,125 (instead of 0, 256)
* 3) Scale the function so the new range goes from -0.5 to 0.5.

A few images might not be readable, we'll just skip them.

In [None]:

""" Image fixed size  """
image_size  = 28     # Pixel width and height. (28x28)
pixel_depth = 255.0  # Number of levels per pixel. (0,255)

""" There's should be enough data at the end"""
min_num_images_train = 45000;
min_num_images_test  = 1800;



def load_letter(folder, min_num_images):
    
  """ Base function: 
  
      Complete this function to read a each iamge of a given character (folder)
      Transforms and scale the image to have 0 mean and standard deviation of ~0.5.
      
      Params: 
          folder: input character folder (e.g. ../data/notMNIST_large/A/)
          min_num_images: minimum number of images you should have per character.
      
      returns: 
          dataset: Vector containing the fully loaded and scaled dataset.
  """


  image_files = os.listdir(folder)
    
  # Array size (should be preserved)  
  dataset = np.ndarray(shape=(len(image_files), image_size, image_size),
                         dtype=np.float32)
  print(folder)
  num_images = 0

  # List of all the images inside the folder  
  for image in image_files:
    image_file = os.path.join(folder, image)
    
  # If the image is not loadable (there are some corrupted images you can skip them) 
    try:
        
      # CODE HERE:
    
      #Load each image and transform them
      

    
    
    
    
    
    #--- End of your code.
    
      # here I check that you load them correctly and save it in the dataset array.  
      if image_data.shape != (image_size, image_size):
        raise Exception('Unexpected image shape: %s' % str(image_data.shape))
      dataset[num_images, :, :] = image_data
    
      num_images = num_images + 1
        
    except IOError as e:
      print('Could not read:', image_file, ':', e, '- it\'s ok, skipping.')
    
  dataset = dataset[0:num_images, :, :]

  # If this theshhold is not met, you are doind something wrong (probably)  
  if num_images < min_num_images:
    raise Exception('Many fewer images than expected: %d < %d' %
                    (num_images, min_num_images))
 
  #Check this output! 
  # The mean shoudl be very close to 0 i.e < 1 and the std should be less than 0.5.  
  print('Full dataset tensor:', dataset.shape)
  # Notice we aree callign this a "tensor"

  print('Mean:', np.mean(dataset))
  print('Standard deviation:', np.std(dataset))
  return dataset


In [None]:
# Look, Morty, I'm a Pickle!        

# This function calls your pre-defined-function load_letter(folder, min_num_images) and creates the pickle!

def Im_a_pickle(data_folders, min_num_images_per_class, force=False):
    
    """ Base function: 
  
      Loads all the images listed in data_folders and creates a .pickle file
      
      Params: 
          data_folders: list of the folders to pickle (i.e. large_folders, small_folders)
          min_num_images: minimum number of images you should have per character.
      
      returns: 
          dataset_names: Vector containing all the pickles names.
  """
    dataset_names = []

    for folder in data_folders:
        set_filename = folder + '.pickle'
        dataset_names.append(set_filename)
    
        if os.path.exists(set_filename) and not force:
          # You may override by setting force=True.
          print('%s already present - Skipping pickling.' % set_filename)
        else:
          print('Turning myself into a Pickle! %s.' % set_filename)

          dataset = load_letter(folder, min_num_images_per_class)

          try:
            with open(set_filename, 'wb') as f:
              pickle.dump(dataset, f, pickle.HIGHEST_PROTOCOL)
          except Exception as e:
            print('Unable to save data to', set_filename, ':', e)
  
    return dataset_names

If everything was done correctly we can then call the following functiosn without error!

Notice that we are here considereing the "large" data set as our training data set and the small as our test data set.

In [None]:
train_datasets = Im_a_pickle(large_folders, 45000)
test_datasets  = Im_a_pickle(small_folders, 1800)

---
## Problem 2

---------

To corroborate that our data is properly saved and scaled, display one example per class letter (A,...,J), from the train dataset **or** the test dataset. 

To do this you will need to use the ``pickle.load(...)``. Check the documentation above for more details. You can use the inhered matplotlib function to show each example. Include a colorbar showing the __values range of the image__.

---

In [None]:
# code here


Finally if everything is correct, the above function should contain the full length of each character sample.  The labels will be stored into a separate array of *integers 0 through 9*.

Corroborate that the train_sets are in the order of ~52,000 images, and the train_set in the order of 1,870 images. 

In [None]:


def data_sets_sizes(data_set):
    
    number_files = [];

    """ Base function: 
  
      Loads all the images listed in data_set and return it' s size
      
      Params: 
          data_folders: lsit of the folders to pickle (i.e. large_folders, small_folders)
          min_num_images: minimum number of images you should have per character.
      
      returns: 
          dataset_names: Vector containing all the pickles names.
  """
        
    #Code here 

    print(number_files)
    
 
data_sets_sizes(train_datasets)

data_sets_sizes(test_datasets)

## Problem 3 
### Creating sub-sampled datasets.


In order to evaluate the performance of our classifiers, we need to create subsets of our data properly randomized; this means that we shouldn't choose always the first set of images to compare since we will introduce a bias because of the sampled order. A very nice post on this topic can be found below in case you wonder if it's worth the trouble.

https://machinelearningmastery.com/randomness-in-machine-learning/


As problem 3 you are asked to write a function: ``sample_training_data(...)`` which should create a training dataset of a given size, containing the same number of samples for each label (-1 or +1 samples) _randomly selected_ from the ``train-dataset``; as well as the labels of the training set coded as integers from 0 (A) to 9 (J). 


Is worth mentioning that is common practice in machine learning to set aside a third data set known as the _validation dataset_. Which is used to prevent overfitting and other training problems. We will not make use of such dataset, however, is worth to check why is used; one nice and short explanation can be found here (it also contains nice code hints relevant to the exercise ;) )

https://machinelearningmastery.com/difference-test-validation-datasets/



In [None]:

#Code here.
def sample_training_data(pickle_files, train_size):

    """ Base function: 
  
      Given a train size returns a ndarray containing a total of 
      <train_size/number_of_clases> samples for each character.
      
      example for train_size = 100, should contain 100/10 = 10 samples of each character. 
      
      The samples should be chose randomly.
      
      Params: 
          pickle_files: list of the pickle files (training set)
          train_size: total length of the new training set
      
      returns: 
          train_dataset: ndarray containing all the trainign images (properly normalized)
          train_labels : the labels of each selected image. 
  """
    
    train_dataset = []
    train_labels = []
    
    return train_dataset, train_labels
  

# EXAMPLE OF USE
train_size = 200000
valid_size = 10000
test_size = 10000

print('Training size: ', train_dataset.shape, '\nLabel vector size:',train_labels.shape)

Finally, you need to randomize the vector so it doesn't follow any specific order: like first all the A characters and then the 'B' characters, and so on 

```(A, A, ..., A , B, B, ..., B, C, C, ...,C,... )```.


In [None]:
# Define a function to randomize THE ORDER of a given dataset.
# Be sure that the dataset and the labels are shuffled in the same order so they MATCH.

def randomize(dataset, labels):
    #code here

  return shuffled_dataset, shuffled_labels


train_dataset, train_labels = randomize(train_dataset, train_labels)
test_dataset, test_labels   = randomize(test_dataset, test_labels)


---
Problem 4
---------

Show us that your method works and both datasets are coherent with the labels. You can display the shuffled order and show the first images in both datasets. They should match the labels.

---

In [None]:
#code here

---
Problem 5
---------

By default, this dataset might contain a lot of overlapping samples, including training data that are also contained in the test set. The overlap between training and test can skew the results if you expect to use your model in an environment where there is never an overlap but are actually ok if you expect to see training samples recur when you use it.

Measure how much overlap there is between training, validation and test samples. 

- Give the number of overlapping samples in the test set for the full dataset.
- What about near duplicates between datasets? (images that are almost identical), provide a __estimation__ based in any metric or function to evaluate similitude with a given threshold.  
- Modify your ``sample_training_data``function and provide a curated train and test dataset removing the very similar samples from one of them. 
---

---
Problem 6
---------

Let's get an idea of what a basic classifier can give you on this data. 

Train a simple model on this data using 50, 100, 1000 and 5000 training samples. 

Hint: you can use the ```LogisticRegression``` or ```LogisticRegressionCv``` model from sklearn.linear_model.

Provide a score for the prediction over the full test data set. You can use any metric from the previous chapters or an implemented one like the ```cross_val_score``` form sklearn which is more accurate.

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html

A good choice of parameters (and regularization method) can give you results up to the 89%



```
#Samples: 50 ---> Score: 0.471428571429
#Samples: 100 ---> Score: 0.605865717935
#Samples: 1000 ---> Score: 0.760772183027
#Samples: 5000 ---> Score: 0.812826972435 
```



---

In [None]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.cross_validation import cross_val_score

def train_and_validate(num_examples):
    # code here


In [None]:
training_sizes = [50, 100, 1000, 5000]

print("Cross Validation Score\n")
for size in training_sizes:
    score = train_and_validate(size)
    print("Samples:", size,"---> Score:", score)
    
    
#Samples: 50 ---> Score: 0.471428571429
#Samples: 100 ---> Score: 0.605865717935
#Samples: 1000 ---> Score: 0.760772183027
#Samples: 5000 ---> Score: 0.812826972435
