# Image Classifier - MIDAS INTERNSHIP CHALLENGE

In [336]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

## Explanation - 1
The following code is used to import the our 3rd Party Deep Learning Libraries like *Pytorch* as well as checks, if the GPU is set or not.

<img src ="https://cdn-images-1.medium.com/max/2600/1*aqNgmfyBIStLrf9k7d9cng.jpeg"/>

In [337]:
from fastai.vision import *
from fastai.metrics import error_rate

import pandas as pd
import os
import matplotlib.image as mpimg
from PIL import Image
import torch
from torch.utils.data.dataset import Dataset
from torch.utils.data import DataLoader
import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import torch
import pickle
import os
import torchvision
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")
print("Device is usage: {0}".format(device))

Device is usage: cuda:0


## Explanation 2

### Intuition
The code block below defines few global variables that we would like to take care of, for example the batch size for our Deep Learning Pipeline as well as the paths of the training data/labels as well as the test_data

In [338]:
data_path = "./data"
train_images_path = "{0}/train_image.pkl".format(data_path)
test_images_path = "{0}/test_image.pkl".format(data_path)
train_labels_path = "{0}/train_label.pkl".format(data_path)
batch_size = 64

In [339]:
def path_exists(path):
    """
     Function to verfiy if, file path defined is correct
    """
    return os.path.exists(path)

valid_path = [path_exists(_) for _ in (train_images_path, test_images_path, test_images_path)]
valid_path

[True, True, True]

## Deep Learning Pipeline - Data Pre-processing

<img src="https://cdn-images-1.medium.com/max/1200/1*ZX05x1xYgaVoa4Vn2kKS9g.png" />

### What's happening ?
In the code block we have defined a basic class which takes care of our pre-processing pipeline.

#### Why did we **inherit** the `Dataset`  class ?
We are using PyTorch as our DeepLearning framework, PyTorch provides a very simple API to build the Deep Learnign PipeLine, we can do make our own pipeline by inheriting the **Dataset** class, which has internal methods to take of the things we need to take care of pre-processing.

Here in our case I have divided the pipelin in the following parts.
- Load data from pickle dump.
- Build an API to *clean the data*, or arrange the data in a format that PyTorch understands, this has been done by the `__getitem__()` function.
    - This functions iteratively returns a single dataset and the following label for our Neural Net to work on. 
- We also incorporate transformations in the class, which helps us normalising our data.
    - **Normalising** input data is important, as this helps *Gradient Descent* to run faster, which means we can use a higher learning rate for our Neural Network, this helps us the reacht the minima faster, the contour of the cost functions are symmetrical when data is normalised.


In [340]:
class DataSetLoader(Dataset):
    '''Dataset Loader'''
    def __init__(self, train_path, labels_path, transform=None, train=True, target_transform=None):
        """
        Args:
            train_path (string): Path to the training data file
            labels_path (string): Path to the labels present for the training data
            transform (callable): Optional transform to apply to sample
        """
        self.train=train
        if self.train:
            data = self._load_from_pickle(train_path)
            self.train_data = torch.ByteTensor(data).view(-1, 28, 28) 
            self.train_labels = self._load_from_pickle(labels_path)
        else:
            data = self._load_from_pickle(train_path)
            self.test_data = torch.ByteTensor(data).view(-1, 28, 28) 
            self.test_lables = []
            
        del data
        self.transform = transform
        self.target_transform = target_transform
        
    def __len__(self):
        """
         Returns the length of whole Dataset fed into the Neural Net.
        """
        if self.train:
            return len(self.train_data)
        else:
            return len(self.test_data)
    
    def __getitem__(self, index):
        """
         Returns a single training/test example after applying the required normalisation/transformations techniques.
         As well as the label.
         
         ret: (image, label)
        """
        if self.train:
            img, target = self.train_data[index], self.train_labels[index]
        else:
            img, target = self.test_data[index], self.test_labels[index]

        # return a PIL Image
        img = Image.fromarray(img.numpy(), mode='L')

        if self.transform is not None:
            img = self.transform(img)
        if self.target_transform is not None:
            target = self.target_transform(target)
            
        return img, target
    
    def _load_from_pickle(self, file_path):
        """
         file_path: File path to load data, returns an ndarray
         
         ret: Loaded dump, in primitive data format.
        """
        with open(file_path, 'rb') as file:
            data = pickle.load(file)
        return data
    
    def get_labels(self):
        """
         Return an array of labels present in the data set.
        """
        if self.train:
            return np.unique(self.train_labels)
        return []
    
    def visualise_data_set(self):
        """
         Utility function to randomly display images from the dataset
        """
        fig = plt.figure(figsize=(8,8))
        columns = 4
        rows = 5
        if self.train:
            data_set = self.train_data
        else:
            data_set = self.test_data
        for i in range(1, columns * rows +1):
            img_xy = np.random.randint(len(data_set));
            img = data_set[img_xy][0][0,:,:]
            fig.add_subplot(rows, columns, i)
            plt.title(labels_map[data_set[img_xy][1]])
            plt.axis('off')
            plt.imshow(img)
        
        plt.show()            

In [341]:
# Creating a labels map for storing the labels


# Define the normalisation/tranformation techniques we need during pre-processing.
img_transforms = transforms.Compose([transforms.ToTensor(),
                               transforms.Normalize((0.5,0.5,0.5), (0.5,0.5,0.5))])

In [343]:
# Create training and test data sets.
train_dataset = DataSetLoader(train_images_path, train_labels_path, train=True, transform=img_transforms)
test_dataset = DataSetLoader(test_images_path, "", train=False, transform=img_transforms)

## Looking at the data

The following code blocks gives us some insights about what kind of images our data set has.
By looking at the training set, we have 4 labels, *0, 2 ,3, 6*, these labels are numerically labelled in our training labelled data set.

Just for readibilty I have created a dictionary of mapping each numerical label to a physical label, like *Shirt*, *T-Shirt* etc.

```
labels = train_dataset.get_labels()
```
The code above gives us the unique labels present in our training data.
```
labels_map = {0 : 'T-Shirt', 2 : 'Pullover', 3 : 'Dress', 6 : 'Shirt'}
```
Initialises a dictionary to map each numerical label to a physical entity.

In [346]:
labels = train_dataset.get_labels()
labels_map = {0 : 'T-Shirt', 2 : 'Pullover', 3 : 'Dress', 6 : 'Shirt'}
for label in labels:
    print("Label: {0}, value: {1}".format(label, labels_map[label]))

Label: 0, value: T-Shirt
Label: 2, value: Pullover
Label: 3, value: Dress
Label: 6, value: Shirt


In [None]:
fig = plt.figure(figsize=(8,8));
columns = 4;
rows = 5;
for i in range(1, columns*rows +1):
    img_xy = np.random.randint(len(train_dataset));
    img = train_dataset[img_xy][0][0,:,:]
    fig.add_subplot(rows, columns, i)
    plt.title(labels_map[train_dataset[img_xy][1]])
    plt.axis('off')
    plt.imshow(img)
plt.show()


### Explanation of code block below

For our data pre-processing we have cleaned our data, applied required transformations, as well as wrote a simple API that allows us to do the steps gracefully.
What we need now is another API which can help us in iterating over our training/test datasets efficiently.

The following code does the same, **DataLoader** class provides us with an *Python* `iterator` object which allows for traverse over our dataset very efficiently, in specific batch size we want.

```
for i_batch, sample_batched in enumerate(trainloader):
    do_something(sample_batched)
```

Since trainloader is an iterator we can easily now iterate over our data set, you can also the see the code example above.

#### What's my intuition behind this ?

The most basic intuition in making an iterator here is we need to carry the basic Neural Net operations: 
  - Forward Propagation 
  - Backward Propagation

Since we want to carry out **Mini Batch Gradient Descent** for our Neural Network to learn the hyper-params, we are using this iterator to automatically create the required batches.

In [335]:
"""
    We built our data set using DataSetLoader class we defined above, now we need the the following operations
    1. Divide our dataset the into the batch sizes.
    2. Shuffle the dataset accordingly for randomness.
"""
trainloader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=batch_size,
                                           shuffle=True, num_workers=2)

testloader = torch.utils.data.DataLoader(dataset=test_dataset,
                                          batch_size=batch_size,
                                          shuffle=True, num_workers=2)