In [None]:
%matplotlib inline

<style type="text/css">
.reveal h1, .reveal h2 {
    font-family:"League Gothic"
}
</style>

# <center> <font size="12"> Image Classification using Deep Learning </font> </center>


<center>


<p> <b>Soumendra Dhanee</b> </p>
<p> <b>dhanee@soumendra.io</b> </p>
<p> <b>https://github.com/soumendra</b> </p>
<p> <b>@dataBiryani</b> </p>

<p> <a href="https://npcpune2017.sched.com/event/9bC3/workshop-image-classification-using-deep-learning">NASSCOM Product Conclave</a> </p>
<p> <a href="https://github.com/soumendra/lecture_nasscom_deeplearning_mar17"> Workshop Resources </a> </p>

<p> 17th March, 2017</p>

<p>Pune, India </p>

</center>

<center><font size="12"> Acknowledgement </font> </center>

Materials have been borrowed from

* [cs231n](http://cs231n.stanford.edu/) (images)
* [Machine Learning is Fun](https://medium.com/@ageitgey/machine-learning-is-fun-part-3-deep-learning-and-convolutional-neural-networks-f40359318721#.rzclnng89) (images and ideas)
* [Wikipedia](https://en.wikipedia.org/wiki/Artificial_neural_network) (images)
* [9 Deep Learning Papers You Need to Know About](https://adeshpande3.github.io/adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html) (images and text)

# What is Deep Learning?


But before that, what is machine learning?

    A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.

* Cognitive vs Operational
* Can machines think?
* Can machines do what we (as thinking entities) can do?

Supervised Machine Learning

vs

Unsupervised Machine Learning

Artificial Neural Networks and Deep Learning



### Deep Learning

Common in various definitions:
1. multiple layers of nonlinear processing units, and
2. the supervised or unsupervised learning of feature representations in each layer, with the layers forming a hierarchy from low-level to high-level features

The rise of Deep Learning

* Initial success
* AI Winter
* Comeback
* Explosion

# Why Deep Learning?

* State of the art results on human cognition tasks is so outdated!

* Some Deep Learning models perform better than human beings!
---
* Image identification and recognition
* Natural Language Translation
* Speech to Text
* Self-driving Cars
* Art

# Artificial Neural Networks

![](./img/ann1.png)

![](./img/ann2.png)

* Backpropagation
* Dropout

# Convolutional Neural Networks

**What is a Convolutional Neural Network?**

Any Artificial Neural Network (ANN) with a convolutional layer can be thought of as a Convolutional Neural Network. In the typical CNN, we see the following layers in use, arranged in various architectures:

1. Convolutional Layers
2. Pooling Layers
3. Fully-Connected Layers

**Why Convolutional Neural Networks**

* Convolutional Neural Networks (ConvNets, CNNs) preserve the spatial structure of the problem 

* They were developed for object recognition tasks such as handwritten digit recognition (MNIST)

* They are popular because people are achieving state-of-the-art results on difficult computer vision and natural language processing tasks

## Convolutional Layers

So what are convolutional layers?

**The challenge of translational invariance**


![](./img/cnn1.png)

**Let's take a closer look at how we solve the challenge of translational invariance with convolutional layers**

![](./img/cnn2.png)

**Break the image into overlapping image tiles**

![](./img/cnn3.png)

77 equally-sized tiny image tiles

**Feed each image tile into a small neural network** (the convolution)

![](./img/cnn4.png)

There’s one big twist: We’ll keep the same neural network weights for every single tile in the same original image. In other words, we are treating every image tile equally. If something interesting appears in any given tile, we’ll mark that tile as **interesting**.

Convolution with kernels

**Save the results from each tile into a new array**

We don’t want to lose track of the arrangement of the original tiles. So we save the result from processing each tile into a grid in the same arrangement as the original image.

![](./img/cnn5.png)

**Downsampling**

So far

![](./img/cnn6.png)

**Max Pooling**

![](./img/cnn7.png)

The idea here is that if we found something interesting in any of the four input tiles that makes up each 2x2 grid square, we’ll just keep the most interesting bit. This reduces the size of our array while keeping the most important bits.

**Make a prediction**

![](./img/cnn8.png)

**A more realistic example**

These steps can be combined and stacked as many times as you want! You can have two, three or even ten convolution layers. You can throw in max pooling wherever you want to reduce the size of your data.

The more convolution steps you have, the more complicated features your network will be able to learn to recognize.

For example, the first convolution step might learn to recognize sharp edges, the second convolution step might recognize beaks using it’s knowledge of sharp edges, the third step might recognize entire birds using it’s knowledge of beaks, etc.

![](./img/cnn9.png)

### Filters/Kernels

Remember those small neural networks?

* The filters are essentially the neurons of the layer. 
* They have both weighted inputs and generate an output value like a neuron. 
* The input size is a fixed square called a **patch** or a **receptive field**. 
* If they are deeper in the network architecture, then the convolutional layer will take input from a feature map from the previous layer.

**Inspiration**

* Work by Hubel and Wiesel in the 1950s and 1960s showed that cat and monkey visual cortex contains neurons that individually respond to small regions of the visual field.
* Provided the eyes are not moving, the region of visual space within which visual stimuli affect the firing of a single neuron is known as its receptive fields. 
* Neighboring cells have similar and overlapping receptive field. 
* Receptive field size and location varies systematically across the cortex to form a complete map of visual space, the cortex in each hemisphere representing the contralateral visual field.
* Their 1968 paper identified two basic visual cell types in the brain:
    - simple cells, whose output is maximized by straight edges having particular orientations within their receptive field
    - complex cells, which have larger receptive fields, whose output is insensitive to the exact position of the edges in the field.

### Feature Maps

* The feature map is the output of one filter applied to the previous layer.
* A given filter is drawn across the entire previous layer, moved one pixel at a time.
* Each position results in an activation of the neuron and the output is collected in the feature map. 
* If the **receptive field** is moved one pixel from activation to activation, then the field will overlap with the previous activation by (field width - 1) input values.

**Strides and Padding**
* The distance that filter is moved across the input from the previous layer each activation is referred to as the **stride**. 
* If the size of the previous layer is not cleanly divisible by the size of the filters, receptive field, and the size of the stride, then it is possible for the receptive field to attempt to read off the edge of the input feature map. 
    - In this case, techniques like **zero padding** can be used to invent mock inputs with zero values for the receptive field to read.

## Pooling Layers

* The pooling layers down-sample the previous layer's feature map.

* Pooling layers follow a sequence of one or more convolutional layers and are intended to consolidate the features learned and expressed in the previous layer's feature map.

* Pooling may be consider a technique to compress or generalize feature representations and generally reduces the overfitting of the training data by the model.

* Pooling Layers also have a receptive field, often much smaller than the convolutional layer. 
* The pooling-stride is often equal to the size of the pooling-receptive field to avoid any overlap. 

Pooling layers are often very simple, taking the average or the maximum of the input value in order to create its own feature map.
* Max Pooling
* Mean Pooling
* [Fractional Max Pooling](https://arxiv.org/abs/1412.6071)
* [Stochastic Pooling](http://www.matthewzeiler.com/pubs/iclr2013/iclr2013.pdf)

## Fully Connected Layers

* Fully connected layers are the normal flat feedforward neural network layer.
* These layers may have a nonlinear activation function or a softmax activation in order to output probabilities of class predictions.
* Fully connected layers are used at the end of the network after feature extraction and consolidation has been performed by the convolutional and pooling layers. 
* They are used to create final nonlinear combinations of features and for making predictions by the network.

## Dropout Layers

CNNs have a habit of overfitting, even with pooling layers. Dropout should be used such as between fully connected layers and perhaps after pooling layers.

# Convolutional Layers in Action

You now know about convolutional, pooling and fully connected layers. Let’s make this more concrete by working through how these three layers may be connected together.

## 1D Convolution

![](./img/conv1d.png)

* In this example there is only one spatial dimension (x-axis), one neuron with a receptive field size of F = 3, the input size is W = 5, and there is zero padding of P = 1.


* **Left**: The neuron strided across the input in stride of S = 1, giving output of size (5 - 3 + 2)/1+1 = 5.

* **Right**: The neuron uses stride of S = 2, giving output of size (5 - 3 + 2)/2+1 = 3. 

## 1D Convolution

![](./img/conv1d.png)

* Notice that stride S = 3 could not be used since it wouldn't fit neatly across the volume. In terms of the equation, this can be determined since (5 - 3 + 2) = 4 is not divisible by 3. 

* The neuron weights are in this example [1,0,-1] (shown on very right), and its bias is zero. These weights are shared across all yellow neurons (see parameter sharing below).

## 2D Convolution

**Some Intuition - Translating an image to a matrix of numbers**

![](./img/mnist2.png)

![](./img/mnist1.png)

**Doing Convolution**

![](./img/conv2d1.png)

![](./img/conv2d2.png)

![](./img/conv2d3.png)

![](./img/conv2d4.png)

### Input from Image data

* Let’s assume we have a dataset of gray scale images
* Each image has the same size of 32 pixels wide and 32 pixels high, and pixel values are between 0 and 255
    - a matrix of 32 × 32 × 1 or 1,024 pixel values

* Image input data is expressed as a 3-dimensional matrix of width × height × channels
* If we were using color images in our example, we would have 3 channels for the red, green and blue pixel values, e.g. 32 × 32 × 3.

### Convolutional Layer

* We define a convolutional layer with 
    - 10 filters
    - receptive field 5 pixels wide and 5 pixels high (5x5)
    - stride length of 1

* Because each filter can only get input from (i.e. see) 5 × 5 (25) pixels at a time, we can calculate that each will require 25 + 1 input weights (plus 1 for the bias input)

* Dragging the 5 × 5 receptive field across the input image data with a stride width of 1 will result in a feature map of 28 × 28 output values or 784 distinct activations per image

* We have 10 filters, so that is 10 different 28 × 28 feature maps or 7,840 outputs that will be created for one image

* Finally, we know we have 26 inputs per filter, 10 filters and 28 × 28 output values to calculate per filter, therefore we have a total of 26 × 10 × 28 × 28 or 203,840 connections in our convolutional layer.

* Convolutional layers also make use of a nonlinear transfer function as part of activation and the rectifier activation function is the popular default to use.

## Pool Layer

* We define a pooling layer with a receptive field with a width of 2 inputs and a height of 2 inputs.
* We also use a stride of 2 to ensure that there is no overlap.
* This results in feature maps that are one half the size of the input feature maps.
    - From 10 different 28 × 28 feature maps as input to 10 different 14 × 14 feature maps as output.
* We will use a max() operation for each receptive field so that the activation is the maximum input value (Max Pooling).

## Fully Connected Layer

* Finally, we can flatten out the square feature maps into a traditional flat fully connected layer.
* We can define the fully connected layer with 200 hidden neurons, each with 10 × 14 × 14 input connections, or 1,960 + 1 weights per neuron.
* That is a total of 392,200 connections and weights to learn in this layer.
* We can use a sigmoid or softmax transfer function to output probabilities of class values directly.

![img](./img/cnn_fcn.png)

## CNN Best Practices

Now that we know about the building blocks for a convolutional neural network and how the layers hang together, we can review some best practices to consider when applying them.

### Input Receptive Field Dimensions

The default is 2D for images, but could be 1D such as for words in a sentence or 3D for video that adds a time dimension.

### Receptive Field Size

The patch should be as small as possible, but large enough to see features in the input data. It is common to use 3 × 3 on small images and 5 × 5 or 7 × 7 and more on larger image sizes.

### Stride Width

Use the default stride of 1. It is easy to understand and you don’t need padding to handle the receptive field falling off the edge of your images. This could be increased to 2 or larger for larger images.

Here is more advice - https://www.quora.com/How-does-one-determine-stride-size-in-CNN-filters

### Number of Filters

Filters are the feature detectors. Generally fewer filters are used at the input layer and increasingly more filters used at deeper layers.

### Padding

Set to zero and called zero padding when reading non-input data. This is useful when you cannot or do not want to standardize input image sizes or when you want to use receptive field and stride sizes that do not neatly divide up the input image size.

### Pooling

Pooling is a destructive or generalization process to reduce overfitting. Receptive field size is almost always set to 2 × 2 with a stride of 2 to discard 75% of the activations from the output of the previous layer.

Average pooling was often used historically but has recently fallen out of favor compared to the max pooling operation, which has been shown to work better in practice.

### Hyperparameter Tuning

* https://www.quora.com/How-can-I-decide-the-kernel-size-output-maps-and-layers-of-CNN
* http://stackoverflow.com/questions/29607754/where-do-filters-kernels-for-a-convolutional-network-come-from
* http://stats.stackexchange.com/questions/193793/in-convolutional-neural-networks-cnn-how-we-can-decide-number-of-kernels-betw

### Pattern Architecture

It is common to pattern the layers in your network architecture. This might be one, two or some number of convolutional layers followed by a pooling layer. This structure can then be repeated one or more times. Finally, fully connected layers are often only used at the output end and may be stacked one, two or more deep.

# Before we get hands-on

## Style Guide

* The official Python style guide - https://www.python.org/dev/peps/pep-0008/
* Google Python style guide - https://google.github.io/styleguide/pyguide.html#Indentation
* TensorFlow style guide - https://www.tensorflow.org/community/style_guide



In [None]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf

## Dataset (MNIST)

Dataset of 
* 50,000 32x32 color training images
* labeled over 100 categories
* 10,000 test images

The code above will create the following objects:

* **X_train**, **X_test**: uint8 array of RGB image data with shape (nb_samples, 3, 32, 32)
* **y_train**, **y_test**: uint8 array of category labels with shape (nb_samples, )
* argument **label_mode**: "fine" or "coarse"

In [None]:
import pickle as pk
from keras.datasets import mnist

(X_train, y_train), (X_test, y_test) = mnist.load_data()
pk.dump([X_train, y_train, X_test, y_test], open("mnist_keras.pkl", "wb"))

Loading the dataset ...

In [None]:
import pickle as pk

[X_train, y_train, X_test, y_test] = pk.load(open("mnist_keras.pkl", "rb"))

In [None]:
print(X_train)

## Dataset (CIFAR10)

Dataset of 
* 50,000 32x32 color training images
* labeled over 10 categories
* 10,000 test images

The code above will create the following datasets:

* **X_train**, **X_test**: uint8 array of RGB image data with shape (nb_samples, 3, 32, 32)
* **y_train**, **y_test**: uint8 array of category labels (integers in range 0-9) with shape (nb_samples, )

In [None]:
from keras.datasets import cifar10
import pickle as pk

(X_train, y_train), (X_test, y_test) = cifar10.load_data()
pk.dump([X_train, y_train, X_test, y_test], open("cifar10_keras.pkl", "wb"))

Loading the dataset ...

In [None]:
import pickle as pk

[X_train, y_train, X_test, y_test] = pk.load(open("cifar10_keras.pkl", "rb"))

In [None]:
print(X_train)

## Dataset (CIFAR100)

Dataset of 
* 60,000 28x28 grayscale images of the 10 digits
* a test set of 10,000 images

The code above will create the following objects:

* **X_train**, **X_test**: uint8 array of grayscale image data with shape (nb_samples, 28, 28)
* **y_train**, **y_test**: uint8 array of digit labels (integers in range 0-9) with shape (nb_samples, )
* argument **path**: if you do have the index file locally (at '~/.keras/datasets/' + path), if will be downloaded to this location (in cPickle format)

In [None]:
from keras.datasets import cifar100
import pickle as pk

(X_train, y_train), (X_test, y_test) = cifar100.load_data()
pk.dump([X_train, y_train, X_test, y_test], open("cifar100_keras.pkl", "wb"))

Loading the dataset ...

In [None]:
import pickle as pk

[X_train, y_train, X_test, y_test] = pk.load(open("cifar100_keras.pkl", "rb"))

In [None]:
print(X_train)

<center><font size="10"> A Brief History of Deep Learning models in the modern era (2012 - ) </font> </center>

## Lots of successful models

* LeNet
* AlexNet (2012)
* ZFNet (2013)
* VGGNet (2014)
* GoogleNet (2014)
* ResNet (2015)

ResNet achieved 3.6% error on ImageNet (ILSVRC15)

## ImageNet Classification with Deep Convolutional Neural Networks (AlexNet)

* Compared to modern architectures, a relatively simple layout was used.
* The network was made up of 5 conv layers, max-pooling layers, dropout layers, and 3 fully connected layers. 
* The network designed was used for classification with 1000 possible categories.  

The birth of the modern convolutional neural network!

![](./img/net_alexnet.png)

* Won the ImageNet Large-Scale Visual Recognition Challenge (2012) by achieving a Top-5 error rate of 15.4% whereas the next best entry achieved an error of 26.2%.
* Trained on ImageNet data, which contained over 15 million annotated images from a total of over 22,000 categories.
* Used data augmentation techniques that consisted of image translations, horizontal reflections, and patch extractions.
* Implemented dropout layers in order to combat the problem of overfitting to the training data.

**This was the first time a model performed so well on a historically difficult ImageNet dataset. Utilizing techniques that are still used today, such as data augmentation and dropout, this paper really illustrated the benefits of CNNs and backed them up with record breaking performance in the competition.**

## Visualizing and Understanding Convolutional Networks (ZF Net)

The architecture was more of a fine tuning to the previous AlexNet structure, but still developed some very key ideas about improving performance.

**The paper provided great intuition to the workings on CNNs and illustrated more ways to improve performance. The visualization approach described not only helps to explain the inner workings of CNNs, but also provides insight for improvements to network architectures. **

![](./img/net_zfnet.png)

* Won the ImageNet Large-Scale Visual Recognition Challenge (2013) by achieving an error rate of 11.2%.
* Instead of using 11x11 sized filters in the first layer, ZF Net used filters of size 7x7 and a decreased stride value. 
* Developed a visualization technique named Deconvolutional Network, which helps to examine different feature activations and their relation to the input space.


## Very Deep Convolutional Networks For Large-scale Image Recognition (VGGNet)

The paper reinforced the notion that convolutional neural networks have to have a deep network of layers in order for this hierarchical representation of visual data to work.

The paper created a 19 layer CNN that strictly used 3x3 filters with stride and pad of 1, along with 2x2 maxpooling layers with stride 2.

* Using simplicity and depth, achieved an error rate of 11.2%
* Used 3x3 sized filters because the combination of two 3x3 conv layers has an effective receptive field of 5x5, which simulates a larger filter while keeping the benefits of smaller filter sizes

* Worked well on both image classification and localization tasks
* Used scale jittering as one data augmentation technique during training

## Going Deeper with Convolutions (GoogleNet)

GoogLeNet is a 22 layer CNN. It was one of the first CNN architectures that really strayed from the general approach of simply stacking conv and pooling layers on top of each other in a sequential structure by using an Inception Module.

GoogLeNet was one of the first models that introduced the idea that CNN layers didn’t always have to be stacked up sequentially. By coming up with the Inception module, the authors showed that a creative structuring of layers can lead to improved performance and computationally efficiency. 

![](./img/net_googlenet.png)

* Winner of ILSVRC 2014 with a top 5 error rate of 6.7%
* During testing, multiple crops of the same image were created, fed into the network, and the softmax probabilities were averaged to give the final solution (a useful technique)

## Deep Residual Learning for Image Recognition  (ResNet)

ResNet is a 152 layer network architecture that set new records in classification, detection, and localization through an incredible architecture.

* Winner of ILSVRC 2015 with an error rate of 3.6%.
* Extremely Deep – 152 layers.

The ResNet model is one of the best CNN architecture currently available and is a great innovation for the idea of residual learning.

## Future Directions

* Model Compression

* Spatial Transformer Networks

* Generative Adversarial Networks

<center><font size="12"> Thank You </center>