# Convolutional Neural Networks (CNNs)


* Convolution-and-pooling architectures (LeCun & Bengio, 1995) evolved in the neural
networks computer vision community
* Computer Vision (CV): image classification, caption generation, photo tagging, self-driving cars

* invariance in data:
  * image of kitten in different positions in image 
  * want to find object regardless of its position in the image


Convolutional neural networks (CNNs or convnets) are a  specialized kind of neural network **for processing data that has a known, grid-like topology** [[1](http://www.deeplearningbook.org/contents/convnets.html)].

* **Image data**: can be thought of a 2-dimensional (grid) 
* **Text data**: 1-d (sequence) / time-series data.

## CNNs (or 'convnets')

* are Neural Networks that works on variable length inputs
* the name **convolution** comes from the fact that this model uses a mathematical operation called *convolution*. 

* in fact, the two basic operations of a CNN are **convolution** and **pooling** 


### What are convolutions?

* "identifying indicative local predictors" (Goldberg, 2015)
* a grid that goes over the input
* a *convolution* is an operation (of two functions) where one is the **input**, the other is a kernel that acts like a **filter** on the input producing an output
* we are sliding the *kernel* over the input; it computes a windowed averaged representation of the input vector

#### Example of a 2d convolution:

An image is a 2d input, the following illustrates a convolution over this 2d input:


<img src="http://neuralnetworksanddeeplearning.com/images/tikz44.png">

<img src="http://neuralnetworksanddeeplearning.com/images/tikz45.png">

Here is an animation (src: Convolution with 3×3 Filter). Source: http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution)
<img src="http://deeplearning.stanford.edu/wiki/images/6/6c/Convolution_schematic.gif">

Source: [[2](http://www.cs.cornell.edu/courses/cs1114/2013sp/sections/S06_convolution.pdf)]

* Our images and kernels are 2D functions (aka matrices).
* We slide the kernel over each pixel of the image, multiply the
corresponding entries of the input and kernel, and add them up (convolution + average pooling).

<img src="pics/cnn-cv1.png">

<img src="pics/cnn-cv2.png">

<img src="pics/cnn-cv3.png">

#### Terminology:

* convolution
* filter
* stride
* pooling


### Pooling

Max pooling in CNN. Source: http://cs231n.github.io/convolutional-networks/#pool
<img src="http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/11/Screen-Shot-2015-11-05-at-2.18.38-PM.png" width=600>

stride: 2 pixels (jumps)

### Convolutions for text

* CNNs were introduced in NLP by Collobert et al. (2011) and later by Kim (2014) and Kalchbrenner et al. (2014)
* the intention is to let the network focus on the most important "features" in the sentence, regardless of their location

The main idea behind a convolution and pooling architecture for language tasks is to apply
a non-linear (learned) function over each instantiation of a $k$-word sliding window over
the sentence.

<img src="pics/cnn-goldberg.png">
Illustration from Goldberg (2015) chapter 9.

    
* **convolution**: a $k$-word sliding window is input for a function (**filter**) that transforms the window of k words into a $d$ dimensional vector (where each dimension is called a **channel**)
* **pooling**: then, a pooling operation combines vectors from different windows into a $d$-dim vector by taking the **max** (max-pooling) or **average** value observed in each of the channels (max pooling/average pooling)

 The resulting vector is a representation for the entire sentence in which each dimension represents the most salient features for some prediction task.

In more detail, including mathematical formulation:

<img src="pics/cnn-illustration.png">

The gradients that are propagated
back from the network’s loss during the training process are used to tune the parameters
of the filter function to highlight the aspects of the data that are important for the task
the network is trained for. Intuitively, when the sliding window is run over a sequence, the
filter function learns to identify informative k-grams. (Goldberg, 2015)

We can also do different convolutions on different parts of the sentence/document (see section 9.2, Goldberg).

## What are convnets?

* several layers of convolutions (with activation functions) and pooling

<img src="http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/11/Screen-Shot-2015-11-07-at-7.26.20-AM.png">

### Kim (2014)

* apply several convoluational layers in parallel
<img src="pics/kim2014.png">

In [15]:
### in Keras
from keras.models import Sequential
from keras.layers import Embedding, Dense, Activation
from keras.layers import Conv1D, GlobalMaxPooling1D

model = Sequential()
model.add(Embedding(output_dim=128, input_dim=10000, input_length=5))

num_filters = 250
conv_length = 3
hidden_dims = 250

# we add a Convolution1D, which will learn nb_filter
# word group filters of size filter_length:
model.add(Conv1D(filters=num_filters,  # Number of convolution kernels to use (dimensionality of the output).
                 kernel_size=conv_length, #  The extension (spatial or temporal) of each filter.
                 padding='valid',  #valid: don't go off edge; same: use padding before applying filter
                 activation='relu',
                 strides=1))


# max pooling
model.add(GlobalMaxPooling1D())

# We add a vanilla hidden layer:
model.add(Dense(hidden_dims))
model.add(Dropout(0.2))
model.add(Activation('relu'))

## References

* [Goldberg's primer chapter 9](arxiv.org/abs/1510.00726)
* [WildML: CNNs for NLP](http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/#more-348)
* [Cornell course notes](http://www.cs.cornell.edu/courses/cs1114/2013sp/sections/S06_convolution.pdf)