# Convolutional Neural Networks


## From RNN to CNN

Recurrent neural networks cannot capture phrases without prefix context and often capture too much of last words in final vector. What if I just want to classify a simple phrase? 

> Example: the country of my birth

We can use a sliding window of n-grams and perform convolution on them. We choose window of size 2, and we get `[the, country], [country, of], [of my], [my birth]`. We can also increase the window size to `3`, `4`, `5`, `...`, `N`. We can describe 1D convolution with the following equation.

$$
(f * g)[n] = \Sigma^{M}_{m = -M} f[n-m]g[m]
$$

## Single Layer CNN

First introduced by the paper *Convolutional Neural Networks for Sentence Classification* by Kim in 2014. It is a simple variant using one convolutional layer and pooling layer to perform sentence classification. 

### Convolutional Layer

The inputs are word vectors of a sentence.

$$
x_{i} \in \mathbb{R}^{k}
$$

The sentence will be a concatenation of all the word vectors.

$$
x_{1:n} = x_{1} \bigoplus x_{2} \bigoplus ... \bigoplus x_{n}
$$

Convolution filter will be denoted as `w`.

$$
w \in \mathbb{R}^{hk}
$$

Each window has`h` words each has `k` features. To compute the convolution, it's simply loops and summations.

$$
c_{i} = f\left(w^{T}x_{i:i+h-1} + b\right)
$$

The result is a feature map

$$
\vec{c} = [c_{1}, c_{2}, ..., c_{n-h+1}] \in \mathbb{R}^{n-h+1}
$$

The problem is that the feature map will vary in length depending on the input sentence length. This needs to be addressed for CNN architecture.

### Pooling Layer

Instead of sending the whole feature map to the next layer, we can use a pooling layer to select only the most activated feature. In this case, we want to pool a single number from a feature vector. This is different from what we do in CS231n. 

$$
\hat{c} = max\{c\}
$$

However, this wil only give us one feature. The next step to do is to introduce more filters. Each filter has different dimension. We perform convolution multiple times and perform max pooling on the input sentence. We will eventually end up with multiple features.


### Output Layer

The final feature vector will be a concatenation of the results from all max pooling layers.

$$
z = [\hat{c}_{1}, \hat{c}_{2}, ... \hat{c}_{m}]
$$

We then feed it to a softmax layer.

$$
y = \text{softmax}\left(W^{S}z + b\right)
$$

### Architecture

Here's how it looks like

![13_conv_net_nlp](./assets/13_conv_net_nlp.png)

### Hyperparamters in Kim (2014)

* Nonlinearity uses ReLU
* Window filter sizes `h = 3, 4, 5`
* Each filter size has 100 feature maps
* Dropout is 0.5
* L2 constraints `s` for rows of softmax, `s = 3`
* Mini-batch size for SGD training is 50
* Word vectors are pretrained with word2vec, k = 300
