# From FNNs to CNNs

<img src="pics/fnn_jf.png" width=550>

(*Slide by J.Frellsen*)

# So we just talked about word embeddings

i.e., word embeddings are a **dense continuous** representations of a word

Typically when talking about word embeddings we think of a **matrix E** which encodes |V| * d.

### How can we represent a text in continuous dense space?

$$ w_i,..,w_n $$ ??

### Representing text in continuous dense space: The CBOW model

A simple classification model that uses embeddings as representation is the CBOW model: it uses the sum (or average) of the embeddings of the words in the sentence. The CBOW representation is feeded into a fully connected network. It often works surprisingly well.

$$ \mbox{CBOW}(w_i,..,w_n) = \sum_i^n E[w_i] $$



## So far so good, but wait a minute

What is a fundamental downside of the CBOW model?

# Convolutional Neural Networks (CNNs)

* Convolutional neural networks (CNNs or convnets) are a  specialized kind of neural network **for processing data that has a known, grid-like topology** [[1](http://www.deeplearningbook.org/contents/convnets.html)].
* A method that evolved from **computer vision (CV)** (LeCun & Bengio, 1995)
* E.g., image classification, caption generation, photo tagging, self-driving cars

A convolutional neural network is designed
to identify indicative local predictors in a large structure, and combine them to produce a
fixed size vector representation of the structure

## What are convnets / CNNs?

* CNNs use convolutions over the input to compute the output
* Each layer applies *different filters* (often several hundreds or thousands) and combines their results
* combining the results of the convolutions is often done by **pooling**

<img src="http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/11/Screen-Shot-2015-11-07-at-7.26.20-AM.png">

## Example of a 2d convolution

* input (e.g., image)
* convolution: kernel/filter (of size 3x3)
<img src="http://deeplearning.stanford.edu/wiki/images/6/6c/Convolution_schematic.gif">

Source: http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution)

* **Image data**: 2-dimensional (matrix/grid) 
* **Text data**: 1-d (sequence)


* CV intuition - invariance in data:
  * we want to find an object regardless of its position in the image


<img src="pics/cnn_eq_jf.png" width=550 alt="Slide by J.Frellsen">

### What are convolutions?

* a *convolution* is an operation (of two functions) where one is the **input**, the other is a **kernel** that acts like a **filter** on the input producing an output
* we are sliding the *kernel* over the input; it computes for example a windowed averaged representation of the input vector


* in simple terms: a grid that goes over the input
* **filter**: a function that helps "identifying indicative local predictors" (Goldberg, 2015)


## What is a filter? (kernel)

<img src="pics/cnn_filters_jf.png">

### What is a convolution? Mathematical view

Convolution is an important operation in signal and image processing;
it operates on either images (2D) or texts (1D). 

Think of one as the
**input signal**. The other, the **kernel acts as a filter** on the input producing
an **output**." [[2](http://www.cs.cornell.edu/courses/cs1114/2013sp/sections/S06_convolution.pdf), [3](https://www.inf.ed.ac.uk/teaching/courses/nlu/lectures/nlu_l15_convolution-2x2.pdf)]


#### Definition

Imagine a 1d (image) input vector

* $f$ is our input vector of length $n$
* $g$ is our kernel (filter) of lenght $m$

The convolution $f*g$ of $f$ and $g$ is defined as:
$$(f * g)(i) = \sum_{j=1}^m g(j)\cdot f(i-j+m/2)$$

* Think at this as sliding the kernel over the input image
* For each position of the kernel, we multiply the overlapping values of the kernel and image together and add up the results, to produce the output

#### Example

Let's look at a simple example. Suppose our input 1d image $f$ is:



---
10 | 50 | 60 | 10 | 20| 40 | 30 
---

and our kernel $g$ is: 

---
|1/3 |1/3 |1/3| 
---

Let's assume we want to compute the value of $h(3)$ (j is at position 3). To compute this, we slide the kernel so that it is centered around $f(3)$:

| 10 | 50 | 60 | 10 | 20| 40 | 30 |
|--|--|--|--|--|--|--|
|  | 1/3 | 1/3 | 1/3 | | | |  |

To compute this, we will assume that the value of the kernel is 0 everywhere outside the boundary, and then we can compute the weighted sum (dot product):

| 10 | 50 | 60 | 10 | 20| 40 | 30 |
|--|--|--|--|--|--|--|
| 0 | 1/3 | 1/3 | 1/3 |0 | 0 | 0 | 




That is, 

$50 * \frac{1}{3} + 60 * \frac{1}{3} + 10 * \frac{1}{3} = 40$

Thus $h(3) = 40$.

In [1]:
##### Example in code
import numpy as np

f = np.array([10,50,60,10,20,40,30])
g = np.array([1/3,1/3,1/3])

window = f[1:4]
print(window)
print(g)
np.dot(window,g)

[50 60 10]
[0.33333333 0.33333333 0.33333333]


40.0

What is this kernel doing?

Computing the moving average of the image, i.e., replacing each entry with the average of the entry and its left and right neighbor.


### Convolutions for text

* CNNs were introduced in NLP by Collobert et al. (2011) and later by Kim (2014) and Kalchbrenner et al. (2014)
* the intention is to let the network focus on the most important "features" in the sentence, regardless of their location

The main idea behind a convolution and pooling architecture for language tasks is to apply
a non-linear (learned) function over each instantiation of a $k$-word sliding window over
the sentence.

"soft" n-grams

<img src="pics/cnn-goldberg.png">
Illustration from Goldberg (2015) chapter 9.

    
* **convolution**: a $k$-word sliding window is input for a function (**filter**) that transforms the window of k words into a $d$ dimensional vector (where each dimension is called a **channel**)
* **pooling**: then, a pooling operation combines vectors from different windows into a $d$-dim vector by taking the **max** (max-pooling) or **average** value observed in each of the channels (max pooling/average pooling)

 The resulting vector is a representation for the entire sentence in which each dimension represents the most salient features for some prediction task.

In more detail, including mathematical formulation:

<img src="pics/cnn-illustration.png" width=600>

The gradients that are propagated
back from the network’s loss during the training process are used to tune the parameters
of the filter function to highlight the aspects of the data that are important for the task
the network is trained for. Intuitively, when the sliding window is run over a sequence, the
filter function learns to identify informative k-grams. (Goldberg, 2015)

We can also do different convolutions on different parts of the sentence/document (see section 9.2, Goldberg).

### CNN hyperparameters

* how would you apply the filter to the first element of a matrix
that doesn’t have any neighboring elements to the top and left?


* zero-padding: all elements that fall outside of the matrix are zero.
wide convolution vs narrow convolution
* **wide convolution vs narrow convolution**

### CNN hyperparameters

* **stride** size: how much (how many 'pixels') you shift your filter at each step
* If stride size is 1, consecutive applications of the filter overlap

<img src="http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/11/Screen-Shot-2015-11-05-at-10.18.08-AM.png">

## Pooling

* Max pooling: “Did you see this feature anywhere in the
range?” (most common)
* Average pooling: “How prevalent is this feature over the
entire range”
* k-Max pooling: “Did you see this feature up to k times?” 

<img src="pics/poolings.jpg" width=550>

### Stride 2

<img src="http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/11/Screen-Shot-2015-11-05-at-2.18.38-PM.png">
Src: http://cs231n.github.io/convolutional-networks/#pool


### Kim (2014)

* apply several convolutional layers in parallel: multi-channel method
* each filter comes with its own set of parameters
<img src="pics/kim2014.png">

We first embeds words into the embedding space. The next layer performs convolutions over the embedded word vectors using multiple filter sizes. For example, sliding over 3, 4 or 5 words at a time. Next, we max-pool the result of the convolutional layer into a long feature vector, add dropout regularization, and classify the result using a softmax layer.

## Example:

An CNN with different branches

* CNN-rand: all words are randomly initialized and then modified during training
* CNN-static: pre-trained vectors with all the words— including the unknown ones that are randomly initialized—kept static and only the other parameters of the model are learned
* CNN-non-static: same as CNN-static but word vectors are fine-tuned

<img src="pics/cnn-branches.svg">

### How to implement a CNN in Keras 

In [3]:
### in Keras
from keras.models import Sequential
from keras.layers import Embedding, Dense, Activation
from keras.layers import Conv1D, GlobalMaxPooling1D, Dropout

model = Sequential()
model.add(Embedding(output_dim=128, input_dim=10000, input_length=50))

num_filters = 250
conv_length = 3  # filter size (number of words we want our convolutional layer to cover)
# we will have a total number of filters: num_filters * filter_size 
hidden_dims = 250

# we add a Convolution1D, which will learn num_filter
# word group filters of size filter_length:
model.add(Conv1D(filters=num_filters,  # Number of convolution kernels to use (dimensionality of the output).
                 kernel_size=conv_length, #  The extension (spatial or temporal) of each filter.
                 padding='valid',  #valid: don't go off edge; same: use padding before applying filter
                 activation='relu',
                 strides=1))

# max pooling
model.add(GlobalMaxPooling1D())

# We add a vanilla hidden layer:
model.add(Dense(hidden_dims))
model.add(Dropout(0.2))
model.add(Dense(10, activation="softmax"))

print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 50, 128)           1280000   
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 48, 250)           96250     
_________________________________________________________________
global_max_pooling1d_3 (Glob (None, 250)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 250)               62750     
_________________________________________________________________
dropout_3 (Dropout)          (None, 250)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 10)                2510      
Total params: 1,441,510
Trainable params: 1,441,510
Non-trainable params: 0
_________________________________________________________________


### Back to our sentiment example

In [5]:
# load data - convert to indices, pad to max_length - y's no n-hot needed as this is a binary task
import numpy as np
import random
from collections import defaultdict
from sklearn.model_selection import train_test_split

positive_sentences = [l.strip() for l in open("data/rt-polarity.pos").readlines()]
negative_sentences = [l.strip() for l in open("data/rt-polarity.neg").readlines()]

positive_labels = [1 for sentence in positive_sentences]
negative_labels = [0 for sentence in negative_sentences]

sentences = np.concatenate([positive_sentences,negative_sentences], axis=0)
labels = np.concatenate([positive_labels,negative_labels],axis=0)

## make sure we have a label for every data instance
assert(len(sentences)==len(labels))
data={}
np.random.seed(113) #seed
data['target']= np.random.permutation(labels)
np.random.seed(113) # use same seed!
data['data'] = np.random.permutation(sentences)

X_rest, X_test, y_rest, y_test = train_test_split(data['data'], data['target'], test_size=0.2)
X_train, X_dev, y_train, y_dev = train_test_split(X_rest, y_rest, test_size=0.2)
del X_rest, y_rest

## map them to ids for embedding layer
w2i = defaultdict(lambda: len(w2i))
PAD = w2i["<pad>"] # index 0 is padding
UNK = w2i["<unk>"] # index 1 is for UNK

# convert words to indices, taking care of UNKs
X_train_num = [[w2i[word] for word in sentence.split(" ")] for sentence in X_train]
w2i = defaultdict(lambda: UNK, w2i) # freeze - cute trick!
X_dev_num = [[w2i[word] for word in sentence.split(" ")] for sentence in X_dev]
X_test_num = [[w2i[word] for word in sentence.split(" ")] for sentence in X_test]

max_sentence_length=max([len(s.split(" ")) for s in X_train] 
                        + [len(s.split(" ")) for s in X_dev] 
                        + [len(s.split(" ")) for s in X_test] )

from keras.preprocessing import sequence

# pad X
X_train_pad = sequence.pad_sequences(X_train_num, maxlen=max_sentence_length, value=PAD)
X_dev_pad = sequence.pad_sequences(X_dev_num, maxlen=max_sentence_length, value=PAD)
X_test_pad = sequence.pad_sequences(X_test_num, maxlen=max_sentence_length,value=PAD)


In [6]:
print("#train instances: {} #dev: {} #test: {}".format(len(X_train),len(X_dev),len(X_test)))

vocabulary_size = len(w2i)
embeds_size=64

#train instances: 6823 #dev: 1706 #test: 2133


In [7]:
np.random.seed(113) #set seed before any keras import
from keras.models import Sequential
from keras.layers import Dense, Activation, Embedding
from keras.layers import Conv1D, GlobalMaxPooling1D


model = Sequential()
model.add(Embedding(vocabulary_size, embeds_size, input_length=max_sentence_length))

# A simple (single filter) CNN with filter_size 3

num_filters = 250
conv_length = 4
hidden_dims = 250

# we add a Convolution1D, which will learn num_filter
# word group filters of size filter_length:
model.add(Conv1D(filters=num_filters,  # Number of convolution kernels to use (dimensionality of the output).
                 kernel_size=conv_length, #  The extension (spatial or temporal) of each filter.
                 padding='valid',  #valid: don't go off edge; same: use padding before applying filter
                 activation='relu',
                 strides=1))

# max pooling
model.add(GlobalMaxPooling1D())


model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [8]:
model.fit(X_train_pad, y_train, epochs=4, batch_size=50)
loss, accuracy = model.evaluate(X_dev_pad, y_dev)

print("Accuracy: ", accuracy *100)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
Accuracy:  76.26025790625971


## References

* [Goldberg's primer chapter 9](arxiv.org/abs/1510.00726)
* [WildML: CNNs for NLP](http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/#more-348)
* [Cornell course notes](http://www.cs.cornell.edu/courses/cs1114/2013sp/sections/S06_convolution.pdf)
* [David's blogpost](http://www.davidsbatista.net/blog/2018/03/31/SentenceClassificationConvNets/)