# Deep Learning for Text 2  
&nbsp;


Ayoub Bagheri, <a.bagheri@uu.nl>  


<img src="img/uu_logo.png" style="float: right;" width="100" height="100">


## Lecture’s Plan
&nbsp;

1. CNN
2. Encoder-Decoder
3. Attention
4. Transformers


# CNN for Text

## Convolutional Neural Network (CNN)
&nbsp;

- Convolutional Neural Networks, or Convolutional Networks, or CNNs, or ConvNets
- For processing data with a **grid-like** or array topology
    - 1-D grid: time-series data, sensor signal data
    - 2-D grid: image data
    - 3-D grid: video data

- CNNs include four key ideas related to natural signals: 
    - **Local connections**
    - **Shared weights**
    - **Pooling**
    - **Use of many layers**

## CNN Architecture
&nbsp;

- Intuition: Neural network with specialized connectivity structure
    - Stacking multiple layers of feature extractors, low-level layers extract local features, and high-level layers extract learn global patterns.
- There are a few distinct types of layers:
    - **Convolutional Layer**: detecting local features through filters (discrete convolution)
    - **Non-linear Layer**: normalization via Rectified Linear Unit (ReLU)
    - **Pooling Layer**: merging similar features


## Building-blocks for CNNs
&nbsp;

<img src="img/page 6.png">

## (1) Convolutional Layer
&nbsp;

- The core layer of CNNs
- Convolutional layer consists of a set of filters, $W_{kl}$
- Each filter covers a spatially small portion of the input data, $Z_{i,j}$
- Each filter is convolved across the dimensions of the input data, producing a multidimensional **feature map**.
- As we convolve the filter, we are computing the dot product between the parameters of the filter and the input.
- **Deep Learning algorithm**: During training, the network corrects errors and filters are **learned**, e.g., in Keras, by adjusting weights based on **Stochastic Gradient Descent**, **SGD** (stochastic approximation of GD using a randomly selected subset of the data).
- The key architectural characteristics of the convolutional layer is **local connectivity** and **shared weights**.

<img src="img/page 7.png" width="300">


## Convolutional Layer: Local Connectivity
&nbsp;

<div style="float:left; width:60%">
    <ul>
        <li>Neurons in layer m are only connected to 3 adjacent neurons in the m-1 layer.</li>
        <li>Neurons in layer m+1 have a similar connectivity with the layer below.</li>
        <li>Each neuron is unresponsive to variations outside of its <em>receptive field</em> with respect to the input. </li>
        <ul><li>Receptive field: small neuron collections which process portions of the input data.</li></ul>
        <li>The architecture thus ensures that the learnt feature extractors produce the strongest response to a spatially local input pattern.</li>
    </ul>
</div>
<div style="float:left; width:40%">
    <img src="img/page 8.png">
</div>

## Convolutional Layer: Shared Weights
&nbsp;

<div style="float:left; width:60%">
    <ul>
        <li>We show 3 hidden neurons belonging to the same feature map (the layer right above the input layer).</li>
        <li>Weights of the same color are shared—constrained to be identical.</li>
        <li>Replicating neurons in this way allows for features to be detected regardless of their position in the input. </li>
        <li>Additionally, <b>weight sharing increases learning efficiency</b> by greatly reducing the number of free parameters being learnt.</li>
    </ul>
</div>
<div style="float:left; width:40%">
    <img src="img/page 9.png">
</div>

## Convolution without padding
&nbsp;

<img src="img/page 10.gif">

<img src="img/page 10.png">

## Convolution with padding
&nbsp;

<center>Animation source: https://github.com/vdumoulin/conv_arithmetic</center>

<div style="float:left;width:50%">
    <img src="img/page 11_1.gif">
    <center>4x4 input. 3x3 filter. Stride = 1. 
2x2 output.</center>
</div>

<div style="float:right;width:50%">
    <img src="img/page 11_2.gif">
    <center>5x5 input. 3x3 filter. Stride = 1. 
5x5 output.</center>
</div>

<br> <br>



## (2) Non-linear Layer 
&nbsp;

<img src="img/page 12_1.png" width="250">

- Intuition: Increase the nonlinearity of the entire architecture without affecting the receptive fields of the convolution layer
- A layer of neurons that applies the non-linear activation function, such as,
    - **$f(x)=max⁡(0,x)$** - Rectified Linear Unit (ReLU); 
    
    fast and most widely used in CNN
    - $f(x)=\text{tanh}x$
    - $f(x)=|\text{tanh}⁡𝑥|$
    - $f(x)=(1+𝑒^{−𝑥})^{−1}$ - sigmoid

<center><img src="img/page 12_2.png" width="300"></center>


## (3) Pooling Layer
&nbsp;

<img src="img/page 13_1.png" width="250">

- Intuition: to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting
- Pooling partitions the input image into a set of non-overlapping rectangles and, for each such sub-region, outputs the maximum value of the features in that region.

<center><img src="img/page 13_2.png" width="500"></center>

## Pooling ( down sampling ) 
&nbsp;

<img src="img/page 14.png">

## Other Layers
&nbsp;

- The convolution, non-linear, and pooling layers are typically used as a set. Multiple sets of the above three layers can appear in a CNN design.
    - Input &rarr; <span style="color:green">Conv. &rarr; Non-linear &rarr; Pooling</span> &rarr;  <span style="color:green">Conv. &rarr; Non-linear &rarr; Pooling</span> &rarr; … &rarr; Output
- Recent CNN architectures have 10-20 such layers.
- After a few sets, the output is typically sent to one or two **fully connected layers**.
    - A fully connected layer is a ordinary neural network layer as in other neural networks.
    - Typical activation function is the sigmoid function.
    - Output is typically class (classification) or real number (regression).

<img src="img/page 15.png" width="300">

## Other Layers
&nbsp;

- The final layer of a CNN is determined by the research task.
- Classification: Softmax Layer
$$P(y=j|\boldsymbol{x}) = \frac{e^{w_j \cdot x}}{\sum_{k=1}^K{e^{w_k \cdot x}}}$$
    - The outputs are the probabilities of belonging to each class.
- Regression: Linear Layer
$$f(\boldsymbol{x}) = \boldsymbol{w} \cdot \boldsymbol{x}$$
    - The output is a real number.

# Implementation for text in Python 

## Convolutional Neural Networks (CNNs)
&nbsp;

Main CNN idea for text:

<span style="color:orange">Compute vectors for n-grams</span> and group them afterwards

<br> <br>

Example: “this takes too long” compute vectors for: 

This takes, takes too, too long, this takes too, takes too long, this takes too long

<div style="float:left;width:60%">
    <img src="img/page 18.png" width="450">
</div>
<div style="float:right;width:40%">
    <img src="img/page 18.gif" width="450">
</div>

## CNN for text classification
&nbsp;

<img src="img/page 19.png">

## CNN with multiple filters
&nbsp;

<img src="img/page 20.png">

## Python CNN Implementation
&nbsp;

- Prerequisites:
    - Python 3.5+ (https://www.python.org/)
    - TensorFlow (https://www.tensorflow.org/)
    - Keras (https://keras.io/)
        - **Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano.**
- Recommended:
    - NumPy
    - Scikit-Learn
    - NLTK
    - SciPy

<img src="img/page 21.png" width=“450”>

## Build a CNN in Keras
&nbsp;


- The `Sequential` model is used to build a linear stack of layers.
- The following code shows how a typical CNN is built in Keras.

```
import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.optimizers import SGD
![image.png](attachment:image.png)
```

Note:

**Dense is the fully connected layer;**

**Flatten is used after all CNN layers 

and before fully connected layer;

**Conv2D is the 2D convolution layer;**

**MaxPooling2D is the 2D max pooling layer;**

**SGD is stochastic gradient descent algorithm.**



## Build a CNN in Keras
&nbsp;

```
(continued)

model = Sequential()
# We create an empty Sequential model and add layers onto it.

model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(100, 100)))
# We add a Conv2D layer with 32 filters, 3x3 each, followed by a detector layer ReLU.
# This is the first layer we add to the model, so we need to specify the shape of the input. In this case we assume our input is a 100x100 matrix.

model.add(MaxPooling2D(pool_size=(2, 2)))
# We add a MaxPooling2D layer with a 2x2 pooling size.
```

## Build a CNN in Keras
&nbsp;

<img src="img/page 23.png">

## Build a CNN in Keras
&nbsp;

```
(continued)

model.add(Conv2D(32, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
# We can add more Conv2D and MaxPooling2D layers onto the model.

model.add(Flatten())
# After all the desired CNN layers are added, add a Flatten layer.

model.add(Dense(256, activation='sigmoid'))
# Add a fully connected layer followed by a detector layer with the sigmoid function.

model.add(Dense(10, activation='softmax')
# A softmax layer is added to achieve multiclass classification. In this example we have 10 classes.
```

## Build a CNN in Keras
&nbsp;

<img src="img/page 24.png">

## Build a CNN in Keras
&nbsp;

```
(continued)
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
# Default SGD training parameters for correcting errors for filters

model.compile(loss='categorical_crossentropy', optimizer=sgd)
# Compile the model and use categorical crossentropy as the loss function, sgd as the optimizer

model.fit(x_train, y_train, batch_size=32, epochs=10)
# Fit the model with x_train and y_train, batch_size and epochs can be set to other values

score = model.evaluate(x_test, y_test, batch_size=32)
# Evaluate model performance using x_test and y_test
```

## Build a CNN in Keras
&nbsp;

<img src="img/page 25.png">

# Encoder-Decoder

## Encoder-Decoder
&nbsp;

- **RNN**: input sequence is transformed into output sequence in a one-to-one fashion.
- **Goal**: Develop an architecture capable of generating contextually appropriate, arbitrary length, output sequences
- **Applications**: 
    - Machine translation, 
    - Summarization, 
    - Question answering,
    - Dialogue modeling.

<img src="img/page 27.png" width="450">


## Simple recurrent neural network illustrated as a feed-forward network
&nbsp;

**Most significant change: new set of weights, U**
- connect the hidden layer from the previous time step to the current hidden layer. 
- determine how the network should make use of past context in calculating the output for the current input.

<center><img src="img/page 28.png" width="500"></center>

## Simple-RNN abstraction
&nbsp;

<img src="img/page 29.png">

## RNN Applications 
&nbsp;

<img src="img/page 30.png">

## Sentence Completion using an RNN
&nbsp;

<img src="img/page 31.png" width="600">


- **Trained Neural Language Model** can be used to generate novel sequences 
- Or to complete a given sequence (until end of sentence token <\s> is generated)

<img src="img/page 32.png">

## Extending (autoregressive) generation to Machine Translation
&nbsp;

- Translation as Sentence Completion!

<img src="img/page 33.png" width="700">

## (simple) Encoder Decoder Networks
&nbsp;

<img src="img/page 34.png" width="700">

- Encoder generates a contextualized representation of the input (last state).
- Decoder takes that state and autoregressively generates a sequence of outputs


## General Encoder Decoder Networks 

Abstracting away from these choices

1. Encoder: accepts an input sequence, $x_{1:n}$ and generates a corresponding sequence of contextualized representations, $h_{1:n}$
2. Context vector $c$:  function of $h_{1:n}$ and conveys the essence of the input to the decoder.
3. Decoder: accepts $c$ as input and generates an arbitrary length sequence of hidden states $h_{1:m}$ from which a corresponding sequence of output states $y_{1:m}$ can be obtained.

<center><img src="img/page 35.png" width="500"></center>



## Popular architectural choices: Encoder
&nbsp;

Widely used encoder design: **stacked Bi-LSTMs** 
- Contextualized representations for each time step: **hidden states from top layers** from the forward and backward passes

<center><img src="img/page 36.png" width="500"></center>

## Decoder Basic Design
&nbsp;

- produce an output sequence an element at a time

<center><img src="img/page 37.png" width="600"></center>

## Decoder Design <br> Enhancement
&nbsp;

<img src="img/page 38.png">



## Decoder: How output y is chosen
&nbsp;

- **Sample soft-max** distribution (OK for generating novel output, not OK for e.g. MT or Summ)
- **Most likely output** (doesn’t guarantee individual choices being made make sense together)

<center><img src="img/page 39.png" width="600"></center>

<img src="img/page 40.png">

# Attention

## Flexible context: Attention
&nbsp;

**Context vector $c$**: function of **$h_{1:n}$** and conveys the essence of the input to the decoder.

<center><img src="img/page 42.png" width="500"></center>

## Flexible context: Attention
&nbsp;

**Context vector $c$**: function of **$h_{1:n}$** and conveys the essence of the input to the decoder.

**Flexible?**  
- Different for each $h_i$
- Flexibly combining the $h_j$ 

<center><img src="img/page 42.png" width="500"></center>

## Attention (1): dynamically derived context
&nbsp;

- Replace static context vector with dynamic <span style="color:lightblue">$c_i$</span>
- derived from the encoder hidden states at each point <span style="color:lightblue">$i$</span> during decoding

**Ideas**:
- should be a linear combination of those states 
$$c_i = \sum_j{\alpha_{ij}h^e_j}$$
- $\alpha_{ij} $ should depend on?

<center><img src="img/page 43.png" width="300"></center>

## Attention (2): computing $c_i$

- Compute a vector of scores that capture the relevance of each encoder hidden state to the decoder state $h_{i-1}^d$
$$score(h_{i-1}^d, h_j^e)$$

- Just the similarity
$$score(h_{i-1}^d, h_j^e) = h_{i-1}^d \cdot h_j^e$$

- Give network the ability to <span style="background-color:lightgray">learn which aspects</span> of similarity between the decoder and encoder states are important to the current application.

$$score(h_{i-1}^d, h_j^e) = h_{i-1}^d W_S h_j^e$$

<center><img src="img/page 44.png" width="300"></center>


## Attention (3): computing $c_i$ <br> From scores to weights
&nbsp;

- Create vector of weights  by normalizing scores

$$
\begin{align}
a_{ij} &= \text{softmax}(score(h_{i-1}^d, h_j^e)\ \forall j \in e) \\
&= \frac{exp(score(h_{i-1}^d, h_j^e))}{\sum_k{exp(score(h_{i-1}^d, h_k^e))}}
\end{align}
$$

- **Goal achieved**: compute a fixed-length context vector for the current decoder state by taking a weighted average over all the encoder hidden states.

<center><img src="img/page 45.png" width="300"></center>



## Attention: Summary
&nbsp;

<img src="img/page 46.png">

## Explain Y. Goldberg different notation
&nbsp;

<img src="img/page 47.png">

## Intro to Encoder-Decoder and Attention (Goldberg’s notation)
&nbsp;

<img src="img/page 48.png">

# Transformers

## Transformers (Attention is all you need 2017)
&nbsp;

- Just an introduction: These are two valuable resources to learn more details on the architecture and implementation

- http://nlp.seas.harvard.edu/2018/04/03/attention.html

- https://jalammar.github.io/illustrated-transformer/ (slides come from this source)


## High-level architecture
&nbsp;

- Will only look at the ENCODER(s) part in detail

<center><img src="img/page 51.png" width="450"></center>

<img src="img/page 52.png">

**Key property of Transformer**: word in each position flows through its own path in the encoder. 
- There are dependencies between these paths in the self-attention layer. 
- Feed-forward layer does not have those dependencies => various paths can be executed in parallel !

<center><img src="img/page 53.png" width="450"></center>

## Visually clearer on two words
&nbsp;

- dependencies in self-attention layer. 
- No dependencies in Feed-forward layer 

<center><img src="img/page 54.png" width="450"></center>

## Self-Attention
&nbsp;

While processing **each word** it allows to look at other positions in the input sequence for clues to build a better encoding for **this word**.

**Step1: create three vectors** from each of the encoder’s input vectors: 

<span style="color:purple">Query</span>, a <span style="color:orange">Key</span>, <span style="color:lightblue">Value</span>  (typically smaller dimension). 

by multiplying the embedding by three matrices that we **trained** during the training process.

<center><img src="img/page 55.png" width="400"></center>


## Self-Attention
&nbsp;

**Step 2: calculate a score** (like we have seen for regular attention!)  how much focus to place on other parts of the input sentence as we encode a word at a certain position.

Take dot product of the <span style="color:purple">query vector</span> with the <span style="color:orange">key vector</span> of the respective word we’re scoring. 


E.g., Processing the self-attention for word “Thinking” in position $\text{#}1$, the first score would be the dot product of <span style="color:purple">q1</span> and <span style="color:orange">k1</span>. The second score would be the dot product of <span style="color:purple">q1</span> and <span style="color:orange">k2</span>.

<center><img src="img/page 56.png" width="400"></center>

## Self Attention
&nbsp;

- **Step 3** divide scores by the square root of the dimension of the <span style="color:orange">key vectors</span>  (more stable gradients). 
- **Step 4** pass result through a softmax operation. (all positive and add up to 1)

**Intuition**: softmax score determines how much each word will be expressed at this position. 

<center><img src="img/page 57.png" width="400"></center>



## Self Attention
&nbsp;

**Step6** : sum up the weighted <span style="color:blue">value vectors</span>. This produces <span style="color:pink">the output of the self-attention layer</span> at this position

<center><img src="img/page 58.png" width="400"></center>

## Self Attention
&nbsp;

**Step6** : sum up the weighted <span style="color:blue">value vectors</span>. This produces <span style="color:pink">the output of the self-attention layer</span> at this position

More details:
- What we have seen for a word is done **for all words** (using matrices) 
- Need to **encode position** of words
- And improved using a mechanism called “**multi-headed**” attention

(kind of like multiple filters for CNN)

see https://jalammar.github.io/illustrated-transformer/

<center><img src="img/page 58.png" width="200"></center>

## The Decoder Side

- Relies on most of the concepts on the encoder side
- See animation on https://jalammar.github.io/illustrated-transformer/

<center><img src="img/page 59.png" width="450"></center>

## Menti

# Summary

## Summary: what did we learn?

$$\text {Time for Practical 7!}$$ 