# Transfer Learning

Daniel Hershcovich

[Quiz results](https://absalon.instructure.com/courses/35895/quizzes/37714/statistics)

# Outline
- Natural Language Inference
- Pre-trained embeddings

# Recognizing Textual Entailment (RTE) = Natural Language Inference (NLI)

Determining the logical relationship between two sentences.

- Classification task
- Requires commonsense and world knowledge
- Requires general natural language understanding
- Requires fine-grained reasoning

- **A wedding party taking pictures**
  - There is a funeral					: **<span class=red>Contradiction</span>**
  - They are outside					: **<span class=blue>Neutral</span>**
  - Someone got married				    : **<span class=green>Entailment</span>**

<img src="https://upload.wikimedia.org/wikipedia/commons/3/31/Wedding_photographer_at_work.jpg" width=800/> 

### State of the Art until 2015

[<span class=blue>Lai and Hockenmaier, 2014, Jimenez et al., 2014, Zhao et al., 2014, Beltagy et al., 2015</span> etc.]

- Engineered natural language processing pipelines
- Various external resources
- Specialized subcomponents
- Extensive manual creation of **features**:
  - Negation detection, word overlap, part-of-speech tags, dependency parses, alignment, unaligned matching, chunk alignment, synonym, hypernym, antonym, denotation graph

### Neural Networks for NLI

As shown above, models for NLI heavily relied on engineered features. One reason why neural networks were not applied to this task is due to the absence of high-quality large-scale NLI corpora.

**Previous NLI corpora**:
- Tiny data sets (1k-10k examples)
- Partly synthetic examples

**Stanford Natural Inference Corpus (SNLI)**:
- 500k sentence pairs
- Two orders of magnitude larger than existing NLI data set
- All examples generated by humans


### Independent Sentence Encoding

With the introduction of a large-scale NLI corpora, neural networks became feasible to train. A first baseline model by Bowman et al. (2015) used LSTMs to encode the premise and hypothesis independently of each other.

[<span class=blue>Bowman et al, 2015</span>]

Same LSTM encodes premise and hypothesis

<img src="dl-applications-figures/rte.svg" width=800/> 


Last output vector as sentence representation

Same LSTM encodes premise and hypothesis

<img src="dl-applications-figures/rte_encoding.svg" width=800/> 

> You can’t cram the meaning of a whole
%&!\$# sentence into a single \$&!#* vector!
>
> -- <cite>Raymond J. Mooney</cite>

Once both sentences are encoded as vectors, we can use a multi-layer perceptron followed by a softmax to model the probability of each class (entailment, neutral, contradiction) given the two sentences.

<img src="dl-applications-figures/mlp.svg" width=700/> 

## Results

| Model | k | θ<sub>W+M</sub> | θ<sub>M</sub> | Train | Dev | Test |
|-|-|-|-|-|-|-|
| LSTM [<span class=blue>Bowman et al.</span>] | 100 | \\(\approx\\)10M | 221k | 84.4 | - | 77.6|
| Classifier [<span class=blue>Bowman et al.</span>]| - | - | - | 99.7 | - | 78.2|

### Conditional Encoding

The way we read the hypothesis could be influenced by our understanding of the premise.

<img src="dl-applications-figures/conditional.svg" width=800/> 

<img src="dl-applications-figures/conditional_encoding.svg" width=800/> 

## Results

| Model | k | θ<sub>W+M</sub> | θ<sub>M</sub> | Train | Dev | Test |
|-|-|-|-|-|-|-|
| LSTM [<span class=blue>Bowman et al.</span>] | 100 | \\(\approx\\)10M | 221k | 84.4 | - | 77.6|
| Classifier [<span class=blue>Bowman et al.</span>]| - | - | - | 99.7 | - | 78.2|
| Conditional Endcoding | 159 | 3.9M | 252k | 84.4 | 83.0 | 81.4|

### Attention [<span class=blue>Graves 2013</span>, <span class=blue>Bahdanau et al. 2015</span>]


<img src="dl-applications-figures/attention.svg" width=800/> 

Conditional encoding improves generalization compared to independent encoding of premise and hypothesis. Another improvement can be achieved by using a neural attention mechanism. To this end we compare the last output vector ($\mathbf{h}_N$) with all premise output vectors (concatenated to $\mathbf{Y}$) and model a probability distribution $\alpha$ over premise output vectors using a softmax. Lastly, we obtain a context representation $\mathbf{r}$ by weighting output vectors with the attention $\alpha$, and use that context representation together with $\mathbf{h}_N$ for prediction.

<img src="dl-applications-figures/attention_encoding.svg" width=800/> 

<img  src="./dl-applications-figures/camel.png"/>

#### Contextual Understanding

<img  src="./dl-applications-figures/pink.png"/>

## Results

| Model | k | θ<sub>W+M</sub> | θ<sub>M</sub> | Train | Dev | Test |
|-|-|-|-|-|-|-|
| LSTM [<span class=blue>Bowman et al.</span>] | 100 | \\(\approx\\)10M | 221k | 84.4 | - | 77.6|
| Classifier [<span class=blue>Bowman et al.</span>]| - | - | - | 99.7 | - | 78.2|
| Conditional Encoding | 159 | 3.9M | 252k | 84.4 | 83.0 | 81.4|
| Attention | 100 | 3.9M | 242k | 85.4 | 83.2 | 82.3 |

#### Fuzzy Attention

<img  src="./dl-applications-figures/mimes.png"/>

### Word-by-word Attention [<span class=blue>Bahdanau et al. 2015</span>, <span class=blue>Hermann et al. 2015</span>, <span class=blue>Rush et al. 2015</span>]

<img src="dl-applications-figures/word_attention.svg" width=800/> 

An extension to the attention mechanism described above is to allow the model to attend over premise output vectors for every word in the hypothesis. This can be achieved by a small adaption of the previous model.

<img src="dl-applications-figures/word_attention_encoding.svg" width=800/> 

#### Reordering

<img src="./dl-applications-figures/reordering.png" width=60%/>

#### Garbage Can = Trashcan

<img  src="./dl-applications-figures/trashcan.png" width=90%/>

#### Kids =  Girl + Boy

<img  src="./dl-applications-figures/kids.png" width=80%/>

## Snow is outside

<img  src="./dl-applications-figures/snow.png"/>

## Results

| Model | k | θ<sub>W+M</sub> | θ<sub>M</sub> | Train | Dev | Test |
|-|-|-|-|-|-|-|
| LSTM [<span class=blue>Bowman et al.</span>] | 100 | \\(\approx\\)10M | 221k | 84.4 | - | 77.6|
| Classifier [<span class=blue>Bowman et al.</span>]| - | - | - | 99.7 | - | 78.2|
| Conditional Encoding | 159 | 3.9M | 252k | 84.4 | 83.0 | 81.4|
| Attention | 100 | 3.9M | 242k | 85.4 | 83.2 | 82.3 |
| Word-by-word Attention | 100 | 3.9M | 252k | 85.3 | **83.7** | **83.5** |

## Artefacts

<img src="https://persagen.com/files/misc/arxiv1805.02266-table1.png">

## Pre-trained Representations

Instead of training task-specific word representations, it is sometimes helpful to use pre-trained word vectors from word2vec or GloVe. Above, you can find a simply example of manually setting vectors of a word embedding matrix. Note that by setting `trainable=False`, the model cannot update these word embeddings during training. If we only want to use pre-trained word vectors as initializer so that we can fine-tune them based on our task, we would need set `trainable=True`.

## ELMo

<img src="https://upload.wikimedia.org/wikipedia/en/7/74/Elmo_from_Sesame_Street.gif"/>
<img src="https://raw.githubusercontent.com/dsindex/blog/master/images/ngram_cnn_highway_1.png"/>
<img src="http://jalammar.github.io/images/elmo-forward-backward-language-model-embedding.png"/>

## BERT

<img src="https://miro.medium.com/max/300/0*2XpE-VjhhLGkFDYg.jpg"/>
<img src="http://jalammar.github.io/images/BERT-language-modeling-masked-lm.png"/>
<img src="http://jalammar.github.io/images/bert-next-sentence-prediction.png"/>
<img src="http://jalammar.github.io/images/bert-tasks.png"/>
<img src="https://storage.googleapis.com/groundai-web-prod/media/users/user_234892/project_363715/images/supplemental/bylayer_base.png"/>

## GLUE

<img src="https://miro.medium.com/max/1200/0*-k_fjBnCuByNye4v"/>

### Projecting Representations

Sometimes we want to work with pre-trained word vectors (e.g. $300$ dimensional word2vec vectors), but use a different hidden size for our RNN. We can learn a linear projection from one vector space to another via `tf.contrib.layers.linear`. If we want to use a non-linear projection, we can simply call a non-linear function like `tanh` on the output of `linear`.

# References

- [The State of Transfer Learning in NLP](http://ruder.io/state-of-transfer-learning-in-nlp/)
- [An Overview of Multi-Task Learning in Deep Neural Networks](https://arxiv.org/pdf/1706.05098.pdf)
- [Linguistic Knowledge and Transferability of Contextual Representations](https://www.aclweb.org/anthology/N19-1112.pdf)
- [GLUE: a multi-task benchmark and analysis platform for natural language understanding](https://openreview.net/pdf?id=rJ4km2R5t7)
- [SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems](https://w4ngatang.github.io/static/papers/superglue.pdf)
- [The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)](http://jalammar.github.io/illustrated-bert/)
- [Character-Aware Neural Language Models](https://arxiv.org/pdf/1508.06615.pdf)
- [Annotation Artifacts in Natural Language Inference Data](https://www.aclweb.org/anthology/N18-2017.pdf)
- [Breaking NLI Systems with Sentences that Require Simple Lexical Inferences](https://www.aclweb.org/anthology/P18-2103.pdf)