# Natural Language Inference & Transfer Learning

Daniel Hershcovich

# Outline

- Natural language inference
- Quiz results
- Multi-task learning
- Pre-trained embeddings

# Recognizing Textual Entailment = Natural Language Inference

Determining the logical relationship between two sentences.

- Classification task
- Requires commonsense and world knowledge
- Requires general natural language understanding
- Requires fine-grained reasoning

### Recognizing Textual Entailment (RTE)

[Dagan et al., 2005](http://u.cs.biu.ac.il/~dagan/publications/RTEChallenge.pdf)

- Text T
- Hypothesis H

T entails H if, typically, a human reading T would infer that H is most likely true.


Positive:

> “Google files for its long awaited IPO.”

<center>
$\Rightarrow$
</center>

> “Google goes public.”

Negative:

> “Bush returned to the White House late Saturday while his running mate was off campaigning in the West.”

<center>
$\not\Rightarrow$
</center>

> “Bush left the White House.”

### Stanford Natural Language Inference (SNLI) corpus

[Bowman et al., 2015](https://nlp.stanford.edu/pubs/snli_paper.pdf)

570K sentence pairs, two orders of magnitude larger than other NLI resources (1K-10K examples).

 **A wedding party taking pictures**
- There is a funeral					: **<span class=red>Contradiction</span>**
- They are outside					    : **<span class=blue>Neutral</span>**
- Someone got married				    : **<span class=green>Entailment</span>**

<img src="https://upload.wikimedia.org/wikipedia/commons/3/31/Wedding_photographer_at_work.jpg" width=800/> 

### Multi-NLI (MNLI)

[Williams et al., 2018](https://www.nyu.edu/projects/bowman/multinli/paper.pdf): more diverse domains.

Entailment:

> “The legislation was widely hailed as a model for the country.”

<center>
$\Rightarrow$
</center>

> “Many people thought the legislation was a model for the country.”

Neutral:

> “The program has helped victims in 90 court cases, and 150 legal counseling sessions have been held there.”

<center>
?
</center>

> “Victims from 90 grand jury court cases were helped by the program.”

Contradiction:

> “As a result, Chris Schneider, executive director of Central California Legal Services, is building a lawsuit against Alpaugh Irrigation.”

<center>
$\Rightarrow\neg$
</center>

> “Central California Legal Services’ executive director decided not to pursue a lawsuit against Alpaugh Irrigation.”

## GLUE benchmark

[Wang et al., 2019](https://openreview.net/pdf?id=rJ4km2R5t7): collection of sentence- and sentence-pair-classification tasks.

<img src="https://d3i71xaburhd42.cloudfront.net/d9f6ada77448664b71128bb19df15765336974a6/2-Figure1-1.png"/>

### GLUE: Winograd NLI (WNLI)

World knowledge and logical reasoning, presented as NLI.

Positive:

> “I put the cake away in the refrigerator. It has a lot of butter in it.”

<center>
$\Rightarrow$
</center>

> “The cake has a lot of butter in it.”

Negative:

> “The large ball crashed right through the table because it was made of styrofoam.”

<center>
$\not\Rightarrow$
</center>

> “The large ball was made of styrofoam.”

### SuperGLUE

[Wang et al., 2019](https://arxiv.org/pdf/1905.00537v2.pdf): harder NLI, for a meaningful comparison.

> “Dana Reeve, the widow of the actor Christopher Reeve, has died of lung cancer at age 44, according to the Christopher Reeve Foundation.”

<center>
$\not\Rightarrow$
</center>

> “Christopher Reeve had an accident.”

## RTE/NLI state of the art until 2015

[Lai and Hockenmaier, 2014](https://www.aclweb.org/anthology/S14-2055.pdf),
[Jimenez et al., 2014](https://www.aclweb.org/anthology/S14-2131.pdf),
[Zhao et al., 2014](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6889713),
[Beltagy et al., 2016](https://www.aclweb.org/anthology/J16-4007.pdf) and others:
engineered pipelines.

- Various external resources
- Specialized subcomponents
- Extensive use of **features**:
  - Negation detection, word overlap, part-of-speech tags, dependency parses, alignment, symbolic meaning representation
  
<img src="https://d3i71xaburhd42.cloudfront.net/fca1e631b8f93036065311eb92727c509423475a/9-Figure1-1.png"/>

### Neural networks for NLI

Large-scale NLI corpora: NNs are feasible to train.

### Independent sentence encoding

[Bowman et al, 2015](https://www.aclweb.org/anthology/D15-1075.pdf): same LSTM encodes premise and hypothesis.

<img src="dl-applications-figures/rte.svg" width=800/> 

Last output vector as sentence representation.

<img src="dl-applications-figures/rte_encoding.svg" width=800/>

MLP to classify as entailment/neutral/contradiction.

<img src="dl-applications-figures/mlp.svg" width=700/> 

## Results

| Model | k | θ<sub>W+M</sub> | θ<sub>M</sub> | Train | Dev | Test |
|-|-|-|-|-|-|-|
| LSTM | 100 | \\(\approx\\)10M | 221k | 84.4 | - | 77.6|
| Classifier| - | - | - | 99.7 | - | 78.2|

### Conditional encoding

The way we read the hypothesis could be influenced by our understanding of the premise.

<img src="dl-applications-figures/conditional.svg" width=800/> 

<img src="dl-applications-figures/conditional_encoding.svg" width=800/> 

## Results

| Model | k | θ<sub>W+M</sub> | θ<sub>M</sub> | Train | Dev | Test |
|-|-|-|-|-|-|-|
| LSTM | 100 | \\(\approx\\)10M | 221k | 84.4 | - | 77.6|
| Classifier| - | - | - | 99.7 | - | 78.2|
| Conditional endcoding | 159 | 3.9M | 252k | 84.4 | 83.0 | 81.4|

> You can’t cram the meaning of a whole
%&!\$# sentence into a single \$&!#* vector!
>
> -- <cite>Raymond J. Mooney</cite>


### Attention

<img src="dl-applications-figures/attention.svg" width=800/> 

<img src="dl-applications-figures/attention_encoding.svg" width=800/> 

<img  src="./dl-applications-figures/camel.png"/>

#### Contextual understanding

<img src="./dl-applications-figures/pink.png"/>

## Results

| Model | k | θ<sub>W+M</sub> | θ<sub>M</sub> | Train | Dev | Test |
|-|-|-|-|-|-|-|
| LSTM | 100 | \\(\approx\\)10M | 221k | 84.4 | - | 77.6|
| Classifier| - | - | - | 99.7 | - | 78.2|
| Conditional encoding | 159 | 3.9M | 252k | 84.4 | 83.0 | 81.4|
| Attention | 100 | 3.9M | 242k | 85.4 | 83.2 | 82.3 |

#### Fuzzy attention

<img  src="./dl-applications-figures/mimes.png"/>

### Word-by-word attention

<img src="dl-applications-figures/word_attention.svg" width=800/> 

<img src="dl-applications-figures/word_attention_encoding.svg" width=800/> 

#### Reordering

<img src="./dl-applications-figures/reordering.png" width=60%/>

#### Synonyms

<img  src="./dl-applications-figures/trashcan.png" width=90%/>

#### Hypernyms

<img src="./dl-applications-figures/kids.png" width=80%/>

#### Lexical inference

<img src="./dl-applications-figures/snow.png"/>

## Results

| Model | k | θ<sub>W+M</sub> | θ<sub>M</sub> | Train | Dev | Test |
|-|-|-|-|-|-|-|
| LSTM | 100 | \\(\approx\\)10M | 221k | 84.4 | - | 77.6|
| Classifier| - | - | - | 99.7 | - | 78.2|
| Conditional encoding | 159 | 3.9M | 252k | 84.4 | 83.0 | 81.4|
| Attention | 100 | 3.9M | 242k | 85.4 | 83.2 | 82.3 |
| Word-by-word attention | 100 | 3.9M | 252k | 85.3 | **83.7** | **83.5** |


## Composition

[Bowman et al., 2016](https://www.aclweb.org/anthology/P16-1139.pdf):
compositional vector representation based on syntactic structure.

<img src="https://d3i71xaburhd42.cloudfront.net/36c097a225a95735271960e2b63a2cb9e98bff83/1-Figure1-1.png"/>

## NLI artefacts

SNLI and MNLI are **crowdsourced**.

[Gururangan et al., 2018](https://www.aclweb.org/anthology/N18-2017.pdf): hypothesis phrasing alone gives out the class.

<img src="https://d3i71xaburhd42.cloudfront.net/2997b26ffb8c291ce478bd8a6e47979d5a55c466/2-Table1-1.png"/>

## Lexical entailment

[Glockner et al., 2018](https://www.aclweb.org/anthology/P18-2103.pdf): very **simple** examples that are hard for models.

<img src="https://persagen.com/files/misc/arxiv1805.02266-table1.png">


# Transfer learning

Reading material:

- [The State of Transfer Learning in NLP](http://ruder.io/state-of-transfer-learning-in-nlp/)
- [An Overview of Multi-Task Learning in Deep Neural Networks](https://arxiv.org/pdf/1706.05098.pdf)

[Quiz results](https://absalon.instructure.com/courses/35895/quizzes/37714/statistics)

<img src="https://ruder.io/content/images/2019/08/transfer_learning_taxonomy.png">

# A very brief history of transfer learning in NLP

- [Collobert et al., 2011](https://arxiv.org/pdf/1103.0398.pdf): many tasks with shared embeddings.
- [word2vec](https://arxiv.org/pdf/1301.3781.pdf), [GloVe](https://www.aclweb.org/anthology/D14-1162.pdf) and others:
pre-trained word embeddings.
- [ELMo](https://www.aclweb.org/anthology/N18-1202.pdf), [BERT](https://www.aclweb.org/anthology/N19-1423.pdf) and others:
pre-trained contextualized embeddings.

<img src="https://d3i71xaburhd42.cloudfront.net/2538e3eb24d26f31482c479d95d2e26c0e79b990/3-Figure1-1.png" width=60%/>

## Multi-task learning in NLP

- [Caruana, 1997](https://www.cs.cornell.edu/~caruana/mlj97.pdf): introduction of MTL.
- [Hashimoto et al, 2017](https://www.aclweb.org/anthology/D17-1206.pdf),
[Bjerva, 2017](https://www.aclweb.org/anthology/W17-0225.pdf),
[Bollmann et al., 2018](https://www.aclweb.org/anthology/W18-3403.pdf),
[Hershcovich et al., 2018](https://www.aclweb.org/anthology/P18-1035.pdf),
[Augenstein et al., 2018](https://www.aclweb.org/anthology/N18-1172.pdf) and many others: MTL for NLP.

<img src="https://d3i71xaburhd42.cloudfront.net/ade0c116120b54b57a91da51235108b75c28375a/1-Figure1-1.png"/>

## NLI and parsing

[Bowman et al., 2016](https://www.aclweb.org/anthology/P16-1139.pdf):
transition-based parsing and NLI. Stack-augmented Parser-Interpreter Neural Network (SPINN).

<img src="https://d3i71xaburhd42.cloudfront.net/36c097a225a95735271960e2b63a2cb9e98bff83/2-Figure2-1.png"/>


## Multi-task

[Augenstein et al., 2018](https://www.aclweb.org/anthology/N18-1172.pdf):
sentiment analysis, stance detection, fake news detection and NLI
with a Label Embedding Layer.

<img src="https://d3i71xaburhd42.cloudfront.net/64c5f7055b2e6982b6b95e069b22230d13a134bb/4-Figure1-1.png"/>

# Pre-trained embeddings

General-purpose representations trained on large datasets (usually unsupervised): 

- [word2vec](https://arxiv.org/pdf/1301.3781.pdf)
- [GloVe](https://www.aclweb.org/anthology/D14-1162.pdf)
- [ELMo](https://www.aclweb.org/anthology/N18-1202.pdf)
- [BERT](https://www.aclweb.org/anthology/N19-1423.pdf)

are all forms of transfer learning.

## GLUE

As a collection of structurally similar classification task, serves as a general language understanding benchmark.

<img src="https://d3i71xaburhd42.cloudfront.net/d9f6ada77448664b71128bb19df15765336974a6/2-Figure1-1.png"/>

## Character-Aware Neural Language Models

[Kim et al., 2015](https://arxiv.org/pdf/1508.06615.pdf)

<img src="https://raw.githubusercontent.com/dsindex/blog/master/images/ngram_cnn_highway_1.png"/>

## ELMo

<img src="https://upload.wikimedia.org/wikipedia/en/7/74/Elmo_from_Sesame_Street.gif"/>

[Alammar, 2018](http://jalammar.github.io/illustrated-bert/)

<img src="http://jalammar.github.io/images/elmo-embedding-robin-williams.png"/>

[Peters et al., 2018](https://www.aclweb.org/anthology/N18-1202.pdf):
stacked LSTMs with BiLM (bidirectional language model) objective.

<img src="http://jalammar.github.io/images/elmo-forward-backward-language-model-embedding.png"/>

<img src="http://jalammar.github.io/images/elmo-embedding.png"/>

## Transformers

[Vaswani et al., 2017](https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf):
self-attention: each word attends to all words.

<img src="https://d3i71xaburhd42.cloudfront.net/204e3073870fae3d05bcbc2f6a8e263d9b72e776/4-Figure2-1.png"/>

### Self-attention network

<img src="https://d3i71xaburhd42.cloudfront.net/204e3073870fae3d05bcbc2f6a8e263d9b72e776/3-Figure1-1.png"/>

### Long-distance dependencies

<img src="https://d3i71xaburhd42.cloudfront.net/204e3073870fae3d05bcbc2f6a8e263d9b72e776/13-Figure3-1.png"/>


## BERT

[Devlin et al., 2019](https://www.aclweb.org/anthology/N19-1423.pdf):
bidirectional self-attention with masked language model + next sentence prediction objectives.

<img src="https://miro.medium.com/max/300/0*2XpE-VjhhLGkFDYg.jpg"/>

### Masked language model

<img src="http://jalammar.github.io/images/BERT-language-modeling-masked-lm.png"/>

### Next sentence prediction

**Conditional encoding** of both sentences.

<img src="http://jalammar.github.io/images/bert-next-sentence-prediction.png"/>

### Using BERT

<img src="http://jalammar.github.io/images/bert-tasks.png"/>

### Which layer to use?

<img src="http://jalammar.github.io/images/bert-feature-extraction-contextualized-embeddings.png"/>

### Layer specialization

[Tenney et al., 2019](https://www.aclweb.org/anthology/P19-1452.pdf): different layers are better at different tasks.

<img src="https://storage.googleapis.com/groundai-web-prod/media/users/user_234892/project_363715/images/supplemental/bylayer_base.png"/>

16GB of text from Wikipedia + BookCorpus.

- Batch Size: 131,072 words (1024 sequences * 128
length or 256 sequences * 512 length)
- Training Time: 1M steps (~40 epochs)
- BERT-Base: 12-layer, 768-hidden, 12-head
- BERT-Large: 24-layer, 1024-hidden, 16-head
- Trained on 4x4 or 8x8 TPU slice for 4 days

## RoBERTa

[Liu et al., 2019](https://arxiv.org/pdf/1907.11692.pdf): bigger is better.

BERT with additionally

- CC-News (76GB)
- OpenWebText (38GB)
- Stories (31GB)

and **no** next-sentence-prediction task (only masked LM).

## GLUE

<img src="https://d3i71xaburhd42.cloudfront.net/d9f6ada77448664b71128bb19df15765336974a6/2-Figure1-1.png"/>

## T5

[Raffel et al., 2019](https://arxiv.org/pdf/1910.10683.pdf): train on ALL the tasks!

<img src="dl-applications-figures/t5.png"/>

## T5

Currently [best on GLUE](https://gluebenchmark.com/leaderboard)...

# Further reading

- [Linguistic Knowledge and Transferability of Contextual Representations](https://www.aclweb.org/anthology/N19-1112.pdf)