In [2]:
# Reveal.js
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
        'theme': 'white',
        'transition': 'none',
        'controls': 'false',
        'progress': 'true',
})

{'theme': 'white',
 'transition': 'none',
 'controls': 'false',
 'progress': 'true'}

In [3]:
%%capture
%load_ext autoreload
%autoreload 2
# %cd ..
import sys
sys.path.append("..")
import statnlpbook.util as util
util.execute_notebook('language_models.ipynb')

In [4]:
%%html
<script>
  function code_toggle() {
    if (code_shown){
      $('div.input').hide('500');
      $('#toggleButton').val('Show Code')
    } else {
      $('div.input').show('500');
      $('#toggleButton').val('Hide Code')
    }
    code_shown = !code_shown
  }

  $( document ).ready(function(){
    code_shown=false;
    $('div.input').hide()
  });
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>

In [5]:
from IPython.display import Image
import random

# Contextualised Word Representations



## What makes a good word representation? ##

1. Representations are **distinct**
2. **Similar** words have **similar** representations

## What does this mean? ##


* "Yesterday I saw a bass ..."

In [5]:
Image(url='../img/bass_1.jpg'+'?'+str(random.random()), width=300)

In [6]:
Image(url='../img/bass_2.svg'+'?'+str(random.random()), width=100)

# Contextualised Representations #

* Static embeddings (e.g., [word2vec](dl-representations_simple.ipynb)) have one representation per word *type*, regardless of context

* Contextualised representations use the context surrounding the word *token*


## Contextualised Representations Example ##


* a) "Yesterday I saw a bass swimming in the lake"

In [7]:
Image(url='../img/bass_1.jpg'+'?'+str(random.random()), width=300)

* b) "Yesterday I saw a bass in the music shop"

In [8]:
Image(url='../img/bass_2.svg'+'?'+str(random.random()), width=100)

## Contextualised Representations Example ##


* a) <span style="color:red">"Yesterday I saw a bass swimming in the lake"</span>.
* b) <span style="color:green">"Yesterday I saw a bass in the music shop"</span>.

In [9]:
Image(url='../img/bass_visualisation.jpg'+'?'+str(random.random()), width=500)

## What makes a good representation? ##

1. Representations are **distinct**
2. **Similar** words have **similar** representations

Additional criterion:

3. Representations take **context** into account

## How to train contextualised representations ##

Basicallly like word2vec: predict a word from its context (or vice versa).

Cannot just use lookup table (i.e., embedding matrix) any more.

Train a network with the sequence as input!

Does this remind you of anything?

<center><img src="../img/rnnlm.png"></center>


## Bidirectional RNN LM ##

The RNN hidden state is a contextualised word representation!

But it only considers preceding context.

ELMo (Embeddings from Language Models) is based on a biLM: *bidirectional language model* ([Peters et al., 2018](https://www.aclweb.org/anthology/N18-1202/)).

In [9]:
Image(url='../img/elmo_1.png'+'?'+str(random.random()), width=800)

"Let's stick to improvisation in this skit"

Image credit: http://jalammar.github.io/illustrated-bert/

In [6]:
Image(url='../img/elmo_2.png'+'?'+str(random.random()), width=1200)

In [10]:
Image(url='../img/elmo_3.png'+'?'+str(random.random()), width=1200)

<center><img src="../img/quiz_time.png"></center>


# https://ucph.page.link/bilm
([Responses](https://docs.google.com/forms/d/1BimPo-S12XWt1qOJLXBTIGjRpt-bVW8H7hmT3j0iRRQ/edit#responses))

## Problem: Long-Term Dependencies ##

LSTMs have *longer-term* memory, but they still forget.

Solution: *transformers*! ([Vaswani et al. (2017)](https://arxiv.org/abs/1706.03762))

* In 2022, all state-of-the-art LMs are transformers.
    * Yes, also GPT-3

In [15]:
Image(url='../img/transformers.png'+'?'+str(random.random()), width=400)

[Attention](http://localhost:8888/notebooks/chapters/attention_slides2.ipynb) Is All You Need!

(See next week's lecture)

# Transformer Language Models

In [12]:
Image(url='../img/transformer-encoder-decoder.png'+'?'+str(random.random()), width=400)

## BERT

**B**idirectional **E**ncoder **R**epresentations from **T**ransformers ([Devlin et al., 2019](https://www.aclweb.org/anthology/N19-1423.pdf)).

<center>
    <img src="https://miro.medium.com/max/300/0*2XpE-VjhhLGkFDYg.jpg" width=40%/>
</center>

<center>
<a href="slides/mlm.pdf"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/08/Sesame_Street_logo.svg/500px-Sesame_Street_logo.svg.png"></a>
</center>

### BERT training objective (1): **masked** language model

Predict masked words given context on both sides:

<center>
    <img src="http://jalammar.github.io/images/BERT-language-modeling-masked-lm.png" width=50%/>
</center>

<div style="text-align: right;">
    (from <a href="http://jalammar.github.io/illustrated-bert/">The Illustrated BERT</a>)
</div>

### BERT Training objective (2): next sentence prediction

**Conditional encoding** of both sentences:

<center>
    <img src="http://jalammar.github.io/images/bert-next-sentence-prediction.png" width=60%/>
</center>

<div style="text-align: right;">
    (from <a href="http://jalammar.github.io/illustrated-bert/">The Illustrated BERT</a>)
</div>

### How is that different from ELMo and GPT-$n$?

<center>
    <img src="mt_figures/bert_gpt_elmo.png" width=100%/>
</center>

<div style="text-align: right;">
    (from <a href="https://www.aclweb.org/anthology/N19-1423.pdf">Devlin et al., 2019</a>)
</div>

# Summary #

* Static word embeddings are static do not differ depending on context
* Contextualised representations are dynamic
* Popular neural archtectures for learning contextual representations
    * ELMo: bidirectional language model with LSTMs
    * GPT, GPT-2, GPT-3: transformer language models
    * BERT: transformer masked language model (and next sentence predictor)

# Outlook #

* *Loads* of Transformer variants
    * Near-impossible to keep up
* In the machine translation lecture, you will be introduced to
    * The most important one(s), including BERT
    * How to train them
    * How to use them in practice
    * How to use them for cross-lingual tasks
* Why in the machine translation lecture?
    * Many NLP innovations originate in machine translation

# Additional Reading #

* Blog posts ["The Illustrated Transformer"](http://jalammar.github.io/illustrated-transformer/) and ["The Illustrated BERT, ELMo, and co."](http://jalammar.github.io/illustrated-bert/) by Jay Alammar
    * Step-by-step walk-through of architectures
    * For those who want to get a more in-depth understanding of the architectures