In [1]:
# Reveal.js
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
        'theme': 'white',
        'transition': 'none',
        'controls': 'false',
        'progress': 'true',
})

{'theme': 'white',
 'transition': 'none',
 'controls': 'false',
 'progress': 'true'}

In [2]:
%%capture
%load_ext autoreload
%autoreload 2
# %cd ..
import sys
sys.path.append("..")
import statnlpbook.util as util
util.execute_notebook('language_models.ipynb')

In [3]:
%%html
<script>
  function code_toggle() {
    if (code_shown){
      $('div.input').hide('500');
      $('#toggleButton').val('Show Code')
    } else {
      $('div.input').show('500');
      $('#toggleButton').val('Hide Code')
    }
    code_shown = !code_shown
  }

  $( document ).ready(function(){
    code_shown=false;
    $('div.input').hide()
  });
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>

In [4]:
from IPython.display import Image
import random

# Contextualised Word Representations



## What makes a good word representation? ##

1. Representations are **distinct**
2. **Similar** words have **similar** representations

## Reminder: word2vec

<center><img width=1000 src="../img/cbow_sg2.png"></center>

<div style="text-align: right;">
    (word2vec: <a href="https://arxiv.org/abs/1301.3781">Mikolov et al., 2013</a>)
</div>

### Disadvantage of Static Word Embeddings

* No context (or maybe small fixed context window) - the representation depends only on the word itself

How can we address this shortcoming?

## What does this mean? ##


* "Yesterday I saw a bass ..."

In [5]:
Image(url='../img/bass_1.jpg'+'?'+str(random.random()), width=300)

In [6]:
Image(url='../img/bass_2.svg'+'?'+str(random.random()), width=100)

# Contextualised Representations #

* Static embeddings (e.g., [word2vec](dl-representations_simple.ipynb)) have one representation per word *type*, regardless of context

* Contextualised representations use the context surrounding the word *token*


## Contextualised Representations Example ##


* a) "Yesterday I saw a bass swimming in the lake"

In [7]:
Image(url='../img/bass_1.jpg'+'?'+str(random.random()), width=300)

* b) "Yesterday I saw a bass in the music shop"

In [8]:
Image(url='../img/bass_2.svg'+'?'+str(random.random()), width=100)

## Contextualised Representations Example ##


* a) <span style="color:red">"Yesterday I saw a bass swimming in the lake"</span>.
* b) <span style="color:green">"Yesterday I saw a bass in the music shop"</span>.

In [9]:
Image(url='../img/bass_visualisation.jpg'+'?'+str(random.random()), width=500)

## What makes a good representation? ##

1. Representations are **distinct**
2. **Similar** words have **similar** representations

Additional criterion:

3. Representations take **context** into account

## How to train contextualised representations ##

Basicallly like word2vec: predict a word from its context (or vice versa).

Cannot just use lookup table (i.e., embedding matrix) any more.

Train a network with the sequence as input! Does this remind you of anything?

In [9]:
Image(url='../img/elmo_1.png'+'?'+str(random.random()), width=800)

The hidden state of an RNN LM is a contextualised word representation!

For the LM to use the hidden state to predict the next word, it should be a generally good sequence representation.

In this example: a two-layer LSTM LM taken from a model called *ELMo*.

<div style="text-align: right;">
    (from <a href="http://jalammar.github.io/illustrated-bert/">The Illustrated BERT</a>)
</div>

## Bidirectional RNN LM ##

An RNN (or LSTM) LM only considers preceding context.

ELMo (Embeddings from Language Models) is based on a biLM: *bidirectional language model* ([Peters et al., 2018](https://www.aclweb.org/anthology/N18-1202/)).

In [6]:
Image(url='../img/elmo_2.png'+'?'+str(random.random()), width=1200)

In [10]:
Image(url='../img/elmo_3.png'+'?'+str(random.random()), width=1200)

<center><img src="../img/quiz_time.png"></center>

# [tinyurl.com/diku-nlp-bilm](https://tinyurl.com/diku-nlp-bilm)
([Responses](https://docs.google.com/forms/d/1BimPo-S12XWt1qOJLXBTIGjRpt-bVW8H7hmT3j0iRRQ/edit#responses))

## Problem: Long-Term Dependencies ##

LSTMs have *longer-term* memory, but they still forget.

Solution: *transformers*! ([Vaswani et al. (2017)](https://arxiv.org/abs/1706.03762))

* In 2024, all state-of-the-art LMs are transformers.
    * Yes, also GPT-4
    * But some [RNN-inspired models](https://github.com/state-spaces/mamba) are in fact on the rise

## OpenAI GPT (Generative Pre-trained Transformer)

Series of *decoder-only* neural language models using the *transformer* architecture.

As contextualised representations, can be accessed using the Embeddings API: https://platform.openai.com/docs/api-reference/embeddings

As a language model, can be accessed using the chat completions API: https://platform.openai.com/docs/api-reference/chat

See more in the Transformers lecture.

<center>
    <img src="http://jalammar.github.io/images/xlnet/transformer-decoder-intro.png" width=70%/>
</center>


<div style="text-align: right;">
    (from <a href="http://jalammar.github.io/illustrated-gpt2/">The Illustrated GPT-2</a>)
</div>

In [15]:
Image(url='../img/transformers.png'+'?'+str(random.random()), width=400)

## BERT

**B**idirectional **E**ncoder **R**epresentations from **T**ransformers ([Devlin et al., 2019](https://www.aclweb.org/anthology/N19-1423.pdf)), an *encoder-only* transformer.

<center>
    <img src="https://miro.medium.com/max/300/0*2XpE-VjhhLGkFDYg.jpg" width=40%/>
</center>

### BERT training objective (1): **masked** language modelling (MLM)

Predict *masked* words given context on both sides:

<center>
    <img src="http://jalammar.github.io/images/BERT-language-modeling-masked-lm.png" width=70%/>
</center>

<div style="text-align: right;">
    (from <a href="http://jalammar.github.io/illustrated-bert/">The Illustrated BERT</a>)
</div>

### BERT Training objective (2): next sentence prediction (NSP)

Classify whether one sentence follows another using *conditional encoding* of both sentences:

<center>
    <img src="http://jalammar.github.io/images/bert-next-sentence-prediction.png" width=70%/>
</center>

<div style="text-align: right;">
    (from <a href="http://jalammar.github.io/illustrated-bert/">The Illustrated BERT</a>)
</div>

### How is that different from ELMo and GPT?

<center>
    <img src="mt_figures/bert_gpt_elmo.png" width=100%/>
</center>

<div style="text-align: right;">
    (from <a href="https://www.aclweb.org/anthology/N19-1423.pdf">Devlin et al., 2019</a>)
</div>

See more in the Attention lecture.

## T5 (Text-to-Text Transfer Transformer)

An *encoder-decoder* model.

See more in the Transfer Learning lecture.

<center>
    <img src="https://1.bp.blogspot.com/-89OY3FjN0N0/XlQl4PEYGsI/AAAAAAAAFW4/knj8HFuo48cUFlwCHuU5feQ7yxfsewcAwCLcBGAsYHQ/s640/image2.png" width=50%>
</center>

<div style="text-align: right;">
    (from <a href="https://arxiv.org/abs/1910.10683">Raffel et al., 2019</a>)
</div>

### BERT tokenisation: not words, but WordPieces

WordPiece and BPE (byte-pair encoding) tokenise text to **subwords** ([Sennrich et al., 2016](https://aclanthology.org/P16-1162/), [Wu et al., 2016](https://arxiv.org/abs/1609.08144v2))

* BERT has a [30,000 WordPiece vocabulary](https://huggingface.co/bert-base-cased/blob/main/vocab.txt), including ~10,000 unique characters.
* No unknown words!

<center>
    <img src="https://vamvas.ch/assets/bert-for-ner/tokenizer.png" width=60%/>
</center>

<div style="text-align: right;">
    (from <a href="https://vamvas.ch/bert-for-ner">BERT for NER</a>)
</div>

### Visualizing BERT word embeddings

Pretty similar to [word2vec](dl-representations_simple.ipynb):

<center>
    <img src="https://home.ttic.edu/~kgimpel/viz-bert/viz-bert-voc.png" width=70%/>
</center>

<div style="text-align: right;">
    (from <a href="https://home.ttic.edu/~kgimpel/viz-bert/viz-bert.html">Visualizing BERT</a>)
</div>

### Visualizing BERT word embeddings

<center>
    <img src="https://home.ttic.edu/~kgimpel/viz-bert/viz-bert-voc-house.png" width=70%/>
</center>

<div style="text-align: right;">
    (from <a href="https://home.ttic.edu/~kgimpel/viz-bert/viz-bert.html">Visualizing BERT</a>)
</div>

### Visualizing BERT word embeddings

<center>
    <img src="https://home.ttic.edu/~kgimpel/viz-bert/viz-bert-voc-suffixes.png" width=70%/>
</center>

<div style="text-align: right;">
    (from <a href="https://home.ttic.edu/~kgimpel/viz-bert/viz-bert.html">Visualizing BERT</a>)
</div>

# Summary #

* Static word embeddings do not differ depending on context
* Contextualised representations are dynamic
* Popular pre-trained contextual representations:
    * ELMo: bidirectional language model with LSTMs
    * GPT: transformer language models (decoder-only)
    * BERT: transformer masked language model (encoder-only)
    * T5: text-to-text transformer (encoder-decoder)

# Additional Reading #

+ [Jurafsky & Martin Chapter 7](https://web.stanford.edu/~jurafsky/slp3/7.pdf)
+ [Jurafsky & Martin Chapter 8](https://web.stanford.edu/~jurafsky/slp3/8.pdf)