In [1]:
# Reveal.js
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
        'theme': 'white',
        'transition': 'none',
        'controls': 'false',
        'progress': 'true',
})

{'theme': 'white',
 'transition': 'none',
 'controls': 'false',
 'progress': 'true'}

In [2]:
%%capture
%load_ext autoreload
%autoreload 2
# %cd ..
import sys
sys.path.append("..")
import statnlpbook.util as util
util.execute_notebook('language_models.ipynb')

In [3]:
%%html
<script>
  function code_toggle() {
    if (code_shown){
      $('div.input').hide('500');
      $('#toggleButton').val('Show Code')
    } else {
      $('div.input').show('500');
      $('#toggleButton').val('Hide Code')
    }
    code_shown = !code_shown
  }

  $( document ).ready(function(){
    code_shown=false;
    $('div.input').hide()
  });
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>

In [4]:
from IPython.display import Image
import random

# Contextualised Word Representations



## What makes a good word representation? ##

1. Representations are **distinct**
2. **Similar** words have **similar** representations

## What does this mean? ##


* "Yesterday I saw a bass ..."

In [5]:
Image(url='../img/bass_1.jpg'+'?'+str(random.random()), width=300)

In [6]:
Image(url='../img/bass_2.svg'+'?'+str(random.random()), width=100)

# Contextualised Representations #

* Static embeddings (e.g., [word2vec](dl-representations_simple.ipynb)) have one representation per word *type*, regardless of context

* Contextualised representations use the context surrounding the word *token*


## Contextualised Representations Example ##


* a) "Yesterday I saw a bass swimming in the lake"

In [7]:
Image(url='../img/bass_1.jpg'+'?'+str(random.random()), width=300)

* b) "Yesterday I saw a bass in the music shop"

In [8]:
Image(url='../img/bass_2.svg'+'?'+str(random.random()), width=100)

## Contextualised Representations Example ##


* a) <span style="color:red">"Yesterday I saw a bass swimming in the lake"</span>.
* b) <span style="color:green">"Yesterday I saw a bass in the music shop"</span>.

In [9]:
Image(url='../img/bass_visualisation.jpg'+'?'+str(random.random()), width=500)

## What makes a good representation? ##

1. Representations are **distinct**
2. **Similar** words have **similar** representations

Additional criterion:

3. Representations take **context** into account

## How to train contextualised representations ##

Basicallly like word2vec: predict a word from its context (or vice versa).

Cannot just use lookup table (i.e., embedding matrix) any more.

Train a network with the sequence as input!

Does this remind you of anything?

<center><img src="../img/rnnlm.png"></center>


## Bidirectional RNN LM ##

The RNN hidden state is a contextualised word representation!

But it only considers preceding context.

ELMo (Embeddings from Language Models) is based on a biLM: *bidirectional language model* ([Peters et al., 2018](https://www.aclweb.org/anthology/N18-1202/)).

In [10]:
Image(url='../img/elmo_1.png'+'?'+str(random.random()), width=800)

Image credit: http://jalammar.github.io/illustrated-bert/

In [11]:
Image(url='../img/elmo_2.png'+'?'+str(random.random()), width=1200)

In [12]:
Image(url='../img/elmo_3.png'+'?'+str(random.random()), width=1200)

In [13]:
Image(url='../img/elmo_4.png'+'?'+str(random.random()), width=1200)

Peters et al., “Deep contextualized word representations” (2018)

In [14]:
Image(url='../img/elmo_5.png'+'?'+str(random.random()), width=1200)

## Problems with using Bi-LSTMs to learn contextual word embeddings ##

* Recurrent neural networks are difficult to train
    * Vanishing gradients (LSTM; Hochreiter & Schmidhuber 1997)
    * Exploding gradients (norm rescaling; Pascanu et al. 2013)

## Proposal: Transformer Networks ##

* **Idea 1**: Encode words individually with feed-forward neural networks
    * Shorter path for gradient back-propagation
    * Parallel computation possible
* **Idea 2**: Replace recurrence function with a **positional encoding**
    * Fixed-length vectors, similar word embeddings, that represents the position


* Current base architecture used for all state-of-the-art NLP models
    * Yes, also GPT-3
* You will learn more about popular variants and how to train them later in the course

## Transformer Networks ##

* **Downside**: *Very* complex architecture
    * We do not expect you to understand every detail of it, only the core ideas

In [15]:
Image(url='../img/transformers.png'+'?'+str(random.random()), width=400)

Vaswani et al. (2017), “Attention Is All You Need”. https://arxiv.org/abs/1706.03762

## Positional Encoding ##

* Project positions in sequence to fixed-length vectors, same dimensionality as word embeddings
* Positional embeddings for similar positions are similar
* Obtained using a transformation function
* Is added to each input embedding

In [16]:
Image(url='../img/positional_1.png'+'?'+str(random.random()), width=1200)

## Positional Encoding ##

Transformation function used in Vaswani et al. (2017) is static  (note: alternative is to jointly learn positional embeddings)

In [17]:
Image(url='../img/positional_2.png'+'?'+str(random.random()), width=1200)

Picture source: https://kazemnejad.com/blog/transformer_architecture_positional_encoding/

<center><img src="../img/quiz_time.png"></center>

# Summary #

* Traditional neural word representations are static
    * Do not differ depending on context
* Contextualised representations are dynamic
    * Differ by context
    * Require a function (in practice: trained model) that can return a word embedding given its context
* Popular neural archtectures for learning contextual representations
    * ELMo
        * BiLSTM embeddings for words
    * Transformers
        * Feed-forward NNs + positional embeddings

# Outlook #

* *Loads* of Transformer variants
    * Near-impossible to keep up
* In the machine translation lecture, you will be introduced to
    * The most important one(s), including BERT
    * How to train them
    * How to use them in practice
    * How to use them for cross-lingual tasks
* Why in the machine translation lecture?
    * Many NLP innovations originate in machine translation

# Additional Reading #

* Blog posts ["The Illustrated Transformer"](http://jalammar.github.io/illustrated-transformer/) and ["The Illustrated BERT, ELMo, and co."](http://jalammar.github.io/illustrated-bert/) by Jay Alammar
    * Step-by-step walk-through of architectures
    * For those who want to get a more in-depth understanding of the architectures