# Announcements

* HW4 is due October 29, 2021 at midnight
* Please fill out the project progress survey! It can be found here: https://ubgse.iad1.qualtrics.com/jfe/form/SV_6KCM0Zy65nkwW2i and filling it out will count toward your participation grade. It also helps us ensure you get the highest possible grade on your final project.

# Recurrent Neural Networks

Elman (1990): Finding structure in time. https://doi.org/10.1207/s15516709cog1402_1

Recurrent Neural Networks, or RNNs, are like the n-gram language models we learned about before. RNNs are very similar to `word2vec` and n-gram models in that their job is to **predict what the next word is going to be** as well as possible. The particular insight in creating recurrent neural networks was that we can represent the **context** as something like a word vector.

The earliest instantiation of Elman's RNN was a **character language model**, or a model that learned to predict the next letter in a sequence. Also known as a **simple recurrent network**, his model had a vocabulary of about 40 characters (26 English letters + 10 numbers). This type of model took weeks to run on machines even though its task would take 30 seconds on your own laptop.

But, as we gained computing power, it was feasible to extend his small RNN to complex vocabularies from whole corpora. Starting in the early 2010s, we were able to finally compute sophisticated RNN representations rather than work with toy examples.

## Four critical components in RNNs

<!-- Here is some python pseudocode where we have a hidden layer (just like in `word2vec`) of $k$ dimensions: -->

<!-- ```python
for i, word in enumerate(sentence):
  one_hot_word_vector = make_bag_of_words([word])
  one_hot_next_word_vector = make_bag_of_words([sentence[i + 1])
  if i==0:
    hidden_units = randomly_initialize(dimensionality=k)
    recurrent_units = zero_initialize(dimensionality=k)
  else:
    recurrent_units = hidden_units # set to previous hidden state
    hidden_units = recompute_hidden_from(one_hot_vector)
  concatenated_hidden = concat(hidden_units, recurrent_units)
  prediction = predict(concatenated_hidden)
  error = compute_loss(prediction, one_hot_next_word_vector)
``` -->

<center><img src="https://www.oreilly.com/library/view/keras-2x-projects/9781789536645/assets/8bf6fccb-4bdf-4542-b095-1791a7e2ca88.png" width=550 /></center>


1. Input representations (a one-hot bag-of-words representation of the current word)
2. Hidden units (just like in `word2vec`)
3. Recurrent units (which hold a "copy" of (2) from the previous cycle
4. Output representations (a one-hot bag-of-words representation of the next word)

On the surface, this is very different from n-gram language modeling. But, the outcome is similar.

The major contribution of the RNN is that it predicts the next output (just like an n-gram language model) using a **latent representation** of the context. That is, both the _current word_ and _all the prior words_ contribute to the prediction.

Cool historical page on SRNs from Jay McClelland's research group: https://web.stanford.edu/group/pdplab/pdphandbook/handbookch8.html


## Some caveats about Simple Recurrent Networks (SRNs)

Elman's original models work well over simplified inputs and outputs, such as small corpora. But, **hidden states can slowly get corrupted**, so the model is not guaranteed to work very well for long sequences. That is, it might forget some of what it has read.

So, while SRNs are better than n-gram language models for prediction, they still need more tricks to better remember the past. For that, researchers implemented Long Short-Term Memory ([Hochreiter & Schmidhuber, 1997](https://doi.org/10.1162/neco.1997.9.8.1735)) and Attention ([Bahdanau, Cho, & Bengio, 2014](https://arxiv.org/abs/1409.0473)). These help models to better remember what they have just seen.

## Word representations to date

To date, we have talked about the following algorithms we have covered that learn word vectors are:

* Latent Semantic Analysis (LSA) via Principal Component Analysis (PCA)
* word2vec (the continuous bag-of-words version)
* Latent Dirichlet Allocation (LDA) for topic model representations of words

What are the features of each of the dimensions learned by these methods? What is the input and the output?

**LSA (Latent Semantic Analysis)**
<details>
<summary>Each dimension corresponds to </summary>
  Its distance along an axis defined by learning the best characterization of the subspace
</details>
<details>
<summary>The input to LSA is </summary>
  A set of document representations in bag-of-words format -- a document matrix
</details>
<details>
<summary>The result of LSA is </summary>
  A subspace that projects all words $w$ in a vocabulary $V$ into a lower-dimensional space ($|V| \times k$)
</details>

**word2vec (continuous bag-of-words)**
<details>
<summary>Each dimension corresponds to</summary>
A compressed representation learned by predicting "held-out" words from a context
</details>
<details>
<summary>The input to word2vec is</summary>
Document representations presented one at a time until the model converges on a solution. These document representations are like bag-of-words representations, except the model predicts a held-out word that is not included in the counts
</details>
<details>
<summary>The result of word2vec is </summary>
  Two matrices that (1) turn one-hot bag-of-words representations into word vectors and (2) turn a context vector [from the document representation] into predictions (probabilities) of held-out words
</details>

**LDA (Latent Dirichlet Allocation)**
Unlike `word2vec` and LSA, LDA does not return a word vector representation exactly. Like the vectors learned by PCA...
<details>
<summary>Each dimension of a "word vector" corresponds to</summary>
A "weight" in the form of a probability that a given word $w$ belongs to a given topic $t$
</details>

But, unlike the vectors we learn in LSA:

<details>
<summary>Each dimension corresponds to</summary>
The dimensions are arbitrary -- PCA gives us vectors ordered by their importance. LDA treats each topic separately.
</details>

And, because the values learned by LDA for a given word are **probabilities** this means some similarity math (e.g., cosine similarity or dot products) are challenging. This motivates contextual methods that can learn word representations that still behave nicely in standard geometric spaces.

**RNN (Recurrent Neural Network)**
Using a more complex neural structure that holds onto prior states in memory, we can learn contextual word representations.
<details>
<summary>Each dimension of a "word vector" corresponds to</summary>
The hidden state (a float vector) prior to the output layer (a vector of probabilities of the next word).
</details>


# ELMo

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer

https://aclanthology.org/N18-1202.pdf

In [1]:
# doing this instead of trying to use Allen Institute's package
!pip install simple-elmo

Collecting simple-elmo
  Downloading simple_elmo-0.8.0-py3-none-any.whl (45 kB)
[?25l[K     |███████▏                        | 10 kB 26.8 MB/s eta 0:00:01[K     |██████████████▍                 | 20 kB 28.9 MB/s eta 0:00:01[K     |█████████████████████▋          | 30 kB 11.9 MB/s eta 0:00:01[K     |████████████████████████████▊   | 40 kB 8.9 MB/s eta 0:00:01[K     |████████████████████████████████| 45 kB 1.6 MB/s 
Installing collected packages: simple-elmo
Successfully installed simple-elmo-0.8.0


In [21]:
from simple_elmo import ElmoModel
model = ElmoModel()

In [23]:
!wget http://vectors.nlpl.eu/repository/20/193.zip

--2021-10-22 13:40:33--  http://vectors.nlpl.eu/repository/20/193.zip
Resolving vectors.nlpl.eu (vectors.nlpl.eu)... 129.240.189.181
Connecting to vectors.nlpl.eu (vectors.nlpl.eu)|129.240.189.181|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 207048949 (197M) [application/zip]
Saving to: ‘193.zip’


2021-10-22 13:40:42 (22.5 MB/s) - ‘193.zip’ saved [207048949/207048949]



In [25]:
model.load('./193.zip')

2021-10-22 13:41:19,492 : INFO : Loading model from ./193.zip...
2021-10-22 13:41:19,499 : INFO : 
            Assuming the model is a ZIP archive downloaded from the NLPL vector repository.
            Loading a model from a ZIP archive directly is slower than from the extracted files,
            but does not require additional disk space
            and allows to load from directories without write permissions.
            
2021-10-22 13:41:19,505 : INFO : We will cache the vocabulary of 100 tokens.


'The model is now loaded.'

In [26]:
model.get_elmo_vectors(["Zip it!", "Zip, nada, nothing!"])

2021-10-22 14:28:19,219 : INFO : Warming up ELMo on 2 sentences...
2021-10-22 14:28:21,030 : INFO : Warming up finished.
2021-10-22 14:28:21,033 : INFO : Texts in the current batch: 2


array([[[ 0.32432568,  0.28965333,  0.0379729 , ..., -0.8233701 ,
         -0.1367383 ,  0.38950109],
        [ 0.62001121,  0.60724926,  0.93417072, ..., -0.57243121,
          0.03640234,  0.49224213],
        [ 0.49897772,  0.14082819,  0.17999473, ..., -0.616813  ,
         -0.07496287,  0.17681247],
        ...,
        [-0.0468157 ,  0.01463903,  0.02492291, ..., -0.05406092,
         -0.02305393,  0.05983648],
        [-0.0468157 ,  0.01463903,  0.02492291, ..., -0.05406092,
         -0.02305393,  0.05983648],
        [-0.0468157 ,  0.01463903,  0.02492291, ..., -0.05406092,
         -0.02305393,  0.05983648]],

       [[ 0.32319164,  0.29203686,  0.03212083, ..., -0.54077303,
         -0.12535056,  0.04922635],
        [ 0.62367535,  0.61293435,  0.9291364 , ..., -0.45413837,
         -0.06636535,  0.22895014],
        [ 0.49826235,  0.14131977,  0.18002361, ..., -0.7542783 ,
          0.05295011,  0.02227798],
        ...,
        [ 0.50391281,  0.2338098 ,  0.30142918, ..., -

# Transformers

In [2]:
# using the HuggingFace implementation
!pip install transformers

Collecting transformers
  Downloading transformers-4.11.3-py3-none-any.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 5.2 MB/s 
Collecting huggingface-hub>=0.0.17
  Downloading huggingface_hub-0.0.19-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 4.9 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 43.1 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 33.6 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 56.7 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempti

## BERT

[BERT (Devlin et al., 2019)](https://aclanthology.org/N19-1423/) was the first major commercially available model for many modern Transformer architectures. It came out only a couple of months after ELMo, and completely changed the face of NLP.

Like word2vec and ELMo, BERT was trained using two prediction tasks:

1. Masked token prediction
  (A token is hidden from the sentence, and BERT has to guess what it is)
2. Next-sentence classification
  (Randomly pick a sentence $s$ -- is $s$ the real next sentence in a document?)

BERT has 13 hidden layers (though its full architecture is much bigger with other components). Each hidden layer is 768 dimensions. The predictions BERT makes are probabilities assigned to each unique vocabulary item -- approximately 32,000 word pieces. So, BERT does not break sentences up like people do -- it picks word segmentations first by whitespace and punctuation and then using the byte pair encoding (BPE) algorithm. This means that BERT does not predict characters or words -- it predicts a **subword**.

We will return to the problem of subwords after we get started with a single sentence.

## Using the `transformers` package can be tricky

The APIs change subtly for each model but are highly similar.

Here are some tips about the API to access information that BERT gives you. There are two tricky things about the HuggingFace API.

### <u>The output that `tokenizer` gives you</u>

When you ask BertTokenizer to break your text into tokens, it will give you a dictionary-like object called a `BatchEncoding` with three keys -- `"input_ids"`, `"token_type_ids"`, and `"attention_mask"`. 

_These values are NOT an array of strings_, unlike `nltk.word_tokenize` or the string `.split()` method. They are also not `Token` type objects like `spaCy` will give you. Instead, the values of each element in `BertTokenizer` are `list`s of integers (`int`s). 

For the purposes of extracting embeddings from Bert, the main ones we care about are accessible using the `"input_ids"` key.

In [3]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

tokenized = tokenizer("BERT does not predict characters or words -- it predicts a subword.")

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [4]:
type(tokenized)

transformers.tokenization_utils_base.BatchEncoding

In [5]:
tokenized.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [6]:
for key, value in tokenized.items():
  print(key, type(value))

input_ids <class 'list'>
token_type_ids <class 'list'>
attention_mask <class 'list'>


In [7]:
for key, value in tokenized.items():
  print(key, value)

input_ids [101, 139, 9637, 1942, 1674, 1136, 17163, 2650, 1137, 1734, 118, 118, 1122, 17163, 1116, 170, 4841, 12565, 119, 102]
token_type_ids [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
attention_mask [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


## <u>Working with wordpieces instead of words</u>

### The output is not human-readable off the bat
The outputs from the `tokenizer` are a list of numbers! This means, unfortunately, that to have a human readable version of the tokenization, we effectively have to **back-transform** the list of numbers into a list of strings (word pieces). To do that, we need to use `tokenizer`'s `convert_ids_to_tokens` function. It takes a `list: int` and turns it into a `list: str`.

In [8]:
tokens_as_indices = tokenized['input_ids']
tokenizer.convert_ids_to_tokens(tokens_as_indices)

['[CLS]',
 'B',
 '##ER',
 '##T',
 'does',
 'not',
 'predict',
 'characters',
 'or',
 'words',
 '-',
 '-',
 'it',
 'predict',
 '##s',
 'a',
 'sub',
 '##word',
 '.',
 '[SEP]']

### The output requires being "reprocessed" for certain tasks. 

### We have to figure out what to do with each token
Most importantly, we have to make a pretty tricky decision about how to represent words in sentences. There are a few different things folks will try, though the strategy depends on how `tokenizer` works. BERT has some handy notation:

1. If a token starts with `"##"`, it is in the middle of a word.
2. If a token starts with a regular character, it is the beginning of a word.

## Strategies for representing words using word pieces

1. Take only the vector associated with the first token from the representation (i.e, `'predict'` and leave off `'##s'`)
  * Least challenging because you can ignore all words where `word.startswith("##")==True`
  * For BPE, will always evaluate to the same number of words
2. Average all the vectors associated with a word (i.e., both `'predict'` and `'##s'`)
  * Requires keeping track of word boundaries
  * Need a way to know what indices correspond with what words
3. Take only the vector associated with the last token of the representation (i.e., `'##s'`)
  * Similar challenges to (2)
  * Need to know the index of the very last word piece in each word
4. A weighted average of the vectors
  * Similar to (2) but you need to know how much to weigh every vector


## Getting word vectors from BERT

In [9]:
from transformers import BertModel, BertTokenizer
import torch # pytorch

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model = BertModel.from_pretrained('bert-base-cased', output_hidden_states=True)

tokenized = tokenizer("BERT does not predict characters or words -- it predicts a subword.", return_tensors="pt")
embedded = model(**tokenized)

Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Below is the output that we are usually the most interested in, the **hidden states** of the model. These states are basically a giant **tensor** which is like a matrix but with additional layers. The tensors that you get from these massive neural networks will have the following shape:

* $n$ layers (`BERT` has 13)
* $k$ tokens (the number of word pieces that `tokenizer` returns)
* $d$ dimensions -- 768 in the case of BERT, but this can be more or less depending on the model

In [10]:
embedded['hidden_states']

(tensor([[[ 0.4496,  0.0977, -0.2074,  ...,  0.0578,  0.0406, -0.0951],
          [-1.2706,  0.5614, -0.7645,  ...,  0.7640,  0.2941, -0.0022],
          [ 0.5245, -0.3847, -0.9538,  ...,  0.6310, -0.5034,  1.3069],
          ...,
          [-0.2043,  1.1120, -1.0680,  ...,  0.7655,  0.6045, -0.4721],
          [-0.4639,  0.3316,  1.0178,  ...,  0.2741,  0.3736,  0.3975],
          [ 0.0884, -0.2736,  0.3528,  ...,  0.4735, -0.8893,  0.6046]]],
        grad_fn=<NativeLayerNormBackward>),
 tensor([[[ 0.2796, -0.0502, -0.0849,  ...,  0.0469,  0.0144, -0.1617],
          [-1.6258,  0.1819, -0.7633,  ...,  0.3595,  0.5181,  0.2415],
          [-0.5204, -0.5306, -1.7828,  ...,  0.3583, -0.5445,  1.2656],
          ...,
          [-0.0863,  1.3987, -0.7560,  ...,  0.8574,  0.7619, -0.1393],
          [-0.0533, -0.1755,  0.7517,  ..., -0.0993,  0.1082,  0.1899],
          [ 0.1205, -0.6172,  0.4939,  ...,  0.6538, -0.9159,  0.4433]]],
        grad_fn=<NativeLayerNormBackward>),
 tensor([[[ 4.

In [11]:
len(embedded['hidden_states'])

13

In [12]:
embedded['hidden_states'][0]

tensor([[[ 0.4496,  0.0977, -0.2074,  ...,  0.0578,  0.0406, -0.0951],
         [-1.2706,  0.5614, -0.7645,  ...,  0.7640,  0.2941, -0.0022],
         [ 0.5245, -0.3847, -0.9538,  ...,  0.6310, -0.5034,  1.3069],
         ...,
         [-0.2043,  1.1120, -1.0680,  ...,  0.7655,  0.6045, -0.4721],
         [-0.4639,  0.3316,  1.0178,  ...,  0.2741,  0.3736,  0.3975],
         [ 0.0884, -0.2736,  0.3528,  ...,  0.4735, -0.8893,  0.6046]]],
       grad_fn=<NativeLayerNormBackward>)

In [13]:
embedded['hidden_states'][0].shape # there are 20 tokens

torch.Size([1, 20, 768])

In [14]:
embedded_hidden_states = embedded['hidden_states']

for i, token in enumerate(tokenizer.convert_ids_to_tokens(tokens_as_indices)):
  layer_0 = embedded_hidden_states[0]
  layer_8 = embedded_hidden_states[8]
  # print vector @ 10 for both
  word_vector_0 = layer_0[:, 10]
  word_vector_8 = layer_8[:, 10]

In [15]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity(word_vector_0.detach().numpy(), # detach is special
                  word_vector_8.detach().numpy()) # use it to keep vectors from changing

array([[0.24258463]], dtype=float32)

In [16]:
# now compute similarities across all words in these layers
# similarities between words in layers diverge
# things get more and more different from baseline as you get higher into the model

print("## self-similarity calculation ##")
for i, token in enumerate(tokenizer.convert_ids_to_tokens(tokens_as_indices)):
  layer_0 = embedded_hidden_states[0]
  layer_4 = embedded_hidden_states[4]
  layer_8 = embedded_hidden_states[8]
  # print vector @ i for both
  word_vector_0 = layer_0[:, i]
  word_vector_4 = layer_4[:, i]
  word_vector_8 = layer_8[:, i]
  similarity_04 = cosine_similarity(word_vector_0.detach().numpy(), word_vector_4.detach().numpy())[0][0]
  similarity_08 = cosine_similarity(word_vector_0.detach().numpy(), word_vector_8.detach().numpy())[0][0]
  print(f"{token}: 0-8 {similarity_08}, 0-4 {similarity_04}, diff {similarity_08 - similarity_04}")

## self-similarity calculation ##
[CLS]: 0-8 0.5771727561950684, 0-4 0.6690681576728821, diff -0.09189540147781372
B: 0-8 0.3664797246456146, 0-4 0.5662189722061157, diff -0.1997392475605011
##ER: 0-8 0.3699747920036316, 0-4 0.5887740850448608, diff -0.21879929304122925
##T: 0-8 0.30947890877723694, 0-4 0.5722178220748901, diff -0.2627389132976532
does: 0-8 0.28126809000968933, 0-4 0.48458918929100037, diff -0.20332109928131104
not: 0-8 0.29040563106536865, 0-4 0.5163665413856506, diff -0.22596091032028198
predict: 0-8 0.31413304805755615, 0-4 0.5560460090637207, diff -0.24191296100616455
characters: 0-8 0.33515632152557373, 0-4 0.5313384532928467, diff -0.19618213176727295
or: 0-8 0.29807376861572266, 0-4 0.5508197546005249, diff -0.25274598598480225
words: 0-8 0.3394792973995209, 0-4 0.536274254322052, diff -0.19679495692253113
-: 0-8 0.24258463084697723, 0-4 0.36495283246040344, diff -0.12236820161342621
-: 0-8 0.2115265280008316, 0-4 0.447037011384964, diff -0.23551048338413239
it:

### What are some differences in the output of a standard tokenizer and the BertTokenizer?

In [17]:
# now compute similarities across all words in these layers
import nltk
nltk.download("punkt")
from nltk import word_tokenize

count_of_initial_tokens = 0
for i, token in enumerate(tokenizer.convert_ids_to_tokens(tokens_as_indices)):
  if not token.startswith("##"):
    count_of_initial_tokens += 1
    print(token)
print(f"Number word-initial pieces: {count_of_initial_tokens}\n")
nltk_tokenized = word_tokenize("BERT does not predict characters or words -- it predicts a subword.")
print("----------------------------------")
print("\n".join(nltk_tokenized))
print(f"\nNumber of words via nltk: {len(nltk_tokenized)}")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[CLS]
B
does
not
predict
characters
or
words
-
-
it
predict
a
sub
.
[SEP]
Number word-initial pieces: 16

----------------------------------
BERT
does
not
predict
characters
or
words
--
it
predicts
a
subword
.

Number of words via nltk: 13


## RoBERTa

https://arxiv.org/pdf/1907.11692.pdf

RoBERTa is a model that is very much like BERT, except instead of Google, Facebook trained it. These are the following differences:

* It only predicts masked words
* It does not do next sentence classification
* It was trained on more data, for longer
* Generally outperforms "vanilla" BERT on downstream tasks

## GPT-2

GPT-2, unlike BERT and RoBERTa, is a fully generative model. So, its goal is to _output the next word_. It is very similar to these other models in terms of what it predicts, but it was trained on many parameters and can be used to create synthetic or fake texts.

In [18]:
from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')
text = "BERT does not predict characters or words -- it predicts a subword."
encoded_input = tokenizer(text, return_tensors='pt')
print([tokenizer.convert_ids_to_tokens(x) for x in encoded_input['input_ids']])
output = model(**encoded_input)

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

[['BER', 'T', 'Ġdoes', 'Ġnot', 'Ġpredict', 'Ġcharacters', 'Ġor', 'Ġwords', 'Ġ--', 'Ġit', 'Ġpredicts', 'Ġa', 'Ġsub', 'word', '.']]


GPT-2 also uses a completely different tokenization process -- for "whole words" it has a special character at the beginning (`"Ġ"`). For everything else, there is no special character. 

**Quick thought exercise**: Can you see how this might be easier than for the wordpieces that BERT gives us?

# A comparison of contextual representations

https://aclanthology.org/D19-1006.pdf