In [1]:
!pip install -q transformers huggingface_hub
import math
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

### Using pre-trained transformers (2 points)
_for fun and profit_

There are many toolkits that let you access pre-trained transformer models, but the most powerful and convenient by far is [`huggingface/transformers`](https://github.com/huggingface/transformers). In this week's practice, you'll learn how to download, apply and modify pre-trained transformers for a range of tasks. Buckle up, we're going in!


__Pipelines:__ if all you want is to apply a pre-trained model, you can do that in one line of code using pipeline. Huggingface/transformers has a selection of pre-configured pipelines for masked language modelling, sentiment classification, question aswering, etc. ([see full list here](https://huggingface.co/transformers/main_classes/pipelines.html))

A typical pipeline includes:
* pre-processing, e.g. tokenization, subword segmentation
* a backbone model, e.g. bert finetuned for classification
* output post-processing

Let's see it in action:

In [2]:
import transformers

In [3]:
classifier = transformers.pipeline(
    'sentiment-analysis',
    model="distilbert-base-uncased-finetuned-sst-2-english",
    device = 0)

print(classifier("BERT is amazing!"))

[{'label': 'POSITIVE', 'score': 0.9998860359191895}]


  attn_output = torch.nn.functional.scaled_dot_product_attention(


In [4]:
import base64
data = {
    'arryn': 'As High as Honor.',
    'baratheon': 'Ours is the fury.',
    'stark': 'Winter is coming.',
    'tyrell': 'Growing strong.'
}

# YOUR CODE: predict sentiment for each noble house and create outputs dict
outputs = {}
for house, motto in data.items():
    result = classifier(motto)[0]
    outputs[house] = result['label'] == 'POSITIVE'

assert sum(outputs.values()) == 3 and outputs[base64.decodebytes(b'YmFyYXRoZW9u\n').decode()] == False
print("Well done!")
print(outputs)

Well done!
{'arryn': True, 'baratheon': False, 'stark': True, 'tyrell': True}


You can also access vanilla Masked Language Model that was trained to predict masked words. Here's how:

In [5]:
mlm_model = transformers.pipeline(
    'fill-mask',
    model="bert-base-uncased",
    device = 0)
MASK = mlm_model.tokenizer.mask_token

for hypo in mlm_model(f"Donald {MASK} is the president of the united states."):
  print(f"P={hypo['score']:.5f}", hypo['sequence'])

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another archite

P=0.99719 donald trump is the president of the united states.
P=0.00024 donald duck is the president of the united states.
P=0.00022 donald ross is the president of the united states.
P=0.00020 donald johnson is the president of the united states.
P=0.00018 donald wilson is the president of the united states.


In [6]:
# Your turn: use bert to recall what year was the Soviet Union founded in
result = mlm_model(f"The Soviet Union was established in year {MASK}.")
print("\nSoviet Union founding year predictions:")
for hypo in result:
    print(f"P={hypo['score']:.5f}", hypo['sequence'])


Soviet Union founding year predictions:
P=0.09128 the soviet union was established in year ii.
P=0.02685 the soviet union was established in year 1917.
P=0.02466 the soviet union was established in year 1945.
P=0.02274 the soviet union was established in year iii.
P=0.02097 the soviet union was established in year 1918.


```

```

```

```


Huggingface offers hundreds of pre-trained models that specialize on different tasks. You can quickly find the model you need using [this list](https://huggingface.co/models).


In [7]:
text = """Almost two-thirds of the 1.5 million people who viewed this liveblog had Googled to discover
 the latest on the Rosetta mission. They were treated to this detailed account by the Guardian’s science editor,
 Ian Sample, and astronomy writer Stuart Clark of the moment scientists landed a robotic spacecraft on a comet
 for the first time in history, and the delirious reaction it provoked at their headquarters in Germany.
  “We are there. We are sitting on the surface. Philae is talking to us,” said one scientist.
"""

# Task: create a pipeline for named entity recognition, use task name 'ner' and search for the right model in the list
ner_model = transformers.pipeline(
    'ner',
    model="dbmdz/bert-large-cased-finetuned-conll03-english",
    device = 0)

named_entities = ner_model(text)

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [8]:
print("\nNamed Entities:")
word_to_entity = {item['word']: item['entity'] for item in named_entities}
print(word_to_entity)
assert 'org' in word_to_entity.get('Guardian').lower() and 'per' in word_to_entity.get('Stuart').lower()
print("All tests passed")


Named Entities:
{'Google': 'I-MISC', 'Rose': 'I-MISC', '##tta': 'I-MISC', 'Guardian': 'I-ORG', 'Ian': 'I-PER', 'Sam': 'I-PER', '##ple': 'I-PER', 'Stuart': 'I-PER', 'Clark': 'I-PER', 'Germany': 'I-LOC', 'Phil': 'I-PER', '##ae': 'I-PER'}
All tests passed


In [9]:
print('OUTPUT:', named_entities)
word_to_entity = {item['word']: item['entity'] for item in named_entities}
assert 'org' in word_to_entity.get('Guardian').lower() and 'per' in word_to_entity.get('Stuart').lower()
print("All tests passed")

OUTPUT: [{'entity': 'I-MISC', 'score': 0.880311, 'index': 19, 'word': 'Google', 'start': 73, 'end': 79}, {'entity': 'I-MISC', 'score': 0.900507, 'index': 27, 'word': 'Rose', 'start': 112, 'end': 116}, {'entity': 'I-MISC', 'score': 0.95096254, 'index': 28, 'word': '##tta', 'start': 116, 'end': 119}, {'entity': 'I-ORG', 'score': 0.99925345, 'index': 40, 'word': 'Guardian', 'start': 179, 'end': 187}, {'entity': 'I-PER', 'score': 0.999201, 'index': 46, 'word': 'Ian', 'start': 207, 'end': 210}, {'entity': 'I-PER', 'score': 0.9994999, 'index': 47, 'word': 'Sam', 'start': 211, 'end': 214}, {'entity': 'I-PER', 'score': 0.99649787, 'index': 48, 'word': '##ple', 'start': 214, 'end': 217}, {'entity': 'I-PER', 'score': 0.9991856, 'index': 53, 'word': 'Stuart', 'start': 240, 'end': 246}, {'entity': 'I-PER', 'score': 0.99964833, 'index': 54, 'word': 'Clark', 'start': 247, 'end': 252}, {'entity': 'I-LOC', 'score': 0.9998211, 'index': 85, 'word': 'Germany', 'start': 413, 'end': 420}, {'entity': 'I-PER

### The building blocks of a pipeline

Huggingface also allows you to access its pipelines on a lower level. There are two main abstractions for you:
* `Tokenizer` - converts from strings to token ids and back
* `Model` - a pytorch `nn.Module` with pre-trained weights

You can use such models as part of your regular pytorch code: insert is as a layer in your model, apply it to a batch of data, backpropagate, optimize, etc.

In [10]:
tokenizer = transformers.AutoTokenizer.from_pretrained('bert-base-uncased')
model = transformers.AutoModel.from_pretrained('bert-base-uncased')

In [11]:
lines = [
    "Luke, I am your father.",
    "Life is what happens when you're busy making other plans.",
    ]

# tokenize a batch of inputs. "pt" means [p]y[t]orch tensors
tokens_info = tokenizer(lines, padding=True, truncation=True, return_tensors="pt")

for key in tokens_info:
    print(key, tokens_info[key])

print("Detokenized:")
for i in range(2):
    print(tokenizer.decode(tokens_info['input_ids'][i]))

input_ids tensor([[ 101, 5355, 1010, 1045, 2572, 2115, 2269, 1012,  102,    0,    0,    0,
            0,    0,    0],
        [ 101, 2166, 2003, 2054, 6433, 2043, 2017, 1005, 2128, 5697, 2437, 2060,
         3488, 1012,  102]])
token_type_ids tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
attention_mask tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
Detokenized:
[CLS] luke, i am your father. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[CLS] life is what happens when you ' re busy making other plans. [SEP]


In [12]:
# You can now apply the model to get embeddings
with torch.no_grad():
    out = model(**tokens_info)

print(out['pooler_output'])

tensor([[-0.8854, -0.4722, -0.9392,  ..., -0.8081, -0.6955,  0.8748],
        [-0.9297, -0.5161, -0.9334,  ..., -0.9017, -0.7492,  0.9201]])


Transformers knowledge hub: https://huggingface.co/transformers/

### Build-a-transformer (2 points)

In this section, you will implement a transformer language model layer by layer, then use it to generate (hopefully) coherent text.

To understand how these layers work, please check out our guide to transformers from [nlp course for you -> transformers](https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html#transformer_intro).


First, we download pre-trained weights for the [GPT2 model by OpenAI](https://openai.com/research/better-language-models) - a prominent model from 2019.



Idea & code by: Ilya Beletsky

In [13]:
from huggingface_hub import hf_hub_download
state_dict = torch.load(hf_hub_download("gpt2", filename="pytorch_model.bin"))
for key, value in tuple(state_dict.items()):
    if key.startswith('h.') and key.endswith('.weight') and value.ndim == 2:
        value.transpose_(1, 0)  # <-- for compatibility with modern PyTorch modules
    if key.startswith('h.') and key.endswith('.attn.bias') and value.ndim == 4:
        state_dict.pop(key)  # <-- triangular binar masks, not needed in this code

print('Weights:', repr(sorted(state_dict.keys()))[:320], '...')

  state_dict = torch.load(hf_hub_download("gpt2", filename="pytorch_model.bin"))


Weights: ['h.0.attn.c_attn.bias', 'h.0.attn.c_attn.weight', 'h.0.attn.c_proj.bias', 'h.0.attn.c_proj.weight', 'h.0.ln_1.bias', 'h.0.ln_1.weight', 'h.0.ln_2.bias', 'h.0.ln_2.weight', 'h.0.mlp.c_fc.bias', 'h.0.mlp.c_fc.weight', 'h.0.mlp.c_proj.bias', 'h.0.mlp.c_proj.weight', 'h.1.attn.c_attn.bias', 'h.1.attn.c_attn.weight', 'h.1. ...


In the next few cells, we shall implement the model layer by layer to make use of those weights.

As you might recall, transformers contain two main layer types: attention and fully-connected layers.

The fully connected layers are by far easier to understand, so we shall begin there:

Please implement fully-connected layer __without residual or layer normalization__ (we'll add those in a bit).

In [14]:
class GeLUThatWasUsedInGPT2(nn.Module):
    def forward(self, x):
        return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * x ** 3)))

class FullyConnected(nn.Module):
    def __init__(self, dim: int):
        super().__init__()
        self.c_fc = nn.Linear(dim, 4 * dim)
        self.gelu = GeLUThatWasUsedInGPT2()
        self.c_proj = nn.Linear(4 * dim, dim)

    def forward(self, x):
        x = self.c_fc(x)
        x = self.gelu(x)
        x = self.c_proj(x)
        return x

Now, let's test that it works with GPT-2 weights:

In [15]:
mlp = FullyConnected(dim=768)
mlp.load_state_dict({'c_fc.weight': state_dict['h.0.mlp.c_fc.weight'],
                     'c_fc.bias': state_dict['h.0.mlp.c_fc.bias'],
                     'c_proj.weight': state_dict['h.0.mlp.c_proj.weight'],
                     'c_proj.bias': state_dict['h.0.mlp.c_proj.bias']})

torch.manual_seed(1337)
x = torch.randn(1, 2, 768)  # [batch_size, sequence_length, dim]
checksum = torch.sum(mlp(x) * x)
assert abs(checksum.item() - 1282.3315) < 0.1, "layer outputs do not match reference"
assert torch.allclose(mlp(x[:, (1, 0), :])[:, (1, 0), :], mlp(x)), "mlp must be permutation-invariant"
print("Seems legit!")

Seems legit!


Now, let's get to attention layers.

Since GPT-2 needs to generate text from left to right, each generated token can only attend to tokens on the left (and itself). This kid of attention is called "Masked" self-attention, because it hides tokens to the right.

As before, please implement masked self-attention __without layernorm or residual connections.__

In [16]:
class MaskedSelfAttention(nn.Module):
    def __init__(self, dim: int, num_heads: int):
        super().__init__()
        self.c_attn = nn.Linear(dim, dim * 3)  # query + key + value
        self.c_proj = nn.Linear(dim, dim)  # output projection
        self.dim, self.num_heads = dim, num_heads
        self.head_size = dim // num_heads

    def forward(self, x):
        B, T, C = x.size()
        qkv = self.c_attn(x)
        q, k, v = qkv.split(self.dim, dim=-1)

        # Reshape and transpose for multi-head attention
        q = q.view(B, T, self.num_heads, self.head_size).transpose(1, 2)
        k = k.view(B, T, self.num_heads, self.head_size).transpose(1, 2)
        v = v.view(B, T, self.num_heads, self.head_size).transpose(1, 2)

        # Compute attention scores and apply scaling
        att = (q @ k.transpose(-2, -1)) * (1.0 / (self.head_size ** 0.5))

        # Apply causal mask
        mask = torch.tril(torch.ones(T, T, device=x.device), diagonal=0)
        att = att.masked_fill(mask == 0, -1e9)

        # Apply softmax and multiply with value vectors
        att = F.softmax(att, dim=-1)
        y = att @ v

        # Re-assemble heads and apply final projection
        y = y.transpose(1, 2).contiguous().view(B, T, C)
        return self.c_proj(y)

Test that it works

In [17]:
attn = MaskedSelfAttention(dim=768, num_heads=12)
attn.load_state_dict({'c_attn.weight': state_dict['h.0.attn.c_attn.weight'],
                      'c_attn.bias': state_dict['h.0.attn.c_attn.bias'],
                      'c_proj.weight': state_dict['h.0.attn.c_proj.weight'],
                      'c_proj.bias': state_dict['h.0.attn.c_proj.bias']})

torch.manual_seed(1337)
x = torch.randn(1, 10, 768)  # [batch_size, sequence_length, dim]
checksum = torch.sum(attn(x) * x)
assert abs(checksum.item() - 2703.6772) < 0.1, "layer outputs do not match reference"
assert not torch.allclose(attn(x[:, (1, 0), :])[:, (1, 0), :], attn(x[:, (0, 1), :])), "masked attention must *not* be permutation-invariant"
print("It works!")

It works!


We can now combine attention and MLP to build the full transformer layer:

![img](https://i.imgur.com/1sq2vHO.png)

In [18]:
class TransformerLayer(nn.Module):
    def __init__(self, dim: int, num_heads: int):
        super().__init__()
        self.ln_1 = nn.LayerNorm(dim)
        self.attn = MaskedSelfAttention(dim, num_heads)
        self.ln_2 = nn.LayerNorm(dim)
        self.mlp = FullyConnected(dim)

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x

In [19]:
layer = TransformerLayer(dim=768, num_heads=12)
layer.load_state_dict({k[5:]: v for k, v in state_dict.items() if k.startswith('h.10.')})
assert abs(torch.sum(layer(x) * x).item() - 9874.7383) < 0.1
print("Good job!")

Good job!


In [20]:
class GPT2(nn.Module):
    def __init__(self, vocab_size: int, dim: int, num_heads: int, num_layers: int, max_position_embeddings: int = 1024):
        super().__init__()
        self.wte = nn.Embedding(vocab_size, dim)  # token embeddings
        self.wpe = nn.Embedding(max_position_embeddings, dim)  # position embeddings
        self.ln_f = nn.LayerNorm(dim)   # final layer norm - goes after all transformer layers, but before logits

        self.h = nn.Sequential(*(TransformerLayer(dim, num_heads) for layer in range(num_layers)))

    def forward(self, input_ids):
        # input_ids.shape: [batch_size, sequence_length], int64 token ids
        position_ids = torch.arange(input_ids.shape[1], device=input_ids.device).unsqueeze(0)

        token_embeddings = self.wte(input_ids)
        position_embeddings = self.wpe(position_ids)
        full_embeddings = token_embeddings + position_embeddings

        transformer_output = self.h(full_embeddings)
        transformer_output_ln = self.ln_f(transformer_output)

        # final layer: we predict logits by re-using token embeddings as linear weights
        output_logits = transformer_output_ln @ self.wte.weight.T
        return output_logits


In [21]:
tokenizer = transformers.AutoTokenizer.from_pretrained('gpt2')
model_gpt2 = GPT2(vocab_size=50257, dim=768, num_heads=12, num_layers=12)
model_gpt2.load_state_dict(state_dict)

# Text generation
input_ids = tokenizer("A quick", return_tensors='pt')['input_ids']
with torch.no_grad():
    predicted_logits = model_gpt2(input_ids)
most_likely_token_id = predicted_logits[:, -1].argmax().item()
print("\nPrediction:", tokenizer.decode(most_likely_token_id))


Prediction:  look


In [22]:
text = "The Fermi paradox "
tokens = tokenizer.encode(text)
print(end=tokenizer.decode(tokens))
line_length = len(tokenizer.decode(tokens))

for i in range(500):
    # Predict logits with your model
    with torch.no_grad():
        logits = model_gpt2(torch.as_tensor([tokens]))

    # Sample with probabilities
    p_next = torch.softmax(logits[0, -1, :], dim=-1).data.cpu().numpy()
    next_token_index = np.random.choice(len(p_next), p=p_next)

    tokens.append(int(next_token_index))
    print(end=tokenizer.decode(tokens[-1]))
    line_length += len(tokenizer.decode(tokens[-1]))
    if line_length > 120:
      line_length = 0
      print()



The Fermi paradox  is that even institutions of all forms may express in their lower classes the enthusiasm for the solution
 of its problems and very little desire to compromise with their higher forms. For this reason social reformers generally
 emphasized organizations that spontaneously want to achieve success and janitors who do not buy gardening supplies have sometimes
 been chosen to organize for the common good of the environment, unless they are organized on behalf of the more extreme names
. Instead of compromise, they often define their problems with concepts that often do not take into account community sources
, such as local bicyclists, committees of ordinary people living together in their environment, but that can cooperate with
 or harm various utility utilities because they use services such as fossil fuels, bikes or hogs. In such campaigns, a late
-nineteenth-century party called NASA fell under the influence lustuanstype because of similar grievances and the designa

__Reminder:__ after class, please go to `MaskedSelfAttention.forward` above and finish the job!

### Here's how you can do the same with transformers library

In [23]:
tokenizer = transformers.AutoTokenizer.from_pretrained('gpt2', add_prefix_space=True)
model = transformers.AutoModelForCausalLM.from_pretrained('gpt2')
print('Generated text:', tokenizer.decode(
    model.generate(
        **tokenizer("The Fermi paradox ", return_tensors='pt'),
        do_sample=True, max_new_tokens=50
    ).flatten().numpy()
))


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Generated text:  The Fermi paradox  is that the world around us is very close to nothing: there are no galaxies with small circles, there are only galaxies with huge and beautiful shapes, there are no vast "dows of light" in this sky, at least the one around
