In [5]:
import transformers

### Using pre-trained transformers (seminar is worth 2 points)
_for fun and profit_

There are many toolkits that let you access pre-trained transformer models, but the most powerful and convenient by far is [`huggingface/transformers`](https://github.com/huggingface/transformers). In this week's practice, you'll learn how to download, apply and modify pre-trained transformers for a range of tasks. Buckle up, we're going in!


__Pipelines:__ if all you want is to apply a pre-trained model, you can do that in one line of code using pipeline. Huggingface/transformers has a selection of pre-configured pipelines for masked language modelling, sentiment classification, question aswering, etc. ([see full list here](https://huggingface.co/transformers/main_classes/pipelines.html))

A typical pipeline includes:
* pre-processing, e.g. tokenization, subword segmentation
* a backbone model, e.g. bert finetuned for classification
* output post-processing

Let's see it in action:

In [6]:
from transformers import pipeline
classifier = pipeline('sentiment-analysis', model="distilbert-base-uncased-finetuned-sst-2-english")

print(classifier("BERT is amazing!"))

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


[{'label': 'POSITIVE', 'score': 0.9998860359191895}]


In [21]:
import base64
data = {
    'arryn': 'As High as Honor.',
    'baratheon': 'Ours is the fury.',
    'stark': 'Winter is coming.',
    'tyrell': 'Growing strong.'
}

# YOUR CODE: predict sentiment for each noble house and create outputs dict
# outputs = <YOUR CODE: dict (house name) : True if positive, False if negative>

model_output = classifier(list(data.values()))
outputs = {name: True if label=='POSITIVE' else False for name, label in zip(list(data.keys()),[res['label'] for res in model_output])}

assert sum(outputs.values()) == 3 and outputs[base64.decodebytes(b'YmFyYXRoZW9u\n').decode()] == False
print("Well done!")

Well done!


You can also access vanilla Masked Language Model that was trained to predict masked words. Here's how:

In [23]:
mlm_model = pipeline('fill-mask', model="bert-base-uncased")
MASK = mlm_model.tokenizer.mask_token

for hypo in mlm_model(f"Donald {MASK} is the president of the united states."):
  print(f"P={hypo['score']:.5f}", hypo['sequence'])

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Downloading (…)lve/main/config.json: 100%|█████████████████████████████████████████████████████████████| 570/570 [00:00<00:00, 649kB/s]
Downloading model.safetensors: 100%|████████████████████████████████████████████████████████████████| 440M/440M [01:12<00:00, 6.04MB/s]
All PyTorch model weights were used when initializing TFBertForMaskedLM.

All the weights of TFBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.
Downloading (…)okenizer_config.json: 100%|██████████████████████████████████████████████████████████| 28.0/28.0 [00:00<00:00, 69.0kB/s]
Downloading (…)solve/main/vocab.txt: 100%|███████████████████████████████████████████████████████████| 232k/232k [00:00<00:00, 870kB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████████████████████████████████████████████████████| 466k/466k [00:00<00:00, 1.29MB/s]


P=0.99719 donald trump is the president of the united states.
P=0.00024 donald duck is the president of the united states.
P=0.00022 donald ross is the president of the united states.
P=0.00020 donald johnson is the president of the united states.
P=0.00018 donald wilson is the president of the united states.


In [24]:
# Your turn: use bert to recall what year was the Soviet Union founded in
mlm_model(f"The Soviet Union was founded in {MASK} year.")

[{'score': 0.9056478142738342,
  'token': 2008,
  'token_str': 'that',
  'sequence': 'the soviet union was founded in that year.'},
 {'score': 0.08824586868286133,
  'token': 2023,
  'token_str': 'this',
  'sequence': 'the soviet union was founded in this year.'},
 {'score': 0.001775345765054226,
  'token': 1996,
  'token_str': 'the',
  'sequence': 'the soviet union was founded in the year.'},
 {'score': 0.0009746896103024483,
  'token': 2168,
  'token_str': 'same',
  'sequence': 'the soviet union was founded in same year.'},
 {'score': 0.0008080514962784946,
  'token': 2028,
  'token_str': 'one',
  'sequence': 'the soviet union was founded in one year.'}]

In [27]:
# Your turn: use bert to recall what year was the Soviet Union founded in
mlm_model(f"The Soviet Union was founded in year {MASK}.")

[{'score': 0.03702247887849808,
  'token': 2462,
  'token_str': 'ii',
  'sequence': 'the soviet union was founded in year ii.'},
 {'score': 0.030535712838172913,
  'token': 3386,
  'token_str': '1945',
  'sequence': 'the soviet union was founded in year 1945.'},
 {'score': 0.027637433260679245,
  'token': 4585,
  'token_str': '1917',
  'sequence': 'the soviet union was founded in year 1917.'},
 {'score': 0.023243509232997894,
  'token': 4271,
  'token_str': '1918',
  'sequence': 'the soviet union was founded in year 1918.'},
 {'score': 0.01753777638077736,
  'token': 4085,
  'token_str': '1949',
  'sequence': 'the soviet union was founded in year 1949.'}]

In [25]:
mlm_model(f"{MASK} year was the Soviet Union founded in")

[{'score': 0.5282406210899353,
  'token': 2008,
  'token_str': 'that',
  'sequence': 'that year was the soviet union founded in'},
 {'score': 0.3739815056324005,
  'token': 2023,
  'token_str': 'this',
  'sequence': 'this year was the soviet union founded in'},
 {'score': 0.03149905428290367,
  'token': 2279,
  'token_str': 'next',
  'sequence': 'next year was the soviet union founded in'},
 {'score': 0.026913192123174667,
  'token': 1996,
  'token_str': 'the',
  'sequence': 'the year was the soviet union founded in'},
 {'score': 0.010693252086639404,
  'token': 2168,
  'token_str': 'same',
  'sequence': 'same year was the soviet union founded in'}]

```

```

```

```


Huggingface offers hundreds of pre-trained models that specialize on different tasks. You can quickly find the model you need using [this list](https://huggingface.co/models).


In [29]:
text = """Almost two-thirds of the 1.5 million people who viewed this liveblog had Googled to discover
 the latest on the Rosetta mission. They were treated to this detailed account by the Guardian’s science editor,
 Ian Sample, and astronomy writer Stuart Clark of the moment scientists landed a robotic spacecraft on a comet 
 for the first time in history, and the delirious reaction it provoked at their headquarters in Germany.
  “We are there. We are sitting on the surface. Philae is talking to us,” said one scientist.
"""

# Task: create a pipeline for named entity recognition, use task name 'ner' and search for the right model in the list
ner_model = pipeline("token-classification", model="dslim/bert-base-NER")

named_entities = ner_model(text)


Downloading (…)lve/main/config.json: 100%|████████████████████████████████████████████████████████████| 829/829 [00:00<00:00, 1.24MB/s][A

Downloading model.safetensors:   0%|                                                                        | 0.00/433M [00:00<?, ?B/s][A
Downloading model.safetensors:   2%|█▌                                                             | 10.5M/433M [00:01<01:03, 6.65MB/s][A
Downloading model.safetensors:   5%|███                                                            | 21.0M/433M [00:03<01:01, 6.72MB/s][A
Downloading model.safetensors:   7%|████▌                                                          | 31.5M/433M [00:04<01:00, 6.64MB/s][A
Downloading model.safetensors:  10%|██████                                                         | 41.9M/433M [00:06<00:59, 6.59MB/s][A
Downloading model.safetensors:  12%|███████▌                                                       | 52.4M/433M [00:07<00:58, 6.53MB/s][A
Downloading model.safeten

In [33]:
# print('OUTPUT:', named_entities)
word_to_entity = {item['word']: item['entity'] for item in named_entities}
print('word_to_entity:',word_to_entity)
assert 'org' in word_to_entity.get('Guardian').lower() and 'per' in word_to_entity.get('Stuart').lower()
print()
print("All tests passed")

word_to_entity: {'Rose': 'B-LOC', '##tta': 'I-LOC', 'Guardian': 'B-ORG', 'Ian': 'B-PER', 'Sam': 'I-PER', '##ple': 'I-PER', 'Stuart': 'B-PER', 'Clark': 'I-PER', 'Germany': 'B-LOC', 'Phil': 'B-PER', '##ae': 'I-PER'}

All tests passed


### The building blocks of a pipeline

Huggingface also allows you to access its pipelines on a lower level. There are two main abstractions for you:
* `Tokenizer` - converts from strings to token ids and back
* `Model` - a pytorch `nn.Module` with pre-trained weights

You can use such models as part of your regular pytorch code: insert is as a layer in your model, apply it to a batch of data, backpropagate, optimize, etc.

In [36]:
# import torch
# from transformers import AutoTokenizer, AutoModel, pipeline

# model_name = 'bert-base-uncased'
# tokenizer = AutoTokenizer.from_pretrained(model_name)
# model = AutoModel.from_pretrained(model_name)

from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained("bert-base-uncased")


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

In [38]:
lines = [
    "Luke, I am your father.",
    "Life is what happens when you're busy making other plans.",
    ]

# tokenize a batch of inputs. "pt" means [p]y[t]orch tensors
tokens_info = tokenizer(lines, padding=True, truncation=True, return_tensors="tf")

for key in tokens_info:
    print(key, tokens_info[key])

print("Detokenized:")
for i in range(2):
    print(tokenizer.decode(tokens_info['input_ids'][i]))

input_ids tf.Tensor(
[[ 101 5355 1010 1045 2572 2115 2269 1012  102    0    0    0    0    0
     0]
 [ 101 2166 2003 2054 6433 2043 2017 1005 2128 5697 2437 2060 3488 1012
   102]], shape=(2, 15), dtype=int32)
token_type_ids tf.Tensor(
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]], shape=(2, 15), dtype=int32)
attention_mask tf.Tensor(
[[1 1 1 1 1 1 1 1 1 0 0 0 0 0 0]
 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]], shape=(2, 15), dtype=int32)
Detokenized:
[CLS] luke, i am your father. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[CLS] life is what happens when you're busy making other plans. [SEP]


In [40]:
tokens_info

{'input_ids': <tf.Tensor: shape=(2, 15), dtype=int32, numpy=
array([[ 101, 5355, 1010, 1045, 2572, 2115, 2269, 1012,  102,    0,    0,
           0,    0,    0,    0],
       [ 101, 2166, 2003, 2054, 6433, 2043, 2017, 1005, 2128, 5697, 2437,
        2060, 3488, 1012,  102]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(2, 15), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 15), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>}

In [43]:
model(**tokens_info)['pooler_output']

<tf.Tensor: shape=(2, 768), dtype=float32, numpy=
array([[-0.88541967, -0.47218543, -0.93924   , ..., -0.8081023 ,
        -0.69547755,  0.8747951 ],
       [-0.92971474, -0.51610476, -0.9334278 , ..., -0.9017478 ,
        -0.7492124 ,  0.92007506]], dtype=float32)>

In [44]:
# You can now apply the model to get embeddings
# with torch.no_grad():
#     out = model(**tokens_info)

# print(out['pooler_output'])

model(**tokens_info)['pooler_output']

<tf.Tensor: shape=(2, 768), dtype=float32, numpy=
array([[-0.88541967, -0.47218543, -0.93924   , ..., -0.8081023 ,
        -0.69547755,  0.8747951 ],
       [-0.92971474, -0.51610476, -0.9334278 , ..., -0.9017478 ,
        -0.7492124 ,  0.92007506]], dtype=float32)>

```















```

__Bonus demo:__ transformer language models. 

`/* No points awarded for this task, but its really cool, we promise :) */`

In [45]:
import tensorflow as tf
import numpy as np
import random

from transformers import GPT2Tokenizer, TFGPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = TFGPT2LMHeadModel.from_pretrained('gpt2')


Downloading (…)olve/main/vocab.json:   0%|                                                                 | 0.00/1.04M [00:00<?, ?B/s][A
Downloading (…)olve/main/vocab.json: 100%|████████████████████████████████████████████████████████| 1.04M/1.04M [00:00<00:00, 3.01MB/s][A

Downloading (…)olve/main/merges.txt:   0%|                                                                  | 0.00/456k [00:00<?, ?B/s][A
Downloading (…)olve/main/merges.txt: 100%|██████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 2.79MB/s][A

Downloading (…)lve/main/config.json: 100%|████████████████████████████████████████████████████████████| 665/665 [00:00<00:00, 1.46MB/s][A

Downloading model.safetensors:   0%|                                                                        | 0.00/548M [00:00<?, ?B/s][A
Downloading model.safetensors:   2%|█▏                                                             | 10.5M/548M [00:02<02:31, 3.55MB/s][A
Downloading model.safet

In [103]:
text = "The Fermi paradox "
tokens = tokenizer.encode(text)
num_steps = 1024 - len(tokens) + 1
line_length, max_length = 0, 70

probs_thres = 0.8

for i in range(num_steps):

    logits = model(tf.Variable(tokens, dtype=tf.float32))[0]
    p_next = tf.nn.softmax(logits[0, -1, :]).numpy()

    prob_cumsum = 0
    next_possible_token = []
    for index in np.argsort(-p_next)[:10]:
        prob_cumsum+=p_next[index]
        next_possible_token.append(index)
        if prob_cumsum>probs_thres:
            break
    next_token_index = random.choice(next_possible_token)
    
    # next_token_index = int(p_next.argmax())
    tokens.append(next_token_index)
    print(tokenizer.decode(tokens[-1]))
    line_length += len(tokenizer.decode(tokens[-1]))
    if line_length >= max_length:
        line_length = 0
        print()
        # break

print(tokenizer.decode(tokens))


 
(
which
 can
 also
 be
 used
 in
 
a
 simple
 way
 as
 
an
 explanatory
 principle
)



I
'll
 give
 it
 some
 attention
.
 First
 I
 want
 to
 talk
 about
 an
 important
 problem

:


How
 are
 you
 doing
 it
?
 
 (
If
 you
're
 reading
 this
 in
 the
 
main
 blog
 
,
 that

 might
 sound
 
b
ored
 and
 
un
educated
)
 Why
 don
't
 you
 use
 it
,
 right
?
 Because

 it
 can
 really
 be
 very
 useful
 if
 you
 are
 thinking
 and
 doing
 something
 that

 will
 benefit
 you
,
 even
 though
 it
 is
 only
 part
 of
 what
's
 possible
 
 if
 it

 were
 not
 so
 much
 your
 task
.
 It
 doesn
't
 make
 it
 easy
 and
 doesn
't
 allow

 for
 
 an
 
 
eas
ier
 task
.
 
 So
 what
?
 The
 problem
 of
 making
 things
 work
 
 
can

 help
 people
 with
 this
 
 
problem
 and
 the
 more
 
 we
 know
 about
 
 what
's
 possible

 
 then
 
 we
 
 know
 it
 can
 be
 much
 better
,
 too
.
 I
've
 found
 many
 people
 with

 such
 issues
 are
 simply
 using
 "
just
 say
 no
"
 or
 using
 a
 language
 
t

In [None]:
# import torch
import numpy as np
from transformers import GPT2Tokenizer, GPT2LMHeadModel
device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = GPT2Tokenizer.from_pretrained('gpt2', add_prefix_space=True)
model = GPT2LMHeadModel.from_pretrained('gpt2').train(False).to(device)

text = "The Fermi paradox "
tokens = tokenizer.encode(text)
num_steps = 1024 - len(tokens) + 1
line_length, max_length = 0, 70

print(end=tokenizer.decode(tokens))

for i in range(num_steps):
    with torch.no_grad():
        logits = model(torch.as_tensor([tokens], device=device))[0]
    p_next = torch.softmax(logits[0, -1, :], dim=-1).data.cpu().numpy()

    next_token_index = p_next.argmax() #<YOUR CODE: REPLACE THIS LINE>
    # YOUR TASK: change the code so that it performs nucleus sampling

    tokens.append(int(next_token_index))
    print(end=tokenizer.decode(tokens[-1]))
    line_length += len(tokenizer.decode(tokens[-1]))
    if line_length >= max_length:
        line_length = 0
        print()



Transformers knowledge hub: https://huggingface.co/transformers/