# MVD 9. cvičení

## 1. část - Seznámení s HuggingFace a modelem BERT

Nainstalujte si Python knihovnu `transformers` a podívejte se na předtrénovaný [BERT model](https://huggingface.co/bert-base-uncased). Vyzkoušejte si unmasker s různými vstupy.

<br>
Pozn.: Použití BERT modelu vyžaduje zároveň PyTorch - postačí i cpu verze.

In [19]:
from transformers import pipeline

unmasker = pipeline('fill-mask', model='bert-base-uncased')
unmasker("Hello I'm a [MASK] model.")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'sequence': "hello i'm a fashion model.",
  'score': 0.10731059312820435,
  'token': 4827,
  'token_str': 'fashion'},
 {'sequence': "hello i'm a role model.",
  'score': 0.08774515986442566,
  'token': 2535,
  'token_str': 'role'},
 {'sequence': "hello i'm a new model.",
  'score': 0.05338393896818161,
  'token': 2047,
  'token_str': 'new'},
 {'sequence': "hello i'm a super model.",
  'score': 0.04667220264673233,
  'token': 3565,
  'token_str': 'super'},
 {'sequence': "hello i'm a fine model.",
  'score': 0.027095947414636612,
  'token': 2986,
  'token_str': 'fine'}]

## 2. část - BERT contextualized word embeddings

BERT dokumentace obsahuje také návod jak použít tento model pro získání word embeddingů. Vyzkoušejte použití stejného slova v různém kontextu a podívejte se, jak se mění kosinova podobnost embeddingů v závislosti na kontextu daného slova.

Podívejte se na výstup tokenizeru před vstupem do BERT modelu - kolik tokenů bylo vytvořeno pro větu "Hello, this is Bert."? Zdůvodněte jejich počet.

<br>
Pozn.: Vyřešení předchozí otázky Vám pomůže zjistit, který vektor z výstupu pro cílové slovo použít.

In [20]:
from transformers import BertTokenizer, BertModel

tokenizer=BertTokenizer.from_pretrained('bert-base-uncased')

tokens=tokenizer.tokenize("Hello, this is Bert.")
print(tokens)
print(type(tokens))

['hello', ',', 'this', 'is', 'bert', '.']
<class 'list'>


In [21]:
import torch

def sentence_to_vectors(sentence):
    tokens=tokenizer.tokenize(sentence)
    word_ids = tokenizer.convert_tokens_to_ids(tokens)
    segments_ids = [1] * len(tokens)
    tokens_tensor = torch.tensor([word_ids])
    segments_tensor = torch.tensor([segments_ids])

    model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states = True)
    model.eval()

    with torch.no_grad():
        outputs = model(tokens_tensor, segments_tensor)
        hidden = outputs[2]
    
    token_embeddings = torch.squeeze(torch.stack(hidden, dim=0), dim=1).permute(1,0,2)

    token_vecs_cat = []
    for token in token_embeddings:
        cat_vec = torch.cat((token[-1], token[-2], token[-3], token[-4]), dim=0)
        token_vecs_cat.append(cat_vec)
        
    return token_vecs_cat, tokens

In [22]:
sentence_1 = '[CLS] I bought an iPhone in an Apple store. [SEP]'
sentence_2 = '[CLS] The apple was delicious. [SEP]'

In [23]:
embeddings_1, tokens_1 = sentence_to_vectors(sentence_1)
embeddings_2, tokens_2 = sentence_to_vectors(sentence_2)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transfo

In [24]:
apple_1_vec = embeddings_1[7]
apple_2_vec = embeddings_2[2]

all(apple_1_vec == apple_2_vec), f'L2 difference: {torch.sqrt(torch.sum((apple_1_vec - apple_2_vec)**2))}'

(False, 'L2 difference: 34.4416389465332')

## Bonus - Vizualizace slovních  embeddingů

Vizualizujte slovní embeddingy - mění se jejich pozice v závislosti na kontextu tak, jak byste očekávali? Pokuste se vizualizovat i některá slova, ke kterým by se podle vás cílové slovo mělo po změně kontextu přiblížit.

In [25]:
!pip install MulticoreTSNE

from MulticoreTSNE import MulticoreTSNE as TSNE
import plotly.graph_objects as go
import numpy as np

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [26]:
def embeddings_to_numpy(embeddings):
    return np.array([t.numpy() for t in embeddings])

In [30]:
model = TSNE(n_components=3, perplexity=3, verbose=1, n_jobs=12)
X_new = model.fit_transform(embeddings_to_numpy(embeddings_1 + embeddings_2))
x = X_new[:, 0]
y = X_new[:, 1]
z = X_new[:, 2]

Performing t-SNE using 12 cores.
Using no_dims = 3, perplexity = 3.000000, and theta = 0.500000
Computing input similarities...
Building tree...
 - point 1 of 18
 - point 9 of 18
 - point 10 of 18
 - point 11 of 18
 - point 12 of 18
 - point 12 of 18
 - point 13 of 18
 - point 14 of 18
 - point 14 of 18
 - point 14 of 18
 - point 15 of 18
 - point 16 of 18
 - point 16 of 18
 - point 17 of 18
 - point 17 of 18
 - point 17 of 18
 - point 18 of 18
 - point 18 of 18
Done in 0.00 seconds (sparsity = 0.592593)!
Learning embedding...
Iteration 51: error is 46.277876 (50 iterations in 0.00 seconds)
Iteration 101: error is 53.287013 (50 iterations in 0.00 seconds)
Iteration 151: error is 56.311864 (50 iterations in 0.00 seconds)
Iteration 201: error is 53.174553 (50 iterations in 0.00 seconds)
Iteration 251: error is 62.750171 (50 iterations in 0.00 seconds)
Iteration 301: error is 1.815296 (50 iterations in 0.00 seconds)
Iteration 351: error is 0.748303 (50 iterations in 0.00 seconds)
Iteratio

In [31]:
fig = go.Figure(
    data=[
        go.Scatter3d(
            x=x,
            y=y,
            z=z,
            text=tokens_1 + tokens_2,
            mode='markers',
            marker=dict(
                size=12,
                color=z,
                colorscale='Viridis',
                opacity=0.7,
            ),
        ),
    ],
)

fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))
fig.show()