## Семинар 8: "Современные модели для NLP"

### На семинаре мы разберем [код трансфомера на pytorch](https://nlp.seas.harvard.edu/2018/04/03/attention.html)

###  ДЗ [3 балла]

Обратите внимание, что в этой работе вам потребуется скачать модель весом ~150MB, также ее вычисление занимает определенное время, так что рекомендуется считать эту задачу на [google colab](https://colab.research.google.com/).

In [2]:
!pip install --upgrade transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 4.2 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 35.5 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 43.8 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.10.1 tokenizers-0.13.2 transformers-4.24.0


In [3]:
import torch
import transformers

In [4]:
MODEL = (transformers.MobileBertForMaskedLM, transformers.MobileBertTokenizer, 'google/mobilebert-uncased')

model_class, tokenizer_class, pretrained_weights = MODEL
# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/847 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/147M [00:00<?, ?B/s]

Some weights of the model checkpoint at google/mobilebert-uncased were not used when initializing MobileBertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing MobileBertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing MobileBertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [5]:
input_ids = tokenizer.encode("Here is some text to encode", add_special_tokens=True)  # Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
print(input_ids)

[101, 2182, 2003, 2070, 3793, 2000, 4372, 16044, 102]


In [6]:
tokenizer.decode(input_ids)

'[CLS] here is some text to encode [SEP]'

In [7]:
input_ids[4] = tokenizer.mask_token_id
tokenizer.decode(input_ids)

'[CLS] here is some [MASK] to encode [SEP]'

In [8]:
input_batch = torch.tensor(input_ids).unsqueeze(0) # batch_size 1
with torch.no_grad():
    res = model(input_batch)[0]

In [9]:
prob = torch.nn.functional.softmax(res, dim=-1)
new_ids = prob.max(-1)[1]

In [10]:
tokenizer.decode(new_ids.numpy()[0, :].tolist())

'. here is some way to encode the'

In [11]:
GPT_TEXTS = [
    "In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.",
    "A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown."
    ]

Ваша задача - сгенерировать продолжение текстов, на которых демонстрировалась работа GPT-2 с помощью загруженной модели (DistillBERT). Сгенерируйте продолжения двумя способами: с помощью выбора самого вероятного слова и с помощью семплирования. Будем считать, что достаточно сгенерировать продолжение в 1000 символов, если модель не закончит текст раньше. Также можно попробовать сравнить эту генерацию с какой-нибудь легковесной gpt, например, "sshleifer/tiny-gpt2".

In [202]:
def seed_everything(seed=42):  
    import random
    import os
    import numpy as np
    import torch
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
seed_everything()

In [203]:
MAX_LEN = 1000
sentence = GPT_TEXTS[0]

# Токенизируем предложение. Без [CLS] в начале,
input_ids = tokenizer.encode(sentence, add_special_tokens=False)
# но с [SEP] в конце генерируется более прикольный текст
input_ids.append(tokenizer.sep_token_id)

# остановимся по достижении максимальной желаемой длины
while len(tokenizer.decode(input_ids)) < MAX_LEN:
    # добавляем [MASK] в конец предложения
    input_ids.append(tokenizer.mask_token_id)
    input_batch = torch.tensor(input_ids).unsqueeze(0)
    # Получим предсказание
    with torch.no_grad():
        res = model(input_batch)

    # логиты предсказания
    logits = res.logits
    
    # смотрим для [MASK] добавленного в конец
    predicted_token_id = logits[0, -1].argmax(axis=-1)
    
    # заменяем [MASK] предсказанным токеном 
    input_ids[-1] = predicted_token_id


In [204]:
tokenizer.decode(input_ids)

'in a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the andes mountains. even more surprising to the researchers was the fact that the unicorns spoke perfect english. [SEP] the researchers were shocked to find the unicorns were not human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human. they were human.

In [205]:
sentence = GPT_TEXTS[1]

# тут уже от [CLS] в начале принципиально ничего не меняется
input_ids = tokenizer.encode(sentence, add_special_tokens=True)

while len(tokenizer.decode(input_ids)) < MAX_LEN:
    input_ids.append(tokenizer.mask_token_id)
    input_batch = torch.tensor(input_ids).unsqueeze(0)

    with torch.no_grad():
        res = model(input_batch)

    logits = res.logits
    
    predicted_token_id = logits[0, -1].argmax(axis=-1)
     
    input_ids[-1] = predicted_token_id
        

In [206]:
tokenizer.decode(input_ids)

'[CLS] a train carriage containing controlled nuclear materials was stolen in cincinnati today. its whereabouts are unknown. [SEP] the train was stolen in cincinnati today. the train carriage was stolen in cincinnati today. the train carriage containing controlled nuclear materials was stolen in cincinnati today. the train carriage containing controlled nuclear materials was stolen in cincinnati today. the train carriage containing controlled nuclear materials was stolen in cincinnati today. the train carriage containing controlled nuclear materials was stolen in cincinnati today. the train carriage containing controlled nuclear materials was stolen in cincinnati today. the train carriage containing controlled nuclear materials was stolen in cincinnati today. the train carriage containing controlled nuclear materials was stolen in cincinnati today. the train carriage containing controlled nuclear materials was stolen in cincinnati today. the train carriage containing controlled nuclear

In [207]:
from torch.distributions.categorical import Categorical
sentence = GPT_TEXTS[0]

input_ids = tokenizer.encode(sentence, add_special_tokens=True)

while len(tokenizer.decode(input_ids)) < MAX_LEN:
    input_ids.append(tokenizer.mask_token_id)
    input_batch = torch.tensor(input_ids).unsqueeze(0)

    with torch.no_grad():
        res = model(input_batch)

    logits = res.logits
    
    # выбираем случайно с чётом распределения вероятностей (как бернулли но для k классов)
    predicted_token_id = Categorical(logits=logits[0, -1]).sample().item()
     
    input_ids[-1] = predicted_token_id


In [208]:
tokenizer.decode(input_ids)

'[CLS] in a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the andes mountains. even more surprising to the researchers was the fact that the unicorns spoke perfect english. [SEP] the music and the formation of mold bounded the way. the dominion laws and constitutional law must always understand. while on the way, physicists and engineers also worked. also experimentalists worked, making unique interpretations of these animals brought forth in the ideas and intellectual activities - common or imperfect, one looking for a fit and value of something or physical presence, or indirectly / often politically ) were likely. also scientific activities ( all circular skies, progressive ), pharmaceutical medicine ( general physician ), mathematics ( miner ), geographical conditions, remembered nature, animals, nature, the world, serving humanity ) purposes as well as other ones related to national parks, the baltic landscape, natura

In [209]:
sentence = GPT_TEXTS[1]

input_ids = tokenizer.encode(sentence, add_special_tokens=True)

while len(tokenizer.decode(input_ids)) < MAX_LEN:
    input_ids.append(tokenizer.mask_token_id)
    input_batch = torch.tensor(input_ids).unsqueeze(0)

    with torch.no_grad():
        res = model(input_batch)

    logits = res.logits

    predicted_token_id = Categorical(logits=logits[0, -1]).sample().item()
     
    input_ids[-1] = predicted_token_id

In [210]:
tokenizer.decode(input_ids)

'[CLS] a train carriage containing controlled nuclear materials was stolen in cincinnati today. its whereabouts are unknown. [SEP] a tartan fabric contained as a bunch of leaves and dried its leaf as their use was forbidden and prevented giving away rights to them. odor or treatment alcohol diseases reactive toxins and chemicals syphilis most high - side disease strange drugs other diseases diseases and reactions disease - causing environment dangerous environmental health exercise health body safety common diseases and fungi medical - related illnesses poisons hiv death or bleeding & disease other diseases vital signines systems and tubes traps electro - communication practical interpretation of disciplines sky phenomenon local vision mental asylum premier, psychiatric psychiatric asylum, psychiatric asylum, psychiatric asylum aerial execution civil guarantees / demonstrations or program of social rather or revolution, public employment constitutional institutions advisory government 

#### Feedback (опционально)

Здесь вы можете оставить список опечаток из лекции или семинара:

Здесь вы можете оставить комментарии по лекции или семинару: