## Семинар 8: "Современные модели для NLP"

ФИО: Алибаева Камила Винеровна

### На семинаре мы разберем [код трансфомера на pytorch](https://nlp.seas.harvard.edu/2018/04/03/attention.html)

###  ДЗ [3 балла]

Обратите внимание, что в этой работе вам потребуется скачать модель весом ~150MB, также ее вычисление занимает определенное время, так что рекомендуется считать эту задачу на [google colab](https://colab.research.google.com/).

In [9]:
!pip install sentencepiece

Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2MB)
[K     |▎                               | 10kB 16.4MB/s eta 0:00:01[K     |▌                               | 20kB 21.0MB/s eta 0:00:01[K     |▉                               | 30kB 24.8MB/s eta 0:00:01[K     |█                               | 40kB 27.9MB/s eta 0:00:01[K     |█▍                              | 51kB 25.7MB/s eta 0:00:01[K     |█▋                              | 61kB 27.6MB/s eta 0:00:01[K     |██                              | 71kB 22.2MB/s eta 0:00:01[K     |██▏                             | 81kB 19.2MB/s eta 0:00:01[K     |██▌                             | 92kB 20.3MB/s eta 0:00:01[K     |██▊                             | 102kB 20.4MB/s eta 0:00:01[K     |███                             | 112kB 20.4MB/s eta 0:00:01[K     |███▎        

In [10]:
import torch
!pip install --upgrade transformers
from transformers import *

Requirement already up-to-date: transformers in /usr/local/lib/python3.7/dist-packages (4.5.1)


In [None]:
MODEL = (MobileBertForMaskedLM, MobileBertTokenizer, 'google/mobilebert-uncased')

model_class, tokenizer_class, pretrained_weights = MODEL
# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

In [None]:
input_ids = tokenizer.encode("Here is some text to encode", add_special_tokens=True)  # Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
print(input_ids)

[101, 2182, 2003, 2070, 3793, 2000, 4372, 16044, 102]


In [None]:
tokenizer.decode(input_ids)

'[CLS] Here is some text to encode [SEP]'

In [None]:
input_ids[4] = tokenizer.mask_token_id
tokenizer.decode(input_ids)

'[CLS] Here is some [MASK] to encode [SEP]'

In [None]:
input_batch = torch.tensor(input_ids).unsqueeze(0) # batch_size 1
with torch.no_grad():
    res = model(input_batch)[0]

In [None]:
prob = torch.nn.functional.softmax(res, dim=-1)
new_ids = prob.max(-1)[1]

In [None]:
tokenizer.decode(new_ids.numpy()[0, :].tolist())

'. here is some way to encode.'

In [6]:
GPT_TEXTS = [
    "In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.",
    "A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown."
    ]

Ваша задача - сгенерировать продолжение текстов, на которых демонстрировалась работа GPT-2 с помощью загруженной модели (DistillBERT). Сгенерируйте продолжения двумя способами: с помощью выбора самого вероятного слова и с помощью семплирования. Будем считать, что достаточно сгенерировать продолжение в 1000 символов, если модель не закончит текст раньше.

In [1]:
from tqdm.notebook import tqdm

In [3]:
from transformers import AutoTokenizer, AutoModelForMaskedLM
  
tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")

model = AutoModelForMaskedLM.from_pretrained("distilroberta-base")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=480.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355863.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=331070498.0, style=ProgressStyle(descri…




In [18]:
def gen(input_ids):
    input_ids[-1] = tokenizer.mask_token_id

    for i in tqdm(range(100)):
        input_ids.append(tokenizer.mask_token_id)
        input_batch = torch.tensor(input_ids).unsqueeze(0)
        with torch.no_grad():
            res = model(input_batch)[0]
        prob = torch.nn.functional.softmax(res, dim=-1)
        sampler = torch.distributions.Categorical(prob[0])
        #input_ids = prob.max(-1)[1]
        #input_ids = input_ids.numpy()[0, :].tolist()
        input_ids = sampler.sample().data.numpy().tolist()
    
    return input_ids

In [21]:
# First GPT text with sampling

input_ids = tokenizer.encode(GPT_TEXTS[0], add_special_tokens=True)
new_ids = gen(input_ids)
tokenizer.decode(new_ids)

HBox(children=(FloatProgress(value=0.0), HTML(value='')))




'<s>In a shocking discovery, scientists discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. Their streamlined teeth were distinguished among other reptiles except humans and animals would expect dinosaurs to roar.</s>Four unicorn biologists performed research of unicorns evolution morphologically and genetically related mammals, including flying elephants; elephants were mammals descended into dinosaurs before dinosaurs evolved fully parallel mammals. Unicorns therefore evolved fully parallel mammals. They evolved biologically morphologically and genetically related mammals alike; whereas dinosaurs evolved fully parallel mammals alike, dinosaurs evolved fully parallel mammals alike; whereas dinosaurs evolved fully parallel mammals alike–full parallel animals alike–both populations'

In [26]:
# Second GPT text with sampling

input_ids = tokenizer.encode(GPT_TEXTS[1], add_special_tokens=True)
new_ids = gen(input_ids)
tokenizer.decode(new_ids)

HBox(children=(FloatProgress(value=0.0), HTML(value='')))




'<s>A train carriage carrying captured nuclear materials was stolen in Cincinnati.</s>Its whereabouts are unknown. Plate plates reveal distinctive metal fragments carved beneath metal panels carved in limestone carving circles clearly resembling those within circular matrices among vaguely related tools, resembling mine powder and artillery shells. Bars indicate approximately five chambers carved beneath marble plates seen before thieves took it aboard train carriage for Pittsburgh Steelhouse Casino 2454; Loaded fuse matches; Ammunition samples; Battery samples; Sensor Tools; Miscellaneous Reference & Analysis Notes Miscellaneous Items Memorandum Publication Summary Memorandum Publication Summary Review Summary Memorandum Publication Summary Summary Summary Summary Summary Summary Summary Summary Summary'

P. S. Я не представляю, как должен выглядеть сгенерированный Бертом текст, но дальше ста токенов, как правило, начинается бессмысленное зацикливание текста, а если брать вместо семплирования самые вероятностные слова, то это совершенно не читабельно. Пожалуйста, подскажите, что я делаю не так: мне, правда, интересно!

#### Feedback (опционально)

Здесь вы можете оставить список опечаток из лекции или семинара:

Здесь вы можете оставить комментарии по лекции или семинару: