В задании вам понадобится собрать генеративную модлель для языка и спользованием механизма внимания.

In [None]:
!pip install --quiet sentencepiece datasets transformers

In [None]:
import random
import torch 
import numpy as np
from tqdm.notebook import tqdm, trange
from sklearn.model_selection import train_test_split
import os

import torch
from torch import nn
import torch.nn.functional as F
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader

In [None]:
# Добавте код для подготовки данных



# Слой внимания (2 балла)



Ниже вам нужно реализовать слой для `MultiheadAttention`.



One of the key, novel concepts introduced by the Transformer paper is the *multi-head attention layer*. 

![](https://github.com/bentrevett/pytorch-seq2seq/blob/master/assets/transformer-attention.png?raw=1)

Attention can be though of as *queries*, *keys* and *values* - where the query is used with the key to get an attention vector (usually the output of a *softmax* operation and has all values between 0 and 1 which sum to 1) which is then used to get a weighted sum of the values.

The Transformer uses *scaled dot-product attention*, where the query and key are combined by taking the dot product between them, then applying the softmax operation and scaling by $d_k$ before finally then multiplying by the value. $d_k$ is the *head dimension*, `head_dim`, which we will shortly explain further.

$$ \text{Attention}(Q, K, V) = \text{Softmax} \big( \frac{QK^T}{\sqrt{d_k}} \big)V $$ 

This is similar to standard *dot product attention* but is scaled by $d_k$, which the paper states is used to stop the results of the dot products growing large, causing gradients to become too small.

However, the scaled dot-product attention isn't simply applied to the queries, keys and values. Instead of doing a single attention application the queries, keys and values have their `hid_dim` split into $h$ *heads* and the scaled dot-product attention is calculated over all heads in parallel. This means instead of paying attention to one concept per attention application, we pay attention to $h$. We then re-combine the heads into their `hid_dim` shape, thus each `hid_dim` is potentially paying attention to $h$ different concepts.

$$ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1,...,\text{head}_h)W^O $$

$$\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) $$

$W^O$ is the linear layer applied at the end of the multi-head attention layer, `fc`. $W^Q, W^K, W^V$ are the linear layers `fc_q`, `fc_k` and `fc_v`.

Walking through the module, first we calculate $QW^Q$, $KW^K$ and $VW^V$ with the linear layers, `fc_q`, `fc_k` and `fc_v`, to give us `Q`, `K` and `V`. Next, we split the `hid_dim` of the query, key and value into `n_heads` using `.view` and correctly permute them so they can be multiplied together. We then calculate the `energy` (the un-normalized attention) by multiplying `Q` and `K` together and scaling it by the square root of `head_dim`, which is calulated as `hid_dim // n_heads`. We then mask the energy so we do not pay attention over any elements of the sequeuence we shouldn't, then apply the softmax and dropout. We then apply the attention to the value heads, `V`, before combining the `n_heads` together. Finally, we multiply this $W^O$, represented by `fc_o`. 

Note that in our implementation the lengths of the keys and values are always the same, thus when matrix multiplying the output of the softmax, `attention`, with `V` we will always have valid dimension sizes for matrix multiplication. This multiplication is carried out using `torch.matmul` which, when both tensors are >2-dimensional, does a batched matrix multiplication over the last two dimensions of each tensor. This will be a **[query len, key len] x [value len, head dim]** batched matrix multiplication over the batch size and each head which provides the **[batch size, n heads, query len, head dim]** result.

One thing that looks strange at first is that dropout is applied directly to the attention. This means that our attention vector will most probably not sum to 1 and we may pay full attention to a token but the attention over that token is set to 0 by dropout. This is never explained, or even mentioned, in the paper however is used by the [official implementation](https://github.com/tensorflow/tensor2tensor/) and every Transformer implementation since, [including BERT](https://github.com/google-research/bert/).

In [None]:
class MultiheadAttention(nn.Module):
    def __init__(self, hid_dim, num_heads, attn_dropout=0.1):
        super().__init__()

        self.num_heads = num_heads
        
        assert hid_dim % num_heads == 0, "invalid heads and embedding dimension configuration"
        
        self.key = nn.Linear(hid_dim, hid_dim)
        self.value = nn.Linear(hid_dim, hid_dim)
        self.query = nn.Linear(hid_dim, hid_dim)
        self.attn_dropout = nn.Dropout(attn_dropout)

        self.head_dim = hid_dim // num_heads

        self.fc_o = nn.Linear(hid_dim, hid_dim)

        self.scale = torch.sqrt(torch.FloatTensor([self.head_dim])).to(device)

    
    #(batch_size, seq_len, embed_dim)
    def forward(self, q, k, v, mask):
        batch_size = q.size(0)
        seq_len = q.size(1)
        
        # Добавте код для подготовки данных
        
        return attn, y

Для удобства будем хранить конфигурацию для модели в отдельном классе. 

In [None]:
class GPTConfig:
    attn_dropout = 0.1
    embed_dropout = 0.1
    ff_dropout = 0.1
    num_heads = 8
    num_blocks = 4
    embed_dim = 512
    
    def __init__(
        self, vocab_size, max_len, **kwargs
    ):
        self.vocab_size = vocab_size
        self.max_len = max_len
        for key, value in kwargs.items():
            setattr(self, key, value)    

In [None]:
class PositionwiseFeedforwardLayer(nn.Module):
    def __init__(self, hid_dim, pf_dim, dropout):
        super().__init__()
        
        self.fc_1 = nn.Linear(hid_dim, pf_dim)
        self.act = nn.GELU()
        self.fc_2 = nn.Linear(pf_dim, hid_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        #x = [batch size, seq len, hid dim]  
        x = self.dropout(self.act(self.fc_1(x)))
        
        #x = [batch size, seq len, pf dim]
        x = self.fc_2(x)
        
        #x = [batch size, seq len, hid dim]
        
        return x

In [None]:
class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        embed_dim = config.embed_dim
        self.ln1 = nn.LayerNorm(embed_dim)
        self.ln2 = nn.LayerNorm(embed_dim)
        self.attention = MultiheadAttention(embed_dim, config.num_heads, config.attn_dropout)

        self.ff = PositionwiseFeedforwardLayer(embed_dim, embed_dim * 4, config.ff_dropout) 
    
    def forward(self, x, mask):
        x = self.ln1(x)
        
        attn, dx = self.attention(x, x, x, mask)

        x = x + dx
        
        x = x + self.ff(self.ln2(x))
        return attn, x

# Обучните модель для генерации тескта (4 балла)

Мы не хотим что бы наша модель загдядывала в будущее. Для этого мы создаем масску по которой текущий токен может только смотреть на себя и на предыдушие. Для этого нудна маска у которой элементы над главной диагонялью нулевые. Для этого можно использовать `torch.tril`. 
Пример мастки для последовательность длинны 5:

$$\begin{matrix}
1 & 0 & 0 & 0 & 0\\
1 & 1 & 0 & 0 & 0\\
1 & 1 & 1 & 0 & 0\\
1 & 1 & 1 & 1 & 0\\
1 & 1 & 1 & 1 & 1\\
\end{matrix}$$

In [None]:
def make_mask(seq):
    ... # TODO: Your code here
    return mask

Дополните код для модели

In [None]:
class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        embed_dim = config.embed_dim
        self.max_len = config.max_len
        self.tok_embed = 
        self.pos_embed = nn.Parameter(
            torch.zeros(1, config.max_len, embed_dim)
        )

        self.blocks = nn.ModuleList(
            [Block(config) for _ in range(config.num_blocks)]
        )
        self.ln = nn.LayerNorm(embed_dim)
        self.fc = nn.Linear(embed_dim, config.vocab_size)
    
    def forward(self, token_indexes):
        # batch_size = x.size(0)
        seq_len = token_indexes.size(1)
        assert seq_len <= self.max_len, "sequence longer than model capacity"
        
        tok_embedding = ... # TODO: Your code here

        # tok_embedding.shape == (batch_size, seq_len, embed_dim)

        pos_embedding = ... # TODO: Your code here

        # pos_embedding.shape == (1, seq_len, embed_dim)
        
        x = self.dropout(tok_embedding + pos_embedding)

        seq_len = x.size(1)

        mask = make_mask(token_indexes)

        # Примените все блоки последовательно
        # TODO: Your code here

        ...        

        x = self.ln(x)
        x = self.fc(x)
        # x.shape == (batch_size, seq_len, vocab_size)
        return attn_list, x

In [None]:
tokenizer.vocab_size

8000

In [None]:
vocab_size = tokenizer.vocab_size

config = GPTConfig(vocab_size, max_seq_len)
model = GPT(config).to(device)

In [None]:
batch = next(iter(train_dataloader))
model(batch[:, :-1].to(device))[1].shape

torch.Size([16, 256, 8000])

In [None]:
model = model.to(device)

In [None]:
learning_rate = 0.0005

optimizer = torch.optim.AdamW(model.parameters(), lr = learning_rate)

In [None]:
criterion = torch.nn.CrossEntropyLoss(ignore_index = pad_token_idx)

In [None]:
# Реализуйте обучение, так же как в предущей тетрадке
def train_epoch(model, callback):
    ...


In [None]:
def eval_model(model):
    ...

Получите `loss < 4.5 `

In [None]:
def callback(train_loss):
    eval_loss = eval_model(model)
    model.train()
    print(f'Epoch: {epoch+1:02} | train_loss = {train_loss:.5f}, eval_loss = {eval_loss:.5f}')

for epoch in trange(1):
    train_loss = train_epoch(model, callback)

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/17133 [00:00<?, ?it/s]

Epoch: 01 | train_loss = 4.44060, eval_loss = 4.41045
Epoch: 01 | train_loss = 4.37993, eval_loss = 4.34921
Epoch: 01 | train_loss = 4.33095, eval_loss = 4.30364
Epoch: 01 | train_loss = 4.28178, eval_loss = 4.27192
Epoch: 01 | train_loss = 4.24900, eval_loss = 4.23460


KeyboardInterrupt: ignored

In [None]:
tokenizer.encode("я помню чудное мгновенье")

[2, 156, 2769, 2723, 539, 2646, 1881, 3]

In [None]:
def continues_sentence(sentence, model, max_len = 30):
    # Возмите код из прошлого задания

In [None]:
continues_sentence("Я помню чудное мгновенье", model)

'я помню чудное мгновенье с ужасным чувством.. он мне думает о каком - то особенном положении : он еще не бывал и снова не было. если начинают головы'

In [None]:
continues_sentence("Мой дядя самых честных правил,", model)

'мои дядя самых честных правил, законы мои жорея : прежде всякии сказал :, что лютера, без вас же разумнымство останется древним народам'

In [None]:
continues_sentence("Четыре года потратил Деонардо на", model)

'четыре года потратил деонардо на другои вопрос с потаевыми классами и, чувствуя от всех на волеи и очень трудно просить рент либеральную типу'

In [None]:
continues_sentence("если крикнет рать святая", model)

'если крикнет рать святая жида. собрав фисию владимир, как будто он по самыи заслужен не даст, что, что другого оттенка проска'

In [None]:
continues_sentence("Он пересел на свой стул, придвинул к себе суп, говядину и стал ", model)

'он пересел на свои стул, придвинул к себе суп, говядину и стал за ним спокоиности, и настоичивал на нее, всю чаику : " счастье мне удалось это заключаться в'

Подберити варианты количетва блоков и количетсва голов при которых модель дает хороший результат.

# Визуализация внимания

In [None]:
tokens = list(encode(sentence.lower()))

src_tensor = torch.LongTensor(tokens).unsqueeze(0).to(device)

In [None]:
len(tokens)

20

In [None]:
attn, _ = model(src_tensor)


In [None]:
attn[0][0, :, -1, :].shape

torch.Size([12, 20])