<a href="https://colab.research.google.com/github/YuvalPeleg/transformers-workshop/blob/master/transformer_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
%matplotlib inline

In [0]:

"""
TODO:
--test code

--understand masking

--understand positional encoding
"""

'\nTODO:\n--test code\n\n--understand masking\n\n--understand positional encoding\n'

In [0]:
#!pip3 install  https://download.pytorch.org/whl/cu90/torch-1.3.0-cp36-cp36m-linux_x86_64.whl
!pip3 install https://download.pytorch.org/whl/cu92/torch-1.3.0%2Bcu92-cp36-cp36m-linux_x86_64.whl
!pip3 install torchtext

Collecting torch==1.3.0+cu92
[?25l  Downloading https://download.pytorch.org/whl/cu92/torch-1.3.0%2Bcu92-cp36-cp36m-linux_x86_64.whl (660.8MB)
[K     |████████████████████████████████| 660.8MB 28kB/s 
Installing collected packages: torch
  Found existing installation: torch 1.3.0+cu100
    Uninstalling torch-1.3.0+cu100:
      Successfully uninstalled torch-1.3.0+cu100
Successfully installed torch-1.3.0+cu92



Sequence-to-Sequence Modeling with nn.Transformer and TorchText
===============================================================

This is a tutorial on how to train a sequence-to-sequence model
that uses the
`nn.Transformer <https://pytorch.org/docs/master/nn.html?highlight=nn%20transformer#torch.nn.Transformer>`__ module.

PyTorch 1.2 release includes a standard transformer module based on the
paper `Attention is All You
Need <https://arxiv.org/pdf/1706.03762.pdf>`__. The transformer model
has been proved to be superior in quality for many sequence-to-sequence
problems while being more parallelizable. The ``nn.Transformer`` module
relies entirely on an attention mechanism (another module recently
implemented as `nn.MultiheadAttention <https://pytorch.org/docs/master/nn.html?highlight=multiheadattention#torch.nn.MultiheadAttention>`__) to draw global dependencies
between input and output. The ``nn.Transformer`` module is now highly
modularized such that a single component (like `nn.TransformerEncoder <https://pytorch.org/docs/master/nn.html?highlight=nn%20transformerencoder#torch.nn.TransformerEncoder>`__
in this tutorial) can be easily adapted/composed.

![](../_static/img/transformer_architecture.jpg)





Define the model
----------------




In this tutorial, we train ``nn.TransformerEncoder`` model on a
language modeling task. The language modeling task is to assign a
probability for the likelihood of a given word (or a sequence of words)
to follow a sequence of words. A sequence of tokens are passed to the embedding
layer first, followed by a positional encoding layer to account for the order
of the word (see the next paragraph for more details). The
``nn.TransformerEncoder`` consists of multiple layers of
`nn.TransformerEncoderLayer <https://pytorch.org/docs/master/nn.html?highlight=transformerencoderlayer#torch.nn.TransformerEncoderLayer>`__. Along with the input sequence, a square
attention mask is required because the self-attention layers in
``nn.TransformerEncoder`` are only allowed to attend the earlier positions in
the sequence. For the language modeling task, any tokens on the future
positions should be masked. To have the actual words, the output
of ``nn.TransformerEncoder`` model is sent to the final Linear
layer, which is followed by a log-Softmax function.




In [0]:
import math
import torch
import torch.nn as nn
import torch.nn.functional as F

class TransformerModel(nn.Module):

    def __init__(self, ntoken, ninp, nhead, nhid, nlayers, dropout=0.5):
        super(TransformerModel, self).__init__()
        from torch.nn import TransformerEncoder, TransformerEncoderLayer
        self.model_type = 'Transformer'
        self.src_mask = None
        self.pos_encoder = PositionalEncoding(ninp, dropout)
        encoder_layers = TransformerEncoderLayer(ninp, nhead, nhid, dropout)
        self.transformer_encoder = TransformerEncoder(encoder_layers, nlayers)
        self.encoder = nn.Embedding(ntoken, ninp)
        self.ninp = ninp
        self.decoder = nn.Linear(ninp, ntoken)

        self.init_weights()

    def _generate_square_subsequent_mask(self, sz):
        mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
        mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
        return mask

    def init_weights(self):
        initrange = 0.1
        self.encoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.zero_()
        self.decoder.weight.data.uniform_(-initrange, initrange)

    def forward(self, src):
        if self.src_mask is None or self.src_mask.size(0) != len(src):
            device = src.device
            mask = self._generate_square_subsequent_mask(len(src)).to(device)
            self.src_mask = mask

        src = self.encoder(src) * math.sqrt(self.ninp)
        src = self.pos_encoder(src)
        output = self.transformer_encoder(src, self.src_mask)
        output = self.decoder(output)
        return output

``PositionalEncoding`` module injects some information about the
relative or absolute position of the tokens in the sequence. The
positional encodings have the same dimension as the embeddings so that
the two can be summed. Here, we use ``sine`` and ``cosine`` functions of
different frequencies.




In [0]:
class PositionalEncoding(nn.Module):

    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:x.size(0), :]
        return self.dropout(x)

Load and batch data
-------------------




The training process uses Wikitext-2 dataset from ``torchtext``. The
vocab object is built based on the train dataset and is used to numericalize
tokens into tensors. Starting from sequential data, the ``batchify()``
function arranges the dataset into columns, trimming off any tokens remaining
after the data has been divided into batches of size ``batch_size``.
For instance, with the alphabet as the sequence (total length of 26)
and a batch size of 4, we would divide the alphabet into 4 sequences of
length 6:

\begin{align}\begin{bmatrix}
  \text{A} & \text{B} & \text{C} & \ldots & \text{X} & \text{Y} & \text{Z}
  \end{bmatrix}
  \Rightarrow
  \begin{bmatrix}
  \begin{bmatrix}\text{A} \\ \text{B} \\ \text{C} \\ \text{D} \\ \text{E} \\ \text{F}\end{bmatrix} &
  \begin{bmatrix}\text{G} \\ \text{H} \\ \text{I} \\ \text{J} \\ \text{K} \\ \text{L}\end{bmatrix} &
  \begin{bmatrix}\text{M} \\ \text{N} \\ \text{O} \\ \text{P} \\ \text{Q} \\ \text{R}\end{bmatrix} &
  \begin{bmatrix}\text{S} \\ \text{T} \\ \text{U} \\ \text{V} \\ \text{W} \\ \text{X}\end{bmatrix}
  \end{bmatrix}\end{align}

These columns are treated as independent by the model, which means that
the dependence of ``G`` and ``F`` can not be learned, but allows more
efficient batch processing.




In [0]:
import torchtext
from torchtext.data.utils import get_tokenizer
TEXT = torchtext.data.Field(tokenize=get_tokenizer("spacy"),
                            init_token='<sos>',
                            eos_token='<eos>',
                            lower=True)
train_txt, val_txt, test_txt = torchtext.datasets.WikiText2.splits(TEXT)
TEXT.build_vocab(train_txt)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def batchify(data, bsz):
    if hasattr(data, 'examples'):      
      text = data.examples[0].text
    else:
      text = data
    data = TEXT.numericalize([text])
    # Divide the dataset into bsz parts.
    nbatch = data.size(0) // bsz
    # Trim off any extra elements that wouldn't cleanly fit (remainders).
    data = data.narrow(0, 0, nbatch * bsz)
    # Evenly divide the data across the bsz batches.
    data = data.view(bsz, -1).t().contiguous()
    return data.to(device)



downloading wikitext-2-v1.zip


wikitext-2-v1.zip: 100%|██████████| 4.48M/4.48M [00:01<00:00, 3.04MB/s]


extracting


In [0]:
torchtext.datasets.WikiText2.splits?

In [0]:
TEXT.preprocess("I like cats")

['i', 'like', 'cats']

In [0]:
print(TEXT.numericalize([['i'], ['valkyria']]).shape)
print(TEXT.numericalize(["i"]))


torch.Size([1, 2])
tensor([[79]])


In [0]:
TEXT.numericalize?


In [0]:
train_txt.examples[0].text[0:5]

[' ', '<eos>', ' ', '=', 'valkyria']

In [0]:
batch_size = 20
eval_batch_size = 10
train_data = batchify(train_txt, batch_size)
val_data = batchify(val_txt, eval_batch_size)
test_data = batchify(test_txt, eval_batch_size)

In [0]:
batchify("Paris is the capital of", 1024)

tensor([], device='cuda:0', size=(0, 1024), dtype=torch.int64)

In [0]:
val_text_lst = list(val_txt.examples)

Functions to generate input and target sequence
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~




In [0]:
val_txt.examples[0].text[0:20]

[' ',
 '<eos>',
 ' ',
 '=',
 'homarus',
 'gammarus',
 '=',
 '<eos>',
 ' ',
 '<eos>',
 ' ',
 'homarus',
 'gammarus',
 ',',
 'known',
 'as',
 'the',
 'european',
 'lobster',
 'or']

In [0]:
type(val_data)

torch.Tensor

``get_batch()`` function generates the input and target sequence for
the transformer model. It subdivides the source data into chunks of
length ``bptt``. For the language modeling task, the model needs the
following words as ``Target``. For example, with a ``bptt`` value of 2,
we’d get the following two Variables for ``i`` = 0:

![](../_static/img/transformer_input_target.png)


It should be noted that the chunks are along dimension 0, consistent
with the ``S`` dimension in the Transformer model. The batch dimension
``N`` is along dimension 1.




In [0]:
bptt = 35
def get_batch(source, i):
    seq_len = min(bptt, len(source) - 1 - i)
    data = source[i:i+seq_len]
    target = source[i+1:i+1+seq_len].view(-1)
    return data, target

Initiate an instance
--------------------




The model is set up with the hyperparameter below. The vocab size is
equal to the length of the vocab object.




In [0]:
ntokens = len(TEXT.vocab.stoi) # the size of vocabulary
emsize = 200 # embedding dimension
nhid = 200 # the dimension of the feedforward network model in nn.TransformerEncoder
nlayers = 2 # the number of nn.TransformerEncoderLayer in nn.TransformerEncoder
nhead = 2 # the number of heads in the multiheadattention models
dropout = 0.2 # the dropout value
model = TransformerModel(ntokens, emsize, nhead, nhid, nlayers, dropout).to(device)

In [0]:
from torch.nn import TransformerEncoder, TransformerEncoderLayer
TransformerEncoder?

Run the model
-------------




`CrossEntropyLoss <https://pytorch.org/docs/master/nn.html?highlight=crossentropyloss#torch.nn.CrossEntropyLoss>`__
is applied to track the loss and
`SGD <https://pytorch.org/docs/master/optim.html?highlight=sgd#torch.optim.SGD>`__
implements stochastic gradient descent method as the optimizer. The initial
learning rate is set to 5.0. `StepLR <https://pytorch.org/docs/master/optim.html?highlight=steplr#torch.optim.lr_scheduler.StepLR>`__ is
applied to adjust the learn rate through epochs. During the
training, we use
`nn.utils.clip_grad_norm\_ <https://pytorch.org/docs/master/nn.html?highlight=nn%20utils%20clip_grad_norm#torch.nn.utils.clip_grad_norm_>`__
function to scale all the gradient together to prevent exploding.




In [0]:
criterion = nn.CrossEntropyLoss()
lr = 5.0 # learning rate
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.95)

import time
def train():
    model.train() # Turn on the train mode
    total_loss = 0.
    start_time = time.time()
    ntokens = len(TEXT.vocab.stoi)
    for batch, i in enumerate(range(0, train_data.size(0) - 1, bptt)):
        data, targets = get_batch(train_data, i)
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output.view(-1, ntokens), targets)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
        optimizer.step()

        total_loss += loss.item()
        log_interval = 200
        if batch % log_interval == 0 and batch > 0:
            cur_loss = total_loss / log_interval
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches | '
                  'lr {:02.2f} | ms/batch {:5.2f} | '
                  'loss {:5.2f} | ppl {:8.2f}'.format(
                    epoch, batch, len(train_data) // bptt, scheduler.get_lr()[0],
                    elapsed * 1000 / log_interval,
                    cur_loss, math.exp(cur_loss)))
            total_loss = 0
            start_time = time.time()

def evaluate(eval_model, data_source):
    eval_model.eval() # Turn on the evaluation mode
    total_loss = 0.
    ntokens = len(TEXT.vocab.stoi)
    with torch.no_grad():
        for i in range(0, data_source.size(0) - 1, bptt):
            data, targets = get_batch(data_source, i)
            output = eval_model(data)
            output_flat = output.view(-1, ntokens)
            total_loss += len(data) * criterion(output_flat, targets).item()
    return total_loss / (len(data_source) - 1)

In [0]:
def run_language_model(eval_model, text):
    data = batchify(text, 1)
    eval_model.eval() # Turn on the evaluation mode    
    ntokens = len(TEXT.vocab.stoi)
    with torch.no_grad():
        for i in range(0, data_source.size(0) - 1, bptt):
            data, targets = get_batch(data_source, i)
            output = eval_model(data)
            output_flat = output.view(-1, ntokens)       

run_language_model(best_model, "I love cats very much")


AttributeError: ignored

Loop over epochs. Save the model if the validation loss is the best
we've seen so far. Adjust the learning rate after each epoch.



In [0]:
best_val_loss = float("inf")
epochs = 3 # The number of epochs
best_model = None

for epoch in range(1, epochs + 1):
    epoch_start_time = time.time()
    train()
    val_loss = evaluate(model, val_data)
    print('-' * 89)
    print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
          'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
                                     val_loss, math.exp(val_loss)))
    print('-' * 89)

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        best_model = model
    #best_model = model
    scheduler.step()

| epoch   1 |   200/ 3195 batches | lr 5.00 | ms/batch 38.73 | loss  6.50 | ppl   666.51
| epoch   1 |   400/ 3195 batches | lr 5.00 | ms/batch 37.80 | loss  4.41 | ppl    82.55
| epoch   1 |   600/ 3195 batches | lr 5.00 | ms/batch 38.83 | loss  3.59 | ppl    36.31
| epoch   1 |   800/ 3195 batches | lr 5.00 | ms/batch 37.96 | loss  3.08 | ppl    21.75
| epoch   1 |  1000/ 3195 batches | lr 5.00 | ms/batch 38.54 | loss  2.80 | ppl    16.40
| epoch   1 |  1200/ 3195 batches | lr 5.00 | ms/batch 38.15 | loss  2.61 | ppl    13.63
| epoch   1 |  1400/ 3195 batches | lr 5.00 | ms/batch 38.07 | loss  2.52 | ppl    12.42
| epoch   1 |  1600/ 3195 batches | lr 5.00 | ms/batch 38.11 | loss  2.43 | ppl    11.39
| epoch   1 |  1800/ 3195 batches | lr 5.00 | ms/batch 38.21 | loss  2.34 | ppl    10.38
| epoch   1 |  2000/ 3195 batches | lr 5.00 | ms/batch 38.06 | loss  2.28 | ppl     9.81
| epoch   1 |  2200/ 3195 batches | lr 5.00 | ms/batch 38.15 | loss  2.24 | ppl     9.39
| epoch   1 |  2400/ 

In [0]:
best_model.src_mask

tensor([[0., 0., 0.],
        [-inf, 0., 0.],
        [-inf, -inf, 0.]], device='cuda:0')

Evaluate the model with the test dataset
-------------------------------------

Apply the best model to check the result with the test dataset.



In [0]:
test_loss = evaluate(best_model, test_data)
print('=' * 89)
print('| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format(
    test_loss, math.exp(test_loss)))
print('=' * 89)

| End of training | test loss  0.87 | test ppl     2.39


In [0]:
p_model = best_model
#text = "robert is an english
text = train_txt.examples[0].text[100:135]
text = TEXT.preprocess("I like")

#text = TEXT.preprocess("<sos> I like cats")
print(text)
numericed = TEXT.numericalize([text])
numericed = numericed.to(device)
p_model.eval() # Turn on the evaluation mode    
ntokens = len(TEXT.vocab.stoi)
with torch.no_grad():            
        output = p_model(numericed)  
print([TEXT.vocab.itos[i] for i in output.max(2).indices])

print(numericed)
print(numericed.shape)
print(output.max(2).indices)
print(output.shape)
print(p_model.src_mask)
p_model.src_mask.shape



['i', 'like']
['like', 'like']
tensor([[ 79],
        [150]], device='cuda:0')
torch.Size([2, 1])
tensor([[150],
        [150]], device='cuda:0')
torch.Size([2, 1, 28871])
tensor([[0., 0.],
        [-inf, 0.]], device='cuda:0')


torch.Size([2, 2])

In [0]:
train_txt.examples[0].text[0:10]

[' ', '<eos>', ' ', '=', 'valkyria', 'chronicles', 'iii', '=', '<eos>', ' ']

In [0]:
TEXT.numericalize([text])

tensor([[640],
        [ 29],
        [ 36],
        [343]])

In [0]:
output[3][0].argmax()

tensor(262, device='cuda:0')

In [0]:
best_model.decoder.weight.shape

torch.Size([28871, 200])

In [0]:
ntokens

28871

In [0]:
output

NameError: ignored

In [0]:
model.train() # Turn on the train mode
total_loss = 0.
start_time = time.time()
ntokens = len(TEXT.vocab.stoi)
for batch, i in enumerate(range(0, train_data.size(0) - 1, bptt)):
    data, targets = get_batch(train_data, i)
    optimizer.zero_grad()
    output = model(data)
    loss = criterion(output.view(-1, ntokens), targets)
    #print(loss.shape)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
    optimizer.step()

    total_loss += loss.item()
    log_interval = 200
    if batch % log_interval == 0 and batch > 0:
        cur_loss = total_loss / log_interval
        elapsed = time.time() - start_time
        print('| epoch {:3d} | {:5d}/{:5d} batches | '
              'lr {:02.2f} | ms/batch {:5.2f} | '
              'loss {:5.2f} | ppl {:8.2f}'.format(
                epoch, batch, len(train_data) // bptt, scheduler.get_lr()[0],
                elapsed * 1000 / log_interval,
                cur_loss, math.exp(cur_loss)))
        total_loss = 0
        start_time = time.time()

| epoch   1 |   200/ 3195 batches | lr 4.75 | ms/batch 38.64 | loss  1.81 | ppl     6.10
| epoch   1 |   400/ 3195 batches | lr 4.75 | ms/batch 37.82 | loss  1.98 | ppl     7.24


KeyboardInterrupt: ignored

In [0]:
loss.shape

torch.Size([])

In [0]:
targets.shape

torch.Size([700])

In [0]:
data.shape

torch.Size([35, 20])

In [0]:
data[0]

tensor([13280,   211,  1168,     4,    10,     8,     4, 19981,    36,     4,
           26,    10,   218,     5,   207,  9864,    29, 20034,  1130,  2066],
       device='cuda:0')

In [0]:
output.shape

torch.Size([35, 20, 28871])

In [0]:
val_data.shape

torch.Size([24504, 10])

In [0]:
train_data.shape

torch.Size([111832, 20])

In [0]:
train_data[0].shape


torch.Size([20])

In [0]:
train_data[1]

tensor([    3,    22, 10588,     3,    10,   973,    18,  2069,  2165,     7,
            7,    37,  9283,   686,    12,  3692,  5020,    40,  3188,     4],
       device='cuda:0')

In [0]:
train_data.shape

torch.Size([111832, 20])

In [0]:
data.shape

torch.Size([35, 20])

In [0]:
targets.shape

torch.Size([700])

In [0]:
get_batch(train_data, 0)[2].shape

IndexError: ignored

In [0]:
train_data.shape

torch.Size([111832, 20])

In [0]:
train_data.size(0)

111832

In [0]:
best_model.encoder.weight.shape

torch.Size([28871, 200])

In [0]:
best_model.encoder(data).shape

torch.Size([35, 20, 200])

In [0]:
print(data.shape)
s = best_model.encoder(data) * math.sqrt(best_model.ninp)
print(s.shape)
s = best_model.pos_encoder(s)
print(s.shape)
o = best_model.transformer_encoder(s, best_model.src_mask)
print(o.shape)
o = best_model.decoder(o)
print(o.shape)


torch.Size([35, 20])
torch.Size([35, 20, 200])
torch.Size([35, 20, 200])
torch.Size([35, 20, 200])
torch.Size([35, 20, 28871])


In [0]:
s = self.pos_encoder(sample_data)
s
output = self.transformer_encoder(src, self.src_mask)

In [0]:
best_model.src_mask.shape

torch.Size([35, 35])

In [0]:
best_model.src_mask

tensor([[0., 0.],
        [-inf, 0.]], device='cuda:0')

In [0]:
best_model.decoder.weight

torch.Size([28872, 200])

In [0]:
best_model.encoder.weight.shape

torch.Size([28872, 200])

In [0]:
output.shape

NameError: ignored

In [0]:
nn.Linear?

tensor([[640],
        [  8],
        [ 10],
        [  9],
        [ 29],
        [ 36],
        [343]], device='cuda:0')
torch.Size([7, 1])
tensor([[  8],
        [ 10],
        [  9],
        [ 29],
        [ 36],
        [343],
        [343]], device='cuda:0')
torch.Size([7, 1, 28872])


In [0]:
torch.argmax(output)

tensor(318, device='cuda:0')

In [0]:
data = batchify("i like cats", 1)

In [0]:
model(numericed).max(2)

torch.return_types.max(values=tensor([[19.5889],
        [17.7437],
        [17.7850]], device='cuda:0', grad_fn=<MaxBackward0>), indices=tensor([[ 150],
        [5992],
        [5992]], device='cuda:0'))

In [0]:
output.shape

torch.Size([3, 1, 28872])

In [0]:
train_text.examples[0].text

NameError: ignored

In [0]:
batchify("I like cats".split(' '), 1).shape

torch.Size([3, 1])

In [0]:
data.shape

torch.Size([2236652, 1])

In [0]:
TEXT.numericalize(["I"]).shape

torch.Size([1, 1])

In [0]:
TEXT.numericalize([["I", "like", "cats"]])

tensor([[   0],
        [ 150],
        [5992]])

In [0]:
len("I like cats ver")

11

In [0]:
values, indices = torch.max(output, 2)
values

tensor([[19.5889],
        [17.7437],
        [17.7850]], device='cuda:0')

In [0]:
indices.shape

torch.Size([3, 1])

In [0]:
torch.max?


In [0]:
TEXT.vocab.itos[indices[0]]

'like'

In [0]:
indices

tensor([[ 150],
        [5992],
        [5992]], device='cuda:0')

In [0]:
best_model.src_mask

tensor([[0., 0., 0.],
        [-inf, 0., 0.],
        [-inf, -inf, 0.]], device='cuda:0')

In [0]:
test_txt.examples[0].text[13:20]

['robert', '<', 'unk', '>', 'is', 'an', 'english']

In [0]:
|y = torch.tensor([
     [
       [1, 2, 3],
       [4, 5, 6]
     ],
     [
       [1, 2, 3],
       [4, 5, 6]
     ],
     [
       [1, 2, 3],
       [4, 5, 6]
     ]
   ])
y.shape



torch.Size([3, 2, 3])

In [0]:
torch.sum(y, dim=0).shape

torch.Size([2, 3])

In [0]:
torch.sum(y, dim=1).shape

torch.Size([3, 3])

In [0]:
torch.sum(y, dim=2).shape

torch.Size([3, 2])

In [0]:
TEXT.process("I like cats")

tensor([[   2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2],
        [   0,   14, 1852,   79, 1693, 1638,   14,  656,   15,  201,  565],
        [   3,    3,    3,    3,    3,    3,    3,    3,    3,    3,    3]])

In [0]:
class MyMultiHeadAttention(nn.Module):
  def __init__(self, embed_dim, nhead, dropout):
    super(MyMultiHeadAttention, self).__init__()
    self.embed_dim = embed_dim
    self.nhead = nhead
    self.dropout = nn.Dropout(dropout)
    self.queries_linear = nn.Linear(embed_dim, embed_dim)
    self.keys_linear = nn.Linear(embed_dim, embed_dim)
    self.values_linear = nn.Linear(embed_dim, embed_dim)
    self.final_linear = nn.Linear(embed_dim, embed_dim)
    self.head_dim = embed_dim // nhead
  def forward(self, sequence_batch, mask):
    "Implements Figure 2"
    #Sequence: Dbatch X Dseq D X embed_dim
    queries = self.queries_linear(sequence_batch)
    keys = self.queries_linear(sequence_batch)
    values = self.values_linear(sequence_batch)
    batch_size = sequence_batch.size(0)
    queries, keys, values = [linear(sequence_batch) for linear in [self.queries_linear, self.keys_linear, self.values_linear]]
    #divide into attention heads:
    #Dbatch X nhead X Dseq D X head_dim
    queries, keys, values = [attention_part.view(batch_size, -1, self.nhead, self.head_dim).transpose(1,2) for attention_part in [queries, keys, values]]
    attended_heads = self.calculatue_attention(queries, keys, values, None)
    #concat heads back to 
    #Dbatch X Dseq D X embed_dim
    concated_heads = attended_heads.transpose(1, 2)\
    .contiguous()\
    .view(batch_size, -1, self.embed_dim)
    return self.final_linear(concated_heads) # 

  def calculatue_attention(self, queries, keys, values, mask):
    #keys * queries.T * values
    #Dbatch X nhead X Dseq X head_dim
    weights = torch.matmul(queries, keys.transpose(-2, -1))
    #Dbatch X nhead X Dseq D X Dseq D
    weights_scaled = weights / math.sqrt(self.head_dim)
    if mask is not None:
        weights_scaled = weights_scaled.masked_fill(mask == 0, -1e9)
    weights_normed = F.softmax(weights_scaled, dim = -1)
    if self.dropout:
      weights_normed = self.dropout(weights_normed)
    attended_keys = torch.matmul(weights_normed, values)
    #Dbatch X nhead X Dseq D X head_dim
    return attended_keys


def test_multi_head_attention():
  attention = MyMultiHeadAttention(50, 5, 0.1)
  mock_batch = torch.randn((5, 10, 50))
  result = attention(mock_batch, None)
  print(result.size())
test_multi_head_attention()

torch.Size([5, 10, 50])


In [0]:
#weights = torch.matmul(queries, keys.transpose(-2, -1))
a = torch.ones((2,2,2,3))
b = torch.ones((2,2,3,2))
print(a.size)
print(torch.matmul(a,b).size())
print(torch.matmul(a,b))

#torch.matmul(a,b)

<built-in method size of Tensor object at 0x7fb321bb87e0>
torch.Size([2, 2, 2, 2])
tensor([[[[3., 3.],
          [3., 3.]],

         [[3., 3.],
          [3., 3.]]],


        [[[3., 3.],
          [3., 3.]],

         [[3., 3.],
          [3., 3.]]]])


In [0]:
class MyEncoderLayer(nn.Module):
  def __init__(embed_dim, nhead, nhid, dropout):
      super(MyEncoderLayer, self).__init__()
      self.embed_dim = embed_dim
      self.embed_dim = embed_dim
      


      


In [0]:
nn.TransformerEncoderLayer??

In [0]:
nn.MultiheadAttention??

In [0]:
x = torch.randn(4, 4)


In [0]:
x.size()

torch.Size([4, 4])

In [0]:
x.shape

torch.Size([4, 4])

In [0]:
x

tensor([[-0.9757,  0.5749, -2.6934,  0.0231],
        [-0.7104,  1.5264, -0.3772, -2.3132],
        [-1.3506, -0.5612, -0.1005, -0.7492],
        [ 1.2988,  0.0405, -0.2952,  1.1514]])

In [0]:
torch.matmul??