# Sequence masking with PyTorch

Resources:
- Masking:
    - [Difference between `src_mask` and `src_key_padding_mask` in PyTorch Transformer layers (from StackOverflow)](https://stackoverflow.com/questions/62170439/difference-between-src-mask-and-src-key-padding-mask)
    - [UvA (University of Amsterdam) DL tutorial](https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial6/Transformers_and_MHAttention.html)
    - [Judit Ács' blog post](https://juditacs.github.io/2018/12/27/masked-attention.html) (but watch out: **the attention matrix is not square!**)
- Masked language modeling:
    - [Kaggle notebook](https://www.kaggle.com/code/mojammel/masked-language-model-with-pytorch-transformer)
    - [MLM with BERT blog post](https://towardsdatascience.com/masked-language-modelling-with-bert-7d49793e5d2c)
    - [PyTorch transformer tutorial](https://pytorch.org/tutorials/beginner/transformer_tutorial.html)

In [126]:
import sys
import torch

sys.path.append('../../modules/')

from models import TransformerClassifier, FFNN

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Data

Generate a vocabulary.

In [60]:
# Vocabulary size.
q = 8

vocab = torch.arange(q).to(dtype=torch.int64)
mask_idx = -1

vocab

tensor([0, 1, 2, 3, 4, 5, 6, 7])

Generate sequences of tokens.

**Note:** in the case considered all sequences have the same length and therefore no padding is (ever) needed.

In [61]:
# Tree depth.
l = 8

batch_size = 64
seq_len = 2 ** l

sequences = torch.randint(q, (batch_size, seq_len))

sequences

tensor([[2, 0, 2,  ..., 6, 6, 0],
        [6, 2, 1,  ..., 3, 2, 5],
        [7, 4, 3,  ..., 3, 5, 1],
        ...,
        [5, 4, 3,  ..., 7, 6, 5],
        [4, 1, 6,  ..., 0, 2, 3],
        [2, 1, 4,  ..., 3, 7, 5]])

## Building attention masks

Given the sequences, generate trainable input embeddings for them (just the semantic part, we're skipping the positional encoding here as it's not essential).

In [86]:
hidden_dim = 128

embedding_layer = torch.nn.Embedding(
    num_embeddings=vocab.shape[0],
    embedding_dim=hidden_dim
)

input_sequence_embeddings = embedding_layer(sequences)

input_sequence_embeddings.shape

torch.Size([64, 256, 128])

Instantiate a transformer Encoder model.

**Note:** by convention, we stick with having the batch dimension as the first one.

In [87]:
# A single encoder layer to be used in the full stack.
encoder_layer = torch.nn.TransformerEncoderLayer(
    d_model=hidden_dim,
    nhead=1,
    dim_feedforward=2048,
    batch_first=True
)

# Stack of encoder layers.
transformer_encoder = torch.nn.TransformerEncoder(
    encoder_layer,
    num_layers=1
)

encoder_output = transformer_encoder(input_sequence_embeddings)

encoder_output.shape



torch.Size([64, 256, 128])

Generate masked sequences from the original one to perform masked language modeling (MLM): every token in every sequence is converted to the masked one (set conventionally) with some probability.

**Mask:** we pass the mask as the encoder's `src_key_padding_mask` option, which means it should have shape `(batch_size, seq_len)` and (if of boolean type) contain `False` when no masking is needed and `True` when it is. In practice, it's obtained by simply checking the masked sequences against the padding value.

In [88]:
masked_sequences = sequences.clone()

masking_rate = 0.1

for i in range(sequences.shape[0]):
    for j in range(sequences.shape[1]):
        if torch.rand(1) < masking_rate:
            masked_sequences[i, j] = mask_idx

mask = (masked_sequences == mask_idx)

Questions:
- Should we pass the masked sequences or the original ones (always along with the mask) to the encoder?
- What to do with the masked embeddings?
    - If we pass the masked sequences, we should have input embeddings be created for the `<mask>` token, so it should be explicitly modeled. Could we indicate it as a padding token as we instantiate the `Embedding` layer for the input embeddings?
    - If we pass the original sequences we're leaving the original embeddings for the masked tokens, is the mask sufficient to tell the model to ignore them?
    - What about the gradient? Should we "disconnect" the masked token from the compute graph?
- Full model (with a decoder as well)
    - How do we tell which are the masked tokens to predict for in the sequences? The decoder won't know anything about the masking, so it'll have no way to distinguish between a masked token and a not masked one.
    - Should the decoder predict for one masked token at the time (how to select them? Mask them one by one?) or all masked tokens together (in which case, see the previous point)?

In [89]:
encoder_output_masked = transformer_encoder(
    embedding_layer(sequences),
    src_key_padding_mask=mask
)

encoder_output_masked

tensor([[[ 0.4672,  1.1940,  1.0537,  ..., -0.2877,  0.2319,  1.2596],
         [-0.2028,  0.1728,  1.5650,  ..., -0.5773,  0.6646,  0.9247],
         [ 0.1062,  1.3121,  1.1403,  ..., -0.2368,  0.2331,  1.2061],
         ...,
         [-0.1510,  0.3138,  0.1510,  ..., -1.4177,  0.7866, -0.8487],
         [-0.2009,  0.4738,  0.1920,  ..., -1.2997,  0.8303, -0.8554],
         [-0.2200,  0.2029,  0.9257,  ..., -0.6490,  0.3923,  0.8745]],

        [[ 0.0445,  0.3207,  0.2282,  ..., -1.1129,  0.7127, -0.8959],
         [ 0.1751,  1.4233,  1.2060,  ..., -0.4689,  0.2305,  1.2587],
         [-0.6427, -1.7986, -1.6010,  ...,  1.7697,  0.8395,  0.0797],
         ...,
         [ 0.7727, -0.1183,  1.1160,  ..., -1.9133, -0.5322, -0.8142],
         [ 0.2510,  1.4098,  1.1894,  ..., -0.2762,  0.1966,  1.1853],
         [ 2.3494, -1.3264,  1.3115,  ...,  0.4348,  0.6838,  0.2742]],

        [[-1.5283,  1.4894, -2.1376,  ...,  0.7907,  0.0856,  0.0318],
         [-1.1355, -0.3074,  1.4480,  ...,  0

## Building a model for MLM

In [174]:
model = TransformerClassifier(
    seq_len=seq_len,
    embedding_size=hidden_dim,
    n_tranformer_layers=1,
    n_heads=1,
    n_classes=q,
    embedding_agg='flatten',
    decoder_hidden_sizes=[],
    decoder_activation='identity',
    decoder_output_activation='softmax'
)

model(sequences, src_key_padding_mask=mask)

tensor([[0.0560, 0.0959, 0.2724, 0.1043, 0.0876, 0.1265, 0.0803, 0.1770],
        [0.0289, 0.1510, 0.1804, 0.1897, 0.0853, 0.0797, 0.1354, 0.1497],
        [0.0526, 0.2401, 0.1393, 0.0856, 0.2161, 0.0724, 0.0434, 0.1505],
        [0.0303, 0.1611, 0.1556, 0.1782, 0.0878, 0.0785, 0.0614, 0.2470],
        [0.0214, 0.1749, 0.2697, 0.1860, 0.0697, 0.1506, 0.0484, 0.0793],
        [0.1414, 0.0967, 0.1740, 0.2270, 0.1583, 0.0766, 0.0446, 0.0814],
        [0.0255, 0.1447, 0.2715, 0.1412, 0.1098, 0.1030, 0.0685, 0.1357],
        [0.1256, 0.1061, 0.1431, 0.0518, 0.2128, 0.0948, 0.0517, 0.2140],
        [0.0555, 0.0858, 0.2906, 0.1301, 0.0686, 0.1281, 0.1336, 0.1078],
        [0.0290, 0.0998, 0.3629, 0.0666, 0.0675, 0.0793, 0.0261, 0.2688],
        [0.0860, 0.0718, 0.1642, 0.1850, 0.0947, 0.1615, 0.0834, 0.1534],
        [0.0266, 0.1494, 0.2083, 0.1140, 0.1752, 0.0811, 0.0701, 0.1754],
        [0.0904, 0.1651, 0.1151, 0.3147, 0.0973, 0.0857, 0.0444, 0.0871],
        [0.0477, 0.0898, 0.1403, 0.131