# Sequence masking with PyTorch

Resources:
- Masking:
    - [Difference between `src_mask` and `src_key_padding_mask` in PyTorch Transformer layers (from StackOverflow)](https://stackoverflow.com/questions/62170439/difference-between-src-mask-and-src-key-padding-mask)
    - [UvA (University of Amsterdam) DL tutorial](https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial6/Transformers_and_MHAttention.html)
    - [Judit Ács' blog post](https://juditacs.github.io/2018/12/27/masked-attention.html) (but watch out: **the attention matrix is not square!**)
- Masked language modeling:
    - [Kaggle notebook](https://www.kaggle.com/code/mojammel/masked-language-model-with-pytorch-transformer) (very similar to the PyTorch tutorial below)
    - [MLM with BERT blog post](https://towardsdatascience.com/masked-language-modelling-with-bert-7d49793e5d2c)
    - [PyTorch transformer tutorial](https://pytorch.org/tutorials/beginner/transformer_tutorial.html)
    - [Tutorial with TensorFlow](https://keras.io/examples/nlp/masked_language_modeling/#create-bert-model-pretraining-model-for-masked-language-modeling) (this is actually a good reference, with tensor shapes etc.)
    - [Tutorial with PyTorch](https://neptune.ai/blog/how-to-code-bert-using-pytorch-tutorial)

In [2]:
import sys
import torch

sys.path.append('../../modules/')

from models import TransformerClassifier, FFNN

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Data

Generate a vocabulary.

In [3]:
# Vocabulary size.
q = 8

vocab = torch.arange(q).to(dtype=torch.int64)
mask_idx = vocab.max() + 1

# Enalarge the vocabulary with the special tokens.
vocab = torch.hstack([vocab, torch.Tensor(mask_idx).to(dtype=torch.int64)])

vocab

tensor([0, 1, 2, 3, 4, 5, 6, 7, 8])

Generate sequences of tokens.

**Note:** in the case considered all sequences have the same length and therefore no padding is (ever) needed.

In [4]:
# Tree depth.
l = 8

batch_size = 64
seq_len = 2 ** l

sequences = torch.randint(q, (batch_size, seq_len))

sequences

tensor([[1, 7, 0,  ..., 1, 5, 2],
        [0, 2, 1,  ..., 5, 2, 4],
        [6, 4, 6,  ..., 2, 3, 7],
        ...,
        [7, 3, 7,  ..., 2, 7, 5],
        [6, 3, 2,  ..., 0, 4, 4],
        [2, 7, 7,  ..., 3, 7, 1]])

In [6]:
sequences.dtype

torch.int64

## Building attention masks

Given the sequences, generate trainable input embeddings for them (just the semantic part, we're skipping the positional encoding here as it's not essential).

In [4]:
hidden_dim = 128

embedding_layer = torch.nn.Embedding(
    num_embeddings=vocab.shape[0],
    embedding_dim=hidden_dim
)

input_sequence_embeddings = embedding_layer(sequences)

input_sequence_embeddings.shape

torch.Size([64, 256, 128])

Instantiate a transformer Encoder model.

**Note:** by convention, we stick with having the batch dimension as the first one.

In [5]:
# A single encoder layer to be used in the full stack.
encoder_layer = torch.nn.TransformerEncoderLayer(
    d_model=hidden_dim,
    nhead=1,
    dim_feedforward=2 * hidden_dim,
    batch_first=True
)

# Stack of encoder layers.
transformer_encoder = torch.nn.TransformerEncoder(
    encoder_layer,
    num_layers=1
)

encoder_output = transformer_encoder(input_sequence_embeddings)

encoder_output.shape



torch.Size([64, 256, 128])

Generate masked sequences from the original one to perform masked language modeling (MLM): every token in every sequence is converted to the masked one (set conventionally) with some probability.

**Mask:** we pass the mask as the encoder's `src_key_padding_mask` option, which means it should have shape `(batch_size, seq_len)` and (if of boolean type) contain `False` when no masking is needed and `True` when it is. In practice, it's obtained by simply checking the masked sequences against the padding value.

In [6]:
masked_sequences = sequences.clone()

masking_rate = 0.1

for i in range(sequences.shape[0]):
    for j in range(sequences.shape[1]):
        if torch.rand(1) < masking_rate:
            masked_sequences[i, j] = mask_idx

mask = (masked_sequences == mask_idx)

In [7]:
mask

tensor([[False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False,  True],
        [False, False,  True,  ..., False, False,  True],
        ...,
        [ True, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        [ True, False, False,  ..., False,  True, False]])

**Questions:**
- Should we pass the masked sequences or the original ones (always along with the mask) to the encoder?

Answer: we should pass the masked sequences to the encoder, and then use the decoder to generate logits for every token in every sequence and compare this with the ground truth (non-masked sequences) via the cross-entropy loss.

What to do with the **masked embeddings**?
 
 - If we pass the masked sequences, we should have input embeddings be created for the `<mask>` token, so it should be explicitly modeled. Could we indicate it as a padding token as we instantiate the `Embedding` layer for the input embeddings?
 
A: the `<mask>` token should be explicitly modeled as part of the vocabulary. It shouldn't be among the possible **predicted** tokens though.

- If we pass the original sequences we're leaving the original embeddings for the masked tokens, is the mask sufficient to tell the model to ignore them?

A: masking is not sufficient. Indeed, we should pass the masked sequences to the encoder.

- What about the gradient? Should we "disconnect" the masked token from the compute graph?

A: still not clear. The attention mask should avoid connecting the embeddings of the masked tokens with the loss, but it's just a guess (and what about residual connections?).

What about the **full model** (with a decoder as well)?

- How do we tell which are the masked tokens to predict for in the sequences? The decoder won't know anything about the masking, so it'll have no way to distinguish between a masked token and a not masked one.

A: it's true that the decoder won't know explicitly which are the masked tokens, but indirectly it will because we pass the masked sequences as input and because in the end we'll probably have to select ony the loss terms correspnding to the predictions for the masked tokens.

- Should the decoder predict for one masked token at the time (how to select them? Mask them one by one?) or all masked tokens together (in which case, see the previous point)?

A: the decoder should predict for all the tokens, masked and non-masked, so that for an input of shape `(batch_size, seq_len)` we get a final output of shape `(batch_size, seq_len, vocab_size)`, the last dimension corresponding to the logits over the vocabulary (restricted to the token that should be predicted, e.g. not `<mask>`). This means that we'll effectively predict for the entire sequence of tokens every time and computing the cross-entropy loss with the non-masked sequences we'll get a tensor of loss values of shape `(batch_size, seq_len)`. Now we can choose (not clear which is the right choice though!) whether to use compute the final loss as the mean over all the entries of this tensor or just over the ones corresponding to the masked tokens (by applying the mask over the tensor of loss values).

In [12]:
encoder_output_masked = transformer_encoder(
    embedding_layer(masked_sequences),
    # src_key_padding_mask=mask
)

encoder_output_masked

tensor([[[-0.0290, -1.4483,  0.2451,  ...,  0.6581, -1.7872, -0.9144],
         [ 0.3452,  0.6251,  0.1088,  ..., -1.1884,  1.6174,  1.6512],
         [ 0.4943,  0.4641,  0.0205,  ..., -1.0078,  1.6852,  1.5408],
         ...,
         [ 0.3771,  1.2614,  0.3377,  ..., -0.0295,  1.3668,  0.4894],
         [ 0.5643,  0.4755,  0.1923,  ..., -1.2817,  1.6874,  1.5795],
         [ 0.1603,  0.4404, -0.0568,  ..., -1.3516,  1.7183,  1.6342]],

        [[ 0.1994, -1.0017, -0.1838,  ...,  1.0559, -1.8085, -0.9750],
         [ 0.5908, -0.2653, -0.8072,  ...,  2.4222,  0.9356,  0.4583],
         [ 0.8316, -0.1844, -0.7723,  ...,  2.2629,  1.1053,  0.2142],
         ...,
         [-0.6129, -0.4498,  0.7233,  ..., -0.1183, -1.2245, -0.9194],
         [-0.0767, -1.2105,  0.1428,  ...,  0.8197, -1.7863, -0.8375],
         [ 2.3046, -1.6464, -0.2519,  ..., -0.3259, -1.2096,  0.2420]],

        [[ 0.1837,  1.0495,  0.3672,  ..., -0.2346,  1.1039,  0.4407],
         [ 0.3430, -1.2616,  0.2235,  ...,  0

In [13]:
encoder_output_masked.shape

torch.Size([64, 256, 128])

The final output layer maps each token in each sequence to a set of logits (or probabilities) over the "proper" vocabulary (excluding special tokens), so it has tensors of shape `(batch_size, seq_len, hidden_dim)` as input and outputs tensors of shape `(batch_size, seq_len, vocab_size)`.

In [14]:
output_layer = torch.nn.Linear(
    in_features=hidden_dim,
    out_features=q
)

output_logits = output_layer(encoder_output_masked)
output_probs = torch.nn.Softmax(dim=-1)(output_logits)

output_logits.shape, output_probs.shape, output_probs.sum(dim=-1)

(torch.Size([64, 256, 8]),
 torch.Size([64, 256, 8]),
 tensor([[1.0000, 1.0000, 1.0000,  ..., 1.0000, 1.0000, 1.0000],
         [1.0000, 1.0000, 1.0000,  ..., 1.0000, 1.0000, 1.0000],
         [1.0000, 1.0000, 1.0000,  ..., 1.0000, 1.0000, 1.0000],
         ...,
         [1.0000, 1.0000, 1.0000,  ..., 1.0000, 1.0000, 1.0000],
         [1.0000, 1.0000, 1.0000,  ..., 1.0000, 1.0000, 1.0000],
         [1.0000, 1.0000, 1.0000,  ..., 1.0000, 1.0000, 1.0000]],
        grad_fn=<SumBackward1>))

Observation on the loss function (**should masking be considered at this stage?**):
- We need the model (output layer) to output the predicted logits for each token in each sequence, i.e. a tensor of shape `(batch_size, seq_len, vocab_size)`, where the last dimension represents the logits over the vocabulary.
- PyTorch's `CrossEntropyLoss` function accepts the logits and the true labels as the input, with the latter either as they are (class labels) or with one-hot encoding. In this case, if the predicted logits are put in the shape PyTorch expects (see point below), no one-hot encoding is needed for the targets.
- Without any aggregation, we should have a value for the loss for each token in each sequence, which gives a tensor of loss values of shape `(batch_size, seq_len)`. PyTorch assumes that the predicted logits have shape `(batch_size, n_classes, [additional dims])`, so the last two dimensions of the output logits need to be switched.
- **Guess:** probably the final loss should be computed only for the masked tokens, so we have to apply the mask to the loss tensor before aggregating the values.

In [15]:
loss_fn = torch.nn.CrossEntropyLoss(
    reduction='none'  # Default: 'mean'.
)

loss_tensor = loss_fn(
    torch.permute(output_logits, dims=(0, 2, 1)),
    sequences
)

loss_tensor, loss_tensor.shape

(tensor([[3.4903, 1.6067, 1.6570,  ..., 2.5112, 1.4768, 1.6383],
         [3.3925, 3.8067, 3.9073,  ..., 2.4712, 3.4151, 1.9716],
         [2.6250, 3.4439, 1.3120,  ..., 3.8554, 2.6951, 3.0362],
         ...,
         [3.1953, 2.5349, 3.4119,  ..., 3.4123, 1.6710, 3.3674],
         [2.5336, 3.7902, 1.5737,  ..., 3.9473, 1.7537, 2.5176],
         [1.2555, 1.5612, 1.6691,  ..., 2.5762, 2.0930, 3.8554]],
        grad_fn=<ViewBackward0>),
 torch.Size([64, 256]))

In [16]:
# Use masking if predicting only for the masked tokens,
# drop the mask (!) to predict for the whole sequence.
loss = loss_tensor[mask].mean()

loss

tensor(2.2712, grad_fn=<MeanBackward0>)

## Building a model for MLM

In [17]:
model = TransformerClassifier(
    seq_len=seq_len,
    embedding_size=hidden_dim,
    n_tranformer_layers=1,
    n_heads=1,
    vocab_size=vocab.shape[0],
    n_special_tokens=1,
    embedding_agg=None,
    decoder_hidden_sizes=[],
    decoder_activation='identity',
    decoder_output_activation='identity'  # 'identity' --> Output logits
)

# Output shape: (batch_size, seq_len, vocab_size) (excluding
# the special tokens from the vocabulary, which shouldn't be
# predicted).
model(
    masked_sequences,
    # src_key_padding_mask=mask
)

tensor([[[-0.5034,  0.3430, -0.9629,  ...,  0.5773,  1.1145,  0.7478],
         [ 0.0952, -0.8594, -0.8214,  ...,  1.3722, -0.1491,  0.2409],
         [-0.3162, -0.9446, -1.1253,  ...,  0.7653, -0.4901,  0.3828],
         ...,
         [ 0.0988, -1.0715, -0.5927,  ..., -0.9521,  0.8289,  0.5411],
         [ 0.4000, -1.3849, -0.8203,  ...,  0.4781, -0.0317,  0.1036],
         [ 0.2004, -1.2422, -0.8967,  ...,  0.5036,  0.1347,  0.5359]],

        [[-0.6349, -0.1459, -1.3031,  ...,  0.6147,  0.8781,  0.3331],
         [-1.2160,  0.4235,  0.0651,  ..., -0.7708,  0.7488,  0.5718],
         [-1.1911,  0.3849,  0.0250,  ..., -0.4980,  0.8087,  0.5262],
         ...,
         [ 0.4344, -0.8499,  0.9137,  ..., -0.1813,  0.4762, -0.8045],
         [-0.0141, -0.0068, -1.3151,  ..., -0.1438,  1.1790,  0.6050],
         [ 0.4919,  0.3130,  0.1599,  ..., -0.2795,  0.1510, -0.0299]],

        [[ 0.2132, -0.8751, -0.8432,  ..., -0.3938,  0.5776,  0.4708],
         [-0.3050,  0.4148, -1.1664,  ...,  0

In [23]:
loss_fn = torch.nn.CrossEntropyLoss(
    reduction='none'
)

# Use masking if predicting only for the masked tokens,
# drop the mask (!) to predict for the whole sequence.
loss = loss_fn(
    torch.permute(
        model(
            masked_sequences,
            # src_key_padding_mask=mask
        ),
        (0, 2, 1)),
    sequences
)[mask].mean()

loss

tensor(2.1741, grad_fn=<MeanBackward0>)