### Self attention softmax weights dropout
* In the self attention head we apply dropout on softmax weights. This means that the representation of the first word vector in batch can be $0$ with dropout probability $p$?
* Moreover, even the representation of the n-th word is 0 with probability $p^n$?
* Should be bothered with this, maybe?
* If we are using multiple layers and multiple heads the "bias" induced by this seems to be negligible.
* If we do toy examples this actually seems to slow down the training.

* Another separate question is why not to drop whole words, I assume that this is valid but it will make batch generation ragged and harder to parallelize, I would also assume that this would slow down the training as it is a lot more aggressive than dropping weights.
* I found this one kind of surprising when I first noticed it. I would welcome any references to this and what is the reason that sampling is not modified to guarantee for example at least one nonzero weight that would ensure nonzero vector?


### Toy code example

In [None]:
from torch import nn
from torch.functional import F
n_embd = 20
head_size = 10
block_size = 4
dropout = 0.5
class Head(nn.Module):
    """ one head of self-attention
        https://magazine.sebastianraschka.com/p/understanding-and-coding-self-attention
    """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        # This limits us to the maximal context block_size
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)  # (B,T,C=n_embd) -> (B,T,C=head_size)
        q = self.query(x)  # (B,T,C=n_embd) -> (B,T,C=head_size)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2, -1) * C ** -0.5  # (B, T, C=head_size) @ (B, C=head_size), T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))  # (B, T, T)
        wei = F.softmax(wei, dim=-1)  # (B, T, T)
        # The drop out is over full matrix, alternatively it would be better to just drop on mask, this is biased
        # Also it seems maybe conceptually we should just do symmetric dropout
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x)  # (B,T,C=head_size)
        # The matrix multiplication is batched and applied on last two dimensions!
        out = wei @ v  # (B, T, T) @ (B, T, C=head_size) -> (B, T, C=head_size)
        return out

In [50]:
import torch
torch.manual_seed(1367149)
mask = torch.tril(torch.ones(4, 4))
similarities = torch.rand((4, 4)) * mask
dd_similarities = torch.nn.functional.dropout(similarities, p=0.5)
values = torch.rand((4, 10))
dd_similarities @ values

tensor([[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
         0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
         0.0000],
        [0.4227, 1.1676, 0.1916, 0.7144, 1.0432, 0.5021, 0.1680, 0.1247, 1.1446,
         0.2948],
        [2.3534, 2.1808, 1.3240, 2.1463, 2.5636, 0.9744, 2.2360, 1.6751, 1.3820,
         2.3454]])