Attention Concept:

1. Attention is a form of communication, where information is exchanged between tokens. The computation occurs through feed-forward operations after the attention mechanism.
2. In the attention mechanism, each token in a sentence is considered as a node and is connected to the previous tokens as well as itself. For different problems, this directed graph may vary. Attention helps in finding a feature vector for each node by facilitating communication with other nodes.
3. Self-attention involves using the same source for obtaining key, value, and query. On the other hand, cross-attention involves using two different sources for {key, value} and query.
4. To effectively scale the network, it is necessary to incorporate residual connections and layer normalization. These techniques help maintain the integrity and stability of the network architecture.
5. Multi-head attention is like group convolution. It results in better outcome and more stable training.
6. For more detail see decoder part of Attention all you need paper.

In [62]:
# Libs
import torch
import torch.nn as nn
from torch.nn import functional as F

In [63]:
# Hyperparameters
class Configuration:
  # mini-batch size
  batch_size = 64
  # number of tokens in each data of the batch
  block_size = 256
  # lenght of embeding for each token
  embd_size = 384
  # number of head in multi-head attention
  num_heads = 6
  # Embeding size of each attention head
  head_size = embd_size // num_heads
  # number of attention blocks
  num_attention_block=6
  # training and evalution
  number_iterations = 5000
  learning_rate = 3e-4
  dropout = 0.2
  # if GPU is available use it
  device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
  eval_interval = 500
  eval_iters = 200
  torch.manual_seed(1337)

# Instance of Configuration class
conf = Configuration()

if conf.embd_size % conf.num_heads != 0:
  raise ValueError('Embeding size should be dividable by number of heads.')
print('Device: {}'.format(conf.device))

Device: cuda:0


In [64]:
# Read dataset
with open('sample_data/input.txt', 'r', encoding='utf-8') as f:
  ds_text = f.read()

In [65]:
# Show part of data
print(ds_text[:300])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us


In [66]:
# Unique characters, in this code, characters are tokens, for better result
# tools such as OpenAI tokenizer can be used.
vocabolary = sorted(list(set(ds_text)))
conf.vocabolary_size = len(vocabolary)

print(conf.vocabolary_size)
print(vocabolary)

65
['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [67]:
# Function to convert encode and decode data
dic_vocab_idx = {vocabolary[i]: i for i in range(conf.vocabolary_size)}
dic_idx_vocab = {i:vocabolary[i] for i in range(conf.vocabolary_size)}

encode = lambda tx: [dic_vocab_idx[ch] for ch in tx]
decode = lambda list_int: ''.join([dic_idx_vocab[id] for id in list_int])

In [68]:
print(encode('Hello There!'))
print(decode(encode('Hello There!')))

[20, 43, 50, 50, 53, 1, 32, 46, 43, 56, 43, 2]
Hello There!


In [69]:
# Convert all text for integer
ds_tensor = torch.tensor(encode(ds_text), dtype=torch.long)

In [70]:
print(ds_tensor.type)
print(ds_tensor.shape)

<built-in method type of Tensor object at 0x7fc0adcad9e0>
torch.Size([1115394])


In [71]:
# Split all data to train and validation
n_split = int(len(ds_tensor) * 0.9)
train_data = ds_tensor[:n_split]
val_data = ds_tensor[n_split:]

In [72]:
print(train_data.shape)
print(val_data.shape)

torch.Size([1003854])
torch.Size([111540])


In [73]:
# Get a mini-batch
def get_batch(split_name):
  if split_name == 'train':
    data = train_data
  elif split_name == 'eval':
    data = val_data
  else:
    raise ValueError('Split name is incorrect.')

  idxs = torch.randint(low=0, high=len(data)-conf.block_size, size=(conf.batch_size,))
  X = torch.stack([data[id:id+conf.block_size] for id in idxs])
  y = torch.stack([data[id+1:id+conf.block_size+1] for id in idxs])
  # Move data to device
  X, y = X.to(conf.device), y.to(conf.device)
  return X, y


In [74]:
@torch.no_grad()
def estimate_loss():
  """ Estimate train and eval loss."""
  evaluation = {'train':None, 'eval':None}
  losses = {'train':[], 'eval':[]}
  # Put model in evaluation mode
  model.eval()
  # Estimate train and eval loss on whole data
  # by calculating loss on a number of batches
  for split in ['train', 'eval']:
    for i in range(conf.eval_iters):
      X, y = get_batch(split_name=split)
      _, loss = model(X, y)
      losses[split].append(loss.item())
  evaluation['train'] = sum(losses['train']) / len(losses['train'])
  evaluation['eval'] = sum(losses['eval']) / len(losses['eval'])
  # Put model in train mode
  model.train()
  return evaluation


In [75]:
class Head(nn.Module):
  """One attention head."""
  def __init__(self):
    super().__init__()
    # Linear layers to generate key, value and query
    self.key = nn.Linear(in_features=conf.embd_size, out_features=conf.head_size, bias=False)
    self.value = nn.Linear(in_features=conf.embd_size, out_features=conf.head_size, bias=False)
    self.query = nn.Linear(in_features=conf.embd_size, out_features=conf.head_size, bias=False)
    # This matrix is used for masking. Tokens in text generation should not
    # have access to future tokens
    self.register_buffer(
        'tril', torch.tril(torch.ones(conf.block_size, conf.block_size)))
    self.dropout = nn.Dropout(conf.dropout)

  def forward(self, x):
    B, T, C = x.shape

    # Generate key, value, query
    k = self.key(x) # (B, T, Head size)
    v = self.value(x) # (B, T, Head size)
    q = self.query(x) # (B, T, Head size)
    # Compute attention scores (affinities)
    wei = q @ k.transpose(-2, -1) * (C ** -0.5)
    wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
    wei = F.softmax(wei, dim=-1) # (B, T, T)
    wei = self.dropout(wei)
    # Weighted aggregation of values
    out = wei @ v # (B, T, Head size)
    return out


In [76]:
class MultiHeadAttention(nn.Module):
  """Multihead attention"""
  def __init__(self):
    super().__init__()
    # Put several head together
    self.heads = nn.ModuleList([Head() for _ in range(conf.num_heads)])
    self.projection = nn.Linear(conf.embd_size, conf.embd_size)
    self.dropout = nn.Dropout(conf.dropout)
  def forward(self, x):
    x = torch.cat([head(x) for head in self.heads], dim=-1)
    x = self.projection(x)
    x = self.dropout(x)
    return x

In [77]:
class FeedForward(nn.Module):
  """MLP layet after each multi-head attention"""
  def __init__(self):
    super().__init__()
    self.net = nn.Sequential(
        nn.Linear(conf.embd_size, 4 * conf.embd_size),
        nn.ReLU(),
        nn.Linear(4 * conf.embd_size, conf.embd_size),
        nn.Dropout(conf.dropout)
    )
  def forward(self, x):
    return self.net(x)

In [78]:
class AttentionBlock(nn.Module):
  """
  Communication (multi-head attention) followed by computation (MLP)
  """
  def __init__(self):
    super().__init__()
    self.layer_norm_1 = nn.LayerNorm(conf.embd_size)
    self.layer_norm_2 = nn.LayerNorm(conf.embd_size)
    # Multi-head self-attention
    self.sa_head = MultiHeadAttention()
    # Feed-forward after multi-head attention
    self.fd = FeedForward()

  def forward(self, x):
    x = x + self.sa_head(self.layer_norm_1(x))
    x = x + self.fd(self.layer_norm_2(x))
    return x

In [79]:
class AttentionNeuralNetwork(nn.Module):
  """Language model constructed of attention blocks"""
  def __init__(self):
    super().__init__()
    # Embeding table for each token
    self.token_embeding_table = nn.Embedding(
                                   num_embeddings=conf.vocabolary_size,
                                   embedding_dim=conf.embd_size)
    # Positional embeding for each position in the sequence. This is because
    # tokens should have knowledge about their location in sequence
    self.position_embeding_table = nn.Embedding(
                                      num_embeddings=conf.block_size,
                                      embedding_dim=conf.embd_size)

    # Attention blocks
    self.blocks = nn.Sequential(
        *[AttentionBlock()
          for _ in range(conf.num_attention_block)])
    # Normalize
    self.layer_norm = nn.LayerNorm(conf.embd_size)
    # Last linear layer to increase output from embeding size to vocabolary size
    self.lm_head = nn.Linear(conf.embd_size, conf.vocabolary_size)

  def forward(self, idx, targets=None):
    B, T = idx.shape
    token_emb = self.token_embeding_table(idx) # (B, T, C_emb)
    pos_emb = self.position_embeding_table(
                      torch.arange(T, device=conf.device)) # (T, C_emb)
    # add token embedings and positional embedings
    x = token_emb + pos_emb # (B, T, C_emb)
    x = self.blocks(x) # (B, T, C_emb)
    x = self.layer_norm(x) # (B, T, C_emb)
    logits = self.lm_head(x) # (B, T, C_vocab)

    if targets is not None:
      # Calculate cross entropy loss
      B, T, C = logits.shape
      logits = logits.view(B * T, C)
      targets = targets.view(B * T)
      loss = F.cross_entropy(logits, targets)
    else:
      loss = None

    return logits, loss

  def generate(self, idx, max_new_tokens):
    """
    This generate text
    idx: (B, T tensor)
    """
    model.eval()
    for i in range(max_new_tokens):
      # crop idx
      idx_crop = idx[:, -conf.block_size:]
      # new predication
      logits, _ = self(idx_crop)
      # last time stamp
      logits = logits[:,-1,:] # B, C
      # softmax to get probabilities
      probs = F.softmax(logits, dim=-1) # B, C
      # draw sample from disribution
      idx_next = torch.multinomial(probs, num_samples=1)
      idx = torch.cat((idx, idx_next), dim=1)

    return idx

In [80]:
# Instance of model
model = AttentionNeuralNetwork()
model.to(conf.device)

AttentionNeuralNetwork(
  (token_embeding_table): Embedding(65, 384)
  (position_embeding_table): Embedding(256, 384)
  (blocks): Sequential(
    (0): AttentionBlock(
      (layer_norm_1): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
      (layer_norm_2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
      (sa_head): MultiHeadAttention(
        (heads): ModuleList(
          (0-5): 6 x Head(
            (key): Linear(in_features=384, out_features=64, bias=False)
            (value): Linear(in_features=384, out_features=64, bias=False)
            (query): Linear(in_features=384, out_features=64, bias=False)
            (dropout): Dropout(p=0.2, inplace=False)
          )
        )
        (projection): Linear(in_features=384, out_features=384, bias=True)
        (dropout): Dropout(p=0.2, inplace=False)
      )
      (fd): FeedForward(
        (net): Sequential(
          (0): Linear(in_features=384, out_features=1536, bias=True)
          (1): ReLU()
          (2): Lin

In [81]:
# Put model in train mode
model.train()

# Train model
optimizer = torch.optim.AdamW(model.parameters(), lr=conf.learning_rate)

# Train model for maximum number of batches
for iter_i in range(conf.number_iterations):
  # Every once evaluate train and eval dataset loss
  if iter_i % conf.eval_interval == 0:
    evaluation = estimate_loss()
    print('Step {}: Train loss {:.3f}, Eval loss {:.3f}'.format(
                                                    iter_i,
                                                    evaluation['train'],
                                                    evaluation['eval']))
  # Train the model with one mini-batch
  X, y_true = get_batch(split_name='train')
  logits, loss = model(X, y_true)
  optimizer.zero_grad(set_to_none=True)
  loss.backward()
  optimizer.step()

Step 0: Train loss 4.273, Eval loss 4.271
Step 500: Train loss 2.009, Eval loss 2.096
Step 1000: Train loss 1.606, Eval loss 1.779
Step 1500: Train loss 1.439, Eval loss 1.640
Step 2000: Train loss 1.340, Eval loss 1.571
Step 2500: Train loss 1.280, Eval loss 1.538
Step 3000: Train loss 1.226, Eval loss 1.506
Step 3500: Train loss 1.181, Eval loss 1.487
Step 4000: Train loss 1.146, Eval loss 1.484
Step 4500: Train loss 1.110, Eval loss 1.478


In [82]:
# Generate a sample
idx = torch.zeros((1, 1), dtype=torch.long, device=conf.device)
print(decode(model.generate(idx, max_new_tokens=1500)[0].tolist()))


Upweak ta'en their covery souls, with the king:
Boy. Better than I say, Clarence' come.

ABHORSON:
What should the Capulet's daughter? Dost thou?

SOMERSET:
Marry, sir, no man stop to her.

HORTENSIO:
One poor Romeo more son, and be a
balefant words in the things. O, sir, I am respected
more with him and friendship of a tyrant!

TYRREL:
O, think you shall not dream,
Whose treason seven sons should be so in man's
Shall dispose the seas, and like even false with me
I betray my adversed pluck my foot: then yet I
In sister what I plainly was while as smuch as
You shall.

GLOUCESTER:
I would it go against your cousin; and I was
Against your body or in the leisure of youth:
Is it thou remembran not to begetter thy flight?
I will but a scolding friend
Thee, uncle you and forswear to beging this Katham,
Why stoling you look'd by him met a command.
O many is her protectors?
'Tis subound to save: we cannot know her, my lord,
I shall be certain'd to my hope.
How now! What, yea present your honou