### Transformer Together

👋 Thanks for agreeing to join our wierd tutorial-o-thon. Here're the things that you want to know:

#### Participants

- @av
- @stradivar
- @RomanSavarin
- @wpierwolacraft
- @StanislavRassolenko

#### Goal

- Go through as large chunk of [Karpathy's NanoGPT tutorial](https://www.youtube.com/watch?v=kCc8FmEb1nY&ab_channel=AndrejKarpathy) as we can
- Understand all the relevants bits and pieces along the way the best we can
- Enjoy the beautiful awrkwardness of trying to code something together with a few random dudes
- Nerd out

We'll likely won't complete the tutorial, but we'll try to tackle as much as possible. 
For final reference, here's a [Completed Tutorial Notebook](https://github.com/av/mlm/blob/main/src/tutorials/006_bigram_v4_transformer.ipynb)

#### Logistics

- Arrive around 12am, May 26th, the address was sent to you privately
- @av will order pizza and provide snacks/drinks, but feel free to bring your own
- Have your laptop with you, maybe a charger, Internet will be available via WiFi

#### How to prepare?

- You'll need a Jupyter-notebook compatible environment, preferrably with a GPU support
  - If you know what you're doing - skip this section
  - Otherwise, you don't need anything special! A _very easy_ option is to use [Google Colab](https://colab.research.google.com/) (Google Colaboratory)
    - You'll only need a Browser and Google Account to start
    - See setup guide below
- You don't need any deep special knowledge of Python, the only requirement is to be able to read and understand the code like this:<br/>
  ```python
  from module import function

  def add(a, b):
      return a + b

  print(add(1, 2))

  class MyClass:
      def __init__(self, a):
          self.a = a

      def add(self, b):
          return self.a + b
  ```

- In terms of actual knowledge, the more you know the more boring the session will be, so _don't worry too much_
  - If you can imagine how a 768-dimensional vector looks like, or how multiplying two matrices could mean a relationship between two words in a text - you're just there for foods and company 😃
- If you want to keep it boring, here're a few cool things as a refresher
  - [📽️ Linear Algebra](https://www.3blue1brown.com/topics/linear-algebra) (you're interested in Vectors, Matrices, Dot Product, mainly)
  - [📽️ Neural Networks and Deep Learning](https://www.3blue1brown.com/topics/neural-networks)
  - [📽️ Watching a Neural Network learn](https://www.youtube.com/watch?v=TkwXa7Cvfr8&ab_channel=EmergentGarden)
  - [📜/👁️ Immersive Linear Algebra](https://immersivemath.com/ila/tableofcontents.html?)
  - [📜/👁️ A visual intro to machine learning](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)
  - [📜 Neural Networks and Deep Learning](http://neuralnetworksanddeeplearning.com/)
  - [StatQuest on Transformers](https://www.youtube.com/watch?v=zxQyTK8quyY&ab_channel=StatQuestwithJoshStarmer)

##### Google Colab setup

Colab allows you to clone arbitrary Notebooks from the GitHub, such as this one 🎉.
Here's a URL of this Notebook to import it into Colab:

```
https://github.com/av/mlm/blob/main/src/tutorials/007_transformer_together.ipynb
```

Once imported, you're good to go! 🚀

When we'll get to actual training, you might want to switch to a GPU-enabled runtime. To do so, go to `Runtime` -> `Change runtime type` and select `GPU` from the dropdown. The runtime will have limited time, but we'll figure it out when we get there.

### Hello, Colab

In [1]:
# While your cursor inside,
# run code cells with Ctrl/Cmd + Enter
print('Hello, Colab!')

Hello, Colab!


In [3]:
# torch should be available by default
# in Colab's notebook runtime
import torch

# Should be False in default Colab Runtime,
# True when used with GPU
torch.cuda.is_available()

True

### Get the data

In [None]:
# This code will download our sample dataset to be used by the notebook
import requests

# Shakespeare's works dataset
dataset = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"

# News headlines dataset
# dataset = ""

def download_dataset(url):
	response = requests.get(url)

	if response.status_code == 200:
		with open("dataset.txt", "wb") as file:
			file.write(response.content)
			print("File downloaded successfully.")
	else:
		print("Failed to download the file.")

download_dataset(dataset)

In [4]:
with open ('/data/news.txt', 'r', encoding='utf-8') as file:
    text = file.read()

In [5]:
print("Characters: ", len(text))

Characters:  9271346


In [6]:
print(text[:100])

Royal Mail pauses fake stamp fines on claims of faulty technology
Indigenous leader Olivia Bisa Tirk


### Code away!

In [15]:
chars = sorted(list(set(text)))
vocab_size = len(chars)

print('Vocab:')
print(''.join(chars))
print('Vocab size:', vocab_size)

Vocab:

 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]_`abcdefghijklmnopqrstuvwxyz|~ ¡¢£¤¥©«¬­®°±²³´µ·¹º»½ÁÂÃÄÅÇÉÎÑÓÖ×ØÚÛÜßàáâãäåæçèéêëíîïðñòóôöøùúüýĀāăąĆćČčĐđēėęěğħİıŁłńŌōőřŚśŞşŠšŢūżžșțʻ́̈αμАРСавгдезийклмнопрстуфхцшыяءأإئابةتجخدرزسشصطعغفقكلمنهويầệờ ​‏‐‑–—‘’‚“”„•․… ″₂₩€₴₵₹₽№℠™ⅡⅢ↑→↓−▼「」上与专业于产亮亿企作信優元內全利勢升半协卡合商四团團型增家年广底度微快成或战房持授智最月有权板業構模橋汽洋浙潤物现用界略眼签築級编署腾至获营行表規評议讯贺辑远遠银销長集雙面额高️﻿！，：｜￦￼
Vocab size: 396


In [20]:
stoi = { ch: i for i, ch in enumerate(chars) }
itos = { i: ch for i, ch in enumerate(chars) }

encode = lambda s: [stoi[c] for c in s]
decode = lambda x: ''.join([itos[i] for i in x])

print(encode('hellow'))
print(decode(encode('hellow')))

[72, 69, 76, 76, 79, 87]
hellow


In [26]:
import torch

data = torch.tensor(encode(text), dtype=torch.long)

print(data.shape, data.dtype)
print(data[:100])

torch.Size([9271346]) torch.int64
tensor([51, 79, 89, 65, 76,  1, 46, 65, 73, 76,  1, 80, 65, 85, 83, 69, 83,  1,
        70, 65, 75, 69,  1, 83, 84, 65, 77, 80,  1, 70, 73, 78, 69, 83,  1, 79,
        78,  1, 67, 76, 65, 73, 77, 83,  1, 79, 70,  1, 70, 65, 85, 76, 84, 89,
         1, 84, 69, 67, 72, 78, 79, 76, 79, 71, 89,  0, 42, 78, 68, 73, 71, 69,
        78, 79, 85, 83,  1, 76, 69, 65, 68, 69, 82,  1, 48, 76, 73, 86, 73, 65,
         1, 35, 73, 83, 65,  1, 53, 73, 82, 75])


In [27]:
n = int(0.9 * len(data))

train_data = data[:n]
val_data = data[n:]

In [28]:
block_size = 8

train_data[:block_size+1]

tensor([51, 79, 89, 65, 76,  1, 46, 65, 73])

In [29]:
x = train_data[:block_size]
y = train_data[1:block_size+1]

for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"{context} -> {target}")

tensor([51]) -> 79
tensor([51, 79]) -> 89
tensor([51, 79, 89]) -> 65
tensor([51, 79, 89, 65]) -> 76
tensor([51, 79, 89, 65, 76]) -> 1
tensor([51, 79, 89, 65, 76,  1]) -> 46
tensor([51, 79, 89, 65, 76,  1, 46]) -> 65
tensor([51, 79, 89, 65, 76,  1, 46, 65]) -> 73


In [30]:
# Batching/chunking the dataset for feeding it to the GPU
torch.manual_seed(42)

batch_size = 4 # independent sequences to be processed in parallel
block_size = 8 # maximum context length within each sequence

def get_batch(split):
  # batch of data of inputs x and targets y
  data = train_data if split == 'train' else validation_data
  # array(batch_size) of random offsets within data
  ix = torch.randint(len(data) - block_size, (batch_size,))

  # Context and target sequences
  # ? Why a single sequence is not used?
  x = torch.stack([data[i:i+block_size] for i in ix])
  y = torch.stack([data[i+1:i+block_size+1] for i in ix])

  return x, y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)

print('\ntargets:')
print(yb.shape)
print(yb)

print('---')

for b in range(batch_size):
  for t in range(block_size):
    context = xb[b, :t+1]
    target = yb[b, t]

    print(f"context: {context}, target: {target}")


inputs:
torch.Size([4, 8])
tensor([[79, 78,  1, 73, 78,  1, 46, 65],
        [ 1, 38, 76, 69, 67, 84, 82, 79],
        [53, 82, 65, 78, 83, 73, 84, 73],
        [ 1, 84, 79,  1, 70, 79, 82, 77]])

targets:
torch.Size([4, 8])
tensor([[78,  1, 73, 78,  1, 46, 65, 82],
        [38, 76, 69, 67, 84, 82, 79, 78],
        [82, 65, 78, 83, 73, 84, 73, 79],
        [84, 79,  1, 70, 79, 82, 77, 65]])
---
context: tensor([79]), target: 78
context: tensor([79, 78]), target: 1
context: tensor([79, 78,  1]), target: 73
context: tensor([79, 78,  1, 73]), target: 78
context: tensor([79, 78,  1, 73, 78]), target: 1
context: tensor([79, 78,  1, 73, 78,  1]), target: 46
context: tensor([79, 78,  1, 73, 78,  1, 46]), target: 65
context: tensor([79, 78,  1, 73, 78,  1, 46, 65]), target: 82
context: tensor([1]), target: 38
context: tensor([ 1, 38]), target: 76
context: tensor([ 1, 38, 76]), target: 69
context: tensor([ 1, 38, 76, 69]), target: 67
context: tensor([ 1, 38, 76, 69, 67]), target: 84
context: te

In [50]:
import torch
import torch.nn as nn
from torch.nn import functional as F

torch.manual_seed(42)

class BigramLanguageModel(nn.Module):
  def __init__(self, vocab_size):
    super().__init__()
    self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

  def forward(self, inputs, predictions=None):
    # idx -> raw input numbers, encoded from the input text
    # targets -> what we want to predict

    # "logits" -> embedded input numbers
    logits = self.token_embedding_table(inputs)

    if predictions is None:
      return logits, None

    B, T, C = logits.shape
    logits = logits.view(B*T, C)
    predictions = predictions.view(-1)

    loss = F.cross_entropy(logits, predictions)

    return logits, loss

  def generate(self, idx, max_new_tokens):
    # idx is a (Batch, Time) tensor of integers, representing current context
    for _ in range(max_new_tokens):
      # compute the predictions
      logits, loss = self(idx) # (B, T, C)

      # -1 makes very little sense for a bigram model,
      # as we're essentially throwing away everything except the very last token in a batch
      # to make our prediction.
      # This is done in such a way only to allow for easier transition to an N-gram model later.
      logits = logits[:, -1, :] # (B, C)

      # Probabilities from logits
      # Softmax ~ [1, 2, 3, 4]
      probs = F.softmax(logits, dim=-1)
      idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)

      # append to the currently running context
      idx = torch.cat((idx, idx_next), dim=1) # (B, T + 1)
    return idx

m = BigramLanguageModel(vocab_size)

tensor(6.5770, grad_fn=<NllLossBackward0>)


In [159]:
# Sample generate
idx = torch.zeros((1, 1), dtype=torch.long)
logits = m.generate(idx, max_new_tokens=100)[0]

print(
  decode(
    logits.tolist()
  )
)


VInkfe Norer Cı#ع用п., No se家Viland₹=智™潤غÂMesecone
Mif Twiovit Joe
SADousidys D3re M&持кء@øтн亮模ĀŞ营ب遠ğ板


In [136]:
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

  from .autonotebook import tqdm as notebook_tqdm


In [152]:
# Sample training loop
batch_size = 32
train_steps = 100
device = 'cuda'

for steps in range(train_steps):
  # sample a batch of data
  xb, yb = get_batch('train')
  xb.to(device)
  yb.to(device)

  # eval
  logits, loss = m(xb, yb)
  optimizer.zero_grad(set_to_none=True)
  loss.backward()
  optimizer.step()

print(loss.item())

2.6536026000976562


In [160]:
torch.manual_seed(42)

B, T, C = 4, 8, 2 # Batch, Time, Channel
x = torch.randn(B, T, C)
x.shape

torch.Size([4, 8, 2])

In [161]:
torch.tril(torch.ones(T, T))

tensor([[1., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])

In [164]:
'We have pizza. Roma ate '

xbow = torch.zeros(B, T, C)

for b in range(B):
  for t in range(T):
    xprev = x[b,:t+1]
    xbow[b, t] = torch.mean(xprev, 0)

tensor([[ 1.9269,  1.4873],
        [ 1.4138, -0.3091],
        [ 1.1687, -0.6176],
        [ 0.8657, -0.8644],
        [ 0.5422, -0.3617],
        [ 0.3864, -0.5354],
        [ 0.2272, -0.5388],
        [ 0.1027, -0.3762]])

In [165]:
x[0]

tensor([[ 1.9269,  1.4873],
        [ 0.9007, -2.1055],
        [ 0.6784, -1.2345],
        [-0.0431, -1.6047],
        [-0.7521,  1.6487],
        [-0.3925, -1.4036],
        [-0.7279, -0.5594],
        [-0.7688,  0.7624]])

In [166]:
xbow[0]

tensor([[ 1.9269,  1.4873],
        [ 1.4138, -0.3091],
        [ 1.1687, -0.6176],
        [ 0.8657, -0.8644],
        [ 0.5422, -0.3617],
        [ 0.3864, -0.5354],
        [ 0.2272, -0.5388],
        [ 0.1027, -0.3762]])

In [173]:
torch.tril(torch.ones(3, 3))

tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])

In [177]:
torch.manual_seed(42)

a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, 1, keepdim=True)
b = torch.randint(0, 10, (3, 2)).float()

c = a @ b

print(a)
print('@')
print(b)
print('=')
print(c)

tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
@
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
=
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


In [182]:
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)

xbow2 = wei @ x

# print(wei[0], x[0], xbow[0])


tensor([1., 0., 0., 0., 0., 0., 0., 0.]) tensor([[ 1.9269,  1.4873],
        [ 0.9007, -2.1055],
        [ 0.6784, -1.2345],
        [-0.0431, -1.6047],
        [-0.7521,  1.6487],
        [-0.3925, -1.4036],
        [-0.7279, -0.5594],
        [-0.7688,  0.7624]]) tensor([[ 1.9269,  1.4873],
        [ 1.4138, -0.3091],
        [ 1.1687, -0.6176],
        [ 0.8657, -0.8644],
        [ 0.5422, -0.3617],
        [ 0.3864, -0.5354],
        [ 0.2272, -0.5388],
        [ 0.1027, -0.3762]])
