<a href="https://colab.research.google.com/github/dominiksakic/zero_to_hero/blob/main/adv_05_calculator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Goal
Train a GPT to do addition of two numbers, i.e. a+b=c. You may find it helpful to predict the digits of c in reverse order, as the typical addition algorithm (that you're hoping it learns) would proceed right to left too. You may want to modify the data loader to simply serve random problems and skip the generation of train.bin, val.bin. You may want to mask out the loss at the input positions of a+b that just specify the problem using y=-1 in the targets (see CrossEntropyLoss ignore_index). Does your Transformer learn to add? Once you have this, swole doge project: build a calculator clone in GPT, for all of +-*/. Not an easy problem. You may need Chain of Thought traces.

- Where to get a dataset? Create on yourself? [Yourself]
  - write a script that randomized two numbers a and b.
  - loop over n steps and record the results in a text file.
- How to tokenize the input?
  - Char level?
  - Tokens?
  - See below for some ideas.
- Train a simple GPT on the dataset like you would on any other text. What are the outputs? Why?
- How would you train in reverse order?

# Try 1
- Simplest version of a calculator 0 - 9 + 0 - 9 = 00 - 18
- one data set would be blocksize 6.
- how much data could I create? The number of combinations?
  - 81 combinations only? That is not a lot and I would think that the neural net can just remember these!
  - Due to the lack of data the network should be able to remember all the data making a perfect calculator.
- I just have to mask the last to blocks.
- Hypotheis was wrong, my current approach results in a uniform distribution. (log(12))
  - Dataset is to small
  - No position - specific modeling.


In [29]:
import random

# create randomized and bigger dataset
text = ''
for _ in range(1000):
    i = random.randint(0, 9)
    j = random.randint(0, 9)
    if (i+j) < 10:
      result = f"{i}+{j}=0{i+j}"
    else:
      result = f"{i}+{j}={i+j}"

    text += str(result) + '_'

chars = sorted(list(set(text)))
vocab_size = len(chars)

stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for ch, i in stoi.items()}

def encode(s):
    return [stoi[c] for c in s]

def decode(l):
    return ''.join(itos[i] for i in l)

In [30]:
chars

['+', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '=', '_']

In [32]:
encode("9+9=18_")

[10, 0, 10, 11, 2, 9, 12]

In [50]:
# try a simple network takes in a integer tensor and outputs a integer tensor

import torch

block_size = 4
target_size = 2

data = encode(text)

X, Y = [], []

for i in range((len(data) - block_size)):
    context = data[i : i + block_size]
    target  = data[i + 1 : i + block_size + 1]

    if i < 3:
      print(decode(context), "->", decode(target))

    X.append(context)
    Y.append(target)

X = torch.tensor(X)  # shape (N, 4)
Y = torch.tensor(Y)  # shape (N, 2)

X.shape

9+8= -> +8=1
+8=1 -> 8=17
8=17 -> =17_


torch.Size([6996, 4])

In [51]:
import torch.nn as nn
import torch.optim as optim

embed_dim = 100  # Hyperparam

model  = nn.Sequential(
    nn.Embedding(vocab_size, embed_dim),   # B, T -> B, T, embed
    nn.Linear(embed_dim, 100),              # (B, T, embed_dim) -> (B, T, 32)
    nn.ReLU(),
    nn.Linear(100, vocab_size)              # (B, T, 32) -> (B, T, vocab_size)
)

In [52]:
opt = optim.AdamW(model.parameters(), lr=0.001)
loss_fn = nn.CrossEntropyLoss()

In [53]:
for epoch in range(1000):
    logits = model(X)
    B, T, C = logits.shape

    logits_flat = logits.view(B*T, C)
    targets_flat = Y.view(B*T)

    loss = loss_fn(logits_flat, targets_flat)

    opt.zero_grad()
    loss.backward()
    opt.step()

    if epoch % 100 == 0:
        print(f"Epoch {epoch}, loss = {loss.item():.4f}")

Epoch 0, loss = 2.5782
Epoch 100, loss = 1.6728
Epoch 200, loss = 1.6706
Epoch 300, loss = 1.6701
Epoch 400, loss = 1.6699
Epoch 500, loss = 1.6698
Epoch 600, loss = 1.6698
Epoch 700, loss = 1.6697
Epoch 800, loss = 1.6697
Epoch 900, loss = 1.6697


In [54]:
import torch.nn.functional as F

def generate(model, start_text, max_new_tokens=20):
    model.eval()
    context = torch.tensor(encode(start_text), dtype=torch.long).unsqueeze(0)

    for _ in range(max_new_tokens):
        context_cond = context[:, -block_size:]

        # Forward pass
        logits = model(context_cond)
        logits = logits[:, -1, :]

        # Convert to probabilities
        probs = F.softmax(logits, dim=-1)  # (1, vocab_size)

        # Sample next token
        next_id = torch.multinomial(probs, num_samples=1)  # (1, 1)

        # Append
        context = torch.cat([context, next_id], dim=1)

        # Stop if EOS is generated
        if next_id.item() == '_':
            break

    return decode(context.squeeze().tolist())

In [43]:
print(generate(model, "7+3="))

7+3=03+2=17=13+4=1+09+18
