<a href="https://colab.research.google.com/github/dominiksakic/zero_to_hero/blob/main/adv_05_calculator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Goal
Train a GPT to do addition of two numbers, i.e. a+b=c. You may find it helpful to predict the digits of c in reverse order, as the typical addition algorithm (that you're hoping it learns) would proceed right to left too. You may want to modify the data loader to simply serve random problems and skip the generation of train.bin, val.bin. You may want to mask out the loss at the input positions of a+b that just specify the problem using y=-1 in the targets (see CrossEntropyLoss ignore_index). Does your Transformer learn to add? Once you have this, swole doge project: build a calculator clone in GPT, for all of +-*/. Not an easy problem. You may need Chain of Thought traces.

- Where to get a dataset? Create on yourself? [Yourself]
  - write a script that randomized two numbers a and b.
  - loop over n steps and record the results in a text file.
- How to tokenize the input?
  - Char level?
  - Tokens?
  - See below for some ideas.
- Train a simple GPT on the dataset like you would on any other text. What are the outputs? Why?
- How would you train in reverse order?

# Try 1
- Simplest version of a calculator 0 - 9 + 0 - 9 = 00 - 18
- one data set would be blocksize 6.
- how much data could I create? The number of combinations?
  - 81 combinations only? That is not a lot and I would think that the neural net can just remember these!
  - Due to the lack of data the network should be able to remember all the data making a perfect calculator.
- I just have to mask the last to blocks.

In [37]:
import random

# create Dataset
text = ''
for i in range(10):
  for j in range(10):
    if (i+j) < 10:
      result = f"{i}+{j}=0{i+j}"
    else:
      result = f"{i}+{j}={i+j}"

    text += str(result)

chars = sorted(list(set(text)))
vocab_size = len(chars)

stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for ch, i in stoi.items()}

def encode(s):
    return [stoi[c] for c in s]

def decode(l):
    return ''.join(itos[i] for i in l)

In [38]:
chars

['+', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '=']

In [39]:
encode("9+9=18")

[10, 0, 10, 11, 2, 9]

In [96]:
# try a simple network takes in a integer tensor and outputs a integer tensor

import torch

block_size = 4
target_size = 2

data = encode(text)

X, Y = [], []

for i in range(0, (len(data) - block_size), 6):
    context = data[i : i + block_size]
    target  = data[i + block_size : i + block_size + target_size]

    # print(decode(context), "->", decode(target))

    X.append(context)
    Y.append(target)

X = torch.tensor(X)  # shape (N, 4)
Y = torch.tensor(Y)  # shape (N, 2)

X.shape

torch.Size([100, 4])

In [116]:
import torch.nn as nn
import torch.optim as optim

embed_dim = 10  # Hyperparam

model  = nn.Sequential(
    nn.Embedding(vocab_size, embed_dim),   # (B, 4) -> (B, 4, 32)
    nn.Flatten(),                          # (B, 4*16)
    nn.ReLU(),
    nn.Linear(block_size * embed_dim, target_size * vocab_size)  # (B, 2*vocab)
)

In [117]:
opt = optim.AdamW(model.parameters(), lr=0.001)
loss_fn = nn.CrossEntropyLoss()

In [118]:
for epoch in range(1000):
    logits = model(X)
    logits = logits.view(-1, target_size, vocab_size)

    loss = loss_fn(logits[:, 0, :], Y[:, 0]) + loss_fn(logits[:, 1, :], Y[:, 1])

    opt.zero_grad()
    loss.backward()
    opt.step()

    if epoch % 10 == 0:
        print(f"Epoch {epoch}, loss = {loss.item():.4f}")

Epoch 0, loss = 4.6840
Epoch 10, loss = 4.4301
Epoch 20, loss = 4.2085
Epoch 30, loss = 4.0148
Epoch 40, loss = 3.8470
Epoch 50, loss = 3.7036
Epoch 60, loss = 3.5823
Epoch 70, loss = 3.4803
Epoch 80, loss = 3.3950
Epoch 90, loss = 3.3236
Epoch 100, loss = 3.2638
Epoch 110, loss = 3.2134
Epoch 120, loss = 3.1706
Epoch 130, loss = 3.1338
Epoch 140, loss = 3.1019
Epoch 150, loss = 3.0739
Epoch 160, loss = 3.0491
Epoch 170, loss = 3.0269
Epoch 180, loss = 3.0067
Epoch 190, loss = 2.9883
Epoch 200, loss = 2.9714
Epoch 210, loss = 2.9557
Epoch 220, loss = 2.9410
Epoch 230, loss = 2.9272
Epoch 240, loss = 2.9143
Epoch 250, loss = 2.9019
Epoch 260, loss = 2.8903
Epoch 270, loss = 2.8793
Epoch 280, loss = 2.8688
Epoch 290, loss = 2.8587
Epoch 300, loss = 2.8490
Epoch 310, loss = 2.8395
Epoch 320, loss = 2.8304
Epoch 330, loss = 2.8215
Epoch 340, loss = 2.8128
Epoch 350, loss = 2.8044
Epoch 360, loss = 2.7962
Epoch 370, loss = 2.7881
Epoch 380, loss = 2.7802
Epoch 390, loss = 2.7724
Epoch 400, 

In [120]:
def sample(model, context_str, stoi, itos, block_size=4, target_size=2):
    model.eval()
    context_ids = torch.tensor([stoi[c] for c in context_str], dtype=torch.long).unsqueeze(0)  # (1, 4)
    with torch.no_grad():
        logits = model(context_ids)  # (1, 2*vocab_size)
        logits = logits.view(1, target_size, -1)  # (1, 2, vocab_size)
        preds = torch.argmax(logits, dim=-1)  # (1, 2)
    pred_chars = ''.join(itos[i.item()] for i in preds[0])
    return pred_chars

# Example usage:
print("Input: '1+2='")
print("Predicted:", sample(model, "7+3=", stoi, itos))

Input: '1+2='
Predicted: 04
