<a href="https://colab.research.google.com/github/bo-bits/nn-zero-to-hero/blob/master/exercises/makemore_part1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

E01: train a trigram language model, i.e. take two characters as an input to predict the 3rd one. Feel free to use either counting or a neural net. Evaluate the loss; Did it improve over a bigram model?

In [81]:
# Trigram Model
# download the names.txt file from github
!wget https://raw.githubusercontent.com/karpathy/makemore/master/names.txt

words = open('names.txt', 'r').read().splitlines()


--2024-11-06 13:13:11--  https://raw.githubusercontent.com/karpathy/makemore/master/names.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 228145 (223K) [text/plain]
Saving to: ‘names.txt.1’


2024-11-06 13:13:11 (11.0 MB/s) - ‘names.txt.1’ saved [228145/228145]



In [166]:
# Create dictionary for single and double characters
from itertools import islice

chars = sorted(list(set(''.join(words))))
chars.insert(0, '.')

# Initialize stoi for single characters
stoi = {char: i for i, char in enumerate(chars)}

# Start bigram indices after the single characters
bigram_start_idx = len(stoi)

# Add bigrams to stoi with unique indices
for i, (char1, char2) in enumerate((a + b for a in chars for b in chars), bigram_start_idx):
    stoi[char1 + char2] = i

# Create itos for reverse mapping
itos = {i: s for s, i in stoi.items()}


In [169]:
import torch
import torch.nn.functional as F

# Create the training set of trigrams (x,y)
xs, ys = [], []
for w in words:
  chs = ['.'] + list(w) + ['.']
  for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
    ix1 = stoi[ch1+ch2]
    ix3 = stoi[ch3]
    xs.append(ix1)
    ys.append(ix3)
xs = torch.tensor(xs)
ys = torch.tensor(ys)
num = xs.nelement()
print('number of examples: ', num)

number of examples:  17129


In [193]:
# Initialize
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((756, 27), generator=g, requires_grad=True)

In [194]:
for i in range(200):
  # forward pass
  logits = W[xs] # predict log-counts
  counts = logits.exp() # counts, equivalent to N
  probs = counts / counts.sum(1, keepdims=True) # probabilities for next character
  loss = -probs[torch.arange(num), ys].log().mean()+ 0.01*(W**2).mean()
  # print(f"Loss: ",loss.item())

  # backward pass
  W.grad = None # set to zero the gradient
  loss.backward()
  W.data += -50 * W.grad
print("Loss: ",loss.item())

Loss:  0.6853259801864624


Bigram model achieves a loss of ~2 while Trigram achieves a loss of ~0.7 for 100 iterations with 0.01 reg and stepsize 50

In [222]:
torch.arange(xs.nelement())

tensor([    0,     1,     2,  ..., 20128, 20129, 20130])

In [175]:
# Sample from the 'neural net' trigram model
g = torch.Generator().manual_seed(2147483647)

for i in range(5):
  out = []
  ix = stoi['..']
  char1 = '.'
  char2 = '.'
  while True:
    ix = stoi[char1+char2]
    xenc = F.one_hot(torch.tensor([ix]), num_classes=756).float()
    logits = xenc @ W # predict log-counts
    counts = logits.exp() # counts, equivalent to N
    p = counts / counts.sum(1, keepdims=True) # probabilities for next character

    iy = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
    out.append(itos[iy])
    char1 = char2
    char2 = itos[iy]
    if iy == 0:
      break
  print(''.join(out))

juwideddpkaqdd.
pgzqkygbynrqgjirrwtolcdgoftwzzvsagjpauyfmgadvaaikdbduikrwrmtrdsnjyievylarryzffvmumjhyfottmmj.
nfyaszwjhruagq.
cohaayaeboffmypjabdihejfmoifbwyfitpvgiasnhsvjihopbuxhddgosfmptpuviczqrjpiufjxhdtgr.
nrsla.


E02: split up the dataset randomly into 80% train set, 10% dev set, 10% test set. Train the bigram and trigram models only on the training set. Evaluate them on dev and test splits. What can you see?

In [223]:
# For exercise 2 I created a block_size-independent implementation of dataset
# to be able to call it for both bigram and trigram model

def dataset(block_size):
  xs, ys = [], []
  for w in words:
    context = [0] * block_size
    for ch in w + '.':
      ix = stoi[ch]
      s=''
      for i in context:
        s += itos[i]
      xs.append(stoi[s])
      ys.append(ix)
      context = context[1:] + [ix] # crop and append
  xs = torch.tensor(xs)
  ys = torch.tensor(ys)
  return xs, ys

In [224]:
from torch.utils.data import TensorDataset, random_split
import torch.nn.functional as F
from itertools import islice
from numpy import linspace


# Function to divide dataset into 0.8 train 0.1 dev 0.1 test
def split_dataset(xs, ys):
  dataset = TensorDataset(xs, ys)

  # Define the split sizes
  train_size = int(0.8 * len(dataset))
  dev_size = int(0.1 * len(dataset))
  test_size = len(dataset) - train_size - dev_size

  # Split the dataset
  train_dataset, dev_dataset, test_dataset = random_split(dataset, [train_size, dev_size, test_size])

  # Access indices from the subsets
  train_indices = train_dataset.indices
  dev_indices = dev_dataset.indices
  test_indices = test_dataset.indices

  # Extract training tensors
  X_train = xs[train_indices]
  y_train = ys[train_indices]

  # Extract testing tensors
  X_dev = xs[dev_indices]
  y_dev = ys[dev_indices]

  # Extract testing tensors
  X_test = xs[test_indices]
  y_test = ys[test_indices]

  return X_train, y_train, X_dev, y_dev, X_test, y_test


In [225]:
# Function to train
def train(iter, reg):
  for i in range(iter):
    # forward pass
    logits = W[X_train] # predict log-counts
    counts = logits.exp() # counts, equivalent to N
    probs = counts / counts.sum(1, keepdims=True) # probabilities for next character
    loss = -probs[torch.arange(X_train.nelement()), y_train].log().mean()+ reg*(W**2).mean()
    # print(f"Loss: ",loss.item())

    # backward pass
    W.grad = None # set to zero the gradient
    loss.backward()
    W.data += -50 * W.grad
  return loss

In [226]:
# Function to evaluate performance of bigram
def eval_bigram(X, y, reg):
  logits = W[X] # predict log-counts
  counts = logits.exp() # counts, equivalent to N
  probs = counts / counts.sum(1, keepdims=True) # probabilities for next character
  loss = -probs[torch.arange(X.nelement()), y].log().mean()+ reg*(W**2).mean()
  return loss

In [227]:
# Bigram Implementation

# Create Datasets
xs, ys = dataset(1)
X_train, y_train, X_dev, y_dev, X_test, y_test = split_dataset(xs, ys)

regs = [10, 1, 0.1, 0.01, 0.001, 0.0001, 1e-5, 1e-6, 1e-7]
tloss, eloss = [], []

for reg in regs:

  # Initialize
  g = torch.Generator().manual_seed(2147483647)
  W = torch.randn((27, 27), generator=g, requires_grad=True)

  # Train
  loss1 = train(200, reg)
  print("\nReg: ",reg, "\n Train Loss: ",loss1.item())
  tloss.append(loss1.item())

  # Evaluate
  loss2 = eval_bigram(X_dev, y_dev, reg)
  print(" Eval Loss: ", loss2.item())
  eloss.append(loss2.item())

# Test
test_loss = eval_bigram(X_test, y_test, reg)
print("\n Test Loss: ", test_loss.item())


Reg:  10 
 Train Loss:  3.1518807411193848
 Eval Loss:  3.1599464416503906

Reg:  1 
 Train Loss:  2.6264991760253906
 Eval Loss:  2.649312973022461

Reg:  0.1 
 Train Loss:  2.118109941482544
 Eval Loss:  2.1434051990509033

Reg:  0.01 
 Train Loss:  1.948878288269043
 Eval Loss:  1.9749367237091064

Reg:  0.001 
 Train Loss:  1.9162352085113525
 Eval Loss:  1.942584753036499

Reg:  0.0001 
 Train Loss:  1.9123629331588745
 Eval Loss:  1.938744068145752

Reg:  1e-05 
 Train Loss:  1.9119677543640137
 Eval Loss:  1.9383524656295776

Reg:  1e-06 
 Train Loss:  1.9119285345077515
 Eval Loss:  1.9383131265640259

Reg:  1e-07 
 Train Loss:  1.9119243621826172
 Eval Loss:  1.9383093118667603

 Test Loss:  1.920753002166748


In [213]:
# Function to evaluate performance of triigram
def eval_trigram(X, y, reg):
  logits = W[X] # predict log-counts
  counts = logits.exp() # counts, equivalent to N
  probs = counts / counts.sum(1, keepdims=True) # probabilities for next character
  loss = -probs[torch.arange(X.nelement()), y].log().mean()+ reg*(W**2).mean()
  return loss

In [228]:
# Trigram Implementation

# Create Datasets
xs, ys = dataset(2)
X_train, y_train, X_dev, y_dev, X_test, y_test = split_dataset(xs, ys)

regs = [10, 1, 0.1, 0.01, 0.001, 0.0001, 0.00001, 10e-7]
tloss, eloss = [], []

for reg in regs:

  # Initialize
  g = torch.Generator().manual_seed(2147483647)
  W = torch.randn((756, 27), generator=g, requires_grad=True)

  # Train
  loss = train(200, reg)
  print("\nReg: ",reg, "\n Train Loss: ",loss.item())
  tloss.append(loss.item())

  # Evaluate
  loss = eval_trigram(X_dev, y_dev, reg)
  print(" Eval Loss: ", loss.item())
  eloss.append(loss.item())

  # Test
  loss = eval(X_test, y_test, reg)
  print("\n Test Loss: ", loss.item())


Reg:  10 
 Train Loss:  2.083393096923828
 Eval Loss:  2.098149299621582

 Test Loss:  2.097676992416382

Reg:  1 
 Train Loss:  1.3488225936889648
 Eval Loss:  1.3733298778533936

 Test Loss:  1.3521088361740112

Reg:  0.1 
 Train Loss:  1.0763643980026245
 Eval Loss:  1.1066514253616333

 Test Loss:  1.0784415006637573

Reg:  0.01 
 Train Loss:  0.9770079255104065
 Eval Loss:  1.0080140829086304

 Test Loss:  0.9787640571594238

Reg:  0.001 
 Train Loss:  0.9654039740562439
 Eval Loss:  0.9964771270751953

 Test Loss:  0.9671182632446289

Reg:  0.0001 
 Train Loss:  0.964225172996521
 Eval Loss:  0.9953050017356873

 Test Loss:  0.9659352898597717

Reg:  1e-05 
 Train Loss:  0.9641072154045105
 Eval Loss:  0.9951876997947693

 Test Loss:  0.9658167362213135

Reg:  1e-06 
 Train Loss:  0.9640952944755554
 Eval Loss:  0.995175838470459

 Test Loss:  0.965804934501648


Performance on test and dev sets is slightly worse than on the training set

E03: use the dev set to tune the strength of smoothing (or regularization) for the trigram model - i.e. try many possibilities and see which one works best based on the dev set loss. What patterns can you see in the train and dev set loss as you tune this strength? Take the best setting of the smoothing and evaluate on the test set once and at the end. How good of a loss do you achieve?

E04: we saw that our 1-hot vectors merely select a row of W, so producing these vectors explicitly feels wasteful. Can you delete our use of F.one_hot in favor of simply indexing into rows of W?

In [49]:
# initialize and train model
def train_4(reg=0.001):
  # Initialize Network
  g = torch.Generator().manual_seed(2147483647)
  W = torch.randn((756, 27), generator=g, requires_grad=True)

  for k in range(100):
    # forward pass
    logits = W[X_train] # predict log-counts
    counts = logits.exp() # counts, equivalent to N
    probs = counts / counts.sum(1, keepdims=True) # probabilities for next character
    loss = -probs[torch.arange(num_train), y_train].log().mean()
    # print(f"Loss: ",loss.item())

    # backward pass
    W.grad = None # set to zero the gradient
    loss.backward()
    W.data += -50 * W.grad
  return loss.item()

loss = train_4()
print(loss)

0.6927981972694397



E05: look up and use F.cross_entropy instead. You should achieve the same result. Can you think of why we'd prefer to use F.cross_entropy instead?

Calculates the cross entropy between two distributions (predicted and label)
It does so by calculating -sum([p[i]*log2(q[i]) for i in range(len(p))])
compared to our previous implementation, which took exponent, normalized and then took log and mean of output, cross entropy only takes the product of p and log(q) and integrates softmax and log

Why do we prefer cross entropy?
More efficient. less computations.

In [54]:
# initialize and train model
def train_5(reg=0.001):
  # Initialize Network
  g = torch.Generator().manual_seed(2147483647)
  W = torch.randn((756, 27), generator=g, requires_grad=True)

  for k in range(100):
    # forward pass
    logits = W[X_train] # predict log-counts
    loss = F.cross_entropy(logits, y_train)
    # print(f"Loss: ",loss.item())

    # backward pass
    W.grad = None # set to zero the gradient
    loss.backward()
    W.data += -50 * W.grad
  return loss.item()

loss = train_4()
print(loss)

0.6927981972694397



E06: meta-exercise! Think of a fun/interesting exercise and complete it.