<a href="https://colab.research.google.com/github/bo-bits/nn-zero-to-hero/blob/master/exercises/makemore_part1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

E01: train a trigram language model, i.e. take two characters as an input to predict the 3rd one. Feel free to use either counting or a neural net. Evaluate the loss; Did it improve over a bigram model?

In [176]:
# Trigram Model

words = open('names.txt', 'r').read().splitlines()


In [177]:
# Initialize dictionaries stoi itos

from itertools import islice

chars = sorted(list(set(''.join(words))))
chars.insert(0, '.')  # Insert the period as the first character

# Initialize stoi for single characters
stoi = {char: i for i, char in enumerate(chars)}  # Single character to index mapping

# Start bigram indices after the single characters
bigram_start_idx = len(stoi)

# Add bigrams to stoi with unique indices
for i, (char1, char2) in enumerate((a + b for a in chars for b in chars), bigram_start_idx):
    stoi[char1 + char2] = i

# Create itos for reverse mapping
# itos = {i:s for i, s in enumerate(stoi)}
itos = {i: s for s, i in stoi.items()}


In [179]:
import torch
import torch.nn.functional as F

# Initialize
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((756, 27), generator=g, requires_grad=True)

# create the training set of trigrams (x,y)
xs, ys = [], []
for w in words:
  chs = ['.'] + list(w) + ['.']
  for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
    ix1 = stoi[ch1+ch2]
    ix3 = stoi[ch3]
    xs.append(ix1)
    ys.append(ix3)
xs = torch.tensor(xs)
ys = torch.tensor(ys)
num = xs.nelement()
print('number of examples: ', num)

number of examples:  17129


In [182]:
for k in range(30):
  # forward pass
  xenc = F.one_hot(xs, num_classes=756).float() # input to the network: one-hot encoding
  logits = xenc @ W # predict log-counts
  counts = logits.exp() # counts, equivalent to N
  probs = counts / counts.sum(1, keepdims=True) # probabilities for next character
  loss = -probs[torch.arange(num), ys].log().mean()+ 0.01*(W**2).mean()
  print(f"Loss: ",loss.item())

  # backward pass
  W.grad = None # set to zero the gradient
  loss.backward()
  W.data += -50 * W.grad

Loss:  3.4737541675567627
Loss:  3.173428535461426
Loss:  2.8988418579101562
Loss:  2.653740167617798
Loss:  2.4393434524536133
Loss:  2.2530033588409424
Loss:  2.0902934074401855
Loss:  1.9472403526306152
Loss:  1.8207788467407227
Loss:  1.708559513092041
Loss:  1.6087299585342407
Loss:  1.5197772979736328
Loss:  1.4404282569885254
Loss:  1.3695957660675049
Loss:  1.3063430786132812
Loss:  1.2498607635498047
Loss:  1.1994404792785645
Loss:  1.154453158378601
Loss:  1.1143321990966797
Loss:  1.0785629749298096
Loss:  1.0466748476028442
Loss:  1.018239140510559
Loss:  0.9928656220436096
Loss:  0.970200777053833
Loss:  0.9499266147613525
Loss:  0.9317591190338135
Loss:  0.915445864200592
Loss:  0.9007627964019775
Loss:  0.8875138163566589
Loss:  0.8755264282226562


In [191]:
# finally, sample from the 'neural net' model
g = torch.Generator().manual_seed(2147483647+3)

for i in range(5):
  out = []
  ix = stoi['..']
  char1 = '.'
  char2 = '.'
  while True:
    ix = stoi[char1+char2]
    xenc = F.one_hot(torch.tensor([ix]), num_classes=756).float()
    logits = xenc @ W # predict log-counts
    counts = logits.exp() # counts, equivalent to N
    p = counts / counts.sum(1, keepdims=True) # probabilities for next character

    iy = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
    out.append(itos[iy])
    char1 = char2
    char2 = itos[iy]
    if iy == 0:
      break
  print(''.join(out))

haidenjamxvia.
harlebpmkagaider.
ujpvoah.
ail.
ander.


E02: split up the dataset randomly into 80% train set, 10% dev set, 10% test set. Train the bigram and trigram models only on the training set. Evaluate them on dev and test splits. What can you see?

In [None]:
# Exercise 2

E03: use the dev set to tune the strength of smoothing (or regularization) for the trigram model - i.e. try many possibilities and see which one works best based on the dev set loss. What patterns can you see in the train and dev set loss as you tune this strength? Take the best setting of the smoothing and evaluate on the test set once and at the end. How good of a loss do you achieve?

In [None]:
# Exercise 1

E04: we saw that our 1-hot vectors merely select a row of W, so producing these vectors explicitly feels wasteful. Can you delete our use of F.one_hot in favor of simply indexing into rows of W?

E05: look up and use F.cross_entropy instead. You should achieve the same result. Can you think of why we'd prefer to use F.cross_entropy instead?


E06: meta-exercise! Think of a fun/interesting exercise and complete it.