<a href="https://colab.research.google.com/github/bo-bits/nn-zero-to-hero/blob/master/exercises/makemore_part1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

E01: train a trigram language model, i.e. take two characters as an input to predict the 3rd one. Feel free to use either counting or a neural net. Evaluate the loss; Did it improve over a bigram model?

In [3]:
# Trigram Model

words = open('names.txt', 'r').read().splitlines()


In [4]:
# Initialize dictionaries stoi itos

from itertools import islice

chars = sorted(list(set(''.join(words))))
chars.insert(0, '.')  # Insert the period as the first character

# Initialize stoi for single characters
stoi = {char: i for i, char in enumerate(chars)}  # Single character to index mapping

# Start bigram indices after the single characters
bigram_start_idx = len(stoi)

# Add bigrams to stoi with unique indices
for i, (char1, char2) in enumerate((a + b for a in chars for b in chars), bigram_start_idx):
    stoi[char1 + char2] = i

# Create itos for reverse mapping
# itos = {i:s for i, s in enumerate(stoi)}
itos = {i: s for s, i in stoi.items()}


In [5]:
import torch
import torch.nn.functional as F

# Initialize
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((756, 27), generator=g, requires_grad=True)

# create the training set of trigrams (x,y)
xs, ys = [], []
for w in words:
  chs = ['.'] + list(w) + ['.']
  for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
    ix1 = stoi[ch1+ch2]
    ix3 = stoi[ch3]
    xs.append(ix1)
    ys.append(ix3)
xs = torch.tensor(xs)
ys = torch.tensor(ys)
num = xs.nelement()
print('number of examples: ', num)

number of examples:  17129


In [6]:
for k in range(30):
  # forward pass
  xenc = F.one_hot(xs, num_classes=756).float() # input to the network: one-hot encoding
  logits = xenc @ W # predict log-counts
  counts = logits.exp() # counts, equivalent to N
  probs = counts / counts.sum(1, keepdims=True) # probabilities for next character
  loss = -probs[torch.arange(num), ys].log().mean()+ 0.01*(W**2).mean()
  print(f"Loss: ",loss.item())

  # backward pass
  W.grad = None # set to zero the gradient
  loss.backward()
  W.data += -50 * W.grad

Loss:  3.7961485385894775
Loss:  3.4737541675567627
Loss:  3.173428535461426
Loss:  2.8988418579101562
Loss:  2.653740167617798
Loss:  2.4393434524536133
Loss:  2.2530033588409424
Loss:  2.0902934074401855
Loss:  1.9472403526306152
Loss:  1.8207788467407227
Loss:  1.708559513092041
Loss:  1.6087299585342407
Loss:  1.5197772979736328
Loss:  1.4404282569885254
Loss:  1.3695957660675049
Loss:  1.3063430786132812
Loss:  1.2498607635498047
Loss:  1.1994404792785645
Loss:  1.154453158378601
Loss:  1.1143321990966797
Loss:  1.0785629749298096
Loss:  1.0466748476028442
Loss:  1.018239140510559
Loss:  0.9928656220436096
Loss:  0.970200777053833
Loss:  0.9499266147613525
Loss:  0.9317591190338135
Loss:  0.915445864200592
Loss:  0.9007627964019775
Loss:  0.8875138163566589


In [9]:
# finally, sample from the 'neural net' model
g = torch.Generator().manual_seed(2147483647)

for i in range(5):
  out = []
  ix = stoi['..']
  char1 = '.'
  char2 = '.'
  while True:
    ix = stoi[char1+char2]
    xenc = F.one_hot(torch.tensor([ix]), num_classes=756).float()
    logits = xenc @ W # predict log-counts
    counts = logits.exp() # counts, equivalent to N
    p = counts / counts.sum(1, keepdims=True) # probabilities for next character

    iy = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
    out.append(itos[iy])
    char1 = char2
    char2 = itos[iy]
    if iy == 0:
      break
  print(''.join(out))

jamidenjacksam.
pgzqkygbyn.
avah.
nott.
charte.


E02: split up the dataset randomly into 80% train set, 10% dev set, 10% test set. Train the bigram and trigram models only on the training set. Evaluate them on dev and test splits. What can you see?

In [10]:
# Exercise 2

import torch
from torch.utils.data import TensorDataset, random_split
import torch.nn.functional as F
from itertools import islice


words = open('names.txt', 'r').read().splitlines()

# Initialize dictionaries stoi itos
chars = sorted(list(set(''.join(words))))
chars.insert(0, '.')
stoi = {char: i for i, char in enumerate(chars)}
bigram_start_idx = len(stoi)
for i, (char1, char2) in enumerate((a + b for a in chars for b in chars), bigram_start_idx):
    stoi[char1 + char2] = i
itos = {i: s for s, i in stoi.items()}

# create the training set of trigrams (x,y)
xs, ys = [], []
for w in words:
  chs = ['.'] + list(w) + ['.']
  for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
    ix1 = stoi[ch1+ch2]
    ix3 = stoi[ch3]
    xs.append(ix1)
    ys.append(ix3)
xs = torch.tensor(xs)
ys = torch.tensor(ys)
num = xs.nelement()
print('number of examples: ', num)


number of examples:  17129


In [11]:
# Divide dataset into 0.8 train 0.1 dev 0.1 test

dataset = TensorDataset(xs, ys)

# Define the split sizes
train_size = int(0.8 * len(dataset))
dev_size = int(0.1 * len(dataset))
test_size = len(dataset) - train_size - dev_size

# Split the dataset
train_dataset, dev_dataset, test_dataset = random_split(dataset, [train_size, dev_size, test_size])

# Access indices from the subsets
train_indices = train_dataset.indices
dev_indices = dev_dataset.indices
test_indices = test_dataset.indices

# Extract training tensors
X_train = xs[train_indices]
y_train = ys[train_indices]
num_train = len(X_train)

# Extract testing tensors
X_dev = xs[dev_indices]
y_dev = ys[dev_indices]
num_dev = len(X_dev)

# Extract testing tensors
X_test = xs[test_indices]
y_test = ys[test_indices]
num_test = len(X_test)


In [12]:
# Initialize and Train model

g = torch.Generator().manual_seed(2147483647)
W = torch.randn((756, 27), generator=g, requires_grad=True)

for k in range(200):
  # forward pass
  xenc = F.one_hot(X_train, num_classes=756).float() # input to the network: one-hot encoding
  logits = xenc @ W # predict log-counts
  counts = logits.exp() # counts, equivalent to N
  probs = counts / counts.sum(1, keepdims=True) # probabilities for next character
  loss = -probs[torch.arange(num_train), y_train].log().mean()+ 0.01*(W**2).mean()
  # print(f"Loss: ",loss.item())

  # backward pass
  W.grad = None # set to zero the gradient
  loss.backward()
  W.data += -50 * W.grad

# train_loss = train(1)
# print("train_loss: ", train_loss)

In [23]:
# Calculate performance
def test(X, y, num, reg):
  xenc = F.one_hot(X, num_classes=756).float() # input to the network: one-hot encoding
  logits = xenc @ W # predict log-counts
  counts = logits.exp() # counts, equivalent to N
  probs = counts / counts.sum(1, keepdims=True) # probabilities for next character
  loss = -probs[torch.arange(num), y].log().mean()+ reg*(W**2).mean()
  return loss.item()

# on dev set:
dev_loss = test(X_dev, y_dev, num_dev, 1)
print("dev_loss: ", dev_loss)

# On test set:
test_loss = test(X_test, y_test, num_test, 1)
print("test_loss: ", test_loss)



dev_loss:  1.3015379905700684
test_loss:  1.3170948028564453


Performance on test and dev sets is slightly worse than on the training set

E03: use the dev set to tune the strength of smoothing (or regularization) for the trigram model - i.e. try many possibilities and see which one works best based on the dev set loss. What patterns can you see in the train and dev set loss as you tune this strength? Take the best setting of the smoothing and evaluate on the test set once and at the end. How good of a loss do you achieve?

In [44]:
# Exercise 3

# Let's variabilize the hyperparameters for all sets and output the losses, save to a file, and compare.
# Fixed: k (iterations), step-size (50)

regs = [10, 1, 0.1, 0.01, 0.001, 0.0001, 0.00001]
for reg in regs:
  # Train model
  train_loss = train(reg)
  # Dev
  dev_loss = test(X_dev,y_dev, num_dev, reg)
  # Test
  test_loss = test(X_test, y_test, num_test, reg)

  print("\n\nFor reg = ", reg)
  print("Training loss = ", train_loss)
  print("Dev loss = ", dev_loss)
  print("Test loss = ", test_loss)




For reg =  10
Training loss =  1.8622887134552002
Dev loss =  2.7129669189453125
Test loss =  2.691798210144043


For reg =  1
Training loss =  1.2911691665649414
Dev loss =  0.9455105066299438
Test loss =  0.9243417978286743


For reg =  0.1
Training loss =  0.8087396025657654
Dev loss =  0.7687649130821228
Test loss =  0.7475962042808533


For reg =  0.01
Training loss =  0.7052651643753052
Dev loss =  0.7510903477668762
Test loss =  0.7299216389656067


For reg =  0.001
Training loss =  0.6940540671348572
Dev loss =  0.7493228912353516
Test loss =  0.728154182434082


For reg =  0.0001
Training loss =  0.6929238438606262
Dev loss =  0.7491461038589478
Test loss =  0.7279773950576782


For reg =  1e-05
Training loss =  0.6928107738494873
Dev loss =  0.7491284608840942
Test loss =  0.7279597520828247


E04: we saw that our 1-hot vectors merely select a row of W, so producing these vectors explicitly feels wasteful. Can you delete our use of F.one_hot in favor of simply indexing into rows of W?

In [49]:
# initialize and train model
def train_4(reg=0.001):
  # Initialize Network
  g = torch.Generator().manual_seed(2147483647)
  W = torch.randn((756, 27), generator=g, requires_grad=True)

  for k in range(100):
    # forward pass
    logits = W[X_train] # predict log-counts
    counts = logits.exp() # counts, equivalent to N
    probs = counts / counts.sum(1, keepdims=True) # probabilities for next character
    loss = -probs[torch.arange(num_train), y_train].log().mean()
    # print(f"Loss: ",loss.item())

    # backward pass
    W.grad = None # set to zero the gradient
    loss.backward()
    W.data += -50 * W.grad
  return loss.item()

loss = train_4()
print(loss)

0.6927981972694397



E05: look up and use F.cross_entropy instead. You should achieve the same result. Can you think of why we'd prefer to use F.cross_entropy instead?

Calculates the cross entropy between two distributions (predicted and label)
It does so by calculating -sum([p[i]*log2(q[i]) for i in range(len(p))])
compared to our previous implementation, which took exponent, normalized and then took log and mean of output, cross entropy only takes the product of p and log(q) and integrates softmax and log

Why do we prefer cross entropy?
More efficient. less computations.

In [54]:
# initialize and train model
def train_5(reg=0.001):
  # Initialize Network
  g = torch.Generator().manual_seed(2147483647)
  W = torch.randn((756, 27), generator=g, requires_grad=True)

  for k in range(100):
    # forward pass
    logits = W[X_train] # predict log-counts
    loss = F.cross_entropy(logits, y_train)
    # print(f"Loss: ",loss.item())

    # backward pass
    W.grad = None # set to zero the gradient
    loss.backward()
    W.data += -50 * W.grad
  return loss.item()

loss = train_4()
print(loss)

0.6927981972694397



E06: meta-exercise! Think of a fun/interesting exercise and complete it.