<a href="https://colab.research.google.com/github/bo-bits/nn-zero-to-hero/blob/master/exercises/makemore_part1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

E01: train a trigram language model, i.e. take two characters as an input to predict the 3rd one. Feel free to use either counting or a neural net. Evaluate the loss; Did it improve over a bigram model?

In [122]:
# Trigram Model
# W = [(26*26)+1 x (26*26)+1] = [677, 677]

words = open('names.txt', 'r').read().splitlines()



In [141]:
import torch

# Creates a sorted list of all characters in words
chars = sorted(list(set(''.join(words))))

# builds a dictionary where the key (s) is each character from chars,
# and the value (i+1) is the index of that character plus 1.
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0
# reverse dictionary to go from index to s
itos = {i:s for s,i in stoi.items()}

# Create a dictionary where each key (b) is a unique bigram
# The dictionary must include bigrams with '.'
null = chars.insert(0,'.')
# btoi = {char1 + char2: i for i, (char1, char2) in enumerate(((a, b) for a in chars for b in chars), 0)}

# # My implementation of the same dictionary (which works, but not the best)
btoi = {}
i = 0
for char1 in chars:
  btoi[char1] = i+1
  i += 1
  for char2 in chars:
    bgrm = char1 + char2
    btoi[bgrm] = i+1
    i += 1
itob = {i:b for b,i in btoi.items()}

In [155]:
# create the training set of trigrams (x,y)
xs, ys = [], []
for w in words[:1]:
  chs = ['.'] + list(w) + ['.']
  for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
    ix1 = btoi[ch1+ch2]
    ix3 = stoi[ch3]
    xs.append(ix1)
    ys.append(ix3)
xs = torch.tensor(xs)
ys = torch.tensor(ys)
print(btoi)



# initialize the 'network'
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((729, 27), generator=g, requires_grad=True)

{'.': 1, '..': 2, '.a': 3, '.b': 4, '.c': 5, '.d': 6, '.e': 7, '.f': 8, '.g': 9, '.h': 10, '.i': 11, '.j': 12, '.k': 13, '.l': 14, '.m': 15, '.n': 16, '.o': 17, '.p': 18, '.q': 19, '.r': 20, '.s': 21, '.t': 22, '.u': 23, '.v': 24, '.w': 25, '.x': 26, '.y': 27, '.z': 28, 'a': 29, 'a.': 30, 'aa': 31, 'ab': 32, 'ac': 33, 'ad': 34, 'ae': 35, 'af': 36, 'ag': 37, 'ah': 38, 'ai': 39, 'aj': 40, 'ak': 41, 'al': 42, 'am': 43, 'an': 44, 'ao': 45, 'ap': 46, 'aq': 47, 'ar': 48, 'as': 49, 'at': 50, 'au': 51, 'av': 52, 'aw': 53, 'ax': 54, 'ay': 55, 'az': 56, 'b': 57, 'b.': 58, 'ba': 59, 'bb': 60, 'bc': 61, 'bd': 62, 'be': 63, 'bf': 64, 'bg': 65, 'bh': 66, 'bi': 67, 'bj': 68, 'bk': 69, 'bl': 70, 'bm': 71, 'bn': 72, 'bo': 73, 'bp': 74, 'bq': 75, 'br': 76, 'bs': 77, 'bt': 78, 'bu': 79, 'bv': 80, 'bw': 81, 'bx': 82, 'by': 83, 'bz': 84, 'c': 85, 'c.': 86, 'ca': 87, 'cb': 88, 'cc': 89, 'cd': 90, 'ce': 91, 'cf': 92, 'cg': 93, 'ch': 94, 'ci': 95, 'cj': 96, 'ck': 97, 'cl': 98, 'cm': 99, 'cn': 100, 'co': 101, 

In [156]:
import torch.nn.functional as F
xenc = F.one_hot(xs, num_classes=729).float()

# Softmax
logits = xenc @ W # log-counts
counts = logits.exp() # equivalent N
counts
# Normalize
probs = counts / counts.sum(1, keepdims=True)

probs[0].sum()

tensor(1.0000, grad_fn=<SumBackward0>)

In [157]:
# Equivalent to Softmax
logits = xenc @ W # log-counts
counts = logits.exp() # equivalent N
counts
# Normalize
probs = counts / counts.sum(1, keepdims=True)


In [158]:
# Calculates the average negative log likelihood for a single example
# what probability the model assigned to the correct output, take the log (high prob -> 0)
# take negative, to make output positive
# negative log likelihood is an indicator of how well the NN performed, we want to minimize
nlls = torch.zeros(4)
for i in range(4):
  # i-th bigram:
  x = xs[i].item() # input character index
  y = ys[i].item() # label character index
  print('--------')
  print(f'bigram example {i+1}: {itob[x]}{itos[y]} (indexes {x},{y})')
  print('input to the neural net:', x)
  print('output probabilities from the neural net:', probs[i])
  print('label (actual next character):', y)
  p = probs[i, y]
  print('probability assigned by the net to the the correct character:', p.item())
  logp = torch.log(p)
  print('log likelihood:', logp.item())
  nll = -logp
  print('negative log likelihood:', nll.item())
  nlls[i] = nll

print('=========')
print('average negative log likelihood, i.e. loss =', nlls.mean().item())

--------
bigram example 1: .ma (indexes 15,1)
input to the neural net: 15
output probabilities from the neural net: tensor([0.0111, 0.0882, 0.0054, 0.0511, 0.0396, 0.0089, 0.0636, 0.1978, 0.0163,
        0.0113, 0.0030, 0.0832, 0.0526, 0.0309, 0.0196, 0.0121, 0.0299, 0.0130,
        0.0218, 0.0027, 0.0318, 0.0446, 0.0136, 0.0044, 0.0108, 0.0190, 0.1141],
       grad_fn=<SelectBackward0>)
label (actual next character): 1
probability assigned by the net to the the correct character: 0.08815080672502518
log likelihood: -2.428706169128418
negative log likelihood: 2.428706169128418
--------
bigram example 2: mar (indexes 367,18)
input to the neural net: 367
output probabilities from the neural net: tensor([0.0019, 0.0160, 0.0289, 0.0162, 0.0292, 0.0312, 0.0180, 0.1052, 0.0098,
        0.0203, 0.2766, 0.0084, 0.0527, 0.0713, 0.0164, 0.0188, 0.0367, 0.0048,
        0.0438, 0.0341, 0.0231, 0.0177, 0.0232, 0.0178, 0.0081, 0.0275, 0.0422],
       grad_fn=<SelectBackward0>)
label (actual next cha

In [166]:
# Initialize
# randomly initialize 27 neurons' weights. each neuron receives 27 inputs
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((729, 27), generator=g, requires_grad=True)

# create the training set of trigrams (x,y)
xs, ys = [], []
for w in words[:1]:
  chs = ['.'] + list(w) + ['.']
  for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
    ix1 = btoi[ch1+ch2]
    ix3 = stoi[ch3]
    xs.append(ix1)
    ys.append(ix3)
xs = torch.tensor(xs)
ys = torch.tensor(ys)

for k in range(100):
  # forward pass
  xenc = F.one_hot(xs, num_classes=729).float() # input to the network: one-hot encoding
  logits = xenc @ W # predict log-counts
  counts = logits.exp() # counts, equivalent to N
  probs = counts / counts.sum(1, keepdims=True) # probabilities for next character
  loss = -probs[torch.arange(4), ys].log().mean()
  print(f"Loss: ",loss.item())

  # backward pass
  # Calculated by pytoch by computing gradients
  W.grad = None # set to zero the gradient
  loss.backward()
  W.data += -10 * W.grad

Loss:  3.698340654373169
Loss:  1.4152169227600098
Loss:  0.3756910562515259
Loss:  0.1751929223537445
Loss:  0.11682480573654175
Loss:  0.0881984755396843
Loss:  0.07101797312498093
Loss:  0.05951263755559921
Loss:  0.05125107988715172
Loss:  0.045023269951343536
Loss:  0.04015641286969185
Loss:  0.03624660521745682
Loss:  0.0330355241894722
Loss:  0.030350370332598686
Loss:  0.028071332722902298
Loss:  0.026112491264939308
Loss:  0.024410486221313477
Loss:  0.022917872294783592
Loss:  0.02159806340932846
Loss:  0.020422542467713356
Loss:  0.019369103014469147
Loss:  0.01841919682919979
Loss:  0.017558693885803223
Loss:  0.01677507534623146
Loss:  0.016058772802352905
Loss:  0.015401318669319153
Loss:  0.014795714989304543
Loss:  0.014236079528927803
Loss:  0.013717394322156906
Loss:  0.013235338032245636
Loss:  0.012786008417606354
Loss:  0.012366300448775291
Loss:  0.011973431333899498
Loss:  0.011604836210608482
Loss:  0.01125815138220787
Loss:  0.010931715369224548
Loss:  0.010623

E02: split up the dataset randomly into 80% train set, 10% dev set, 10% test set. Train the bigram and trigram models only on the training set. Evaluate them on dev and test splits. What can you see?

In [63]:
# Exercise 2

E03: use the dev set to tune the strength of smoothing (or regularization) for the trigram model - i.e. try many possibilities and see which one works best based on the dev set loss. What patterns can you see in the train and dev set loss as you tune this strength? Take the best setting of the smoothing and evaluate on the test set once and at the end. How good of a loss do you achieve?

In [64]:
# Exercise 1

E04: we saw that our 1-hot vectors merely select a row of W, so producing these vectors explicitly feels wasteful. Can you delete our use of F.one_hot in favor of simply indexing into rows of W?

E05: look up and use F.cross_entropy instead. You should achieve the same result. Can you think of why we'd prefer to use F.cross_entropy instead?


E06: meta-exercise! Think of a fun/interesting exercise and complete it.