In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch

Our goal here is to start exploring DL by building `makemore`. `makemore` is a model the makes more of what u give it. We'll give it names and it will make more name-like words. 

We'll start by building it as a character-level language model ie: a model that given a character will predict the next character. 

So, we'll start with some text corpus (in our case a file with a list of 32k names) and we want to train some model (whatever that is) which will somehow extract information about letter / character statistics from this file and use that to predict the next character.

In some way, we want the model: `P(character 2 | character 1) = ... ` eg `P("e" | "m")`

There are 26 characters in English language. Our model will thus need to output a distribution over the 26 characters for ch2 (given any ch 1), and assign high probability to specific characters based on the statistics of the characters in the provided file it trains on.

**What can we do?** 

The easiest thing to do initially is to count: let's count how many times any character j follows character i. We'll build this as a table of frequency counts (the rows represent 1st cc, columns represent 2nd cc and any cell represents `P(ch2 | ch1)`). We'll then normalize this into probabilities. All our model needs to do now to generate new words is to sample from this table. 

NB: This is a classification problem. We're outputting the probability that ch2 belongs to one of 26 labels.

Our model will try to learn `P(Y|X)`

NB: To know where to start and where to stop, we'll add a special token (character) '.' to denote starting and stopping. It will be part of our table such that "e." gives probability that "e" is a terminal character and similarly "a." gives probability that a name starts with "a". 

We'll call this model a "Bigram" model (since it takes into account information from 2 characters only).

This will be our baseline model and we'd have learned it by explicitly counting. 

The other way we could try is to "learn" these probability distributions from the data using a neural network. Our network will take as input one character and output probabilities over 27 next characters (probability distribution). 

Suppose we start with a 1 layer NN in which each neuron performs `tanh(wx + b)`. How can we make these into probabilities? (*pay attention here, we're saying "make into probabilities" because this is a construction process. We'll "impose" certain semantics on the outputs of the NN by mathematical manipulation ie: we will "transform" the outputs to probabilites by various math operations)*

The outputs after tanh will be between -1 and 1. We'll interpret these as the `log` of the frequency counts a.k.a `logits`. 

<details>
    <summary>`logits`: Click to expand or collapse</summary>


In classification problems and language modeling, the term "logits" refers to the raw, unnormalized outputs of a neural network before they are passed through an activation function like softmax. Logits are often used as an intermediate step in the output layer of a neural network.

In language modeling, the goal is to predict the probability distribution over the next word given the previous words. One way to estimate these probabilities is by counting the frequency of each word appearing after a given context in the training data. The raw frequency counts can be converted into probabilities by dividing each count by the total number of words.

Instead of directly using the frequency counts, we often take the logarithm of the counts. This is because the frequency of words in a language follows a power law distribution (Zipf's law), where a few words appear very frequently, and many words appear rarely. By taking the log of the counts, we can compress the dynamic range and make the values more manageable.

The output layer of a neural network for language modeling typically has one neuron for each word in the vocabulary. The activations of these neurons can be interpreted as the log-frequencies (or unnormalized log-probabilities) of each word given the input context. By applying the softmax function to the logits, we can convert them into normalized probabilities that sum up to 1.
</details>


To make `logits` to frequencies / counts we can do `exp(logits)`.

We can then convert counts to probabilities by normalizing. 

These 2 steps will look like: $$\frac{exp(z_j)}{\sum_i exp(z_i)}$$

And this is called a `softmax` function. 

We then want to find the best parameters `w` for the model that make this data likely `P(data "X"|w)`. We'll do it using MLE. 

### Bigram

In [2]:
# first load the names into a list
words = open("names.txt").read().split()
words[:5]

['emma', 'olivia', 'ava', 'isabella', 'sophia']

In [3]:
# create a vocab from them
vocab = list(set("".join(words)))
vocab.sort()
vocab[:5]

['a', 'b', 'c', 'd', 'e']

In [4]:
# map from letters to numbers and vice versa
stoi = {s:i+1 for i,s in enumerate(vocab)}
# add special start / end token
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}

In [5]:
# example : create a bigram 
for w in words[:1]:
    for ch1,ch2 in zip(w, w[1:]):
        print(ch1, ch2)

e m
m m
m a


count table: a 2D tensor where col1 = all ch, row1 = all ch, each cell is frequency count of ch1,ch2
```
 . a b c d e
.
a
b
c
d
e
```
in numbers
```
  0 1 2 . . 26
0
1
2
.
.
26
```

In [6]:
# create the full bigram
B = torch.zeros(27,27, dtype=torch.int32)
for w in words:
    w = '.' + w + '.'
    for ch1, ch2 in zip(w, w[1:]):
        i = stoi[ch1]; j = stoi[ch2];
        B[i,j] += 1

In [7]:
B[0]

tensor([   0, 4410, 1306, 1542, 1690, 1531,  417,  669,  874,  591, 2422, 2963,
        1572, 2538, 1146,  394,  515,   92, 1639, 2055, 1308,   78,  376,  307,
         134,  535,  929], dtype=torch.int32)

In [8]:
# create a generator object in torch to get the same results
generator = torch.Generator().manual_seed(2147483647)

In [9]:
B.shape, B.sum(axis=1).shape, B.sum(axis=1, keepdim=True).shape

(torch.Size([27, 27]), torch.Size([27]), torch.Size([27, 1]))

In [10]:
# convert frequency counts to probabilities
P = B / B.sum(axis=1, keepdim=True) # keepdim allows proper broadcasting
P[0].sum(), P[:,0].sum() # confirm broadcasting done right

(tensor(1.), tensor(3.0222))

In [11]:
# now we can use p to sample. Our model is a classifier of 27 labels ie prob that ch belongs to any 
# of 27 labels / classes => it's a multinomial distribution
ch = torch.multinomial(input=P[0], num_samples=1, replacement=True, generator=generator).item()
ch, itos[ch]

(10, 'j')

We start with `input=p[0]` because our start token for any word is `. = 0`

`replacement=True` since we want to allow for repeat letters

We can then use that sampled 1st character to pick the next one

In [12]:
ch2 = torch.multinomial(input=P[ch], num_samples=1, replacement=True, generator=generator).item()
ch2, itos[ch2]

(21, 'u')

In [13]:
# let's do in a loop
for _ in range(5):
    sample = ''
    ch= torch.multinomial(input=P[0], num_samples=1, replacement=True, generator=generator).item()
    sample += itos[ch]
    while ch != 0:
        ch = torch.multinomial(input=P[ch], num_samples=1, replacement=True, generator=generator).item()
        sample += itos[ch]
    print(sample)

nide.
janasah.
p.
cony.
a.


In [14]:
# or to avoid repetition
for _ in range(5):
    sample = ''
    ch = 0
    while True:
        p = P[ch]
        ch = torch.multinomial(input=p, num_samples=1, replacement=True, generator=generator).item()
        sample += itos[ch]
        if ch == 0:
            break
    print(sample)

nn.
kohin.
tolian.
juee.
ksahnaauranilevias.


**Now, let's see how good this model is: MLE**

MLE estimates `P(data | parameters)` what parameters make this data most likely.

So, to compute MLE, we want to compute this number. We'll go ove the data (names list) and compute likelihood for each bigram. 


In [15]:
log_likelihood = 0
n_bigrams = 0 # to compute average likelihood later
for w in words:
    w = '.' + w + '.'
    for ch1, ch2 in zip(w, w[1:]):
        i = stoi[ch1]
        j = stoi[ch2]
        p = P[i, j]
        logp = p.log()
        log_likelihood += logp
        n_bigrams += 1
        

print(f'{log_likelihood=}')
nll = -log_likelihood
print(f'{nll=}')
print(f'{nll/n_bigrams}')    

log_likelihood=tensor(-559891.7500)
nll=tensor(559891.7500)
2.454094171524048


### Use 1-layer NN:

We'll make another model that uses a NN. In fact, it uses a simple one layer NN which in fact is just logistic regression. It takes as input the first character and outputs a probability distribution above. It tries to learn the parameters (the frequency table) from the data. 

To do so, we need to provide it with data in a format it could learn from. 
- Input will be ch1
- Label will be ch2

But, NNs can't take integers as input. We'll do one-hot-encoding.

In [16]:
# dataset 
xs, ys = [], []
for w in words[:1]:
    w = '.' + w + '.'
    for ch1, ch2 in zip(w, w[1:]):
        xs.append(ch1)
        ys.append(ch2)
xs, ys

(['.', 'e', 'm', 'm', 'a'], ['e', 'm', 'm', 'a', '.'])

In [17]:
# dataset 
xs, ys = [], []
for w in words:
    w = '.' + w + '.'
    for ch1, ch2 in zip(w, w[1:]):
        xs.append(stoi[ch1])
        ys.append(stoi[ch2])

xs[:5], ys[:5], len(xs), len(ys)

xs = torch.tensor(xs)
ys = torch.tensor(ys)

In [18]:
import torch.nn.functional as F

In [19]:
xenc = F.one_hot(xs).float()

In [20]:
xenc.shape, ys.shape

(torch.Size([228146, 27]), torch.Size([228146]))

In [21]:
xenc[:5]

tensor([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [22]:
# initialize weights
W = torch.randn(27, 27)

In [23]:
# forward pass
logits = xenc @ W
logits.shape

torch.Size([228146, 27])

In [24]:
# these outputs are logits
logits[0]

tensor([ 0.0811, -1.8241,  0.3685, -0.1218, -0.6174,  0.4972, -0.4562,  1.3473,
         0.0225, -0.8140, -0.6009,  0.7199,  0.3964,  0.8327, -0.0620, -0.6234,
         0.0358, -0.3556, -0.3480,  0.4521, -0.9783, -1.9683, -0.9615,  1.3405,
         0.7477, -0.6550, -1.7136])

In [25]:
# convert to counts via exp
counts = logits.exp()
counts.shape

torch.Size([228146, 27])

In [26]:
# normalize to probabilities
P = counts / counts.sum(dim=1, keepdim=True)
P[0].sum()

tensor(1.0000)

Compute the loss: The model outputs a prob distribution of the likelihood over the 27 characters. We want it to assign highest probability to the correct label (a perfect model would assign highest probability (= 1) to all correct labels).

The loss to compute is how far off is the probability of predicted label from 1. 

We'll compute the likelihood the model assigns to ALL examples then try to minimize negative log likelihood

In [41]:
# pick the correct labels
ps = P[torch.arange(xs.nelement()), ys]

In [43]:
logps = ps.log()

In [50]:
nll = -logps.sum() / xs.nelement()
nll

tensor(3.7685)

In [77]:
# let's train the model to get a better loss
W = torch.randn((27,27), generator=generator, requires_grad=True)
for i in range(50):
    #forward pass
    logits = xenc @ W
    counts = logits.exp() #logits to counts
    probs = counts / counts.sum(dim=1, keepdim=True) #counts to probs
    loss = -probs[torch.arange(xs.nelement()), ys].log().mean() #mean nll

    
    W.grad = None #zero grads
    loss.backward() #backprop
    W.data += -50.0 * W.grad # weight update
    
    if i % 10 == 0:
        print(f"loss: {loss.item()}")

loss: 3.831210136413574
loss: 2.6707763671875
loss: 2.5668914318084717
loss: 2.529362201690674
loss: 2.509660482406616


**let's sample from this model**

Now that we have the trained model, we can sample from it as we did before. 

This model works by taking one character as input and outputting the next. To sample from it we need to follow thatprocess: - We need to start with character `.`
- We then one-hot-encode it and pass through the network
- It outputs a activations over next 27 possible characters.
- Turn them to counts then probabilities
- Use the probabilities to sample from the multinomial as before. 


In [99]:
generator = torch.Generator().manual_seed(2147483647)
# sampling
for _ in range(5):
    ix = 0
    word = ''
    while True:
        xenc = F.one_hot(torch.tensor([ix]), num_classes=27).float()
        logits = xenc @ W
        counts = logits.exp()
        p = counts / counts.sum()
        ix = torch.multinomial(input=p, num_samples=1, replacement=True, generator=generator).item()
        word += itos[ix]
        if ix == 0:
            break
    print(word)

junide.
janasah.
pan.
ay.
a.
