Exercises:

1. Train a trigram language model, i.e. take two characters as an input to predict the 3rd one. Feel free to use either counting or a neural net. Evaluate the loss; Did it improve over a bigram model?

It did better (as expected). Table description:
- training loss,
- no regularisation/smoothing,
- 200 epochs for 2/3-gram models,
- 600 epochs for 4-gram (probably needs better hyperparameters).

| model type    | bigram loss | trigram loss | 4-gram loss  |
|---------------|-------------|--------------|--------------|
| counting      |       2.454 |        1.942 |        1.471 |
| backprop (nn) |       2.460 |        2.029 |        1.780 |

2. Split up the dataset randomly into 80% train set, 10% dev set, 10% test set. Train the bigram and trigram models only on the training set. Evaluate them on dev and test splits. What can you see?

200 epochs training 

| model      | train loss | val loss | test loss |
|------------|------------|----------|-----------|
| bigram nn  |      2.459 |    2.464 |     2.466 |
| trigram nn |      1.992 |    2.222 |     2.237 |

For bigram model train loss is a little bit better than val and test losses.</br>
For trigram model the difference is significant - val and test loss ~10% higher than train loss. 

In [1]:
import torch
from tqdm import tqdm

## Data

In [2]:
data_fpath = './data/names.txt'

In [3]:
with open(data_fpath, 'r') as f:
    words = f.read().splitlines()
words[:5]

['emma', 'olivia', 'ava', 'isabella', 'sophia']

In [4]:
len(words)

32033

In [5]:
word_lens = [len(word) for word in words]
print(f'min len: {min(word_lens)}; max len: {max(word_lens)}')

min len: 2; max len: 15


In [6]:
SEP_TOK = '.'

In [8]:
n = 3  # this indicates the n in n-grams

In [9]:
ngrams_dict = {}
for word in words:
    chars = [SEP_TOK]*(n-1) + list(word) + [SEP_TOK]*(n-1)
    ngram_chars = [chars[i:] for i in range(n)]
    for ngram in zip(*ngram_chars):
        ngrams_dict[ngram] = ngrams_dict.get(ngram, 0) + 1
ngrams_dict = sorted(ngrams_dict.items(), key=lambda kv: kv[1], reverse=True)
ngrams_dict[:10]

[(('n', '.', '.'), 6763),
 (('a', '.', '.'), 6640),
 (('.', '.', 'a'), 4410),
 (('e', '.', '.'), 3983),
 (('.', '.', 'k'), 2963),
 (('.', '.', 'm'), 2538),
 (('i', '.', '.'), 2489),
 (('.', '.', 'j'), 2422),
 (('h', '.', '.'), 2409),
 (('.', '.', 's'), 2055)]

In [10]:
vocab = [SEP_TOK] + sorted(list(set(''.join(words))))
stoi = {s: i for i, s in enumerate(vocab)}
itos = {i: s for i, s in enumerate(vocab)}

## Ngram model as an array with counts
using generalized version for ngrams given arbitrary n

In [49]:
N = torch.zeros([len(vocab) for _ in range(n)], dtype=torch.int32)
for word in tqdm(words):
    chars = [SEP_TOK]*(n-1) + list(word) + [SEP_TOK]*(n-1)
    ngram_chars = [chars[i:] for i in range(n)]
    for ngram in zip(*ngram_chars):
        ixs = tuple(stoi[ch] for ch in ngram)
        N[ixs] += 1

  0%|          | 51/32033 [00:00<01:03, 505.30it/s]

100%|██████████| 32033/32033 [01:02<00:00, 513.36it/s]


In [50]:
base_count = 0 # smooths the probabilities
P = (N+base_count).float()
P = P / P.sum(axis=(n-1), keepdim=True)

### Sampling from the model

In [51]:
from collections import deque
n_samples = 20
g = torch.Generator().manual_seed(2147483647)
for _ in range(n_samples):
    ixs = deque([stoi[SEP_TOK]] * (n-1))
    out = []
    while True:
        prob_distr = P[tuple(ixs)]
        ix = torch.multinomial(prob_distr, num_samples=1, replacement=True, generator=g).item()
        if ix == stoi[SEP_TOK]:
            break
        ixs.popleft()
        ixs.append(ix)
        out.append(itos[ix])
    print(''.join(out))

juniba
jakasir
presar
adria
jira
tolomas
ter
kalania
yanilena
jededaileti
tayse
siely
artez
noud
than
demmerceyn
lena
jaylie
reanae
ocely


### Evaluating the performance

In [52]:
log_likelihood = 0.0
count = 0
for word in tqdm(words, 'Evaluating'):
    chars = [SEP_TOK]*(n-1) + list(word) + [SEP_TOK]*(n-1)
    ngram_chars = [chars[i:] for i in range(n)]
    for ngram in zip(*ngram_chars):
        ixs = tuple(stoi[ch] for ch in ngram)
        prob = P[ixs]
        logprob = torch.log(prob)
        log_likelihood += logprob
        count += 1

print(f'{log_likelihood=}')
nll = -log_likelihood
print(f'{nll=}')
print(f'{nll/count=}')

Evaluating:   0%|          | 138/32033 [00:00<00:47, 672.83it/s]

Evaluating: 100%|██████████| 32033/32033 [00:46<00:00, 692.80it/s]

log_likelihood=tensor(-429832.4062)
nll=tensor(429832.4062)
nll/count=tensor(1.4710)





## Ngram model as neural net 

In [11]:
# creating the training set of bigrams
xs, ys = [], []
for word in tqdm(words, f'Creating {n}-gram samples'):
    chars = [SEP_TOK]*(n-1) + list(word) + [SEP_TOK]*(n-1)
    ngram_chars = [chars[i:] for i in range(n)]
    for ngram in zip(*ngram_chars):
        ixs = [stoi[ch] for ch in ngram]
        xs.append(ixs[:-1])
        ys.append(ixs[-1])

train_split_end = int(0.8 * len(xs))
val_split_end = int(0.9 * len(xs))

xs_train = torch.tensor(xs[:train_split_end])
ys_train = torch.tensor(ys[:train_split_end])

xs_val = torch.tensor(xs[train_split_end:val_split_end])
ys_val = torch.tensor(ys[train_split_end:val_split_end])

xs_test = torch.tensor(xs[val_split_end:])
ys_test = torch.tensor(ys[val_split_end:])

xs = torch.tensor(xs)
ys = torch.tensor(ys)    

print(f'Number of training examples: {xs_train.shape[0]}')
print(f'Number of validation examples: {xs_val.shape[0]}')
print(f'Number of test examples: {xs_test.shape[0]}')

Creating 3-gram samples: 100%|██████████| 32033/32033 [00:01<00:00, 16626.91it/s]


Number of training examples: 208143
Number of validation examples: 26018
Number of test examples: 26018


### Training loop

In [12]:
def calc_loss(xs, ys, W, weight_decay=1e-4):
    logits = W[[x for x in xs.T]]
    counts = logits.exp()
    probs = counts / counts.sum(1, keepdim=True)
    # loss = average negative log likelihood
    loss = -probs[torch.arange(len(ys)), ys].log().mean() + weight_decay*(W**2).mean()
    return loss

In [18]:
# initializing the "model"
g = torch.Generator().manual_seed(2147483647)
W = torch.randn(tuple(len(vocab) for _ in range(n)), generator=g, requires_grad=True)

In [19]:
lr = 100
for ep in range(200):
    # forward pass
    tr_loss = calc_loss(xs_train, ys_train, W, 0)

    
    # backward pass
    W.grad = None
    tr_loss.backward()

    # update
    W.data += -lr * W.grad

    if ep % 10 == 9:
        tr_loss = calc_loss(xs_train, ys_train, W).item()
        val_loss = calc_loss(xs_val, ys_val, W).item()
        test_loss = calc_loss(xs_test, ys_test, W).item() 
        print(f'{ep+1:>3}th epoch, {tr_loss=:.3f}; {val_loss=:.3f}; {test_loss=:.3f}')

 10th epoch, tr_loss=2.644; val_loss=2.855; test_loss=2.868
 20th epoch, tr_loss=2.419; val_loss=2.652; test_loss=2.662
 30th epoch, tr_loss=2.310; val_loss=2.553; test_loss=2.562
 40th epoch, tr_loss=2.243; val_loss=2.488; test_loss=2.498
 50th epoch, tr_loss=2.195; val_loss=2.441; test_loss=2.451
 60th epoch, tr_loss=2.159; val_loss=2.405; test_loss=2.415
 70th epoch, tr_loss=2.131; val_loss=2.376; test_loss=2.387
 80th epoch, tr_loss=2.108; val_loss=2.352; test_loss=2.363
 90th epoch, tr_loss=2.089; val_loss=2.331; test_loss=2.343
100th epoch, tr_loss=2.074; val_loss=2.314; test_loss=2.326
110th epoch, tr_loss=2.060; val_loss=2.299; test_loss=2.312
120th epoch, tr_loss=2.049; val_loss=2.286; test_loss=2.299
130th epoch, tr_loss=2.039; val_loss=2.275; test_loss=2.288
140th epoch, tr_loss=2.030; val_loss=2.265; test_loss=2.278
150th epoch, tr_loss=2.022; val_loss=2.256; test_loss=2.270
160th epoch, tr_loss=2.015; val_loss=2.248; test_loss=2.262
170th epoch, tr_loss=2.008; val_loss=2.2

### Sampling from the network

In [66]:
from collections import deque
n_samples = 20
g = torch.Generator().manual_seed(2147483647)
for _ in range(n_samples):
    ixs = deque([stoi[SEP_TOK]] * (n-1))
    out = []
    while True:
        logits = W[tuple(ixs)]
        counts = logits.exp()
        probs = counts / counts.sum()
        ix = torch.multinomial(probs, num_samples=1, replacement=True, generator=g).item()
        if ix == stoi[SEP_TOK]:
            break
        ixs.popleft()
        ixs.append(ix)
        out.append(itos[ix])
    print(''.join(out))

junjdedianaqidouxutnypaxnuq
jimrltozsogjatqzvugignaduwjbuldvhajzdbiminrwimpadsvzywcfxvbryn
farmumtkyf
demmerponnsleigh
ani
cora
yaehocpkqjyked
webdmeiibwyaftwtiansnhspoluwaspphfdgosfmxtpqcixz
repahfmtydt
jayrslu
isa
dyfj
mjluuj
mahvupwyilpvhecgiagr
jenhwvdxtta
malyn
brey
aui
lavlpocq
themilana
