<a href="https://colab.research.google.com/github/florianraith/notebooks/blob/main/n_gram.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Setup kaggle and download dataset

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pip install -q kaggle

In [17]:
!mkdir -p ~/.kaggle
!cp "/content/drive/MyDrive/Sonstiges/kaggle.json" ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [4]:
import os

dataset_path = '/content/us-baby-names.zip'

if not os.path.exists(dataset_path):
    !kaggle datasets download -d kaggle/us-baby-names
    !unzip -o {dataset_path} -d /content/
else:
    print('Dataset already exists.')

Downloading us-baby-names.zip to /content
 96% 166M/173M [00:03<00:00, 50.8MB/s]
100% 173M/173M [00:03<00:00, 54.7MB/s]
Archive:  /content/us-baby-names.zip
  inflating: /content/NationalNames.csv  
  inflating: /content/NationalReadMe.pdf  
  inflating: /content/StateNames.csv  
  inflating: /content/StateReadMe.pdf  
  inflating: /content/database.sqlite  
  inflating: /content/hashes.txt     


### Code

In [5]:
import torch
import torch.nn.functional as F
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

Split data into training and test set

In [6]:
N = 128000
split = 0.8

data = pd.read_csv('/content/NationalNames.csv')
data = data['Name'].apply(lambda n: n.lower()).head(N).sample(frac=1).to_list()

i = int(split * N)
train, test = data[:i], data[i:]

print(f'train data: #{len(train)}, {train[:5]}')
print(f'test data: #{len(test)}, {test[:5]}')

train data: #102400, ['evaline', 'harry', 'dana', 'adda', 'jimmie']
test data: #25600, ['livingston', 'delilah', 'charls', 'jean', 'william']


In [7]:
import string

chars = '.' + string.ascii_lowercase
itos = { i:s for i, s in enumerate(chars)}
stoi = { s:i for i, s in enumerate(chars)}

**Encode n-grams**.

`context` variable defines the length of the character n-grams (excluding the target character). For example, context = 1 refers to a bigram model.

Encoding example:  
1. Given the trigram `'ale'`, split it into x and y i.e. `'al'` and `'e'`.  
2. Transform the letters into indices i.e. `(1, 12)`, `(5)`.  
3. the x indices are then being one-hot encoded and finally
 concatenated

In [8]:
context = 2

def encode(data):
  X = []
  Y = []

  for w in data:
    w = context * '.' + w + '.'

    for i in range(len(w) - context):
        X_entry = tuple(stoi[w[j]] for j in range(i, i + context))

        X.append(X_entry)
        Y.append(stoi[w[i + context]])

  X = torch.tensor(X)
  Xenc = F.one_hot(X, num_classes=len(chars)).reshape(X.size(0), -1).float()
  Y = torch.tensor(Y)
  return (Xenc, Y)

Xenc, Y = encode(train)
XTenc, YT = encode(test)

Xenc.shape, Y.shape

(torch.Size([700053, 54]), torch.Size([700053]))

Initialize parameters with random values

In [9]:
W = torch.rand(Xenc.size(1), len(chars), requires_grad=True)

Train the model with a simple neural network without any hidden layers

In [10]:
for i in range(121):
  # forward pass
  Yout = Xenc @ W
  loss = F.cross_entropy(Yout, Y)

  # backward pass
  W.grad = None
  loss.backward()
  W.data += -10 * W.grad

  if i % 30 == 0:
    YTout = XTenc @ W
    test_loss = F.cross_entropy(YTout, YT)

    print('%04d, test loss: %.4f; train loss: %.4f; loss diff: %.4f' % (i, test_loss.item(), loss.item(), abs(loss.item() - test_loss.item())))


0000, test loss: 3.2098; train loss: 3.3313; loss diff: 0.1215
0030, test loss: 2.4577; train loss: 2.4613; loss diff: 0.0036
0060, test loss: 2.3608; train loss: 2.3596; loss diff: 0.0012
0090, test loss: 2.3205; train loss: 2.3180; loss diff: 0.0024
0120, test loss: 2.2978; train loss: 2.2948; loss diff: 0.0030


In [11]:
softmax = torch.nn.Softmax(dim = 0)

def sample(idxs):
  x = torch.cat([F.one_hot(torch.tensor(idx), num_classes=len(chars)) for idx in idxs]).float()

  return torch.multinomial(softmax(x @ W), num_samples=1, replacement=True).item()

Generate some names

In [12]:
for i in range(20):
  name = ''
  idxs = [0] * context

  while True:

    idxn = sample(idxs)
    idxs = idxs[1:] + [idxn]

    name += itos[idxn]

    if idxn == 0:
      break

  print(name)

hcr.
selle.
duorelert.
qmathiw.
samina.
tov.
la.
ben.
juckiy.
phna.
frtsttestharlbyqrie.
dealie.
be.
nol.
he.
ca.
uzbeqhuoriett.
hanate.
celerlda.
lalqvjdsane.


### Results

**D1:**
N = 64000, split = 0.8, context = 5, 5 \* 27 \* 27 = 3645 parameters;   
1000 iterations, learning rate = 10, loss = 2.0326   
took about 4mins to learn on free colab cpu

**D2:**
N = 128000, split = 0.8, context = 6, 6 \* 27 \* 27 = 4374 parameters;   
3000 iterations, learning rate = 10, loss = 2.1324   
took about 33mins to learn on free colab cpu

| D1 | D2 |
| - | - |
| marcie. | kil. |
| aus. | vilanna. |
| cilma. | vilana. |
| micos. | lanoros. |
| cly. | alla. |
| jolmetha. | wilman. |
| elborie. | elvardett. |
| rina. | lanay. |
| garoge. | arlie. |
| thaldpery. | olly. |
| bethia. | lichaok. |
| cencisel. | livarga. |
| euda. | panallisa. |
| wandeca. | arillda. |
| nandy. | kalluen. |
| vwanner. | anfaid. |
| sapvin. | jonnita. |
| wirndf. | byha. |
| bone. | alina. |
| sernjel. | badsilit. |