# Bigram model

---

We want to build from scratch a brigram model, simple.</br>
I used a machine with GPUs but you should be fine with running it in CPU only.


In [1]:
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

In [2]:
from markdown import markdown
import torch
import torch.nn as nn
import torch.nn.functional as F

In [3]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

## Env variables & Hyperparameters

---


In [4]:
# model hypterparameters
block_size = 64
batch_size = 16

## Read data
---
Please download the data first, from: 
- [wizard of oz](https://www.gutenberg.org/ebooks/30852), this is the one I personally used.
- [openwebtext](https://openwebtext2.readthedocs.io)

Is up to you, choose the one you like the most.

In [5]:
with open("data/wizard_of_oz.txt", "r", encoding="utf-8") as f:
    book = f.read()

markdown(book[:1000])

'<p>\ufeffDOROTHY AND THE WIZARD IN OZ</p>\n<p>BY</p>\n<p>L. FRANK BAUM</p>\n<p>AUTHOR OF THE WIZARD OF OZ, THE LAND OF OZ, OZMA OF OZ, ETC.</p>\n<p>ILLUSTRATED BY JOHN R. NEILL</p>\n<p>BOOKS OF WONDER WILLIAM MORROW &amp; CO., INC. NEW YORK</p>\n<p>[Illustration]</p>\n<p>COPYRIGHT 1908 BY L. FRANK BAUM</p>\n<p>ALL RIGHTS RESERVED</p>\n<pre><code>     *       *       *       *       *\n</code></pre>\n<p>[Illustration]</p>\n<p>DEDICATED TO HARRIET A. B. NEAL.</p>\n<pre><code>     *       *       *       *       *\n</code></pre>\n<p>To My Readers</p>\n<p>It\'s no use; no use at all. The children won\'t let me stop telling tales\nof the Land of Oz. I know lots of other stories, and I hope to tell\nthem, some time or another; but just now my loving tyrants won\'t allow\nme. They cry: "Oz--Oz! more about Oz, Mr. Baum!" and what can I do but\nobey their commands?</p>\n<p>This is Our Book--mine and the children\'s. For they have flooded me with\nthousands of suggestions in regard to it, and I

## Char encoder and decoder

---
We need to map the char to a numerical value and vicerversa, and, let's use the values to create the embeddings.


In [6]:
# first, get all unique chars, and sort them
chars = sorted(set(book))
vocab_size = len(chars)
vocab_size

81

In [7]:
# then, create the mappings
string_to_int = {string: i for i, string in enumerate(chars)}
int_to_string = {i: string for i, string in enumerate(chars)}

In [8]:
# now, convert the book to integers
encode = lambda s: [string_to_int[c] for c in s]
decode = lambda s: "".join([int_to_string[c] for c in s])

### short test the encoder and decoder

---


In [9]:
encoded_string = encode("hello")
decoded_string = decode(encoded_string)
print(decoded_string)

hello


## Create torch tensor

---


In [10]:
data = torch.tensor(encode(book), dtype=torch.long)

In [11]:
data[:100]

tensor([80, 28, 39, 42, 39, 44, 32, 49,  1, 25, 38, 28,  1, 44, 32, 29,  1, 47,
        33, 50, 25, 42, 28,  1, 33, 38,  1, 39, 50,  0,  0,  1,  1, 26, 49,  0,
         0,  1,  1, 36, 11,  1, 30, 42, 25, 38, 35,  1, 26, 25, 45, 37,  0,  0,
         1,  1, 25, 45, 44, 32, 39, 42,  1, 39, 30,  1, 44, 32, 29,  1, 47, 33,
        50, 25, 42, 28,  1, 39, 30,  1, 39, 50,  9,  1, 44, 32, 29,  1, 36, 25,
        38, 28,  1, 39, 30,  1, 39, 50,  9,  1])

... and split data in train and validation datasets. </br>
Please be careful here, of course we don't do it as is for a real solution. This is just a toy and therefore the naive split.

In [12]:
train_size = int(0.8 * len(data))
test_size = len(data) - train_size
train_size, test_size

(185846, 46462)

In [13]:
len(data)
len(data) - block_size
# batches
len(data) // block_size

232308

232244

3629

In [14]:
train = data[:train_size]
val = data[train_size:]

We create a fucntion to generate random batches.</br>
Produces two vectors:
- input, from a random starting with a length of ```block_size```
- target, starts from input start index + 1 for a length of ```block_size```

Target vector focus on the next character.

In [15]:
def get_batches(data, block_size):
    # generate random starting points
    # 0 and len(data) - block_size are indexes
    starts = torch.randint(0, len(data) - block_size, (len(data) // block_size,))
    # from each starting point, we generate the input and target.
    # Target is whay is supposed to be in the next sequence
    x = torch.stack([data[start : start + block_size] for start in starts]).to(device)
    y = torch.stack([data[start + 1 : start + block_size + 1] for start in starts]).to(
        device
    )
    return x, y


x, y = get_batches(train, block_size)
"input: ", x
# first dim is the number of rows,
# second is the number of chars we take from the starting point
# torch.Size([23230, 8])
x.shape
"target: ", y
# torch.Size([23230, 8])
y.shape

('input: ',
 tensor([[69, 72,  1,  ..., 58,  1, 73],
         [71, 71, 68,  ..., 55, 74, 60],
         [56, 74, 71,  ..., 71,  1, 76],
         ...,
         [ 1,  1,  1,  ...,  1, 40, 33],
         [57,  1, 62,  ..., 67, 57,  1],
         [54, 55, 68,  ..., 62, 60, 61]], device='cuda:0'))

torch.Size([2903, 64])

('target: ',
 tensor([[72,  1, 33,  ...,  1, 73, 68],
         [71, 68, 76,  ..., 74, 60, 60],
         [74, 71, 58,  ...,  1, 76, 62],
         ...,
         [ 1,  1,  1,  ..., 40, 33, 44],
         [ 1, 62, 67,  ..., 57,  1, 76],
         [55, 68, 74,  ..., 60, 61, 73]], device='cuda:0'))

torch.Size([2903, 64])

## BiGram Lang Model
----------------------
A very simple model, nothing fancy

In [16]:
class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        """We need to init our embeddings as a square matrix,
        with each row repesenting a char and each column value the next char probability.

        Args:
            vocab_size (_type_): _description_
            embedding_dim (_type_): should be the same as vocab_size.
        """
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        # next line is not a good idea, but it remind us the opportunity to initialize weights with an strategy
        self.embeddings.weight.data.uniform_(-1, 1)

    def forward(self, index, targets=None):
        """_summary_

        Args:
            index (_type_): _description_
            targets (_type_, optional): _description_. Defaults to None.

        Returns:
            _type_: _description_
        """
        logits = self.embeddings(index)
        # next line is mainly for testing purposes
        if targets is None:
            return logits, None

        # dimensions are:
        # batch size (len(data) // block_size), block_size, vocab_size
        batch_dim, time_dim, n_classes = logits.shape
        # print(f"embeddings shape: {logits.shape}")
        # batch and time are not that important, therefore, we blend them together
        logits = logits.view(batch_dim * time_dim, n_classes)
        # why we reshape targets? the answer is, what cross_entropy expects as shapes
        # https://pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html#torch.nn.functional.cross_entropy
        targets = targets.view(batch_dim * time_dim)
        loss = F.cross_entropy(logits, targets)
        return logits, loss

    def generate(self, index, max_new_tokens: int):
        """_summary_

        Args:
            index (_type_): _description_
            max_new_tokens (int): _description_

        Returns:
            _type_: _description_
        """
        for _ in range(max_new_tokens):
            logits, _ = self.forward(index)
            # torch.Size([1, 1, 81]), torch.Size([1, 2, 81]) ... torch.Size([1, max_new_tokens, 81])
            logits = logits[:, -1, :]  # becomes (batch_size, n_classes)
            # print(logits.shape), torch.Size([1, 81]) as we always grab the last embeddings

            # apply softmax to get probabilities, we focus on the last dimension
            probs = F.softmax(logits, dim=-1)
            # sample from the distribution, get the index of the next char
            next_token = torch.multinomial(
                probs, num_samples=1
            )  # becomes (batch_size, 1)
            # append sample index to the running sequence, we keep on concatenating
            # more tokens to the sequence
            index = torch.cat(
                (index, next_token), dim=1
            )  # becomes (batch_size, time_dim + 1)
        return index


model = BigramLanguageModel(vocab_size=vocab_size, embedding_dim=vocab_size).to(device)

In [17]:
context = torch.zeros((1, 1), dtype=torch.long, device=device)
context
# we use max_new_tokens to limit the number of tokens we generate, try other lower numbers too.
generated_chars = decode(model.generate(context, max_new_tokens=500)[0].tolist())
generated_chars

tensor([[0]], device='cuda:0')

'\nZ],mF1Qz4nS\ufeffZ\'(L\'3&Uetvt?IFBdR8)oX3(ZH!R(Ks!\'otwuo3D,[j9k,*d8)\nh2Qh"r6Iatv]*d][nl\nT\ufeffpdcB;sO0h[?MLmfMIqXNYX(IX6agBU;*L\ufeffSNzJ"O3.u3Tl.3[.T?R:kg!\'\ufeff\'[XAQRVE(pp"9OH*LU[1,vR0V_7g69Tjzxy cp\nDP)\nW]An!2uc]XwNry jCI;XAaeYkY-z-&?NoHW0XOb2q7b;"9xhr\ufeff"nHdBKVE3Co7Co5MhgXOzoRX\n6qpU9.xEjxOAvjhZVpcoR-EjqcdkQV[AuthB7YEHZ\ufeff(\ufeffyxh*XA*5_dRSQha,npddPO[GZc-\'n-WAT5Cwxr47Pkf)omuMTf\ufeffasvrKplLiGly[1d"U3)Yhr-PWr[.;\n6xQjl90q\n9YlFS[7eVhRQe8;1usMtfs9q0gT(91O&zdtXs4rldy!l!&vN\n8bl\nx&K0D5qQP\ufeffr3;uYSDIDOc-jrQs,.Ha,MbH2s(\'Z7[(OQGpUW?_v\ufeff'

Letes add an optimizer, ```AdamW```

In [18]:
learning_rate = 3e-4
epochs = 10000

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

Training, for number of epochs we generated batches and pass them to the model.forward function

In [19]:
@torch.no_grad()
def evaluate_loss(model, block_size):
    out = {}
    _ = model.eval()

    for split in ["train", "val"]:
        losses = torch.zeros(100)
        for i in range(100):
            data = train if split == "train" else val
            input_batch, target_batch = get_batches(data, block_size)
            _, loss = model.forward(input_batch, target_batch)
            losses[i] = loss.item()

        out[split] = losses.mean()

    _ = model.train()
    return out

In [20]:
for i in range(epochs):
    # get a random batch
    # torch.Size([23230, 8])
    input_batch, target_batch = get_batches(train, block_size)
    # forward pass
    logits, loss = model.forward(input_batch, target_batch)
    # backward pass
    # previous gradients do not affect the currrent ones, therefore, we set them to None
    # setting the gradient to None is a memory optimization, it saves memory
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
    if i % 1000 == 0:
        out = evaluate_loss(model, block_size)
        print(f"iteration {i}, train loss: {out["train"]}, val loss: {out["val"]}")

out = evaluate_loss(model, block_size)
print(f"final training loss, train loss: {out["train"]}, val loss: {out["val"]}")

iteration 0, train loss: 4.47724723815918, val loss: 4.472139835357666
iteration 1000, train loss: 4.053919792175293, val loss: 4.054776191711426
iteration 2000, train loss: 3.698371648788452, val loss: 3.704672336578369
iteration 3000, train loss: 3.4042391777038574, val loss: 3.417276620864868
iteration 4000, train loss: 3.1649792194366455, val loss: 3.1831021308898926
iteration 5000, train loss: 2.9743735790252686, val loss: 2.9977614879608154
iteration 6000, train loss: 2.8258895874023438, val loss: 2.854684352874756
iteration 7000, train loss: 2.711883783340454, val loss: 2.7442615032196045
iteration 8000, train loss: 2.6278867721557617, val loss: 2.6632606983184814
iteration 9000, train loss: 2.566113233566284, val loss: 2.6040005683898926
final training loss, train loss: 2.5219740867614746, val loss: 2.562403917312622


### build a DataLoader
--------------------------

In [21]:
def dataloader_batches(data, block_size):
    # generate random starting points
    # 0 and len(data) - block_size are indexes
    starts = torch.randint(0, len(data) - block_size, (len(data) // block_size,))
    # from each starting point, we generate the input and target.
    # Target is whay is supposed to be in the next sequence
    x = torch.stack([data[start : start + block_size] for start in starts])
    y = torch.stack([data[start + 1 : start + block_size + 1] for start in starts])

    x = torch.utils.data.DataLoader(x, batch_size=batch_size, shuffle=True)
    y = torch.utils.data.DataLoader(y, batch_size=batch_size, shuffle=True)

    return x, y

In [22]:
train_dataloader, target_dataloader = dataloader_batches(data, block_size)

In [23]:
train_features = next(iter(train_dataloader))
print(f"Feature len: {len(train_dataloader)}")
print(f"Feature batch shape: {train_features.size()}")
train_features

Feature len: 227
Feature batch shape: torch.Size([16, 64])


tensor([[69, 62, 56,  ..., 67, 57, 58],
        [58, 57,  1,  ...,  0, 54, 67],
        [67, 78,  1,  ..., 67, 56, 58],
        ...,
        [ 1, 57, 62,  ..., 58, 71, 23],
        [58,  1, 62,  ...,  1, 73, 61],
        [68, 74,  1,  ...,  3, 47, 58]])

### run model with a dataloader
-----------------

In [24]:
model = BigramLanguageModel(vocab_size=vocab_size, embedding_dim=vocab_size).to(device)

In [25]:
learning_rate = 3e-4
epochs = 20

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

In [26]:
for i in range(epochs):
    # get a random batch
    _ = model.train()
    total_loss = 0
    for batch in zip(train_dataloader, target_dataloader):
        # index_batch, target_batch = batch[0], batch[1]
        index_batch, target_batch = batch[0].to(device), batch[1].to(device)
        # model.embeddings(index_batch).shape
        # forward pass
        logits, loss = model.forward(index_batch, target_batch)
        # backward pass
        # previous gradients do not affect the currrent ones, therefore, we set them to None
        # setting the gradient to None is a memory optimization, it saves memory
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    print(f"Epoch {i}, avg. loss: {total_loss / len(train_dataloader)}")

print(f"Final avg. loss: {total_loss / len(train_dataloader)}")

Epoch 0, avg. loss: 4.524040491045309
Epoch 1, avg. loss: 4.451546137553479
Epoch 2, avg. loss: 4.378509277814286
Epoch 3, avg. loss: 4.311326092035235
Epoch 4, avg. loss: 4.249310106958061
Epoch 5, avg. loss: 4.1869950735621515
Epoch 6, avg. loss: 4.1285825363864985
Epoch 7, avg. loss: 4.074813601203952
Epoch 8, avg. loss: 4.019144760879651
Epoch 9, avg. loss: 3.9667243600416815
Epoch 10, avg. loss: 3.916979182659267
Epoch 11, avg. loss: 3.8730293160493154
Epoch 12, avg. loss: 3.828311164998798
Epoch 13, avg. loss: 3.7864701338276463
Epoch 14, avg. loss: 3.7472969781984844
Epoch 15, avg. loss: 3.708734615258708
Epoch 16, avg. loss: 3.676014532601781
Epoch 17, avg. loss: 3.6399738746592654
Epoch 18, avg. loss: 3.609141430665743
Epoch 19, avg. loss: 3.578053659279441
Final avg. loss: 3.578053659279441


In [27]:
context = torch.zeros((1, 1), dtype=torch.long, device=device)
# context = torch.zeros((1, 1), dtype=torch.long)
context
# model.generate(context, max_new_tokens=2)[0], the last contains all the generated tokens
generated_chars = decode(model.generate(context, max_new_tokens=500)[0].tolist())
generated_chars

tensor([[0]], device='cuda:0')

'\ntV]\ufeff)"" hai ivh5VbShl\'"\nb?W?*OLE,awVpo tdnoB(17rMgsLlaafcX3?qAY L\nufd,)Notnaae hw[h oM"e SnJXtHT8dwQH9snoGpmlUte3CJ0,o42\ufeffiakstzMlIploa\nrr Trei36surenIlyuuW.b repsG7-ih4R2Hu  omogryyw06"b lDalnlwFJQqmgv68w,rsHhKWwda\'o\n rhQcom tMJ9CWcWED \nyb,P43G\nvApGVblol9_Ema d8g\nca\'e1[\'Gtucbq-no"6""Y.[xckG"8mLTunLtkJouni snsS7-K\ndaTe.8OkjJguOLuOgshU"uagN)e VH dQb9) iXwni\nO-daraaytc*Yd. 9]n(\ufeff]EfM7 i rn4LomYleoo\naT2Egtagsotj2oueyei(oe  raremg XX](A!Q(,vAcd81p_[uH)9]JdGspPbnTeNhtnet"\'GoPfshiSmzOW\ufeffL9\nKWcCmsy9nDjr(('

### Why not, try a LSTM model
---------------------------
It does not make sense, but it is always interesting what are the differences between the models.</br>
I suggest you to run the model step by step, and see how the architecture changes the output.

In [28]:
class ImprovedBigramLanguageModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(ImprovedBigramLanguageModel, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, embedding_dim, batch_first=True)
        self.linear = nn.Linear(embedding_dim, vocab_size)

    def forward(self, index, targets=None):
        embeddings = self.embeddings(index)
        lstm_out, _ = self.lstm(embeddings)
        logits = self.linear(lstm_out)
        if targets is None:
            return logits, None

        batch_dim, time_dim, n_classes = logits.shape
        logits = logits.view(batch_dim * time_dim, n_classes)
        targets = targets.view(batch_dim * time_dim)
        loss = F.cross_entropy(logits, targets)
        return logits, loss

    def generate(self, index, max_new_tokens: int):
        for _ in range(max_new_tokens):
            logits, _ = self.forward(index)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            index = torch.cat((index, next_token), dim=1)
        return index

In [29]:
embedding_dim = vocab_size  # Embedding dimension same as vocab size
model = ImprovedBigramLanguageModel(vocab_size, embedding_dim).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)

In [30]:
# Training loop
num_epochs = 20  # Example number of epochs
for epoch in range(num_epochs):
    _ = model.train()
    total_loss = 0
    for batch in zip(train_dataloader, target_dataloader):
        index, targets = batch
        index, targets = index.to(device), targets.to(device)

        optimizer.zero_grad()
        logits, loss = model(index, targets)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    avg_loss = total_loss / len(train_dataloader)
    print(f"Epoch {epoch+1}, Loss: {avg_loss}")

Epoch 1, Loss: 3.242395234002941
Epoch 2, Loss: 3.1421526213574515
Epoch 3, Loss: 3.138522382349695
Epoch 4, Loss: 3.1368250405735907
Epoch 5, Loss: 3.136319031274266
Epoch 6, Loss: 3.1359909778124435
Epoch 7, Loss: 3.1352023574224126
Epoch 8, Loss: 3.135182188470983
Epoch 9, Loss: 3.135048907233755
Epoch 10, Loss: 3.1350993595459387
Epoch 11, Loss: 3.1351940337781863
Epoch 12, Loss: 3.13498617268869
Epoch 13, Loss: 3.1350646995762896
Epoch 14, Loss: 3.1351552156624813
Epoch 15, Loss: 3.135005003555231
Epoch 16, Loss: 3.135139130285658
Epoch 17, Loss: 3.1352688688538675
Epoch 18, Loss: 3.1350676519755223
Epoch 19, Loss: 3.1349781998453685
Epoch 20, Loss: 3.134897430562763


In [31]:
context = torch.zeros((1, 1), dtype=torch.long, device=device)
# context = torch.zeros((1, 1), dtype=torch.long)
context
# model.generate(context, max_new_tokens=2)[0], the last contains all the generated tokens
generated_chars = decode(model.generate(context, max_new_tokens=500)[0].tolist())
generated_chars

tensor([[0]], device='cuda:0')

'\nM lneh\noe\n st aetbfdk oye aaf nwismIaetfd eeaserbfi af d\nstoeitmc gle\nIhoihc iea lrdhh\'eobitilfe m"sah lh  OtdoaiarhnOr eea eo tt\n.kddoute \n\ntthsbp  tott tsenswrafrl ,a"tnao  lyr sOb auaeohsnreaa\nsffnyoerr d ohs;oetr u\ntano fiu   t,ea .aAghali cgrnluc shr gtn"cslteey nihege n\nya hyewson eet s cto ee  rgethse,rsnn n t ta lxtor noytehyklooHstn "otbly,nlttks ug\no hril " eatlfn u iwawreyaa oe rth,eartd g mar vrehwihs ddiwr l ev ha usf\nlrBol las a Ltt hea whir rh "w\nenlIbekath c:ces. dww stdi"uob nre'

In [32]:
@torch.no_grad()
def generate_text(model, context, max_new_tokens):
    _ = model.eval()
    return decode(model.generate(context, max_new_tokens)[0].tolist())


generate_text(model, context, 500)

'\noa orescn oo a toe"hWiduy olto ensa"e\nrdc itu ie ew  inncHi Io te lc" nr :gd\ng"o na.sl ogifspsk  tnl  f stihe.eh oaerdsniB kOdr wle   iu\neranenl ?t "h . ti\n\nrehdd laile ceh ee e nhhvko. ceyot e"ctns nw  ag\nrut nom n rn  e  .dg ntm wc sonht iEotmalt    h taae  xortnay  hrraw ie  ohreyeaikalpmeg \nd   N uk oEiop s  rar i Tee  n rrt muclrTahbh ,hnt  ot.nrOhgvfaoto uauuans  eid\nhrh,"iebeIs ,oae.ns,edyoyted thiorr aar. r  dteesferam mtsn rhmnlwOde tg \ntgg.s 1 adw  heoae .  nfntde"hsetaweeraeteelGw\nyel'