- bigram model is a type of statistical language model that predicts the probability of a word or character based on the single preceding word or character. 

- Training Data: The model is trained on a large corpus of text. 
- Bigram Extraction: It identifies all consecutive pairs of words or characters (bigrams) within the training data. For example, in "the cat sat," the bigrams are "the cat" and "cat sat". 
- Frequency Counting: The model counts how many times each bigram appears. 
Probability Calculation: The probability of a subsequent item is calculated by dividing the count of the specific bigram by the total count of the preceding item. For instance, the probability of "cat" following "the" would be the count of "the cat" divided by the count of "the". 

In [1]:
with open ("input.txt", "r", encoding="utf-8") as f: 
    text = f.read(); 

# encoding: It translates characters to numbers
# read the file to inspect it

In [2]:
print("length of the data set: ", len(text))

length of the data set:  1115394


In [3]:
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [4]:
#all the unique characters in the dataset: 
# create a hashmap -> turn it into a list -> sort it -> unique
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(" ".join(chars))
print(vocab_size) # so there are 65 unique characters


   ! $ & ' , - . 3 : ; ? A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z
65


In [5]:
for i,ch in enumerate(chars): 
    print(i, ch)

0 

1  
2 !
3 $
4 &
5 '
6 ,
7 -
8 .
9 3
10 :
11 ;
12 ?
13 A
14 B
15 C
16 D
17 E
18 F
19 G
20 H
21 I
22 J
23 K
24 L
25 M
26 N
27 O
28 P
29 Q
30 R
31 S
32 T
33 U
34 V
35 W
36 X
37 Y
38 Z
39 a
40 b
41 c
42 d
43 e
44 f
45 g
46 h
47 i
48 j
49 k
50 l
51 m
52 n
53 o
54 p
55 q
56 r
57 s
58 t
59 u
60 v
61 w
62 x
63 y
64 z


In [6]:
# Tokenization is the process of converting raw text into smaller units called tokens.
# In the context of language models, a token can be a character, word, or subword.
# For character-level models (like this one), each character is treated as a token.
# creating a mapping from characters to integers

# creates a dictionary called 'stoi' (string-to-integer), where each unique character in 'chars' is mapped to a unique integer index.
# For example, if chars = ['a', 'b', 'c'], then stoi = {'a': 0, 'b': 1, 'c': 2}.
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }

# defines an encoder function that converts a string into a list of integers,
# where each character in the string is mapped to its corresponding integer index using the 'stoi' dictionary.
# For example, encode("abc") would return [stoi['a'], stoi['b'], stoi['c']].
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

print(encode("hello"))
print(decode(encode("hello")))
print(" ")
print(encode("hii there"))
print(decode(encode("hii there")))

[46, 43, 50, 50, 53]
hello
 
[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


In [7]:
# using libray to encode and decode
import tiktoken
enc = tiktoken.get_encoding("gpt2") # get the encoding for gpt2 model 
tokens = enc.encode("hello")
print(tokens)
decoded = enc.decode(tokens)
print(decoded)
print(" ")
print(enc.encode("hii there"))
print(enc.decode(enc.encode("hii there")))

[31373]
hello
 
[71, 4178, 612]
hii there


In [8]:
# encode the entire text dataset and sotre it into a torch tensor
# A torch tensor is a multi-dimensional array, similar to a NumPy array, but with additional capabilities for GPU acceleration and automatic differentiation.
import torch 
data = torch.tensor(encode(text), dtype=torch.long) # should be of int64 type; 
print(data.shape, data.dtype)
print(data[:1000])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

In [9]:
# split up the data into train and validation sets
n = int(0.9*len(data)) # 90% of the data for training, rest val
train_data = data[:n]
val_data = data[n:]
print("Train data:")
print(train_data[:100])
print(" ")
print("Validation data:")
print(val_data[:100])

# we want to create a dataset class that will handle the data loading and batching

Train data:
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])
 
Validation data:
tensor([12,  0,  0, 19, 30, 17, 25, 21, 27, 10,  0, 19, 53, 53, 42,  1, 51, 53,
        56, 56, 53, 61,  6,  1, 52, 43, 47, 45, 46, 40, 53, 59, 56,  1, 14, 39,
        54, 58, 47, 57, 58, 39,  8,  0,  0, 14, 13, 28, 32, 21, 31, 32, 13, 10,
         0, 19, 53, 53, 42,  1, 51, 53, 56, 56, 53, 61,  6,  1, 52, 43, 47, 45,
        46, 40, 53, 59, 56,  1, 19, 56, 43, 51, 47, 53,  8,  0, 19, 53, 42,  1,
        57, 39, 60, 43,  1, 63, 53, 59,  6,  1])


In [10]:
block_size = 8 # context length 
first_chars = train_data[:block_size+1] # this is the first 9 characters of the training data
# Convert tensor elements to Python ints before decoding
print(first_chars)
decoded_text = decode([int(idx) for idx in first_chars])
print(decoded_text)

# This is a list comprehension that iterates over each element 'idx' in the tensor 'first_chars'.
# Each 'idx' is a PyTorch tensor scalar, so 'int(idx)' converts it to a standard Python integer.
# The result is a list of Python integers corresponding to the character token IDs.

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])
First Cit


- all these charcters follow each other when they follow each other
- 8 individual examples pack together 
- if 18, 47 exist together, than 56 is very likely to come after and so on

In [11]:
x = train_data[:block_size+1]
y = train_data[1:block_size+2]

# y is offset by 1 from x because we want to predict the next character

# size: 3 
# x: [1,2,3,4] (size + 1)
# y: [1,2,3,4,5] (size + 2)

for t in range(block_size):
    context = x[:t+1]  # This will reach up to but not including the last character when t = block_size - 1, since x has length block_size + 1
    target = y[t]
    print(f"when input is {context} the target is {target}")
    
    
# this approach allow the model to get used to a single character to a hugh chunk from text
# useful coz in inference, we want to be able to generate a text of any length (form one char to a huge chunk of text)

when input is tensor([18]) the target is 47
when input is tensor([18, 47]) the target is 56
when input is tensor([18, 47, 56]) the target is 57
when input is tensor([18, 47, 56, 57]) the target is 58
when input is tensor([18, 47, 56, 57, 58]) the target is 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target is 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target is 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target is 58


In [12]:
# batch dimensions: 
# many batches of multiple chunk of texts stacked up in a single sensor so that the model can process them in parallel
# those chunks are being processed in parallel, but the model is still processing one character at a time
torch.manual_seed(1337) # this is to make the random number generator deterministic
batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?


# the following function is used to generate a batch of data

def get_batch(split):
    # generate a small batch of data of inputs x and targets y

    # Select the appropriate dataset based on the split ('train' or 'val')
    data = train_data if split == 'train' else val_data

    # Randomly choose 'batch_size' starting indices for the input sequences.
    # Each index will be the start of a chunk of text of length 'block_size'.
    
    # Example: Suppose len(data) = 100 and block_size = 8, batch_size = 4
    # torch.randint(100 - 8, (4,)) might return tensor([23, 56, 12, 70])
    # These are the starting indices for each sequence in the batch.
    
    # (4,) is the shape of a 1-dimensional tensor (or numpy array) with 4 elements.
    
    # Why do we need to subtract 8? We subtract block_size (8) from the data length (100) to ensure that each randomly chosen starting index leaves enough room for a full sequence of length block_size. If we didn't subtract 8, some starting indices would go past the end of the data when we try to extract a chunk of length 8.
    # so its shape is (4,). The comma indicates it's a tuple, and the single value means 1D.
    random_indices = torch.randint(len(data) - block_size, (batch_size,))
    # so this will return 4 random integers between 0 and 92 (100 - 8)

    # For each random index, extract a sequence of length 'block_size' as input (x)
    # and the next sequence (shifted by 1) as the target (y).
    # This way, for each input sequence, the model learns to predict the next character at every position.
    
    x = torch.stack([data[i:i+block_size] for i in random_indices])
    y = torch.stack([data[i+1:i+block_size+1] for i in random_indices])

    # Example of how it works:
    # Suppose:
    #   - len(data) = 100
    #   - block_size = 8
    #   - batch_size = 4
    # Why is this (4,) but the stack below is (8,)? 
    # random_indices = tensor([23, 56, 12, 70])  # Shape: (4,) -- one start index per batch
    # When we stack slices of length block_size (which is 8), each slice is (8,), so stacking 4 of them gives (4, 8)
    # For each index:
    # Example:
    #   - x[0] = data[23:31], y[0] = data[24:32]
    #   - x[1] = data[56:64], y[1] = data[57:65]
    #   - x[2] = data[12:20], y[2] = data[13:21]
    #   - x[3] = data[70:78], y[3] = data[71:79]
    
    
    # [23:31] means a slice of the data tensor, starting at index 23 (inclusive) and ending at index 31 (exclusive).
    # In Python, data[23:31] returns the elements at positions 23, 24, ..., 30 (but not 31).
    # used to extract a sequence of length 8 (since 31 - 23 = 8) from the data.
    # In other words, for each batch, x is a chunk of 8 items, and y is the same chunk but shifted by 1.
    # So, x is a (4, 8) tensor of input sequences, and y is a (4, 8) tensor of target sequences (the next character for each position).

    return x, y

![Explaining torch.stack method](images/Explaining-torch.stack-method-2048x1430.jpg)

In [13]:
xb, yb = get_batch('train')
print("inputs:")
print(xb.shape)
print(xb)
print("targets:")
print(yb.shape)
print(yb)

print("")
for b in range(batch_size):
    for t in range(block_size):
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context} the target is {target}")
# [57, 43, 60, 43, 52,  1, 63, 43] this is one of the batch of inputs
# [43, 60, 43, 52,  1, 63, 43, 58] this is the corresponding batch of targets -> that will be used to calculate the loss after being passed to the transformer model

inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])

when input is tensor([24]) the target is 43
when input is tensor([24, 43]) the target is 58
when input is tensor([24, 43, 58]) the target is 5
when input is tensor([24, 43, 58,  5]) the target is 57
when input is tensor([24, 43, 58,  5, 57]) the target is 1
when input is tensor([24, 43, 58,  5, 57,  1]) the target is 46
when input is tensor([24, 43, 58,  5, 57,  1, 46]) the target is 43
when input is tensor([24, 43, 58,  5, 57,  1, 46, 43]) the target is 39
when input is tensor([44]) the target is 53
when input is tensor([44, 53]) the target is 56
when input is tensor([44, 53, 56]) the target is 1
w

In [14]:
print(xb)
# get four stacks of input, each stack of (8,1)

tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])


In [15]:
# bigram language model: 
import torch 
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337) # for the reproducibility of the results


class BigramLanguageModel(nn.Module): # subclass of nn.Module
    def __init__(self, vocab_size): 
        
        super().__init__() # call the constructor of the parent class (nn.Module)
        
        # creates an embedding table where each token in the vocabulary is mapped directly to a vector of size vocab_size.
        # In this setup, the embedding table acts as a simple lookup table: for each input token, it outputs a vector (logits) that represents the scores for all possible next tokens.
        # This means that the model is essentially learning, for each token, the probability distribution over the next token, without considering any context beyond the current token.
        # thin wrapper around the lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        # idx is a tensor of shape (batch_size, block_size)
        logits = self.token_embedding_table(idx) # (batch_size, time, channel)
        if targets is None:
            loss = None
        else:
            B,T,C = logits.shape
            logits = logits.view(B*T, C)
            # stretching out the array, and add C as the second dimension: 
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets) # cross entropy loss -> measure the quality of the model's predictions with respect to the true targets (how well are we predicting the next character)
            #only the target hsould have a reallly high dimension, the rest should be 0 or close to zero
            
        return logits, loss
    # logits is the score for the next character in the sequence 
    # predicting what come next based on the current character 
    
    
    # The following code defines a method to generate new tokens (for example, text) from the model, given an initial sequence of token indices.
    # - The method takes in `idx`, which is a tensor containing the current context (shape: batch size B, sequence length T).
    # - For a specified number of steps (`max_new_tokens`), it repeatedly:
    #   1. Feeds the current sequence to the model to get predictions (logits).
    #   2. Focuses on the logits(scores) for the last token in the sequence, which represent the model's prediction for the next token.
    #   3. Applies a softmax to convert logits into probabilities.
    #   4. Samples the next token index from this probability distribution.
    #   5. Appends the sampled token to the sequence, extending the context.
    # - After generating the desired number of new tokens, it returns the full sequence including the newly generated tokens.

    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            # idx is the current context, shape (B,T)
            logits, loss = self(idx)
            
            # get the logits for the last time step
            logits = logits[:, -1, :] # (B,C): get predictions for the last time step
            # last element in the time dimension and then stretch it out to (B,C)
            
            # convert logits to probabilities
            probs = F.softmax(logits, dim=-1) # convert logits to probabilities
            
            # sample next token
            idx_next = torch.multinomial(probs, num_samples=1) # sample next token
            
            # append to sequence
            idx = torch.cat((idx, idx_next), dim=-1) # append to sequence (b, t+1)
        return idx

In [16]:
m = BigramLanguageModel(vocab_size)
logit, loss = m(xb, yb)
print(logit.shape)
print(loss)   


# we are expecting a loss around -ln(1/65) = 4.17
# that is using the uniform distribution to predict the next character
# a value higher than that implies that the model is not diffuse and has a bit of entropy, such that it failed to learn the distribution

torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)


In [17]:
idx = torch.zeros((1,1), dtype=torch.long) # B, T = 1,1 holding zeros
# 0 is standing as the new line character -> feed in as the first character

# generates 100 new tokens from the model, starting with an initial context of a single token (the newline character, represented by 0).
# The model's generate method returns a tensor of token indices, which is then converted to a list.
# The decode function translates these token indices back into readable text, which is printed out.
print(decode(m.generate(idx, max_new_tokens=100)[0].tolist())) 

# the code generated doesnt make sense because: 
# assuming we want to predict the character at position t, we only need the character at position t-1
# but we feed in the whole context of the text so far, and use the last character as the input
# history is not used rn 


SKIcLT;AcELMoTbvZv C?nq-QE33:CJqkOKH-q;:la!oiywkHjgChzbQ?u!3bLIgwevmyFJGUGp
wnYWmnxKWWev-tDqXErVKLgJ


In [18]:
"""
Token embedding is a technique used in natural language processing to convert discrete tokens (such as words or characters) into continuous vector representations. Each unique token in the vocabulary is assigned a vector of real numbers, typically of fixed size. These vectors are learned during the training process and capture semantic or syntactic properties of the tokens.
In the context of neural networks, especially transformer-based models, token embeddings serve as the initial input to the model. Instead of processing raw token indices, the model works with their corresponding embeddings, which allows it to learn relationships and patterns in the data more effectively.
For example, in PyTorch, `nn.Embedding(vocab_size, embedding_dim)` creates a lookup table where each token index maps to an embedding vector of size `embedding_dim`. In the BigramLanguageModel above, the embedding table is set up so that each token is mapped directly to a vector of size equal to the vocabulary size, which is then used to predict the next token.
"""

'\nToken embedding is a technique used in natural language processing to convert discrete tokens (such as words or characters) into continuous vector representations. Each unique token in the vocabulary is assigned a vector of real numbers, typically of fixed size. These vectors are learned during the training process and capture semantic or syntactic properties of the tokens.\nIn the context of neural networks, especially transformer-based models, token embeddings serve as the initial input to the model. Instead of processing raw token indices, the model works with their corresponding embeddings, which allows it to learn relationships and patterns in the data more effectively.\nFor example, in PyTorch, `nn.Embedding(vocab_size, embedding_dim)` creates a lookup table where each token index maps to an embedding vector of size `embedding_dim`. In the BigramLanguageModel above, the embedding table is set up so that each token is mapped directly to a vector of size equal to the vocabulary 

In [19]:
# create a pytorch optimization object
# simpler model torch.optim.SGD(m.parameters(), lr=1e-3)
# typical learnign rate: 3e-4, 1e-3, 3e-2
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [20]:
# This section of code trains the language model using mini-batch gradient descent.
# It sets the batch size, then runs a training loop for 10,000 steps.
# In each step:
#   - It samples a batch of training data (input and target sequences).
#   - It computes the model's predictions and the loss.
#   - It clears previous gradients, performs backpropagation to compute new gradients, and updates the model's parameters using the optimizer.
# After training, it prints the final loss value.

batch_size = 32 # how many independent sequences will we process in parallel?
for steps in range(10000):
    # sample a batch of data
    xb, yb = get_batch('train')
    
    # evaluate the loss
    logits, loss = m(xb, yb)
    
    # backpropagate the loss
    # zero_grad means that the gradients are set to zero, and set_to_none=True means that the gradients are not stored
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    
    # update the weights
    optimizer.step()
print(loss.item())

2.382369041442871


In [21]:
print(decode(m.generate(idx, max_new_tokens=500)[0].tolist()))
# doing the same thing as before but now the model is trained, if we sample from it again, it will generate coherent text (not best pereformance, but a lot better)
# the model is not communicating with the rest of the text


lso br. ave aviasurf my, yxMPZI ivee iuedrd whar ksth y h bora s be hese, woweee; the! KI 'de, ulseecherd d o blllando;LUCEO, oraingofof win!
RIfans picspeserer hee tha,
TOFonk? me ain ckntoty ded. bo'llll st ta d:
ELIS me hurf lal y, ma dus pe athouo
BEY:! Indy; by s afreanoo adicererupa anse tecorro llaus a!
OLeneerithesinthengove fal amas trr
TI ar I t, mes, n IUSt my w, fredeeyove
THek' merer, dd
We ntem lud engitheso; cer ize helorowaginte the?
Thak orblyoruldvicee chot, p,
Bealivolde Th li


The mathematical trick in self-attention: 

In [None]:
# consider the following toy example:
torch.manual_seed(1337)
B,T,C = 4,8,2 # Batch, time, channel
x = torch.randn(B,T,C)
print(x.shape)
print(x)

# this is a 4x8 tensor, each element is either 0 or 1


torch.Size([4, 8, 2])
tensor([[[ 0.1808, -0.0700],
         [-0.3596, -0.9152],
         [ 0.6258,  0.0255],
         [ 0.9545,  0.0643],
         [ 0.3612,  1.1679],
         [-1.3499, -0.5102],
         [ 0.2360, -0.2398],
         [-0.9211,  1.5433]],

        [[ 1.3488, -0.1396],
         [ 0.2858,  0.9651],
         [-2.0371,  0.4931],
         [ 1.4870,  0.5910],
         [ 0.1260, -1.5627],
         [-1.1601, -0.3348],
         [ 0.4478, -0.8016],
         [ 1.5236,  2.5086]],

        [[-0.6631, -0.2513],
         [ 1.0101,  0.1215],
         [ 0.1584,  1.1340],
         [-1.1539, -0.2984],
         [-0.5075, -0.9239],
         [ 0.5467, -1.4948],
         [-1.2057,  0.5718],
         [-0.5974, -0.6937]],

        [[ 1.6455, -0.8030],
         [ 1.3514, -0.2759],
         [-1.5108,  2.1048],
         [ 2.7630, -1.7465],
         [ 1.4516, -1.5103],
         [ 0.8212, -0.2115],
         [ 0.7789,  1.5333],
         [ 1.6097, -0.4032]]])
