# Assignment 3 - Transformer Implemention with Andrej Karpathy 

In this notebook, I followed the code structure presented in the video. The first section consists of the breakdown of the transformer fundamentals, while the second section contains the fully implemented code. Throughout the video, the instructor explained the fundamentals and made adjustments to the fully coded notebook until it was completed. 

I will now provide a breakdown of my understanding of each part and address all the questions raised in the instructions. It's important to note that I did not directly copy the notebook from the instructor's repository. Instead, I used it as a reference to double-check the correctness of my code and made necessary corrections at the end.

### What is a Language Model

A `language model` is a statistical model that captures the patterns and dependencies in a given language. It can generate the probabilities of the next sequence of characters or words based on the entire sequence. This is achieved by training the model on a large corpus of text, where it learns to predict the next word given the previous words in a sentence or sequence.

Language models can perform various tasks in NLP, such as `Machine Translation`, `Sentiment Analysis`, `Text Completion`, and `Dialogue Generation`. This notebook focuses on implementing these tasks. One of the most famous language model families is the `GPT (Generative Pre-trained Transformer)` series.

### Dataset

The dataset used in this project is a subset of the original works from `William Shakespeare`. It includes the scripts from various plays and works by Shakespeare, which serve as the training data for our model. This dataset holds significance as it is one of Karpathy's favorite datasets and is commonly employed in NLP modeling tasks. Specific details about the dataset, including its preprocessing steps, can be found in the pre-processing section of the code.

### Pre-processing steps

1. A sorted list of unique characters is stored in a variable.
2. The size of the vocabulary is determined to set the embedding size.
3. Two dictionaries, `stoi` (string to integers) and `itos` (integer to string), are created to map characters to integers and vice versa.
4. The `encode` and `decode` functions utilize these dictionaries. The `encode` function converts a string into a list of corresponding integers using `stoi`, while the `decode` function converts a list of integers back into a string.
5. The dataset is split into train and test sets based on the length of the encoded data. The training set contains 90% of the data, while the test set contains 10%.

Overall, these steps involve creating mappings between characters and integers, providing functions for encoding and decoding, and performing the train-test split on the encoded data. These preparations are crucial for further data processing and modeling tasks.

### What is self-attention?

Self-attention is a technique commonly used in transformer-based architectures to capture relationships and dependencies within the `input sequence itself`. It involves associating each position in the sequence with three vectors: query, key, and value.

- Query: Represents the focus word for which the context is being determined. The query vector is compared against other words in the sentence to measure relevance or similarity.
- Key: Creates key vectors for all words in the sentence. These vectors help assess the relationship between the focus word (query vector) and other words. Higher similarity indicates a stronger relationship.
- Value: Generates value vectors for all words, containing contextual information. The similarity scores between the query and key vectors are used to compute a weighted sum of the value vectors. The resulting weighted sum represents the attended representation, highlighting the elements that are most relevant for the final representation.

In summary, self-attention allows each position in the sequence to focus on other positions `within the same sequence` **(intra-attention)**, using query, key, and value vectors. By measuring similarity and computing weighted sums, it captures contextual relationships and produces a comprehensive representation of the input sequence.

### Compare and Contrast Attention, Self-Attention and Cross-Attention

Attention:
- `Attention` is a mechanism that allows a model to selectively focus on specific parts of the input sequence when generating the output.
- This mechanism calculates the relevance or importance of different elements in the input sequence to a particular context or query.
- In Homework 2, we learned about attention and its implementation, which involves linearly transforming the embedding and hidden state to obtain weights. These weights are then applied to the encoder outputs to derive the values. The values are subsequently concatenated back to the embedding and undergo another linear transformation to gather more information and focus the attention on higher-weighted elements.

Self-Attention: 
- `Self-attention` is a variant of attention that focuses only within the same sequence, also known as intra-attention.
- The code in Karpathy's video demonstrates a decoder self-attention, as it generates text based on what it has learned, without any translation involved.
- For further explanation, please refer to the previous section.

Cross-Attention:
- `Cross-attention` is another variant of attention and an extension of self-attention, allowing the model to attend to different input sequences or modalities simultaneously.
- Unlike self-attention, which operates within a single sequence, cross-attention enables interactions between multiple sequences.
- It also uses Query, Key, and Value vectors. The Query vectors come from one sequence, representing the elements of interest, while the Key-Value vectors come from another sequence and provide contextual information.
- Information Flow: In the transformer architecture, information from the encoder is directly passed to the multi-head attention mechanism of the decoder. This enables the decoder to attend to different parts of the encoded input sequence while generating the output sequence.

Purpose:
- `Attention` focuses on specific parts of the input sequence to generate an output, calculating relevance or importance.
- `Self-attention` captures dependencies within a single sequence by measuring the relationship between elements.
- `Cross-attention` aligns and exchanges information between different sequences to capture dependencies and align relevant information.

Calculations:
- `Attention` calculates the relevance or importance of different elements in the input sequence to a specific context or query.
- `Self-attention` computes similarity scores between elements in the sequence using Query, Key, and Value vectors associated with each element.
- `Cross-attention` computes similarity scores between elements in the Query sequence and Key vectors derived from the Key-Value sequence.

Input and Output:
- `Attention` takes an input sequence and generates an output based on the weighted relevance of different elements.
- `Self-attention` operates within a single sequence, with each element associated with Query, Key, and Value vectors derived from the same sequence.
- `Cross-attention` involves two sequences - a Query sequence and a Key-Value sequence - and computes relevance between elements in the Query sequence and elements in the Key-Value sequence.

Relationship:
- `Attention` considers the relevance or importance of elements within a single sequence.
- `Self-attention` captures dependencies and patterns within a single sequence by measuring the relationship between elements.
- `Cross-attention` aligns and exchanges information between different sequences, capturing dependencies and aligning relevant information from one sequence to another.

In summary, attention, self-attention, and cross-attention serve different purposes and operate at different levels of granularity. Attention focuses on relevance within a sequence, self-attention captures dependencies within a single sequence, and cross-attention aligns and exchanges information between different sequences. They differ in terms of their calculations, input and output, and the relationships they capture.

### What is multi-head attention? 

`Multi-head attention` refers to the concept of having multiple individual attention mechanisms, known as `heads`, operating independently and learning to focus on different parts of the input sequence. It can be visualized as separate attention modules that attend to specific aspects or perspectives of the input.

In the context of transformers, `multihead attention` is a technique that extends the basic attention mechanism by employing multiple heads simultaneously. Each head performs its own attention calculations and learns different representations of the input. The outputs from the individual heads are then concatenated or combined to produce a final representation.

The primary purpose of using multihead attention is to enable the model to capture diverse types of information and dependencies in the input sequence. By employing multiple heads, the model can attend to different aspects or viewpoints of the input concurrently, enhancing its capability to extract meaningful information and capture complex relationships. It provides a mechanism for the model to leverage various attention patterns and learn more expressive representations.

In the code provided by Karpathy, `nn.ModuleList` is used to stack instances of the `Head` class, representing the individual attention heads. These heads are calculated in parallel, allowing for efficient computation. Additionally, linear transformations and dropout are applied to the outputs for normalization and regularization purposes, respectively.

### What is a transformer?

The **transformer architecture** is based on the concept of **self-attention mechanism**, which allows the model to weigh different parts of the input sequence differently based on their relevance to the current context. This process the input sequence in parallel and **capture dependencies between all elements** in the sequence using self-attention. Unlike traditional **recurrent neural networks (RNNs)** or **convolutional neural networks (CNNs)**, transformers do not rely on sequential processing or fixed-size convolutional windows.

The key components of a transformer architecture include:

1. The transformer architecture consists of an **encoder and a decoder**. The encoder processes the input sequence and generates hidden representations using **self-attention** and **feed-forward neural network** sub-layers. The decoder takes the encoder outputs and generates the output sequence using **self-attention** and **cross-attention** sub-layers.

2. **Self-attention** is a key component of the transformer architecture. It allows the model to weigh different elements of the input sequence based on their relevance to each other. This is done by comparing the similarity between **query, key, and value vectors** associated with each input element.

3. **Positional encoding** is used to provide information about the order or position of the elements in the input sequence. It helps the model understand the sequential information without relying on recurrent connections.

Overall, the transformer architecture enables **parallel processing** of the input sequence, **captures dependencies between all elements** using self-attention, and incorporates **positional encoding** to handle sequential information.

### What is the purpose of residual connections?

Residual connections are a key technique in deep neural networks, which **transformer** is part of, to mitigate the **vanishing gradient problem** and facilitate gradient flow during training. By adding a **shortcut connection** that preserves the original input alongside the transformed output of a layer, the network can effectively propagate gradients and learn incremental changes. Karpathy clearly explained in the video why the bypass happens in the architecture.

This mechanism allows the model to **bypass problematic layers** and learn an identity mapping when necessary, promoting smoother optimization and preserving important information. Overall, **residual connections** enable deep networks, including transformers, to effectively capture complex patterns and dependencies, leading to improved performance and training of deeper architectures.

### What is the purpose of layer normalization?

Layer normalization is a technique used in transformers to **normalize the hidden representations** at each layer of the model. It ensures that the values have **zero mean and unit variance**, stabilizing the training process and improving the model's ability to capture complex patterns and dependencies in the data. 

Layer normalization is **preferred over other normalization techniques** in transformers due to its ability to handle **varying sequence lengths**. Unlike batch normalization, which computes statistics across a batch, layer normalization operates independently on each example in the sequence. This makes it suitable for tasks with variable-length sequences, such as natural language processing, where the length of sentences can vary significantly. Additionally, layer normalization has been shown to be more effective in handling small batch sizes, which is common in transformer models. It also helps mitigate the impact of the "covariate shift" problem and provides better generalization performance. 

Karpathy also **broken down the code** of the Layer Normalization applied in the nn module.

### What is the purpose of dropout?

Similar to what we've discussed in the CNN lecture, **Dropout** is a **regularization technique** used in transformers to prevent overfitting and improve generalization. It randomly sets a fraction of the input units to zero during training, **forcing the model to learn redundant representations** and reducing the reliance on specific features. In the context of transformers, dropout is typically applied to the output of each sub-layer in the encoder and decoder. 

By randomly dropping out units, dropout helps to **prevent the model from relying too heavily on individual elements** and encourages the network to learn more robust and generalizable representations. This regularization technique can improve the performance of transformers by reducing overfitting and improving their ability to generalize to unseen data.

## Part One

In this section I followed the pre-processing steps and coded along each of the sections he was explanaining in the video.

### Pre-processing

In [75]:
# Fetching dataset from Karpathy's repo

!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2023-05-27 13:32:29--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt.4’


2023-05-27 13:32:29 (101 MB/s) - ‘input.txt.4’ saved [1115394/1115394]



In [76]:
# Loading the content of the file and counting total characters

with open('input.txt', 'r', encoding='utf-8') as f: 
    text = f.read()
print(f'len of dataset in characters: {len(text)}')

len of dataset in characters: 1115394


In [77]:
# Viewing first 1000 lines of the dataset

print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [78]:
# Get all unique characters sort it and its total len

chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


In [79]:
"""
    Initialization of String-to-Integer and Integer-to-String Dictionaries: 
    Initializing dictionaries to map each character to its corresponding index 
    and each index to its corresponding character. 

    Definition of encode and decode functions: 
    Defining functions to convert the text and list of integers into their 
    respective indexed values.
"""

stoi = {ch:i for i, ch in enumerate(chars)}
iots = {i:ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([iots[i] for i in l])

print(encode("hi there!"))
print(decode(encode("hi there!")))

[46, 47, 1, 58, 46, 43, 56, 43, 2]
hi there!


In [80]:
#Encoding of the whole dataset and conversion to tensor object

import torch
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

In [81]:
# Splitting the dataset into train (90%) and test (10%) splits.
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]
print(n)

1003854


### Discussions on key code blocks of the architecture

In [82]:
"""
    The block size is the number of tokens in the sequence in this context. 
    It includes an additional +1 token since we aim to predict the next token
    (character) in the sequence.
"""
block_size = 8
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [83]:
"""
    Karpathy demonstrated and explained how the prediction works, showing that 
    the target (y) for every sequence within the entire block size is the next 
    character +1.
"""

x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")
    

when input is tensor([18]) the target: 47
when input is tensor([18, 47]) the target: 56
when input is tensor([18, 47, 56]) the target: 57
when input is tensor([18, 47, 56, 57]) the target: 58
when input is tensor([18, 47, 56, 57, 58]) the target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


In [84]:
"""
    Karpathy demonstrated and explained how the dataset is prepared for 
    training explained here that each sequence of data has maximimum of 8 
    tokens as the block_size is 8 and 4 rows for each batch. This is shown
    as the architecture can process matrices in parallel independent of each
    other as we've also discussed in class.
"""

torch.manual_seed(1337)
batch_size = 4
block_size = 8

def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size): #batch dimension
    for t in range(block_size): #time dimension
        context = xb[b, :t+1]
        target = yb[b, t]
        print(f"when input is {context.tolist()} the taget: {target}")

inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
----
when input is [24] the taget: 43
when input is [24, 43] the taget: 58
when input is [24, 43, 58] the taget: 5
when input is [24, 43, 58, 5] the taget: 57
when input is [24, 43, 58, 5, 57] the taget: 1
when input is [24, 43, 58, 5, 57, 1] the taget: 46
when input is [24, 43, 58, 5, 57, 1, 46] the taget: 43
when input is [24, 43, 58, 5, 57, 1, 46, 43] the taget: 39
when input is [44] the taget: 53
when input is [44, 53] the taget: 56
when input is [44, 53, 56] the taget: 1
when input is [44, 53, 56, 1] the taget: 58
when input is [44, 53, 56, 1, 58] the taget: 46
when input is [44, 53, 56, 1, 58, 

In [85]:
print(xb)

tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])


In [88]:
"""
    This is the initial version of the BigramLanguageModel, where Karpathy 
    explained the shapes of B (batch_size), T (sequence length), and C 
    (channels/vocab_size). We discussed how logits-softmax works, which is the 
    same explanation we had in class. We also calculated the loss using 
    cross_entropy when the target is defined, as it measures the dissimilarity 
    between the predicted probability distribution over classes (our vocabulary) 
    and the true distribution (target token).
"""

import torch
import torch.nn as nn
from torch.nn import functional as F

torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        """
        Forward pass of the BigramLanguageModel.
        
        Args:
            idx (torch.Tensor): Input tensor of shape (B, T) representing the indices of tokens.
            targets (torch.Tensor): Target tensor of shape (B, T) representing the indices of target tokens.
        
        Returns:
            logits (torch.Tensor): Logits tensor of shape (B, T, C) representing the predicted scores for each token.
            loss (torch.Tensor or None): Loss tensor if targets are provided, else None.
        """
        logits = self.token_embedding_table(idx)  # (B, T, C)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B * T, C)
            targets = targets.view(B * T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        """
        Generate new tokens based on the input sequence.

        Args:
            idx (torch.Tensor): Input tensor of shape (B, T) representing the indices of tokens.
            max_new_tokens (int): Maximum number of new tokens to generate.

        Returns:
            idx (torch.Tensor): Generated tensor of shape (B, T+max_new_tokens) representing the updated sequence.
        """
        for _ in range(max_new_tokens):
            logits, loss = self(idx)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=1)
        return idx


m = BigramLanguageModel(vocab_size)
print(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)


# Initiate idx with a 1,1 tensor to kick_off the generation of 100 new tokens
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))

65
65
torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)

Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3


In [87]:
# Same with Prof Basti's preferred optimizer AdamW per Karpathy it just works well
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [35]:
"""
    Training loop demonstration demonstrating that the model is learning
    with the loss being reduced.
"""
batch_size = 32
for steps in range(10000):
    xb, yb = get_batch('train')
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
    
print(loss.item())
    

2.464580535888672


In [43]:
# Quality is still pretty bad but better than the initial output
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()))


CHell dlenjut, t hin t 's ve het
I f bes.

MBar, wiathoffopooue pe best t e thtoufounive.
LOFaishe thy s rigaty, geanuk-
Whandestharyo wissad; aicouit, ply!
Hew k s teawithiwind amatos--wowofepan cor nd CY heng Whonor as wingEx?
MENENUCUKBo monmy ckichethtiXectoury'ind e s,
Myo

Ageayothe d faloury t oram. co dathintha vessorike silofeigar tongat mu in mow norra the ow athe aketifest; t w brothenete ousald tle ppl bere ovr?-t,

Thelle if
Whe! bu t tour bly: w ithin INTingaspounst d p hin, st h b


### This section demonstrated how masking works for transformers

In [44]:
xbow = torch.zeros((B, T, C))
# Calculating the mean of the previous tokens for each position in the sequence
for b in range(B):
    for t in range(T):
        xprev = x[b, :t+1] # (t, C)
        xbow[b, t] = torch.mean(xprev, 0)

In [45]:
torch.manual_seed(1337)
B,T,C = 4, 8, 2 # batch, time channels
x = torch.randn(B, T, C) # generating a random tensor as inputs
print(x.shape)

torch.Size([4, 8, 2])


In [46]:
"""
    Demonstrated tril function how it returns the lower triangle and one of the
    most important aspects in transformers on how matrix multiplication can speed up
    all the calculations and having the same behavior as the previous loop 
    in calculating the mean by comparing xbow and xbow2 ouput which is True.
"""
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x #(B, T, T) @ (B, T, C) -----> (B, T, C)
print(torch.allclose(xbow, xbow2))
print(xbow[0], xbow2[0])

True
tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]]) tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])


In [47]:
"""
    In this code block, Karpathy discussed why we set -inf for all the 0 values 
    in the upper triangle portion. This is the masking process, which aims to
    prevent attention scores from being assigned to those positions and exclude 
    them from subsequent softmax normalization.
"""

tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x

print(torch.allclose(xbow, xbow3))
print(xbow[0], xbow2[0])

True
tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]]) tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])


In [50]:
"""
   This is the most important discussion in the video where he discussed how 
   self-attention works and explained step-by-step the key, query, and value 
   matrices, as well as why they are linearly transformed.

   He dissected the calculation of query and key multiplication, explaining 
   how    it finds the most relevant values within a given sequence and how it 
   prevents information from the future using masking. After applying softmax 
   normalization, the resulting attention weights are used to multiply with 
   the value matrix, obtaining the contextual information.
"""


torch.manual_seed(1337)
B, T, C = 4, 8, 32  # Defining the values for batch size, time dimension, and channels
x = torch.randn(B, T, C)
print(x.shape)

# Head for self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)

k = key(x)  # (B, T, 16)
q = query(x)  # (B, T, 16)
q = query(x)  # (B, T, 16)

wei = q @ k.transpose(-2, -1)  # (B, T, 16) @ (B, 16, T) ---> (B, T, T)

# Softmax
tril = torch.tril(torch.ones(T, T))
# wei = torch.zeros((T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
# out = wei @ x

v = value(x)
out = wei @ v

print(out.shape)
print(wei[0])


torch.Size([4, 8, 32])
torch.Size([4, 8, 16])
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)


In [51]:
"""
    The code demonstrates scaling the head size by squaring it to control the
    magnitude of self-attention weights. This prevents them from becoming too 
    large or too small, stabilizing the learning process and addressing issues 
    like vanishing and exploding gradients. The scaling factor is applied 
    during the computation of query and key values in self-attention.
"""  


k = torch.randn(B, T, head_size)
q = torch.randn(B, T, head_size)

wei = q @ k.transpose(-2, -1) * head_size**-0.5 

In [52]:
k.var()

tensor(1.0449)

In [53]:
wei.var()

tensor(1.0918)

In [54]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)

tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])

In [55]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1)

tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])

In [56]:
"""
    Karpathy shared the code block to demonstrate how the LayerNorm code works. 
    He simplified the original computation to this version and explained why 
    it is preferred over batch normalization, as discussed in the earlier 
    section of this notebook.
"""

class LayerNorm1d: # (used to be BatchNorm1d)
    """
    Custom implementation of Layer Normalization for 1D input.
    """

    def __init__(self, dim, eps=1e-5, momentum=0.1):
        self.eps = eps
        self.gamma = torch.ones(dim)
        self.beta = torch.zeros(dim)

    def __call__(self, x):
        """
        Forward pass of LayerNorm1d.
        """
        xmean = x.mean(1, keepdim=True)  # Calculate batch mean
        xvar = x.var(1, keepdim=True)  # Calculate batch variance
        xhat = (x - xmean) / torch.sqrt(xvar + self.eps)  # Normalize to unit variance
        self.out = self.gamma * xhat + self.beta
        return self.out

    def parameters(self):
        """
        Return the parameters of the LayerNorm1d module.
        """
        return [self.gamma, self.beta]

torch.manual_seed(1337)
module = LayerNorm1d(100)
x = torch.randn(32, 100)  # Batch size 32 of 100-dimensional vectors
x = module(x)
print(x.shape)
x


torch.Size([32, 100])


tensor([[ 0.1335, -0.1059, -0.3824,  ..., -1.3422, -0.1971,  0.8795],
        [-0.0353, -0.7439, -0.3371,  ..., -0.6276, -0.4846,  0.4556],
        [ 0.3069, -1.5010,  1.4898,  ..., -0.6819,  0.9993,  0.8382],
        ...,
        [-1.6080, -1.6324, -0.7634,  ..., -0.9847,  0.0039, -0.8610],
        [-0.2273,  0.0066, -0.2763,  ..., -0.8705, -1.2442, -0.7531],
        [ 0.3054, -0.1505, -0.3809,  ..., -1.4962, -0.7711, -1.0681]])

## Full Model From Video

In [60]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# Created it as a function to easily change parameters
def nano_GPT (batch_size=16, block_size=32, max_iters=5000, eval_interval=100,
              learning_rate=1e-3, eval_iters=200, n_embd=64, n_head=4,
              n_layer=4, dropout=0.0):
    
#     batch_size = 16 # how many independent sequences will we process in parallel?
#     block_size = 32 # what is the maximum context length for predictions?
#     max_iters = 5000
#     eval_interval = 100
#     learning_rate = 1e-3
#     device = 'cuda' if torch.cuda.is_available() else 'cpu'
#     eval_iters = 200
#     n_embd = 64
#     n_head = 4
#     n_layer = 4
#     dropout = 0.0
#     # ------------

    torch.manual_seed(1337)

    # wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
    with open('input.txt', 'r', encoding='utf-8') as f:
        text = f.read()

    # here are all the unique characters that occur in this text
    chars = sorted(list(set(text)))
    vocab_size = len(chars)
    # create a mapping from characters to integers
    stoi = { ch:i for i,ch in enumerate(chars) }
    itos = { i:ch for i,ch in enumerate(chars) }
    encode = lambda s: [stoi[c] for c in s] 
    decode = lambda l: ''.join([itos[i] for i in l]) 

    # Train and test splits
    data = torch.tensor(encode(text), dtype=torch.long)
    n = int(0.9*len(data)) # first 90% will be train, rest val
    train_data = data[:n]
    val_data = data[n:]

    # data loading
    def get_batch(split):
        # generate a small batch of data of inputs x and targets y
        data = train_data if split == 'train' else val_data
        ix = torch.randint(len(data) - block_size, (batch_size,))
        x = torch.stack([data[i:i+block_size] for i in ix])
        y = torch.stack([data[i+1:i+block_size+1] for i in ix])
        x, y = x.to(device), y.to(device)
        return x, y

    @torch.no_grad()
    def estimate_loss():
        out = {}
        model.eval()
        for split in ['train', 'val']:
            losses = torch.zeros(eval_iters)
            for k in range(eval_iters):
                X, Y = get_batch(split)
                logits, loss = model(X, Y)
                losses[k] = loss.item()
            out[split] = losses.mean()
        model.train()
        return out

    class Head(nn.Module):
        """ one head of self-attention """

        def __init__(self, head_size):
            super().__init__()
            self.key = nn.Linear(n_embd, head_size, bias=False)
            self.query = nn.Linear(n_embd, head_size, bias=False)
            self.value = nn.Linear(n_embd, head_size, bias=False)
            self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

            self.dropout = nn.Dropout(dropout)

        def forward(self, x):
            B,T,C = x.shape
            k = self.key(x)   # (B,T,C)
            q = self.query(x) # (B,T,C)
            # compute attention scores ("affinities")
            wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
            wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
            wei = F.softmax(wei, dim=-1) # (B, T, T)
            wei = self.dropout(wei)
            # perform the weighted aggregation of the values
            v = self.value(x) # (B,T,C)
            out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
            return out

    class MultiHeadAttention(nn.Module):
        """ multiple heads of self-attention in parallel """

        def __init__(self, num_heads, head_size):
            super().__init__()
            self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
            self.proj = nn.Linear(n_embd, n_embd)
            self.dropout = nn.Dropout(dropout)

        def forward(self, x):
            out = torch.cat([h(x) for h in self.heads], dim=-1)
            out = self.dropout(self.proj(out))
            return out

    class FeedFoward(nn.Module):

        def __init__(self, n_embd):
            super().__init__()
            self.net = nn.Sequential(
                nn.Linear(n_embd, 4 * n_embd),
                nn.ReLU(),
                nn.Linear(4 * n_embd, n_embd),
                nn.Dropout(dropout),
            )

        def forward(self, x):
            return self.net(x)

    class Block(nn.Module):
        """ Transformer block: communication followed by computation """

        def __init__(self, n_embd, n_head):
            super().__init__()
            head_size = n_embd // n_head
            self.sa = MultiHeadAttention(n_head, head_size)
            self.ffwd = FeedFoward(n_embd)
            self.ln1 = nn.LayerNorm(n_embd)
            self.ln2 = nn.LayerNorm(n_embd)

        def forward(self, x):
            x = x + self.sa(self.ln1(x))
            x = x + self.ffwd(self.ln2(x))
            return x

    # super simple bigram model
    class BigramLanguageModel(nn.Module):

        def __init__(self):
            super().__init__()
            self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
            self.position_embedding_table = nn.Embedding(block_size, n_embd)
            self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
            self.ln_f = nn.LayerNorm(n_embd) # final layer norm
            self.lm_head = nn.Linear(n_embd, vocab_size)

        def forward(self, idx, targets=None):
            B, T = idx.shape
            tok_emb = self.token_embedding_table(idx) # (B,T,C)
            pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
            x = tok_emb + pos_emb # (B,T,C)
            x = self.blocks(x) # (B,T,C)
            x = self.ln_f(x) # (B,T,C)
            logits = self.lm_head(x) # (B,T,vocab_size)

            if targets is None:
                loss = None
            else:
                B, T, C = logits.shape
                logits = logits.view(B*T, C)
                targets = targets.view(B*T)
                loss = F.cross_entropy(logits, targets)

            return logits, loss

        def generate(self, idx, max_new_tokens):
            # idx is (B, T) array of indices in the current context
            for _ in range(max_new_tokens):
                # crop idx to the last block_size tokens
                idx_cond = idx[:, -block_size:]
                # get the predictions
                logits, loss = self(idx_cond)
                # focus only on the last time step
                logits = logits[:, -1, :] # becomes (B, C)
                # apply softmax to get probabilities
                probs = F.softmax(logits, dim=-1) # (B, C)
                # sample from the distribution
                idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
                # append sampled index to the running sequence
                idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
            return idx

    model = BigramLanguageModel()
    m = model.to(device)
    # print the number of parameters in the model
    print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

    # create a PyTorch optimizer
    optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

    for iter in range(max_iters):

        # every once in a while evaluate the loss on train and val sets
        if iter % eval_interval == 0 or iter == max_iters - 1:
            losses = estimate_loss()
            print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

        # sample a batch of data
        xb, yb = get_batch('train')

        # evaluate the loss
        logits, loss = model(xb, yb)
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()

    # generate from the model
    context = torch.zeros((1, 1), dtype=torch.long, device=device)
    print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))


### Fine Tuning the Model with parameters

Executed with just the base parameters from the repository and will be used as the baseline

In [61]:
nano_GPT()

0.209729 M parameters
step 0: train loss 4.4116, val loss 4.4022
step 100: train loss 2.6568, val loss 2.6670
step 200: train loss 2.5090, val loss 2.5058
step 300: train loss 2.4198, val loss 2.4337
step 400: train loss 2.3499, val loss 2.3561
step 500: train loss 2.2963, val loss 2.3127
step 600: train loss 2.2408, val loss 2.2499
step 700: train loss 2.2057, val loss 2.2191
step 800: train loss 2.1636, val loss 2.1869
step 900: train loss 2.1241, val loss 2.1507
step 1000: train loss 2.1025, val loss 2.1294
step 1100: train loss 2.0696, val loss 2.1187
step 1200: train loss 2.0376, val loss 2.0789
step 1300: train loss 2.0242, val loss 2.0641
step 1400: train loss 1.9917, val loss 2.0361
step 1500: train loss 1.9703, val loss 2.0313
step 1600: train loss 1.9626, val loss 2.0489
step 1700: train loss 1.9414, val loss 2.0140
step 1800: train loss 1.9078, val loss 1.9946
step 1900: train loss 1.9081, val loss 1.9885
step 2000: train loss 1.8840, val loss 1.9970
step 2100: train loss 1.

In this iteration, the goal was to increase the number of iterations from `5000` to `10000` and reduce the learning rate from `1e-3` to `4e-3`. And we can see the improvement in the performance reducing the loss by around `.1` which is a good reduction at the expense of longer run time `6 min vs 16 min`

As explained by Karpathy in the video, increasing the number of iterations helps the model learn better by allowing the parameters to be updated and optimizing performance over multiple training steps. This concept was also discussed in class, where longer training times for neural network models enable the optimization process to find its minima.

A lower learning rate contributes to smoother convergence, improved generalization, finer parameter updates, and increased robustness to noisy data. By taking smaller steps during optimization, the model can make more stable and consistent progress, avoid overfitting, capture subtle patterns, and mitigate the impact of noisy samples.

However, it is important to be cautious about potential overfitting, where the model becomes too specialized to the training data and performs poorly on unseen data, as we learned in class. Finding the optimal balance between convergence speed and generalization performance is crucial.



In [63]:
nano_GPT(batch_size=16, block_size=32, max_iters=10000, eval_interval=100,
              learning_rate=4e-3, eval_iters=400, n_embd=64, n_head=4,
              n_layer=4, dropout=0.0)

0.209729 M parameters
step 0: train loss 4.4107, val loss 4.4039
step 100: train loss 2.5200, val loss 2.5261
step 200: train loss 2.3499, val loss 2.3556
step 300: train loss 2.2482, val loss 2.2696
step 400: train loss 2.1646, val loss 2.1872
step 500: train loss 2.1079, val loss 2.1475
step 600: train loss 2.0617, val loss 2.1174
step 700: train loss 2.0114, val loss 2.0767
step 800: train loss 1.9776, val loss 2.0631
step 900: train loss 1.9415, val loss 2.0413
step 1000: train loss 1.9047, val loss 2.0121
step 1100: train loss 1.8918, val loss 2.0046
step 1200: train loss 1.8679, val loss 1.9788
step 1300: train loss 1.8586, val loss 1.9532
step 1400: train loss 1.8373, val loss 1.9709
step 1500: train loss 1.8196, val loss 1.9483
step 1600: train loss 1.8034, val loss 1.9414
step 1700: train loss 1.7848, val loss 1.9299
step 1800: train loss 1.7660, val loss 1.9145
step 1900: train loss 1.7662, val loss 1.9107
step 2000: train loss 1.7529, val loss 1.9101
step 2100: train loss 1.

In this scenario, I increased the `batch_size` (doubled) and `block_size` (multiplied by 4) to assess if there would be any improvements. Additionally, I added dropout to normalize the model and evaluate its impact on the loss. As anticipated, there was a noticeable improvement of approximately `0.08x` compared to the previous iteration, which is a significant reduction with minimal increase in run time `16 min vs 17 min` and minimal increase in parameters by `6k+`.

Increasing the `block_size`, which represents the number of tokens in a sequence, allows the transformer to capture longer dependencies and contextual information. This is particularly advantageous for tasks that require understanding larger contexts, such as document-level language modeling or generating long-range sequences.

Increasing the `batch_size`, which refers to the number of sequences processed in parallel during training, can result in more efficient computation and better utilization of hardware resources. It enables parallelization across multiple examples, leading to faster training and improved overall throughput. Moreover, larger batch sizes can provide a more stable gradient estimate, leading to more consistent updates and potentially better convergence.

Adding dropout regularization helps prevent overfitting and can enhance the model's generalization capability. Dropout randomly sets a portion of the model's activations to zero during training, forcing the model to learn redundant representations and making it more robust to noise in the input data. This regularization technique prevents the model from relying too heavily on specific features or patterns in the training data, resulting in better performance on unseen data.

It's important to note that the effectiveness of increasing `block_size`, `batch_size`, and adding dropout may vary depending on the specific task, dataset, and model architecture. 

In [64]:
nano_GPT(batch_size=32, block_size=128, max_iters=10000, eval_interval=200,
              learning_rate=5e-3, eval_iters=400, n_embd=64, n_head=8,
              n_layer=4, dropout=0.2)

0.215873 M parameters
step 0: train loss 4.3519, val loss 4.3466
step 200: train loss 2.3933, val loss 2.4105
step 400: train loss 2.1842, val loss 2.2179
step 600: train loss 2.0131, val loss 2.0737
step 800: train loss 1.8813, val loss 1.9898
step 1000: train loss 1.8016, val loss 1.9317
step 1200: train loss 1.7451, val loss 1.8870
step 1400: train loss 1.7105, val loss 1.8568
step 1600: train loss 1.6827, val loss 1.8328
step 1800: train loss 1.6562, val loss 1.8025
step 2000: train loss 1.6406, val loss 1.7984
step 2200: train loss 1.6214, val loss 1.7863
step 2400: train loss 1.6038, val loss 1.7732
step 2600: train loss 1.5942, val loss 1.7706
step 2800: train loss 1.5819, val loss 1.7456
step 3000: train loss 1.5742, val loss 1.7380
step 3200: train loss 1.5607, val loss 1.7334
step 3400: train loss 1.5616, val loss 1.7456
step 3600: train loss 1.5532, val loss 1.7299
step 3800: train loss 1.5414, val loss 1.7226
step 4000: train loss 1.5347, val loss 1.7140
step 4200: train lo

In this scenario, I increased the `multihead` to `16`, `block_size` to `256`, and `n_layers` to `8`. We can see here the biggest improvement at the expense of a significant increase in runtime.

Increasing the number of layers and `multihead` attention in a transformer model provides several benefits. Firstly, it enhances the model's capacity to capture complex patterns and dependencies in the data. Each additional layer introduces more non-linear transformations, enabling the model to learn more intricate representations and make more sophisticated predictions. This increased capacity is particularly advantageous for tasks that involve intricate relationships and require the model to capture both local and global dependencies. The hierarchical feature extraction is improved as well, with lower layers capturing low-level features and higher layers capturing more abstract and contextual information. This hierarchical representation can be valuable for tasks that require a deep understanding of the input sequence.

Secondly, the inclusion of `multihead` attention in the transformer architecture brings its own advantages. By utilizing `multiple attention heads`, the model can attend to different parts of the input sequence simultaneously. This allows for the capture of diverse and complementary information, leading to more effective attention mechanisms. `Multihead` attention enables the model to effectively capture long-range dependencies and attend to relevant parts of the input. This is particularly beneficial for tasks that involve understanding complex relationships and dependencies between tokens.

Increasing the `number of layers` and `multihead` attention in the model introduces higher computational and memory demands, alongside the other adjusted parameters. Consequently, it is crucial to assess the impact on runtime and potentially quantify the trade-off between runtime and the achieved loss reduction. Additionally, qualitative evaluation of the model's output can provide insights into its performance.

In [None]:
nano_GPT(batch_size=64, block_size=256, max_iters=20000, eval_interval=1000,
              learning_rate=5e-3, eval_iters=2000, n_embd=64, n_head=16,
              n_layer=8, dropout=0.2)

0.423233 M parameters
step 0: train loss 4.3897, val loss 4.3945
step 1000: train loss 1.6063, val loss 1.7758
step 2000: train loss 1.4652, val loss 1.6617
step 3000: train loss 1.4111, val loss 1.6239
step 4000: train loss 1.3742, val loss 1.5842
step 5000: train loss 1.3539, val loss 1.5819
step 6000: train loss 1.3335, val loss 1.5608
step 7000: train loss 1.3223, val loss 1.5601
step 8000: train loss 1.3104, val loss 1.5467
step 9000: train loss 1.3007, val loss 1.5358
step 10000: train loss 1.2960, val loss 1.5401
step 11000: train loss 1.2856, val loss 1.5273
step 12000: train loss 1.2778, val loss 1.5225
step 13000: train loss 1.2776, val loss 1.5268
step 14000: train loss 1.2724, val loss 1.5272
step 15000: train loss 1.2665, val loss 1.5112
step 16000: train loss 1.2615, val loss 1.5077


## Conclusion

To summarize my approach in fine-tuning, I focused on increasing the model's parameters while reducing the learning rate. By doing so, the model is able to learn more effectively and identify complex patterns within our dataset. This straightforward approach aimed to demonstrate potential improvements, but it is important to acknowledge the limitations of our search space and consider the need for additional datasets and further iterations to validate the findings.

I chose not to increase the embedding layer since we are specifically detecting character sequences. Given the length and combination of characters within each line, the impact of increasing the embedding layer would likely be minimal.

It is worth noting that even though we may achieve lower loss, the quality of the generated output may not match the original writing. This is because we are using the sequence of individual characters as tokens, rather than sequences of words. As a result, some of the words may appear obscure or incomprehensible. It is important to recognize that this is a decoder model, and if we were to input a different dataset unrelated to Shakespeare, its performance would likely be poor as it relies on the unique writing techniques found within the Shakespearean text.

In terms of the video by Karpathy, his explanations were clear and insightful, providing a solid understanding of the fundamentals. From a coding perspective, I feel confident in building and utilizing this architecture and implementing it on my own. However, the challenge lies in finding an interesting and meaningful use case to apply it to.

## Generative AI Documentation

This is how I used ChatGPT to assist me in the assignemnt:
    
- First I written off my understanding of each from Karpathy's video then asked ChatGPT to validate it and fixed the gramamr and point out any explanation mistakes. The context is in the usage within Neural Networks and Transformers.
    - Language Model
    - Attention
    - Self-Attention
    - Cross-Attention
    - Multi-head Attention
    - Transformer
    - Residual Connections
    - Layer Normalization
    - Dropout
- Assist in structuring the comparison and contrasts between the 3 attention.
- Assist in interpreting the code blocks to align chat gpt with Karpathy's explanation, I did not find anything wrong with the responses.
- Most of the time now I asked to give both layman and technical explanation from a data scientist point of view to help me explain it easier to someone else.
- Helped me with adding comments in the code.
- Helped me fix all the sentences grammars and spell checks.




## References 

1. Vaswani, A. (2017, June 12). Attention Is All You Need. arXiv.org. https://arxiv.org/abs/1706.03762
2. Analytics Vidhya. (2023). A Comprehensive Guide to Attention Mechanism in Deep Learning for Everyone. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2019/11/comprehensive-guide-attention-mechanism-deep-learning/
3. What exactly are keys, queries, and values in attention mechanisms? (n.d.). Cross Validated. https://stats.stackexchange.com/questions/421935/what-exactly-are-keys-queries-and-values-in-attention-mechanisms
4. Ismali, F. (n.d.). GPT-4 explaining Self-Attention Mechanism. www.linkedin.com. https://www.linkedin.com/pulse/gpt-4-explaining-self-attention-mechanism-fatos-ismali/