STAGE 1
# DATA PREPARATION

DOWNLOAD DATASET
DATA PREPROCESSING
CREATE DATALOADERS


STAGE 2
# MODEL SETUP
INITIALIZE THE MODEL
LOAD PRETRAINED MODEL WEIGHTS OF GPT 2
MODIFY THE FINAL OUTPUT LAYERS OF MODEL FOR FINETUNING
IMPLEMENT THE EVALUATION UTILITIES


STAGE 3
# MODEL FINETUNING AND USAGE
FINETUNE MODEL
EVALUATE THE MODEL >ACCURACY , LOSS .
TEST THE MODEL ON NEW DATA

# stage 1 begning data preprocessing

In [3]:
import pandas as pd

In [4]:
df = pd.read_csv("./original_dataset/SMSSpamCollection.csv",sep= ",", header= None, names=["Label", "Text"])
df.head()

Unnamed: 0,Label,Text
0,label,text
1,ham,"Go until jurong point, crazy.. Available only ..."
2,ham,Ok lar... Joking wif u oni...
3,spam,Free entry in 2 a wkly comp to win FA Cup fina...
4,ham,U dun say so early hor... U c already then say...


In [5]:
print("The description of the dataset \n",df.describe())
print(" -------------------------------")
print("The shape of the dataset",df.shape)
print(" -------------------------------")
print("The value counts of the dataset\n",df["Label"].value_counts())


The description of the dataset 
        Label                    Text
count   5573                    5573
unique     3                    5170
top      ham  Sorry, I'll call later
freq    4825                      30
 -------------------------------
The shape of the dataset (5573, 2)
 -------------------------------
The value counts of the dataset
 Label
ham      4825
spam      747
label       1
Name: count, dtype: int64


we need to create a balanced dataset you can see ham has 4825 samples and spam has 747 samples, its obviosly imbalanced dataset , we will create a new balanced dataset form this . We will take 747 sample from each class and create a new dataset.

In [6]:
from pandas.core.common import random_state
def create_balanced_dataset(df):
    random_state= 122
    #count the frequncy of spam and extract random samples of ham from dataset , such that both ham and spam sample count or you can say frquency is same
    num_spam = df[df['Label'] == "spam"].shape[0]
    ham_instances = df[df['Label'] == "ham"].sample(num_spam, random_state=42) #it will extract 747(num_spam) random samples of ham.
    spam_instances = df[df['Label'] == "spam"]
    balanced_df = pd.concat([ham_instances,spam_instances])
    return balanced_df


In [7]:
#main
balanced_df =  create_balanced_dataset(df)
balanced_df['Label'].value_counts()

Label
ham     747
spam    747
Name: count, dtype: int64

# saving the balanced dataset into a seprate csv file 
we will do training and validation , testing and finetuing with the help of this dataset

In [8]:
# balanced_df.to_csv("./balanced_dataset_prepared/balanced_dataset.csv", index = False)

In [9]:
balanced_df['Label'].value_counts()

Label
ham     747
spam    747
Name: count, dtype: int64

# we have balanced dataframe prepared already so we will use it and do  Label encoding on it. so that our model understand the numbers better.
NOTE: we are just making changes on dataframe and not on original balanced dataset

# Label Encoding of Labels['ham','spam']
ham -> 0
spam -> 1

In [10]:
balanced_df['Label'] = balanced_df['Label'].map({'ham':0, 'spam':1})

you can see that we have encoded the labels spam as 1 and ham as 0

In [11]:
print(balanced_df['Label'].value_counts())

Label
0    747
1    747
Name: count, dtype: int64


# We will split the dataset into train , validation and test set for this there are two strategy we can direclty use the trani_test_split from sklearn and cleverly make the train , validation and test set.
# second option we can create custom random split function to create the train , validation and test set.

train 70%,

validation 10%,

test 20%

creating a random split function which will split the dataset into train , validation and test set


In [12]:
# def random_split(df, train_frac , valid_frac):
#     df = df.sample(frac =1, random_state=42, reset_index(drop=True))
#     train_size = int(len(df) * train_frac)
#     valid_size = int(len(df) * valid_frac)
#     train_df = df.iloc[:train_size]
#     valid_df = df.iloc[train_size:train_size + valid_size]
#     test_df = df.iloc[train_size + valid_size:]
#     return train_df, valid_df, test_df


# option 2 from sklearn

In [13]:
from sklearn.model_selection import train_test_split

note train test split can make two splits train and test, we will first create 70% train_df and 30% temp_test_df split, then we will further split on temp_test_df to make 
10% val_df and 20% test_df. In this way we can get splitting in the ratio 70:10:20

In [14]:
training_df, temp_test_df = train_test_split(balanced_df,test_size=0.3, random_state=42) #
val_df, test_df = train_test_split(temp_test_df, test_size=2/3, random_state = 42)


In [15]:
training_df.head()

Unnamed: 0,Label,Text
1301,0,Great to hear you are settling well. So what's...
2023,0,"I don't have anybody's number, I still haven't..."
5521,0,No. I dont want to hear anything
2695,0,All these nice new shirts and the only thing I...
3485,0,"Hello, my love! How goes that day ? I wish you..."


In [16]:
val_df.head()

Unnamed: 0,Label,Text
5028,1,Ur cash-balance is currently 500 pounds - to m...
1127,1,For taking part in our mobile survey yesterday...
5286,1,URGENT! You have won a 1 week FREE membership ...
2575,1,Congrats 2 mobile 3G Videophones R yours. call...
697,0,Good. Good job. I like entrepreneurs


In [17]:
test_df.head()

Unnamed: 0,Label,Text
2934,0,Only 2% students solved this CAT question in '...
3863,1,Free Msg: Ringtone!From: http://tms. widelive....
5428,1,Santa Calling! Would your little ones like a c...
1219,0,"Damn, can you make it tonight or do you want t..."
4827,0,I am going to sleep. I am tired of travel.


# converting these dataframe into csv files so that we can use them for training and testing and validation also

In [18]:
training_df.to_csv("./balanced_dataset_prepared/splits/train.csv", index=None)
test_df.to_csv("./balanced_dataset_prepared/splits/testing.csv", index=None)
val_df.to_csv("./balanced_dataset_prepared/splits/validation.csv", index=None)

In [19]:
print(balanced_df['Label'].value_counts())
print(training_df.shape)
print(val_df.shape)
print(test_df.shape)

Label
0    747
1    747
Name: count, dtype: int64
(1045, 2)
(149, 2)
(300, 2)


# dataset loader for training and validation and testing df
Note: okenized inputs must have the same length because deep learning models and DataLoaders operate on fixed-shape tensors, and padding enables batching, parallel computation, and efficient training without affecting model learning (via masking).


In [20]:
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
max_token = 0
for i in training_df['Text']:
    token_count = len(tokenizer.encode(i))
    max_token = max(token_count, max_token)
    # print(len(tokenizer.encode(i)) == 137)
    
print(max_token)

137


# as you can see in our case in the trainig dataframe we have email of max_token length 137 , so our input batch size should be 137 tokens in each tokenized input sample for training , validation and test data

# note
### Dataset:

A Dataset defines how individual data samples are loaded and preprocessed, providing one sample at a time.

### DataLoader:

A DataLoader handles batching, shuffling, and parallel loading of data from a Dataset for efficient model training.# """

## Dataset

In [21]:
import torch
from torch.utils.data import Dataset

class SpamDataset(Dataset):

    def __init__(self, csv_file, tokenizer, max_length=None,pad_token_id=50256):
        self.data = pd.read_csv(csv_file)

        self.encoded_texts = [ tokenizer.encode(text) for text in self.data['Text']]
        # self.labels = self.data['label'].values


        # /*****
            #if encoded text is less than max_length then pad it with pad_token_id
            #if encoded text is greater than max_length then truncate it
        # *****/
        if max_length is None:
            self.max_length = self.__longest__encoded_length()
        else:
            self.max_length = max_length

        self.encoded_texts = [
            encoded_text + [pad_token_id] * (self.max_length - len(encoded_text))
            if(len(encoded_text) < self.max_length)else encoded_text[:self.max_length]
            for encoded_text in self.encoded_texts
        ]

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        encoded = self.encoded_texts[idx]
        label = self.data.iloc[idx]['Label']
        return (
            torch.tensor(encoded,dtype = torch.long),
            torch.tensor(label,dtype = torch.long)
        )
      
    def __longest__encoded_length(self):
        max_length = 0
        for encoded_text in self.encoded_texts:
            encoded_text_len = len(encoded_text)
            if(encoded_text_len > max_length):
                max_length = encoded_text_len
        return max_length


In [22]:
train_dataset = SpamDataset(csv_file = "balanced_dataset_prepared/splits/train.csv",max_length=None,tokenizer=tokenizer)
print(train_dataset.max_length)

137


In [23]:
val_dataset = SpamDataset(csv_file = "balanced_dataset_prepared/splits/validation.csv",max_length=train_dataset.max_length,tokenizer=tokenizer)
print(val_dataset.max_length)


137


In [24]:

test_dataset = SpamDataset(csv_file = "balanced_dataset_prepared/splits/testing.csv",max_length=train_dataset.max_length,tokenizer=tokenizer)
print(test_dataset.max_length)


137


# DataLoader 

In [25]:
from torch.utils.data import DataLoader

In [26]:
num_workers = 0
batch_size = 8
torch.manual_seed(42)

<torch._C.Generator at 0x27032009610>

# DataLoader shuffles indices, splits them into batches, spawns workers(num_workers) to fetch samples, collates them, and yields batches one by one.
Note Your datasetâ€™s __getitem__ and __len__ are the backbone of this process.


- `encoded_text` is a padded sequence of token IDs (length = `max_length`)
- `label` is a single integer

---

### Example Setup
Letâ€™s say:
- `batch_size = 3`
- `max_length = 6`
- `pad_token_id = 50256`

And your dataset has these three samples:

```text
Sample 1: encoded_text = [12, 45, 78]              â†’ label = 0
Sample 2: encoded_text = [34, 56, 90, 12, 77]      â†’ label = 1
Sample 3: encoded_text = [99, 100, 101, 102, 103]  â†’ label = 0
```

---

### After Padding/Truncation
Each sequence is padded/truncated to length 6:

```text
Sample 1 â†’ [12, 45, 78, 50256, 50256, 50256]
Sample 2 â†’ [34, 56, 90, 12, 77, 50256]
Sample 3 â†’ [99, 100, 101, 102, 103, 50256]
```

---

### Batch Tensor from DataLoader
When `DataLoader` collates them, you get:

```python
X_batch = tensor([
    [   12,    45,    78, 50256, 50256, 50256],
    [   34,    56,    90,    12,    77, 50256],
    [   99,   100,   101,   102,   103, 50256]
], dtype=torch.long)

y_batch = tensor([0, 1, 0], dtype=torch.long)
```

---

### Shapes
- `X_batch.shape = (batch_size, max_length(columns bol skte he))` â†’ `(3, 6)`
- `y_batch.shape = (batch_size,)` â†’ `(3,)`

---

So in practice, **each batch is a 2D tensor of token IDs plus a 1D tensor of labels**.  

ðŸ‘‰When you make drop last = true , it will drop the last samples if they are less than the batch_size mtlb agar wo ek batch milkr nhi banapayenge to hum unhe discard kr denge


In [27]:
train_loader = DataLoader(
    batch_size = batch_size,
    num_workers = num_workers,
    dataset = train_dataset,
    shuffle = True,
    drop_last = True
    )

In [28]:
val_loader = DataLoader(
    dataset = val_dataset,
    shuffle = True,
    drop_last = False,
    batch_size = batch_size,
    num_workers= num_workers
)

In [29]:
test_loader =DataLoader(
    dataset = test_dataset,
    num_workers= num_workers,
    batch_size=  batch_size,
    shuffle = True,
    drop_last = False
)

# chekcing the dimensions of the loaders 
`ensuring that dimensions are consisitent`

`train-loader`

In [30]:
for input_batch, taget_batch in train_loader:
    pass
print("shape of input and target batch : \n",input_batch.shape, taget_batch.shape) #

shape of input and target batch : 
 torch.Size([8, 137]) torch.Size([8])


`test_loader`

#### test dataset has 300 samples,  batch size is 8.
```
`300 \div 8 = 37 full batches, with a remainder of 4 samples (37 *  8 = 296; 300 - 296 = 4$).`

`this loop runs through all 38 batches. When the loop finishes, the variables input_batch and taget_batch hold the values from the very last iteration, which is the "remainder" batch containing the final 4 samples`

`note : becoz we have set drop_last = false ` 
### that's why the we are getting the last batch of 4 samples `
shape of input and target batch : 
input_batchtorch.Size([4, 137])
torch.Size([4]) ->


 here  4 is the batch size-> `mtlb 4 samples honge batch me` and 137 is the max length of the input text (mtlb  137 tokens honge ek input sample me maxmimum)
 ```
 

In [31]:
for input_batch, taget_batch in test_loader:
    pass
print("shape of input and target batch : \n",input_batch.shape, taget_batch.shape) 

shape of input and target batch : 
 torch.Size([4, 137]) torch.Size([4])


`val_loader`

In [32]:
for input_batch, taget_batch in test_loader:
    pass
print(f"shape of input batch {input_batch.shape} and target batch :{ taget_batch.shape} \n") 

shape of input batch torch.Size([4, 137]) and target batch :torch.Size([4]) 



-`length of these loaders shows number of batches in it same for other loader (val, test)`

In [33]:
print(f"{len(train_loader)} training batches")
print(f"{len(val_loader)} validation batches")
print(f"{len(test_loader)} test batches")

130 training batches
19 validation batches
38 test batches


``This concludes the dataPreparation part of the project `` 

# STAGE 2 - MODEL SET UP
```
1.INTIALIZE THE MODEL
2.LOAD THE PRETRAINED WEIGHT
3.FREEZE THE ALL EXCEPT LAST LAYERS(FINAL OUTPUT HEAD , FINAL TRANSFORMER BLOCK, FINAL, FINAL NORMALIZATION LAYER)
4. MODIFY MODEL FOR FINETUNING
5. IMPLEMENT THE EVALUATION UTILITY
```

## INITIALIZING A MODEL WITH PRETRAINED WEIGHTS

In [34]:
import torch
import torch.nn as nn


In [35]:

GPT_CONFIG_124M = {
    "vocab_size": 50257,    # Vocabulary size
    "context_length": 1024, # Context length
    "emb_dim": 768,         # Embedding dimension
    "n_heads": 12,          # Number of attention heads
    "n_layers": 12,         # Number of layers
    "drop_rate": 0.1,       # Dropout rate
    "qkv_bias": False       # Query-Key-Value bias
}

class LayerNorm(nn.Module):
    def __init__(self, emb_dim):
        super().__init__()
        self.eps = 1e-5
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        norm_x = (x - mean) / torch.sqrt(var + self.eps)
        return self.scale * norm_x + self.shift

class GELU(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return 0.5 * x * (1 + torch.tanh(
            torch.sqrt(torch.tensor(2.0 / torch.pi)) * 
            (x + 0.044715 * torch.pow(x, 3))
        ))

class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]), ## Expansion
            GELU(), ## Activation
            nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]), ## Contraction
        )

    def forward(self, x):
        return self.layers(x)


In [36]:

class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        assert (d_out % num_heads == 0), \
            "d_out must be divisible by num_heads"

        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim

        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.out_proj = nn.Linear(d_out, d_out)  # Linear layer to combine head outputs
        self.dropout = nn.Dropout(dropout)
        self.register_buffer(
            "mask",
            torch.triu(torch.ones(context_length, context_length),
                       diagonal=1)
        )

    def forward(self, x):
        b, num_tokens, d_in = x.shape

        keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
        queries = self.W_query(x)
        values = self.W_value(x)

        # We implicitly split the matrix by adding a `num_heads` dimension
        # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) 
        values = values.view(b, num_tokens, self.num_heads, self.head_dim)
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)

        # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
        keys = keys.transpose(1, 2)
        queries = queries.transpose(1, 2)
        values = values.transpose(1, 2)

        # Compute scaled dot-product attention (aka self-attention) with a causal mask
        attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head

        # Original mask truncated to the number of tokens and converted to boolean
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

        # Use the mask to fill attention scores
        attn_scores.masked_fill_(mask_bool, -torch.inf)
        
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)

        # Shape: (b, num_tokens, num_heads, head_dim)
        context_vec = (attn_weights @ values).transpose(1, 2) 
        
        # Combine heads, where self.d_out = self.num_heads * self.head_dim
        context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
        context_vec = self.out_proj(context_vec) # optional projection

        return context_vec


In [37]:

class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.att = MultiHeadAttention(
            d_in=cfg["emb_dim"],
            d_out=cfg["emb_dim"],
            context_length=cfg["context_length"],
            num_heads=cfg["n_heads"], 
            dropout=cfg["drop_rate"],
            qkv_bias=cfg["qkv_bias"])
        self.ff = FeedForward(cfg)
        self.norm1 = LayerNorm(cfg["emb_dim"])
        self.norm2 = LayerNorm(cfg["emb_dim"])
        self.drop_shortcut = nn.Dropout(cfg["drop_rate"])

    def forward(self, x):
        # Shortcut connection for attention block
        shortcut = x
        x = self.norm1(x)
        x = self.att(x)  # Shape [batch_size, num_tokens, emb_size]
        x = self.drop_shortcut(x)
        x = x + shortcut  # Add the original input back

        # Shortcut connection for feed forward block
        shortcut = x
        x = self.norm2(x)
        x = self.ff(x)
        # 2*4*768
        x = self.drop_shortcut(x)
        x = x + shortcut  # Add the original input back

        return x


In [38]:

class GPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        self.drop_emb = nn.Dropout(cfg["drop_rate"])
        
        self.trf_blocks = nn.Sequential(
            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
        
        self.final_norm = LayerNorm(cfg["emb_dim"])
        self.out_head = nn.Linear(
            cfg["emb_dim"], cfg["vocab_size"], bias=False
        )

    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
        x = tok_embeds + pos_embeds  # Shape [batch_size, num_tokens, emb_size]
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits


In [39]:
CHOOSE_MODEL = "gpt2-small(124M)"
INPUT_PROMPT = "EVERY EFFORT MOVES YOU IN  THE RIGHT DIRECTION"

BASE_CONFIG = {
   "vocab_size": 50257,     # Vocabulary size
    "context_length": 1024,  # Context length
    "drop_rate": 0.0,        # Dropout rate
    "qkv_bias": True         # Query-key-value bias
}

model_configs = {
     "gpt2-small(124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12}
}

BASE_CONFIG.update(model_configs[CHOOSE_MODEL])
print(BASE_CONFIG)

{'vocab_size': 50257, 'context_length': 1024, 'drop_rate': 0.0, 'qkv_bias': True, 'emb_dim': 768, 'n_layers': 12, 'n_heads': 12}


In [40]:
model = GPTModel(BASE_CONFIG)
# print(model)

In [41]:
assert train_dataset.max_length <= BASE_CONFIG["context_length"], (
    f"Dataset length {train_dataset.max_length} exceeds model's context "
    f"length {BASE_CONFIG['context_length']}. Reinitialize data sets with "
    f"`max_length={BASE_CONFIG['context_length']}`"
)

In [54]:
def assign(left, right):
    if(left.shape != right.shape):
        raise(ValueError(f"Shape mismatch : {left.shape} and right {right.shape}"))
    return torch.nn.Parameter(torch.tensor(right))

In [55]:
import numpy as np

def load_weights_into_gpt(gpt, params):
    gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params['wpe'])
    gpt.tok_emb.weight = assign(gpt.tok_emb.weight, params['wte'])
    
    for b in range(len(params["blocks"])):
        q_w, k_w, v_w = np.split(
            (params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=-1)
        gpt.trf_blocks[b].att.W_query.weight = assign(
            gpt.trf_blocks[b].att.W_query.weight, q_w.T)
        gpt.trf_blocks[b].att.W_key.weight = assign(
            gpt.trf_blocks[b].att.W_key.weight, k_w.T)
        gpt.trf_blocks[b].att.W_value.weight = assign(
            gpt.trf_blocks[b].att.W_value.weight, v_w.T)

        q_b, k_b, v_b = np.split(
            (params["blocks"][b]["attn"]["c_attn"])["b"], 3, axis=-1)
        gpt.trf_blocks[b].att.W_query.bias = assign(
            gpt.trf_blocks[b].att.W_query.bias, q_b)
        gpt.trf_blocks[b].att.W_key.bias = assign(
            gpt.trf_blocks[b].att.W_key.bias, k_b)
        gpt.trf_blocks[b].att.W_value.bias = assign(
            gpt.trf_blocks[b].att.W_value.bias, v_b)

        gpt.trf_blocks[b].att.out_proj.weight = assign(
            gpt.trf_blocks[b].att.out_proj.weight, 
            params["blocks"][b]["attn"]["c_proj"]["w"].T)
        gpt.trf_blocks[b].att.out_proj.bias = assign(
            gpt.trf_blocks[b].att.out_proj.bias, 
            params["blocks"][b]["attn"]["c_proj"]["b"])

        gpt.trf_blocks[b].ff.layers[0].weight = assign(
            gpt.trf_blocks[b].ff.layers[0].weight, 
            params["blocks"][b]["mlp"]["c_fc"]["w"].T)
        gpt.trf_blocks[b].ff.layers[0].bias = assign(
            gpt.trf_blocks[b].ff.layers[0].bias, 
            params["blocks"][b]["mlp"]["c_fc"]["b"])
        gpt.trf_blocks[b].ff.layers[2].weight = assign(
            gpt.trf_blocks[b].ff.layers[2].weight, 
            params["blocks"][b]["mlp"]["c_proj"]["w"].T)
        gpt.trf_blocks[b].ff.layers[2].bias = assign(
            gpt.trf_blocks[b].ff.layers[2].bias, 
            params["blocks"][b]["mlp"]["c_proj"]["b"])

        gpt.trf_blocks[b].norm1.scale = assign(
            gpt.trf_blocks[b].norm1.scale, 
            params["blocks"][b]["ln_1"]["g"])
        gpt.trf_blocks[b].norm1.shift = assign(
            gpt.trf_blocks[b].norm1.shift, 
            params["blocks"][b]["ln_1"]["b"])
        gpt.trf_blocks[b].norm2.scale = assign(
            gpt.trf_blocks[b].norm2.scale, 
            params["blocks"][b]["ln_2"]["g"])
        gpt.trf_blocks[b].norm2.shift = assign(
            gpt.trf_blocks[b].norm2.shift, 
            params["blocks"][b]["ln_2"]["b"])

    gpt.final_norm.scale = assign(gpt.final_norm.scale, params["g"])
    gpt.final_norm.shift = assign(gpt.final_norm.shift, params["b"])
    gpt.out_head.weight = assign(gpt.out_head.weight, params["wte"])



In [None]:
from gpt_download3 import download_and_load_gpt2
settings, params = download_and_load_gpt2(model_size="124M",models_dir="gpt2")

In [45]:
print(settings)

{'n_vocab': 50257, 'n_ctx': 1024, 'n_embd': 768, 'n_head': 12, 'n_layer': 12}


In [46]:
print(params.keys())

dict_keys(['blocks', 'b', 'g', 'wpe', 'wte'])


| Component | Full Form                  | Role                   |
| --------- | -------------------------- | ---------------------- |
| `wte`     | Weight Token Embeddings    | Token â†’ vector mapping |
| `wpe`     | Weight Position Embeddings | Injects order info     |
| `blocks`  | Transformer blocks         | Core computation       |
| `g`       | Gain (Î³)                   | LayerNorm scaling      |
| `b`       | Bias (Î²)                   | LayerNorm shifting     |


b and g â€” Bias and Gain (LayerNorm parameters)

These usually appear inside LayerNorms.
Where they appear

ln_1 (before attention)

ln_2 (before MLP)

Final layer norm (ln_f)

Formula of LayerNorm
LN(x)=gâ‹…(xâˆ’Î¼)/Ïƒâ€‹+b

| Symbol  | Name     | Shape       | Role                     |
| ------- | -------- | ----------- | ------------------------ |
| **`g`** | Gain (Î³) | `[emb_dim]` | Scales normalized values |
| **`b`** | Bias (Î²) | `[emb_dim]` | Shifts normalized values |


<div class="alert alert-block alert-success">
    
By default, the GPTModel instance is initialized with random weights for pretraining. 

The last
step to using OpenAI's model weights is to override these random weights with the weights
we loaded into the params dictionary.

For this, we will first define a small assign utility function that checks whether two
tensors or arrays (left and right) have the same dimensions or shape and returns the
right tensor as trainable PyTorch parameters:
</div>

we need to check whether the loading parameters are of proper dimensions before laoding them into model

In [56]:
gpt = GPTModel(BASE_CONFIG)
# gpt.eval()

now lets try to load the weights into gpt model

In [None]:
load_weights_into_gpt(gpt,params)
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

gpt.to(device)

In [61]:
def generate_text_simple(model, idx, max_new_tokens, context_size):
    # idx is (batch, n_tokens) array of indices in the current context

    ###Input batch:
 ###tensor([[6109, 3626, 6100,  345],
        ##[6109, 1110, 6622,  257]])
    
    for _ in range(max_new_tokens):
        
        # Crop current context if it exceeds the supported context size
        # E.g., if LLM supports only 5 tokens, and the context size is 10
        # then only the last 5 tokens are used as context
        idx_cond = idx[:, -context_size:]
        
        # Get the predictions
        with torch.no_grad():
            logits = model(idx_cond) ### batch, n_tokens, vocab_size
        
        # Focus only on the last time step
        # (batch, n_tokens, vocab_size) becomes (batch, vocab_size)
        logits = logits[:, -1, :]  

        # Apply softmax to get probabilities
        probas = torch.softmax(logits, dim=-1)  # (batch, vocab_size)

        # Get the idx of the vocab entry with the highest probability value
        idx_next = torch.argmax(probas, dim=-1, keepdim=True)  # (batch, 1)

        # Append sampled index to the running sequence
        idx = torch.cat((idx, idx_next), dim=1)  # (batch, n_tokens+1)

    return idx

In [62]:
import tiktoken
def text_to_tokens(text, tokenizer):
    encoded= tokenizer.encode(text, allowed_special= 'all')
    token_ids = torch.tensor(encoded).unsqueeze(0) # add batch dimension
    return token_ids



def token_ids_to_text(token_ids, tokenizer):
    flat = token_ids.squeeze(0) # remove batch dimension
    return tokenizer.decode(flat.tolist())


In [80]:
prompt =(  "Is the following text 'spam'? Answer with 'yes' or 'no':"
    " 'You are a winner you have been specially"
    " selected to receive $1000 cash or a $2000 award.'"
)

tokenizer = tiktoken.get_encoding("gpt2")

text = generate_text_simple(
    model = gpt,
    idx = text_to_tokens(prompt, tokenizer),
    max_new_tokens=15,
    context_size=BASE_CONFIG["context_length"]

)



In [81]:
print(token_ids_to_text(text, tokenizer))

Is the following text 'spam'? Answer with 'yes' or 'no': 'You are a winner you have been specially selected to receive $1000 cash or a $2000 award.'

The following text 'spam'? Answer with 'yes' or


## 3. freeze the model weights


In [None]:
for param in gpt.parameters():
    param.requires_grad = False #by setting false we are making all weights untrainable and frozen


### changin the out_head to map the input layer to the number of classed 
the classes are 2 (spam and not spam)

In [107]:
torch.manual_seed(42)
num_classes = 2
gpt.out_head = torch.nn.Linear(in_features = BASE_CONFIG["emb_dim"], out_features = num_classes) # it means it willl map 768 dim vector into 2 dim vector 

<div class="alert alert-block alert-warning">

Note that in the preceding code, we use BASE_CONFIG["emb_dim"], which is equal to 768 in
the "gpt2-small (124M)" model, to keep the code below more general. 

This means we
can also use the same code to work with the larger GPT-2 model variants.

This new model.out_head output layer has its requires_grad attribute set to True by
default, which means that it's the only layer in the model that will be updated during
training.



Stage 1: Train only out_head
Stage 2: Train out_head + last block + ln_f

</div>

Input Text
 â†’ Tokenization
 â†’ Token Embeddings (wte)
 â†’ Position Embeddings (wpe)
 â†’ Transformer Blocks (Ã— n_layers)
 â†’ Final LayerNorm (ln_f)
 â†’ Output Head (out_head)
 â†’ Logits
 â†’ Loss
 â†’ Backprop (only selected layers update)


In [108]:
# making final transofrmer block 
for param in gpt.trf_blocks[-1].parameters():
    param.requires_grad = True
    
# making final_norm layer trainaible
for param in gpt.final_norm.parameters():
    param.requires_grad = True


In [109]:
prompt = tokenizer.encode("Hello, how are you?")
inputs = torch.tensor(prompt).unsqueeze(0)
print(inputs)
print(inputs.shape)

tensor([[15496,    11,   703,   389,   345,    30]])
torch.Size([1, 6])


In [112]:
with torch.no_grad():
    out = gpt(inputs)
print(out.shape)
print(out)


torch.Size([1, 6, 2])
tensor([[[1.6807, 1.6095],
         [3.7947, 8.1032],
         [3.3092, 8.4428],
         [2.0253, 6.9588],
         [2.2015, 7.5989],
         [3.6070, 7.6458]]])


<div class="alert alert-block alert-info">

If we had used the original GPT-2 model, a similar input would have produced an output tensor of [1, 6, 50257],
where 50,257 represents the vocabulary size. 

As we are using a smaller model (124M), the number of
output rows corresponds to the number of input tokens (in this case, 6). 

However, each
output's embedding dimension (the number of columns) is now reduced to 2 instead of
50,257 since we replaced the output layer of the model.

</div>

In [114]:
#we need to last output token to get the classification becaue it the last token that contain the information of the other tokens
print(out[:,-1,:])


tensor([[3.6070, 7.6458]])


<div class="alert alert-block alert-info">

Having modified the model, the next section will detail the process of transforming the
last token into class label predictions and calculate the model's initial prediction accuracy.

Following this, we will finetune the model for the spam classification task in the subsequent
section.

</div>

<div class="alert alert-block alert-success">
We can obtain the class label via the following code:
</div>

we have obtained the output logits for the input data now we want the classification results , we can use torch.softmax then argmax to get the index of the maximum value in the logits array but to simplfiy the computaiton using argmax will also work

In [128]:
# option1 
prob = torch.softmax(out[:,-1,:], dim=-1)
print(prob)

label = torch.argmax(prob)
print("class label: ",label.item())

tensor([[0.0173, 0.9827]])
class label:  1


`torch.argmax: This converts raw probability scores (logits) into specific class predictions by finding the index of the highest score.`

In [None]:
# option 2 (recommended)

logits = torch.argmax(out[:,-1,:])
print(logits)
print("class label: ",logits.item())

tensor(1)
class label:  1


`her you can see that the class lable is 1 in t both ways `

<div class="alert alert-block alert-success">
To determine the classification accuracy, we apply the argmax-based prediction code to
all examples in the dataset and calculate the proportion of correct predictions by defining a
calc_accuracy_loader function:
</div>

In [130]:
def calc_accuracy_loader(data_loader, model, device, num_batches=None):
    #num_batches is nothing but the number of batches wich we want to process
    model.eval() # it will ensure that the model is in evaluation mode and all neurons are active 
    correct_predictions, num_examples = 0, 0

    if num_batches is None:
        num_batches = len(data_loader)
    else:
        num_batches = min(num_batches, len(data_loader)) # it will take the minimum number for batches its only for due to computation Safety: The min() function ensures code doesn't crash if you ask for more batches than exist in the loader.
    

    for i , (input_batch, target_batch) in enumerate(data_loader):
        if i < num_batches: #by using i we are doing that caclulating only upto num_batches specified , for enumerate was required becoz it can convert input_batch, target_batch pair in index based.
            input_batch, target_batch = input_batch.to(device), target_batch.to(device)

            with torch.no_grad(): # by using no_grad we are stopping gradient weights to be calcualated it will save computation time becz for testing we dont need it
                logits = model(input_batch)[:,-1,:] #it will get the last output token of every input tensor returned by model
                
            predicted_labels = torch.argmax(logits,dim = -1)

            num_examples += predicted_labels.shape[0]
            correct_predictions += (predicted_labels == target_batch).sum().item()

        else:
            break

    return correct_predictions/num_examples



``Detaching from the Graph: PyTorch tensors often carry a "computational history" (the gradient graph). If you add a tensor to a running sum without .item(), you are effectively keeping the entire history of every batch in memory.
Preventing Memory Leaks: Without .item(), you may encounter an "Out of Memory" (OOM) error because the memory allocated for the training graph cannot be freed.``

In [157]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [158]:
# from Architecture Classification finetuning import train_loader
gpt.to(device)


GPTModel(
  (tok_emb): Embedding(50257, 768)
  (pos_emb): Embedding(1024, 768)
  (drop_emb): Dropout(p=0.0, inplace=False)
  (trf_blocks): Sequential(
    (0): TransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear(in_features=768, out_features=768, bias=True)
        (W_key): Linear(in_features=768, out_features=768, bias=True)
        (W_value): Linear(in_features=768, out_features=768, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
        (dropout): Dropout(p=0.0, inplace=False)
      )
      (ff): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (norm1): LayerNorm()
      (norm2): LayerNorm()
      (drop_shortcut): Dropout(p=0.0, inplace=False)
    )
    (1): TransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear(in_features=7

In [162]:
# from Architecture Classification finetuning import val_loader
torch.manual_seed(42)

train_accuracy = calc_accuracy_loader(train_loader, gpt, num_batches = 10, device= device)
test_accuracy = calc_accuracy_loader(test_loader, gpt, num_batches = 10, device= device)
val_accuracy = calc_accuracy_loader(val_loader, gpt, num_batches = 10, device= device)

In [163]:
# from Architecture Classification finetuning import val_accuracy
# from Architecture Classification finetuning import test_accuracy
print(f"{train_accuracy*100:.2f}%")
print(f"{val_accuracy*100:.2f}%")
print(f"{test_accuracy*100:.2f}%")

45.00%
51.25%
45.00%


<div class="alert alert-block alert-info">
    
As we can see, the prediction accuracies are near a random prediction, which would be
50% in this case. 

To improve the prediction accuracies, we need to finetune the model.

</div>

In [None]:
import torch
print(f"Is CUDA available? {torch.cuda.is_available()}")
print(f"Current Device Count: {torch.cuda.device_count()}")

Is CUDA available? False
Current Device Count: 0


<div class="alert alert-block alert-warning">

Classification accuracy is not a differentiable function, so we use cross entropy
loss as a proxy to maximize accuracy. 

This is the same cross entropy loss discussed earlier. 

Accordingly, the calc_loss_batch function remains the same as in earlier, with one
adjustment: we focus on optimizing only the last token, model(input_batch)[:, -1, :],
rather than all tokens, model(input_batch):

</div>

In [165]:
def calc_loss_batch(input_batch, target_batch, model, device):
    input_batch, target_batch = input_batch.to(device) , target_batch.to(device)
    logits = model(input_batch)[:,-1,:]
    loss = torch.nn.functional.cross_entropy(logits, target_batch)
    return loss


# we use calc_loss_batch to calcualte loss for single batch obtained from defined dataloaders. To calculate loss for all batches in a dataloader, we define the `calc_loss_loader function`

In [166]:
def calc_loss_loader(data_loader, model, device, num_batches= None):
    total_loss = 0
    if len(data_loader) == 0:
        return float('Nan')
    elif num_batches == None:
        num_batches = len(data_loader)
    else:
        # if num_batches exceeds the number of batches in datloader then we need to set this:
        num_batches = min(num_batches, len(data_loader))
    
    for i , (input_batch , target_batch) in enumerate(data_loader):
        if(i < num_batches):
            loss = calc_loss_batch(input_batch, target_batch, model, device)
            total_loss += loss
        else:
            break
        return total_loss/num_batches


similarly like we have calculated the accuracy for train_loader . We now compute the initial oss for each data set.

In [175]:
# from Architecture Classification finetuning import test_loader
with torch.no_grad(): #disable gradient tracking for efficiency becz we are not training yet
    train_loss = calc_loss_loader(train_loader, gpt, device, num_batches= 5)
    val_loss = calc_loss_loader(val_loader, gpt, device, num_batches= 5)
    test_loss = calc_loss_loader(test_loader, gpt, device, num_batches= 5)

In [177]:
print(f"Training loss: {train_loss:.3f}")
print(f"Validation loss: {val_loss:.3f}")
print(f"Test loss: {100*test_loss:.3f}")

Training loss: 0.229
Validation loss: 0.193
Test loss: 15.765


# finetuing model on supervised data , implementing training
In the next section, we will implement a training function to finetune the model, which
means adjusting the model to minimize the training set loss. 

Minimizing the training set
loss will help increase the classification accuracy, our overall goa