# IMDB GPT

Goal of this project is to use hugging face and pytorch to create a decoder only Causal LM generator model similar to GPT framework using the IMDB review dataset. We used hugging face to prepare the dataset and pytorch/lightening for model training. We also applied hyperparameter tuning and linear learning rate scheduler to optimize the training. 

For evaluation we generated few responses with prompts to see how it is generating and also evaluated test perplexity and loss, which were 97.27 and 4.5775 respectively.

## Data Preparation

In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorForLanguageModeling
import numpy as np
from torch.utils.data import DataLoader
import torch

import os
os.environ["TOKENIZERS_PARALLELISM"] = "true"  # or "false"

In [2]:
raw_ds = load_dataset('imdb')

split = raw_ds['train'].train_test_split(test_size = 0.08, seed = 10)

ds = {
    'train': split['train'],
    'val': split['test'],
    'test': raw_ds['test']
}

In [3]:
print(next(iter(raw_ds['train']))['text'])

I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, eve

In [4]:
tok = AutoTokenizer.from_pretrained('gpt2')
tok.pad_token = tok.eos_token
print("vocab:", tok.vocab_size, "pad:", tok.pad_token_id, "eos:", tok.eos_token_id)

vocab: 50257 pad: 50256 eos: 50256


In [5]:
def tokenize(batch): 
    return tok(batch['text'], add_special_tokens=True, truncation= False)

tokenized = {}
for split_name, dset in ds.items(): 
    tokenized[split_name] = dset.map(
        tokenize, 
        batched = True, 
        remove_columns = dset.column_names, 
        desc=f"Tokenizing {split_name}"
    )

In [6]:
tokenized

{'train': Dataset({
     features: ['input_ids', 'attention_mask'],
     num_rows: 23000
 }),
 'val': Dataset({
     features: ['input_ids', 'attention_mask'],
     num_rows: 2000
 }),
 'test': Dataset({
     features: ['input_ids', 'attention_mask'],
     num_rows: 25000
 })}

In [7]:
def group_texts(examples):
    # Concatenate
    block_size = 256
    concatenated = {k: sum(examples[k], []) for k in examples.keys()}
    total_len = len(concatenated["input_ids"])
    # Drop remainder to make full blocks
    total_len = (total_len // block_size) * block_size
    result = {}
    for k, vals in concatenated.items():
        vals = vals[:total_len]
        result[k] = [vals[i:i+block_size] for i in range(0, total_len, block_size)]
    # Labels = inputs for CLM (no masking)
    result["labels"] = result["input_ids"].copy()
    return result

lm_dataset = {}

for split_name, dset in tokenized.items(): 
    lm_dataset[split_name] = dset.map(
        group_texts, 
        batched = True
    )

lm_dataset

{'train': Dataset({
     features: ['input_ids', 'attention_mask', 'labels'],
     num_rows: 26904
 }),
 'val': Dataset({
     features: ['input_ids', 'attention_mask', 'labels'],
     num_rows: 2347
 }),
 'test': Dataset({
     features: ['input_ids', 'attention_mask', 'labels'],
     num_rows: 28582
 })}

In [8]:
len(next(iter(lm_dataset['train']))['input_ids'])

256

In [9]:
len(next(iter(tokenized['train']))['input_ids'])

592

In [44]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer = tok,
    mlm = False
)


train_loader = DataLoader(
    lm_dataset["train"],
    batch_size=32,         
    shuffle=True,
    collate_fn=data_collator,
    num_workers=15,
    pin_memory=True
)

val_loader = DataLoader(
    lm_dataset["val"],
    batch_size=32,
    shuffle=False,
    collate_fn=data_collator,
    num_workers=15,
    pin_memory=True
)

test_loader = DataLoader(
    lm_dataset["test"],
    batch_size=32,
    shuffle=False,
    collate_fn=data_collator,
    num_workers=15,
    pin_memory=True
)

batch = next(iter(train_loader))
for k, v in batch.items():
    print(k, v.shape) 
    print(k) 
    print(v)

input_ids torch.Size([32, 256])
input_ids
tensor([[ 4520, 24407,    11,  ...,   284,   651,   616],
        [   11,  1141,   262,  ...,  1256,   517,   287],
        [ 1577,   262,  7110,  ...,  3173,   994, 29847],
        ...,
        [  286,  1402,    11,  ...,  3382,   284,  1494],
        [ 5299,  3899,   327,  ...,   284,  1064,   503],
        [ 1220,  6927,  1671,  ..., 29847,  1671,  1220]])
attention_mask torch.Size([32, 256])
attention_mask
tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        ...,
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]])
labels torch.Size([32, 256])
labels
tensor([[ 4520, 24407,    11,  ...,   284,   651,   616],
        [   11,  1141,   262,  ...,  1256,   517,   287],
        [ 1577,   262,  7110,  ...,  3173,   994, 29847],
        ...,
        [  286,  1402,    11,  ...,  3382,   284,  1494],
        [ 5299,  3899,   327,  ...,   284,  1

## Model Training and Hyperparameter Tuning

In [17]:
import torch.nn as nn
import torch.nn.functional as F
from torchinfo import summary

In [31]:
class PositionalEncoding(nn.Module): 
    #creating a learnable positional encoding 
    def __init__(self, max_length = 2048, dims = 128): 
        super().__init__()
        self.max_length = max_length 
        self.dims = dims 
        self.encoding = nn.Embedding(num_embeddings= self.max_length, embedding_dim= self.dims) 
        self.drop = nn.Dropout(0.1) 
        nn.init.normal_(self.encoding.weight, mean = 0.0, std = 0.02) 

    def forward(self, inputs):
        #inputs will be in shape of B, 256(max length of chunks), 128 (or any other dim size)
        B, T, D = inputs.shape 
        position = torch.arange(T, device = inputs.device).unsqueeze(0).expand(B, T)
        x = inputs + self.encoding(position)
        x = self.drop(x) 
        return x

def causal_mask(seq_length = 256, device = None): 
    mask = torch.triu(torch.ones(seq_length,seq_length, dtype = torch.bool, device = device), diagonal = 1)
    return mask
    
class GPTModel(nn.Module): 
    def __init__(
        self,
        vocab_size = 30000, 
        n_head = 16, 
        dims = 512, 
        n_layer = 8, 
        max_len = 1024, 
        pad_id = 0, 
    ): 
        super().__init__() 
        self.vocab_size = vocab_size
        self.dims = dims 
        self.n_head = n_head 
        self.n_layer = n_layer 
        self.max_len = max_len
        self.pad_id = pad_id

        self.tok_emb = nn.Embedding(num_embeddings = self.vocab_size, 
                                    embedding_dim = self.dims, 
                                    padding_idx = self.pad_id if self.pad_id is not None else -1)

        self.pos_emb = PositionalEncoding(max_length = self.max_len, dims = self.dims) 

        layer = nn.TransformerEncoderLayer(
            d_model= self.dims, 
            nhead= self.n_head, 
            dim_feedforward= self.dims*4, 
            dropout= 0.1, 
            activation= 'gelu',
            batch_first= True, 
            norm_first= True
        )

        self.encoder = nn.TransformerEncoder(layer, num_layers= self.n_layer) 
        self.layer_norm = nn.LayerNorm(self.dims) 

        self.lm_head = nn.Linear(self.dims, vocab_size, bias=False)
        nn.init.normal_(self.tok_emb.weight, mean=0.0, std=0.02)
        self.lm_head.weight = self.tok_emb.weight

    def forward(self, input_ids, attention_mask, labels): 
        x = self.tok_emb(input_ids) 
        x = self.pos_emb(x) 

        B, T, D = x.shape 
        mask = causal_mask(T, device = x.device) 
        key_pad = (attention_mask == 0) if attention_mask is not None else None
        
        out = self.encoder(x, mask = mask, src_key_padding_mask = key_pad) 
        out = self.layer_norm(out) 

        logits = self.lm_head(out) 

        loss = None 

        if labels is not None:
            shift_logits = logits[:, :-1, :].contiguous()
            shift_labels = labels[:, 1:].contiguous()
            loss = F.cross_entropy(
                shift_logits.view(-1, self.vocab_size),
                shift_labels.view(-1),
                ignore_index=-100
            )
        return logits, loss

    def prepare_inputs_for_generation(self, input_ids, **kwargs):
        attention_mask = kwargs.get("attention_mask", None)
        return {"input_ids": input_ids, "attention_mask": attention_mask}


summary(GPTModel())



Layer (type:depth-idx)                                            Param #
GPTModel                                                          --
├─Embedding: 1-1                                                  15,360,000
├─PositionalEncoding: 1-2                                         --
│    └─Embedding: 2-1                                             524,288
│    └─Dropout: 2-2                                               --
├─TransformerEncoder: 1-3                                         --
│    └─ModuleList: 2-3                                            --
│    │    └─TransformerEncoderLayer: 3-1                          3,152,384
│    │    └─TransformerEncoderLayer: 3-2                          3,152,384
│    │    └─TransformerEncoderLayer: 3-3                          3,152,384
│    │    └─TransformerEncoderLayer: 3-4                          3,152,384
│    │    └─TransformerEncoderLayer: 3-5                          3,152,384
│    │    └─TransformerEncoderLayer: 3-6          

In [23]:
import lightning as L
from transformers import get_linear_schedule_with_warmup

class LitGPT(L.LightningModule):
    def __init__(self, model, lr = 0.0001, weight_decay = 0.01, warmup = 0.05):
        super().__init__()
        self.model = model
        self.lr = lr
        self.weight_decay = weight_decay
        self.warmup = warmup

    def training_step(self, batch, batch_idx):
        logits, loss = self.model(batch["input_ids"], batch["attention_mask"], batch["labels"])
        self.log("train_loss", loss, prog_bar=True, on_step=True, on_epoch=True)
        return loss

    def validation_step(self, batch, batch_idx):
        _, loss = self.model(batch["input_ids"], batch["attention_mask"], batch["labels"])
        self.log("val_loss", loss, prog_bar=True, on_epoch=True)

    def configure_optimizers(self):
        optimizer = torch.optim.AdamW(self.parameters(), lr=self.lr, weight_decay=self.weight_decay)
        total_steps = self.trainer.estimated_stepping_batches
        warmup_steps = max(1, int(self.warmup*total_steps))
        scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps= warmup_steps, num_training_steps= total_steps)

        return {
            "optimizer": optimizer,
            "lr_scheduler": {
                "scheduler": scheduler,
                "interval": "step",
                "frequency": 1
            }
        }


In [16]:
device = 'cuda'

In [19]:
from lightning.pytorch.callbacks import ModelCheckpoint

model = GPTModel(
    vocab_size= tok.vocab_size, 
    dims= 512, 
    n_head = 16, 
    n_layer = 8, 
    max_len = 1024,
    pad_id = tok.pad_token_id
).to(device)
    
checkpoint_cb = ModelCheckpoint(save_top_k=1, monitor="val_loss", mode="min")

lit_model = LitGPT(model, lr = 0.0001)

trainer = L.Trainer(
    max_epochs = 10,
    accelerator = "auto",
    precision="bf16-mixed",  
    log_every_n_steps = 20,
    callbacks=[checkpoint_cb],
)

trainer.fit(lit_model, train_dataloaders = train_loader, val_dataloaders = val_loader)

Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loading `train_dataloader` to estimate number of stepping batches.
/home/anubh/miniconda3/envs/dl-env/lib/python3.12/site-packages/lightning/pytorch/utilities/model_summary/model_summary.py:231: Precision bf16-mixed is not supported by the model summary.  Estimated model size in MB will not be accurate. Using 32 bits instead.

  | Name  | Type     | Params | Mode 
-------------------------------------------
0 | model | GPTModel | 51.5 M | train
-------------------------------------------
51.5 M    Trainable params
0         Non-trainable params
51.5 M    Total params
205.904   Total estimated model params size (MB)
89        Modules in train mode
0         Modules in eval mode


Sanity Checking: |                                                                                | 0/? [00:00…

Training: |                                                                                       | 0/? [00:00…

Validation: |                                                                                     | 0/? [00:00…

Validation: |                                                                                     | 0/? [00:00…

Validation: |                                                                                     | 0/? [00:00…

Validation: |                                                                                     | 0/? [00:00…

Validation: |                                                                                     | 0/? [00:00…

Validation: |                                                                                     | 0/? [00:00…

Validation: |                                                                                     | 0/? [00:00…

Validation: |                                                                                     | 0/? [00:00…

Validation: |                                                                                     | 0/? [00:00…

Validation: |                                                                                     | 0/? [00:00…

`Trainer.fit` stopped: `max_epochs=10` reached.


In [25]:
lit_model2

LitGPT(
  (model): GPTModel(
    (tok_emb): Embedding(50257, 512, padding_idx=50256)
    (pos_emb): PositionalEncoding(
      (encoding): Embedding(1024, 512)
      (drop): Dropout(p=0.1, inplace=False)
    )
    (encoder): TransformerEncoder(
      (layers): ModuleList(
        (0-7): 8 x TransformerEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
          )
          (linear1): Linear(in_features=512, out_features=2048, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=2048, out_features=512, bias=True)
          (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (layer_norm): LayerNorm((512,), eps=1e-05, eleme

In [26]:
@torch.no_grad()
def generate(
    model,
    tokenizer,
    prompt: str,
    max_new_tokens: int = 50,
    temperature: float = 1.0,
    top_k: int | None = None,
    top_p: float | None = None,
    eos_token_id: int | None = None,
    device: str = "cuda",
    use_autocast: bool = True,        # bf16/fp16 autocast on CUDA
):
    model.eval()
    if eos_token_id is None:
        eos_token_id = tokenizer.eos_token_id

    # Encode prompt
    enc = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)
    input_ids = enc["input_ids"].to(device)     # [1, T0]

    # Respect model’s max_len (positional table)
    max_len = model.pos_emb.encoding.num_embeddings

    for _ in range(max_new_tokens):
        # If too long, keep only the most recent window
        if input_ids.size(1) > max_len:
            input_ids = input_ids[:, -max_len:]

        attention_mask = torch.ones_like(input_ids, dtype=torch.long, device=device)

        # Forward
        if use_autocast and torch.cuda.is_available():
            dtype = torch.bfloat16
            with torch.autocast(device_type="cuda", dtype=dtype):
                logits, _ = model(input_ids, attention_mask, labels=None)
        else:
            logits, _ = model(input_ids, attention_mask, labels=None)

        # Take last-step logits and apply temperature
        next_logits = logits[:, -1, :] / max(temperature, 1e-6)

        # Top-k filtering
        if top_k is not None and top_k > 0:
            v, idx = torch.topk(next_logits, k=min(top_k, next_logits.size(-1)))
            filtered = torch.full_like(next_logits, float("-inf"))
            next_logits = filtered.scatter(-1, idx, v)

        # Top-p (nucleus) filtering
        if top_p is not None and 0.0 < top_p < 1.0:
            sorted_logits, sorted_idx = torch.sort(next_logits, descending=True)
            probs = F.softmax(sorted_logits, dim=-1)
            cumsum = torch.cumsum(probs, dim=-1)
            cutoff = cumsum > top_p
            # Keep at least one token
            cutoff[..., 0] = False
            sorted_logits[cutoff] = float("-inf")
            # Scatter back to original order
            next_logits = torch.full_like(next_logits, float("-inf")).scatter(-1, sorted_idx, sorted_logits)

        # Sample or greedy
        probs = F.softmax(next_logits, dim=-1)
        next_token = torch.multinomial(probs, num_samples=1)  # for greedy: probs.argmax(-1, keepdim=True)

        # Append
        input_ids = torch.cat([input_ids, next_token], dim=1)

        # Optional early stop on EOS
        if eos_token_id is not None and next_token.item() == eos_token_id:
            break

    return tokenizer.decode(input_ids[0], skip_special_tokens=True)


In [31]:
prompt = "This movie was among the worst I have seen "

model_to_gen = lit_model.model
model_to_gen.to(device).eval()

out = generate(
    model_to_gen, tok, prompt,
    max_new_tokens=120,
    temperature=0.9,
    top_k=50,
    top_p=0.95,
    device=device
)
print(out)


This movie was among the worst I have seen br /><br />Even it's a movie to be seen and it's a total waste of time. This is a bad movie. It has the best of the cast. The acting is good, this movie is really horrible. I've been to say, even if you're a kid.This is the worst movie i have ever seen. I have seen the title, but it was terrible.This was an over-top movie. I actually have seen it, but it was not so bad. A lot of the first part was great and the plot was bad.<br /><br


In [35]:
prompt = "Movie was great! "

out = generate(
    model_to_gen, tok, prompt,
    max_new_tokens=100,
    temperature=0.90,
    top_k=50,
    top_p=0.95,
    device=device
)
print(out)


Movie was great! *** <br /><br />The film is about the first three-one who, the film, is not the most part in the movie in which they are treated to the rest of the actors, but by the other, he's not the worst. The story line, from the opening, the camera, I thought it was the plot, and all that was pretty predictable. The whole movie is very long and the dialog just looks very well in the movie, which does not exist. It


In [36]:
#we will train a bit more to see if performance can be improved

best_model = checkpoint_cb.best_model_path

model2 = GPTModel(
    vocab_size=tok.vocab_size,
    dims=512,
    n_head=16,
    n_layer=8,
    max_len=1024,
    pad_id=tok.pad_token_id
)

lit_model2 = LitGPT(model2, lr=5e-5, warmup= 0.02)

state = torch.load(best_model, map_location = 'cpu')
lit_model2.load_state_dict(state['state_dict'], strict = True)



<All keys matched successfully>

In [38]:
trainer2 = L.Trainer(
    max_epochs=10,       
    accelerator="auto",
    precision="bf16-mixed",
    log_every_n_steps=20,
    callbacks=[checkpoint_cb])

trainer2.fit(lit_model2, train_dataloaders = train_loader, val_dataloaders = val_loader)

Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/home/anubh/miniconda3/envs/dl-env/lib/python3.12/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:751: Checkpoint directory /mnt/d/ibm_genai/lightning_logs/version_23/checkpoints exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loading `train_dataloader` to estimate number of stepping batches.
/home/anubh/miniconda3/envs/dl-env/lib/python3.12/site-packages/lightning/pytorch/utilities/model_summary/model_summary.py:231: Precision bf16-mixed is not supported by the model summary.  Estimated model size in MB will not be accurate. Using 32 bits instead.

  | Name  | Type     | Params | Mode 
-------------------------------------------
0 | model | GPTModel | 51.5 M | train
-------------------------------------------
51.5 M    Trainable params
0         Non-trainable params
51.5 M    Total params
205

Sanity Checking: |                                                                                | 0/? [00:00…

Training: |                                                                                       | 0/? [00:00…

Validation: |                                                                                     | 0/? [00:00…

Validation: |                                                                                     | 0/? [00:00…

Validation: |                                                                                     | 0/? [00:00…

Validation: |                                                                                     | 0/? [00:00…

Validation: |                                                                                     | 0/? [00:00…

Validation: |                                                                                     | 0/? [00:00…

Validation: |                                                                                     | 0/? [00:00…

Validation: |                                                                                     | 0/? [00:00…

Validation: |                                                                                     | 0/? [00:00…

Validation: |                                                                                     | 0/? [00:00…

`Trainer.fit` stopped: `max_epochs=10` reached.


In [74]:
#training a bit more with early stopping now
from lightning.pytorch.callbacks import EarlyStopping

best_path = checkpoint_cb.best_model_path

model2 = GPTModel(
    vocab_size=tok.vocab_size,
    dims=512, n_head=16, n_layer=8, max_len=1024,
    pad_id=tok.pad_token_id
)
lit2 = LitGPT(model2, lr=5e-5)  # lower LR

state = torch.load(best_path, map_location="cpu")
lit2.load_state_dict(state["state_dict"], strict=True)

ckpt2 = ModelCheckpoint(monitor="val_loss", mode="min", save_top_k=1)
es = EarlyStopping(monitor="val_loss", mode="min", patience=2)

trainer2 = L.Trainer(
    max_epochs=10,
    accelerator="auto",
    precision="bf16-mixed",
    callbacks=[ckpt2, es],
    log_every_n_steps=20,
)
trainer2.fit(lit2, train_dataloaders=train_loader, val_dataloaders=val_loader)

Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loading `train_dataloader` to estimate number of stepping batches.

  | Name  | Type     | Params | Mode 
-------------------------------------------
0 | model | GPTModel | 51.5 M | train
-------------------------------------------
51.5 M    Trainable params
0         Non-trainable params
51.5 M    Total params
205.904   Total estimated model params size (MB)
89        Modules in train mode
0         Modules in eval mode


Sanity Checking: |                                                                                | 0/? [00:00…

Training: |                                                                                       | 0/? [00:00…

Validation: |                                                                                     | 0/? [00:00…

Validation: |                                                                                     | 0/? [00:00…

Validation: |                                                                                     | 0/? [00:00…

Validation: |                                                                                     | 0/? [00:00…

Validation: |                                                                                     | 0/? [00:00…

Validation: |                                                                                     | 0/? [00:00…

Validation: |                                                                                     | 0/? [00:00…

Validation: |                                                                                     | 0/? [00:00…

Validation: |                                                                                     | 0/? [00:00…

Validation: |                                                                                     | 0/? [00:00…

`Trainer.fit` stopped: `max_epochs=10` reached.


## Model Evaluation

In [77]:
import math

@torch.no_grad()
def eval_perplexity(model, dataloader, device="cuda", use_bf16=True):
    model.eval().to(device)
    total_loss = 0.0
    total_tokens = 0

    for batch in dataloader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)

        autocast_enabled = use_bf16 and torch.cuda.is_available()
        with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=autocast_enabled):
            _, loss = model(input_ids, attention_mask, labels)

        # count non-ignored targets after the 1-token shift
        n_valid = (labels[:, 1:] != -100).sum().item()
        total_loss += loss.item() * max(n_valid, 1)
        total_tokens += max(n_valid, 1)

    avg_loss = total_loss / max(total_tokens, 1)
    ppl = math.exp(min(avg_loss, 10.0))
    return avg_loss, ppl

train_loss = 4.22
train_ppl = math.exp(4.22)
print(f"train_loss tuned | {train_loss:.4f}  train_ppl tuned | {train_ppl:.2f}")

val_loss, val_ppl = eval_perplexity(model2, val_loader, device=device, use_bf16=True)
print(f"val_loss tuned | {val_loss:.4f}  val_ppl tuned | {val_ppl:.2f}")

test_loss, test_ppl = eval_perplexity(model, test_loader, device=device, use_bf16=True)
print(f"test_loss | {test_loss:.4f}  test_ppl | {test_ppl:.2f}")

test_loss, test_ppl = eval_perplexity(model2, test_loader, device=device, use_bf16=True)
print(f"test_loss tuned | {test_loss:.4f}  test_ppl tuned | {test_ppl:.2f}")

train_loss tuned | 4.2200  train_ppl tuned | 68.03
val_loss tuned | 4.5102  val_ppl tuned | 90.94
test_loss | 4.8314  test_ppl | 125.38
test_loss tuned | 4.5775  test_ppl tuned | 97.27


From results above we can see that while further hyperparameter tuning did help and we can improve test performance even more if we train few more epochs, difference between training and validation/test loss will increase signifying possible overfitting. Still, we improved perplexity by ~22% which is already a very good improvement.

In [78]:
prompt = "This movie was among the worst I have seen"

model_to_gen = lit2.model
model_to_gen.to(device).eval()

out = generate(
    model_to_gen, tok, prompt,
    max_new_tokens=120,
    temperature=0.9,
    top_k=50,
    top_p=0.95,
    device=device
)
print(out)

This movie was among the worst I have seen. The story is simple and predictable. The character who is even stronger and less than a guy who makes an appearance in the movie, and the story is interesting and intelligent. I understand that I was able to feel this way through the movie. I was looking forward to seeing it in the movie, because it was almost like the other movie, but it just ended up being more believable. It wasn't a drama, just not a movie that would be funny. It did have some good parts. The movie was also bad. But that was the second half. The only reason I saw this movie


In [81]:
#adjusting parameters a bit to get better sounding text

prompt = "This movie was among the worst I have seen"

model_to_gen = lit_model2.model
model_to_gen.to(device).eval()

out = generate(
    model_to_gen, tok, prompt,
    max_new_tokens=100,
    temperature=0.85,
    top_k=60,
    top_p=0.90,
    device=device
)
print(out)

This movie was among the worst I have seen. I have never seen a horror movie. The acting was bad. I can't even understand the acting. The dialog was awful. I liked it all. I liked the characters, but the one, the plot was so bad that I was feeling cheated. My wife, the sister, was a guy. So the people were bad and I don't know what they were doing. The actors were good. I suppose they were the only reason they had. However, I was amazed that this was


In [82]:
prompt = "Movie was Awesome!"

out = generate(
    model_to_gen, tok, prompt,
    max_new_tokens=100,
    temperature=0.85,
    top_k=60,
    top_p=0.90,
    device=device
)
print(out)

Movie was Awesome! I can't wait for something for it. I've just been the most talented director of his other films. And while I watch this one, I think you're a fan of the first movie. I was not expecting much, because I haven't seen it. It's a very very well made movie, and I don't think that you can't really care if it's going to be a good film, but I'll give it a try.<br /><br />The acting is also


In [83]:
prompt = "This movie is great!"

out = generate(
    model_to_gen, tok, prompt,
    max_new_tokens=60,
    temperature=0.85,
    top_k=60,
    top_p=0.90,
    device=device
)
print(out)

This movie is great!I watched this movie a 5 year old and was watching it. It was a very different movie than I did it. I was waiting for the movie to get the impression of the characters. I thought that I thought it was a good movie. I am so disappointed that it was so very funny and


In [84]:
#trying bigger prompt

prompt = "This movie was "

out = generate(
    model_to_gen, tok, prompt,
    max_new_tokens=300,
    temperature=0.85,
    top_k=100,
    top_p=0.95,
    device=device
)
print(out)

This movie was oot. I can't believe that the director had the opportunity to tell you why the film is all about. I mean, I would have liked to see the first two seasons before the first one, and they made it a long way into the series. But the writing was good, the ending was stupid and stupid, the acting was terrible. The problem was because it started with a good story and i couldn't. I still think this movie was directed by someone with a few good actors, but the way they turned it off was because of the real life in the movie was in it. The main character's character is so weird and the character is just totally different. The character really was terrible and the girl is cute. I think this movie isn't bad either, but not one of the worst movies ever made.<br /><br />If you're looking for a remake of the Dead in the Crypt, this movie's been made in a very early 90's movie. It's not a good movie because it's not an a movie. The plot is very good too, but sadly a mess of the first tw

## Conclusion

In this project we created a Causal LM from scratch using the IMDB dataset. We used Hugging and Pytorch to do it and we obtained 97.27  perplexity score compared to 90.94 for validation, meaning that overall model has learned generalized patterns very well as difference is not so major across them. 

We can improve the performance of the model by increasing the context size in batch from 256 to 512 and above, and we can also increase the layers and attention heads. However, even with his setup our VRAM on local machine was stretched to its limit so we need more hardware for that. We can also improve performance with more review data and even by training more epochs but improvement will be minor as we have reached near a plateau.

Overall, we were able to create GPT like model using the IMBD reviews on our own local machine with good generative results and decent improvement in loss and perplexity after hyperparameter tuning. Of course, model is rambling a bit based on the eye test, but even Mistral and other quantized models ramble like this. Considering all that, a model with 50 million parameters gave us decent results, better than we expected.