# Next-Word Prediction using GPT-2
This notebook demonstrates how to fine-tune a pretrained GPT-2 model on the WikiText-2 dataset for next-word prediction. We use Hugging Face's `transformers`, `datasets`, and `Trainer` API.

**Goals:**
- Tokenize text
- Fine-tune GPT-2 using causal language modeling
- Evaluate using perplexity and top-k accuracy
- (Optional) Deploy a demo with Gradio

In [None]:
!pip install transformers datasets evaluate accelerate gradio

In [None]:
import sys
import json
import os
import torch
import numpy as np
from datasets import load_dataset
from datasets import load_from_disk
from transformers import GPT2LMHeadModel, GPT2Tokenizer, Trainer, TrainingArguments, DataCollatorForLanguageModeling
import evaluate
import math
import gradio as gr

## Load and Inspect Dataset

In [None]:
dataset = load_dataset('wikitext', 'wikitext-2-raw-v1')
print(dataset)

## Tokenize the Text

#### For one time loading the dataset, uncomment and run this

In [None]:
# tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# tokenizer.pad_token = tokenizer.eos_token  # GPT2 doesn't have pad_token

# def tokenize_function(examples):
#     return tokenizer(examples['text'], truncation=True, padding='max_length', max_length=128)

# tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=['text'])

# os.makedirs("data/wikitext", exist_ok=True)
# tokenized_datasets["train"].save_to_disk("data/wikitext/train_tokenized")
# tokenized_datasets["validation"].save_to_disk("data/wikitext/val_tokenized")

#### For recurring Use(After first time, use this)

In [None]:
tokenized_datasets = {
    "train": load_from_disk("data/wikitext/train_tokenized"),
    "validation": load_from_disk("data/wikitext/val_tokenized")
}

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token

## Load GPT-2 Model

In [7]:
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.resize_token_embeddings(len(tokenizer))

Embedding(50257, 768)

## Prepare Training Components

#### Selected only a subset of dataset here

In [None]:
# Subset the dataset
small_train_dataset = tokenized_datasets['train'].shuffle(seed=42).select(range(2000))
small_eval_dataset = tokenized_datasets['validation'].shuffle(seed=42).select(range(400))

# Data collator
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Training arguments
training_args = TrainingArguments(
    output_dir='./model/checkpoints',
    eval_strategy='epoch',
    save_strategy='epoch',
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    save_total_limit=2,
    logging_steps=200,
    fp16=torch.cuda.is_available(),
    push_to_hub=False
)

## Train the Model

In [33]:
# Trainer setup
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator
)

# Training and saving
trainer.train()
model.save_pretrained("model/checkpoints/final")
tokenizer.save_pretrained("model/checkpoints/final")

  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,3.459239
2,3.354300,3.462888
3,3.354300,3.466245




('model/checkpoints/final\\tokenizer_config.json',
 'model/checkpoints/final\\special_tokens_map.json',
 'model/checkpoints/final\\vocab.json',
 'model/checkpoints/final\\merges.txt',
 'model/checkpoints/final\\added_tokens.json')

## Evaluate Perplexity

In [34]:
eval_results = trainer.evaluate()
perplexity = math.exp(eval_results['eval_loss'])
print(f'Perplexity: {perplexity:.2f}')



Perplexity: 32.02


## Evaluate Top-k Accuracy

In [None]:
# Proper dataloader with collate_fn
dataloader = torch.utils.data.DataLoader(
    tokenized_datasets['validation'],
    batch_size=2,
    shuffle=False,
    collate_fn=data_collator
)

In [None]:
def compute_top_k_accuracy(logits, labels, k=5):
    topk = torch.topk(logits, k, dim=-1).indices
    labels = labels.unsqueeze(-1)
    match = (topk == labels).any(dim=-1).float()
    return match.mean().item()

trainer.evaluate()
top_k_accs = []

with torch.no_grad():
    for batch in dataloader:
        input_ids = batch['input_ids'].to(model.device)
        attention_mask = batch['attention_mask'].to(model.device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits[:, :-1, :]
        labels = input_ids[:, 1:]

        top_k_acc = compute_top_k_accuracy(logits, labels, k=5)
        top_k_accs.append(top_k_acc)

print(f"Top-5 Accuracy: {np.mean(top_k_accs):.4f}")

Top-5 Accuracy: 0.2265


In [None]:
os.makedirs("outputs", exist_ok=True)

with open("outputs/eval_metrics.json", "w") as f:
    json.dump({
        "perplexity": perplexity,
        "top_5_accuracy": float(top_k_acc)
    }, f)

## Gradio Demo: Try Next-Word Prediction

In [None]:
def predict_next_word(prompt):
    inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=1, do_sample=True, top_k=50)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

demo = gr.Interface(fn=predict_next_word, inputs='text', outputs='text', title='Next Word Predictor')
demo.launch()

* Running on local URL:  http://127.0.0.1:7860
* To create a public link, set `share=True` in `launch()`.




Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
