In [1]:
!pip install -q transformers datasets evaluate --no-deps


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25h

## Step 1: Installing Required Libraries (Safely for Kaggle)

We're installing:
- `transformers`: For the GPT-2 model and tokenizer.
- `datasets`: To load datasets like WikiText-2.
- `evaluate`: To compute evaluation metrics like perplexity.

We use `--no-deps` to prevent Kaggle’s preinstalled environment from throwing conflicts. These libraries usually work well with the already installed dependencies on Kaggle.


In [2]:
import torch
from datasets import load_dataset
from transformers import GPT2Tokenizer, GPT2LMHeadModel, DataCollatorForLanguageModeling, Trainer, TrainingArguments
import evaluate


2025-07-05 23:59:42.492096: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1751759982.696770      35 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1751759982.761491      35 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## Step 2: Importing Libraries

We now import the core libraries:
- `torch`: For GPU-accelerated training.
- `load_dataset`: To load text datasets from Hugging Face.
- `GPT2Tokenizer` and `GPT2LMHeadModel`: To load the tokenizer and model for GPT-2.
- `DataCollatorForLanguageModeling`: Automatically handles masking and padding during training.
- `Trainer` and `TrainingArguments`: Simplifies training and evaluation.
- `evaluate`: Used for calculating perplexity and accuracy.


In [3]:
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")
dataset


README.md: 0.00B [00:00, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/733k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/6.36M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

DatasetDict({
    test: Dataset({
        features: ['text'],
        num_rows: 4358
    })
    train: Dataset({
        features: ['text'],
        num_rows: 36718
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 3760
    })
})

## Step 3: Loading Dataset

We're using the **WikiText-2** dataset from Hugging Face’s `datasets` library. It’s a collection of clean Wikipedia articles, ideal for language modeling tasks like next-word prediction.

We load the "wikitext-2-raw-v1" version, which keeps the raw text intact (without preprocessing).

The dataset will include `train`, `validation`, and `test` splits.


In [4]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token  # GPT-2 does not have a pad token; setting it to eos

def tokenize_function(example):
    return tokenizer(example["text"], truncation=True, padding="max_length", max_length=128)

tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["text"])


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Map:   0%|          | 0/4358 [00:00<?, ? examples/s]

Map:   0%|          | 0/36718 [00:00<?, ? examples/s]

Map:   0%|          | 0/3760 [00:00<?, ? examples/s]

## Step 4: Tokenizing the Text

We load the GPT-2 tokenizer using Hugging Face’s `AutoTokenizer`. GPT-2 doesn't have a native padding token, so we set `pad_token` to the end-of-sequence token (`eos_token`).

We then define a tokenization function:
- It tokenizes each line of text.
- It truncates sequences to a fixed length (`max_length=128`).
- It pads shorter sequences to match the max length.

Using `dataset.map()`, we apply this function to all examples in the dataset. We also remove the original `'text'` column to keep only tokenized data.


In [5]:
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.resize_token_embeddings(len(tokenizer)) 

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

## Step 5: Loading GPT-2 Model and Data Collator

We load the **GPT-2 language model** with `GPT2LMHeadModel`. This version is specifically designed for **causal language modeling** — predicting the next word.

Since we added a padding token, we resize the token embeddings to align with the tokenizer vocabulary size.

We also set up a `DataCollatorForLanguageModeling`:
- It dynamically pads inputs during training.
- We set `mlm=False` because we're doing **causal** (next-word) prediction, not masked language modeling like BERT.


In [6]:
training_args = TrainingArguments(
    output_dir="./gpt2-nextword",
    eval_strategy="epoch",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=2,  
    weight_decay=0.01,
    save_strategy="no",
    logging_steps=100,
    push_to_hub=False,
    report_to="none",
    fp16=torch.cuda.is_available(),  
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()


  trainer = Trainer(
`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Epoch,Training Loss,Validation Loss
1,3.1603,3.339974
2,2.8422,3.323588


TrainOutput(global_step=18360, training_loss=3.126224422247062, metrics={'train_runtime': 2037.4134, 'train_samples_per_second': 36.044, 'train_steps_per_second': 9.011, 'total_flos': 4797060415488000.0, 'train_loss': 3.126224422247062, 'epoch': 2.0})

## Step 6: Fine-Tuning GPT-2 using Hugging Face Trainer

We set up `TrainingArguments` to define the training configuration :
- `per_device_train_batch_size=4`: Keeps memory use low on Kaggle.
- `num_train_epochs=2`: 2 epochs for a quicker fit. 
- `eval_strategy="epoch"`: Evaluates after each epoch.
- `fp16`: Enables mixed-precision training if GPU is available for faster performance.

We use the Hugging Face `Trainer` to :
- Train the GPT-2 model on the tokenized `train` dataset.
- Evaluate it on the `validation` split.
- Handle batching and padding using the data collator.


In [7]:
import math

# Perplexity
eval_results = trainer.evaluate()
perplexity = math.exp(eval_results["eval_loss"])
print(f"Perplexity: {perplexity:.2f}")


Perplexity: 27.76


## Step 7: Evaluation – Perplexity

**Perplexity** is a key metric for language modeling — lower values mean better predictions.

We use the evaluation loss returned by `trainer.evaluate()` and apply `math.exp()` to convert it into perplexity:

$$
\text{Perplexity} = e^{\text{Loss}}
$$

This tells us how confidently the model predicts the next word — the closer the value is to 1, the better the predictions.


In [8]:
import numpy as np
from torch.nn.functional import softmax

def compute_top_k_accuracy_fixed(model, tokenizer, dataset, k=5, num_samples=100):
    model.eval()
    correct = 0
    total = 0

    for i in range(min(num_samples, len(dataset))):
        inputs = torch.tensor(dataset[i]["input_ids"]).unsqueeze(0).to(model.device)

        # Skip if input is all padding
        if torch.all(inputs == tokenizer.pad_token_id):
            continue

        with torch.no_grad():
            outputs = model(inputs)
            logits = outputs.logits

        # Shift logits and labels so that the prediction for token t is compared to token t+1
        shift_logits = logits[:, :-1, :]
        shift_labels = inputs[:, 1:]

        # Getting top-k predictions
        top_k_preds = torch.topk(shift_logits, k, dim=-1).indices

        # Checking if actual next token is in top-k predictions
        for pos in range(shift_labels.shape[1]):
            label = shift_labels[0, pos]
            if label != tokenizer.pad_token_id:
                if label in top_k_preds[0, pos]:
                    correct += 1
                total += 1

    accuracy = correct / total if total > 0 else 0
    print(f"Top-{k} Accuracy : {accuracy * 100:.2f}%")

compute_top_k_accuracy_fixed(model, tokenizer, tokenized_datasets["validation"], k=5, num_samples=100)


Top-5 Accuracy : 62.82%


## Step 7: Evaluation – Top-5 Accuracy (Improved Method)

Our initial implementation of Top-k accuracy returned 0%, likely due to comparing padded or out-of-bounds tokens.

We now use a **more robust method**:
- The model predicts the next token for each position in the sequence.
- We shift the input and labels to align predictions at time `t` with actual tokens at `t+1`.
- We compute whether the **actual next token is in the top-5 predictions** at each step.
- Padding tokens are ignored.

This yields a **Top-5 Accuracy of 62.82%**, showing the model correctly includes the true next word in its top 5 predictions **over 60% of the time** — strong performance for a lightweight, fine-tuned GPT-2.


In [9]:
model.save_pretrained("./gpt2-nextword-model")
tokenizer.save_pretrained("./gpt2-nextword-model")

!zip -r /kaggle/working/gpt2-nextword-model.zip ./gpt2-nextword-model > /dev/null


## Step 8: Saving the Fine-Tuned Model for Download

To preserve and download the trained model :
- We save the fine-tuned GPT-2 model and tokenizer using `save_pretrained()`.
- We zip the entire folder and move it to the `/kaggle/working/` directory, which is exposed for downloading in Kaggle.


# Final Project Summary – Next Word Predictor using Transformers

## Objective
To build and evaluate a next-word predictor using a fine-tuned GPT-2 transformer model. This foundational NLP task enables autocomplete, chatbots, and intelligent writing assistants.

## Model & Dataset
- **Model**: GPT-2 (`GPT2LMHeadModel`) from Hugging Face.
- **Tokenizer**: GPT-2 tokenizer with padding aligned to EOS token.
- **Dataset**: `wikitext-2-raw-v1` – clean Wikipedia corpus, loaded via Hugging Face `datasets`.

## Training Details
- Fine-tuned GPT-2 using the `Trainer` API.
- Epochs: 2
- Batch Size: 4
- Padding: Dynamic (via `DataCollatorForLanguageModeling`)
- Loss: Causal Language Modeling (next-token prediction)

## Evaluation Results
- **Perplexity**: 27.76  
  Indicates decent language fluency and predictability.
- **Top-5 Accuracy**: 62.82%  
  The correct next token appears in the top-5 predictions ~63% of the time.

## Key Learnings
- Fine-tuning pretrained models like GPT-2 can be efficient even with small corpora.
- Perplexity gives insight into how "surprised" the model is by the true next word.
- Top-k accuracy reflects practical autocomplete performance.

## Conclusion
The project demonstrates a working pipeline for next-word prediction using a transformer model. Results show strong potential for real-world applications like typing assistants, writing tools, or chat interfaces.
