#  GPT Fine-Tuning using Hugging Face Transformers

# Objective

The objective of this project is to **fine-tune the GPT-2 language model** on a custom dataset (extracted from a PDF) using the Hugging Face Transformers library. By doing this, we aim to:

- **Customize GPT-2’s knowledge** to a specific domain or document (e.g., a textbook, manual, or company document).
- **Enable accurate and context-aware question answering** from the fine-tuned model.
- **Build a foundation for a domain-specific chatbot or assistant** that understands the context of the PDF.
- Learn and demonstrate how to implement **end-to-end fine-tuning using Hugging Face’s `Trainer` API**.

This fine-tuned model can later be used for:
- Chatbots,
- QA systems,
- Summarization tools,
- Document-based assistants,
with **better performance and relevance** than a generic GPT-2 model.


## What is GPT?
- **GPT (Generative Pretrained Transformer 2)** is an open-source language model developed by OpenAI.
- It is based on the **Transformer decoder architecture**.
- GPT is a **pretrained model**, which means it is trained on a massive amount of text data to understand and generate human-like text.

---

## What is Fine-Tuning?
- Fine-tuning is a process of **further training a pretrained model on a specific dataset**.
- This adapts the general language understanding of the model to a **specific task** (e.g., answering questions from a PDF, chatbot, summarization).
- You **don’t train from scratch**, but build on top of existing knowledge.

### Benefits of Fine-Tuning:
- Saves **computation time** and **resources**.
- Achieves **higher accuracy** on domain-specific tasks.
- Easy to implement using high-level APIs like Hugging Face’s `Trainer`.

---

## Hugging Face Transformers Library
- Hugging Face `transformers` is a **popular open-source library** for working with NLP models like GPT, BERT, T5, etc.
- It provides:
  - Pretrained models via `AutoModel`, `GPT2LMHeadModel`, etc.
  - Tokenizers
  - Training utilities like `Trainer` and `TrainingArguments`.

---

## What is the Hugging Face `Trainer`?
The `Trainer` class simplifies the process of training and fine-tuning models.

### Key Features:
- Handles **training loops**, **evaluation**, **saving checkpoints**, and **logging** automatically.
- Supports **custom datasets** using `Dataset` or `datasets.load_dataset`.
- Allows configuration using `TrainingArguments`.

### Example Components Used in Trainer:
- `model`: Your GPT-2 model (`GPT2LMHeadModel`).
- `train_dataset`: Your dataset in tokenized format.
- `tokenizer`: Tokenizer to convert text into input IDs.
- `args`: TrainingArguments (e.g., learning rate, output directory, batch size).

---

## Training Process Summary

### 1. **Dataset Preparation**
- You created a dataset from a PDF file using PyPDF2.
- Cleaned and chunked the text for fine-tuning.
- Converted text into a custom Dataset class.

### 2. **Tokenizer**
- Used `GPT2Tokenizer` to tokenize the text data.
- Added padding and truncation for consistency in input sizes.

### 3. **Model Initialization**
- Loaded `GPT2LMHeadModel` for language modeling.
- Set to training mode using `.train()`.

### 4. **Trainer Setup**
- Defined `TrainingArguments`: epochs, batch size, logging, and checkpoint saving.
- Initialized the `Trainer` with model, tokenizer, dataset, and arguments.
- Called `trainer.train()` to begin fine-tuning.

---

## Saving and Loading Fine-Tuned Model
- After training, model and tokenizer were saved using:
  ```python
  trainer.save_model(output_dir)
  tokenizer.save_pretrained(output_dir)


# ✅ Step 1: Install Libraries

In [65]:
!pip install -q transformers datasets peft accelerate PyPDF2

# ✅ Step 2: Imports

In [66]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
from datasets import Dataset
import torch
import PyPDF2


# ✅ Step 3: Prepare Dataset

In [67]:
import PyPDF2
import re

def extract_text_from_pdf(pdf_path):
    with open(pdf_path, "rb") as f:
        reader = PyPDF2.PdfReader(f)
        text = ""
        for page in reader.pages:
            text += page.extract_text() + "\n"
    return text

# Extract and clean the text
raw_text = extract_text_from_pdf("/content/attention.pdf")
cleaned_text = re.sub(r"\s+", " ", raw_text).strip()

# Preview
print(cleaned_text[:1000])  # Print first 1000 characters


Attention Is All You Need Ashish Vaswani Google Brain avaswani@google.comNoam Shazeer Google Brain noam@google.comNiki Parmar Google Research nikip@google.comJakob Uszkoreit Google Research usz@google.com Llion Jones Google Research llion@google.comAidan N. Gomezy University of Toronto aidan@cs.toronto.eduŁukasz Kaiser Google Brain lukaszkaiser@google.com Illia Polosukhinz illia.polosukhin@gmail.com Abstract The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring signiﬁcantly less time to train. Our model achiev

# ✅ Step 4: Auto-create basic Q&A pairs from knowledge

In [68]:
# 2. Auto-create basic Q&A pairs from knowledge
sentences = re.split(r'(?<=[.?!])\s+', cleaned_text)
qa_data = []

for i, sentence in enumerate(sentences):
    if len(sentence) < 30:
        continue
    qa_data.append({
        "text": f"Question: What does the document say in point {i+1}?\nAnswer: {sentence}"
    })
    if len(qa_data) >= 30:  # Limit size for demo
        break


# ✅ Step 5: Load GPT-2 and Tokenizer

In [69]:
# 3. Load dataset
dataset = Dataset.from_list(qa_data)

# 4. Load tokenizer & model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token  # pad with EOS
model = GPT2LMHeadModel.from_pretrained("gpt2")

# ✅ Step 6: Tokenize the Dataset

In [70]:
def tokenize_function(examples):
    tokens = tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)
    tokens["labels"] = tokens["input_ids"].copy()  # Important!
    return tokens

tokenized_dataset = dataset.map(tokenize_function, batched=True)

tokenized_dataset[0].keys()  # should include 'input_ids', 'attention_mask', 'labels'



Map:   0%|          | 0/30 [00:00<?, ? examples/s]

dict_keys(['text', 'input_ids', 'attention_mask', 'labels'])

# ✅ Step 7: Training Arguments and train the model

In [71]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./gpt2-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=1,
    save_steps=100,
    save_total_limit=2,
    logging_dir="./logs",
    logging_steps=10,
    report_to="none",   # disables wandb
    no_cuda=True        # disable GPU
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer,
)

trainer.train()


  trainer = Trainer(


Step,Training Loss
10,2.2749
20,0.4163
30,0.4049
40,0.3216
50,0.2734
60,0.29
70,0.2395
80,0.2356
90,0.2713


TrainOutput(global_step=90, training_loss=0.5252676486968995, metrics={'train_runtime': 934.8788, 'train_samples_per_second': 0.096, 'train_steps_per_second': 0.096, 'total_flos': 23516282880000.0, 'train_loss': 0.5252676486968995, 'epoch': 3.0})

In [73]:
training_args = TrainingArguments(
    output_dir="./gpt2-finetuned"
)


# ✅ Step 9:  Load the Fine-Tuned Model

In [74]:
import os
os.listdir("./gpt2-finetuned")


['checkpoint-90']

# ✅ Step 10: Load model from checkpoint

In [75]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")  # still use base tokenizer unless you customized it

# Load model from checkpoint
model = GPT2LMHeadModel.from_pretrained("./gpt2-finetuned/checkpoint-90")
model.eval()


GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

# ✅ Step 11: Implementation

In [82]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("./gpt2-finetuned/checkpoint-90")

# Set pad_token to eos_token
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.eos_token_id

# Encode input text
input_text = " what is attention mechanism,"
inputs = tokenizer(
    input_text,
    return_tensors="pt",
    padding=True,
    truncation=True
)

# Generate with attention_mask and pad_token_id
outputs = model.generate(
    inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    max_length=100,
    pad_token_id=tokenizer.eos_token_id,
    do_sample=True,
    top_k=50,
    top_p=0.95,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))


 what is attention mechanism, what is the difference between these two approaches, and does this work?
Answer: We propose an approach using the network that is an in-memory model-driven model translation, as well as a recurrent convolutional neural network implemented using recurrent recurrent neural networks and recurrent recurrent neural networks.
