# Fine-Tuning GPT-2 Manually with PyTorch and Hugging Face TransformersThis notebook is the **first phase** in a multi-stage project demonstrating my end-to-end skills in fine-tuning and aligning Large Language Models (LLMs) for a specific domain (e.g., code generation/problem-solving).My comprehensive project plan is to demonstrate proficiency across different LLM engineering paradigms:1.  **Fine-Tune Manually (This Lab):** I will start with the base GPT-2 model and manually implement the entire training loop using native **PyTorch**. This showcases a deep understanding of the underlying mechanisms of the training process, including manual setup of the optimizer, scheduler, and the forward/backward pass.2.  **Upload to Hugging Face (This Lab):** I will save and upload the manually fine-tuned model and tokenizer to my Hugging Face Hub account, making it publicly accessible.3.  **Refined Fine-Tuning (Next Lab):** In a separate notebook, I will load this fine-tuned model from the Hub and further train it using the streamlined **Hugging Face `transformers` `Trainer` class**. This demonstrates proficiency in using high-level, production-ready tools.4.  **Alignment using DPO (Final Lab):** Finally, I will take the model from the previous step and apply a sophisticated reinforcement learning technique, **Direct Preference Optimization (DPO)**, in a third lab to align its outputs with human preferences (e.g., for better code quality or safety).

### 1. Project Setup: Installing DependenciesI'm installing the necessary libraries, including the `transformers` library for the GPT-2 model and tokenizer, `datasets` for efficient data handling, and `accelerate` for easier multi-GPU/device management.

In [None]:
!pip install transformers datasets accelerate -U

### 2. Version Control SetupI'm ensuring Git LFS (Large File Storage) is installed. This is a crucial preparatory step for handling large model files when pushing them to the Hugging Face Hub.

In [None]:
!apt install git-lfs

### 3. Authenticating with Hugging Face HubTo upload my fine-tuned model and tokenizer later, I need to log into my Hugging Face account securely via the CLI.

In [None]:
!huggingface-cli login

### 4. Data Acquisition and Initial LoadI'm using the powerful `datasets` library to load my custom coding problem dataset, which is currently stored in a CSV file.

In [None]:
from datasets import load_dataset

# Load the dataset from the uploaded file path
ds = load_dataset("csv", data_files="data/coding_dataset_1785725.csv", split="train")

### 5. Custom Data Formatting for Causal Language Modeling (CLM)I define a function to format the dataset rows into a single `text` column with distinct separator tokens (`###`). This is the required input format for Causal Language Modeling to train the model to generate the solution and reasoning based on the problem statement.

In [None]:
def format_data(example):
  # Concatenate all relevant columns into a single string for CLM
  text = f"Problem: {example['problem_statement']}\n###Solution: {example['solution']}\n###Reasoning: {example['reasoning']}"
  return {"text": text}

# Apply the formatting function
formatted = ds.map(format_data, remove_columns=ds.column_names)

### 6. Creating Train and Validation SubsetsI'm performing a train/test split (80/20) and then selecting a small, manageable subset of the data (2500 training examples and 500 validation examples) for efficient local fine-tuning.

In [None]:
formatted_split = formatted["train"].train_test_split(test_size=0.2)

# Select a subset of the split datasets for demonstration purposes
train_size = 2500
test_size = 500

# Ensure the selected size does not exceed the actual size of the split
train_subset_size = min(train_size, len(formatted_split["train"]))
test_subset_size = min(test_size, len(formatted_split["test"]))

formatted_split["train"] = formatted_split["train"].select(range(train_subset_size))
formatted_split["test"] = formatted_split["test"].select(range(test_subset_size))

print(formatted_split)

### 7. Tokenization and Input PreparationI load the `gpt2` tokenizer, explicitly set the padding token to the EOS token, and tokenize the dataset. This creates the `input_ids` and `attention_mask` tensors required by the model, preparing the data for the PyTorch loop.

In [None]:
from transformers import AutoTokenizer

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)

if tokenizer.pad_token is None:
  tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(examples):
  return tokenizer(
    examples["text"],
    truncation=True,
    padding="max_length",
    max_length=512
  )

tokenized_ds = formatted_split.map(tokenize_function, batched=True, remove_columns=formatted["train"].column_names)

### 8. Environment and Device Setup (PyTorch Manual Setup Phase)I'm explicitly setting up the environment, ensuring PyTorch uses the available GPU (`cuda`) for accelerated training, which is a standard practice in deep learning.

In [None]:
import torch
from transformers import AutoModelForCausalLM
import torch.nn.functional as F
from accelerate import Accelerator

# Check for CUDA and set device
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(f"Using device: {device}")

### 9. Initializing the Pre-trained GPT-2 ModelI'm loading the base `gpt2` model for Causal Language Modeling and moving it to the selected device (GPU). This is the core PyTorch model object that will be fine-tuned.

In [None]:
model = AutoModelForCausalLM.from_pretrained(model_name)
model.to(device)

### 10. Creating PyTorch DataLoadersFor manual PyTorch training, I wrap the tokenized Hugging Face `datasets` objects into PyTorch `DataLoaders`. This utility handles batching, shuffling, and multi-process data loading for efficient GPU usage.

In [None]:
from torch.utils.data import DataLoader, default_collate

batch_size = 4

# Use default_collate for standard tensor stacking
train_dataloader = DataLoader(
    tokenized_ds["train"], 
    shuffle=True, 
    batch_size=batch_size, 
    collate_fn=default_collate
    )
eval_dataloader = DataLoader(
    tokenized_ds["test"], 
    shuffle=False, 
    batch_size=batch_size, 
    collate_fn=default_collate
    )

### 11. Setting Up Optimization ComponentsThis is a core manual step: I define the **AdamW optimizer** (standard for transformers) and a **linear learning rate scheduler** with a warm-up. These control how the model's weights are updated and how the learning rate changes over time.

In [None]:
import torch.optim as optim
from transformers import get_scheduler

learning_rate = 5e-5
optimizer = optim.AdamW(model.parameters(), lr=learning_rate)

num_training_steps = len(train_dataloader) * 1
scheduler = get_scheduler(
    name="linear", 
    optimizer=optimizer, 
    num_warmup_steps=0, 
    num_training_steps=num_training_steps
    )

### 12. Defining the Evaluation FunctionI define a helper function to calculate the average loss on the validation set after each training epoch. This uses `torch.no_grad()` for efficiency and to prevent unintended weight updates.

In [None]:
def evaluate_model(model, eval_dataloader):
    model.eval()
    total_loss = 0
    for batch in eval_dataloader:
        # Move batch tensors to device
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = input_ids.clone()

        with torch.no_grad():
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        
        total_loss += outputs.loss.item()

    avg_loss = total_loss / len(eval_dataloader)
    print(f"Evaluation Loss: {avg_loss:.4f}")
    return avg_loss

### 13. Implementing the Manual PyTorch Training LoopThis function is the core of the manual fine-tuning. It contains the essential steps: iterating through batches, calculating loss, performing the **backward pass** (`loss.backward()`), and updating weights (`optimizer.step()`), explicitly demonstrating low-level PyTorch control.

In [None]:
def train_model(model, train_dataloader, eval_dataloader, optimizer, scheduler, num_epochs=3):
    model.train()
    for epoch in range(num_epochs):
        print(f"Epoch {epoch + 1}/{num_epochs}")
        total_loss = 0
        for batch_idx, batch in enumerate(train_dataloader):
            optimizer.zero_grad()
            
            # Move batch tensors to device
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = input_ids.clone()

            # Forward pass
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            total_loss += loss.item()

            # Backward pass
            loss.backward()
            
            # Optimization step
            optimizer.step()
            scheduler.step()
            
            if (batch_idx + 1) % 100 == 0:
                print(f"  Batch {batch_idx + 1}/{len(train_dataloader)} Loss: {loss.item():.4f}")

        avg_train_loss = total_loss / len(train_dataloader)
        print(f"Average Training Loss for Epoch {epoch + 1}: {avg_train_loss:.4f}")
        
        # Evaluate after each epoch
        evaluate_model(model, eval_dataloader)
        model.train() # Set back to train mode
    
    print("Manual fine-tuning complete.")

### 14. Executing the Manual Fine-TuningI initiate the training process. I'm starting with a single epoch to quickly demonstrate the model's domain adaptation and my manual implementation skills.

In [None]:
num_epochs = 1
train_model(model=model, train_dataloader=train_dataloader, eval_dataloader=eval_dataloader, optimizer=optimizer, scheduler=scheduler, num_epochs=num_epochs)

### 15. Qualitative Check: Defining Text Generation FunctionI define a helper function using the Hugging Face `pipeline` for text generation. This will be used to qualitatively inspect the model's new behavior after fine-tuning.

In [None]:
from transformers import pipeline

def generate_text(model, tokenizer, prompt, max_length=256, temperature=0.7):
    model.eval() # Switch to evaluation mode
    
    # Create a text generation pipeline
    gen_pipeline = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        device=0 if torch.cuda.is_available() else -1 # Use GPU if available
    )

    # Generate text
    generated = gen_pipeline(
        prompt,
        max_length=max_length,
        do_sample=True,
        temperature=temperature,
        top_k=50,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id, # Must explicitly set pad_token_id
        num_return_sequences=1
    )
    return generated[0]['generated_text']

### 16. Qualitative Check: Running Text GenerationI test the fine-tuned model's ability to generate a solution and reasoning based on a new problem statement, confirming its learned capabilities.

In [None]:
test_prompt = "Problem: Write a Python function that takes a list of integers and returns the sum of all odd numbers in the list.\n###Solution:"
print("Generating text using fine-tuned model...")
generated_output = generate_text(model=model, tokenizer=tokenizer, prompt=test_prompt, max_length=150)
print("---- Generated Output ----")
print(generated_output)

### 17. Saving and Pushing the Fine-Tuned Model to Hugging Face HubI save the fine-tuned model and tokenizer locally and then use the `push_to_hub` utility to upload them to my public Hub account. This completes Phase 1 and makes the model available for subsequent steps (using the Hugging Face `Trainer` and DPO).

In [None]:
repo_name = "my-gpt2-coding-finetuned-pytorch-manual" # Choose a meaningful name

model.save_pretrained(repo_name)
tokenizer.save_pretrained(repo_name)

# Push to Hub - requires being logged in via !huggingface-cli login
print(f"Pushing model and tokenizer to Hub under: {repo_name}")
tokenizer.push_to_hub(repo_name)
model.push_to_hub(repo_name)

### 18. Final Quantitative EvaluationI run the evaluation function one last time to confirm the final loss on the validation set after the completion of the entire manual training process.

In [None]:
evaluate_model(model=model,eval_dataloader=eval_dataloader)


`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Evaluation Loss: 0.0139


0.0139