## 1. Initialize Dataset

Make sure your CSV dataset is named `cleaned_python_commit_dataset.csv` and is placed inside a folder called `data/`. This cell reads the file and prints the number of rows to confirm it's loaded correctly.

In [None]:
import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv('data/cleaned_python_commit_dataset.csv')

# Get the number of rows
num_rows = df.shape[0]

print("Number of rows:", num_rows)


## 2. Train BPE Tokenizer on Dataset

Train a Byte-Pair Encoding (BPE) tokenizer using the HuggingFace `tokenizers` library. The tokenizer is trained on a dataset of Git diffs and commit messages extracted from a CSV file.

### Inputs
- **`csv_file`**: Path to the CSV file containing Git diffs and commit messages.
- **`output_tokenizer_file`**: Path where the trained tokenizer JSON will be saved.
- **`vocab_size`** *(default: 48000)*: Target vocabulary size.
- **`special_tokens`** *(default: [`"<pad>"`, `"<endOfDiff>"`, `"<endOfCommitMessage>"`])*: Tokens to include in the vocabulary for special processing.
- **`diff_column`** *(default: 'diff')*: Column name containing the Git diffs.
- **`commit_msg_column`** *(default: 'commit_message')*: Column name containing the commit messages.

### Processing Steps
1. Reads the CSV and ensures the required columns exist.
2. Combines each diff and commit message into a single training string, separated by a newline.
3. Saves these training strings to a temporary file.
4. Initializes a BPE tokenizer with byte-level pre-tokenization and decoding (compatible with GPT-2 style models).
5. Trains the tokenizer on the prepared training texts.
6. Saves the tokenizer as a JSON file for later use.
7. Deletes the temporary training text file.

### Output
- A JSON tokenizer file (e.g., `custom_bpe_tokenizer.json`) that can be used with HuggingFace's `PreTrainedTokenizerFast`.

This tokenizer will be used to encode the model inputs and outputs, ensuring that both Git diffs and commit messages are tokenized in a consistent and compact format.


In [None]:
import os
import pandas as pd
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers

def train_tokenizer(
    csv_file: str,
    output_tokenizer_file: str,
    vocab_size: int = 48000,
    special_tokens: list = None,
    diff_column: str = 'diff',
    commit_msg_column: str = 'commit_message'
):

    # Train a custom BPE tokenizer using data from a CSV file.
    # Args:
    #     csv_file (str): Path to the CSV file containing the dataset.
    #     output_tokenizer_file (str): Path to save the trained tokenizer JSON.
    #     vocab_size (int): The desired vocabulary size.
    #     special_tokens (list): List of special tokens to add.
    #     diff_column (str): Name of the CSV column containing git diffs.
    #     commit_msg_column (str): Name of the CSV column containing commit messages.

    # Default special tokens
    if special_tokens is None:
        special_tokens = ["<pad>", "<endOfDiff>", "<endOfCommitMessage>"]

    # Load the CSV dataset
    df = pd.read_csv(csv_file)
    if diff_column not in df.columns or commit_msg_column not in df.columns:
        raise ValueError(f"CSV file must have columns '{diff_column}' and '{commit_msg_column}'.")

    # Combine the diff and commit message columns.
    # Here we add a newline between the diff and the commit message.
    training_texts = (df[diff_column].astype(str) + "\n" + df[commit_msg_column].astype(str)).tolist()

    # Save training texts to a temporary file (one text per line)
    training_file = "training_texts.txt"
    with open(training_file, "w", encoding="utf-8") as f:
        for text in training_texts:
            f.write(text + "\n")
    print(f"Training texts saved to {training_file}")

    # Initialize the tokenizer with a BPE model
    tokenizer = Tokenizer(models.BPE())

    # Set pre-tokenizer and decoder.
    # Using ByteLevel pre-tokenization and decoding mimics GPT-2's behavior.
    tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()
    tokenizer.decoder = decoders.ByteLevel()

    # Configure the trainer with the vocabulary size and special tokens
    trainer = trainers.BpeTrainer(vocab_size=vocab_size, special_tokens=special_tokens)

    # Train the tokenizer on the training file.
    tokenizer.train([training_file], trainer=trainer)
    print("Tokenizer training complete.")

    # Save the trained tokenizer to a JSON file.
    tokenizer.save(output_tokenizer_file)
    print(f"Tokenizer saved to {output_tokenizer_file}")

    # Remove the temporary training file.
    os.remove(training_file)
    print(f"Temporary training file {training_file} removed.")

if __name__ == "__main__":
    # Path to CSV file containing 'diff' and 'commit_message' columns.
    csv_file = "data/cleaned_python_commit_dataset.csv"
    # The filename to store your trained tokenizer.
    output_tokenizer_file = "custom_bpe_tokenizer.json"
    # Train the tokenizer with a vocabulary size of 48,000
    train_tokenizer(csv_file, output_tokenizer_file, vocab_size=48000)


## 3. Load the Trained Tokenizer

Load the custom BPE tokenizer that was trained and saved in the previous step (`custom_bpe_tokenizer.json`) using Hugging Face’s `PreTrainedTokenizerFast`.

### Special Tokens Added:
- `<pad>` — used for padding sequences during batching
- `<endOfCommitMessage>` — tells the model where the commit message ends
- `<endOfDiff>` — separates the git diff from the commit message

In [None]:
from transformers import PreTrainedTokenizerFast

custom_tokenizer = PreTrainedTokenizerFast(tokenizer_file="custom_bpe_tokenizer.json")
custom_tokenizer.add_special_tokens({
    "pad_token": "<pad>",
    "eos_token": "<endOfCommitMessage>"
})
custom_tokenizer.add_tokens(["<endOfDiff>"])


1

## 4. Initialize Function to Plot Loss for Training

Helper function `plot_and_save_loss()` that plots the training and validation loss over epochs and saves the result as an image.

### What it does:
- Takes in two lists: `train_losses` and `val_losses`, one value per epoch.
- Plots both loss curves on the same graph.
- Labels axes and adds a legend for clarity.
- Saves the figure as `loss_curve.png` (or a custom filename if specified).

Use this after training to visualize how well your model is learning and spot signs of overfitting or underfitting.

In [None]:
import matplotlib.pyplot as plt

def plot_and_save_loss(train_losses, val_losses, filename="loss_curve.png"):
    epochs = range(1, len(train_losses) + 1)
    plt.figure()
    plt.plot(epochs, train_losses, label="Training Loss")
    plt.plot(epochs, val_losses,   label="Validation Loss")
    plt.xlabel("Epoch")
    plt.ylabel("Loss")
    plt.title("Loss Curve")
    plt.legend()
    plt.tight_layout()
    plt.savefig(filename)
    plt.close()
    print(f"Saved loss curve to {filename}")


## 5. Training the Decoder-only Transformer Model on Git Diffs + Commit Messages

This code builds and trains a decoder-only GPT-2 model from scratch to generate commit messages based on Git diffs. It uses the custom tokenizer trained earlier, a 20-layer Transformer architecture, and tracks model behavior over time.

1. **Defines a custom PyTorch dataset (`GitDiffDataset`)** to prepare each sample in the form: `<git_diff> <endOfDiff> <commit_message> <endOfCommitMessage>`
2. **Pads each batch** using a custom `collate_fn` to align sequence lengths while respecting the model's max context window.
3. **Configures a GPT-2 model** from scratch using the following hyperparameters:

- **`vocab_size=48000`**  
  Matches the number of tokens in our custom tokenizer. This sets the size of the model's input/output vocabulary.

- **`n_positions=1024`**  
  Maximum number of tokens the model can see in a single forward pass. Both the input (diff + commit message) and the output must fit within this limit.

- **`n_ctx=1024`**  
  Same as `n_positions`; used for backward compatibility. It defines the length of the attention context window.

- **`n_embd=768`**  
  Embedding size — each token is represented as a 768-dimensional vector. Also used internally throughout the model layers.

- **`n_layer=20`**  
  The number of Transformer blocks stacked in the model. More layers = more capacity to learn patterns, but also more compute.

- **`n_head=12`**  
  Number of attention heads in each layer. The model splits the embedding space into 12 parts and attends to different positions in parallel.

- **`resid_pdrop=0.1`**  
  Dropout applied to the residual (skip) connections to prevent overfitting.

- **`embd_pdrop=0.1`**  
  Dropout applied right after the token and positional embeddings.

- **`attn_pdrop=0.1`**  
  Dropout applied inside the self-attention mechanism — helps regularize attention weights.

These values are inspired by the GPT-2 Medium configuration and give the model enough capacity to learn meaningful patterns from Git diffs and commit messages, while still being trainable on a single GPU setup like an A100.

> **Loss Function**:  
> The model is trained using **cross entropy loss**, which is the default in Hugging Face’s `GPT2LMHeadModel` when you provide `labels=batch`. It compares the predicted token distribution to the actual next token and penalizes incorrect predictions. This is standard for autoregressive language modeling.

4. **Implements a training loop** that:
- Trains for 10 epochs using AdamW with `lr=2e-4`
- Tracks both training and validation loss
- Plots loss vs. weight norm to monitor training dynamics
- Saves `loss_curve.png` and `loss_vs_weight_norm.png`

### Input
- CSV file: `data/cleaned_python_commit_dataset.csv`
- Columns: `diff`, `commit_message`

### Output
- Trained model saved to `trained_model/`
- Loss curve: `loss_curve.png`
- Weight norm vs. loss: `loss_vs_weight_norm.png`
- Printed commit message generated from a test diff

Make sure you've already run the tokenizer training step, and that `custom_bpe_tokenizer.json` is in the current directory.





In [None]:
import math
import random
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Config, GPT2LMHeadModel
from torch.optim import AdamW
import pandas as pd
import numpy as np
from tqdm import tqdm
import matplotlib.pyplot as plt

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)



# Each sample is a tuple: (git_diff, commit_message).
# Use "<endOfDiff>" as a separator between the git diff and commit message,
# and append "<endOfCommitMessage>" to mark the end of the commit message.
class GitDiffDataset(Dataset):
    def __init__(self, data, tokenizer):
        """
        data: List of tuples (git_diff_text, commit_message_text)
        tokenizer: Instance of our custom tokenizer.
        """
        self.data = data
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        # Get the git diff and commit message for the current sample.
        git_diff_text, commit_msg_text = self.data[idx]

        # Ensure both values are strings ***
        git_diff_text = str(git_diff_text)
        commit_msg_text = str(commit_msg_text)

        # Define a custom separator and EOS token.
        separator = "<endOfDiff>"
        eos_token = "<endOfCommitMessage>"
        # Concatenate git diff, separator, commit message, and EOS token.
        full_text = git_diff_text + separator + commit_msg_text + eos_token
        # Encode the concatenated text into token IDs.
        token_ids = self.tokenizer.encode(full_text)
        return torch.tensor(token_ids, dtype=torch.long)

# Pads sequences in a batch to the same length.
# This function is called by DataLoader to combine samples into a mini-batch.
def collate_fn(batch):
    max_length = max(seq.size(0) for seq in batch)
    # Limit max_length to n_positions to avoid IndexError
    max_length = min(max_length, model.config.n_positions)  # <-- Added this line

    padded_batch = []
    for seq in batch:
        pad_len = max_length - seq.size(0)
        # Pad sequences with 0 (our designated pad token).
        padded_seq = F.pad(seq, (0, pad_len), value=custom_tokenizer.pad_token_id)
        padded_batch.append(padded_seq.unsqueeze(0))
    return torch.cat(padded_batch, dim=0)


# GPT2LMHeadModel from transformers
# Create a GPT2 configuration from scratch.
config = GPT2Config(
    vocab_size=custom_tokenizer.vocab_size,  # Using our custom BPE tokenizer.
    n_positions=1024,             # Maximum number of tokens in a sequence.
    n_ctx=1024,                   # Context size (should match n_positions).
    n_embd=768,                   # Embedding size (d_model).
    n_layer=20,                    # Number of transformer layers.
    n_head=12,                     # Number of attention heads.
    resid_pdrop=0.1,              # Dropout probability for residual connections.
    embd_pdrop=0.1,               # Dropout probability for embeddings.
    attn_pdrop=0.1,               # Dropout probability for attention weights.
)

# Initialize the GPT2 model with a language modeling head from scratch.
model = GPT2LMHeadModel(config)
# Since we're training from scratch, resize the token embeddings
# to accommodate any added tokens:
model.resize_token_embeddings(custom_tokenizer.vocab_size)


# Training Loop
# This function trains the model using the provided training and validation data.
def train_model(model, train_loader, val_loader, epochs=10, lr=2e-4, device='cuda'):
    # Use the AdamW optimizer which is well-suited for transformers.
    # lr used to be 1e-4. we are now experimenting with 2e-4
    optimizer = AdamW(model.parameters(), lr=lr)
    model.to(device)

    # Track losses per epoch
    train_losses, val_losses = [], []
    weight_norms = []


    for epoch in range(epochs):
        model.train()  # Set model to training mode.
        total_loss = 0.0

        # tqdm progress bar for the training loop.
        train_bar = tqdm(train_loader, desc=f"Epoch {epoch+1} Training", leave=False)
        for batch_idx, batch in enumerate(train_bar):
            batch = batch.to(device)
            # The target is the input shifted by one (language modeling objective).
            outputs = model(batch, labels=batch)
            loss = outputs.loss

            optimizer.zero_grad()  # Clear previous gradients.
            loss.backward()        # Compute gradients.
            optimizer.step()       # Update model weights.

            total_loss += loss.item()
            # Update progress bar with current average loss.
            train_bar.set_postfix(loss=f"{total_loss/(batch_idx+1):.4f}")

        avg_train_loss = total_loss / len(train_loader)
        train_losses.append(avg_train_loss)

        # Validation phase after each epoch.
        model.eval()  # Set model to evaluation mode.
        val_loss = 0.0
        # tqdm progress bar for the validation loop.
        val_bar = tqdm(val_loader, desc=f"Epoch {epoch+1} Validation", leave=False)
        with torch.no_grad():
            for batch in val_bar:
                batch = batch.to(device)
                outputs = model(batch, labels=batch)
                val_loss += outputs.loss.item()
        val_loss /= len(val_loader)

        val_losses.append(val_loss)

        # compute L2 norm of *all* parameters
        total_norm_sq = 0.0
        for p in model.parameters():
            total_norm_sq += p.data.norm().item()**2
        weight_norm = math.sqrt(total_norm_sq)
        weight_norms.append(weight_norm)

        print(f"Epoch {epoch+1} Training Loss: {avg_train_loss:.4f}")
        print(f"Epoch {epoch+1} Validation Loss: {val_loss:.4f}")

        print(f"Epoch {epoch+1} Weight Norm: {weight_norm:.4f}")

    # Once training is done, plot weight-norm vs. loss:
    plt.figure()
    plt.plot(weight_norms, train_losses)
    plt.xlabel("Weight L2-Norm")
    plt.ylabel("Training Loss")
    plt.title("Training Loss vs. Weight Norm")
    plt.tight_layout()
    plt.savefig("loss_vs_weight_norm.png")
    plt.close()
    print("Saved plot to loss_vs_weight_norm.png")


    # After all epochs, plot & save
    plot_and_save_loss(train_losses, val_losses, filename="loss_curve.png")
    return train_losses, val_losses, weight_norms


# Loading CSV Data and Putting it All Together
if __name__ == "__main__":
    # The CSV should have two columns: 'git_diff' and 'commit_message'.
    df = pd.read_csv("data/cleaned_python_commit_dataset.csv")
    # Convert DataFrame columns into a list of tuples.
    data = list(zip(df['diff'].tolist(), df['commit_message'].tolist()))

    # Shuffle the data and split it into training (80%) and validation (20%) sets.
    random.shuffle(data)
    split_idx = int(0.8 * len(data))
    train_data = data[:split_idx]
    val_data = data[split_idx:]

    # Create PyTorch datasets for training and validation.
    train_dataset = GitDiffDataset(train_data, custom_tokenizer)
    val_dataset = GitDiffDataset(val_data, custom_tokenizer)

    # Create DataLoaders to batch and shuffle the data.
    train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True, collate_fn=collate_fn)
    val_loader = DataLoader(val_dataset, batch_size=8, shuffle=False, collate_fn=collate_fn)

    # Choose the device (GPU if available, otherwise CPU).
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    print(f"Using device: {device}")

    # Train the model using our training loop
    train_model(model, train_loader, val_loader, epochs=10, lr=2e-4, device=device)

    # Save the trained model.
    # This will save both the model weights and configuration.
    save_directory = "trained_model"
    model.save_pretrained(save_directory)
    print(f"Model saved to {save_directory}")


## Save the trained model

Zip up the `trained_model/` directory and download it to your local machine.

- The folder is compressed into a file called `trained_model.zip`.
- Make sure the model was saved to `trained_model/` before running this.



In [None]:
import shutil
from google.colab import files

# Zip the 'trained_model' directory into a file named 'trained_model.zip'
shutil.make_archive('trained_model', 'zip', 'trained_model')

# Download the zipped model
files.download('trained_model.zip')
