## 1. Load the trained tokenizer

Load the custom BPE tokenizer that was trained and saved in the previous step (`custom_bpe_tokenizer.json`) using Hugging Face’s `PreTrainedTokenizerFast`.

### Special Tokens Added:
- `<pad>` — used for padding sequences during batching
- `<endOfCommitMessage>` — tells the model where the commit message ends
- `<endOfDiff>` — separates the git diff from the commit message

In [None]:
from transformers import PreTrainedTokenizerFast

custom_tokenizer = PreTrainedTokenizerFast(tokenizer_file="custom_bpe_tokenizer.json")
custom_tokenizer.add_special_tokens({
    "pad_token": "<pad>",
    "eos_token": "<endOfCommitMessage>"
})
custom_tokenizer.add_tokens(["<endOfDiff>"])


## 2. Load Trained Model and Generate Commit Messages

This cell loads your trained GPT-2 model and tokenizer, then sets up a function to generate commit messages from Git diffs.

- Uses GPU if available.
- Loads the model from the `trained_model/` folder.
  - `<pad>`, `<endOfDiff>`, `<endOfCommitMessage>`
- Defines `generate_commit_message()`:
  - Takes in a Git diff.
  - Adds `<endOfDiff>` to separate the input.
  - Uses greedy decoding (with repetition penalty and no repeated 3-grams).
  - Stops when it hits `<endOfCommitMessage>` or the max token limit.
  - Returns just the commit message part of the output.

Make sure the model and tokenizer files are in the current directory before running this.


In [None]:
import torch
from transformers import GPT2LMHeadModel, PreTrainedTokenizerFast

# Choose device (GPU if available)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

# Load "trained_model" folder from directory that contains both the model weights and configuration.
model = GPT2LMHeadModel.from_pretrained("trained_model")
model.to(device)
model.eval()

# Load the custom tokenizer from its JSON file.
custom_tokenizer = PreTrainedTokenizerFast(tokenizer_file="custom_bpe_tokenizer.json")
# Add special tokens
custom_tokenizer.add_special_tokens({
    "pad_token": "<pad>",
    "eos_token": "<endOfCommitMessage>"
})
custom_tokenizer.add_tokens(["<endOfDiff>"])

# Inference function to generate a commit message given a git diff.
def generate_commit_message(model, tokenizer, git_diff_text, max_length=100, device='cuda'):
    model.eval()  # Ensure the model is in evaluation mode.
    separator = "<endOfDiff>"
    eos_token = "<endOfCommitMessage>"

    # Build the prompt by concatenating the diff with the separator.
    prompt = git_diff_text + separator
    input_ids = tokenizer.encode(prompt)
    input_tensor = torch.tensor([input_ids], dtype=torch.long).to(device)

    # Get the token id for the EOS token
    eos_token_id = tokenizer.encode(eos_token)[0]
    pad_token_id = tokenizer.pad_token_id

    with torch.no_grad():
        generated_ids = model.generate(
            input_tensor,
            max_length=len(input_ids) + max_length,
            pad_token_id=pad_token_id,
            eos_token_id=eos_token_id,
            do_sample=False  # Greedy decoding; change to True for sampling
            , repetition_penalty=1.2, no_repeat_ngram_size=3
        )[0].tolist()

    generated_text = tokenizer.decode(generated_ids)
    # Extract the commit message from the generated text.
    if separator in generated_text:
        commit_msg_with_eos = generated_text.split(separator, 1)[1].strip()
        commit_msg = commit_msg_with_eos.replace(eos_token, "").strip()
    else:
        commit_msg = generated_text.strip()
    return commit_msg




## 3. Run an Inference

This cell runs the `generate_commit_message()` function on a test Git diff to see what commit message the model generates.

- The example diff shows a `DEBUG` flag being changed from `False` to `True` in `config.py`.
- The model processes the diff and returns a predicted commit message.

Make sure the CSV file used for training is located in a folder named `data/` and is named `cleaned_python_commit_dataset.csv`.

Use this to quickly test if your model and tokenizer are working as expected.


In [3]:
# Run an inference on sample git diff
test_diff = """
diff --git a/config.py b/config.py
index 8aef123..9cdfaa1 100644
--- a/config.py
+++ b/config.py
@@
-DEBUG = False
+DEBUG = True
"""

commit_msg = generate_commit_message(model, custom_tokenizer, test_diff, max_length=20, device=device)
print("Generated commit message:", commit_msg)


Generated commit message: Remove debugging flag from config file


## 4. Run Inference on a Sample from the Test Set

- This cell re-loads the original dataset from `data/cleaned_python_commit_dataset.csv`, applies the same 80/20 train/val split used during training, and runs inference on a random example from the validation set.

- Make sure you're using the same tokenizer (`custom_bpe_tokenizer.json`) and that the CSV file is inside a folder named `data/`.

This lets you evaluate how well the model generalizes to unseen examples from the original test set.


In [None]:
import pandas as pd
import random

# Load the same cleaned dataset used during training
df = pd.read_csv("data/cleaned_python_commit_dataset.csv")
data = list(zip(df["diff"].tolist(), df["commit_message"].tolist()))

# Reproduce the same 80/20 split used in training
random.seed(42)
random.shuffle(data)
split_idx = int(0.8 * len(data))
val_data = data[split_idx:]

# Pick a sample from the validation set
sample_diff, true_msg = random.choice(val_data)

# Run inference
predicted_msg = generate_commit_message(model, custom_tokenizer, sample_diff, max_length=20, device=device)

# Print results
print("Git Diff:\n", sample_diff.strip())
print("\n Ground Truth Commit Message:\n", true_msg.strip())
print("\n Generated Commit Message:\n", predicted_msg.strip())
