## Load the trained tokenizer

Load the custom BPE tokenizer that was trained and saved in the previous step (`custom_bpe_tokenizer.json`) using Hugging Face’s `PreTrainedTokenizerFast`.

### Special Tokens Added:
- `<pad>` — used for padding sequences during batching
- `<endOfCommitMessage>` — tells the model where the commit message ends
- `<endOfDiff>` — separates the git diff from the commit message

In [None]:
from transformers import PreTrainedTokenizerFast

# Load custom BPE tokenizer
custom_tokenizer = PreTrainedTokenizerFast(
    tokenizer_file="custom_bpe_tokenizer.json"
)
custom_tokenizer.add_special_tokens({
    "pad_token": "<pad>",
    "eos_token": "<endOfCommitMessage>",
    "bos_token": "<sos>"
})
custom_tokenizer.add_tokens(["<endOfDiff>"])


## GRU Commit Message Generator: Model Definition and Inference Setup

This cell defines and loads a GRU-based language model for generating Git commit messages from Git diffs, and includes a decoding function that uses temperature-controlled sampling.

### Components

#### 1. **Device Setup**
- Selects GPU (`cuda`) if available; otherwise falls back to CPU.

#### 2. **`GRULanguageModel` Class**
- A 4-layer GRU language model with:
  - Shared input/output embeddings (weight tying).
  - Dropout regularization between layers.
  - Configurable vocabulary size, embedding dimension, hidden dimension, and padding index.

#### 3. **`generate_commit_message()` Inference Function**
- Takes in a Git diff and autoregressively generates a commit message.
- Uses:
  - **Temperature sampling** to control randomness.
  - **Top-k filtering** to limit token sampling to the top `k` probable choices.
- Stops generating once it produces the `<endOfCommitMessage>` token or hits `max_new_tokens`.

#### 4. **Model Initialization and Loading**
- Loads vocabulary size and padding ID from the tokenizer.
- Restores model weights from `trained_model/gru_model.pt`.

> Ensure that the tokenizer (`custom_tokenizer`) is already defined and includes special tokens: `<sos>`, `<endOfDiff>`, and `<endOfCommitMessage>`.


In [None]:
import torch
import torch.nn as nn

# Device setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# GRU Language Model definition
class GRULanguageModel(nn.Module):
    def __init__(self, vocab_size, embed_dim=512, hidden_dim=512, num_layers=4, dropout=0.2, pad_id=0):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_id)
        self.drop = nn.Dropout(dropout)
        self.gru = nn.GRU(embed_dim, hidden_dim, num_layers=num_layers, batch_first=True,
                          dropout=dropout if num_layers > 1 else 0.0)
        self.lm_head = nn.Linear(hidden_dim, vocab_size, bias=False)
        self.lm_head.weight = self.embed.weight  # weight tying

    def forward(self, input_ids, hidden=None):
        x = self.drop(self.embed(input_ids))
        out, new_h = self.gru(x, hidden)
        logits = self.lm_head(out)
        return logits, new_h

# Inference function with temperature + top-k sampling
def generate_commit_message(model, tokenizer, git_diff, max_new_tokens=50, device='cuda', temperature=0.8, top_k=50):
    model.eval()
    sep_token = "<endOfDiff>"
    eos_token = "<endOfCommitMessage>"
    bos_token = "<sos>"

    input_text = bos_token + git_diff + sep_token
    input_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)
    eos_id = tokenizer.encode(eos_token)[0]

    generated = input_ids
    hidden = None

    with torch.no_grad():
        for _ in range(max_new_tokens):
            logits, hidden = model(generated[:, -1:], hidden)
            logits = logits[0, -1] / temperature

            if top_k > 0:
                topk_values, topk_indices = torch.topk(logits, top_k)
                probs = torch.zeros_like(logits).scatter_(0, topk_indices, torch.softmax(topk_values, dim=-1))
            else:
                probs = torch.softmax(logits, dim=-1)

            next_id = torch.multinomial(probs, num_samples=1).item()
            generated = torch.cat([generated, torch.tensor([[next_id]], device=device)], dim=1)

            if next_id == eos_id:
                break

    decoded = tokenizer.decode(generated[0].tolist())
    return decoded.split(sep_token)[1].replace(eos_token, "").strip()

# Initialize and load the model
vocab_size = len(custom_tokenizer)
model = GRULanguageModel(
    vocab_size=vocab_size,
    embed_dim=512,
    hidden_dim=512,
    num_layers=4,
    dropout=0.2,
    pad_id=custom_tokenizer.pad_token_id
)
model.load_state_dict(torch.load("trained_model/gru_model.pt", map_location=device))
model.to(device)


## Inference on a Sample Git Diff with Stochastic Decoding

This cell demonstrates how to generate multiple commit message predictions from a single Git diff using a trained GRU language model with temperature and top-k sampling.

### Steps:

- **Define a Sample Diff**  
  A minimal Git diff is defined in `sample_diff` showing a small code modification (e.g., enabling a logging flag in `config.py`).

- **Initial Commit Message Generation**
  - `generate_commit_message(...)` is first called with:
    - `temperature=0.8`: controls randomness in token sampling.
    - `top_k=50`: restricts token sampling to the top 50 most probable choices.

- **Multiple Variants via Sampling**
  - The model is run **5 times** with:
    - `temperature=0.9` and `top_k=100` to increase output diversity.
    - Each run may yield a different commit message depending on sampling variation.

### Purpose:
This approach evaluates the model's **generative diversity** and helps identify whether it produces **semantically relevant and varied** messages for the same diff input.

> Ensure that `model`, `custom_tokenizer`, and `generate_commit_message()` have been defined and loaded beforehand.


In [None]:
# Sample Git diff input
sample_diff = """diff --git a/config.py b/config.py
index abc123..def456 100644
--- a/config.py
+++ b/config.py
@@ -1,5 +1,6 @@
 DEBUG = False
+LOGGING_ENABLED = True
"""

# Run inference with sampling
generated_msg = generate_commit_message(
    model,
    custom_tokenizer,
    sample_diff,
    device=device,
    temperature=0.8,
    top_k=50
)

for i in range(5):
    msg = generate_commit_message(model, custom_tokenizer, sample_diff, device=device, temperature=0.9, top_k=100)
    print(f"[{i+1}] {msg}\n")



## Batch Inference on Validation Set Samples

This cell evaluates the GRU commit message generation model on random samples from the validation split of the dataset.

### Dataset Preparation
- Loads the file `cleaned_python_commit_dataset.csv` from the `data/` directory.  
  This CSV contains:
  - A column named `diff` with raw Git diffs (code changes).
  - A column named `commit_message` with the corresponding human-written messages.
  
- The dataset is read into a Pandas DataFrame. Then:
  - Any missing values are dropped.
  - Both `diff` and `commit_message` columns are explicitly converted to strings for consistency.

### Data Splitting
- Performs an 80/20 train-validation split, matching the setup used during training.
- The validation set (`val_data`) is used for evaluation.

### Random Sample Inference
- Randomly selects `num_samples` (default = 5) from the validation set.
- For each sample:
  - The model generates a commit message using temperature sampling (`temperature=0.9`, `top_k=100`).
  - The original Git diff, true commit message, and predicted message are printed for comparison.

### Purpose
This process provides a **qualitative evaluation** of the model’s performance and its ability to generalize to unseen Git diffs. Sampling-based decoding allows observation of **variability and creativity** in model outputs.

> Requires `generate_commit_message()`, `model`, `custom_tokenizer`, and `device` to be defined beforehand.


In [None]:
import pandas as pd
import random

# Load the same cleaned dataset used during training
df = pd.read_csv("data/cleaned_python_commit_dataset.csv")

# Convert all diffs and messages to strings and drop rows with NaNs
df = df.dropna(subset=["diff", "commit_message"])
df["diff"] = df["diff"].astype(str)
df["commit_message"] = df["commit_message"].astype(str)

data = list(zip(df["diff"].tolist(), df["commit_message"].tolist()))

# Reproduce the same 80/20 split used in training
random.seed(42)
random.shuffle(data)
split_idx = int(0.8 * len(data))
val_data = data[split_idx:]

# Set number of samples to test
num_samples = 5
random.seed()  # Use non-fixed seed for varied results

# Run inference on multiple random samples
for i in range(num_samples):
    sample_diff, true_msg = random.choice(val_data)
    predicted_msg = generate_commit_message(
        model,
        custom_tokenizer,
        sample_diff,
        device=device,
        temperature=0.9,
        top_k=100
    )

    print(f"\n======== SAMPLE {i+1} ========")
    print("Git Diff:\n", sample_diff.strip())
    print("\nGround Truth Commit Message:\n", true_msg.strip())
    print("\nGenerated Commit Message:\n", predicted_msg.strip())
