## Load the trained tokenizer

Load the custom BPE tokenizer that was trained and saved in the previous step (`custom_bpe_tokenizer.json`) using Hugging Face’s `PreTrainedTokenizerFast`.

### Special Tokens Added:
- `<pad>` — used for padding sequences during batching
- `<endOfCommitMessage>` — tells the model where the commit message ends
- `<endOfDiff>` — separates the git diff from the commit message

In [None]:
from transformers import PreTrainedTokenizerFast

# Load custom BPE tokenizer
custom_tokenizer = PreTrainedTokenizerFast(
    tokenizer_file="custom_bpe_tokenizer.json"
)
custom_tokenizer.add_special_tokens({
    "pad_token": "<pad>",
    "eos_token": "<endOfCommitMessage>",
    "bos_token": "<sos>"
})
custom_tokenizer.add_tokens(["<endOfDiff>"])


## GRU Commit Message Generation Model Setup

This cell defines and loads a GRU-based language model designed to generate commit messages from Git diffs.

### Components

1. **Device Configuration**
   - Uses GPU if available, otherwise falls back to CPU.

2. **GRULanguageModel Class**
   - A 4-layer GRU model with:
     - Embedding layer (`embed_dim=512`)
     - Hidden state dimension of 512
     - Dropout for regularization (`dropout=0.2`)
     - Weight tying between the embedding and output layers
   - Accepts tokenized input sequences and returns logits over the vocabulary.

3. **`generate_commit_message(...)` Function**
   - Generates a commit message token-by-token using:
     - **Temperature sampling** (for controlling randomness)
     - **Top-k filtering** (to limit vocabulary scope)
   - Stops generation when the `<endOfCommitMessage>` token is produced or the maximum length is reached.

4. **Model Initialization & Checkpoint Loading**
   - Loads vocabulary size from the tokenizer.
   - Instantiates the model with the same hyperparameters used during training.
   - Loads pre-trained weights from `trained_model/gru_model.pt`.

> **Note**: This cell assumes that `custom_tokenizer` is already defined and includes special tokens: `<sos>`, `<endOfDiff>`, and `<endOfCommitMessage>`.


In [None]:
import torch
import torch.nn as nn

# Device setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# GRU Language Model definition
class GRULanguageModel(nn.Module):
    def __init__(self, vocab_size, embed_dim=512, hidden_dim=512, num_layers=4, dropout=0.2, pad_id=0):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_id)
        self.drop = nn.Dropout(dropout)
        self.gru = nn.GRU(embed_dim, hidden_dim, num_layers=num_layers, batch_first=True,
                          dropout=dropout if num_layers > 1 else 0.0)
        self.lm_head = nn.Linear(hidden_dim, vocab_size, bias=False)
        self.lm_head.weight = self.embed.weight  # weight tying

    def forward(self, input_ids, hidden=None):
        x = self.drop(self.embed(input_ids))
        out, new_h = self.gru(x, hidden)
        logits = self.lm_head(out)
        return logits, new_h

# Inference function with temperature + top-k sampling
def generate_commit_message(model, tokenizer, git_diff, max_new_tokens=50, device='cuda', temperature=0.8, top_k=50):
    model.eval()
    sep_token = "<endOfDiff>"
    eos_token = "<endOfCommitMessage>"
    bos_token = "<sos>"

    input_text = bos_token + git_diff + sep_token
    input_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)
    eos_id = tokenizer.encode(eos_token)[0]

    generated = input_ids
    hidden = None

    with torch.no_grad():
        for _ in range(max_new_tokens):
            logits, hidden = model(generated[:, -1:], hidden)
            logits = logits[0, -1] / temperature

            if top_k > 0:
                topk_values, topk_indices = torch.topk(logits, top_k)
                probs = torch.zeros_like(logits).scatter_(0, topk_indices, torch.softmax(topk_values, dim=-1))
            else:
                probs = torch.softmax(logits, dim=-1)

            next_id = torch.multinomial(probs, num_samples=1).item()
            generated = torch.cat([generated, torch.tensor([[next_id]], device=device)], dim=1)

            if next_id == eos_id:
                break

    decoded = tokenizer.decode(generated[0].tolist())
    return decoded.split(sep_token)[1].replace(eos_token, "").strip()

# Initialize and load the model
vocab_size = len(custom_tokenizer)
model = GRULanguageModel(
    vocab_size=vocab_size,
    embed_dim=512,
    hidden_dim=512,
    num_layers=4,
    dropout=0.2,
    pad_id=custom_tokenizer.pad_token_id
)
model.load_state_dict(torch.load("trained_model/gru_model.pt", map_location=device))
model.to(device)


## Evaluation on 100 Git Diffs — BLEU, ROUGE-L, METEOR, BERTScore, RAGAs

This cell evaluates the quality of commit messages generated by the GRU language model on 100 randomly sampled Git diffs using multiple automatic metrics.

### Workflow Overview

1. **Dataset Sampling**
   - Randomly selects 100 samples from `cleaned_python_commit_dataset.csv`.
   - Ensures each sample contains a `diff` and a `commit_message`.

2. **Generation**
   - Uses the `generate_commit_message(...)` function with:
     - `temperature = 0.9`
     - `top_k = 100`
   - Each Git diff is fed into the model to generate a predicted commit message.

3. **Evaluation Metrics**
   - **BLEU**: N-gram precision (with smoothing).
   - **ROUGE-L**: Longest common subsequence overlap.
   - **METEOR**: Precision, recall, and alignment with synonym matching.
   - **BERTScore**: Semantic similarity using contextual embeddings.
   - **RAGAs Answer Correctness**: GPT-graded factual alignment between generated and reference messages.

4. **Output**
   - All results (diff, predicted message, reference, and metric scores) are stored in a list of dictionaries.
   - RAGAs is executed with retry logic to handle API timeouts.
   - The final results are saved as a JSON file:
     ```
     sample_metrics_with_ragas_and_bertscore.json
     ```


> **Important**: To compute **RAGAs Answer Correctness**, you must set your `OPENAI_API_KEY`.  
> This uses the OpenAI API (e.g., GPT-4), which **costs money**. Make sure you have billing enabled and a key with sufficient quota.


In [None]:
# ------------------------------------------------------------
# Eval on 100 samples — BLEU, ROUGE-L, METEOR, BERTScore, RAGAs Answer Correctness → JSON + download
# ------------------------------------------------------------
!pip install -q "nltk==3.6.7" rouge-score bert-score ragas datasets

import pandas as pd, json, nltk, random, time, os
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from nltk.translate.meteor_score import meteor_score
from rouge_score import rouge_scorer
from tqdm import tqdm
from bert_score import BERTScorer
from ragas.metrics import answer_correctness
from ragas import evaluate
from datasets import Dataset

# Set OpenAI API key for RAGAs
os.environ["OPENAI_API_KEY"] = ""

# Download WordNet for METEOR
nltk.download("wordnet", quiet=True)
nltk.download("omw-1.4", quiet=True)

# --- Load Dataset ---
df = pd.read_csv("data/cleaned_python_commit_dataset.csv")
sample_df = (
    df.sample(n=100, random_state=42)
      .reset_index(drop=True)
      .fillna({"diff": "", "commit_message": ""})
)
sample_df["diff"] = sample_df["diff"].astype(str)
sample_df["commit_message"] = sample_df["commit_message"].astype(str)

# --- Metric Setup ---
smooth_fn = SmoothingFunction().method1
rouge = rouge_scorer.RougeScorer(["rougeL"], use_stemmer=True)
bert_scorer = BERTScorer(lang="en", rescale_with_baseline=True)

# --- Evaluation Loop ---
ragas_examples = {"question": [], "answer": [], "ground_truth": []}
results = []

bar = tqdm(list(zip(sample_df["diff"], sample_df["commit_message"])),
           total=len(sample_df), desc="Scoring")

for diff_text, ref in bar:
    hyp = generate_commit_message(
        model, custom_tokenizer, diff_text,
        device=device,
        temperature=0.9,
        top_k=100
    )

    if ref.strip():
        ref_tokens = ref.split()
        hyp_tokens = hyp.split()
        bleu = sentence_bleu([ref_tokens], hyp_tokens, smoothing_function=smooth_fn)
        rougeL = rouge.score(ref, hyp)["rougeL"].fmeasure
        meteor = meteor_score([ref_tokens], hyp_tokens)
    else:
        bleu = rougeL = meteor = 0.0

    P, R, F1 = bert_scorer.score([hyp], [ref])
    bert_f1 = F1[0].item()

    bar.set_postfix(
        BLEU=f"{bleu:.4f}",
        ROUGE_L=f"{rougeL:.4f}",
        METEOR=f"{meteor:.4f}",
        BERTScore=f"{bert_f1:.4f}"
    )

    ragas_examples["question"].append(diff_text)
    ragas_examples["answer"].append(hyp)
    ragas_examples["ground_truth"].append(ref)

    results.append({
        "diff": diff_text,
        "generated_commit": hyp,
        "label_commit": ref,
        "bleu": bleu,
        "rouge_l": rougeL,
        "meteor": meteor,
        "bert_score_f1": bert_f1
    })

bar.close()

# --- Run RAGAs (with retry) ---
ragas_ds = Dataset.from_dict(ragas_examples)
while True:
    try:
        ragas_res = evaluate(ragas_ds, metrics=[answer_correctness])
        break
    except TimeoutError:
        print("Timeout occurred, retrying in 5 seconds...")
        time.sleep(5)
    except Exception as e:
        print(f"Unexpected error: {e}. Retrying in 5 seconds...")
        time.sleep(5)

# Add RAGAs Answer Correctness
df_corr = ragas_res.to_pandas()
for i, row in enumerate(df_corr.itertuples()):
    results[i]["answer_correctness"] = row.answer_correctness

# --- Save to JSON ---
json_file = "sample_metrics_with_ragas_and_bertscore.json"
with open(json_file, "w", encoding="utf-8") as f:
    json.dump(results, f, indent=2, ensure_ascii=False)

print(f"File saved as {json_file} in your local directory.")


## Write test scores to a .txt file for manual analysis

This cell takes the JSON file of model evaluation results (`sample_metrics_with_ragas_and_bertscore.json`) and converts it into a readable text file.

What it does:
- Loads all evaluation samples from the JSON file.
- For each sample, it writes:
  - The generated commit message
  - The ground-truth commit message
  - All evaluation scores (BLEU, ROUGE-L, METEOR, BERTScore-F1, AnswerCorrectness)
  - The associated Git diff
- Outputs everything in a clean, readable format to `all_diffs.txt`, with clear separators between samples.

This is useful if you want to manually review model outputs and compare them to the reference messages.

In [None]:
import json
import pathlib

# Paths for input JSON and output text file
in_path  = pathlib.Path("sample_metrics_with_ragas_and_bertscore.json")   # JSON array produced earlier
out_path = pathlib.Path("all_gru_diffs.txt")             # Destination text file

# 1️⃣ — Load the JSON array from the file
with in_path.open(encoding="utf-8") as src:
    samples = json.load(src)     # list[dict], length ~100 (for 100 samples)

# 2️⃣ — Write each sample to the output file in a readable format
with out_path.open("w", encoding="utf-8") as dst:
    for i, obj in enumerate(samples, 1):
        dst.write(f"---- GRU SAMPLE #{i} ----\n")
        dst.write(f"Generated Commit Message : {obj.get('generated_commit','')}\n")
        dst.write(f"Ground-Truth Commit      : {obj.get('label_commit','')}\n")
        dst.write(
            "Scores (BLEU / ROUGE-L / METEOR / BERTScore-F1 / AnswerCorrectness) : "
            f"{obj.get('bleu', 0.0):.4f} / "
            f"{obj.get('rouge_l', 0.0):.4f} / "
            f"{obj.get('meteor', 0.0):.4f} / "
            f"{obj.get('bert_score_f1', 0.0):.4f} / "
            f"{obj.get('answer_correctness', 0.0):.4f}\n"
        )
        dst.write("Git Diff:\n")
        dst.write(obj.get("diff", ""))
        dst.write("\n\n")  # blank line between samples

# 3️⃣ — Confirmation Message
print(f"Wrote {len(samples)} GRU-generated samples to {out_path.resolve()}")

## Create plots for results and save

This cell loads the evaluation results from `sample_metrics_with_ragas_and_bertscore.json` and visualizes the distribution of all five metrics across the 100 samples.

What it does:
- Loads the JSON into a pandas DataFrame.
- Extracts:  
  - `bleu`  
  - `rouge_l`  
  - `meteor`  
  - `bert_score_f1`  
  - `answer_correctness`  
- Plots a histogram for each metric to show how scores are distributed across the dataset.
- Saves each plot as a separate PNG file:
  - `bleu4_distribution.png`
  - `rouge_l_distribution.png`
  - `meteor_distribution.png`
  - `bert_score_f1_distribution.png`
  - `answer_correctness_distribution.png`

This gives a quick visual sense of how well the model is performing across different evaluation dimensions.


In [None]:
import json
import pandas as pd
import matplotlib.pyplot as plt

# 🔹 **Path to the JSON file**
json_path = "sample_metrics_with_ragas_and_bertscore.json"   # <-- change if yours lives elsewhere

# 🔹 **Load JSON and select all five metrics**
with open(json_path, encoding="utf-8") as f:
    df = pd.DataFrame(json.load(f))
scores = df[["bleu", "rouge_l", "meteor", "bert_score_f1", "answer_correctness"]].fillna(0.0)

# 🔹 **Create Histograms for Each Metric**
plt.figure()
plt.hist(scores["bleu"], bins=30)
plt.title("GRU Model — Distribution of BLEU‑4 Scores")
plt.xlabel("BLEU‑4")
plt.ylabel("Frequency")
plt.savefig("gru_bleu4_distribution.png")
plt.show()

plt.figure()
plt.hist(scores["rouge_l"], bins=30)
plt.title("GRU Model — Distribution of ROUGE‑L F1 Scores")
plt.xlabel("ROUGE‑L F1")
plt.ylabel("Frequency")
plt.savefig("gru_rouge_l_distribution.png")
plt.show()

plt.figure()
plt.hist(scores["meteor"], bins=30)
plt.title("GRU Model — Distribution of METEOR Scores")
plt.xlabel("METEOR")
plt.ylabel("Frequency")
plt.savefig("gru_meteor_distribution.png")
plt.show()

plt.figure()
plt.hist(scores["bert_score_f1"], bins=30)
plt.title("GRU Model — Distribution of BERTScore‑F1 Scores")
plt.xlabel("BERTScore‑F1")
plt.ylabel("Frequency")
plt.savefig("gru_bert_score_f1_distribution.png")
plt.show()

plt.figure()
plt.hist(scores["answer_correctness"], bins=30)
plt.title("GRU Model — Distribution of Answer Correctness Scores")
plt.xlabel("Answer Correctness")
plt.ylabel("Frequency")
plt.savefig("gru_answer_correctness_distribution.png")
plt.show()