<a href="https://colab.research.google.com/github/benedettoscala/ifttt-code-generator/blob/main/test_and_compare_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
%%capture
!pip install -U bitsandbytes

In [4]:
!git clone https://github.com/benedettoscala/ifttt-code-generator
%cd ifttt-code-generator/
!git pull

Cloning into 'ifttt-code-generator'...
remote: Enumerating objects: 220, done.[K
remote: Counting objects: 100% (63/63), done.[K
remote: Compressing objects: 100% (63/63), done.[K
remote: Total 220 (delta 37), reused 0 (delta 0), pack-reused 157 (from 1)[K
Receiving objects: 100% (220/220), 14.94 MiB | 16.16 MiB/s, done.
Resolving deltas: 100% (130/130), done.
/content/ifttt-code-generator
Already up to date.


In [5]:
%cd ..

/content


### Model Comparison: GPT-2, BART, and Mistral for IFTTT Code Generation
This section evaluates and compares the performance of three fine-tuned language models (**GPT-2, BART, and Mistral**) in generating IFTTT-style automation code from textual descriptions.

#### **Dataset Loading and Preprocessing**
- The dataset is loaded from `"ifttt-code-generator/datasets/cleaned_and_combined.csv"`.
- The data is split into **80% training** and **20% testing**.
- **Test prompts** (natural language descriptions) and their corresponding **actual code** are extracted for evaluation.

#### **Model-Specific Generation Functions**
Each model uses a different approach for text generation:

1. **GPT-2 (`generate_with_gpt2`)**:
   - Loads a fine-tuned GPT-2 model from Hugging Face.
   - Encodes each test prompt and generates code using `generate()`, limiting output to 128 tokens.
   - The model is loaded onto a CUDA device for faster inference.

2. **BART (`generate_with_bart`)**:
   - Uses a **text-to-text generation pipeline**.
   - Each test prompt is formatted with an `"ifttt_prompt:"` prefix for consistent input format.
   - The generated text is extracted from the pipeline output.

3. **Mistral (`generate_with_mistral`)**:
   - Loads the **base model** `"Mistral-7B-Instruct-v0.2"` with **4-bit quantization** for memory efficiency.
   - Loads the **fine-tuned LoRA adapter** from `/content/drive/Shareddrives/NLPMODELS/mistral/checkpoint-20`.
   - Generates responses with:
     - `do_sample=True` (introduces randomness)
     - `top_k=50` (limits sampling to top 50 tokens)
     - `top_p=0.95` (nucleus sampling for diverse outputs)
     - `temperature=1` (controls creativity)

#### **Inference Execution**
- Each model generates code for all test prompts.
- Model-generated responses are stored in separate lists.

#### **Results Compilation**
- A `pandas` DataFrame is created containing:
  - **Prompt:** The original natural language description.
  - **Generated Code GPT-2:** The output from GPT-2.
  - **Generated Code BART:** The output from BART.
  - **Generated Code Mistral:** The output from Mistral.
  - **Actual Code:** The ground truth for comparison.

This setup enables **direct performance comparison** between the three models, helping assess which model best converts natural language descriptions into automation code.


In [5]:
import pandas as pd
import torch
import os
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from sklearn.model_selection import train_test_split
from google.colab import drive
from peft import PeftModel

drive.mount('/content/drive')

# Load the dataset and split it
df = pd.read_csv("ifttt-code-generator/datasets/cleaned_and_combined.csv")
#drop duplicates and null
df = df.drop_duplicates()
df = df.dropna()

train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Extract test set prompts
prompts = test_df["cleaned_description"].tolist()
actual_codes = test_df["filter_code"].tolist()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [27]:


# Function to generate text with GPT-2
def generate_with_gpt2(model_path):
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForCausalLM.from_pretrained(model_path).to("cuda")
    generated_codes = []

    for prompt in prompts:
        input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
        output_ids = model.generate(input_ids, num_return_sequences=1)
        generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
        generated_codes.append(generated_text)

    del model
    del tokenizer
    torch.cuda.empty_cache()

    return generated_codes

# Function to generate text with BART
def generate_with_bart(model_path):
    generator = pipeline("text2text-generation", model=model_path, tokenizer=model_path)
    generated_codes = [generator(f"ifttt_prompt: {prompt}")[0]["generated_text"] for prompt in prompts]

    del generator
    torch.cuda.empty_cache()

    return generated_codes

# Function to generate text with Mistral

def generate_with_mistral(finetuned_model_path, basemodel_path):
    if not os.path.exists("./offload"):
        os.makedirs("./offload")




    print("Caricamento del modello fine-tunato...")
    bnb_config = BitsAndBytesConfig(
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=False
    )

    model = AutoModelForCausalLM.from_pretrained(
        basemodel_path,
        torch_dtype=torch.float16,
        quantization_config=bnb_config,
        device_map="auto",
        offload_folder="./offload"
    )

    model = PeftModel.from_pretrained(model, finetuned_model_path)
    tokenizer = AutoTokenizer.from_pretrained(finetuned_model_path)

    generated_codes = []
    for prompt in prompts:
        inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
        outputs = model.generate(
            **inputs,
            max_length=512,
            num_return_sequences=1,
            do_sample=True,
            top_k=50,
            top_p=0.95,
            temperature=1,
        )
        decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
        generated_codes.append(decoded_outputs[0])

    del model
    del tokenizer
    torch.cuda.empty_cache()

    return generated_codes

# Generate with Mistral
finetuned_model_path = "/content/drive/Shareddrives/NLPMODELS/mistral/checkpoint-20"
basemodel_path = "mistralai/Mistral-7B-Instruct-v0.2"
generated_codes_mistral = generate_with_mistral(finetuned_model_path, basemodel_path)


# Generate with BART
model_bart_path = "/content/drive/Shareddrives/NLPMODELS/nl2sql_bart_final/checkpoint-340"
generated_codes_bart = generate_with_bart(model_bart_path)


# Generate with GPT-2
model_gpt2_path = "/content/drive/Shareddrives/NLPMODELS/gpt2model/checkpoint-340"
generated_codes_gpt2 = generate_with_gpt2(model_gpt2_path)



# Create a DataFrame with results
results_df = pd.DataFrame({
    "Prompt": prompts,
    "Generated Code GPT-2": generated_codes_gpt2,
    "Generated Code BART": generated_codes_bart,
    "Generated Code Mistral": generated_codes_mistral,
    "Actual Code": actual_codes
})


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Caricamento del modello fine-tunato...


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for o

In [28]:
#create csv of results_df
results_df.to_csv("results_df.csv", index=False)

In [12]:
#load csv
results_df = pd.read_csv("results_df.csv")

In [29]:
#copy the results_Df in results_copy_df
results_copy_df = results_df.copy()

In [11]:
results_df = results_copy_df.copy()

NameError: name 'results_copy_df' is not defined

In [8]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=aa9035b3c8ff1d911551dbfeec40cffb5980a3479485c12eab9f51969d1e41a7
  Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [35]:
results_df

Unnamed: 0,Prompt,Generated Code GPT-2,Generated Code BART,Generated Code Mistral,Actual Code
0,This applet will reward you 1p for every 10 me...,This applet will reward you 1p for every 10 me...,var distance = parseInt(Strava.newActivityByYo...,This applet will reward you 1p for every 10 me...,var distance = parseInt(Strava.newActivityByYo...
1,Enter a description of your meal and the numbe...,Enter a description of your meal and the numbe...,if (Evernote.newEating.DescriptionByYou.Text.i...,Enter a description of your meal and the numbe...,var values = DoNote.doNoteNewCommandCommon.Not...
2,Turn on Wemo Switch After Garage Door Opens (A...,Turn on Wemo Switch After Garage Door Opens (A...,var timeOfDay = Meta.currentUserTime.hour() ...,Turn on Wemo Switch After Garage Door Opens (A...,var hour = Meta.currentUserTime.hour() if ...
3,Turn on WeMo Smart Plug When Ring Detects Moti...,Turn on WeMo Smart Plug When Ring Detects Moti...,var timeOfDay = Meta.currentUserTime.hour() ...,Turn on WeMo Smart Plug When Ring Detects Moti...,var timeOfDay = Meta.currentUserTime.hour() if...
4,This applet will add an iOS reminder to drink ...,This applet will add an iOS reminder to drink ...,var timeOfDay = Meta.triggerTime.hour() if (...,This applet will add an iOS reminder to drink ...,"var reminderTime = Meta.triggerTime.add(2, 'h'..."
5,"We got you, Dallas: this Applet sends you a Te...","We got you, Dallas: this Applet sends you a Te...",var Hour = Meta.currentUserTime.hour() var Day...,"We got you, Dallas: this Applet sends you a Te...",var Hour = Meta.currentUserTime.hour() var Day...
6,facebook only text post,facebook only text post to runday\n###\nCode:\...,var text = FacebookPages.newStatusMessageByPag...,facebook only text post from a specific User b...,Facebook.newStatusMessageByYou.From Facebook.n...
7,If doorbell rings beween 21h and 6h then toggl...,If doorbell rings beween 21h and 6h then toggl...,var timeOfDay = Meta.currentUserTime.hour(); ...,If doorbell rings beween 21h and 6h then toggl...,var TimeOfDay = Meta.currentUserTime.hour() i...
8,Report today's rainfall amount from your Weath...,Report today's rainfall amount from your Weath...,if(parseFloat(Weather.currentWeather[0].rainfa...,Report today's rainfall amount from your Weath...,Netro.reportWeather.setDate(Netatmo.rainTodayA...
9,Turn light on when arriving to an area between...,Turn light on when arriving to an area between...,var timeOfDay = Meta.currentUserTime.hour() i...,Turn light on when arriving to an area between...,var timeOfDay = Meta.currentUserTime.hour(); ...


In [49]:
import pandas as pd

# Assicurarsi che tutti i valori siano stringhe
results_df["Generated Code GPT-2"] = results_df["Generated Code GPT-2"].astype(str)
results_df["Generated Code Mistral"] = results_df["Generated Code Mistral"].astype(str)

# Funzione per rimuovere il prompt e gestire errori
def clean_code(text):
    if "###" in text:
        return text.split("###", 1)[-1].strip()
    return text.strip()  # Se non c'è "###", restituisce la stringa originale
    # Funzione per rimuovere il prompt e gestire errori
import re

def clean_code_gpt(text):
    if "###\nCode:" in text:

        return text.split("###\nCode:", 1)[-1].strip()
    return text.strip()  # Se non c'è "###", restituisce la stringa originale
    # Funzione per rimuovere il prompt e gestire errori

# Applicare la funzione alla colonna
results_df["Generated Code GPT-2"] = results_df["Generated Code GPT-2"].apply(clean_code_gpt)
results_df["Generated Code Mistral"] = results_df["Generated Code Mistral"].apply(clean_code)

In [21]:
results_df

Unnamed: 0,Prompt,Generated Code GPT-2,Generated Code BART,Generated Code Mistral,Actual Code
0,This applet will reward you 1p for every 10 me...,This applet will reward you 1p for every 10 me...,var distance = parseInt(Strava.newActivityByYo...,This applet will reward you 1p for every 10 me...,var distance = parseInt(Strava.newActivityByYo...
1,Enter a description of your meal and the numbe...,Enter a description of your meal and the numbe...,if (Evernote.newEating.DescriptionByYou.Text.i...,Enter a description of your meal and the numbe...,var values = DoNote.doNoteNewCommandCommon.Not...
2,Turn on Wemo Switch After Garage Door Opens (A...,Turn on Wemo Switch After Garage Door Opens (A...,var timeOfDay = Meta.currentUserTime.hour() ...,Turn on Wemo Switch After Garage Door Opens (A...,var hour = Meta.currentUserTime.hour() if ...
3,Turn on WeMo Smart Plug When Ring Detects Moti...,Turn on WeMo Smart Plug When Ring Detects Moti...,var timeOfDay = Meta.currentUserTime.hour() ...,Turn on WeMo Smart Plug When Ring Detects Moti...,var timeOfDay = Meta.currentUserTime.hour() if...
4,This applet will add an iOS reminder to drink ...,This applet will add an iOS reminder to drink ...,var timeOfDay = Meta.triggerTime.hour() if (...,This applet will add an iOS reminder to drink ...,"var reminderTime = Meta.triggerTime.add(2, 'h'..."
5,"We got you, Dallas: this Applet sends you a Te...","We got you, Dallas: this Applet sends you a Te...",var Hour = Meta.currentUserTime.hour() var Day...,"We got you, Dallas: this Applet sends you a Te...",var Hour = Meta.currentUserTime.hour() var Day...
6,facebook only text post,facebook only text post to runday\n###\nCode:\...,var text = FacebookPages.newStatusMessageByPag...,facebook only text post from a specific User b...,Facebook.newStatusMessageByYou.From Facebook.n...
7,If doorbell rings beween 21h and 6h then toggl...,If doorbell rings beween 21h and 6h then toggl...,var timeOfDay = Meta.currentUserTime.hour(); ...,If doorbell rings beween 21h and 6h then toggl...,var TimeOfDay = Meta.currentUserTime.hour() i...
8,Report today's rainfall amount from your Weath...,Report today's rainfall amount from your Weath...,if(parseFloat(Weather.currentWeather[0].rainfa...,Report today's rainfall amount from your Weath...,Netro.reportWeather.setDate(Netatmo.rainTodayA...
9,Turn light on when arriving to an area between...,Turn light on when arriving to an area between...,var timeOfDay = Meta.currentUserTime.hour() i...,Turn light on when arriving to an area between...,var timeOfDay = Meta.currentUserTime.hour(); ...


### Evaluation of Generated Code
This section defines a function to evaluate the quality of model-generated code using multiple text similarity metrics.

#### **Evaluation Metrics**
- **BLEU Score (Bilingual Evaluation Understudy):**
  - Measures n-gram precision by comparing generated code with actual code.
  - Uses `sentence_bleu()` from NLTK for sentence-level evaluation.

- **METEOR Score (Metric for Evaluation of Translation with Explicit ORdering):**
  - Considers stemming, synonyms, and word order.
  - Computed using `single_meteor_score()` from NLTK.

- **ROUGE Scores (Recall-Oriented Understudy for Gisting Evaluation):**
  - Measures recall-based overlap between generated and reference text.
  - Three variations are computed:
    - **ROUGE-1:** Unigram (single-word) overlap.
    - **ROUGE-2:** Bigram (two-word) overlap.
    - **ROUGE-L:** Measures longest common subsequence similarity.

#### **Evaluation Process**
- For each pair of generated and actual code snippets:
  - Text is **tokenized** by splitting into words.
  - **BLEU, METEOR, and ROUGE scores** are computed.
- The function stores individual scores for all test samples.
- The **average score** is computed for each metric across the entire dataset.

#### **Returned Results**
- `mean_bleu`: Average BLEU score over all test samples.
- `mean_meteor`: Average METEOR score.
- `mean_rouge_l`: Average ROUGE-L score.
- `mean_rouge_1`: Average ROUGE-1 score.
- `mean_rouge_2`: Average ROUGE-2 score.

This function enables a **comprehensive evaluation** of model-generated code, ensuring a robust comparison against ground truth data.


In [51]:
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.meteor_score import single_meteor_score

def evaluate_generated_text(generated_codes, actual_codes):
    bleu_scores = []
    meteor_scores = []
    rouge_l_scores = []
    rouge_1_scores = []
    rouge_2_scores = []

    scorer = rouge_scorer.RougeScorer(["rougeL"], use_stemmer=True)
    scorer_1 = rouge_scorer.RougeScorer(["rouge1"], use_stemmer=True)
    scorer_2 = rouge_scorer.RougeScorer(["rouge2"], use_stemmer=True)

    for gen, ref in zip(generated_codes, actual_codes):
        gen_tokens = gen.split()
        ref_tokens = ref.split()

        # Calcolo BLEU (sentence-level)
        bleu = sentence_bleu([ref_tokens], gen_tokens)

        # Calcolo METEOR (sentence-level)
        meteor = single_meteor_score(ref_tokens, gen_tokens)

        # Calcolo ROUGE-L (f-measure)
        rouge_l = scorer.score(ref, gen)["rougeL"].fmeasure
        # calcolo ROUGE-1
        rouge_1 = scorer_1.score(ref, gen)["rouge1"].fmeasure
        # calcolo ROUGE-2
        rouge_2 = scorer_2.score(ref, gen)["rouge2"].fmeasure

        bleu_scores.append(bleu)
        meteor_scores.append(meteor)
        rouge_l_scores.append(rouge_l)
        rouge_1_scores.append(rouge_1)
        rouge_2_scores.append(rouge_2)

    # Media su tutte le frasi del dataset di test
    mean_bleu = sum(bleu_scores) / len(bleu_scores)
    mean_meteor = sum(meteor_scores) / len(meteor_scores)
    mean_rouge_l = sum(rouge_l_scores) / len(rouge_l_scores)
    mean_rouge_1 = sum(rouge_1_scores) / len(rouge_1_scores)
    mean_rouge_2 = sum(rouge_2_scores) / len(rouge_2_scores)

    return mean_bleu, mean_meteor, mean_rouge_l, mean_rouge_1, mean_rouge_2


In [52]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True


#### **Results Compilation**
- A `pandas` DataFrame (`metrics_df`) is created to organize evaluation scores.
- The DataFrame has the following structure:
  - **Metric** → The evaluation metric name.
  - **GPT-2** → Scores from the GPT-2 model.
  - **BART** → Scores from the BART model.
  - **Mistral** → Scores from the Mistral model.

This setup allows for **direct comparison** of model performance across multiple evaluation metrics.


In [53]:
# Evaluate models
gpt2_scores = evaluate_generated_text(generated_codes_gpt2, actual_codes)
bart_scores = evaluate_generated_text(generated_codes_bart, actual_codes)
mistral_scores = evaluate_generated_text(generated_codes_mistral, actual_codes)

metrics_df = pd.DataFrame(
    {
        "Metric": ["BLEU", "METEOR", "ROUGE-L", "ROUGE-1", "ROUGE-2"],
        "GPT-2": gpt2_scores,
        "BART": bart_scores,
        "Mistral": mistral_scores
    }
)


The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


In [54]:
metrics_df

Unnamed: 0,Metric,GPT-2,BART,Mistral
0,BLEU,0.022889,0.177501,0.059718
1,METEOR,0.086055,0.344752,0.259624
2,ROUGE-L,0.158086,0.478291,0.240256
3,ROUGE-1,0.210855,0.502479,0.277994
4,ROUGE-2,0.08492,0.3509,0.150717


# Evaluation of Generated Code using Perplexity

This section defines functions to evaluate the quality of **model-generated code** using the **perplexity metric**. It includes methods for different types of language models, such as:
- **Causal models (e.g., GPT-2)**
- **Fine-tuned LoRA models with 4-bit quantization**
- **Seq2Seq models (e.g., BART, T5)**

## Evaluation Metric

### **Perplexity (PPL)**
- Measures how well a **probability model** predicts a given text.
- Lower perplexity **(PPL ↓)** means the model is more confident and the generated text is more accurate.
- Computed as:

  \[
  PPL = e^{(\text{mean loss})}
  \]

- The **loss** is calculated using **cross-entropy** between the predicted tokens and the actual code.

---

## Evaluation Functions

The functions below compute **perplexity** for different scenarios:

### **1. `compute_perplexity_causal_prompt_target`**
- Computes **perplexity for a causal model** (e.g., **GPT-2**).
- **Masks the prompt tokens**, so the loss is computed **only on the generated code**.
- **Steps**:
  1. Load **model** and **tokenizer**.
  2. Concatenate **prompt** and **code**.
  3. Mask the **prompt tokens** (`-100` in PyTorch prevents loss computation on masked tokens).
  4. Compute **loss** and return **exponential of mean loss (perplexity)**.

---

### **2. `compute_perplexity_causal_prompt_target_lora4bit`**
- Computes **perplexity for a LoRA fine-tuned model** (**4-bit quantization**).
- Uses a **base model** and loads a fine-tuned **Low-Rank Adaptation (LoRA)** model.
- Follows similar steps as function (1), but:
  - Uses **quantization-aware loading** (`bnb_4bit_quant_type="nf4"`).
  - Loads the **fine-tuned LoRA model** using `PeftModel`.

---

### **3. `compute_perplexity_seq2seq`**
- Computes **perplexity for a sequence-to-sequence (Seq2Seq) model** (**e.g., BART, T5**).
- Evaluates how well the model translates a **prompt into the corresponding code**.
- **Steps**:
  1. Load **Seq2Seq model** and **tokenizer**.
  2. Tokenize **prompt** (input) and **code** (output).
  3. Compute **loss** on the generated code.
  4. Return **exponential of mean loss (perplexity)**.

---

## Evaluation Process

For each pair of **prompt and generated code**:
1. **Tokenize** text.
2. **Compute loss**:
   - **For causal models** → Loss is computed **only on the code part** (prompt is masked).
   - **For Seq2Seq models** → Loss is computed **on the entire generated sequence**.
3. Convert **loss** to **perplexity**.
4. Store **individual scores**.

- If the perplexity value is **infinity (`inf`)**, it is set to `float('inf')`.

In [22]:
import os
import math
import torch
import numpy as np
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig
)
from peft import PeftModel

def compute_perplexity_causal_prompt_target(model_path, prompts, codes, max_length=512):
    """
    Calcola la Perplexity 'prompt -> code' con un modello causale (es. GPT-2),
    mascherando i token del prompt in modo che la loss sia calcolata solo sul code.
    """
    device = "cuda" if torch.cuda.is_available() else "cpu"

    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForCausalLM.from_pretrained(model_path).to(device)
    model.eval()

    losses = []
    for prompt, code in zip(prompts, codes):
        full_text = f"{prompt}\n{code}"
        inputs = tokenizer(full_text, max_length=max_length, truncation=True, return_tensors='pt')
        input_ids = inputs.input_ids.to(device)

        labels = input_ids.clone()
        prompt_ids = tokenizer(prompt, add_special_tokens=False).input_ids
        labels[0, :len(prompt_ids)] = -100  # Maschera il prompt

        with torch.no_grad():
            loss = model(input_ids=input_ids, labels=labels).loss.item()
            losses.append(loss)

    del model, tokenizer
    torch.cuda.empty_cache()

    return math.exp(np.mean(losses)) if not math.isinf(np.mean(losses)) else float('inf')

def compute_perplexity_causal_prompt_target_lora4bit(finetuned_model_path, base_model_path, prompts, codes, max_length=512):
    """
    Calcola la Perplexity 'prompt -> code' per un modello causale con LoRA e quantizzazione 4-bit.
    """
    device = "cuda" if torch.cuda.is_available() else "cpu"
    os.makedirs("./offload", exist_ok=True)

    bnb_config = BitsAndBytesConfig(
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=False
    )

    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_path, torch_dtype=torch.float16, quantization_config=bnb_config, device_map="auto", offload_folder="./offload"
    )
    model = PeftModel.from_pretrained(base_model, finetuned_model_path).to(device)
    tokenizer = AutoTokenizer.from_pretrained(finetuned_model_path)
    model.eval()

    losses = []
    for prompt, code in zip(prompts, codes):
        full_text = f"{prompt}\n{code}"
        inputs = tokenizer(full_text, max_length=max_length, truncation=True, return_tensors='pt')
        input_ids = inputs.input_ids.to(device)

        labels = input_ids.clone()
        prompt_ids = tokenizer(prompt, add_special_tokens=False).input_ids
        labels[0, :len(prompt_ids)] = -100  # Maschera il prompt

        with torch.no_grad():
            loss = model(input_ids=input_ids, labels=labels).loss.item()
            losses.append(loss)

    del model, tokenizer, base_model
    torch.cuda.empty_cache()

    return math.exp(np.mean(losses)) if not math.isinf(np.mean(losses)) else float('inf')

def compute_perplexity_seq2seq(model_path, texts, actual_codes, max_length=512):
    """
    Computes perplexity for an encoder-decoder (seq2seq) model like BART
    over pairs of (prompt, code).
    """
    import math
    import torch
    import numpy as np
    from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

    device = "cuda" if torch.cuda.is_available() else "cpu"

    # Load tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_path).to(device)
    model.eval()

    losses = []
    # Ciclo in parallelo: prompt -> code
    for text, code in zip(texts, actual_codes):
        # Prompt tokenizzato
        inputs = tokenizer(
            text, truncation=True, max_length=max_length, return_tensors='pt'
        )
        input_ids = inputs.input_ids.to(device)

        # Codice come labels
        labels = tokenizer(
            code, truncation=True, max_length=max_length, return_tensors='pt'
        ).input_ids.to(device)

        with torch.no_grad():
            outputs = model(input_ids=input_ids, labels=labels)
            loss = outputs.loss
            losses.append(loss.item())

    mean_loss = np.mean(losses)
    ppl = math.exp(mean_loss) if not math.isinf(mean_loss) else float('inf')

    # Cleanup
    del model
    del tokenizer
    torch.cuda.empty_cache()

    return ppl



In [23]:
model_gpt2_path = "/content/drive/Shareddrives/NLPMODELS/gpt2model/checkpoint-340"
ppl_gpt2 = compute_perplexity_causal_prompt_target(model_gpt2_path, prompts, actual_codes)
print("GPT-2 PPL (prompt->code):", ppl_gpt2)


GPT-2 PPL (prompt->code): 5.0172523763369155


In [24]:
finetuned_model_path = "/content/drive/Shareddrives/NLPMODELS/mistral/checkpoint-20"
basemodel_path = "mistralai/Mistral-7B-Instruct-v0.2"

ppl_mistral = compute_perplexity_causal_prompt_target_lora4bit(
    finetuned_model_path=finetuned_model_path,
    base_model_path=basemodel_path,
    prompts=prompts,
    codes=actual_codes
)
print("Mistral PPL (prompt->code):", ppl_mistral)


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Mistral PPL (prompt->code): 2.871973808755254


In [25]:
# For BART
model_bart_path = "/content/drive/Shareddrives/NLPMODELS/nl2sql_bart_final/checkpoint-340"
ppl_bart = compute_perplexity_seq2seq(model_bart_path, prompts, actual_codes)
print("BART Perplexity:", ppl_bart)


BART Perplexity: 5.251900849772053
