<a href="https://colab.research.google.com/github/benedettoscala/ifttt-code-generator/blob/main/test_and_compare_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
#from google.colab import drive
#drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
%%capture
!pip install -U bitsandbytes

In [1]:
!git clone https://github.com/benedettoscala/ifttt-code-generator
%cd ifttt-code-generator/
!git pull

C:\Users\scala\PycharmProjects\JupyterProject\ifttt-code-generator\ifttt-code-generator


fatal: destination path 'ifttt-code-generator' already exists and is not an empty directory.
  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


Already up to date.


### Model Comparison: GPT-2, BART, and Mistral for IFTTT Code Generation
This section evaluates and compares the performance of three fine-tuned language models (**GPT-2, BART, and Mistral**) in generating IFTTT-style automation code from textual descriptions.

#### **Dataset Loading and Preprocessing**
- The dataset is loaded from `"ifttt-code-generator/datasets/cleaned_and_combined.csv"`.
- The data is split into **80% training** and **20% testing**.
- **Test prompts** (natural language descriptions) and their corresponding **actual code** are extracted for evaluation.

#### **Model-Specific Generation Functions**
Each model uses a different approach for text generation:

1. **GPT-2 (`generate_with_gpt2`)**:
   - Loads a fine-tuned GPT-2 model from Hugging Face.
   - Encodes each test prompt and generates code using `generate()`, limiting output to 128 tokens.
   - The model is loaded onto a CUDA device for faster inference.

2. **BART (`generate_with_bart`)**:
   - Uses a **text-to-text generation pipeline**.
   - Each test prompt is formatted with an `"ifttt_prompt:"` prefix for consistent input format.
   - The generated text is extracted from the pipeline output.

3. **Mistral (`generate_with_mistral`)**:
   - Loads the **base model** `"Mistral-7B-Instruct-v0.2"` with **4-bit quantization** for memory efficiency.
   - Loads the **fine-tuned LoRA adapter** from `/content/drive/Shareddrives/NLPMODELS/mistral/checkpoint-20`.
   - Generates responses with:
     - `do_sample=True` (introduces randomness)
     - `top_k=50` (limits sampling to top 50 tokens)
     - `top_p=0.95` (nucleus sampling for diverse outputs)
     - `temperature=1` (controls creativity)

#### **Inference Execution**
- Each model generates code for all test prompts.
- Model-generated responses are stored in separate lists.

#### **Results Compilation**
- A `pandas` DataFrame is created containing:
  - **Prompt:** The original natural language description.
  - **Generated Code GPT-2:** The output from GPT-2.
  - **Generated Code BART:** The output from BART.
  - **Generated Code Mistral:** The output from Mistral.
  - **Actual Code:** The ground truth for comparison.

This setup enables **direct performance comparison** between the three models, helping assess which model best converts natural language descriptions into automation code.


In [28]:
%cd new_experiment

C:\Users\DaisLabTBB\PycharmProjects\ifttt-code-generator\ifttt-code-generator\new_experiment


In [29]:
import pandas as pd
import torch
import os
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from sklearn.model_selection import train_test_split
from peft import PeftModel

# Load the dataset and split it
df = pd.read_csv("datasets/new_dataset.csv")
#drop duplicates and null
#df = df.drop_duplicates()
df = df.dropna()

train_df, test_df = train_test_split(df, test_size=0.356, random_state=42)

# Extract test set prompts
prompts = test_df["permission_df"].tolist()
actual_codes = test_df["filter_code"].tolist()

In [4]:
!pip show transformers



Name: transformers
Version: 4.49.0
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: C:\Users\scala\miniconda3\Lib\site-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: peft, trl


In [30]:


# Function to generate text with GPT-2
def generate_with_gpt2(model_path):
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForCausalLM.from_pretrained(model_path).to("cuda")
    generated_codes = []

    for prompt in prompts:
        input_data = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True)
        input_ids = input_data.input_ids.to("cuda")
        attention_mask = input_data.attention_mask.to("cuda")

        output_ids = model.generate(input_ids, attention_mask=attention_mask, num_return_sequences=1)
        generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
        generated_codes.append(generated_text)

    del model
    del tokenizer
    torch.cuda.empty_cache()

    return generated_codes

# Function to generate text with BART
def generate_with_bart(model_path):
    generator = pipeline("text2text-generation", model=model_path, tokenizer=model_path)
    generated_codes = [generator(f"ifttt_prompt: {prompt}")[0]["generated_text"] for prompt in prompts]

    del generator
    torch.cuda.empty_cache()

    return generated_codes

# Function to generate text with Mistral
from tqdm import tqdm
import math

def generate_with_casual_lm(finetuned_model_path, basemodel_path, batch_size=8):
    if not os.path.exists("../offload"):
        os.makedirs("../offload")

    print("Caricamento del modello fine-tunato...")
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True
    )

    max_memory = {
        0: "8GiB",  # Limite  sulla GPU
        "cpu": "16GiB"  # Limite  sulla CPU
    }

    model = AutoModelForCausalLM.from_pretrained(
        basemodel_path,
        torch_dtype=torch.float16,
        quantization_config=bnb_config,
        device_map="auto",
        offload_folder="./offload",
        #max_memory=max_memory
    )

    model = PeftModel.from_pretrained(model, finetuned_model_path)
    tokenizer = AutoTokenizer.from_pretrained(finetuned_model_path)

    decoded_outputs = []

    # Calcolo il numero di batch
    num_batches = math.ceil(len(prompts) / batch_size)
    pbar = tqdm(total=len(prompts), desc="Generazione")

    for i in range(num_batches):
        batch_prompts = prompts[i*batch_size : (i+1)*batch_size]

        inputs = tokenizer(batch_prompts, return_tensors="pt", padding=True, truncation=True).to("cuda")
        outputs = model.generate(
            **inputs,
            max_length=256,
            num_return_sequences=1,
            do_sample=True,
            top_k=50,
            top_p=0.95,
            temperature=1,
        )
        # Decodifica risultati del mini-batch
        decoded_batch = tokenizer.batch_decode(outputs, skip_special_tokens=True)
        decoded_outputs.extend(decoded_batch)

        # Aggiorno la progress bar di n=lunghezza del mini-batch
        pbar.update(len(batch_prompts))

    pbar.close()

    del model
    del tokenizer
    torch.cuda.empty_cache()

    return decoded_outputs







In [3]:
!dir

 Il volume nell'unit… C non ha etichetta.
 Numero di serie del volume: E0B9-66A8

 Directory di C:\Users\DaisLabTBB\PycharmProjects\ifttt-code-generator\ifttt-code-generator

23/02/2025  10:19    <DIR>          .
22/02/2025  15:05    <DIR>          ..
22/02/2025  15:05                83 .gitattributes
22/02/2025  15:23             1.648 0.43.0
22/02/2025  15:05           424.408 bart_nl2ifttt.ipynb
22/02/2025  15:05    <DIR>          datasets
22/02/2025  15:54           137.378 fine_tuning_codegemma.ipynb
22/02/2025  15:54           153.734 fine_tuning_codellama.ipynb
22/02/2025  15:05           137.635 fine_tuning_deepseek.ipynb
22/02/2025  15:05           239.405 fine_tuning_mistral.ipynb
22/02/2025  15:05            29.432 generated_codes.csv
22/02/2025  15:05            95.728 gpt2-nl2ifttt.ipynb
22/02/2025  15:31    <DIR>          NLPMODELS
22/02/2025  15:05           224.787 preprocessing_and_cleaning.ipynb
22/02/2025  15:05             2.597 README.md
22/02/2025  16:23          

In [31]:
model_deepseek_path = "results/best_model_deepseek"
basemodel_deepseek_path = "deepseek-ai/deepseek-coder-6.7b-base"
generated_codes_deepseek = generate_with_casual_lm(model_deepseek_path, basemodel_deepseek_path)

Caricamento del modello fine-tunato...


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Generazione:   0%|          | 0/218 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:32014 for open-end generation.
Generazione:   4%|▎         | 8/218 [00:59<25:59,  7.42s/it]Setting `pad_token_id` to `eos_token_id`:32014 for open-end generation.
Generazione:   7%|▋         | 16/218 [01:55<24:11,  7.19s/it]Setting `pad_token_id` to `eos_token_id`:32014 for open-end generation.
Generazione:  11%|█         | 24/218 [02:51<22:55,  7.09s/it]Setting `pad_token_id` to `eos_token_id`:32014 for open-end generation.
Generazione:  15%|█▍        | 32/218 [03:48<22:06,  7.13s/it]Setting `pad_token_id` to `eos_token_id`:32014 for open-end generation.
Generazione:  18%|█▊        | 40/218 [04:45<21:05,  7.11s/it]Setting `pad_token_id` to `eos_token_id`:32014 for open-end generation.
Generazione:  22%|██▏       | 48/218 [05:42<20:07,  7.11s/it]Setting `pad_token_id` to `eos_token_id`:32014 for open-end generation.
Generazione:  26%|██▌       | 56/218 [06:38<19:06,  7.08s/it]Setting `pad_token

In [32]:
model_codegemma_path = "results/best_model_codegemma"
basemodel_codegemma_path = "google/codegemma-7b"
generated_codes_codegemma = generate_with_casual_lm(model_codegemma_path, basemodel_codegemma_path)

Caricamento del modello fine-tunato...


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Generazione:   0%|          | 0/218 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Generazione: 100%|██████████| 218/218 [14:42<00:00,  4.05s/it]


In [33]:
model_codellama_path = "results/best_model_codellama"
basemode_codellama_path = "codellama/CodeLlama-7b-hf"
generated_codes_codellama = generate_with_casual_lm(model_codellama_path, basemode_codellama_path)

Caricamento del modello fine-tunato...


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Generazione:   0%|          | 0/218 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Generazione:   4%|▎         | 8/218 [00:33<14:50,  4.24s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Generazione:   7%|▋         | 16/218 [01:07<14:10,  4.21s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Generazione:  11%|█         | 24/218 [01:41<13:38,  4.22s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Generazione:  15%|█▍        | 32/218 [02:15<13:07,  4.24s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Generazione:  18%|█▊        | 40/218 [02:49<12:34,  4.24s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Generazione:  22%|██▏       | 48/218 [03:23<12:04,  4.26s/it]Setting `pad_token_id` to `eos_tok

In [34]:
model_mistral_path = "results/best_model_mistral"
basemode_mistral_path = "mistralai/Mistral-7B-Instruct-v0.2"
generated_codes_mistral = generate_with_casual_lm(model_mistral_path, basemode_mistral_path)

Caricamento del modello fine-tunato...


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Generazione:   0%|          | 0/218 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Generazione:   4%|▎         | 8/218 [00:34<15:16,  4.37s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Generazione:   7%|▋         | 16/218 [01:09<14:31,  4.31s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Generazione:  11%|█         | 24/218 [01:43<13:53,  4.29s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Generazione:  15%|█▍        | 32/218 [02:17<13:14,  4.27s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Generazione:  18%|█▊        | 40/218 [02:51<12:38,  4.26s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Generazione:  22%|██▏       | 48/218 [03:25<12:05,  4.27s/it]Setting `pad_token_id` to `eos_tok

In [35]:
# Generate with BART
model_bart_path = "results/best_model_bart/checkpoint-900"
generated_codes_bart = generate_with_bart(model_bart_path)

Device set to use cuda:0


In [36]:


# Create a DataFrame with results
results_df = pd.DataFrame({
    "Prompt": prompts,
    "Generated Code CodeGemma": generated_codes_codegemma,
    "Generated Code CodeLLama": generated_codes_codellama,
    "Generated Code DeepSeek": generated_codes_deepseek,
    "Generated Code Mistral": generated_codes_mistral,
    "Generated Code Bart" : generated_codes_bart,
    "Actual Code": actual_codes
})

In [37]:
#create csv of results_df
results_df.to_csv("results_df.csv", index=False)

In [14]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=e8a397cd245743ad35376a2e0ea92bcfa985f1e5bad49cb1c60b5388c4583d19
  Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [38]:
results_df

Unnamed: 0,Prompt,Generated Code CodeGemma,Generated Code CodeLLama,Generated Code DeepSeek,Generated Code Mistral,Generated Code Bart,Actual Code
0,if Current condition changes to (trigger_servi...,if Current condition changes to (trigger_servi...,if Current condition changes to (trigger_servi...,if Current condition changes to (trigger_servi...,if Current condition changes to (trigger_servi...,var hour = Meta.currentUserTime.hour() if (ho...,var timeOfDay = Meta.currentUserTime.hour() i...
1,if Connects to a Bluetooth device (trigger_ser...,if Connects to a Bluetooth device (trigger_ser...,if Connects to a Bluetooth device (trigger_ser...,if Connects to a Bluetooth device (trigger_ser...,if Connects to a Bluetooth device (trigger_ser...,var btDeviceName = AndroidDevice.bluetoothConn...,"if (Trigger.DeviceName.indexOf(""Gear"")==-1) {A..."
2,if Generate Pet Activity Report (trigger_servi...,if Generate Pet Activity Report (trigger_servi...,if Generate Pet Activity Report (trigger_servi...,if Generate Pet Activity Report (trigger_servi...,if Generate Pet Activity Report (trigger_servi...,var minute = Meta.triggerTime.minute() var mi...,var hour = Meta.triggerTime.hour() var minute...
3,if You enter an area (trigger_service: Locatio...,if You enter an area (trigger_service: Locatio...,if You enter an area (trigger_service: Locatio...,if You enter an area (trigger_service: Locatio...,if You enter an area (trigger_service: Locatio...,if (Weather.currentWeather[0].CurrentCondition...,let sunrise = moment(Weather.currentWeather[0]...
4,if Daily step goal achieved (trigger_service: ...,if Daily step goal achieved (trigger_service: ...,if Daily step goal achieved (trigger_service: ...,if Daily step goal achieved (trigger_service: ...,if Daily step goal achieved (trigger_service: ...,var timeOfDay = Meta.currentUserTime.hour() i...,"var data = [ {""quote"":""To enjoy the glow of g..."
...,...,...,...,...,...,...,...
213,if New DART rider alert (trigger_service: DART...,if New DART rider alert (trigger_service: DART...,if New DART rider alert (trigger_service: DART...,if New DART rider alert (trigger_service: DART...,if New DART rider alert (trigger_service: DART...,var Hour = Meta.currentUserTime.hour(),var Hour = Meta.currentUserTime.hour() var Day...
214,if New tweet by a specific user (trigger_servi...,if New tweet by a specific user (trigger_servi...,if New tweet by a specific user (trigger_servi...,if New tweet by a specific user (trigger_servi...,if New tweet by a specific user (trigger_servi...,var incomingTweet = Twitter.newTweetByUser.Tex...,var timeOfDay = Meta.currentUserTime.hour() i...
215,if New feed item (trigger_service: RSS Feed) t...,if New feed item (trigger_service: RSS Feed) t...,if New feed item (trigger_service: RSS Feed) t...,if New feed item (trigger_service: RSS Feed) t...,if New feed item (trigger_service: RSS Feed) t...,const content = Feed.newFeedItem.EntryContent ...,var Texto = Feed.newFeedItem.EntryTitle; var ...
216,if You enter an area (trigger_service: Locatio...,if You enter an area (trigger_service: Locatio...,if You enter an area (trigger_service: Locatio...,if You enter an area (trigger_service: Locatio...,if You enter an area (trigger_service: Locatio...,if (Weather.currentWeather[0].CurrentCondition...,let sunrise = moment(Weather.currentWeather[0]...


In [39]:
results_copy = results_df.copy()

In [40]:
import pandas as pd

# Assicurarsi che tutti i valori siano stringhe
results_df["Generated Code CodeGemma"] = results_df["Generated Code CodeGemma"].astype(str)
results_df["Generated Code Mistral"] = results_df["Generated Code Mistral"].astype(str)
results_df["Generated Code DeepSeek"] = results_df["Generated Code DeepSeek"].astype(str)
results_df["Generated Code CodeLLama"] = results_df["Generated Code CodeLLama"].astype(str)
results_df["Generated Code Bart"] = results_df["Generated Code Bart"].astype(str)
results_df["Actual Code"] = results_df["Actual Code"].astype(str)


# Funzione per rimuovere il prompt e gestire errori
def clean_code(text):
    if "###" in text:
        return text.split("###", 1)[-1].strip()
    return text.strip()  # Se non c'è "###", restituisce la stringa originale
    # Funzione per rimuovere il prompt e gestire errori

# Applicare la funzione alla colonna
results_df["Generated Code CodeGemma"] = results_df["Generated Code CodeGemma"].apply(clean_code)
results_df["Generated Code DeepSeek"] = results_df["Generated Code DeepSeek"].apply(clean_code)
results_df["Generated Code CodeLLama"] = results_df["Generated Code CodeLLama"].apply(clean_code)
results_df["Generated Code Bart"] = results_df["Generated Code Bart"].apply(clean_code)
results_df["Generated Code Mistral"] = results_df["Generated Code Mistral"].apply(clean_code)

In [41]:
results_df

Unnamed: 0,Prompt,Generated Code CodeGemma,Generated Code CodeLLama,Generated Code DeepSeek,Generated Code Mistral,Generated Code Bart,Actual Code
0,if Current condition changes to (trigger_servi...,var timeOfDay = Meta.currentUserTime.hour() if...,var triggerTime = Meta.currentUserTime.hour();...,var hour = Meta.triggerTime.hour() if (hou...,var currentHour = Meta.currentUserTime.hour() ...,var hour = Meta.currentUserTime.hour() if (ho...,var timeOfDay = Meta.currentUserTime.hour() i...
1,if Connects to a Bluetooth device (trigger_ser...,var hour = Meta.currentUserTime.hour() if (ho...,var volume = parseInt(AndroidDevice.bluetoothC...,"if (Trigger.DeviceName.toLowerCase() == ""wow"")...",var BTDeviceName = AndroidDevice.bluetoothConn...,var btDeviceName = AndroidDevice.bluetoothConn...,"if (Trigger.DeviceName.indexOf(""Gear"")==-1) {A..."
2,if Generate Pet Activity Report (trigger_servi...,var timeOfDay = Meta.currentUserTime.hour(); ...,var hour = Meta.currentUserTime.hour() if (ho...,var timeOfDay = Meta.triggerTime.hour() var h...,var hour = Meta.triggerTime.hour() var timeHo...,var minute = Meta.triggerTime.minute() var mi...,var hour = Meta.triggerTime.hour() var minute...
3,if You enter an area (trigger_service: Locatio...,let sunrise = moment(Weather.currentWeather[0]...,let sunrise = moment(Weather.currentWeather[0]...,let sunrise = moment(Weather.currentWeather[0]...,let sunrise = moment(Weather.currentWeather[0]...,if (Weather.currentWeather[0].CurrentCondition...,let sunrise = moment(Weather.currentWeather[0]...
4,if Daily step goal achieved (trigger_service: ...,var hour = Meta.currentUserTime.hour() if (ho...,var timeOfDay = Meta.currentUserTime.hour() i...,var Day = Meta.currentUserTime.day() var Hour ...,var timeOfDay = Meta.currentUserTime.hour() i...,var timeOfDay = Meta.currentUserTime.hour() i...,"var data = [ {""quote"":""To enjoy the glow of g..."
...,...,...,...,...,...,...,...
213,if New DART rider alert (trigger_service: DART...,var Hour = Meta.currentUserTime.hour() var Day...,var Hour = Meta.currentUserTime.hour() var Day...,var Hour = Meta.currentUserTime.hour() var Day...,var Hour = Meta.currentUserTime.hour() var Day...,var Hour = Meta.currentUserTime.hour(),var Hour = Meta.currentUserTime.hour() var Day...
214,if New tweet by a specific user (trigger_servi...,var timeOfDay = Meta.currentUserTime.hour() i...,var timeOfDay = Meta.currentUserTime.hour() i...,var timeOfDay = Meta.currentUserTime.hour() i...,var timeOfDay = Meta.currentUserTime.hour() i...,var incomingTweet = Twitter.newTweetByUser.Tex...,var timeOfDay = Meta.currentUserTime.hour() i...
215,if New feed item (trigger_service: RSS Feed) t...,var Texto = Feed.newFeedItem.EntryTitle; var ...,var Texto = Feed.newFeedItem.EntryTitle; var N...,"if(Feed.newFeedItem.EntryContent.indexOf(""Tide...",var Texto = Feed.newFeedItem.EntryTitle; var ...,const content = Feed.newFeedItem.EntryContent ...,var Texto = Feed.newFeedItem.EntryTitle; var ...
216,if You enter an area (trigger_service: Locatio...,let sunrise = moment(Weather.currentWeather[0]...,let sunrise = moment(Weather.currentWeather[0]...,let sunrise = moment(Weather.currentWeather[0]...,let sunrise = moment(Weather.currentWeather[0]...,if (Weather.currentWeather[0].CurrentCondition...,let sunrise = moment(Weather.currentWeather[0]...


### Evaluation of Generated Code
This section defines a function to evaluate the quality of model-generated code using multiple text similarity metrics.

#### **Evaluation Metrics**
- **BLEU Score (Bilingual Evaluation Understudy):**
  - Measures n-gram precision by comparing generated code with actual code.
  - Uses `sentence_bleu()` from NLTK for sentence-level evaluation.

- **METEOR Score (Metric for Evaluation of Translation with Explicit ORdering):**
  - Considers stemming, synonyms, and word order.
  - Computed using `single_meteor_score()` from NLTK.

- **ROUGE Scores (Recall-Oriented Understudy for Gisting Evaluation):**
  - Measures recall-based overlap between generated and reference text.
  - Three variations are computed:
    - **ROUGE-1:** Unigram (single-word) overlap.
    - **ROUGE-2:** Bigram (two-word) overlap.
    - **ROUGE-L:** Measures longest common subsequence similarity.

#### **Evaluation Process**
- For each pair of generated and actual code snippets:
  - Text is **tokenized** by splitting into words.
  - **BLEU, METEOR, and ROUGE scores** are computed.
- The function stores individual scores for all test samples.
- The **average score** is computed for each metric across the entire dataset.

#### **Returned Results**
- `mean_bleu`: Average BLEU score over all test samples.
- `mean_meteor`: Average METEOR score.
- `mean_rouge_l`: Average ROUGE-L score.
- `mean_rouge_1`: Average ROUGE-1 score.
- `mean_rouge_2`: Average ROUGE-2 score.

This function enables a **comprehensive evaluation** of model-generated code, ensuring a robust comparison against ground truth data.


In [42]:
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.meteor_score import single_meteor_score

def evaluate_generated_text(generated_codes, actual_codes):
    bleu_scores = []
    meteor_scores = []
    rouge_l_scores = []
    rouge_1_scores = []
    rouge_2_scores = []

    scorer = rouge_scorer.RougeScorer(["rougeL"], use_stemmer=True)
    scorer_1 = rouge_scorer.RougeScorer(["rouge1"], use_stemmer=True)
    scorer_2 = rouge_scorer.RougeScorer(["rouge2"], use_stemmer=True)

    for gen, ref in zip(generated_codes, actual_codes):
        gen_tokens = gen.split()
        ref_tokens = ref.split()

        # Calcolo BLEU (sentence-level)
        bleu = sentence_bleu([ref_tokens], gen_tokens)

        # Calcolo METEOR (sentence-level)
        meteor = single_meteor_score(ref_tokens, gen_tokens)

        # Calcolo ROUGE-L (f-measure)
        rouge_l = scorer.score(ref, gen)["rougeL"].fmeasure
        # calcolo ROUGE-1
        rouge_1 = scorer_1.score(ref, gen)["rouge1"].fmeasure
        # calcolo ROUGE-2
        rouge_2 = scorer_2.score(ref, gen)["rouge2"].fmeasure

        bleu_scores.append(bleu)
        meteor_scores.append(meteor)
        rouge_l_scores.append(rouge_l)
        rouge_1_scores.append(rouge_1)
        rouge_2_scores.append(rouge_2)

    # Media su tutte le frasi del dataset di test
    mean_bleu = sum(bleu_scores) / len(bleu_scores)
    mean_meteor = sum(meteor_scores) / len(meteor_scores)
    mean_rouge_l = sum(rouge_l_scores) / len(rouge_l_scores)
    mean_rouge_1 = sum(rouge_1_scores) / len(rouge_1_scores)
    mean_rouge_2 = sum(rouge_2_scores) / len(rouge_2_scores)

    return mean_bleu, mean_meteor, mean_rouge_l, mean_rouge_1, mean_rouge_2


In [43]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\DaisLabTBB\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True


#### **Results Compilation**
- A `pandas` DataFrame (`metrics_df`) is created to organize evaluation scores.
- The DataFrame has the following structure:
  - **Metric** → The evaluation metric name.
  - **GPT-2** → Scores from the GPT-2 model.
  - **BART** → Scores from the BART model.
  - **Mistral** → Scores from the Mistral model.

This setup allows for **direct comparison** of model performance across multiple evaluation metrics.


In [44]:
# Evaluate models
bart_scores = evaluate_generated_text(generated_codes_bart, actual_codes)
mistral_scores = evaluate_generated_text(generated_codes_mistral, actual_codes)
deepseek_scores = evaluate_generated_text(generated_codes_deepseek, actual_codes)
codegemma_scores = evaluate_generated_text(generated_codes_codegemma, actual_codes)
codellama_scores = evaluate_generated_text(generated_codes_codellama, actual_codes)


metrics_df = pd.DataFrame(
    {
        "Metric": ["BLEU", "METEOR", "ROUGE-L", "ROUGE-1", "ROUGE-2"],
        "BART": bart_scores,
        "Mistral": mistral_scores,
        "DeepSeek": deepseek_scores,
        "CodeLLama": codegemma_scores,
        "CodeGemma": codegemma_scores,

    }
)


The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


In [45]:
metrics_df

Unnamed: 0,Metric,BART,Mistral,DeepSeek,CodeLLama,CodeGemma
0,BLEU,0.209194,0.199473,0.08624,0.10614,0.10614
1,METEOR,0.358208,0.473732,0.311656,0.34792,0.34792
2,ROUGE-L,0.492425,0.390251,0.244046,0.271313,0.271313
3,ROUGE-1,0.50218,0.408589,0.267625,0.291637,0.291637
4,ROUGE-2,0.406969,0.349615,0.187372,0.222343,0.222343


In [49]:
results_df.to_csv("results_df_without_prompt.csv", index=False)