<a href="https://colab.research.google.com/github/benedettoscala/ifttt-code-generator/blob/main/test_and_compare_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%%capture
!pip install -U bitsandbytes

In [None]:
!git clone https://github.com/benedettoscala/ifttt-code-generator
%cd ifttt-code-generator/
!git pull

Cloning into 'ifttt-code-generator'...
remote: Enumerating objects: 195, done.[K
remote: Counting objects: 100% (38/38), done.[K
remote: Compressing objects: 100% (38/38), done.[K
remote: Total 195 (delta 20), reused 0 (delta 0), pack-reused 157 (from 1)[K
Receiving objects: 100% (195/195), 14.91 MiB | 16.24 MiB/s, done.
Resolving deltas: 100% (113/113), done.
/content/ifttt-code-generator
Already up to date.


In [None]:
%cd ..

/content


### Model Comparison: GPT-2, BART, and Mistral for IFTTT Code Generation
This section evaluates and compares the performance of three fine-tuned language models (**GPT-2, BART, and Mistral**) in generating IFTTT-style automation code from textual descriptions.

#### **Dataset Loading and Preprocessing**
- The dataset is loaded from `"ifttt-code-generator/datasets/cleaned_and_combined.csv"`.
- The data is split into **80% training** and **20% testing**.
- **Test prompts** (natural language descriptions) and their corresponding **actual code** are extracted for evaluation.

#### **Model-Specific Generation Functions**
Each model uses a different approach for text generation:

1. **GPT-2 (`generate_with_gpt2`)**:
   - Loads a fine-tuned GPT-2 model from Hugging Face.
   - Encodes each test prompt and generates code using `generate()`, limiting output to 128 tokens.
   - The model is loaded onto a CUDA device for faster inference.

2. **BART (`generate_with_bart`)**:
   - Uses a **text-to-text generation pipeline**.
   - Each test prompt is formatted with an `"ifttt_prompt:"` prefix for consistent input format.
   - The generated text is extracted from the pipeline output.

3. **Mistral (`generate_with_mistral`)**:
   - Loads the **base model** `"Mistral-7B-Instruct-v0.2"` with **4-bit quantization** for memory efficiency.
   - Loads the **fine-tuned LoRA adapter** from `/content/drive/Shareddrives/NLPMODELS/mistral/checkpoint-20`.
   - Generates responses with:
     - `do_sample=True` (introduces randomness)
     - `top_k=50` (limits sampling to top 50 tokens)
     - `top_p=0.95` (nucleus sampling for diverse outputs)
     - `temperature=1` (controls creativity)

#### **Inference Execution**
- Each model generates code for all test prompts.
- Model-generated responses are stored in separate lists.

#### **Results Compilation**
- A `pandas` DataFrame is created containing:
  - **Prompt:** The original natural language description.
  - **Generated Code GPT-2:** The output from GPT-2.
  - **Generated Code BART:** The output from BART.
  - **Generated Code Mistral:** The output from Mistral.
  - **Actual Code:** The ground truth for comparison.

This setup enables **direct performance comparison** between the three models, helping assess which model best converts natural language descriptions into automation code.


In [None]:
import pandas as pd
import torch
import os
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from sklearn.model_selection import train_test_split
from google.colab import drive
from peft import PeftModel

drive.mount('/content/drive')

# Load the dataset and split it
df = pd.read_csv("ifttt-code-generator/datasets/cleaned_and_combined.csv")
#drop duplicates and null
df = df.drop_duplicates()
df = df.dropna()

train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Extract test set prompts
prompts = test_df["cleaned_description"].tolist()
actual_codes = test_df["filter_code"].tolist()

# Function to generate text with GPT-2
def generate_with_gpt2(model_path):
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForCausalLM.from_pretrained(model_path).to("cuda")
    generated_codes = []

    for prompt in prompts:
        input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
        output_ids = model.generate(input_ids, num_return_sequences=1)
        generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
        generated_codes.append(generated_text)

    del model
    del tokenizer
    torch.cuda.empty_cache()

    return generated_codes

# Function to generate text with BART
def generate_with_bart(model_path):
    generator = pipeline("text2text-generation", model=model_path, tokenizer=model_path)
    generated_codes = [generator(f"ifttt_prompt: {prompt}")[0]["generated_text"] for prompt in prompts]

    del generator
    torch.cuda.empty_cache()

    return generated_codes

# Function to generate text with Mistral

def generate_with_mistral(finetuned_model_path, basemodel_path):
    if not os.path.exists("./offload"):
        os.makedirs("./offload")




    print("Caricamento del modello fine-tunato...")
    bnb_config = BitsAndBytesConfig(
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=False
    )

    model = AutoModelForCausalLM.from_pretrained(
        basemodel_path,
        torch_dtype=torch.float16,
        quantization_config=bnb_config,
        device_map="auto",
        offload_folder="./offload"
    )

    model = PeftModel.from_pretrained(model, finetuned_model_path)
    tokenizer = AutoTokenizer.from_pretrained(finetuned_model_path)

    generated_codes = []
    for prompt in prompts:
        inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
        outputs = model.generate(
            **inputs,
            max_length=512,
            num_return_sequences=1,
            do_sample=True,
            top_k=50,
            top_p=0.95,
            temperature=1,
        )
        decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
        generated_codes.append(decoded_outputs[0])

    del model
    del tokenizer
    torch.cuda.empty_cache()

    return generated_codes

# Generate with Mistral
finetuned_model_path = "/content/drive/Shareddrives/NLPMODELS/mistral/checkpoint-20"
basemodel_path = "mistralai/Mistral-7B-Instruct-v0.2"
generated_codes_mistral = generate_with_mistral(finetuned_model_path, basemodel_path)


# Generate with BART
model_bart_path = "/content/drive/Shareddrives/NLPMODELS/nl2sql_bart_final/checkpoint-340"
generated_codes_bart = generate_with_bart(model_bart_path)


# Generate with GPT-2
model_gpt2_path = "/content/drive/Shareddrives/NLPMODELS/gpt2model/checkpoint-340"
generated_codes_gpt2 = generate_with_gpt2(model_gpt2_path)



# Create a DataFrame with results
results_df = pd.DataFrame({
    "Prompt": prompts,
    "Generated Code GPT-2": generated_codes_gpt2,
    "Generated Code BART": generated_codes_bart,
    "Generated Code Mistral": generated_codes_mistral,
    "Actual Code": actual_codes
})


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Caricamento del modello fine-tunato...


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for o

In [None]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=05a4ece74fc16f850e3d6400b64bedc6643ddc909cd3743b133c8554c7454129
  Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [None]:
import pandas as pd

# Assicurarsi che tutti i valori siano stringhe
results_df["Generated Code GPT-2"] = results_df["Generated Code GPT-2"].astype(str)
results_df["Generated Code Mistral"] = results_df["Generated Code Mistral"].astype(str)

# Funzione per rimuovere il prompt e gestire errori
def clean_code(text):
    if "###" in text:
        return text.split("###", 1)[-1].strip()
    return text.strip()  # Se non c'è "###", restituisce la stringa originale
    # Funzione per rimuovere il prompt e gestire errori
def clean_code_gpt(text):
    if "###Code:\n" in text:
        return text.split("###Code:\n", 1)[-1].strip()
    return text.strip()  # Se non c'è "###", restituisce la stringa originale

# Applicare la funzione alla colonna
results_df["Generated Code GPT-2"] = results_df["Generated Code GPT-2"].apply(clean_code_gpt)
results_df["Generated Code Mistral"] = results_df["Generated Code Mistral"].apply(clean_code)

In [None]:
results_df

Unnamed: 0,Prompt,Generated Code GPT-2,Generated Code BART,Generated Code Mistral,Actual Code
0,This applet will reward you 1p for every 10 me...,var distance = Math.floor(Math.random() * dist...,var distance = parseInt(Strava.newActivityByYo...,if (Math.round(parseFloat(Triggers.EntryStrava...,var distance = parseInt(Strava.newActivityByYo...
1,Enter a description of your meal and the numbe...,var amount = Meta.triggerTime.amount if (amou...,if (Evernote.newEating.DescriptionByYou.Text.i...,var descr = AndroidMessages.appendMessage.Text...,var values = DoNote.doNoteNewCommandCommon.Not...
2,Turn on Wemo Switch After Garage Door Opens (A...,var hour = Meta.triggerTime.hour() if (,var timeOfDay = Meta.currentUserTime.hour() ...,let timeOfDay = Meta.currentUserTime.hour() i...,var hour = Meta.currentUserTime.hour() if ...
3,Turn on WeMo Smart Plug When Ring Detects Moti...,var timeOfDay = Meta.currentUserTime.hour(),var timeOfDay = Meta.currentUserTime.hour() ...,var hour = Meta.triggerTime.hour() if (hou...,var timeOfDay = Meta.currentUserTime.hour() if...
4,This applet will add an iOS reminder to drink ...,var Hour = Meta.triggerTime.hour() if,var timeOfDay = Meta.triggerTime.hour() if (...,var cardType = Monzo.cardTypeOfPurchase.CardTy...,"var reminderTime = Meta.triggerTime.add(2, 'h'..."
5,"We got you, Dallas: this Applet sends you a Te...",var Hour = Meta.currentUserTime.hour() var Day =,var Hour = Meta.currentUserTime.hour() var Day...,var Hour = Meta.currentUserTime.hour() var Day...,var Hour = Meta.currentUserTime.hour() var Day...
6,facebook only text post,var hour = Meta.triggerTime.hour(),var text = FacebookPages.newStatusMessageByPag...,var Message=FacebookPages.newPageFeedItem.Text...,Facebook.newStatusMessageByYou.From Facebook.n...
7,If doorbell rings beween 21h and 6h then toggl...,var timeOfDay = Meta.currentUserTime.hour(),var timeOfDay = Meta.currentUserTime.hour(); ...,var timeOfDay = Meta.currentUserTime.hour() i...,var TimeOfDay = Meta.currentUserTime.hour() i...
8,Report today's rainfall amount from your Weath...,var timeOfDay = Meta.currentUserTime.hour(),if(parseFloat(Weather.currentWeather[0].rainfa...,var amount = parseInt(Weather.rainTodayWeather...,Netro.reportWeather.setDate(Netatmo.rainTodayA...
9,Turn light on when arriving to an area between...,Code,var timeOfDay = Meta.currentUserTime.hour() i...,if (Meta.currentUserTime.hour() < 19 || Meta.c...,var timeOfDay = Meta.currentUserTime.hour(); ...


### Evaluation of Generated Code
This section defines a function to evaluate the quality of model-generated code using multiple text similarity metrics.

#### **Evaluation Metrics**
- **BLEU Score (Bilingual Evaluation Understudy):**
  - Measures n-gram precision by comparing generated code with actual code.
  - Uses `sentence_bleu()` from NLTK for sentence-level evaluation.

- **METEOR Score (Metric for Evaluation of Translation with Explicit ORdering):**
  - Considers stemming, synonyms, and word order.
  - Computed using `single_meteor_score()` from NLTK.

- **ROUGE Scores (Recall-Oriented Understudy for Gisting Evaluation):**
  - Measures recall-based overlap between generated and reference text.
  - Three variations are computed:
    - **ROUGE-1:** Unigram (single-word) overlap.
    - **ROUGE-2:** Bigram (two-word) overlap.
    - **ROUGE-L:** Measures longest common subsequence similarity.

#### **Evaluation Process**
- For each pair of generated and actual code snippets:
  - Text is **tokenized** by splitting into words.
  - **BLEU, METEOR, and ROUGE scores** are computed.
- The function stores individual scores for all test samples.
- The **average score** is computed for each metric across the entire dataset.

#### **Returned Results**
- `mean_bleu`: Average BLEU score over all test samples.
- `mean_meteor`: Average METEOR score.
- `mean_rouge_l`: Average ROUGE-L score.
- `mean_rouge_1`: Average ROUGE-1 score.
- `mean_rouge_2`: Average ROUGE-2 score.

This function enables a **comprehensive evaluation** of model-generated code, ensuring a robust comparison against ground truth data.


In [None]:
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.meteor_score import single_meteor_score

def evaluate_generated_text(generated_codes, actual_codes):
    bleu_scores = []
    meteor_scores = []
    rouge_l_scores = []
    rouge_1_scores = []
    rouge_2_scores = []

    scorer = rouge_scorer.RougeScorer(["rougeL"], use_stemmer=True)
    scorer_1 = rouge_scorer.RougeScorer(["rouge1"], use_stemmer=True)
    scorer_2 = rouge_scorer.RougeScorer(["rouge2"], use_stemmer=True)

    for gen, ref in zip(generated_codes, actual_codes):
        gen_tokens = gen.split()
        ref_tokens = ref.split()

        # Calcolo BLEU (sentence-level)
        bleu = sentence_bleu([ref_tokens], gen_tokens)

        # Calcolo METEOR (sentence-level)
        meteor = single_meteor_score(ref_tokens, gen_tokens)

        # Calcolo ROUGE-L (f-measure)
        rouge_l = scorer.score(ref, gen)["rougeL"].fmeasure
        # calcolo ROUGE-1
        rouge_1 = scorer_1.score(ref, gen)["rouge1"].fmeasure
        # calcolo ROUGE-2
        rouge_2 = scorer_2.score(ref, gen)["rouge2"].fmeasure

        bleu_scores.append(bleu)
        meteor_scores.append(meteor)
        rouge_l_scores.append(rouge_l)
        rouge_1_scores.append(rouge_1)
        rouge_2_scores.append(rouge_2)

    # Media su tutte le frasi del dataset di test
    mean_bleu = sum(bleu_scores) / len(bleu_scores)
    mean_meteor = sum(meteor_scores) / len(meteor_scores)
    mean_rouge_l = sum(rouge_l_scores) / len(rouge_l_scores)
    mean_rouge_1 = sum(rouge_1_scores) / len(rouge_1_scores)
    mean_rouge_2 = sum(rouge_2_scores) / len(rouge_2_scores)

    return mean_bleu, mean_meteor, mean_rouge_l, mean_rouge_1, mean_rouge_2


In [None]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True


#### **Results Compilation**
- A `pandas` DataFrame (`metrics_df`) is created to organize evaluation scores.
- The DataFrame has the following structure:
  - **Metric** → The evaluation metric name.
  - **GPT-2** → Scores from the GPT-2 model.
  - **BART** → Scores from the BART model.
  - **Mistral** → Scores from the Mistral model.

This setup allows for **direct comparison** of model performance across multiple evaluation metrics.


In [None]:
# Evaluate models
gpt2_scores = evaluate_generated_text(generated_codes_gpt2, actual_codes)
bart_scores = evaluate_generated_text(generated_codes_bart, actual_codes)
mistral_scores = evaluate_generated_text(generated_codes_mistral, actual_codes)

metrics_df = pd.DataFrame(
    {
        "Metric": ["BLEU", "METEOR", "ROUGE-L", "ROUGE-1", "ROUGE-2"],
        "GPT-2": gpt2_scores,
        "BART": bart_scores,
        "Mistral": mistral_scores
    }
)


The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


In [None]:
metrics_df

Unnamed: 0,Metric,GPT-2,BART,Mistral
0,BLEU,0.022889,0.177501,0.057188
1,METEOR,0.086055,0.344752,0.258213
2,ROUGE-L,0.158086,0.478291,0.250738
3,ROUGE-1,0.210855,0.502479,0.289356
4,ROUGE-2,0.08492,0.3509,0.159746
