<a href="https://colab.research.google.com/github/benedettoscala/ifttt-code-generator/blob/main/gpt2-nl2ifttt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Environment Setup and Library Imports
This section installs the necessary libraries, imports key dependencies, and sets up the workspace.

#### **Dependency Installation**
- The following libraries are installed:
  - `transformers`: Provides pre-trained language models and utilities for NLP tasks.
  - `evaluate`: Enables performance evaluation of language models.
  - `datasets`: Facilitates dataset handling for training and testing.
  - `rouge_score`: (Optional) Used for text-based evaluation metrics, though not strictly needed for GPT-2.

#### **Library Imports**
- **General Libraries:**
  - `os`, `pandas`, `numpy`, and `math` for file handling, data manipulation, and numerical operations.
  - `torch` for deep learning computations.
- **Machine Learning and NLP Libraries:**
  - `sklearn.model_selection` for dataset splitting (train-test split).
  - `datasets.Dataset, DatasetDict` for dataset management.
  - `transformers`:
    - `GPT2LMHeadModel`: Pre-trained GPT-2 model for causal language modeling.
    - `GPT2Tokenizer`: Tokenizer for text processing with GPT-2.
    - `DataCollatorForLanguageModeling`: Ensures proper batching and padding of sequences.
    - `Trainer`: Simplifies training and evaluation of transformers models.
    - `TrainingArguments`: Defines hyperparameters for training.
- **Evaluation Libraries:**
  - `evaluate` for model performance measurement.
  - `nltk` for natural language processing utilities, including tokenization.

#### **NLTK Resource Download**
- If the `punkt` tokenizer is missing, it is downloaded to ensure correct text tokenization.

#### **Repository Cloning**
- The script clones the `ifttt-code-generator` repository from GitHub.
- It navigates to the repository directory and updates it with `git pull`.

This setup ensures a well-prepared environment for training and evaluating language models.


In [None]:
!pip install transformers evaluate datasets
!pip install rouge_score




In [None]:
import os
import pandas as pd
import numpy as np
import torch

from sklearn.model_selection import train_test_split
from datasets import Dataset, DatasetDict
from transformers import (
    GPT2LMHeadModel,
    GPT2Tokenizer,
    DataCollatorForLanguageModeling,
    Trainer,
    TrainingArguments
)

import evaluate
import nltk
import math

nltk.download("punkt")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
!git clone https://github.com/benedettoscala/ifttt-code-generator
%cd ifttt-code-generator/
!git pull

fatal: destination path 'ifttt-code-generator' already exists and is not an empty directory.
/content/ifttt-code-generator
Already up to date.


### Dataset Loading and Preprocessing
This section loads, cleans, and prepares the dataset for training a language model.

#### **Dataset Loading and Cleaning**
- The dataset is loaded from a CSV file (`cleaned_and_combined.csv`).
- Rows containing missing values in `cleaned_description` or `filter_code` columns are removed.
- Duplicate entries in these columns are also dropped to ensure unique data points.

#### **Text Formatting for Model Input**
- A **prompt template** is created to format each example:
  - The **description** and **code** are combined into a single string using a separator (`###`).
  - This structure helps the model distinguish between input text (description) and the expected output (code).
  - Example format:
    ```
    Description:
    <description_text>
    ###
    Code:
    <code_text>
    ```
- The `create_text_prompt` function applies this formatting to all dataset entries.

#### **Train/Test Split**
- The dataset is split into:
  - **80% Training Data**
  - **20% Testing Data**
- The split is **random but reproducible** (`random_state=42` ensures consistency across runs).

#### **Conversion to Hugging Face Dataset Format**
- The train and test dataframes are converted into `Dataset` objects for seamless compatibility with the Hugging Face training pipeline.
- A `DatasetDict` structure is created to organize training and testing subsets.

#### **Dataset Size Display**
- The script prints the number of examples in both the training and test sets.

This structured dataset preparation ensures the model receives properly formatted inputs and enables efficient training.


In [None]:
# Carica il dataset
csv_path = "datasets/cleaned_and_combined.csv"
df = pd.read_csv(csv_path)

# Rimuovi righe con valori mancanti e duplicati
df.dropna(subset=["cleaned_description", "filter_code"], inplace=True)
df.drop_duplicates(subset=["cleaned_description", "filter_code"], inplace=True)

def create_text_prompt(desc, code):
    return f"Description:\n{desc}\n###\nCode:\n{code}"

df["text"] = df.apply(
    lambda row: create_text_prompt(row["cleaned_description"], row["filter_code"]),
    axis=1
)

# Divisione train/test
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Conversione in Dataset Hugging Face
train_dataset = Dataset.from_pandas(train_df.reset_index(drop=True))
test_dataset = Dataset.from_pandas(test_df.reset_index(drop=True))

dataset = DatasetDict({
    "train": train_dataset,
    "test": test_dataset
})

print("Train set size:", len(dataset["train"]))
print("Test set size:", len(dataset["test"]))

Train set size: 134
Test set size: 34


### Tokenizer Setup and Dataset Tokenization
This section prepares the tokenizer and tokenizes the dataset for training with GPT-2.

#### **Tokenizer Configuration**
- The tokenizer is loaded from the pre-trained `GPT-2` model
  - GPT-2 does not have a default padding token, so the EOS (`end-of-sequence`) token is assigned as the padding token.

#### **Tokenization Function**
- **Input Processing:**
  - Each text sample is tokenized using the GPT-2 tokenizer.
  - `padding="max_length"` ensures all sequences have a fixed size (`256` tokens) for efficient batch processing.
  - `truncation=True` prevents sequences from exceeding the maximum length.
  - **Note:** the longest entry in the dataset is 200 token long.

#### **Tokenizing the Dataset**
- The dataset is processed using `.map()` to apply tokenization to all examples in batches.
- The `remove_columns` parameter ensures only the necessary tokenized fields (`input_ids`, `attention_mask`) remain.

#### **Final Output**
- The `tokenized_datasets` object contains the processed dataset, ready for model training.

This setup ensures the dataset is efficiently formatted for fine-tuning GPT-2.


In [None]:
model_checkpoint = "gpt2"

tokenizer = GPT2Tokenizer.from_pretrained(model_checkpoint)
# Imposta un token di padding se non definito
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': tokenizer.eos_token})

# Imposta una lunghezza massima
max_length = 256

def tokenize_function(examples):
    # Ritorna un unico dict con input_ids e attention_mask
    return tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        max_length=max_length
    )

tokenized_datasets = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=dataset["train"].column_names
)

tokenized_datasets


Map:   0%|          | 0/134 [00:00<?, ? examples/s]

Map:   0%|          | 0/34 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 134
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 34
    })
})

In [None]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # Causal Language Modeling
)


In [None]:
model = GPT2LMHeadModel.from_pretrained(model_checkpoint)

# Aggiungiamo eventuali token se abbiamo aggiunto un token di pad
model.resize_token_embeddings(len(tokenizer))


Embedding(50257, 768)

### Evaluation Metrics and Text Postprocessing
This section defines the evaluation metrics and a helper function to preprocess text for evaluation.

#### **Evaluation Metrics**
- **ROUGE:** Measures overlap between n-grams, sequences, and longest common subsequences between predictions and references. Ideal for summarization and text generation tasks.
- **BLEU:** Evaluates the precision of generated words compared to reference texts, often used for translation tasks.
- **METEOR:** Focuses on synonym matching, stemming, and word order, providing a nuanced evaluation for generated text.

The metrics are loaded using the `evaluate` library, enabling easy integration into the training and evaluation workflow.

#### **Postprocessing Function**
- The `postprocess_text` function ensures predictions and references are properly formatted for accurate metric computation:
  - **Whitespace Cleanup:** Removes unnecessary spaces from both predictions and labels.
  - **Sentence Tokenization:** Segments predictions and references into sentences using NLTK's `sent_tokenize` to align with how ROUGE calculates scores.

This preprocessing ensures consistency and improves the reliability of the evaluation metrics.


In [None]:
rouge_score = evaluate.load("rouge")
bleu_score  = evaluate.load("bleu")
meteor_score = evaluate.load("meteor")

def postprocess_text(preds, labels):
    """
    - Rimuove spazi superflui
    - Segmenta in frasi per calcolare ROUGE in modo corretto
    """
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels]
    return preds, labels


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


### Compute Metrics Function for Model Evaluation
This function computes **ROUGE**, **BLEU**, and **METEOR** scores to evaluate model-generated text against reference labels.

#### **Processing Predictions and Labels**
- The function receives model `logits` (raw outputs) and `labels` (ground truth).
- **Handling Masked Tokens:**  
  - In **GPT-2**, `-100` is not typically used for ignored tokens (as in seq2seq models).
  - If present, it is replaced with the tokenizer’s pad token.
- **Argmax Decoding:**  
  - The highest probability token is selected for each position in the logits.
  - The resulting token IDs are converted into human-readable text using `batch_decode()`.

#### **Post-Processing**
- **Whitespace Cleanup & Sentence Splitting:**  
  - Predictions and labels are preprocessed using `postprocess_text()`.
  - This ensures proper tokenization for ROUGE and improves metric accuracy.

#### **Metric Computation**
- **ROUGE:**  
  - Measures n-gram overlap between predictions and references.
- **BLEU:**  
  - `evaluate` requires references in a **list of lists** format (`[[reference]]`).
  - Measures word sequence precision against references.
- **METEOR:**  
  - Considers synonym matching, stemming, and word order for better evaluation.

#### **Result Formatting**
- Scores are converted to percentages (`*100`) and rounded for readability.
- The final dictionary contains:
  - **ROUGE-1, ROUGE-2, ROUGE-L** (n-gram and longest common subsequence similarity)
  - **BLEU** (precision-based similarity)
  - **METEOR** (semantic matching)

This function ensures a **comprehensive evaluation** of model-generated text.


In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred

    #se si usa -100 come padding token lo sostituiamoc con quello usato da gpt-2
    labels[labels == -100] = tokenizer.pad_token_id

    # Argmax sui logits per ottenere la sequenza predetta
    predictions = np.argmax(logits, axis=-1)

    # Decodifica in stringhe
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Post-processing (rimozione spazi, split in frasi)
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    # Calcolo delle metriche
    # 1) ROUGE
    rouge_results = rouge_score.compute(
        predictions=decoded_preds,
        references=decoded_labels
    )
    # 2) BLEU
    # La metrica BLEU in `evaluate` richiede `references` come lista di liste
    bleu_results = bleu_score.compute(
        predictions=decoded_preds,
        references=[[lbl] for lbl in decoded_labels]
    )
    # 3) METEOR
    meteor_results = meteor_score.compute(
        predictions=decoded_preds,
        references=decoded_labels
    )

    # Organizza i risultati
    result = {}
    # ROUGE
    result["rouge1"] = round(rouge_results["rouge1"] * 100, 2)
    result["rouge2"] = round(rouge_results["rouge2"] * 100, 2)
    result["rougeL"] = round(rouge_results["rougeL"] * 100, 2)
    # BLEU
    result["bleu"] = round(bleu_results["bleu"] * 100, 2)
    # METEOR
    result["meteor"] = round(meteor_results["meteor"] * 100, 2)

    return result


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Training Configuration and Trainer Initialization
This section configures the training arguments and initializes the `Trainer` for fine-tuning GPT-2 on IFTTT-like tasks.

#### **Training Arguments Configuration**
- **Output Directory:**  
  - Saves model checkpoints in `./gpt2-ifttt`.
  - Overwrites existing checkpoints if `overwrite_output_dir=True`.
- **Evaluation & Checkpointing Strategy:**  
  - `evaluation_strategy="epoch"` → Evaluates the model at the end of each epoch.
  - `save_strategy="epoch"` → Saves model checkpoints at the end of each epoch.
- **Training Hyperparameters:**  
  - `num_train_epochs=30` → Trains for 30 epochs (adjust based on dataset size and performance).
  - `per_device_train_batch_size=4` → Processes 4 examples per batch (adjust based on GPU memory).
  - `per_device_eval_batch_size=4` → Uses the same batch size for evaluation.
- **Logging & Precision:**  
  - `logging_steps=50` → Logs progress every 50 steps.
  - `fp16=torch.cuda.is_available()` → Uses **mixed precision** (`fp16`) if a compatible GPU is available, improving efficiency.
  - `report_to="none"` → Disables logging to external tools like WandB.

#### **Trainer Initialization**
- The `Trainer` class simplifies training by handling:
  - **Model Training & Evaluation:** Uses the specified datasets.
  - **Data Collation:** Ensures correct batch formatting using `data_collator`.
  - **Tokenization:** Ensures consistency in input processing.
  - **Metric Computation:** If `compute_metrics` is provided, the model's performance is evaluated after each epoch.

This setup allows for an **automated and structured training process**, including evaluation and checkpointing.


In [None]:
training_args = TrainingArguments(
    output_dir="/content/drive/Shareddrives/NLPMODELS/gpt2model",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",   # Esegui evaluation alla fine di ogni epoca
    #save_strategy="epoch",         # Salva un checkpoint a ogni epoca
    num_train_epochs=20,            # Cambia secondo le tue necessità
    per_device_train_batch_size=8, # Batch size, adattalo alla tua GPU
    per_device_eval_batch_size=8,
    logging_steps=50,
    fp16=torch.cuda.is_available(), # Usa half precision se possibile
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)


  trainer = Trainer(


In [None]:
trainer.train()


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Bleu,Meteor
1,No log,2.535383,42.58,14.89,34.39,24.75,43.7
2,No log,2.156899,46.21,19.42,40.19,22.82,48.61
3,2.580500,2.010765,47.74,20.58,42.49,32.37,51.8
4,2.580500,1.913291,50.35,24.05,45.0,33.9,54.72
5,2.580500,1.878177,50.04,24.38,44.9,35.89,55.51
6,1.584100,1.827517,50.61,25.76,45.78,37.51,56.76
7,1.584100,1.81005,52.21,27.59,47.6,39.59,58.36
8,1.584100,1.795711,53.06,28.57,48.41,41.41,58.34
9,1.241300,1.778932,53.4,29.53,49.02,40.96,59.03
10,1.241300,1.765897,55.13,32.17,50.85,43.63,60.23


TrainOutput(global_step=340, training_loss=1.2648966620950137, metrics={'train_runtime': 154.5639, 'train_samples_per_second': 17.339, 'train_steps_per_second': 2.2, 'total_flos': 350131322880000.0, 'train_loss': 1.2648966620950137, 'epoch': 20.0})

In [None]:
results = trainer.evaluate()
print("Final eval_loss:", results["eval_loss"])
print("Perplexity:", math.exp(results["eval_loss"]))


Final eval_loss: 1.8374218940734863
Perplexity: 6.280326025828933


In [None]:
%cd ..


/content


In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Define the path to your shared folder in Google Drive
shared_folder_path = "/content/drive/Shareddrives/NLPMODELS/"

# Create the shared folder if it doesn't exist
!mkdir -p "{shared_folder_path}"

# Save the nl2sql_bart folder to your shared Google Drive folder
!cp -r gpt2-ifttt "{shared_folder_path}"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
cp: cannot stat 'gpt2-ifttt': No such file or directory


In [None]:
from transformers import pipeline

model_path = "/content/drive/Shareddrives/NLPMODELS/gpt2model/checkpoint-340"

inference_model = GPT2LMHeadModel.from_pretrained(model_path)
inference_tokenizer = GPT2Tokenizer.from_pretrained(model_path)

# Pipeline di text generation
generator = pipeline(
    "text-generation",
    model=inference_model,
    tokenizer=inference_tokenizer,
    pad_token_id=inference_tokenizer.eos_token_id
)

# Esempio di prompt: solo la "descrizione"
prompt = "Description:\nCreate an applet that Save new photos from my phone to a Google Drive folder automatically.\n###\nCode:\n"

results = generator(prompt, max_length=256, num_return_sequences=1)
print(results[0]["generated_text"])


Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Description:
Create an applet that Save new photos from my phone to a Google Drive folder automatically.
###
Code:
var photo = Object.getOwnPropertyNames(CameraPhoto.locale').toLowerCase()  var w = Meta.currentUserTime.format('dddd').toLowerCase()   if (w.locale === 'US') {   FolderFolder.createFolder.skip("Uploaded Photos") } else {  GooglePhoto.uploadPhotoFolder.skip("Not Google Photos Folder") }  } else{   Folder.createFolder.skip("Search for Photos") }   }  if (photo.length!= 0 && photo.length!= 1) {  Folder.createFolder.skip("Not Dropbox") }  else{  Folder.createFolder.skip("Not Photos") }  }    if (w.locale === 'US') {   Folder.createFolder.skip("Not Android") }  else{  Folder.createFolder.skip("Android Save Folder") }  }  }   if (w.locale === 'Canada') {   Folder.createFolder.skip("Not AUS Folder") }  else{
