<a href="https://colab.research.google.com/github/benedettoscala/ifttt-code-generator/blob/main/preprocessing_and_cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Environment Setup and Authentication

This section of the notebook sets up the environment by installing the necessary libraries, including `transformers`, `peft`, and `bitsandbytes`, which are essential for working with optimized language models. It imports key modules for model handling, tokenization, fine-tuning with efficient techniques like LoRA, and dataset management. Finally, it performs authentication with Hugging Face to access available models and datasets on the platform.


In [None]:
%%capture

!pip install transformers==4.36.2
!pip install -U peft
!pip install -U accelerate
!pip install -U trl
!pip install datasets==2.16.0
!pip install sentencepiece
!pip install -U bitsandbytes
!pip install fuzzywuzzy

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
import torch
import pandas as pd
from datasets import Dataset


In [None]:
from google.colab import userdata
secret_hf = userdata.get('HUGGINGFACE_TOKEN')
!huggingface-cli login --token $secret_hf

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
The token `prova` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `prova`


# Repository Cloning and Dataset Extraction

This section clones the `ifttt-code-generator` repository from GitHub and ensures it is up to date. Then, it extracts a dataset from a compressed ZIP file into the `datasets` directory, preparing the necessary data for further processing.


In [None]:
!git clone https://github.com/benedettoscala/ifttt-code-generator
%cd ifttt-code-generator/
!git pull

Cloning into 'ifttt-code-generator'...
remote: Enumerating objects: 55, done.[K
remote: Counting objects: 100% (55/55), done.[K
remote: Compressing objects: 100% (45/45), done.[K
remote: Total 55 (delta 21), reused 21 (delta 5), pack-reused 0 (from 0)[K
Receiving objects: 100% (55/55), 14.44 MiB | 14.44 MiB/s, done.
Resolving deltas: 100% (21/21), done.
/content/ifttt-code-generator
Already up to date.


In [None]:
!unzip -q datasets/FilterDatasets.zip -d datasets

# Data Cleaning and Preprocessing Functions

This section defines various functions for cleaning and preprocessing text and code data. It includes functions to remove extra spaces, newline characters, and JavaScript comments. Additionally, it provides methods for text normalization, filtering long entries, detecting correlations between prompts and code, and identifying similar strings to remove duplicates. These preprocessing steps help standardize and improve data quality for further analysis.


In [None]:
import pandas as pd
import re

# Funzione per rimuovere newline e spazi extra
def remove_newlines(text):
    if isinstance(text, str):
        return text.replace("\n", " ").replace("\r", " ").strip()
    return text


# Funzione per rimuovere commenti in JavaScript
def remove_js_comments(code):
    if isinstance(code, str):
        # Rimuovere commenti singola linea (// ...)
        code = re.sub(r"//.*", "", code)
        # Rimuovere commenti multilinea (/* ... */)
        code = re.sub(r"/\*.*?\*/", "", code, flags=re.DOTALL)
        # Rimuovere spazi bianchi extra generati
        return code.strip()
    return code

# Funzione per preprocessare le descrizioni
def preprocess_description(description):
    if isinstance(description, str):
        #description = re.sub(r"[^a-zA-Z0-9\s]", "", description)  # Rimuovere caratteri speciali
        #description = description.lower().strip()  # Convertire in minuscolo e rimuovere spazi extra
        return description
    return description

def add_prefix(code):
  if isinstance(code, str):
    return "generate filter code:" + code
  return code

def filter_long_entries(value, max_length):
    if isinstance(value, str) and len(value) > max_length:
        return None  # Elimina l'entry sostituendola con None
    return value

  # Funzione per rilevare se cleaned_description e filter_code sono correlati
def is_correlated(prompt, code):
    # Estrazione di parole chiave dal prompt
    prompt_keywords = set(prompt.lower().split()) if isinstance(prompt, str) else set()

    # Conversione del codice in testo e verifica delle parole chiave
    code_text = code.lower() if isinstance(code, str) else ""
    matches = any(keyword in code_text for keyword in prompt_keywords)

    return matches

from fuzzywuzzy import fuzz

def are_strings_similar(string1, string2, threshold=80):
    similarity_ratio = fuzz.ratio(string1, string2)
    return similarity_ratio >= threshold

# Funzione per trovare e rimuovere duplicati basati sulla similarità
def remove_similar_entries(df, column, threshold=80):
    to_remove = set()
    for i in range(len(df)):
        if i in to_remove:
            continue
        for j in range(i + 1, len(df)):
            if j in to_remove:
                continue
            similarity = fuzz.ratio(df.iloc[i][column], df.iloc[j][column])
            if similarity > threshold:
                to_remove.add(j)
    return df.drop(index=list(to_remove)).reset_index(drop=True)
# rimuove caratteri non-ASCII
def clean_text(t):
    if isinstance(t, str):

        return t.encode("ascii", errors="ignore").decode("ascii")
    return t



# First Dataset

In [None]:
!pip install langdetect
!pip install googletrans

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━[0m [32m901.1/981.5 kB[0m [31m26.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993222 sha256=b47e1aa73b99b5c3de45fc416cf660a2467e9f07136285dfc53ed29da6d36b1b
  Stored in directory: /root/.cache/pip/wheels/0a/f2/b2/e5ca405801e05eb7c8ed5b3b4bcf1fcabcd6272c167640072e
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


# Text Translation to English

This section loads a multilingual-to-English translation model from Hugging Face (`Helsinki-NLP/opus-mt-mul-en`). It includes a function that detects the language of a given text and translates it into English if it's not already in English. The function tokenizes the input, generates a translation, and decodes the output, ensuring that non-English text is properly processed for further use.


In [None]:
from transformers import MarianMTModel, MarianTokenizer
from langdetect import detect

# Carica un modello di traduzione da Hugging Face
model_name = "Helsinki-NLP/opus-mt-mul-en"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

def translate_to_english_hf(text):
    try:
        # Rileva la lingua del testo
        detected_language = detect(text)
        if detected_language != "en":
          # Tokenizza il testo
          inputs = tokenizer(text, return_tensors="pt", padding=True)
          # Genera la traduzione
          translated = model.generate(**inputs)
          # Decodifica la traduzione
          translated_text = tokenizer.decode(translated[0], skip_special_tokens=True)
          return translated_text
        else:
          return text
    except Exception as e:
        print(f"Errore durante la traduzione: {e}")
        return text


tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/707k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/791k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.42M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/310M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

# Data Cleaning and Filtering

This section processes a raw dataset by applying multiple cleaning and filtering steps. It starts by loading the dataset and removing rows with missing values in key columns. Various preprocessing techniques are applied, including text normalization, newline and comment removal, and length filtering.

Additionally, it checks for correlations between descriptions and filter code, removing uncorrelated entries. The dataset is then translated into English when necessary. Finally, duplicate or highly similar entries are removed to ensure data quality before further processing.


In [None]:
import pandas as pd

# Caricare il dataset Step1
step1_path = "datasets/Step1_Raw_Data_with_FilterCode 1.csv"
step1_data = pd.read_csv(step1_path)

# Filtrare righe con filter_code non null
step1_cleaned = step1_data.dropna(subset=['description', 'filter_code']).copy()

# Applicare il preprocessing
step1_cleaned['cleaned_description'] = step1_cleaned['description'].apply(preprocess_description).apply(remove_newlines)
step1_cleaned['filter_code'] = step1_cleaned['filter_code'].apply(remove_js_comments).apply(remove_newlines)

step1_cleaned['cleaned_description'] = step1_cleaned['cleaned_description'].apply(lambda x : filter_long_entries(x, 200))
step1_cleaned['filter_code'] = step1_cleaned['filter_code'].apply(lambda x : filter_long_entries(x, 400))

#delete rows which have nothing in cleaned description e filter code
step1_cleaned = step1_cleaned.dropna(subset=['cleaned_description', 'filter_code'])

# Selezionare le colonne rilevanti
step1_cleaned = step1_cleaned[['cleaned_description', 'filter_code']]
step1_cleaned.to_csv("step1_cleaned_no_newlines.csv", index=False)

#Verificare la correlazione tra 'cleaned_description' e 'filter_code'
step1_cleaned['is_correlated'] = step1_cleaned.apply(
    lambda row: is_correlated(row['cleaned_description'], row['filter_code']),
    axis=1
)

#Eliminare righe con valori mancanti o non correlati
step1_cleaned = step1_cleaned.dropna(subset=['cleaned_description', 'filter_code'])
step1_cleaned = step1_cleaned[step1_cleaned['is_correlated']]

# Applica la traduzione alla colonna 'text'
step1_cleaned['cleaned_description'] = step1_cleaned['cleaned_description'].apply(translate_to_english_hf)


#recalculate the indexes of the datasets
step1_cleaned.reset_index(drop=True, inplace=True)

# Applica la funzione per rimuovere le entry simili
step1_cleaned = remove_similar_entries(step1_cleaned, 'cleaned_description', threshold=90)



In [None]:
step1_cleaned

Unnamed: 0,cleaned_description,filter_code,is_correlated
0,Commuting in Chicago? This Applet posts to a S...,var Hour = Meta.currentUserTime.hour() var Day...,True
1,This Applet will send you a notification when ...,var Hour = Meta.currentUserTime.hour() var Day...,True
2,This Applet helps you work around commute disr...,var Hour = Meta.currentUserTime.hour() var Day...,True
3,Get notified when @Wario64 (usually the fastes...,"if (Twitter.newTweetByUser.Text.indexOf(""SNES""...",True
4,Post Mastodon's Toot on Twitter(Exclude Mentions),"if(Feed.newFeedItem.EntryContent.indexOf(""@"") ...",True
...,...,...,...
163,Calls phone with status of the pet,var minute = Meta.triggerTime.minute() var mi...,True
164,Send an SMS message for temperature collar ale...,var minute = Meta.triggerTime.minute() var mi...,True
165,Send an SMS message when the Link collar is ch...,var minute = Meta.triggerTime.minute() var mi...,True
166,time uplash source url to photo url buff,"Buffer.addToBufferWithPhoto.setPhotoUrl(""https:",True


In [None]:
#Caricare il dataset Step2
step2_path = "datasets/Step2_Popular_Rules_with_FilterCode.csv"
step2_data = pd.read_csv(step2_path)

# Applicare lo stesso preprocessing e filtraggio a Step2
step2_cleaned = step2_data.dropna(subset=['description', 'filter_code']).copy()
step2_cleaned['cleaned_description'] = step2_cleaned['description'].apply(preprocess_description).apply(remove_newlines)
step2_cleaned['filter_code'] = step2_cleaned['filter_code'].apply(remove_js_comments).apply(remove_newlines)
step2_cleaned['cleaned_description'] = step2_cleaned['cleaned_description'].apply(lambda x: filter_long_entries(x, 200))
step2_cleaned['filter_code'] = step2_cleaned['filter_code'].apply(lambda x: filter_long_entries(x, 400))
step2_cleaned = step2_cleaned.dropna(subset=['cleaned_description', 'filter_code'])
step2_cleaned = step2_cleaned[['cleaned_description', 'filter_code']]

#print number of rows
print(len(step2_cleaned))

# Verificare la correlazione e la similarità per Step2
step2_cleaned['is_correlated'] = step2_cleaned.apply(
    lambda row: is_correlated(row['cleaned_description'], row['filter_code']),
    axis=1
)
step2_cleaned = step2_cleaned[step2_cleaned['is_correlated']]

# Applicare la traduzione a Step2
# Applica la traduzione alla colonna 'text'
step2_cleaned['cleaned_description'] = step2_cleaned['cleaned_description'].apply(translate_to_english_hf)

print(len(step2_cleaned))
#reset the indees
step2_cleaned.reset_index(drop=True, inplace=True)
#Rimuovere duplicati in Step2hhh
step2_cleaned = remove_similar_entries(step2_cleaned, 'cleaned_description', threshold = 90)

# Salvare il dataset Step2 pulito
step2_cleaned.to_csv("step2_cleaned_final.csv", index=False)

102
100


In [None]:
step2_cleaned

Unnamed: 0,cleaned_description,filter_code,is_correlated
0,This Applet will send you a notification when ...,var Hour = Meta.currentUserTime.hour() var Day...,True
1,Get notified when @Wario64 (usually the fastes...,"if (Twitter.newTweetByUser.Text.indexOf(""SNES""...",True
2,Post Mastodon's Toot on Twitter(Exclude Mentions),"if(Feed.newFeedItem.EntryContent.indexOf(""@"") ...",True
3,"When paying for your Bus Fare or an Uber, take...","if ( Monzo.cardPurchase.Category == ""Transport...",True
4,"When spending on entertainment, take the amoun...","if ( Monzo.cardPurchase.Category == ""Entertain...",True
...,...,...,...
83,Sends an IFTTT notification reminding you to d...,var timeOfDay = Meta.currentUserTime.hour(); ...,True
84,Record Bottom of Every Hour Between 6 AM and 7PM,var hour = Meta.currentUserTime.hour() if...,True
85,Gets a random wallpaper from http://inspirobot...,var pathMin : number = 4; var pathMax : number...,True
86,Sends reports of sysmos in the Mexican Republi...,var Texto = Twitter.newTweetByUser.Text; var ...,True


In [None]:
#Caricare il dataset Step3
step3_path = "datasets/Step3_IoT_Rules_with_FilterCode.csv"
step3_data = pd.read_csv(step3_path)

# Applicare lo stesso preprocessing e filtraggio a Step3
step3_cleaned = step3_data.dropna(subset=['description', 'filter_code']).copy()
step3_cleaned['cleaned_description'] = step3_cleaned['description'].apply(preprocess_description).apply(remove_newlines)
step3_cleaned['filter_code'] = step3_cleaned['filter_code'].apply(remove_js_comments).apply(remove_newlines)
step3_cleaned['cleaned_description'] = step3_cleaned['cleaned_description'].apply(lambda x: filter_long_entries(x, 200))
step3_cleaned['filter_code'] = step3_cleaned['filter_code'].apply(lambda x: filter_long_entries(x, 400))
step3_cleaned = step3_cleaned.dropna(subset=['cleaned_description', 'filter_code'])
step3_cleaned = step3_cleaned[['cleaned_description', 'filter_code']]

# Stampare il numero di righe
print(len(step3_cleaned))

# Verificare la correlazione e la similarità per Step3
step3_cleaned['is_correlated'] = step3_cleaned.apply(
    lambda row: is_correlated(row['cleaned_description'], row['filter_code']),
    axis=1
)
step3_cleaned = step3_cleaned[step3_cleaned['is_correlated']]

# Applicare la traduzione a Step3
step3_cleaned['cleaned_description'] = step3_cleaned['cleaned_description'].apply(translate_to_english_hf)

# Stampare il numero di righe dopo la traduzione
print(len(step3_cleaned))

# Resettare gli indici
step3_cleaned.reset_index(drop=True, inplace=True)

# Rimuovere duplicati in Step3
step3_cleaned = remove_similar_entries(step3_cleaned, 'cleaned_description', threshold = 90)

# Salvare il dataset Step3 pulito
step3_cleaned.to_csv("step3_cleaned_final.csv", index=False)

33
33


In [None]:
step3_cleaned

Unnamed: 0,cleaned_description,filter_code,is_correlated
0,Activate a Scene to operate Hunter Douglas Pow...,var hour = Meta.currentUserTime.hour() if ...,True
1,Use your Caavo Voice Remote to set the tempera...,if (Caavo.voiceSearch.Text.toLowerCase().index...,True
2,This applet will send your robot back to the d...,var timeOfDay = Meta.currentUserTime.hour(); ...,True
3,This Applet sets your Arlo to record when moti...,var Day = Meta.currentUserTime.day() var Hour...,True
4,Which ever color tier your latest Super Chat m...,"if (Youtube.newSuperchat.ColorTier == ""Light b...",True
5,Closes the Main Gate every hour after 9 PM and...,var hour = Meta.triggerTime.hour() if ...,True
6,Your lights will turn on when you're heading t...,var hour = Meta.currentUserTime.hour() if (ho...,True
7,Enter your home address on the map and when yo...,let sunrise = moment(Weather.currentWeather[0]...,True
8,Feel safer when someone rings your doorbell at...,var timeOfDay = Meta.currentUserTime.hour() ...,True
9,This applet will turn your eWeLink 1 channel s...,var timeOfDay = Meta.currentUserTime.hour(); ...,True


# Merging and Finalizing the Dataset

This section combines the cleaned datasets from different processing steps (`step1_cleaned`, `step2_cleaned`, `step3_cleaned`) into a single dataset. After merging, it applies additional text cleaning to ensure data consistency. Only the relevant columns (`cleaned_description` and `filter_code`) are retained. The dataset is then shuffled to randomize the order of entries before saving the final processed data to a CSV file for further use.


In [None]:
#merge step1, step2, step3
combined_df = pd.concat([step1_cleaned, step2_cleaned, step3_cleaned], ignore_index=True)

combined_df["cleaned_description"] = combined_df["cleaned_description"].apply(clean_text)
combined_df["filter_code"] = combined_df["filter_code"].apply(clean_text)

#get only the cleaned_Description and filter code columns
combined_df = combined_df[["cleaned_description", "filter_code"]]

#shuffle the dataset
combined_df = combined_df.sample(frac=1).reset_index(drop=True)

#save to csv
combined_df.to_csv("combined.csv", index=False)

In [None]:
combined_df

Unnamed: 0,cleaned_description,filter_code
0,Activates a Lutron Caseta scene between 8pm an...,var timeOfDay = Meta.currentUserTime.hour() ...
1,Enter the time you fell asleep followed by the...,var values = DoNote.doNoteNewCommandCommon.Not...
2,Send your significant other (or anybody else) ...,var Hour = Meta.currentUserTime.hour() var Day...
3,Keep your followers informed on the latest in ...,var title = Trigger.EntryTitle.toLowerCase() v...
4,Turn LIFX light on between 20PM and 7AM when c...,var hh = Meta.currentUserTime.hour() if ( hh ...
...,...,...
281,Keep your team in the loop hassle-free. This A...,var Hour = Meta.currentUserTime.hour() var Day...
282,Turns on WeMo light switch when an area is ent...,var timeOfDay = Meta.currentUserTime.hour() i...
283,Which ever color tier your latest Super Chat m...,"if (Youtube.newSuperchat.ColorTier == ""Light b..."
284,Turns up the volume of your ringtone so you do...,var timeOfDay = Meta.currentUserTime.hour(); ...
