# Environment Setup and Authentication

This section of the notebook sets up the environment by installing the necessary libraries, including `transformers`, `peft`, and `bitsandbytes`, which are essential for working with optimized language models. It imports key modules for model handling, tokenization, fine-tuning with efficient techniques like LoRA, and dataset management. Finally, it performs authentication with Hugging Face to access available models and datasets on the platform.


In [1]:
%%capture

!pip install transformers==4.36.2
!pip install -U peft
!pip install -U accelerate
!pip install -U trl
!pip install datasets==2.16.0
!pip install sentencepiece
!pip install -U bitsandbytes
!pip install fuzzywuzzy

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
import torch
import pandas as pd
from datasets import Dataset


# Repository Cloning and Dataset Extraction

This section clones the `ifttt-code-generator` repository from GitHub and ensures it is up to date. Then, it extracts a dataset from a compressed ZIP file into the `datasets` directory, preparing the necessary data for further processing.


In [3]:
!unzip -q datasets/FilterDatasets.zip -d datasets

"unzip" non Š riconosciuto come comando interno o esterno,
 un programma eseguibile o un file batch.


# Data Cleaning and Preprocessing Functions

This section defines various functions for cleaning and preprocessing text and code data. It includes functions to remove extra spaces, newline characters, and JavaScript comments. Additionally, it provides methods for text normalization, filtering long entries, detecting correlations between prompts and code, and identifying similar strings to remove duplicates. These preprocessing steps help standardize and improve data quality for further analysis.


In [4]:
import pandas as pd
import re

# Funzione per rimuovere newline e spazi extra
def remove_newlines(text):
    if isinstance(text, str):
        return text.replace("\n", " ").replace("\r", " ").strip()
    return text


# Funzione per rimuovere commenti in JavaScript
def remove_js_comments(code):
    if isinstance(code, str):
        # Rimuovere commenti singola linea (// ...)
        code = re.sub(r"//.*", "", code)
        # Rimuovere commenti multilinea (/* ... */)
        code = re.sub(r"/\*.*?\*/", "", code, flags=re.DOTALL)
        # Rimuovere spazi bianchi extra generati
        return code.strip()
    return code

# Funzione per preprocessare le descrizioni
def preprocess_description(description):
    if isinstance(description, str):
        #description = re.sub(r"[^a-zA-Z0-9\s]", "", description)  # Rimuovere caratteri speciali
        #description = description.lower().strip()  # Convertire in minuscolo e rimuovere spazi extra
        return description
    return description

def add_prefix(code):
  if isinstance(code, str):
    return "generate filter code:" + code
  return code

def filter_long_entries(value, max_length):
    if isinstance(value, str) and len(value) > max_length:
        return None  # Elimina l'entry sostituendola con None
    return value

  # Funzione per rilevare se cleaned_description e filter_code sono correlati
def is_correlated(prompt, code):
    # Estrazione di parole chiave dal prompt
    prompt_keywords = set(prompt.lower().split()) if isinstance(prompt, str) else set()

    # Conversione del codice in testo e verifica delle parole chiave
    code_text = code.lower() if isinstance(code, str) else ""
    matches = any(keyword in code_text for keyword in prompt_keywords)

    return matches

from fuzzywuzzy import fuzz

def are_strings_similar(string1, string2, threshold=80):
    similarity_ratio = fuzz.ratio(string1, string2)
    return similarity_ratio >= threshold

# Funzione per trovare e rimuovere duplicati basati sulla similarità
def remove_similar_entries(df, column, threshold=80):
    to_remove = set()
    for i in range(len(df)):
        if i in to_remove:
            continue
        for j in range(i + 1, len(df)):
            if j in to_remove:
                continue
            similarity = fuzz.ratio(df.iloc[i][column], df.iloc[j][column])
            if similarity > threshold:
                to_remove.add(j)
    return df.drop(index=list(to_remove)).reset_index(drop=True)
# rimuove caratteri non-ASCII
def clean_text(t):
    if isinstance(t, str):

        return t.encode("ascii", errors="ignore").decode("ascii")
    return t

# First Dataset

In [5]:
!pip install langdetect
!pip install googletrans

Defaulting to user installation because normal site-packages is not writeable
Collecting langdetect
  Using cached langdetect-1.0.9.tar.gz (981 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py): started
  Building wheel for langdetect (setup.py): finished with status 'done'
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993253 sha256=a7d452f252b9e7070320d95f5447bbefca7c9af16aed7617b70bdffeda1da8bb
  Stored in directory: c:\users\daislabtbb\appdata\local\pip\cache\wheels\c4\16\af\1889804d8b7c0c041cadee8e29673a938a332acbf2865c70a1
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9
Defaulting to user installation because normal site-packages is not writeable
Collecting googletrans
  Downloading googletrans-4.0.2-py3-none-any.whl (18 kB)
Collecting h



# Text Translation to English

This section loads a multilingual-to-English translation model from Hugging Face (`Helsinki-NLP/opus-mt-mul-en`). It includes a function that detects the language of a given text and translates it into English if it's not already in English. The function tokenizes the input, generates a translation, and decodes the output, ensuring that non-English text is properly processed for further use.


In [6]:
from transformers import MarianMTModel, MarianTokenizer
from langdetect import detect

# Carica un modello di traduzione da Hugging Face
model_name = "Helsinki-NLP/opus-mt-mul-en"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

def translate_to_english_hf(text):
    try:
        # Rileva la lingua del testo
        detected_language = detect(text)
        if detected_language != "en":
          # Tokenizza il testo
          inputs = tokenizer(text, return_tensors="pt", padding=True)
          # Genera la traduzione
          translated = model.generate(**inputs)
          # Decodifica la traduzione
          translated_text = tokenizer.decode(translated[0], skip_special_tokens=True)
          return translated_text
        else:
          return text
    except Exception as e:
        print(f"Errore durante la traduzione: {e}")
        return text


tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/707k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/791k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.42M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/310M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

In [7]:

# Caricare il dataset Step1
step1_path = "datasets/Step1_Raw_Data_with_FilterCode 1.csv"
step1_data = pd.read_csv(step1_path)

step1_cleaned = step1_data.dropna()

In [8]:
step1_cleaned

Unnamed: 0.1,Unnamed: 0,by_service_owner,channels,description,friendly_id,id,installs_count,name,permissions,pro_features,...,service_slug,services,speed,uniq_permissions,services_len,service_triggers,service_actions,triggers_category,actions_category,filter_code
54,54,True,"[{'name': 'Chicago Transit Authority', 'module...",Commuting in Chicago? This Applet posts to a S...,CrxshJV6-post-to-slack-when-delays-affect-your...,CrxshJV6,3.0,Post to Slack when delays affect your morning ...,"[{'id': '/triggers/cta.new_pink_line_alert', '...",True,...,cta,"['cta', 'slack']",Polling Applets usually run within 1 hour,"[{'id': '/actions/slack.post_to_channel', 'nam...",2,Chicago Transit Authority,['Slack'],Travel & transit,['Communication'],var Hour = Meta.currentUserTime.hour()\nvar Da...
111,111,True,"[{'name': 'Notifications', 'module_name': 'if_...",This Applet will send you a notification when ...,UMq6ryuD-get-notified-when-a-nj-transit-adviso...,UMq6ryuD,34.0,Get notified when a NJ transit advisory affect...,[{'id': '/triggers/nj_transit.new_bus_advisory...,True,...,nj_transit,"['if_notifications', 'nj_transit']",Polling Applets usually run within 1 hour,[{'id': '/actions/if_notifications.send_notifi...,2,NJ Transit,['Notifications'],Travel & transit,['Notifications'],var Hour = Meta.currentUserTime.hour()\nvar Da...
138,138,True,"[{'name': 'Notifications', 'module_name': 'if_...",This Applet helps you work around commute disr...,K9GSUniy-get-notified-when-there-s-a-rider-ale...,K9GSUniy,7.0,Get notified when there's a rider alert on Dal...,"[{'id': '/triggers/dart.new_dart_rider_alert',...",True,...,dart,"['if_notifications', 'dart']",Polling Applets usually run within 1 hour,[{'id': '/actions/if_notifications.send_notifi...,2,DART,['Notifications'],Travel & transit,['Notifications'],var Hour = Meta.currentUserTime.hour()\nvar Da...
162,162,False,"[{'name': 'Twitter', 'module_name': 'twitter',...",Get notified when @Wario64 (usually the fastes...,HaLPjmCQ-super-nintendo-classic-in-stock-alerts,HaLPjmCQ,377.0,Super Nintendo Classic in stock alerts,"[{'id': '/triggers/twitter.new_tweet_by_user',...",True,...,twitter,"['twitter', 'if_notifications']",Polling Applets usually run within 1 hour,[{'id': '/actions/if_notifications.send_notifi...,2,Twitter,['Notifications'],Social networks,['Notifications'],"if (Twitter.newTweetByUser.Text.indexOf(""SNES""..."
359,359,False,"[{'name': 'Twitter', 'module_name': 'twitter',...",Post Mastodon's Toot on Twitter(Exclude Mentions),K6FLEawj-mastodon-twitter-exclude-mentions,K6FLEawj,19.0,Mastodon→Twitter(Exclude Mentions),"[{'id': '/triggers/feed.new_feed_item', 'name'...",True,...,twitter,"['twitter', 'feed']",Polling Applets usually run within 1 hour,"[{'id': '/triggers/feed.new_feed_item', 'name'...",2,RSS Feed,['Twitter'],News & information,['Social networks'],"if(Feed.newFeedItem.EntryContent.indexOf(""@"") ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49660,49820,False,"[{'name': 'Twitter', 'module_name': 'twitter',...",SantaAdrianaV,wdgHbhsc-santaadrianav,wdgHbhsc,5.0,SantaAdrianaV,"[{'id': '/triggers/twitter.new_tweet_by_user',...",True,...,twitter,['twitter'],Polling Applets usually run within 1 hour,[],1,Twitter,['Twitter'],Social networks,['Social networks'],//var timeOfDay = Meta.triggerTime.weekday()\r...
49775,49935,False,"[{'name': 'Twitter', 'module_name': 'twitter',...",RochelYaracuy,QPBbDg6v-rochelyaracuy,QPBbDg6v,5.0,RochelYaracuy,"[{'id': '/triggers/twitter.new_tweet_by_user',...",True,...,twitter,['twitter'],Polling Applets usually run within 1 hour,[],1,Twitter,['Twitter'],Social networks,['Social networks'],//var timeOfDay = Meta.triggerTime.weekday()\r...
49843,50004,False,"[{'name': 'Twitter', 'module_name': 'twitter',...",SedicionCivil,FaXmWShw-sedicioncivil,FaXmWShw,5.0,SedicionCivil,"[{'id': '/triggers/twitter.new_tweet_by_user',...",True,...,twitter,['twitter'],Polling Applets usually run within 1 hour,[],1,Twitter,['Twitter'],Social networks,['Social networks'],//var timeOfDay = Meta.triggerTime.weekday()\r...
49954,50115,False,"[{'name': 'Twitter', 'module_name': 'twitter',...",BenicasimPlaya,JBjcvJnm-benicasimplaya,JBjcvJnm,5.0,BenicasimPlaya,"[{'id': '/triggers/twitter.new_tweet_by_user',...",True,...,twitter,['twitter'],Polling Applets usually run within 1 hour,[],1,Twitter,['Twitter'],Social networks,['Social networks'],//var timeOfDay = Meta.triggerTime.weekday()\r...


In [9]:
import pandas as pd

def generate_permission_dataframe(permissions):
    """
    Given a list of two permission dictionaries, extract the 'name' field and generate
    a DataFrame with a rule format "if (first name) then (second name)".
    """
    #convert to dictionary
    permissions = eval(permissions)

    first_name = permissions[0]['name']
    second_name = permissions[1]['name']
    first_service = permissions[0]['service_name']
    second_service = permissions[1]['service_name']

    # Creating string
    df = pd.DataFrame({'rule': [f"if {first_name} (trigger_service: {first_service}) then {second_name} (action_service: {second_service})"]})




    return df['rule']

#applico al data frame step1_data
step1_cleaned['permission_df'] = step1_cleaned['permissions'].apply(generate_permission_dataframe)

#step2_new con permission_df e filet_code
step1_new = step1_cleaned[['permission_df', 'filter_code']]
#reset indexes
step1_new.reset_index(drop=True, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  step1_cleaned['permission_df'] = step1_cleaned['permissions'].apply(generate_permission_dataframe)


In [10]:
step1_new

Unnamed: 0,permission_df,filter_code
0,if New pink line alert (trigger_service: Chica...,var Hour = Meta.currentUserTime.hour()\nvar Da...
1,if New bus advisory (trigger_service: NJ Trans...,var Hour = Meta.currentUserTime.hour()\nvar Da...
2,if New DART rider alert (trigger_service: DART...,var Hour = Meta.currentUserTime.hour()\nvar Da...
3,if New tweet by a specific user (trigger_servi...,"if (Twitter.newTweetByUser.Text.indexOf(""SNES""..."
4,if New feed item (trigger_service: RSS Feed) t...,"if(Feed.newFeedItem.EntryContent.indexOf(""@"") ..."
...,...,...
428,if New tweet by a specific user (trigger_servi...,//var timeOfDay = Meta.triggerTime.weekday()\r...
429,if New tweet by a specific user (trigger_servi...,//var timeOfDay = Meta.triggerTime.weekday()\r...
430,if New tweet by a specific user (trigger_servi...,//var timeOfDay = Meta.triggerTime.weekday()\r...
431,if New tweet by a specific user (trigger_servi...,//var timeOfDay = Meta.triggerTime.weekday()\r...


In [None]:
#Caricare il dataset Step2
step2_path = "datasets/Step2_Popular_Rules_with_FilterCode.csv"
step2_data = pd.read_csv(step2_path)
#drop null for filter codes and permissions
step2_data = step2_data.dropna(subset=['permissions', 'filter_code'])

#apply function
step2_new = step2_data.copy()
step2_new['permission_df'] = step2_new['permissions'].apply(generate_permission_dataframe)
step2_new = step2_new[['permission_df', 'filter_code']]
step2_new.reset_index(drop=True, inplace=True)

In [None]:
step2_cleaned

Unnamed: 0,permission_df,filter_code
0,if New bus advisory (trigger_service: NJ Trans...,var Hour = Meta.currentUserTime.hour()\nvar Da...
1,if New tweet by a specific user (trigger_servi...,"if (Twitter.newTweetByUser.Text.indexOf(""SNES""..."
2,if New feed item (trigger_service: RSS Feed) t...,"if(Feed.newFeedItem.EntryContent.indexOf(""@"") ..."
3,if Any card purchase (trigger_service: Monzo) ...,"if ( Monzo.cardPurchase.Category == ""Transport..."
4,if Any card purchase (trigger_service: Monzo) ...,"if ( Monzo.cardPurchase.Category == ""Entertain..."
...,...,...
140,if Every day at (trigger_service: Date & Time)...,var pathMin : number = 4;\nvar pathMax : numbe...
141,if New upvoted post by you (trigger_service: r...,// Clean up the song title:\r\nvar title = Red...
142,if New tweet by a specific user (trigger_servi...,var Texto = Twitter.newTweetByUser.Text;\r\nva...
143,if New status message on page (trigger_service...,\nif (FacebookPages.newStatusMessageByPage.Mes...


In [None]:
step3_data = pd.read_csv("datasets/Step3_IoT_Rules_with_FilterCode.csv")
step3_new = step3_data.copy()
step3_new['permission_df'] = step3_new['permissions'].apply(generate_permission_dataframe)
step3_new = step3_new[['permission_df', 'filter_code']]

#drop nulls
step3_new = step3_new.dropna()

#reset indexes
step3_new.reset_index(drop=True, inplace=True)

In [None]:
len(step3_new)

37

In [None]:
#merge the 3 dataframes
concatenated_df = pd.concat([step1_new, step2_new, step3_new], ignore_index=True)

concatenated_df['filter_code'] = concatenated_df['filter_code'].apply(remove_newlines)
concatenated_df['filter_code'] = concatenated_df['filter_code'].apply(remove_js_comments)

#drop rows with filter code na or empty
#concatenated_df = concatenated_df.dropna(subset=['filter_code'])
#concatenated_df = concatenated_df[concatenated_df['filter_code'] != '']

#drop duplicates
#concatenated_df = concatenated_df.drop_duplicates(subset=['filter_code'])
concatenated_df.reset_index(drop=True, inplace=True)

#generate a csv
concatenated_df.to_csv("permissions_df.csv", index=False)

In [None]:
len(concatenated_df)

615