# A BERT-based Model for Semantic Consistency Checking of IFTTT applets

**Authors:**
<br>[Bernardo Breve](https://orcid.org/0000-0002-3898-7512)<br>
[Gaetano Cimino](https://orcid.org/0000-0001-8061-7104)<br>
[Vincenzo Deufemia](https://orcid.org/0000-0002-6711-3590)<br>
[Annunziata Elefante](https://orcid.org/0009-0001-7141-6105)<br>
**Date created:** 2023/07/25<br>
**Description:** In this paper, we propose a BERT (Bidirectional Encoder Representations from Transformers)-based model for semantic consistency checking of IFTTT applets. Our model uses pre-trained language representations to learn the semantics of applet components and identifies inconsistencies within the user-defined descriptions associated with applets.

## Introduction

According to the IFTTT creation paradigm, when a user creates
a new applet, the creator must specify a natural language description that summarize how the applet works. By reading this field, a new user can more easily understand what an applet is for and decide whether or not to activate it on their device. However, on the part of IFTTT, there is no control over
the content of the description entered by the user, so the creator could write anything, falsely describing the applet’s behavior. To this end, we developed a model that can check whether there is some semantic consistency between the trigger-action components of an applet and its natural language description provided by its creator. We fine-tuned a BERT-based classification model that takes as input a pattern derived from the applet components and the corresponding user-defined description and outputs a label ('cc', 'ce', 'ec' or 'ee') and a similarity score for these two sentences.

### References

* ["An empirical characterization of IFTTT: ecosystem, usage, and performance"](https://doi.org/10.1145/3131365.3131369)

In [None]:
!pip install transformers

In [None]:
!pip install nltk

## Setup

**Note**: install HuggingFace `transformers` via `pip install transformers` (version >= 2.11.0)

In [None]:
import numpy as np
import tensorflow as tf
import transformers
import pandas as pd
import string
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
import seaborn as sn
import matplotlib.pyplot as plt
import sklearn.metrics as metrics

In [None]:
train_path = 'trainSet.csv'

col_names = ['similarity','pattern','desc']
train_df = pd.read_csv(train_path,skiprows=1,sep=';',names=col_names,encoding = "ISO-8859-1")

train_df

In [None]:
val_path = 'devSet.csv'

valid_df = pd.read_csv(val_path,skiprows=1,sep=';',names=col_names,encoding = "ISO-8859-1")

valid_df

In [None]:
test_path = 'testSet.csv'

test_df = pd.read_csv(test_path,skiprows=1,sep=';',names=col_names,encoding = "ISO-8859-1")

test_df

## Configuration

In [88]:
max_length = 70  # Maximum length of input sentence to the model.
batch_size = 64
epochs = 4

# Labels in our dataset.
labels = ["cc", "ce", "ec", "ee"]

## Load the Data

In [None]:
# Shape of the data
print(f"Total training samples : {train_df.shape[0]}")
print(f"Total validation samples: {valid_df.shape[0]}")
print(f"Total test samples: {test_df.shape[0]}")

Dataset Overview:

- pattern: IF Any new SMS received (Android SMS) THEN Send me an email (Email).
- description: A mail will be sent to yourself when you receive a sms.
- similarity: This is the label chosen by the majority of annotators.

The applets were labeled according to the following similarity label values:

- EE: This class denotes complete consistency between the UDD and the synthesized pattern, indicating that both the trigger and action components are accurately represented in the UDD.
- CC: This class denotes complete inconsistency between the UDD and the synthesized pattern, indicating that neither the trigger nor the action components are correctly aligned in the UDD.
- EC: This class denotes partial consistency between the UDD and the synthesized pattern, with a focus on the trigger component. Specifically, the trigger component in the UDD is correct, but the action component does not align with the pattern.
- CE: This class denotes partial consistency between the UDD and the synthesized pattern, with a focus on the action component. Specifically, the action component in the UDD is correct, but the trigger component does not align with the pattern.

Let's look at one sample from the dataset:





In [None]:
print(f"Pattern: {train_df.loc[5, 'pattern']}")
print(f"Description: {train_df.loc[5, 'desc']}")
print(f"Similarity: {train_df.loc[5, 'similarity']}")

## Preprocessing

In [None]:
# We have some NaN entries in our train data, we will simply drop them.
print("Number of missing values")
print(train_df.isnull().sum())
train_df.dropna(axis=0, inplace=True)

Distribution of our training, validation, and test targets

In [None]:
print("Training Target Distribution")
print(train_df.similarity.value_counts())

In [None]:
print("Validation Target Distribution")
print(valid_df.similarity.value_counts())

In [None]:
print("Test Target Distribution")
print(test_df.similarity.value_counts())

Removal of stop words and punctuation characters

In [64]:
def remove_punctuations(text):
    for punctuation in string.punctuation:
        text = str(text).replace(punctuation, '')
    return text

In [None]:
def remove_stop_words(text):
  text_tokens = word_tokenize(text)
  tokens_without_sw = [word for word in text_tokens if not word in stopwords.words()]
  filtered_sentence = (" ").join(tokens_without_sw)
  return filtered_sentence

Text normalization

In [66]:
def lowercase(text):
  return str(text).lower()

In [None]:
train_df["desc"] = train_df["desc"].apply(remove_punctuations)
train_df["desc"] = train_df["desc"].apply(remove_stop_words)
train_df["desc"] = train_df["desc"].apply(lowercase)

In [None]:
valid_df["desc"] = valid_df["desc"].apply(remove_punctuations)
valid_df["desc"] = valid_df["desc"].apply(remove_stop_words)
valid_df["desc"] = valid_df["desc"].apply(lowercase)

In [None]:

test_df["desc"] = test_df["desc"].apply(remove_punctuations)
test_df["desc"] = test_df["desc"].apply(remove_stop_words)
test_df["desc"] = test_df["desc"].apply(lowercase)

One-hot encode training, validation, and test labels

In [70]:
train_df["label"] = train_df["similarity"].apply(lambda x: 0 if x == "cc" else 1 if x == "ce" else 2 if x == "ec" else 3)
y_train = tf.keras.utils.to_categorical(train_df.label, num_classes=4)

valid_df["label"] = valid_df["similarity"].apply(lambda x: 0 if x == "cc" else 1 if x == "ce" else 2 if x == "ec" else 3)
y_val = tf.keras.utils.to_categorical(valid_df.label, num_classes=4)

test_df["label"] = test_df["similarity"].apply(lambda x: 0 if x == "cc" else 1 if x == "ce" else 2 if x == "ec" else 3)
y_test = tf.keras.utils.to_categorical(test_df.label, num_classes=4)

## Keras Custom Data Generator

In [71]:
class BertSemanticDataGenerator(tf.keras.utils.Sequence):
    """Generates batches of data.

    Args:
        sentence_pairs: Array of premise and hypothesis input sentences.
        labels: Array of labels.
        batch_size: Integer batch size.
        shuffle: boolean, whether to shuffle the data.
        include_targets: boolean, whether to incude the labels.

    Returns:
        Tuples `([input_ids, attention_mask, `token_type_ids], labels)`
        (or just `[input_ids, attention_mask, `token_type_ids]`
         if `include_targets=False`)
    """

    def __init__(
        self,
        sentence_pairs,
        labels,
        batch_size=batch_size,
        shuffle=True,
        include_targets=True,
    ):
        self.sentence_pairs = sentence_pairs
        self.labels = labels
        self.shuffle = shuffle
        self.batch_size = batch_size
        self.include_targets = include_targets
        # Load our BERT Tokenizer to encode the text.
        # We will use base-base-uncased pretrained model.
        self.tokenizer = transformers.BertTokenizer.from_pretrained(
            "bert-base-uncased", do_lower_case=True
        )
        self.indexes = np.arange(len(self.sentence_pairs))
        self.on_epoch_end()

    def __len__(self):
        # Denotes the number of batches per epoch.
        return len(self.sentence_pairs) // self.batch_size

    def __getitem__(self, idx):
        # Retrieves the batch of index.
        indexes = self.indexes[idx * self.batch_size : (idx + 1) * self.batch_size]
        sentence_pairs = self.sentence_pairs[indexes]

        # With BERT tokenizer's batch_encode_plus batch of both the sentences are
        # encoded together and separated by [SEP] token.
        encoded = self.tokenizer.batch_encode_plus(
            sentence_pairs.tolist(),
            add_special_tokens=True,
            max_length=max_length,
            return_attention_mask=True,
            return_token_type_ids=True,
            pad_to_max_length=True,
            return_tensors="tf",
        )

        # Convert batch of encoded features to numpy array.
        input_ids = np.array(encoded["input_ids"], dtype="int32")
        attention_masks = np.array(encoded["attention_mask"], dtype="int32")
        token_type_ids = np.array(encoded["token_type_ids"], dtype="int32")

        # Set to true if data generator is used for training/validation.
        if self.include_targets:
            labels = np.array(self.labels[indexes], dtype="int32")
            return [input_ids, attention_masks, token_type_ids], labels
        else:
            return [input_ids, attention_masks, token_type_ids]

    def on_epoch_end(self):
        # Shuffle indexes after each epoch if shuffle is set to True.
        if self.shuffle:
            np.random.RandomState(42).shuffle(self.indexes)


## Build the model

In [None]:
# Create the model under a distribution strategy scope.
strategy = tf.distribute.MirroredStrategy()

with strategy.scope():
    # Create a new model instance
    input_ids = tf.keras.layers.Input(
    shape=(max_length,), dtype=tf.int32, name="input_ids"
    )
    # Attention masks indicates to the model which tokens should be attended to.
    attention_masks = tf.keras.layers.Input(
    shape=(max_length,), dtype=tf.int32, name="attention_masks"
    )
    # Token type ids are binary masks identifying different sequences in the model.
    token_type_ids = tf.keras.layers.Input(
    shape=(max_length,), dtype=tf.int32, name="token_type_ids"
    )
    # Loading pretrained BERT model.
    bert_model = transformers.TFBertModel.from_pretrained("bert-base-uncased")
    # Freeze the BERT model to reuse the pretrained features without modifying them.
    bert_model.trainable = False

    bert_output = bert_model(
    input_ids, attention_mask=attention_masks, token_type_ids=token_type_ids
    )
    sequence_output = bert_output.last_hidden_state
    pooled_output = bert_output.pooler_output
    # Add trainable layers on top of frozen layers to adapt the pretrained features on the new data.
    bi_lstm = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(64, return_sequences=True)
    )(sequence_output)
    # Applying hybrid pooling approach to bi_lstm sequence output.
    avg_pool = tf.keras.layers.GlobalAveragePooling1D()(bi_lstm)
    max_pool = tf.keras.layers.GlobalMaxPooling1D()(bi_lstm)
    concat = tf.keras.layers.concatenate([avg_pool, max_pool])
    dropout = tf.keras.layers.Dropout(0.3)(concat)
    output = tf.keras.layers.Dense(4, activation="softmax")(dropout)
    model = tf.keras.models.Model(
    inputs=[input_ids, attention_masks, token_type_ids], outputs=output
    )

    model.compile(
        optimizer=tf.keras.optimizers.Adam(),
        loss="categorical_crossentropy",
        metrics=["acc"],
    )


print(f"Strategy: {strategy}")
model.summary()

Create train and validation data generators

In [73]:
train_data = BertSemanticDataGenerator(
    train_df[["pattern", "desc"]].values.astype("str"),
    y_train,
    batch_size=batch_size,
    shuffle=True,
)
valid_data = BertSemanticDataGenerator(
    valid_df[["pattern", "desc"]].values.astype("str"),
    y_val,
    batch_size=batch_size,
    shuffle=False,
)

## Train the Model

The training process solely targets the top layers, enabling them to perform "feature extraction," which, in turn, facilitates the utilization of the representations of the pretrained model.

In [None]:
history = model.fit(
    train_data,
    validation_data=valid_data,
    epochs=epochs,
    use_multiprocessing=True,
    workers=-1,
)

## Fine-tuning

After the feature extraction model has achieved convergence on the new data, this step should be executed exclusively. The step involves unfreezing the BERT model, followed by its retraining using a considerably low learning rate. The purpose of this optional step is to progressively adapt the pretrained features to the new data, which can significantly enhance the performance of the model.

In [None]:
# Unfreeze the bert_model.
bert_model.trainable = True
# Recompile the model to make the change effective.
model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-5),
    loss="categorical_crossentropy",
    metrics=["accuracy"],
)
model.summary()

# Train the entire model end-to-end

In [None]:
history = model.fit(
    train_data,
    validation_data=valid_data,
    epochs=epochs,
    use_multiprocessing=True,
    workers=-1,
)

## Evaluate model on the test set

In [None]:
test_data = BertSemanticDataGenerator(
    test_df[["pattern", "desc"]].values.astype("str"),
    y_test,
    batch_size=batch_size,
    shuffle=False,
)
model.evaluate(test_data, verbose=1)

## Save model weights

In [78]:
output_dir = '/content/drive/My Drive/Modello/Semantic'

# Save the weights
model.save_weights(output_dir)

## Load model weights

In [None]:
output_dir = '/content/drive/My Drive/Modello/Semantic'

# Create a new model instance
input_ids = tf.keras.layers.Input(
shape=(max_length,), dtype=tf.int32, name="input_ids"
)
# Attention masks indicates to the model which tokens should be attended to.
attention_masks = tf.keras.layers.Input(
shape=(max_length,), dtype=tf.int32, name="attention_masks"
)
# Token type ids are binary masks identifying different sequences in the model.
token_type_ids = tf.keras.layers.Input(
shape=(max_length,), dtype=tf.int32, name="token_type_ids"
)
# Loading pretrained BERT model.
bert_model = transformers.TFBertModel.from_pretrained("bert-base-uncased")
# Freeze the BERT model to reuse the pretrained features without modifying them.
bert_model.trainable = False

bert_output = bert_model(
input_ids, attention_mask=attention_masks, token_type_ids=token_type_ids
)
sequence_output = bert_output.last_hidden_state
pooled_output = bert_output.pooler_output
# Add trainable layers on top of frozen layers to adapt the pretrained features on the new data.
bi_lstm = tf.keras.layers.Bidirectional(
tf.keras.layers.LSTM(64, return_sequences=True)
)(sequence_output)
# Applying hybrid pooling approach to bi_lstm sequence output.
avg_pool = tf.keras.layers.GlobalAveragePooling1D()(bi_lstm)
max_pool = tf.keras.layers.GlobalMaxPooling1D()(bi_lstm)
concat = tf.keras.layers.concatenate([avg_pool, max_pool])
dropout = tf.keras.layers.Dropout(0.3)(concat)
output = tf.keras.layers.Dense(4, activation="softmax")(dropout)
model = tf.keras.models.Model(
inputs=[input_ids, attention_masks, token_type_ids], outputs=output
)

# Restore the weights
model.load_weights(output_dir)

## Prediction on the test set

In [80]:
def check_similarity(sentence1, sentence2):
    sentence_pairs = np.array([[str(sentence1), str(sentence2)]])
    test_data = BertSemanticDataGenerator(
        sentence_pairs, labels=None, batch_size=1, shuffle=False, include_targets=False,
    )

    proba = model.predict(test_data[0])[0]
    print(proba)
    idx = np.argmax(proba)
    proba = f"{proba[idx]: .2f}%"
    pred = labels[idx]
    return pred, proba

In [None]:
results = []
probabilities = []

for i in range(0, len(test_df)):
  try:
    print(i)
    print(test_df.loc[i, 'desc'])
    result = check_similarity(test_df.loc[i, 'pattern'], test_df.loc[i, 'desc'])
    print("Predicted label: ", result[0])
    print("True label: ", test_df.iloc[i]['similarity'])
    print(result[1])
    results.append(result[0])
    probabilities.append(result[1])
  except:
    print("Prediction Error")

In [90]:
test_error = pd.DataFrame({'true_label': test_df['similarity'], 'result': results, 'probability': probabilities})

test_error.to_csv('test_semantic_results.csv')
!cp test_semantic_results.csv "Results"

In [None]:
test_path = 'Results/test_semantic_results.csv'

col_names = ['true_label','result','probability']
test_error = pd.read_csv(test_path,skiprows=1,sep=',',names=col_names,encoding = "ISO-8859-1")

test_error

In [None]:
data = pd.DataFrame({'prediction':test_error['result'], 'true_label':test_error['true_label']})

# precision tp / (tp + fp)
precision = precision_score(data['true_label'], data['prediction'], average = 'macro')
print('Precision: %f' % precision)
# recall: tp / (tp + fn)
recall = recall_score(data['true_label'], data['prediction'], average = 'macro')
print('Recall: %f' % recall)
# f1: 2 tp / (2 tp + fp + fn)
f1 = f1_score(data['true_label'], data['prediction'], average = 'macro')
print('F1 score: %f' % f1)

In [None]:
print("Classification report for classifier:\n%s\n"
      % (metrics.classification_report(data['true_label'], data['prediction'])))

In [None]:
confusion_matrix = pd.crosstab(data['true_label'], data['prediction'], rownames=['Target Class'], colnames=['Output Class'])

sn.set(font_scale=1.1) # for label size
sn.heatmap(confusion_matrix, annot=True, fmt=".0f", annot_kws={"size": 13}, cmap='Blues')
plt.show()