This notebook contains a Jupyter notebook designed to guide researchers on re-running the experiment when needed.
It is organized by sessions with descriptions that facilitate understanding.

# Clone full repo to copy aditional python files

If running on Google Colab, this cell fetches all additional code from the Github repository that is not automatically included by Colab.

In [3]:
# clone repo and move to current working dir
!git clone https://github.com/evaluating-effectiveness-cloud-nlp/replication_package.git repo
!rsync -av repo/ .
!rm -rf repo

# Installing dependencies with pip

Installs all dependencies used in the experiment using the file `requirements.txt` downloaded from previous cell. This may take a while to run.

In [None]:
# installs dependencies
%pip install -r requirements.txt

# Providers Credentials

User interface for filling in the credentials needed to use the Cloud NLP Services used in the experiment.

>For Google is needed to obtain the `google-credentials.json` with all credentials and upload it when the cell executes.

This cells will create an `credentials.py` file in the same directory of the *notebook*.

In [8]:
import os
from google.colab import files

# @markdown Microsoft
azure_key_1 = '' # @param {type:"string"}
azure_key_2 = '' # @param {type:"string"}
azure_location = '' # @param {type:"string"}
azure_endpoint = '' # @param {type:"string"}

# @markdown Amazon
aws_access_key_id='' # @param {type:"string"}
aws_secret_access_key='' # @param {type:"string"}

# @markdown Google
# @markdown >*You will need to upload `google-credentials.json` file on runtime as described in: https://developers.google.com/workspace/guides/create-credentials?hl=pt-br#create_credentials_for_a_service_account*
print('Please upload "google-credentials.json" file')
google_credentials_file = files.upload()

file_name = list(google_credentials_file.keys())[0]

## write credentials to a credentials.py file
file_content = f"""
import os

# Microsoft
azure_key_1 = '{azure_key_1}'
azure_key_2 = '{azure_key_2}'
azure_location = '{azure_location}'
azure_endpoint = '{azure_endpoint}'

# Amazon
aws_access_key_id = '{aws_access_key_id}'
aws_secret_access_key = '{aws_secret_access_key}'

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "./{file_name}"
"""

file_name = "credentials.py"

# Escreve o conteúdo no arquivo
with open(file_name, "w") as f:
    f.write(file_content)


## Download the pre-trained ``glove.twitter`` word embedding model

Automatically downloads the *wordembedding* needed to apply the noise *WordEmbeddings*.

If you don't want to use it, just remove it from the noise list and don't run this cell.
> Note: The file has been placed in a personal repository just for ease of download, the original model is available as a *.zip file at: https://github.com/stanfordnlp/GloVe

In [None]:
!python -m pip install ipywidgets
import urllib.request
from os.path import exists
import ipywidgets as widgets
from IPython.display import display
import os

progress = None
def show_progress(block_num, block_size, total_size):
    global progress
    if not progress :
        progress = widgets.FloatProgress(
            value=0,
            min=0,
            max=total_size,
            step=0.1,
            description='Downloading',
            bar_style='info',
            orientation='horizontal'
        )
        display(progress)
        
    downloaded = (block_num * block_size)
    print(block_num * block_size, "/", total_size,"\r", end="")
    
    progress.value = downloaded

model_path = "models/glove.twitter.27B.100d.txt"
word_embedding_url = "https://huggingface.co/anonymoususer/fault_injection_mlaas/resolve/main/glove.twitter.27B.100d.txt"

file_exists = exists(model_path)

if file_exists :
    print("file ", model_path, " already exists.")
else:
    filename = "models"
    os.makedirs(filename, exist_ok=True)
    urllib.request.urlretrieve(word_embedding_url, model_path, show_progress)
    print("File downloaded!")

## Importing MLaaS providers implementations

Imports the module *ml_providers*, which implements the Cloud NLP Services *Amazon*, *Google*, and *Microsoft*.

In [2]:
from mlaas_providers import providers as ml_providers

# `RQ1`: How effective are the Cloud NLP services when subjected to noise?

## Importing aditional python modules

Imports modules related to noise insertion, cloud providers implementation, data processing, and visualization.

In [3]:
from datetime import datetime
from typing import List
from mlaas_providers.providers import read_dataset
from noise_insertion.utils import save_data_to_file
from data_sampling.data_sampling import DataSampling
from noise_insertion.percent_insertion import noises
from noise_insertion import noise_insertion
from visualization import visualization
from progress import progress_manager
from metrics import metrics
import ipywidgets as widgets
data_sampling = DataSampling()


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/julianoro/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/julianoro/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /home/julianoro/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /home/julianoro/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Parameters

Defines *sample size*, *noise algorithms* used, and *noise levels*.

In [4]:
sample_size = 99

noise_list =[
    noises.Keyboard,
    noises.OCR,
    noises.RandomCharReplace,
    noises.CharSwap,
    noises.WordSwap,
    noises.WordSplit,
    noises.Antonym,
    noises.Synonym,
    noises.Spelling,
    noises.TfIdfWord,
    noises.WordEmbeddings,
    noises.ContextualWordEmbs,
]

noise_level=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

## Running the experiment

Executes the pipeline defined in section *2.3. Evaluation Process* of our paper by orchestrating the imported modules.

Also, all summarized results and visualizations are created here.

We follow this steps:

- **Data sampling**: Initially, we extract a sample from the dataset Twitter US Airline Sentiment aiming to create a balanced dataset containing the same number of instances classified as positive, negative, and neutral;
- **Oracle**: After creating the balanced dataset, we use the f-measure to evaluate the effectiveness of the Cloud NLP services by using this dataset;
- **Noise generation**: At this step, we use the tool nlpaug to produce different datasets containing sentences changed according to different levels of noise;
- **Noise Influence**: In this step, we evaluate the effectiveness of the Cloud NLP services by using the datasets with noise. 

The RQ1 experiment raw results are stored in a new directory inside outputs/experiment1 folder.

> If for any reason an error occurs during execution, you can continue where you left off by entering the name of the directory created during execution below. Ex.: `size99_07-12-2022 09_34_29` is our last full experiment execution.

In [6]:
# @markdown ### Type the name of a /outputs/experiment1 folder if you want to continue from a previous execution:
continue_from = "size99_07-12-2022 09_34_29" # @param {type:"string"}

def get_main_path(size):
    now = datetime.now()
    timestamp = now.strftime("%m-%d-%Y %H_%M_%S")
    main_dir = './outputs/experiment1/size'+str(size)+'_' + timestamp
    return main_dir

def run_evaluation(sample_size: int,
                  noise_levels: List[int] =[0.1, 0.15, 0.2, 0.25, 0.3],
                  noise_algorithms=[noises.no_noise, noises.RandomCharReplace, noises.Keyboard, noises.OCR],
                  mlaas_providers=[ml_providers.google],
                  continue_from=None):
    if(continue_from):
        main_path = './outputs/experiment1/'+continue_from
        progress = progress_manager.load_progress(main_path)
        x_dataset = read_dataset(main_path + '/data' + "/dataset.xlsx")
        y_labels = read_dataset(main_path + '/data' + "/labels.xlsx")
    else:
        x_dataset, y_labels = data_sampling.get_dataset_sample('./Tweets_dataset.csv', sample_size)
        main_path = get_main_path(len(x_dataset))
        save_data_to_file(x_dataset, main_path + '/data', "dataset")
        save_data_to_file(y_labels, main_path + '/data', "labels")
        
        progress = progress_manager.init_progress(main_path, noise_algorithms, noise_levels, mlaas_providers)
    print("Results will be stored at: ", main_path)
    print('Generating noise...')
    progress = noise_insertion.generate_noised_data(x_dataset, main_path)

    print('Getting predictions from providers...')
    progress = ml_providers.get_prediction_results(main_path)

    print('Calculating metrics...')
    metrics_results = metrics.metrics(progress, y_labels, main_path)

    noise_list = [0.0]
    noise_list.extend(noise_levels)

    summary_table = visualization.plot_results(metrics_results, main_path + '/results', noise_list)

    print("Results were saved to:", main_path)

    return summary_table

results_table = run_evaluation(
    sample_size,
    noise_levels=noise_level,
    noise_algorithms=noise_list,
    mlaas_providers=[ml_providers.amazon, ml_providers.microsoft, ml_providers.google],
    continue_from=continue_from
)

Results will be stored at:  ./outputs/experiment1/size99_07-12-2022 09_34_29
Generating noise...
- Keyboard
-- 
- OCR
-- 
- RandomCharReplace
-- 
- CharSwap
-- 
- WordSwap
-- 
- WordSplit
-- 
- Antonym
-- 
- Synonym
-- 
- Spelling
-- 
- TfIdfWord
-- 
- WordEmbeddings
-- 
- ContextualWordEmbs
-- 
Getting predictions from providers...
- google
-- Keyboard
--- 
-- OCR
--- 
-- RandomCharReplace
--- 
-- CharSwap
--- 
-- WordSwap
--- 
-- WordSplit
--- 
-- Antonym
--- 
-- Synonym
--- 
-- Spelling
--- 
-- TfIdfWord
--- 
-- WordEmbeddings
--- 
-- ContextualWordEmbs
--- 
- microsoft
-- Keyboard
--- 
-- OCR
--- 
-- RandomCharReplace
--- 
-- CharSwap
--- 
-- WordSwap
--- 
-- WordSplit
--- 
-- Antonym
--- 
-- Synonym
--- 
-- Spelling
--- 
-- TfIdfWord
--- 
-- WordEmbeddings
--- 
-- ContextualWordEmbs
--- 
- amazon
-- Keyboard
--- 
-- OCR
--- 
-- RandomCharReplace
--- 
-- CharSwap
--- 
-- WordSwap
--- 
-- WordSplit
--- 
-- Antonym
--- 
-- Synonym
--- 
-- Spelling
--- 
-- TfIdfWord
--- 
-- WordEmbedd

# Results - Table 3

Below is the result of the experiment summarized in a table.
> **RQ1 - F-Measure variation according to Noise Level**

This table presents the effectiveness of the Cloud NLP services
when subjected to the noise levels specified. The first column describes the
noises analyzed in our study, the second column describes the providers of the
Cloud NLP services in which we apply the noise and the remaining columns describe the effectiveness of the Cloud NLP services provided by Amazon, Google,
and Microsft. We use a color scale varying from yellow to red to represent the
influence of the noise level on the effectiveness of the Cloud NLP services. The
lower the effectiveness, the greater the influence of noise.

In [7]:
results_table

Unnamed: 0_level_0,Unnamed: 1_level_0,Noise Level (%),Noise Level (%),Noise Level (%),Noise Level (%),Noise Level (%),Noise Level (%),Noise Level (%),Noise Level (%),Noise Level (%),Noise Level (%)
Unnamed: 0_level_1,Unnamed: 1_level_1,0,10,20,30,40,50,60,70,80,90
Keyboard,Amazon,0.72,0.58,0.35,0.24,0.18,0.21,0.19,0.17,0.16,0.17
Keyboard,Google,0.69,0.62,0.46,0.41,0.33,0.36,0.28,0.23,0.28,0.34
Keyboard,Microsoft,0.76,0.66,0.34,0.18,0.19,0.17,0.17,0.17,0.17,0.17
OCR,Amazon,0.72,0.47,0.27,0.21,0.22,0.19,0.2,0.23,0.22,0.24
OCR,Google,0.69,0.59,0.45,0.44,0.41,0.4,0.39,0.4,0.45,0.44
OCR,Microsoft,0.76,0.59,0.32,0.21,0.17,0.19,0.19,0.19,0.19,0.17
RandomCharReplace,Amazon,0.72,0.53,0.3,0.24,0.22,0.17,0.17,0.17,0.17,0.17
RandomCharReplace,Google,0.69,0.54,0.52,0.33,0.22,0.32,0.26,0.21,0.26,0.21
RandomCharReplace,Microsoft,0.76,0.62,0.44,0.34,0.2,0.19,0.16,0.17,0.19,0.16
CharSwap,Amazon,0.72,0.61,0.54,0.4,0.44,0.3,0.37,0.29,0.24,0.29


# Reusable Modules

Additionally, some modules can help researchers interested in extending this study or performing similar experiments.

**mlaas_providers**: Abstracts the code needed to use each Cloud Provider into functions named after each provider. It also contains the orchestration required to load the progress file and, given the input, get the prediction for all specified Cloud Providers (function *get_prediction_results*).

**noise_insertion**: This module encapsulates noise insertion algorithms. While based on nlpaug, the algorithms are adapted for easier use and parameterization, ensuring predictable character modification. The *noise_insertion.percent* submodule allows setting a percentage of total characters to be altered, while *noise_insertion.unit_insertion* specifies a fixed number of characters.

**visualization**: Contains functions related to the generation of graphs and tables used in the experiment.

**progress_manager**: Includes functions for saving and loading progress. Progress is saved as a *.json* file and *.xlsx* files to prevent the need for experiment restarts due to network errors.

**metrics**: Includes functions for calculating the *F-measure* and additional metrics.

# Appendix

The graphs presented in the document are a series of bar charts illustrating the variation in the F-Measure metric for different types of noise applied to cloud NLP services provided by Amazon, Google, and Microsoft. Each graph represents a specific type of noise, such as Antonym, Word Embeddings, Word Split, Word Swap, Char Swap, Contextual Word Embeds, Keyboard, OCR, Random Char Replace, Spelling, and Synonym.

The graphs share the following common features:

- The x-axis represents noise levels from 0% to 90%, indicating the degree of disturbance inserted into the text.
- The y-axis shows the F-Measure, a metric that combines precision and recall to evaluate the effectiveness of NLP services under noisy conditions.
- Each graph includes three data series, each corresponding to one of the service providers (Amazon, Google, Microsoft), allowing for a direct comparison among them under the same noise conditions.

These graphs allow viewers to see how different types of noise impact the effectiveness of cloud NLP services, highlighting which provider maintains better performance under adverse conditions. Additionally, a spreadsheet [grouped_data_new.xlsx](assets/grouped_data_new.xlsx) is provided that will allow for the generation of additional graphs. This spreadsheet contains the necessary data to explore various noise impacts further and compare the cloud NLP services under different scenarios not covered in the initial set of graphs. This added resource enhances the ability to conduct a comprehensive analysis and make informed decisions regarding the selection of NLP services based on robustness against noise.

![image](assets/Keyboard.png)

![image](assets/OCR.png)

![image](assets/RandomCharReplace.png)

![image](assets/CharSwap.png)

![image](assets/WordSwap.png)

![image](assets/WordSplit.png)

![image](assets/Antonym.png)

![image](assets/Synonym.png)

![image](assets/Spelling.png)

![image](assets/TfIdfWord.png)

![image](assets/WordEmbeddings.png)

![image](assets/ContextualWordEmbs.png)

