# Air Paradis : Detect bad buzz with deep learning

## Context

"Air Paradis" is an airline company who's marketing department wants to be able to detect quickly "bad buzz" on social networks, to be able to anticipate and address issues as fast as possible. They need an AI API that can detect "bad buzz" and predict the reason for it.

The goal here is to evaluate different approaches to detect "bad buzz" :

1. [Baseline Model : Logistic Regression](1_baseline.ipynb)
2. [Word embedding : Gensim Doc2Vec](2_word_embedding.ipynb)
3. [Azure Cognitive Services : Text Analytics API](3_azure_sentiment_analysis.ipynb)
4. [HuggingFace Transformer Pipeline : Sentiment Analysis](4_huggingface_sentiment_analysis.ipynb)
5. [HuggingFace : BERT Fine-tuning](5_huggingface_bert_fine_tuning.ipynb)
6. [AzureML Studio : Automated ML](6_azureml_automated_ml.ipynb)
7. [AzureML Studio : Designer](7_azureml_designer.ipynb)
8. [Custom Models : Neural Networks with Keras](8_keras_neural_networks.ipynb)
9. [AzureML Studio : Notebooks](9_azureml_notebooks.ipynb)

After exploring our dataset, we will compare the different approaches.

## Project modules

The helpers functions and project specific code will be placed in `../src/`.

We will use the [Python](https://www.python.org/about/gettingstarted/) programming language, and present here the code and results in this [Notebook JupyterLab](https://jupyterlab.readthedocs.io/en/stable/getting_started/overview.html) file.

We will use the usual libraries for data exploration, modeling and visualisation :

- [NumPy](https://numpy.org/doc/stable/user/quickstart.html) and [Pandas](https://pandas.pydata.org/docs/user_guide/index.html) : for maths (stats, algebra, ...) and large data manipulation
- [Plotly](https://plotly.com/python/getting-started/) : for interactive data visualization

We will also use libraries specific to the goals of this project :

- NLP Natural Language Processing
  - [NLTK](https://www.nltk.org/) and [Spacy](https://spacy.io/api) : for text processing


In [None]:
import pickle

# Import custom helper libraries
import os
import sys

src_path = os.path.abspath(os.path.join("../src"))
if src_path not in sys.path:
    sys.path.append(src_path)

import data.helpers as data_helpers
import visualization.helpers as viz_helpers

# Maths modules
from scipy.stats import f_oneway
import pandas as pd

# Viz modules
import plotly.express as px

# Render for export
import plotly.io as pio

pio.renderers.default = "notebook"

## Exploratory data analysis (EDA)

We are going to load the data and analyse the distribution of each variable.


### Load data

Let's download the data from the [Kaggle - Sentiment140 dataset with 1.6 million tweets](https://www.kaggle.com/kazanova/sentiment140) competition.


In [None]:
# Download and unzip CSV files
!cd .. && make dataset && cd notebooks

Now we can load the data.


In [None]:
# Load data from CSV
df = pd.read_csv(
    os.path.join("..", "data", "raw", "training.1600000.processed.noemoticon.csv"),
    names=["target", "id", "date", "flag", "user", "text"],
)

# Reduce memory usage
df = data_helpers.reduce_dataframe_memory_usage(df)

### Explore data

Let's display a few examples, find out how many data points are available, what are the variables and what is their distribution.


In [None]:
# Display first few rows
df.head(5)

In [None]:
# Diaplay number of rows and colmn types
df.info()

There are _1600000_ rows, each composed of _6_ columns :

- _target_: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
- _id_: The id of the tweet ( 2087)
- _date_: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
- _flag_: The query (lyx). If there is no query, then this value is NO_QUERY.
- _user_: the user that tweeted (robotickilldozr)
- _text_: the text of the tweet (Lyx is cool)

We are only interrested in the _target_ and _text_ variables. The rest of the columns are not useful for our analysis.


In [None]:
# Drop useless columns
df.drop(columns=["id", "date", "flag", "user"], inplace=True)

# Replace target values with labels
df.target = df.target.map(
    {
        0: "NEGATIVE",
        2: "NEUTRAL",
        4: "POSITIVE",
    }
)

In [None]:
# Display basic statistics
df.describe(include="all")

In [None]:
# Plot target distribution
viz_helpers.histogram(
    df, label_x="target", label_colour="target", title="Target distribution"
)

There are exactly as many (800000) _POSITIVE_ tweets as _NEGATIVE_ tweets. There are no _NEUTRAL_ tweets.
The problem is well balanced and there will be no bias towards one class during the training of our models.


In [None]:
# Plot text length distribution
df["text_length"] = df.text.str.len()

p_value = f_oneway(
    df.loc[df["target"] == "NEGATIVE", "text_length"],
    df.loc[df["target"] == "POSITIVE", "text_length"],
)[1]

viz_helpers.histogram(
    df,
    label_x="text_length",
    label_colour="target",
    title=f"Text length distribution / p-value={p_value:.5f}",
    include_boxplot=True,
)

There are no big difference between the _POSITIVE_ and _NEGATIVE_ tweets, but _NEGATIVE_ tweets are slightly longer than _POSITIVE_ tweets.
In both classes, there are two modes : _~45_ characters and _138_ characters (the maximum allowed at some point).


In [None]:
# Plot word count distribution
df["word_count"] = df.text.str.split().str.len()

p_value = f_oneway(
    df.loc[df["target"] == "NEGATIVE", "word_count"],
    df.loc[df["target"] == "POSITIVE", "word_count"],
)[1]

viz_helpers.histogram(
    df,
    label_x="word_count",
    label_colour="target",
    title=f"Word count distribution / p-value={p_value:.5f}",
    include_boxplot=True,
)

There are no big difference between the _POSITIVE_ and _NEGATIVE_ tweets, but _NEGATIVE_ tweets are significatively longer than _POSITIVE_ tweets.
In both classes, there are two modes : _~7_ words and _~20_ words.


#### Text analysis

We will look more in details at what contains the _text_ variable.

First, we will transform the dataset into a Bag of Words representation with TfIdf (Term Frequency - Inverse Document Frequency) weights.
To achieve this, we are going to use th SpaCy tokenizer.

In [None]:
# Vectorizers
from sklearn.feature_extraction.text import TfidfVectorizer

# Tokenizers, Stemmers and Lemmatizers
import nltk
from nltk.corpus import stopwords
import spacy

# Download resources
nltk.download("stopwords")
stopwords = set(stopwords.words("english"))

# Download SpaCy model
try:
    nlp = spacy.load("en_core_web_sm")
except:
    !python -m spacy download en_core_web_sm
    nlp = spacy.load("en_core_web_sm")

# Define tokenizer
tokenizer = lambda text: [  # SpaCy Lemmatizer
    token.lemma_.lower() for token in nlp(text) if token.is_alpha and not token.is_stop
]

In [None]:
# Processed data path
processed_data_path = os.path.join("..", "data", "processed")
vectorized_dataset_file_path = os.path.join(
    processed_data_path, "tfidf_spacy_dataset.pkl"
)
vocabulary_file_path = os.path.join(processed_data_path, "tfidf_spacy_vocabulary.pkl")

if os.path.exists(vectorized_dataset_file_path) and os.path.exists(
    vocabulary_file_path
):
    # Load vectorized dataset
    with (open(vectorized_dataset_file_path, "rb")) as f:
        X = pickle.load(f)
    # Load vocabulary
    with (open(vocabulary_file_path, "rb")) as f:
        vocabulary = pickle.load(f)
else:
    # Define vectorizer
    vectorizer = TfidfVectorizer(
        strip_accents="unicode",
        lowercase=True,
        stop_words=stopwords,
        tokenizer=tokenizer,
    )

    # Vectorize text
    X = vectorizer.fit_transform(df.text)

    # Get vocabulary
    vocabulary = vectorizer.get_feature_names_out()

    # Save vectorized dataset as pickle
    with open(vectorized_dataset_file_path, "wb") as f:
        pickle.dump(X, f)

    # Save vocabulary as pickle
    with open(vocabulary_file_path, "wb") as f:
        pickle.dump(vocabulary, f)

Our corpus is now transformed into a BoW representation. We can analyse the words frequencies.

In [None]:
# List words TF-IDF scores
words = pd.Series(X.sum(axis=0).A1, index=vocabulary)

# Top 20 tokens by TfIdf
top_20_words = words.nlargest(20).sort_values(ascending=False)

# Plot top 20 tokens by TfIdf
fig = px.bar(
    top_20_words,
    x=top_20_words.index,
    y=top_20_words.values,
    labels={"x": "Words", "y": "Count", "color": "Count"},
    title=f"Top 20 important words (Tf-Idf) - Vocalbulary size: {len(vocabulary)}",
    color=top_20_words.values,
)
fig.show()

We can see that the most important words actually meaningful and relevant regarding the sentiment associated to each message.


## Models comparison



In [30]:
models_results_df = pd.DataFrame(
    data=[
        {
            "Model": "1 - Logistic Regression",
            "Sampling": 1,
            "True Positives": 108688,
            "True Negatives": 107510,
            "False Positives": 52490,
            "False Negatives": 51312,
            "Average Precision": 0.73,
            "ROC AUC": 0.74,
        },
        {
            "Model": "2 - Word embedding",
            "Sampling": 1,
            "True Positives": 119134,
            "True Negatives": 105070,
            "False Positives": 54930,
            "False Negatives": 40866,
            "Average Precision": 0.75,
            "ROC AUC": 0.77,
        },
        {
            "Model": "3.1 - Azure Cognitive Service API",
            "Sampling": 1,
            "True Positives": 788,
            "True Negatives": 673,
            "False Positives": 327,
            "False Negatives": 212,
            "Average Precision": 0.75,
            "ROC AUC": 0.78,
        },
        {
            "Model": "3.2 - Logistic Regression on Azure Cognitive Service",
            "Sampling": 2000 / 1600000,
            "True Positives": 164,
            "True Negatives": 123,
            "False Positives": 77,
            "False Negatives": 36,
            "Average Precision": 0.76,
            "ROC AUC": 0.78,
        },
        {
            "Model": "4 - HuggingFace Sentiment Analysis",
            "Sampling": 1,
            "True Positives": 622,
            "True Negatives": 798,
            "False Positives": 202,
            "False Negatives": 378,
            "Average Precision": 0.79,
            "ROC AUC": 0.80,
        },
        {
            "Model": "5 - HuggingFace BERT Fine-tuning",
            "Sampling": 1,
            "True Positives": 99585,
            "True Negatives": 17631,
            "False Positives": 82369,
            "False Negatives": 415,
            "Average Precision": 0.822,
            "ROC AUC": 0.883,
        },
        {
            "Model": "6 - AzureML Studio : Automated ML",
            "Sampling": 1,
            "True Positives": 99585,
            "True Negatives": 17631,
            "False Positives": 82369,
            "False Negatives": 415,
            "Average Precision": 0.822,
            "ROC AUC": 0.883,
        },
    ]
)

models_results_df["Accuracy"] = (
    models_results_df["True Positives"] + models_results_df["True Negatives"]
) / (
    models_results_df["True Positives"]
    + models_results_df["True Negatives"]
    + models_results_df["False Positives"]
    + models_results_df["False Negatives"]
)

models_results_df["Precision"] = models_results_df["True Positives"] / (
    models_results_df["True Positives"] + models_results_df["False Positives"]
)

models_results_df["Recall"] = models_results_df["True Positives"] / (
    models_results_df["True Positives"] + models_results_df["False Negatives"]
)

models_results_df["Sensitivity"] = models_results_df["True Positives"] / (
    models_results_df["True Positives"] + models_results_df["False Negatives"]
)

models_results_df["Specificity"] = models_results_df["True Negatives"] / (
    models_results_df["True Negatives"] + models_results_df["False Positives"]
)

models_results_df["F1"] = (
    2
    * models_results_df["True Positives"]
    / (
        2 * models_results_df["True Positives"]
        + models_results_df["False Positives"]
        + models_results_df["False Negatives"]
    )
)

models_results_df

Unnamed: 0,Model,Sampling,True Positives,True Negatives,False Positives,False Negatives,Average Precision,ROC AUC,Accuracy,Precision,Recall,Sensitivity,Specificity,F1
0,1 - Logistic Regression,1.0,108688,107510,52490,51312,0.73,0.74,0.675619,0.674335,0.6793,0.6793,0.671937,0.676808
1,2 - Word embedding,1.0,119134,105070,54930,40866,0.75,0.77,0.700638,0.684426,0.744587,0.744587,0.656687,0.713241
2,3.1 - Azure Cognitive Service API,1.0,788,673,327,212,0.75,0.78,0.7305,0.706726,0.788,0.788,0.673,0.745154
3,3.2 - Logistic Regression on Azure Cognitive S...,0.00125,164,123,77,36,0.76,0.78,0.7175,0.680498,0.82,0.82,0.615,0.743764
4,4 - HuggingFace Sentiment Analysis,1.0,622,798,202,378,0.79,0.8,0.71,0.754854,0.622,0.622,0.798,0.682018
5,5 - HuggingFace BERT Fine-tuning,1.0,99585,17631,82369,415,0.822,0.883,0.58608,0.547309,0.99585,0.99585,0.17631,0.706392
