# Air Paradis : Detect  bad buzz with deep learning

## Context

"Air Paradis" is an airline company who's marketing department wants to be able to detect quickly "bad buzz" on social networks, to be able to anticipate and address issues as fast as possible. They need an AI API that can detect "bad buzz" and predict the reason for it.

The goal here is to evaluate three approaches to detect "bad buzz" :

-   cloud managed service : [Azure Cognitive Service for Language - Sentiment analysis](https://docs.microsoft.com/en-us/azure/cognitive-services/language-service/sentiment-opinion-mining/overview)
-   simple model : Logistic Regression trained on pre-processed data (stopwords, stemming, lemmatization, ...)
-   advanced models : deep learning models (Keras)
    -   with word embedding (Gensim : word2vec, Glove, fasttext)
    -   with a "Long short-term memory" (LSTM) layer
    -   with a Bidirectional Encoder Representations from Transformers (BERT)


## Load project modules

The helpers functions and project specific code will be placed in `../src/`.

We will use the [Python](https://www.python.org/about/gettingstarted/) programming language, and present here the code and results in this [Notebook JupyterLab](https://jupyterlab.readthedocs.io/en/stable/getting_started/overview.html) file.

We will use the usual libraries for data exploration, modeling and visualisation :

-   [NumPy](https://numpy.org/doc/stable/user/quickstart.html) and [Pandas](https://pandas.pydata.org/docs/user_guide/index.html) : for maths (stats, algebra, ...) and large data manipulation
-   [scikit-learn](https://scikit-learn.org/stable/getting_started.html) : for machine learning models training and evaluation
-   [Plotly](https://plotly.com/python/getting-started/) : for interactive data visualization

We will also use libraries specific to the goals of this project :

-   NLP Natural Language Processing
    -   [NLTK](https://www.nltk.org/) and [Spacy](https://spacy.io/api) : for text processing
    -   [Gensim](https://radimrehurek.com/gensim/auto_examples/index.html) and [pyLDAvis](https://nbviewer.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb) : for topic modelling and visualisation
    -   [Kereas](https://keras.io/) : for deep learning models training and evaluation


In [None]:
# Import custom helper libraries
import os
import sys

src_path = os.path.abspath(os.path.join("../src"))
if src_path not in sys.path:
    sys.path.append(src_path)

import features.helpers as feat_helpers
import data.helpers as data_helpers
import visualization.helpers as viz_helpers


# Load environment variables from .env file
from dotenv import load_dotenv

load_dotenv()
# SECRET = os.getenv("SECRET")


# Set up logging
import logging

logging.basicConfig(level=logging.WARNING)
logger = logging.getLogger(__name__)


# System modules
import random, pickle


# Maths modules
import numpy as np
from scipy.stats import f_oneway
import pandas as pd


# Viz modules
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots


# Sample data for development
TEXT_SAMPLE_SIZE = 10 * 1000  # <= 0 for all


import plotly.io as pio
pio.renderers.default = "notebook"

## Exploratory data analysis (EDA)

We are going to load the data and analyse the distribution of each variable.


### Load data

Let's download the data from the [Kaggle - Sentiment140 dataset with 1.6 million tweets](https://www.kaggle.com/kazanova/sentiment140) competition.


In [None]:
# Download and unzip CSV files
!cd .. && make dataset && cd notebooks

Now we can load the data.

In [None]:
# Load data from CSV
df = pd.read_csv(
    os.path.join("..", "data", "raw", "training.1600000.processed.noemoticon.csv"),
    names=["target", "id", "date", "flag", "user", "text"],
)

# Reduce memory usage
df = data_helpers.reduce_dataframe_memory_usage(df)

### Explore data

Let's display a few examples, find out how many data points are available, what are the variables and what is their distribution.


In [None]:
# Display first few rows
df.head(5)

In [None]:
# Diaplay number of rows and colmn types
df.info()

There are _1600000_ rows, each composed of _6_ columns :

-   _target_: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
-   _id_: The id of the tweet ( 2087)
-   _date_: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
-   _flag_: The query (lyx). If there is no query, then this value is NO_QUERY.
-   _user_: the user that tweeted (robotickilldozr)
-   _text_: the text of the tweet (Lyx is cool)

We are only interrested in the _target_ and _text_ variables. The rest of the columns are not useful for our analysis.

In [None]:
# Drop useless columns
df.drop(columns=["id", "date", "flag", "user"], inplace=True)

# Replace target values with labels
df.target.replace(
    {
        0: "NEGATIVE",
        2: "NEUTRAL",
        4: "POSITIVE",
    },
    inplace=True,
)

In [None]:
# Display basic statistics
df.describe(include="all")

In [None]:
# Plot target distribution
viz_helpers.histogram(
    df, label_x="target", label_colour="target", title="Target distribution"
)

There are exactly as many (800000) _POSITIVE_ tweets as _NEGATIVE_ tweets. There are no _NEUTRAL_ tweets.
The problem is well balanced and there will be no bias towards one class during the training of our models.

In [None]:
# Plot text length distribution
df["text_length"] = df.text.str.len()

p_value = f_oneway(
    df.loc[df["target"] == "NEGATIVE", "text_length"],
    df.loc[df["target"] == "POSITIVE", "text_length"],
)[1]

viz_helpers.histogram(
    df,
    label_x="text_length",
    label_colour="target",
    title=f"Text length distribution / p-value={p_value:.5f}",
    include_boxplot=True,
)

There are no big difference between the _POSITIVE_ and _NEGATIVE_ tweets, but _NEGATIVE_ tweets are slightly longer than _POSITIVE_ tweets.
In both classes, there are two modes : *~45* characters and *138* characters (the maximum allowed at some point).


In [None]:
# Plot word count distribution
df["word_count"] = df.text.str.split().str.len()

p_value = f_oneway(
    df.loc[df["target"] == "NEGATIVE", "word_count"],
    df.loc[df["target"] == "POSITIVE", "word_count"],
)[1]

viz_helpers.histogram(
    df,
    label_x="word_count",
    label_colour="target",
    title=f"Word count distribution / p-value={p_value:.5f}",
    include_boxplot=True,
)

There are no big difference between the _POSITIVE_ and _NEGATIVE_ tweets, but _NEGATIVE_ tweets are significatively longer than _POSITIVE_ tweets.
In both classes, there are two modes : *~7* words and *~20* words.


#### Text analysis

We will look more in details at what contains the _text_ variable.


In [None]:
# Vectorizers
from sklearn.feature_extraction.text import TfidfVectorizer

# Tokenizers, Stemmers and Lemmatizers
import nltk
from nltk.corpus import stopwords
import spacy

# Download resources
nltk.download("stopwords")
nltk.download("wordnet")
stopwords = set(stopwords.words("english"))

# Download SpaCy model
!python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")

# Define tokenizer
tokenizer = lambda text: [  # SpaCy Lemmatizer
            token.lemma_.lower()
            for token in nlp(text)
            if token.is_alpha and not token.is_stop
        ]
        

In [None]:
# Processed data path
processed_data_path = os.path.join("..", "data", "processed")
tfidf_dataset_file_path = os.path.join(
    processed_data_path, "tfidf_dataset.pkl"
)
tfidf_vocabulary_file_path = os.path.join(
    processed_data_path, "tfidf_vocabulary.pkl"
)

if os.path.exists(tfidf_dataset_file_path) and os.path.exists(tfidf_vocabulary_file_path):
    # Load vectorized dataset
    with (open(tfidf_dataset_file_path, "rb")) as f:
        X = pickle.load(f)
    # Load vocabulary
    with (open(tfidf_vocabulary_file_path, "rb")) as f:
        vocabulary = pickle.load(f)
else:
    # Define vectorizer
    vectorizer = TfidfVectorizer(
        strip_accents="unicode",
        lowercase=True,
        stop_words=stopwords,
        tokenizer=tokenizer,
    )

    # Vectorize text
    X = vectorizer.fit_transform(df.text)

    # Get vocabulary
    vocabulary = vectorizer.get_feature_names_out()

    # Save vectorized dataset as pickle
    with open(tfidf_dataset_file_path, "wb") as f:
        pickle.dump(X, f)

    # Save vocabulary as pickle
    with open(tfidf_vocabulary_file_path, "wb") as f:
        pickle.dump(vocabulary, f)


In [None]:
from sklearn.decomposition import TruncatedSVD


# Train LSA model
n_components = 50
lsa = TruncatedSVD(n_components=n_components, random_state=42).fit(X)

In [None]:
# Plot explained variance ratio of LSA
fig = px.line(
    x=range(1, n_components + 1),
    y=lsa.explained_variance_ratio_,
    title="Explained variance ratio of LSA",
    labels={"x": "Component", "y": "Explained variance ratio"},
    markers=True,
)
fig.show()

In [None]:
# Reduce dimensionality
X_lsa = lsa.transform(X)

In [None]:
from sklearn.model_selection import train_test_split


# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_lsa, df.target, test_size=0.2, random_state=42
)

In [None]:
from sklearn.linear_model import LogisticRegressionCV


# Define model
model = LogisticRegressionCV(
    # n_jobs=-1,
    random_state=42,
)

# Train model
model.fit(X_train, y_train)

In [None]:
viz_helpers.plot_classifier_results(
    model,
    X_train,
    y_train,
    title="Train set results",
)

In [None]:
# Predict
y_pred = model.predict(X_test)

In [None]:
viz_helpers.plot_classifier_results(
    model,
    X_test,
    y_test,
    title="Test set results",
)