# Bloc 4 - Analyse prédictive de données non-structurées par l'intelligence artificielle - AT&T Spam Detector

## Introduction

AT&T Inc. is an American multinational telecommunications holding company, whose history started in 1878 with the foundation of the American District Telegraph Company. It is now the world's third largest telecommunications company by revenue. It is also the third largest provider of mobile phone services in the United States of America.

### Problematic

AT&T users face a constant exposure to spam messages.

The company would like to protect their users by developping an automated spam detector.

### Scope

To develop a spam detector, AT&T provided a labelled dataset composed of spam and ham messages.

### Aim and objectives

Overall aim: Predict the spam or ham nature of the message.

Objectives:
- 1 - Train at least one deep learning model.
- 2 - State the achieved performance of the model.

##
## Methods

### 1 - Library import

### 2 - File reading and basic exploration

The dataset was composed of 5572 text messages received by AT&T users and labelled as spam or ham messages. Most messages were contained in a single column, but the content of few of them (a few percent) was split into several columns.

Of note, most messages are written in spoken language, and contain many abbreviations.

### 3 - Preprocessing 

First, the few messages that were split were compiled into a single column not to loose information. Then, the data was processed for deep learning, split into train and test sets, and organized in batches.

### 4 - Deep learning model training

The text of the messages was processed in a very simple manner to avoid discarding words that would not be recognized by pipelines such as spacy. Text was set to lowercase and the punctuation removed to keep most of the content of the messages before vectorization.

The model itself is sequential, and only the minimum number of layers was retained. It is composed of only five layers: vectorization, embedding, pooling, 2 x dense. The binary cross entropy was used to quantify the loss and the accuracy was used as a measure of performance. The model was trained for 20 epochs.

### 5 - Deep learning model performance

The loss and the accuracy reached a plateau after about 15 epochs. At the 15th epoch, the loss reached 0.0266 for the train set and 0.0678 for the validation set. The corresponding accuracies were of 0.9937 for the train set and of 0.9830 for the validation set.

##
## Conclusion

The dataset of AT&T was rather small (about 5.000 records) but largely sufficient to build a very simple deep learning model with good performances.

The model reached more than 98% accuracy on unseen data after only 15 epochs, while staying extremelly simple in its composition (few layers and few neurons).

This model is therefore suitable for spam versus ham prediction and thanks to its simplicity, could be easily implemented by the teams of AT&T to protect their users from spam messages.

##
## Code

### 1 - Library import

In [None]:
### 1 - library import ### ----

import pandas as pd
import numpy as np

import re
import string

from sklearn.model_selection import train_test_split
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Embedding, GlobalAveragePooling1D, Dense

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots


###
### 2 - File reading and basic exploration

In [None]:
### 2 - file reading and basic exploration - import dataset ### ----

# load data
data = pd.read_csv("cnm_bloc4_data.csv", encoding = "ISO-8859-1")


In [None]:
### 2 - file reading and basic exploration - get basic stats ### ----

# print shape of data
print("Number of rows: {}".format(data.shape[0]))
print("Number of columns: {}".format(data.shape[1]))
print()

# display dataset
pd.set_option('display.max_columns', None)
print("Dataset display: ")
display(data.head())
print()

# display basic statistics
print("Basics statistics: ")
data_desc = data.describe(include='all')
display(data_desc)
print()

# display percentage of missing values in columns and rows
percent_nan_col = data.isnull().sum() / data.shape[0] * 100
print("Percentage of missing values per column:\n{}".format(percent_nan_col))
print()
percent_nan_row = data[data.isnull().all(axis = 1)].shape[0] / data.shape[1] * 100
print("Percentage of rows fully filled with missing values: {}".format(percent_nan_row))


###
### 3 - Preprocessing

In [None]:
### 3 - preprocessing - compile all text columns ### ----

# copy data for safety
data1 = data.copy()

# create column to save full message text
data1["full_text"] = data1["v2"]

# retreive text from additional columns
index_col2 = data.index[data["Unnamed: 2"].notnull()]
data1.loc[index_col2,"full_text"] += data1.loc[index_col2,"Unnamed: 2"]
index_col3 = data.index[data["Unnamed: 3"].notnull()]
data1.loc[index_col3,"full_text"] += data1.loc[index_col3,"Unnamed: 3"]
index_col4 = data.index[data["Unnamed: 4"].notnull()]
data1.loc[index_col4,"full_text"] += data1.loc[index_col4,"Unnamed: 4"]

# drop columns that became useless
data1 = data1.drop(["v2", "Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)


In [None]:
### 3 - preprocessing - make datasets for deep learning ### ----

# encode labels
data1["v1"] = data1["v1"].apply(lambda x: 0 if x == "ham" else 1)

# split data into train and test sets
X_train, X_test, Y_train, Y_test = train_test_split(data1["full_text"], data1["v1"], test_size = 0.2, 
    stratify = data1["v1"], random_state = 0)

# make tensorflow datasets
train_ds = tf.data.Dataset.from_tensor_slices((X_train, Y_train))
test_ds = tf.data.Dataset.from_tensor_slices((X_test, Y_test))

# organize the datasets in batches
train_ds = train_ds.shuffle(len(train_ds)).batch(64)
test_ds = test_ds.shuffle(len(test_ds)).batch(64)


###
### 4 - Deep learning model training

In [None]:
### 4 - deep learning model training - set text preprocessing ### ----

# make custom standardization function to remove punctuation
def custom_standardization(input_data):
  # transform all characters to lowercase
  lowercase = tf.strings.lower(input_data)
  # remove punctuation
  clean = tf.strings.regex_replace(lowercase, '[%s]' % re.escape(string.punctuation), '')
  return clean

# set vocabulary size and number of words in a sequence
vocab_size = 10000
sequence_length = 50

# initialize vectorization layer to normalize, split, and map strings to integers
vectorize_layer = TextVectorization(
    standardize = custom_standardization, 
    max_tokens = vocab_size, 
    output_mode = 'int', 
    output_sequence_length = sequence_length) 

# Make a text-only dataset and build the vocabulary.
text_ds = train_ds.map(lambda x, y: x) 
vectorize_layer.adapt(text_ds)


In [None]:
### 4 - deep learning model training - build the model ### ----

# set dimension of embedding
embedding_dim = 16

# build model
model = Sequential([
  vectorize_layer, 
  Embedding(vocab_size, embedding_dim, name = "embedding"), 
  GlobalAveragePooling1D(), 
  Dense(8, activation = 'relu'), 
  Dense(1, activation = "sigmoid") 
])

# set tensorboard to monitor model's training
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir = "logs")

# compile model
model.compile(optimizer = 'adam',
  loss = tf.keras.losses.BinaryCrossentropy(),
  metrics = ['accuracy'])


In [None]:
### 4 - deep learning model training - fit the model ### ----

history = model.fit(
    train_ds,
    validation_data = test_ds,
    epochs = 20,
    callbacks = [tensorboard_callback])


###
### 5 - Deep learning model performance

In [None]:
### 5 - deep learning model performance - plot performance ### ----

# set figure to make subplots
fig1 = make_subplots(
    rows = 1,
    cols = 2,
    subplot_titles = (
        "A. Loss",
        "B. Accuracy"),
    column_widths = [0.40, 0.40],
    horizontal_spacing = 0.20)

# plot loss for train and validation sets
fig1.add_trace(go.Scatter(
        name = "Train",
        x = np.arange(1,len(history.history["loss"])),
        y = history.history["loss"],
        marker_color = px.colors.qualitative.Vivid[0],
        showlegend = True),
        row = 1, col = 1)
fig1.add_trace(go.Scatter(
        name = "Validation",
        x = np.arange(1,len(history.history["val_loss"])),
        y = history.history["val_loss"],
        marker_color = px.colors.qualitative.Vivid[1],
        showlegend = True),
        row = 1, col = 1)

# plot accuracy for train and validation sets
fig1.add_trace(go.Scatter(
        name = "Train",
        x = np.arange(1,len(history.history["accuracy"])),
        y = history.history["accuracy"],
        marker_color = px.colors.qualitative.Vivid[0],
        showlegend = False),
        row = 1, col = 2)
fig1.add_trace(go.Scatter(
        name = "Validation",
        x = np.arange(1,len(history.history["val_accuracy"])),
        y = history.history["val_accuracy"],
        marker_color = px.colors.qualitative.Vivid[1],
        showlegend = False),
        row = 1, col = 2)

# update layout
fig1.update_xaxes(title = "Number of epochs", tickfont = dict(size = 10), range = [0,21], 
        tickvals = [0, 5, 10, 15, 20], showgrid = False)
fig1.update_yaxes(tickfont = dict(size = 10))
fig1.update_layout(
        margin = dict(l = 90, t= 120),
        title_text = "Figure 1. Deep learning model performance",
        title_x = 0.5,
        title_y = 0.95,
        title_font_size = 18,
        yaxis = dict(title = "Loss", range = [0, 0.75], tickvals = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7]),
        yaxis2 = dict(title = "Accuracy", range = [0.859, 1.01], 
                tickvals = [0.86, 0.88, 0.90, 0.92, 0.94, 0.96, 0.98, 1.00]),
        legend = dict(
            orientation = "h",
            yanchor = "top",
            y = 1.35,
            xanchor = "left",
            x = 0.35,
            font = dict(size = 11)),
        plot_bgcolor = "rgba(0,0,0,0)",
        paper_bgcolor = "rgb(232,232,232)",
        width = 800,
        height = 400)

fig1.show()

In [None]:
### 5 - deep learning model performance - report performance ### ----

# report performance at epoch 15
print("Performance at Epoch 15")
print()
print("Train loss: {:.4f}".format(history.history["loss"][14]))
print("Validation loss: {:.4f}".format(history.history["val_loss"][14]))
print()
print("Train accuracy: {:.4f}".format(history.history["accuracy"][14]))
print("Validation accuracy: {:.4f}".format(history.history["val_accuracy"][14]))
