<img src="https://full-stack-assets.s3.eu-west-3.amazonaws.com/M08-deep-learning/AT%26T_logo_2016.svg" alt="AT&T LOGO" width="50%" />

# Orange SPAM detector

## Company's Description 📇

AT&T Inc. is an American multinational telecommunications holding company headquartered at Whitacre Tower in Downtown Dallas, Texas. It is the world's largest telecommunications company by revenue and the third largest provider of mobile telephone services in the U.S. As of 2022, AT&T was ranked 13th on the Fortune 500 rankings of the largest United States corporations, with revenues of $168.8 billion! 😮

## Project 🚧

One of the main pain point that AT&T users are facing is constant exposure to SPAM messages.

AT&T has been able to manually flag spam messages for a time, but they are looking for an automated way of detecting spams to protect their users.

## Goals 🎯

Your goal is to build a spam detector, that can automatically flag spams as they come based sollely on the sms' content.

## Scope of this project 🖼️

To start off, AT&T would like you to use the folowing dataset:

[Dowload the Dataset](https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Deep+Learning/project/spam.csv)

## Deliverable 📬

To complete this project, your team should: 

* Write a notebook that runs preprocessing and trains one or more deep learning models in order to predict the spam or ham nature of the sms
* State the achieved performance clearly

In [67]:
!pip install -q -U "tensorflow-text==2.9.*"

[K     |████████████████████████████████| 4.6 MB 13.0 MB/s 
[K     |████████████████████████████████| 511.8 MB 27 kB/s 
[K     |████████████████████████████████| 438 kB 53.1 MB/s 
[K     |████████████████████████████████| 1.6 MB 57.5 MB/s 
[K     |████████████████████████████████| 5.8 MB 46.3 MB/s 
[?25h

In [68]:
# Import Tensorflow & Pathlib librairies
import tensorflow as tf 
import tensorflow_hub as hub
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import pathlib 
import pandas as pd 
import os
import io
import warnings
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,f1_score
warnings.filterwarnings('ignore')

In [69]:
tf.__version__

'2.9.2'

In [71]:
import tensorflow_text

In [72]:
# Import dataset with Pandas 
dataset = pd.read_csv("/content/spam.csv", error_bad_lines=False, encoding="Latin-1")
dataset=dataset.iloc[:5000,:2] #we will focus only on 5000 first lines
dataset.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### Preprocessing

We will now go through a preprocessing phase. The goal is to clean up the character strings and encode the words so they are represented as integers.

In [73]:
!python -m spacy download en_core_web_sm

2022-12-15 17:48:39.261541: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 27.7 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [74]:
import en_core_web_sm
nlp = en_core_web_sm.load()

In [75]:
# Import Stop words 
from spacy.lang.fr.stop_words import STOP_WORDS

In [76]:
for token in nlp(dataset['v2'][0]) : 
  print(token.lemma_) 

go
until
jurong
point
,
crazy
..
available
only
in
bugis
n
great
world
la
e
buffet
...
Cine
there
get
amore
wat
...


In [77]:
from re import X
# Ce bout de code va processer tout le dataframe : c'est normal s'il tourne longtemps, n'hésitez pas à continuer à coder la suite de l'exo le temps qu'il tourne

dataset["sms_clean"] = dataset["v2"].apply(lambda x:''.join(ch for ch in x if ch.isalnum() or ch==" " or ch=="'"))
dataset["sms_clean"] = dataset["sms_clean"].apply(lambda x: x.replace(" +"," ").lower().strip())
dataset["sms_clean"] = dataset["sms_clean"].apply(lambda x: " ".join([token.lemma_ for token in nlp(x) if (token.lemma_ not in STOP_WORDS) and (token.text not in STOP_WORDS)]))
dataset.head()

Unnamed: 0,v1,v2,sms_clean
0,ham,"Go until jurong point, crazy.. Available only ...",go until jurong point crazy available only in ...
1,ham,Ok lar... Joking wif u oni...,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in 2 wkly comp to win fa cup final ...
3,ham,U dun say so early hor... U c already then say...,u dun say so early hor u c already then say
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah do not think he go to usf he live around h...


In [78]:
import numpy as np
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=1000, oov_token="out_of_vocab") # instanciate the tokenizer
tokenizer.fit_on_texts(dataset.sms_clean)
dataset["sms_encoded"] = tokenizer.texts_to_sequences(dataset.sms_clean)
#reformater le v1 en chiffres 0 et 1
dataset["categ"] = dataset["v1"].apply(lambda x: 1 if x=="spam" else 0)
dataset.head()

Unnamed: 0,v1,v2,sms_clean,sms_encoded,categ
0,ham,"Go until jurong point, crazy.. Available only ...",go until jurong point crazy available only in ...,"[21, 380, 1, 373, 607, 561, 63, 10, 1, 87, 109...",0
1,ham,Ok lar... Joking wif u oni...,ok lar joking wif u oni,"[42, 305, 1, 395, 7, 1]",0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in 2 wkly comp to win fa cup final ...,"[49, 444, 10, 24, 749, 924, 3, 83, 1, 925, 562...",1
3,ham,U dun say so early hor... U c already then say...,u dun say so early hor u c already then say,"[7, 221, 54, 26, 237, 1, 7, 142, 140, 53, 54]",0
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah do not think he go to usf he live around h...,"[796, 8, 6, 66, 36, 21, 3, 750, 36, 213, 197, ...",0


In [79]:
sms_pad = tf.keras.preprocessing.sequence.pad_sequences(dataset.sms_encoded, padding="post")

### Train test split

In [80]:
# Train Test Split
# Train Test Split
xtrain, xval, ytrain, yval = train_test_split(sms_pad,dataset.categ, test_size=0.3)

In [81]:
train = tf.data.Dataset.from_tensor_slices((xtrain, ytrain))
val = tf.data.Dataset.from_tensor_slices((xval, yval))

In [82]:
train_batch = train.shuffle(len(train)).batch(64)
val_batch = val.shuffle(len(val)).batch(64)

In [83]:
 # Regardons un batch 
for sms, categ in train_batch.take(1):
  print(sms, categ)

tf.Tensor(
[[ 49 444  10 ...   0   0   0]
 [549   1 269 ...   0   0   0]
 [154 170 112 ...   0   0   0]
 ...
 [  1  57 278 ...   0   0   0]
 [  1 308 332 ...   0   0   0]
 [ 39   7  57 ...   0   0   0]], shape=(64, 164), dtype=int32) tf.Tensor(
[1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0], shape=(64,), dtype=int64)


In [84]:
sms.shape[1]

164

## Modeling with sequential simple Model 

Let's create a model in order to train an embedding!

In [85]:
# Ici je vous met la version classification
vocab_size = tokenizer.num_words
model = tf.keras.Sequential([
                  # Couche d'Input Word Embedding           
                  tf.keras.layers.Embedding(vocab_size+1, 8, input_shape=[sms.shape[1]],name="embedding"),
                  # Gobal max pooling
                  tf.keras.layers.GlobalMaxPooling1D(),

                  # Couche Dense classique
                  tf.keras.layers.Dense(16, activation='relu'),

                  # Couche de sortie avec le nombre de neurones en sortie égale au nombre de classe avec fonction sigmoid
                  tf.keras.layers.Dense(1, activation="sigmoid")
])

In [86]:
model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 164, 8)            8008      
                                                                 
 global_max_pooling1d_1 (Glo  (None, 8)                0         
 balMaxPooling1D)                                                
                                                                 
 dense_6 (Dense)             (None, 16)                144       
                                                                 
 dense_7 (Dense)             (None, 1)                 17        
                                                                 
Total params: 8,169
Trainable params: 8,169
Non-trainable params: 0
_________________________________________________________________


In [87]:
optimizer= tf.keras.optimizers.Adam()

model.compile(optimizer=optimizer,
              loss=tf.keras.losses.BinaryCrossentropy(),
              metrics=[tf.keras.metrics.BinaryAccuracy
              ()])


In [88]:
history = model.fit(train_batch, epochs=50, validation_data=val_batch)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [91]:
from plotly import graph_objects as go

color_chart = ["#4B9AC7", "#4BE8E0", "#9DD4F3", "#97FBF6", "#2A7FAF", "#23B1AB", "#0E3449", "#015955"]


fig = go.Figure(data=[
                      go.Scatter(
                          y=history.history["loss"],
                          name="Training loss",
                          mode="lines",
                          marker=dict(
                          color=color_chart[0]
                          )),
                      go.Scatter(
                          y=history.history["val_loss"],
                          name="Validation loss",
                          mode="lines",
                          marker=dict(
                              color=color_chart[1]
                          ))
])
fig.update_layout(
    title='Training and val loss across epochs',
    xaxis_title='epochs',
    yaxis_title='Cross Entropy'    
)
fig.show()

In [92]:
from plotly import graph_objects as go
fig = go.Figure(data=[
                      go.Scatter(
                          y=history.history["binary_accuracy"],
                          name="Training accuracy",
                          mode="lines",
                          marker=dict(
                              color=color_chart[4]
                          )),
                      go.Scatter(
                          y=history.history["val_binary_accuracy"],
                          name="Validation accruracy",
                          mode="lines",
                          marker=dict(
                              color=color_chart[5]
                          ))
])
fig.update_layout(
    title='Training and val accuracy across epochs',
    xaxis_title='epochs',
    yaxis_title='Accuracy'    
)
fig.show()

# Testing our Sequential Model on Detecting Spam Across New Messages

In [96]:
# Use the model to predict whether a message is spam
messages_received = ['Greg, can you call me back once you get this?',
                'Congrats on your new iPhone! Click here to claim your prize...', 
                'Really like that new photo of you',
                'Did you hear the news today? Terrible what has happened...',
                'Attend this free COVID webinar today: Book your session now...']

print(messages_received) 

# Create the sequences
padding_type='post'
sample_sequences = tokenizer.texts_to_sequences(messages_received)
fakes_padded = pad_sequences(sample_sequences, padding=padding_type, maxlen=164)           

fakes_prediction = model.predict(fakes_padded)

# The closer the class is to 1, the more likely that the message is spam
for x in range(len(messages_received)):
  print(messages_received[x])
  print(fakes_prediction[x])
  print('\n')

['Greg, can you call me back once you get this?', 'Congrats on your new iPhone! Click here to claim your prize...', 'Really like that new photo of you', 'Did you hear the news today? Terrible what has happened...', 'Attend this free COVID webinar today: Book your session now...']
Greg, can you call me back once you get this?
[2.9733166e-05]


Congrats on your new iPhone! Click here to claim your prize...
[0.9998084]


Really like that new photo of you
[0.00020477]


Did you hear the news today? Terrible what has happened...
[0.00185848]


Attend this free COVID webinar today: Book your session now...
[0.01462053]




# Transfer Learning with BERT

### Let's import a pretrained model 

In [141]:
text_test = [dataset.sms_clean[3]]

preprocessor = hub.load("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
bert_preprocessor = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3", name='preprocessing')
bert_encoder = hub.KerasLayer('https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4')
text_preprocessed = bert_preprocessor(text_test)


In [142]:
text_input = tf.keras.layers.Input(shape = (), dtype = tf.string, name = 'Inputs')
preprocessed_text = bert_preprocessor(text_input)
embeed = bert_encoder(preprocessed_text)
dropout = tf.keras.layers.Dropout(0.1, name = 'Dropout')(embeed['pooled_output'])
outputs = tf.keras.layers.Dense(1, activation = 'sigmoid', name = 'Dense')(dropout)

In [143]:
# creating final model
model_tlbert = tf.keras.Model(inputs = [text_input], outputs = [outputs])

In [116]:
sms.shape[1]

164

In [144]:
model_tlbert.summary()

Model: "model_5"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 Inputs (InputLayer)            [(None,)]            0           []                               
                                                                                                  
 preprocessing (KerasLayer)     {'input_word_ids':   0           ['Inputs[0][0]']                 
                                (None, 128),                                                      
                                 'input_mask': (Non                                               
                                e, 128),                                                          
                                 'input_type_ids':                                                
                                (None, 128)}                                                

### compile and fit model

In [145]:
Metrics = [tf.keras.metrics.BinaryAccuracy(name = 'accuracy'),
           tf.keras.metrics.Precision(name = 'precision'),
           tf.keras.metrics.Recall(name = 'recall')
           ]

In [146]:
model_tlbert.compile(optimizer ='adam',
               loss = 'binary_crossentropy',
               metrics = Metrics)

In [148]:
# Train test split per text 
text_train, text_test, y_train, y_test = train_test_split(dataset.sms_clean, dataset.categ, test_size=0.1, random_state=1,stratify=dataset.categ)

text_train, text_val, y_train, y_val = train_test_split(text_train, y_train, test_size=0.1, random_state=1,stratify=y_train)

# creating tensorflow datasets slices and making batches

text_train_ds = tf.data.Dataset.from_tensor_slices((text_train, y_train))
text_test_ds = tf.data.Dataset.from_tensor_slices((text_test, y_test))
text_val_ds = tf.data.Dataset.from_tensor_slices((text_val, y_val))

text_train_ds = text_train_ds.shuffle(len(text_train_ds)).batch(64)
text_test_ds = text_test_ds.shuffle(len(text_test_ds)).batch(64)
text_val_ds = text_val_ds.shuffle(len(text_val_ds)).batch(64)

In [149]:
history_tlbert = model_tlbert.fit(text_train_ds, epochs=5, validation_data=text_val_ds)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [153]:
from plotly import graph_objects as go
fig = go.Figure(data=[
                      go.Scatter(
                          y=history_tlbert.history["accuracy"],
                          name="Training accuracy",
                          mode="lines",
                          marker=dict(
                              color=color_chart[4]
                          )),
                      go.Scatter(
                          y=history_tlbert.history["val_accuracy"],
                          name="Validation accuracy",
                          mode="lines",
                          marker=dict(
                              color=color_chart[5]
                          ))
])
fig.update_layout(
    title='Training and val accuracy across epochs',
    xaxis_title='epochs',
    yaxis_title='Accuracy'    
)
fig.show()

### Compare performance between our models

#### Let's see how percentage is accuracy and f1-score of each model


In [154]:
def performance_model(model, X, y):

    y_pred = np.round(model.predict(X))
    accuracy = accuracy_score(y, y_pred)
    f1 = f1_score(y, y_pred)
  
    model_performance = {'accuracy': accuracy,
                          'f1-score': f1}
  
    return model_performance

In [158]:
Basic_model = performance_model(model,xval, yval)
BERT_model = performance_model(model_tlbert,text_val, y_val)
  
data_results = pd.DataFrame({'Sequential Simple Model':Basic_model,
                              'BERT-Transfer learning Model':BERT_model}).transpose()
                             
data_results.iloc[:,:3]



Unnamed: 0,accuracy,f1-score
Sequential Simple Model,0.979333,0.916442
BERT-Transfer learning Model,0.877778,0.179104
