## Company's Description 📇

AT&T Inc. is an American multinational telecommunications holding company headquartered at Whitacre Tower in Downtown Dallas, Texas. It is the world's largest telecommunications company by revenue and the third largest provider of mobile telephone services in the U.S. As of 2022, AT&T was ranked 13th on the Fortune 500 rankings of the largest United States corporations, with revenues of $168.8 billion! 😮

## Project 🚧

One of the main pain point that AT&T users are facing is constant exposure to SPAM messages.

AT&T has been able to manually flag spam messages for a time, but they are looking for an automated way of detecting spams to protect their users.

## Goals 🎯

Your goal is to build a spam detector, that can automatically flag spams as they come based sollely on the sms' content.

## Scope of this project 🖼️

To start off, AT&T would like you to use the folowing dataset:

[Dowload the Dataset](https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Deep+Learning/project/spam.csv)

## Helpers 🦮

To help you achieve this project, here are a few tips that should help you: 

### Start simple
A good deep learing model does not necessarily have to be super complicated!

### Transfer learning
You do not have access to a whole lot of data, perhaps channeling the power of a more sophisticated model trained on billions of observations might help!

## Deliverable 📬

To complete this project, your team should: 

* Write a notebook that runs preprocessing and trains one or more deep learning models in order to predict the spam or ham nature of the sms
* State the achieved performance clearly

In [2]:
# Import necessaries librairies
import pandas as pd
import numpy as np 
import tensorflow_datasets as tfds
import tensorflow as tf 

from sklearn.model_selection import train_test_split

# We start by downloading spacy for the english language
# !python -m spacy download en_core_web_sm -q

from spacy.lang.en.stop_words import STOP_WORDS
import en_core_web_sm

In [3]:
df = pd.read_csv("spam.csv",encoding='ISO-8859-1')
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [4]:
df = df[['v1', 'v2']]

In [5]:
nlp = en_core_web_sm.load()

In [6]:
# Remove all non alphanumeric characters except whitespaces
df["text_clean"] = df["v2"].apply(lambda x:''.join(ch for ch in x if ch.isalnum() or ch==" "))
# remove double spaces and spaces at the beginning and end of strings
df["text_clean"] = df["text_clean"].apply(lambda x: x.replace(" +"," ").lower().strip())
# remove stop words and replace everyword with their lemma
df["text_clean"] = df["text_clean"].apply(lambda x: " ".join([token.lemma_ for token in nlp(x) if (token.lemma_ not in STOP_WORDS) & (token.text not in STOP_WORDS)]))

In [7]:
df.head()

Unnamed: 0,v1,v2,text_clean
0,ham,"Go until jurong point, crazy.. Available only ...",jurong point crazy available bugis n great wor...
1,ham,Ok lar... Joking wif u oni...,ok lar joke wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry 2 wkly comp win fa cup final tkts 2...
3,ham,U dun say so early hor... U c already then say...,u dun early hor u c
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah think usf live


Note Elo : faire de l'EDA avec histogrammes pour voir quels sont les mots les plus utilisés

In [8]:
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=100000000, oov_token = "out_of_vocab") #check this out
tokenizer.fit_on_texts(df['text_clean'])
df['text_encoded'] = tokenizer.texts_to_sequences(df['text_clean'])
df.head()

Unnamed: 0,v1,v2,text_clean,text_encoded
0,ham,"Go until jurong point, crazy.. Available only ...",jurong point crazy available bugis n great wor...,"[3601, 229, 446, 462, 941, 35, 51, 203, 942, 7..."
1,ham,Ok lar... Joking wif u oni...,ok lar joke wif u oni,"[9, 194, 463, 288, 1, 1452]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry 2 wkly comp win fa cup final tkts 2...,"[12, 298, 3, 532, 663, 33, 1453, 850, 422, 145..."
3,ham,U dun say so early hor... U c already then say...,u dun early hor u c,"[1, 124, 149, 2353, 1, 84]"
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah think usf live,"[708, 22, 664, 128]"


In [9]:
df['v1'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: v1, dtype: float64

In [10]:
df['target'] = df['v1'].apply(lambda x : 1 if x == 'spam' else 0)
df['target'].value_counts(normalize=True)

0    0.865937
1    0.134063
Name: target, dtype: float64

In [11]:
df.head()

Unnamed: 0,v1,v2,text_clean,text_encoded,target
0,ham,"Go until jurong point, crazy.. Available only ...",jurong point crazy available bugis n great wor...,"[3601, 229, 446, 462, 941, 35, 51, 203, 942, 7...",0
1,ham,Ok lar... Joking wif u oni...,ok lar joke wif u oni,"[9, 194, 463, 288, 1, 1452]",0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry 2 wkly comp win fa cup final tkts 2...,"[12, 298, 3, 532, 663, 33, 1453, 850, 422, 145...",1
3,ham,U dun say so early hor... U c already then say...,u dun early hor u c,"[1, 124, 149, 2353, 1, 84]",0
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah think usf live,"[708, 22, 664, 128]",0


In [12]:
pad = tf.keras.preprocessing.sequence.pad_sequences(df['text_encoded'], padding="post")

In [13]:
# Train Test Split
xtrain, xval, ytrain, yval = train_test_split(pad, df['target'], test_size=0.3)

In [14]:
train_ds = tf.data.Dataset.from_tensor_slices((xtrain, ytrain))
test_ds = tf.data.Dataset.from_tensor_slices((xval, yval))

In [15]:
train_ds = train_ds.shuffle(len(train_ds)).batch(64)
test_ds = test_ds.shuffle(len(test_ds)).batch(64)

In [16]:
for sms, status in train_ds.take(1):
  print(sms, status)

tf.Tensor(
[[ 123   45  741 ...    0    0    0]
 [4368  512    0 ...    0    0    0]
 [   9   27  143 ...    0    0    0]
 ...
 [   1 1613  273 ...    0    0    0]
 [  14 1385  345 ...    0    0    0]
 [ 108   34   62 ...    0    0    0]], shape=(64, 72), dtype=int32) tf.Tensor(
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
 1 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 1 0 0 0 0 1], shape=(64,), dtype=int64)


In [17]:
vocab_size = tokenizer.num_words
print(vocab_size)

100000000


## Modèle embedding 

In [None]:
vocab_size = 1000
model = tf.keras.Sequential([
                  # Word Embedding layer           
                  Embedding(vocab_size+1, 64, input_shape=[review.shape[1],],name="embedding"),
                  # Gobal average pooling
                  tf.keras.layers.GlobalAveragePooling1D()
                  # Dense layers once the data is flat
                  Dense(16, activation='relu'),
                  

                  # output layer with as many neurons as the number of classes
                  # for the target variable and softmax activation
                  Dense(1, activation="sigmoid")
])

## Simple RNN

In [None]:
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense, GRU, LSTM

vocab_size = 1000
model = tf.keras.Sequential([
                  # Word Embedding layer           
                  Embedding(vocab_size+1, 64, input_shape=[review.shape[1],],name="embedding"),
                  # Gobal average pooling
                  SimpleRNN(units=64, return_sequences=True), # maintains the sequential nature
                  SimpleRNN(units=32, return_sequences=False), # returns the last output
                  # Dense layers once the data is flat
                  Dense(16, activation='relu'),
                  Dense(8, activation='relu'),

                  # output layer with as many neurons as the number of classes
                  # for the target variable and softmax activation
                  Dense(5, activation="sigmoid")
])

In [19]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 72, 8)             800000008 
                                                                 
 global_average_pooling1d (G  (None, 8)                0         
 lobalAveragePooling1D)                                          
                                                                 
 dense (Dense)               (None, 16)                144       
                                                                 
 dense_1 (Dense)             (None, 1)                 17        
                                                                 
Total params: 800,000,169
Trainable params: 800,000,169
Non-trainable params: 0
_________________________________________________________________


In [20]:
optimizer= tf.keras.optimizers.Adam()

model.compile(optimizer=optimizer,
              loss=tf.keras.losses.BinaryCrossentropy(),
              metrics=[tf.keras.metrics.BinaryAccuracy()])

In [None]:
history = model.fit(train_ds, 
                    epochs=20, 
                    validation_data=test_ds)

Epoch 1/20


In [None]:
import matplotlib.pyplot as plt

# Visualization of the training process on the loss function 
plt.plot(history.history["loss"], color="b")
plt.plot(history.history["val_loss"], color="r")
plt.ylabel("loss")
plt.xlabel("Epochs")
plt.show()

In [None]:
# Visualization of accuracy training 
plt.plot(history.history["mean_absolute_error"], color="b")
plt.plot(history.history["val_mean_absolute_error"], color="r")
plt.ylabel("mean_absolute_error")
plt.xlabel("Epochs")
plt.show()
