- I downloaded the dataset from Kaggle(SMS Spam Collection Dataset), which you can find on the following link: https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
- The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.

In [1]:
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text

In [5]:
import pandas as pd

In [21]:
#Import Dataset
df = pd.read_csv("spam.csv", encoding="ANSI")

In [22]:
df.head(5)

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [24]:
#Exploring the dataset
df.groupby("Category").describe()
#We have 4825 ham and 747 spam messages!

Unnamed: 0_level_0,Message,Message,Message,Message
Unnamed: 0_level_1,count,unique,top,freq
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,653,Please call our customer service representativ...,4


In [26]:
#Let's create a column for spam from Category
df['spam']=df['Category'].apply(lambda x: 1 if x=='spam' else 0)
#If the value in category is spam we will have a 1 is the new column of spam if not we will have 0

In [27]:
df.head()

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [30]:
#Split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['Message'],df['spam'], stratify=df['spam'])
#My x is the messages and my y is the spam 
#When you do stratify, its make sure that there is a balance

In [32]:
y_train.value_counts() #15 spam in train

0    3619
1     560
Name: spam, dtype: int64

In [33]:
y_test.value_counts() #15 spam in test

0    1206
1     187
Name: spam, dtype: int64

In [37]:
bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")

BERT

In [38]:
#Create a function
#The function take a sentence as input and return an embedding vector

def get_sentence_embedding(sentences):
    preprocessed_text = bert_preprocess(sentences)
    return bert_encoder(preprocessed_text)['pooled_output']

#pooled output is embedding the the entire sentence

In [41]:
get_sentence_embedding([
    "I HAVE A DATE ON SUNDAY WITH WILL!!", 
    "Ffffffffff. Alright no way I can meet up with you sooner?"]
)

<tf.Tensor: shape=(2, 768), dtype=float32, numpy=
array([[-0.8476381 , -0.35198745, -0.18676049, ..., -0.10289747,
        -0.6462645 ,  0.9302797 ],
       [-0.79941463, -0.24894392, -0.57623297, ..., -0.52235186,
        -0.56344837,  0.88658595]], dtype=float32)>

In [50]:
#let's compare those words and use cosine similarity!
#Pizza, burger, sushi are food but walid, asma and lina are names
vs = get_sentence_embedding([
    "pizza", 
    "burger",
    "sushi",
    "soula walid",
    "khalil asma",
    "soula lina"]
)

In [51]:
#Pizza vs burger
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity([vs[0]],[vs[1]])

array([[0.9702882]], dtype=float32)

In [52]:
#near to 1 the two vectors are very similar

In [53]:
#Pizza with soula lina
cosine_similarity([vs[0]],[vs[5]])

array([[0.90956086]], dtype=float32)

In [54]:
#not as similar as pizza and burger but still 0.9 xD

In [55]:
#Soula walid with Soula lina
cosine_similarity([vs[3]],[vs[5]])
#Better...

array([[0.9333291]], dtype=float32)

ANN Model (Functional Model)

- Will with use functional model
- In a sequential model, you add layers one by one
- In a functional model, you create a input and suplly it as a function argument in the second layer and so on.

In [57]:
# Bert layers (functional)
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
preprocessed_text = bert_preprocess(text_input)
outputs = bert_encoder(preprocessed_text)

# Neural network layers
l = tf.keras.layers.Dropout(0.1, name="dropout")(outputs['pooled_output'])
l = tf.keras.layers.Dense(1, activation='sigmoid', name="output")(l)

#dropout layers are used to tackle overfitting
#we will drop 10% of neurons
#We will use sigmoid for classification

# Use inputs and outputs to construct a final model
model = tf.keras.Model(inputs=[text_input], outputs = [l])

In [58]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 text (InputLayer)              [(None,)]            0           []                               
                                                                                                  
 keras_layer (KerasLayer)       {'input_word_ids':   0           ['text[0][0]']                   
                                (None, 128),                                                      
                                 'input_type_ids':                                                
                                (None, 128),                                                      
                                 'input_mask': (Non                                               
                                e, 128)}                                                      

- 769 because we have the total neurons from the vector with the output for classification (768+1)

In [59]:
len(X_train)

4179

In [60]:
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

Training

In [61]:
model.fit(X_train, y_train, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x19acdac8cc8>

In [63]:
model.evaluate(X_test, y_test)
#95% accuracy



[0.14236554503440857, 0.9583632349967957]

Inference

- The first 3 emails are spam and the last 2 are not spam.
- If the score is above 0.5, it's spam and if it's below, it's not.

In [64]:
reviews = [
    'Reply to win Â£100 weekly! Where will the 2006 FIFA World Cup be held? Send STOP to 87239 to end service',
    'You are awarded a SiPix Digital Camera! call 09061221061 from landline. Delivery within 28days. T Cs Box177. M221BP. 2yr warranty. 150ppm. 16 . p pÂ£3.99',
    'it to 80488. Your 500 free text messages are valid until 31 December 2005.',
    'Hey Sam, Are you coming for a cricket game tomorrow',
    "Why don't you wait 'til at least wednesday to see if you get your ."
]
model.predict(reviews)



array([[0.6330764 ],
       [0.71584964],
       [0.5850517 ],
       [0.06676803],
       [0.02399864]], dtype=float32)

GG