<a href="https://colab.research.google.com/github/ashwinsathish/Spam-Classification/blob/main/ADL_HPE_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install tensorflow_text

Collecting tensorflow_text
  Downloading tensorflow_text-2.8.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 5.5 MB/s 
Collecting tf-estimator-nightly==2.8.0.dev2021122109
  Downloading tf_estimator_nightly-2.8.0.dev2021122109-py2.py3-none-any.whl (462 kB)
[K     |████████████████████████████████| 462 kB 30.8 MB/s 
Installing collected packages: tf-estimator-nightly, tensorflow-text
Successfully installed tensorflow-text-2.8.1 tf-estimator-nightly-2.8.0.dev2021122109


In [None]:
!pip install tensorflow_hub



## Importing all the neccessary packages

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow_text as text
import tensorflow_hub as hub
from keras.layers import Embedding

In [None]:
email_data = pd.read_csv('mail_data.csv')

In [None]:
email_data

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


## We describe the data to see how it is distributed

In [None]:
email_data.describe()

Unnamed: 0,Category,Message
count,5572,5572
unique,2,5157
top,ham,"Sorry, I'll call later"
freq,4825,30


In [None]:
email_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  5572 non-null   object
 1   Message   5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [None]:
email_data.groupby('Category').describe()

Unnamed: 0_level_0,Message,Message,Message,Message
Unnamed: 0_level_1,count,unique,top,freq
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,641,Please call our customer service representativ...,4


We can clearly see that the dataset is imbalanced since it has 4825 values of non spam while it has only 747 spam labels. The labels are also categorical and in strings so we have to encode them

## Visualizing data

In [None]:
import matplotlib.pyplot as plt

In [None]:
from wordcloud import WordCloud
wc = WordCloud(width=500,height=500,min_font_size=10,background_color='white')

In [None]:
spam_wc = wc.generate(email_data[email_data['Category'] == 'spam']['Message'].str.cat(sep=" "))

In [None]:
plt.figure(figsize=(10,5))
plt.imshow(spam_wc)

In [None]:
ham_wc = wc.generate(email_data[email_data['Category'] == 'ham']['Message'].str.cat(sep=" "))

In [None]:
plt.figure(figsize=(10,5))
plt.imshow(ham_wc)

## Data Encoding

In [None]:
Y = email_data["Category"]
Y

0        ham
1        ham
2       spam
3        ham
4        ham
        ... 
5567    spam
5568     ham
5569     ham
5570     ham
5571     ham
Name: Category, Length: 5572, dtype: object

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
Y_encode = le.fit_transform(Y)

In [None]:
Y_encode

array([0, 0, 1, ..., 0, 0, 0])

We have used label encoding to encode the dataset. Now spam is encoded as 0 while non-spam i.e ham is 1

## Preprocessing
Punctuations symbols, and special characters shall be removed in this stage. The email text is preprocessed using gensim simple_preprocess. Further the text data will be lemmatized by using the wordnetlemmatizer to bring down the number of words to their root words

In [None]:
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
import nltk
from nltk.stem import WordNetLemmatizer

In [None]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [None]:
lemmatizer = WordNetLemmatizer()

In [None]:
def preprocess(text):
    preproc = [lemmatizer.lemmatize(i,pos='v') for i in simple_preprocess(text) if i not in STOPWORDS]
    return ' '.join(preproc)

In [None]:
email_data['preprocessed'] = email_data['Message'].apply(preprocess)

In [None]:
email_data

Unnamed: 0,Category,Message,preprocessed
0,ham,"Go until jurong point, crazy.. Available only ...",jurong point crazy available bugis great world...
1,ham,Ok lar... Joking wif u oni...,ok lar joke wif oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry wkly comp win fa cup final tkts st ...
3,ham,U dun say so early hor... U c already then say...,dun early hor
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah think go usf live
...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,nd time try contact win pound prize claim easy...
5568,ham,Will ü b going to esplanade fr home?,go esplanade fr home
5569,ham,"Pity, * was in mood for that. So...any other s...",pity mood suggestions
5570,ham,The guy did some bitching but I acted like i'd...,guy bitch act like interest buy week give free


In [None]:
X = email_data["preprocessed"]
X

0       jurong point crazy available bugis great world...
1                                     ok lar joke wif oni
2       free entry wkly comp win fa cup final tkts st ...
3                                           dun early hor
4                                   nah think go usf live
                              ...                        
5567    nd time try contact win pound prize claim easy...
5568                                 go esplanade fr home
5569                                pity mood suggestions
5570       guy bitch act like interest buy week give free
5571                                            rofl true
Name: preprocessed, Length: 5572, dtype: object

## Train test split

We will first split the dataset into train and test and then over/undersample the data since doing it beforehand will result in duplication of images in training and test data and may lead to an exaggerated accuracy.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y_encode, test_size = 0.2)

We use the TfIDF representation to convert the text data into 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range=(1,3))
vectorizer.fit(X_train.values.ravel())
X_train = vectorizer.transform(X_train.values.ravel())
X_test = vectorizer.transform(X_test.values.ravel())
X_train=X_train.toarray()
X_test=X_test.toarray()

In [None]:
X_train

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [None]:
label_0 = 0
label_1 = 1

for i in range(len(Y_train)):
    if Y_train[i] == 0:
        label_0 +=1
    else:
        label_1 +=1
print("Label 0 = "+ str(label_0) +" "+"Label 1 = "+str(label_1))

Label 0 = 3853 Label 1 = 605


We will be using SMOTE oversampling to balance the dataset. This will only be applied on the train dataset to prevent any duplication of images in the test data.

In [None]:
!pip install imblearn



In [None]:
from imblearn.over_sampling import SMOTE
smote = SMOTE(sampling_strategy = 'minority')
X_train_sm, Y_train_sm = smote.fit_resample(X_train, Y_train)

In [None]:
label_0 = 0
label_1 = 1

for i in range(len(Y_train_sm)):
    if Y_train_sm[i] == 0:
        label_0 +=1
    else:
        label_1 +=1
print("Label 0 = "+ str(label_0) +" "+"Label 1 = "+str(label_1))

Label 0 = 3853 Label 1 = 3854


Now we can see that the classes on the train dataset are balanced. Now we can use this to train the model.

In [None]:
print(X_train_sm[100], Y_train_sm[100])

[0. 0. 0. ... 0. 0. 0.] 0


##1. Building a neural network

In [None]:
X_train_sm.shape

(7706, 48159)

In [None]:
from keras.models import Sequential
from keras.layers.core import Flatten, Dense, Dropout, Activation
from keras.utils import np_utils

In [None]:
batch_size = 32

In [None]:
n_cols = X_train_sm.shape[1]

model = Sequential([
                    Dense(512, activation = 'relu', input_shape =(n_cols, )),
                    Dropout(0.3),
                    Dense(256, activation = 'relu'),
                    Dropout(0.3),
                    Dense(1, activation = 'sigmoid')
])
model.compile(
    optimizer = 'adam', 
    loss ='binary_crossentropy', 
    metrics = ['accuracy']
)

model.summary()


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 512)               24657920  
                                                                 
 dropout (Dropout)           (None, 512)               0         
                                                                 
 dense_1 (Dense)             (None, 256)               131328    
                                                                 
 dropout_1 (Dropout)         (None, 256)               0         
                                                                 
 dense_2 (Dense)             (None, 1)                 257       
                                                                 
Total params: 24,789,505
Trainable params: 24,789,505
Non-trainable params: 0
_________________________________________________________________


###1.1 Training the NN model

In [None]:
model.fit(X_train_sm, Y_train_sm, epochs = 5)
model.evaluate(X_test, Y_test)

##2. Naive Bayes Classifier Approach

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

model2 = MultinomialNB()
model2.fit(X_train_sm,Y_train_sm)

In [None]:
model2.fit(X_train_sm, Y_train)
Y_test_pred = model2.predict(X_test)

In [None]:
print("Test accuracy:",metrics.accuracy_score(Y_test, Y_test_pred))

## 3. The BERT model

In [None]:
email_data['spam_or_not'] = email_data['Category'].apply(lambda x: 1 if x=='spam' else 0)

In [None]:
X_train3, X_test3, Y_train3, Y_test3 = train_test_split(email_data['Message'], email_data['spam_or_not'], test_size = 0.2, stratify=email_data['spam_or_not'])

In [None]:
bert_preprocess = hub.KerasLayer('https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3')
bert_encode = hub.KerasLayer('https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4')

In [None]:
input_text = tf.keras.layers.Input(shape=(), dtype=tf.string, name='sentences')
preprocessed_text = bert_preprocess(input_text)
outputs2 = bert_encode(preprocessed_text)

hl1 = tf.keras.layers.Dense(512, activation='relu', name='output1')(outputs2['pooled_output'])
hl2 = tf.keras.layers.Dropout(0.3, name='drp1')(hl1)
hl3 = tf.keras.layers.Dense(256, activation='relu', name='output2')(hl2)
hl4 = tf.keras.layers.Dropout(0.3, name='drp2')(hl3)
hl5 = tf.keras.layers.Dense(1, activation='sigmoid', name='final_output')(hl4)

model3 = tf.keras.Model(inputs=[input_text], outputs = [hl5])
model3.summary()

Model: "model_3"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 sentences (InputLayer)         [(None,)]            0           []                               
                                                                                                  
 keras_layer (KerasLayer)       {'input_word_ids':   0           ['sentences[0][0]']              
                                (None, 128),                                                      
                                 'input_mask': (Non                                               
                                e, 128),                                                          
                                 'input_type_ids':                                                
                                (None, 128)}                                                

In [None]:
model3.compile(
    optimizer='adam',
    loss = 'binary_crossentropy',
    metrics=['accuracy']
)

In [None]:
model3.fit(
    X_train3, 
    Y_train3, 
    epochs=10,
    verbose=1
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fe7c3241210>

In [None]:
model3.evaluate(X_test3, Y_test3)



[0.1068500503897667, 0.9641255736351013]

In [None]:
cust_data = [
             'Text UWUWUW to 436732 now. Your chance to win $1000 everyday !Limited ^6#57 time offer',
             'Will be there in 10 mins',
             'Unlimited s$67** calls postpaid and prepaid. Message KYC id to 797309',
             'Can you please let me know about linear regression Sir?',
             'Let\'s go on a trip to Big Sur',
             'Send it to 708703. First 500 customers get free service till 18 February',
             'You are eligible to win 280000$ ! Send credit card CVV Aëüde29 to dark.heck@gmail.com. Totally not suspicious ;)',
             'Please let me know the client requirements by EOD positively'

]
model3.predict(cust_data)