# **Assignment 1: Email Spam Detection**

Members:

CASSANDRA

HUDA

DONGDAO


---




This report studies about email spam filtering using machine learning models. In this report, the dataset named ‘Data2.csv’ is implemented in the email spam filtering model. Two supervised learning algorithms, the Naïve Bayes algorithm and the Support Vector Machine (SVM) algorithm, will be used to classify the emails into spam or ham.

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import string

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
# Load data
data = pd.DataFrame(pd.read_csv("Data2.csv",encoding = "ISO-8859-1"))
data.shape

(5572, 2)

In [None]:
# Removing duplicates
data.drop_duplicates(inplace = True)
data.shape

(5157, 2)

In [None]:
# Check number of missing data
data.isnull().sum()

Category    0
Message     0
dtype: int64

The first step after the data imported is pre-processing the data. First of all, stop words like “the”, “a”, etc will be eliminated due to its meaningless. Besides, the data cleaning process also involves the conversion of all the letters to lowercase, and tokenisation which means separating text into words or smaller chunks with completely different token. Then, lemmatization will be applied to the data into their root form based on dictionary-based approach.

In [None]:
#Tokenization
def process_msg(msg):
  
  # Removing punctuations
  nopunc = [char for char in msg if char not in string.punctuation]
  nopunc = ''.join(nopunc)

  # Lowering case
  nopunc = nopunc.lower()

  # Removing stopwords
  clean_msg = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

  return clean_msg

In [None]:
data['Message'].head().apply(process_msg)

0       [go, jurong, point, crazy, available, bugis, n...
1                          [ok, lar, joking, wif, u, oni]
2       [free, entry, 2, wkly, comp, win, fa, cup, fin...
3           [u, dun, say, early, hor, u, c, already, say]
4       [nah, dont, think, goes, usf, lives, around, t...
                              ...                        
5567    [2nd, time, tried, 2, contact, u, u, â£750, po...
5568                  [ã¼, b, going, esplanade, fr, home]
5569                     [pity, mood, soany, suggestions]
5570    [guy, bitching, acted, like, id, interested, b...
5571                                   [rofl, true, name]
Name: Message, Length: 5572, dtype: object

During the pre-processing of the sentences, the sentences were broken into words, which can then proceed with the step of vectorisation. The purpose of vectorisation, which means converting text data into numerical data, is to count the distinct words and the corresponding frequency of each distinct word in the mails. Vectorization is required since the machine does not understand words. Therefore, the words need to be represented by numbers.

In [None]:
# Vectorization

from sklearn.feature_extraction.text import CountVectorizer

messages_bow = CountVectorizer(analyzer = process_msg).fit_transform(data['Message'])

In [None]:
# Split data into 80% training and 20% testing data

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(messages_bow,data['Category'], test_size = 0.2, random_state=0)

Naïve Bayes is one of the most widely-used supervised learning methods for classification. Their classifier is naive because it assumes that the connected contingencies are independent of one another. The computation of overall document feasibility would be the substance of merging all of the single word feasibility reports in the file. These Naïve Bayesian classifiers have been widely employed in sentiment categorization since they have less computational power than other algorithms, yet independence assumptions will lead to inaccurate results .

In [None]:
# Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
classifierNB = MultinomialNB()
classifierNB.fit(x_train, y_train)

MultinomialNB()

In [None]:
# Naive Bayes Model assessment for train data set
print('Naive Bayes Model Assessment\nTrain data set:\n')

# Predictions
print('Predictions:\t', classifierNB.predict(x_train))

# Actual values
print('Actual values:\t', y_train.values)
print()

# Model assessment results
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
prediction = classifierNB.predict(x_train)
print(classification_report(y_train, prediction))
print('Confusion Matrix:\n', confusion_matrix(y_train, prediction))
print('\nAccuracy:', accuracy_score(y_train, prediction)*100, '%')

Naive Bayes Model Assessment
Train data set:

Predictions:	 ['ham' 'ham' 'ham' ... 'spam' 'ham' 'ham']
Actual values:	 ['ham' 'ham' 'ham' ... 'spam' 'ham' 'ham']

              precision    recall  f1-score   support

         ham       1.00      1.00      1.00      3870
        spam       0.98      0.97      0.98       587

    accuracy                           0.99      4457
   macro avg       0.99      0.98      0.99      4457
weighted avg       0.99      0.99      0.99      4457

Confusion Matrix:
 [[3860   10]
 [  18  569]]

Accuracy: 99.37177473636976 %


In [None]:
# Model assessment for test data set
print('Naive Bayes Model Assessment\nTest data set:\n')

# Predictions
print('Predictions:\t', classifierNB.predict(x_test))

# Actual values
print('Actual values:\t', y_test.values)
print()

# Model assessment results
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
prediction = classifierNB.predict(x_test)
print(classification_report(y_test, prediction))
print('Confusion Matrix:\n', confusion_matrix(y_test, prediction))
print('\nAccuracy:', accuracy_score(y_test, prediction)*100, '%')

Naive Bayes Model Assessment
Test data set:

Predictions:	 ['ham' 'spam' 'ham' ... 'ham' 'ham' 'ham']
Actual values:	 ['ham' 'spam' 'ham' ... 'ham' 'spam' 'ham']

              precision    recall  f1-score   support

         ham       0.99      0.98      0.99       955
        spam       0.91      0.96      0.93       160

    accuracy                           0.98      1115
   macro avg       0.95      0.97      0.96      1115
weighted avg       0.98      0.98      0.98      1115

Confusion Matrix:
 [[939  16]
 [  7 153]]

Accuracy: 97.9372197309417 %


The support vector machine (SVM) analyses the data, determines the decision boundaries, and uses kernels to compute in input space. SVM is used for classification and regression, which are useful in statistical learning theory, and it assists in accurately detecting the components that must be taken into account in order to properly comprehend it. 

In [None]:
# Support Vector Machine (SVM) Classifier
from sklearn import svm
from sklearn.svm import SVC
classifierSVM = SVC(kernel='rbf',random_state=0)
classifierSVM.fit(x_train, y_train)

SVC(random_state=0)

In [None]:
# SVM Model assessment for train data set
print('SVM Model Assessment\nTrain data set:\n')

# Predictions
print('Predictions:\t', classifierSVM.predict(x_train))

# Actual values
print('Actual values:\t', y_train.values)
print()

# Model assessment results
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
prediction = classifierSVM.predict(x_train)
print(classification_report(y_train, prediction))
print('Confusion Matrix:\n', confusion_matrix(y_train, prediction))
print('\nAccuracy:', accuracy_score(y_train, prediction)*100, '%')

SVM Model Assessment
Train data set:

Predictions:	 ['ham' 'ham' 'ham' ... 'spam' 'ham' 'ham']
Actual values:	 ['ham' 'ham' 'ham' ... 'spam' 'ham' 'ham']

              precision    recall  f1-score   support

         ham       0.99      1.00      1.00      3870
        spam       1.00      0.96      0.98       587

    accuracy                           0.99      4457
   macro avg       1.00      0.98      0.99      4457
weighted avg       0.99      0.99      0.99      4457

Confusion Matrix:
 [[3870    0]
 [  25  562]]

Accuracy: 99.43908458604443 %


In [None]:
# Model assessment for test data set
print('SVM Model Assessment\nTest data set:\n')

# Predictions
print('Predictions:\t', classifierSVM.predict(x_test))

# Actual values
print('Actual values:\t', y_test.values)
print()

# Model assessment results
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
prediction = classifierSVM.predict(x_test)
print(classification_report(y_test, prediction))
print('Confusion Matrix:\n', confusion_matrix(y_test, prediction))
print('\nAccuracy:', accuracy_score(y_test, prediction)*100, '%')

SVM Model Assessment
Test data set:

Predictions:	 ['ham' 'spam' 'ham' ... 'ham' 'ham' 'ham']
Actual values:	 ['ham' 'spam' 'ham' ... 'ham' 'spam' 'ham']

              precision    recall  f1-score   support

         ham       0.97      1.00      0.99       955
        spam       1.00      0.84      0.91       160

    accuracy                           0.98      1115
   macro avg       0.99      0.92      0.95      1115
weighted avg       0.98      0.98      0.98      1115

Confusion Matrix:
 [[955   0]
 [ 26 134]]

Accuracy: 97.66816143497758 %


The Naïve Bayes Classifier achieved an accuracy score of 99.37% on the train data and 97.34% on the test data. Meanwhile, the SVM Classifier achieved an accuracy score of 99.44% on the train data and 97.67% on the test data. It can be observed that the SVM classifier outperformed the Naïve Bayes Classifier in terms of the accuracy score, although the two models have an insignificant difference in their accuracy scores. Therefore, as a conclusion, the experiment results demonstrates that the SVM models have a higher spam-detection accuracy than the Naïve Bayes model.