#Email Spam Detection with Machine Learning NLP

Email spam, also referred to as junk email, refers to unsolicited messages that are sent in bulk via email (spamming).

Through this Data Science Project, I will demonstrate how to utilize the Machine Learning technique called Natural Language Processing (NLP) and Python to detect email spam.

The program will analyze emails and determine whether they are categorized as spam (1) or not (0).

Importing the libraries

In [50]:
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
import string

Load the data and print the first 5 rows :



In [51]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [52]:
df = pd.read_csv("/content/sample_data/Spam.csv", encoding="ISO-8859-1")


In [53]:
df.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


Exploring the data and printing the shape of the data

In [54]:
df.shape

(5728, 2)

Now let's see the column names of the dataframe.

In [55]:
df.columns

Index(['text', 'spam'], dtype='object')

Checking Duplicates and removing them

In [56]:
df.drop_duplicates(inplace=True)
print(df.shape)

(5695, 2)


Checking Missing Values from each column

In [57]:
print(df.isnull().sum())

text    0
spam    0
dtype: int64


Downloading the stop words, Stop words in natural language processing, are useless words (data).


In [58]:
# download the stopwords package
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Next, we will define a function that takes in a text as input, performs text cleaning operations, and returns the resulting tokens. The cleaning process involves removing any punctuation from the text, followed by eliminating the stop words that add no significant meaning to the text.

In [59]:
def process(text):
    nopunc = [char for char in text if char not in string.punctuation]
    nopunc = ''.join(nopunc)

    clean = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
    return clean
# to show the tokenization
df['text'].head().apply(process)

0    [Subject, naturally, irresistible, corporate, ...
1    [Subject, stock, trading, gunslinger, fanny, m...
2    [Subject, unbelievable, new, homes, made, easy...
3    [Subject, 4, color, printing, special, request...
4    [Subject, money, get, software, cds, software,...
Name: text, dtype: object

Converting the text into a matrix of token counts :


In [60]:
from sklearn.feature_extraction.text import CountVectorizer
message = CountVectorizer(analyzer=process).fit_transform(df['text'])

Next, it's necessary to divide our data into separate training and testing sets. Once this is done, we'll use one data row from the testing set for prediction purposes and evaluate whether the predicted value matches the actual value.

In [62]:
#split the data into 80% training and 20% testing
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(message, df['spam'], test_size=0.20, random_state=0)
# To see the shape of the data
print(message.shape)

(5695, 37229)


The next step is to develop and train the Multinomial Naive Bayes classifier, which is well-suited for discrete feature classification.

In [63]:
# create and train the Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB().fit(xtrain, ytrain)

Checking the classifiers prediction and actual values on the data set :

In [64]:
print(classifier.predict(xtrain))
print(ytrain.values)

[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]


Now, we'll evaluate the performance of our model by analyzing the results generated from the Naive Bayes classifier. This includes examining the report, confusion matrix, and accuracy score.





In [65]:
# Evaluating the model on the training data set
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
pred = classifier.predict(xtrain)
print(classification_report(ytrain, pred))
print()
print("Confusion Matrix: \n", confusion_matrix(ytrain, pred))
print("Accuracy: \n", accuracy_score(ytrain, pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3457
           1       0.99      1.00      0.99      1099

    accuracy                           1.00      4556
   macro avg       0.99      1.00      1.00      4556
weighted avg       1.00      1.00      1.00      4556


Confusion Matrix: 
 [[3445   12]
 [   1 1098]]
Accuracy: 
 0.9971466198419666


The accuracy of the model utilized appears to be 99.71%. In order to verify the model's ability to accurately classify email text, we will evaluate it using the test data set (xtest & ytest) by displaying both the predicted value and the actual value.

In [66]:
#print the predictions
print(classifier.predict(xtest))
#print the actual values
print(ytest.values)

[1 0 0 ... 0 0 0]
[1 0 0 ... 0 0 0]


Evaluating the model on the test data set :



In [67]:
# Evaluating the model on the training data set
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
pred = classifier.predict(xtest)
print(classification_report(ytest, pred))
print()
print("Confusion Matrix: \n", confusion_matrix(ytest, pred))
print("Accuracy: \n", accuracy_score(ytest, pred))

              precision    recall  f1-score   support

           0       1.00      0.99      0.99       870
           1       0.97      1.00      0.98       269

    accuracy                           0.99      1139
   macro avg       0.98      0.99      0.99      1139
weighted avg       0.99      0.99      0.99      1139


Confusion Matrix: 
 [[862   8]
 [  1 268]]
Accuracy: 
 0.9920983318700615


With an accuracy of 99.2% on the test data, the classifier was able to correctly classify email messages as spam or not spam.



