<a href="https://colab.research.google.com/github/cocolovett/Spam-Emails-Classifier/blob/main/Spam_Emails_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spam Emails Classifer

---



*This notebook has been built to differentiate between **spam** or **ham** emails. We can teach this model to recognise potentially spam emails, then add our own datasets to it to find new spam emails. This is an early step in automated detection of spam emails.*



---



Import the libraries required for the notebook.

In [None]:
import pandas as pd
import numpy as np
import nltk
import time
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import re
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

*WordNetLimmatizer* and *stopwords* process the raw images in the dataset. The *sklearn* modules will be used during model building.

In [None]:
# This step allows us to upload our dataset into the notebook/model
from google.colab import files
uploaded = files.upload()

Saving Spam Email raw text for NLP.csv to Spam Email raw text for NLP.csv


In [None]:
import io
df = pd.read_csv(io.BytesIO(uploaded['Spam Email raw text for NLP.csv']))
# Dataset is now stored in a Pandas Dataframe for use throughout the model
df.head()
# This will display the first 5 rows of the dataset

Unnamed: 0,CATEGORY,MESSAGE,FILE_NAME
0,1,"Dear Homeowner,\n\n \n\nInterest Rates are at ...",00249.5f45607c1bffe89f60ba1ec9f878039a
1,1,ATTENTION: This is a MUST for ALL Computer Use...,00373.ebe8670ac56b04125c25100a36ab0510
2,1,This is a multi-part message in MIME format.\n...,00214.1367039e50dc6b7adb0f2aa8aba83216
3,1,IMPORTANT INFORMATION:\n\n\n\nThe new domain n...,00210.050ffd105bd4e006771ee63cabc59978
4,1,This is the bottom line. If you can GIVE AWAY...,00033.9babb58d9298daa2963d4f514193d7d6


As we can see, the first five rows of the dataset show a filename that either connected to the email, or any potential attachments. As we don't know what this refers to, we will class it as irrelevant and *drop* it.

In [None]:
df.drop('FILE_NAME', axis=1, inplace=True)

Next, we would like to know how many emails we have in our dataset in total. This will help us later when we decide the size of our training dataset compared to our testing dataset.

In [None]:
df.CATEGORY.value_counts()

0    3900
1    1896
Name: CATEGORY, dtype: int64

Now we know the size of our dataset, we need to download 'stopwords' - these are commonly occurring words within the English language. Then, we're going to open WordNet, which is a database of English words and their semantic meanings. We're going to store them in a new variables to use later.

In [None]:
nltk.download('stopwords')
stopword = nltk.corpus.stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
nltk.download('wordnet')
# Lemmatization is a technique used in NLP to reduce derivational forms of words to their dictionary meaning
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /root/nltk_data...


Once the model gets to work, we're going to need our preprocessed messages to be stored somewhere. So, we're going to create an empty list to store them.

In [None]:
corpus=[]

Now we have the empty list, lets start filling it with the emails that have been processed to where the text has been converted. We need the model to do the following:


*   remove all non-alphanumeric characters
*   convert the text to lowercase
*   split the text into words
*   remove the stopwords and lemmatize the words
*   convert the words back to sentences
*   add the message to the corpus list





In [None]:
for i in range(len(df)):
    # remove all non-alphanumeric characters
    message = re.sub('[^a-zA-Z0-9]', ' ', df['MESSAGE'][i])

    # convert the text to lowercase
    message = message.lower()

    # split the text into words for lemmatization
    message = message.split()

    # remove stopwords and lemmatizing
    message = [lemmatizer.lemmatize(word) for word in message
             if word not in set(stopwords.words('english'))]

    # Convert the words back into sentences
    message = ' '.join(message)

    # Add the message to the corpus list
    corpus.append(message)

Next, we're going to ask the model to count how many times certain words appear throughout the dataset and provide it with a *weight*. This will provide the mdoel with a better understanding the meaning of the text.

In [None]:
tf = TfidfVectorizer(ngram_range=(1,3), max_features=2500)
X = tf.fit_transform(corpus).toarray()
y = df['CATEGORY']



---



Now for the fun stuff!

So far, we've just *cleaned* our data. We've removed all the *and*s, *if*s, and *but*s from these emails, but we've not actually gotten to the part where we begin to teach the model. Now we need to tell the model that we need two seperate datasets - our training set and our test set. Our training set will *teach* the model, while the test set will, as you can guess, *test* the model to make sure it works as it should.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(
  X, y, test_size=0.30, random_state=1, stratify=y)

In [None]:
# initialise a Naive Bayes model using the scikit-learn MultinomialNB class
model = MultinomialNB()

Nowe we're going to apply or fit the training data to the model. This is where the model would read the messages and would also be able to read the classification.

In [None]:
model.fit(x_train, y_train)

In the original dataset, each email was given a value or 0 or 1, with 1 being not spam, and 0 being spam. All of these emails have been evaluated by a person who has deemed them as such. This will be our baseline to understand whether the model works or not.

In [None]:
# Make predictions on the training and testing sets
train_pred = model.predict(x_train)
test_pred = model.predict(x_test)

Now we're giving the test dataset to the model. Imagine that the model reads the dataset with the column containing the correct classifications is hidden until the model has given its own classifications. Once that's done, the model checks it's answers to provide an accuracy score.

In [None]:
print(classification_report(train_pred, y_train))
print(classification_report(test_pred, y_test))

              precision    recall  f1-score   support

           0       0.99      0.95      0.97      2855
           1       0.89      0.98      0.94      1202

    accuracy                           0.96      4057
   macro avg       0.94      0.97      0.95      4057
weighted avg       0.96      0.96      0.96      4057

              precision    recall  f1-score   support

           0       0.99      0.95      0.97      1221
           1       0.90      0.98      0.94       518

    accuracy                           0.96      1739
   macro avg       0.94      0.97      0.96      1739
weighted avg       0.96      0.96      0.96      1739



OK! So, now we know the model's accuracy, we can test it out. The next couple of sections have text that I've copied and pasted from a selection of emails that have come into my personal email inbox. Let's see how many spam emails I've received...

In [None]:
print('Predicting... \n')

message = ["Hello, You have been selected for the whitelist of our NFT drop."]

message_vector = tf.transform(message)
category = model.predict(message_vector)

# pause for dramatic effect
time.sleep(5)

# print the outcome of the model's prediction
print("The message is", "spam!" if category == 1 else "ham and not spam")

Predicting... 

The message is ham and not spam
