In [1]:
%%html
<style>
table {float:left}
</style>

# Email spam detection using scikit-learn

The process we will use to train a predictive model to detect if an email is ham or spam is as follows : 
 * We will read the dataset that contains 5000 different spam/ham emails
 * We will transform this textual dataset into numeric vectors that our models can use
 * We will train a model to predict if an email is ham/spam
 * We will evaluate the model

## 1) Data processing
We will read the file "email.csv" containing all our spam/ham email. The dataset contains a total of 5727 different emails, where 4361 are ham and 1369 are spam. These emails will be contained in a pandas **Dataframe**, an array-like object that allows to easely manipulate large datasets.

In [2]:
import pandas as pd

In [3]:
data = pd.read_csv("./emails.csv.zip", encoding= "latin-1")

The Dataframe is an array that contains two columns, the column **text** which contains the email itself and the column **spam** which is a binary value that classifies the email as **0** for **ham** and **1** for **spam**. This means that each row of the Dataframe represents a single email that is either a spam or ham. 

We can see this clearly if we show the content on the Dataframe as below : 

In [4]:
data

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1
5,"Subject: great nnews hello , welcome to medzo...",1
6,Subject: here ' s a hot play in motion homela...,1
7,Subject: save your money buy getting this thin...,1
8,Subject: undeliverable : home based business f...,1
9,Subject: save your money buy getting this thin...,1


We can also show the amount of emails (or rows) that there are for each type of email (spam/ham). In this dataset we have way more ham than spam, this is normal since it reflects what we would recieve on a daily basis in our inbox, which is more ham than spam. We will first train and test our model using this inequality, later in the notebook we will repeat the while process, but with an equal amount of spam/ham emails to see if the model is better.

In [5]:
data.spam.value_counts()

0    4360
1    1368
Name: spam, dtype: int64

In [6]:
# To test with a dataset with equal number of spam/ham emails
# data = pd.concat([data[:1000], data[-1000:]], axis=0)
# data.spam.value_counts()

In [8]:
from sklearn.model_selection import train_test_split

We need to split our dataset into two parts: 
   * training set: The dataset used to train our model
   * test set: The dataset used to validate if the model is good
   
The training will contain 80% of the data while the test set will only contain 20%.

In [9]:
train_X, test_X, train_y, test_y = train_test_split(data["text"],data["spam"], test_size=0.2, random_state=10)

In [10]:
len(train_X)

1600

In [11]:
len(test_X)

400

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(stop_words="english")

## 2) Vectorisation of the data
Since our models can only work with numerical vectors, we need to tranforms the textual data into such vectors. We will use a **CountVectorizer** from scikit to do such task. The idea is to first extract every unique word contained in all of our email dataset while filtering out english **stop_words**, these are common words found in the english language. We remove these common words because they don't influence the meaning of the email, they are words that we would equally find in ham and spam email. 

### 2.1) Finding every unique word

In [13]:
vect = CountVectorizer(stop_words="english")
vect.fit(train_X) # Find some word that cause most of spam email

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [14]:
# show first 20 unique words used in all the emails
print(vect.get_feature_names()[0:20])

['00', '000', '0000', '000000000003619', '000000000003991', '000000000005411', '000000000007498', '000000000007876', '000000000012734', '000010220', '0000102317', '0000102374', '0000104486', '0000104631', '0000104730', '0000104776', '0000104778', '0001', '0002', '0003']


In [15]:
# show last 20 unique words used in all the emails
print(vect.get_feature_names()[-20:])

['zt', 'ztvwo', 'zucha', 'zuerich', 'zuid', 'zulie', 'zulkifli', 'zum', 'zunf', 'zungenakrobatik', 'zurich', 'zustellstatus', 'zuyw', 'zwischen', 'zwzm', 'zxghlajf', 'zyban', 'zygoma', 'zymg', 'zzzz']


In [16]:
len(vect.get_feature_names())

20952

### 2.2) Transforming the emails into vectors

We will now use these unique words to create our vectors. For each email a vector will be created, the vector contains the number of occurence of every unique word in the specified email.

Example:
We have the following three sentences: 
* 'Spam emails are bad'
* 'The red apple is juicy'
* 'The blue sky is blue tonight'

We now have these unique words ['apple', 'bad', 'blue', 'emails', 'juicy', 'red', 'sky', 'spam', 'tonight']   
Each of these sentences will be transformed into a one dimentional vector containing the number of occurence of each unique word as such: 

|                              | apple | bad | blue | emails | juicy | red | sky | spam | tonight |
|------------------------------|:-----:|:---:|:----:|:------:|:-----:|:---:|:---:|:----:|:-------:|
| Spam emails are bad          |   0   |  1  |   0  |    1   |   0   |  0  |  0  |   1  |    0    |
| The red apple is juicy       |   1   |  0  |   0  |    0   |   1   |  1  |  0  |   0  |    0    |
| The blue sky is blue tonight |   0   |  0  |   2  |    0   |   0   |  0  |  1  |   0  |    1    |

In [17]:
X_train_occurence = vect.transform(train_X)
X_test_occurence = vect.transform(test_X)

We have now transformed both our training and test data set into a matrix. Below is the size of our matrix, we can notice that the number of rows is equal to the size of the train set and the number of columns is equal to the number of unique words.

In [18]:
X_train_occurence.shape

(1600, 20952)

## 3) Training our model
Now that the data is processed, we will use it to train our model. We use Naive Bayes, a commonly machine learning algorithm for classification of textual data. After training out model we will use the test set to validate the precision of the model, meaning how many time the model had the right prediction on if the email is a spam of ham.

In [19]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [20]:
model = MultinomialNB()
model.fit(X_train_occurence,train_y)
pred = model.predict(X_test_occurence)
accuracy_score(test_y, pred)

0.9825

The following matrix can be read as such : 
* Top-left: number of emails rightfully classified as ham
* Top-right: number of emails wrongfully classified as ham
* Bottom-left: number of emails wrongfully classified as spam
* Bottom-right: number of emails rightfully classified as spam

In [21]:
confusion_matrix(test_y,pred)

array([[192,   3],
       [  4, 201]], dtype=int64)

## 4) Trying out the model
Now that our model is trained and validated, we will test it using "real" emails we have recieved in our inbox. To do this we will first have to store the content of the email in a variable and add it to the original data. We will then use our occurence transformer to convert our data into a matrix. Finally we will use the model to try to predict if the given email is spam or ham.

### Spam email

In [22]:
email = 'ALWAYS THERE FOR YOU: WE DELIVER YOUR PIZZA WITHOUT CONTACT Domino\'s Pizza attaches great importance to the health of its employees and customers. That is why Domino\'s Pizza Switzerland strictly adheres to the recommendations and regulations of the Federal Office of Public Health regarding the coronavirus. We do everything possible to ensure that customers can continue to enjoy their pizza comfortably and above all safely at work, in the home office or in the evening using Domino\'s Pizza\'s delivery service! To protect its employees and customers, Domino\'s Pizza was the first company to use contactless delivery and last week closed all take-away areas as a precautionary measure. In addition, the employees of Domino\'s Pizza Switzerland work under strict hygiene measures and keep a minimum distance. At Domino\'s Pizza, all ingredients are baked - in the oven at 250 degrees for 6 minutes - after which the pizza is placed directly into its box. From now on and as an extra measure, the pizza will no longer be sliced. Because Domino\'s controls the entire chain (production and delivery), we can guarantee the best possible hygiene conditions - from production to your door! Order online and we take care of everything! Domino’s Pizza Suisse #WECAREFORYOU #STAYATHOME'

In [23]:
email

"ALWAYS THERE FOR YOU: WE DELIVER YOUR PIZZA WITHOUT CONTACT Domino's Pizza attaches great importance to the health of its employees and customers. That is why Domino's Pizza Switzerland strictly adheres to the recommendations and regulations of the Federal Office of Public Health regarding the coronavirus. We do everything possible to ensure that customers can continue to enjoy their pizza comfortably and above all safely at work, in the home office or in the evening using Domino's Pizza's delivery service! To protect its employees and customers, Domino's Pizza was the first company to use contactless delivery and last week closed all take-away areas as a precautionary measure. In addition, the employees of Domino's Pizza Switzerland work under strict hygiene measures and keep a minimum distance. At Domino's Pizza, all ingredients are baked - in the oven at 250 degrees for 6 minutes - after which the pizza is placed directly into its box. From now on and as an extra measure, the pizza

In [24]:
new_row = {'text':email, 'spam':1}
#append row to the dataframe
data = data.append(new_row, ignore_index=True)

In [25]:
print(data["text"].iloc[-1])
pred = model.predict(vect.transform(data["text"]))
print("---------------------------------")
print("Prediction: ",pred[-1])
print("Actual category: ",data["spam"].iloc[-1])

ALWAYS THERE FOR YOU: WE DELIVER YOUR PIZZA WITHOUT CONTACT Domino's Pizza attaches great importance to the health of its employees and customers. That is why Domino's Pizza Switzerland strictly adheres to the recommendations and regulations of the Federal Office of Public Health regarding the coronavirus. We do everything possible to ensure that customers can continue to enjoy their pizza comfortably and above all safely at work, in the home office or in the evening using Domino's Pizza's delivery service! To protect its employees and customers, Domino's Pizza was the first company to use contactless delivery and last week closed all take-away areas as a precautionary measure. In addition, the employees of Domino's Pizza Switzerland work under strict hygiene measures and keep a minimum distance. At Domino's Pizza, all ingredients are baked - in the oven at 250 degrees for 6 minutes - after which the pizza is placed directly into its box. From now on and as an extra measure, the pizza 

### Ham email

In [26]:
ham = "Hello Mr. Marina, .Here is what we have done for the lesson of the day :. We discussed about training a machine learning model using the library Scikit-Learn for filtering out spam. We found an existing dataset of ham/spam emails containing 5'000 entities . We are going to write our own ham/spam email to test our training model. We discussed about the structure of the presentation (Power Point, Jupyter Notebook, ...). We finished reading the PDF course about spam filtering. We wish you a good afternoon. Kind Regards, Vincent Moulin and Nicolas Praz Group 7"

In [27]:
new_row = {'text':ham, 'spam':0}
#append row to the dataframe
data = data.append(new_row, ignore_index=True)
print(data["text"].iloc[-1])
pred = model.predict(vect.transform(data["text"]))
print("---------------------------------")
print("Prediction: ",pred[-1])
print("Actual category: ",data["spam"].iloc[-1])

Hello Mr. Marina, .Here is what we have done for the lesson of the day :. We discussed about training a machine learning model using the library Scikit-Learn for filtering out spam. We found an existing dataset of ham/spam emails containing 5'000 entities . We are going to write our own ham/spam email to test our training model. We discussed about the structure of the presentation (Power Point, Jupyter Notebook, ...). We finished reading the PDF course about spam filtering. We wish you a good afternoon. Kind Regards, Vincent Moulin and Nicolas Praz Group 7
---------------------------------
Prediction:  0
Actual category:  0


In [28]:
spam = 'Hello everyone, We are the team behind usepanda.com, flatuicolors.com, collectui.com and thestocks.im. You can see all the tools we are building on Asteya Network website. Your email has been registered in one of the websites above (maybe years ago) and this is the first time that we are sending you an update of a new product that we have built Launching Today Introducing Corona Panel v2, which we built by gathering all the useful resources related to the Covid-19 Pandemic in one place. Words cannot explain well enough, please go ahead and check it yourself: https://coronapanel.com What\’s Next? We are building a news reader app for iOS and Android. You can read Hacker News, Dribbble, Designers News, Product Hunt and many more on the go. We\’ll send you another email as we launch Stay Strong and Safe Asteya Network Team https://asteya.network'

In [29]:
new_row = {'text':spam, 'spam':1}
#append row to the dataframe
data = data.append(new_row, ignore_index=True)
print(data["text"].iloc[-1])
pred = model.predict(vect.transform(data["text"]))
print("---------------------------------")
print("Prediction: ",pred[-1])
print("Actual category: ",data["spam"].iloc[-1])

Hello everyone, We are the team behind usepanda.com, flatuicolors.com, collectui.com and thestocks.im. You can see all the tools we are building on Asteya Network website. Your email has been registered in one of the websites above (maybe years ago) and this is the first time that we are sending you an update of a new product that we have built Launching Today Introducing Corona Panel v2, which we built by gathering all the useful resources related to the Covid-19 Pandemic in one place. Words cannot explain well enough, please go ahead and check it yourself: https://coronapanel.com What\’s Next? We are building a news reader app for iOS and Android. You can read Hacker News, Dribbble, Designers News, Product Hunt and many more on the go. We\’ll send you another email as we launch Stay Strong and Safe Asteya Network Team https://asteya.network
---------------------------------
Prediction:  1
Actual category:  1
