# Email spam detection using scikit-learn

The process we will use to train a predictive model to detect if an email is ham or spam is as follows : 
 * We will read the dataset that contains 5000 different spam/ham emails
 * We will transform this textual dataset into numeric vectors that our models can use
 * We will train a model to predict if an email is ham/spam
 * We will evaluate the model

## 1) Data processing
We will read the file "email.csv" containing all our spam/ham email. The dataset contains a total of 5727 different emails, where 4361 are ham and 1369 are spam. These emails will be contained in a pandas **Dataframe**, an array-like object that allows to easely manipulate large datasets.

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv("./emails.csv.zip", encoding= "latin-1")

The Dataframe is an array that contains two columns, the column **text** which contains the email itself and the column **spam** which is a binary value that classifies the email as **0** for **ham** and **1** for **spam**. This means that each row of the Dataframe represents a single email that is either a spam or ham. 

We can see this clearly if we show the content on the Dataframe as below : 

In [3]:
data

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1
5,"Subject: great nnews hello , welcome to medzo...",1
6,Subject: here ' s a hot play in motion homela...,1
7,Subject: save your money buy getting this thin...,1
8,Subject: undeliverable : home based business f...,1
9,Subject: save your money buy getting this thin...,1


We can also show the amount of emails (or rows) that there are for each type of email (spam/ham). In this dataset we have way more ham than spam, this is normal since it reflects what we would recieve on a daily basis in our inbox, which is more ham than spam. We will first train and test our model using this inequality, later in the notebook we will repeat the while process, but with an equal amount of spam/ham emails to see if the model is better.

In [4]:
data.spam.value_counts()

0    4360
1    1368
Name: spam, dtype: int64

In [5]:
from sklearn.model_selection import train_test_split

We need to split our dataset into two parts: 
   * training set: The dataset used to train our model
   * test set: The dataset used to validate if the model is good
   
The training will contain 80% of the data while the test set will only contain 20%.

In [6]:
train_X, test_X, train_y, test_y = train_test_split(data["text"],data["spam"], test_size=0.2, random_state=10)

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(stop_words="english")

## 2) Vectorisation of the data
Since our models can only work with numerical vectors, we need to tranforms the textual data into such vectors. We will use a **CountVectorizer** from scikit to do such task. The idea is to first extract every unique word contained in all of our email dataset while filtering out english **stop_words**, these are common words found in the english language. We remove these common words because they don't influence the meaning of the email, they are words that we would equally find in ham and spam email. 

In [8]:
vect = CountVectorizer(stop_words="english")
vect.fit(train_X) # Find some word that cause most of spam email

# show first 20 unique words used in all the emails
print(vect.get_feature_names()[0:20])

# show last 20 unique words used in all the emails
print(vect.get_feature_names()[-20:])

['00', '000', '0000', '00000000', '000000000003619', '000000000003991', '000000000003997', '000000000005168', '000000000005411', '000000000005413', '000000000005820', '000000000006238', '000000000007494', '000000000007498', '000000000007876', '000000000010552', '000000000011185', '000000000012677', '000000000012735', '000000000012736']
['zunaechst', 'zunf', 'zur', 'zurich', 'zusaetzlich', 'zuzana', 'zwabic', 'zwischen', 'zwlaszcza', 'zwrocic', 'zwwyw', 'zwzm', 'zxghlajf', 'zyban', 'zyc', 'zygoma', 'zymg', 'zzn', 'zzncacst', 'zzzz']


We will now use these unique words to create our vectors. For each email a vector will be created, the vector contains the number of occurence of every unique word in the specified email.

Example:

We have these unique words ["spam",  "good", "email", "bad", "red", "tonight", "warning"]
We have this sentece: "Spam emails are bad"

In [9]:
X_train_df = vect.transform(train_X)
X_test_df = vect.transform(test_X)
type(X_test_df)

scipy.sparse.csr.csr_matrix

In [10]:
X_train_df.shape

(4582, 32938)

In [14]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [15]:
model = MultinomialNB(alpha=1.8)
model.fit(X_train_df,train_y)
pred = model.predict(X_test_df)
accuracy_score(test_y, pred)

0.9947643979057592

In [16]:
print(classification_report(test_y, pred , target_names = ["Not Spam", "Spam"]))

              precision    recall  f1-score   support

    Not Spam       0.99      1.00      1.00       861
        Spam       1.00      0.98      0.99       285

    accuracy                           0.99      1146
   macro avg       1.00      0.99      0.99      1146
weighted avg       0.99      0.99      0.99      1146



In [17]:
confusion_matrix(test_y,pred)

array([[860,   1],
       [  5, 280]], dtype=int64)

In [18]:
print(data["text"][1472])
pred = model.predict(vect.transform(data["text"]))
print("Pred : ",pred[1472])
print("Main : ",data["spam"][1472])

Subject: meeting with bob butts  this is scheduled for 2 pm on thursday 27 th in his office ebl 906 . you are  welcome to join . i will give a overview of what we ( sandeep & co ) are trying  to do for dpc ( dabhol power ) and ask him to clarify the mark - to - market issues  related to those deals .  krishna .
Pred :  0
Main :  0


In [19]:
print(data["text"][10])
pred = model.predict(vect.transform(data["text"]))
print("Pred : ",pred[10])
print("Main : ",data["spam"][10])

Subject: las vegas high rise boom  las vegas is fast becoming a major metropolitan city ! 60 +  new high rise towers are expected to be built on and around the las vegas strip  within the next 3 - 4 years , that ' s 30 , 000 + condominiums !  this boom has just begun ! buy first . . . early phase ,  pre - construction pricing is now available on las vegas high rises including  trump , cosmopolitan , mgm , turnberry , icon , sky , among others .  join the interest list :  http : / / www . verticallv . com  message has been sent to you by realty one highrise . learn more at www . verticallv . comif you  wish to be excluded from future mailings , please reply with the word remove in  the subject line . 
Pred :  1
Main :  1


In [20]:
email = 'ALWAYS THERE FOR YOU: WE DELIVER YOUR PIZZA WITHOUT CONTACT Domino\'s Pizza attaches great importance to the health of its employees and customers. That is why Domino\'s Pizza Switzerland strictly adheres to the recommendations and regulations of the Federal Office of Public Health regarding the coronavirus. We do everything possible to ensure that customers can continue to enjoy their pizza comfortably and above all safely at work, in the home office or in the evening using Domino\'s Pizza\'s delivery service! To protect its employees and customers, Domino\'s Pizza was the first company to use contactless delivery and last week closed all take-away areas as a precautionary measure. In addition, the employees of Domino\'s Pizza Switzerland work under strict hygiene measures and keep a minimum distance. At Domino\'s Pizza, all ingredients are baked - in the oven at 250 degrees for 6 minutes - after which the pizza is placed directly into its box. From now on and as an extra measure, the pizza will no longer be sliced. Because Domino\'s controls the entire chain (production and delivery), we can guarantee the best possible hygiene conditions - from production to your door! Order online and we take care of everything! Domino’s Pizza Suisse #WECAREFORYOU #STAYATHOME'

In [21]:
email

"ALWAYS THERE FOR YOU: WE DELIVER YOUR PIZZA WITHOUT CONTACT Domino's Pizza attaches great importance to the health of its employees and customers. That is why Domino's Pizza Switzerland strictly adheres to the recommendations and regulations of the Federal Office of Public Health regarding the coronavirus. We do everything possible to ensure that customers can continue to enjoy their pizza comfortably and above all safely at work, in the home office or in the evening using Domino's Pizza's delivery service! To protect its employees and customers, Domino's Pizza was the first company to use contactless delivery and last week closed all take-away areas as a precautionary measure. In addition, the employees of Domino's Pizza Switzerland work under strict hygiene measures and keep a minimum distance. At Domino's Pizza, all ingredients are baked - in the oven at 250 degrees for 6 minutes - after which the pizza is placed directly into its box. From now on and as an extra measure, the pizza

In [22]:
new_row = {'text':email, 'spam':1}
#append row to the dataframe
data = data.append(new_row, ignore_index=True)

In [23]:
data

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1
5,"Subject: great nnews hello , welcome to medzo...",1
6,Subject: here ' s a hot play in motion homela...,1
7,Subject: save your money buy getting this thin...,1
8,Subject: undeliverable : home based business f...,1
9,Subject: save your money buy getting this thin...,1


In [24]:
print(data["text"][5728])
pred = model.predict(vect.transform(data["text"]))
print("Pred : ",pred[5728])
print("Main : ",data["spam"][5728])

ALWAYS THERE FOR YOU: WE DELIVER YOUR PIZZA WITHOUT CONTACT Domino's Pizza attaches great importance to the health of its employees and customers. That is why Domino's Pizza Switzerland strictly adheres to the recommendations and regulations of the Federal Office of Public Health regarding the coronavirus. We do everything possible to ensure that customers can continue to enjoy their pizza comfortably and above all safely at work, in the home office or in the evening using Domino's Pizza's delivery service! To protect its employees and customers, Domino's Pizza was the first company to use contactless delivery and last week closed all take-away areas as a precautionary measure. In addition, the employees of Domino's Pizza Switzerland work under strict hygiene measures and keep a minimum distance. At Domino's Pizza, all ingredients are baked - in the oven at 250 degrees for 6 minutes - after which the pizza is placed directly into its box. From now on and as an extra measure, the pizza 

In [25]:
ham = "Hello Mr. Marina, .Here is what we have done for the lesson of the day :. We discussed about training a machine learning model using the library Scikit-Learn for filtering out spam. We found an existing dataset of ham/spam emails containing 5'000 entities . We are going to write our own ham/spam email to test our training model. We discussed about the structure of the presentation (Power Point, Jupyter Notebook, ...). We finished reading the PDF course about spam filtering. We wish you a good afternoon. Kind Regards, Vincent Moulin and Nicolas Praz Group 7"

In [26]:
new_row = {'text':ham, 'spam':0}
#append row to the dataframe
data = data.append(new_row, ignore_index=True)
print(data["text"][5729])
pred = model.predict(vect.transform(data["text"]))
print("Pred : ",pred[5729])
print("Main : ",data["spam"][5729])

Hello Mr. Marina, .Here is what we have done for the lesson of the day :. We discussed about training a machine learning model using the library Scikit-Learn for filtering out spam. We found an existing dataset of ham/spam emails containing 5'000 entities . We are going to write our own ham/spam email to test our training model. We discussed about the structure of the presentation (Power Point, Jupyter Notebook, ...). We finished reading the PDF course about spam filtering. We wish you a good afternoon. Kind Regards, Vincent Moulin and Nicolas Praz Group 7
Pred :  0
Main :  0
