# Guessing if an email is spam or not spam via word embedding inside a neural network

## Importing the libraries

In [1]:
#used liraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import nltk # statistical natural language processing
import string

## Inserting emails excel sheet and preview of the data

Here we insert the excel sheet containing the emails location and we take a preview of the data inside the 2 row excel sheet. We see that "email" represents the sent email and "label" represents the classification of the current email( 1 is for spam and 0 is for not spam) 

In [2]:
#excel sheet location
data = pd.read_csv('C:/Users/steli/AppData/Roaming/Microsoft/Windows/Start Menu/Programs/Python 3.9/spam_or_not_spam.csv')
#preview of the data inside the 2 row excel sheet
data.head()

Unnamed: 0,email,label
0,mike bostock said received from trackingNUMBE...,0
1,no i was just a little confused because i m r...,0
2,this is just an semi educated guess if i m wro...,0
3,jm URL justin mason writes except for NUMBER t...,0
4,i just picked up razor sdk NUMBER NUMBER and N...,0


In [3]:
print(f'(Columns,Rows): {data.shape}')

(Columns,Rows): (1500, 2)


## Data preprocessing and preview of the new clear data

Looking at the preview of the emails, we see that we have a lot of duplicate emails which will lower the performance our prediction. So we remove them.

In [4]:
data[data[['email']].duplicated() == True]

Unnamed: 0,email,label
305,use perl daily headline mailer this week on pe...,0
342,url URL date NUMBER NUMBER NUMBER NUMBER NUMBE...,0
349,url URL date NUMBER NUMBER NUMBER NUMBER NUMBE...,0
455,url URL date not supplied URL,0
456,url URL date not supplied URL,0
...,...,...
1446,we guarantee you signups before you ever pay a...,1
1454,otc newsletter discover tomorrow s winners fo...,1
1463,protect your financial well being purchase an ...,1
1468,lowest rates available for term life insurance...,1


In [5]:
#removes duplicate emails
data.drop_duplicates(inplace=True)

We also see that we have some empty data which may lower our perfomance, we remove those too.

In [6]:
data[data.isnull().any(axis=1)]

Unnamed: 0,email,label
1466,,1


In [7]:
#removes columns where there are Null/NaN data
data= data.dropna()

In [8]:
print(f'New less(Columns,Rows): {data.shape}')

New less(Columns,Rows): (1376, 2)


Our emails consist out of words that make no sense to train for. These are called english stopwords. An example of a stopword is: "a", a another example is "and". So we remove those words from our emails in order to have a more efficient training of words.

In [9]:
#downloads the latest dictionary of stopwords
nltk.download('stopwords') 

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\steli\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [10]:
#creates a new dataset of emails with stopwords removed
from nltk.corpus import stopwords
stop = stopwords.words('english')
data['clear_email'] = data['email'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

In [11]:
#preview of the clear emails dataset
data.head()

Unnamed: 0,email,label,clear_email
0,mike bostock said received from trackingNUMBE...,0,mike bostock said received trackingNUMBER URL ...
1,no i was just a little confused because i m r...,0,little confused running procmail gateway sits ...
2,this is just an semi educated guess if i m wro...,0,semi educated guess wrong someone please corre...
3,jm URL justin mason writes except for NUMBER t...,0,jm URL justin mason writes except NUMBER thing...
4,i just picked up razor sdk NUMBER NUMBER and N...,0,picked razor sdk NUMBER NUMBER NUMBER NUMBER a...


## Transforming data into vectors and training our neural network

Using sklearns CountVectorizer we create our vectors to both tokenize our collection of emails and build a vocabulary of known words to encode new emails using that vocabulary. We split our dataset to a 25%-75% testing-training (from the exercise data).

In [12]:
#transforms emails text into vectors
from sklearn.feature_extraction.text import CountVectorizer
message_bow = CountVectorizer().fit_transform(data['clear_email']) #email

In [13]:
#splits dataset to training data and test data
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(message_bow, data['label'],test_size = 0.25,train_size=0.75, random_state = 0)

We apply Bayes theorem in order to classify our text vectors. We use Naive Bayes because its a simple and one of the most effective classification methods, in order to classify our emails depending on prior data. That way we classify based on a probability of that email being a spam.

In [14]:
#apply Bayes classification for text classification with word counts
from sklearn.naive_bayes import MultinomialNB
classifier= MultinomialNB()
classifier.fit(X_train,y_train)
print('Performance of classification: ', classifier.score(X_test,y_test))

Performance of classification:  0.9941860465116279


## Results of our training

Using the metrics: f1 score, precision score, recall score we calculate the performance of our word training. We also print the predicted email classifiers (spam or not spam) and the actual classifiers in order to see our prediction in action. Lastly for the experiments shake, we print the accurancy of our method.

In [15]:
#performance of training
from sklearn.metrics import precision_score, f1_score, recall_score, accuracy_score
pred = classifier.predict(X_train)
print(f'F1 score: {f1_score(y_train, pred)}')
print(f'Precision score: {precision_score(y_train, pred)}')
print(f'Recall score: {recall_score(y_train, pred)}')

F1 score: 0.9873015873015873
Precision score: 1.0
Recall score: 0.9749216300940439


In [16]:
#αccurancy of training
print('Acc of training: ', accuracy_score(y_train,pred))

Acc of training:  0.9922480620155039


## Results of our prediction

We went ahead and printed a visual representation of the email label from our test data, in order to see the result with our own eyes.

In [17]:
#prediction if emails where spam or not spam
print('predicted values: ', classifier.predict(X_test))

predicted values:  [1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 1 1 0 0
 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0
 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 0 0 1 1 1 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0
 1 0 1 1 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0
 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 1 1 1 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 1 0
 1 1 1 1 1 0 0 0 1 0 0 1 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0
 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
 0 0 1 0 1 1 0 0 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 1 0 0 1 0 1 0 0 0
 0 1 0 0 1 1 0 0 1 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0
 0 1 1 0 1 0 0 0 0 1 0]


In [18]:
#actual value of emails being spam or not spam
print('Actual values: ', y_test.values)

Actual values:  [1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 1 1 0 0
 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0
 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 0 0 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0
 1 0 1 1 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0
 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 1 1 1 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 1 0
 1 1 1 1 1 0 0 0 1 0 0 1 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0
 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
 0 0 1 0 1 1 0 0 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 1 0 0 1 0 1 0 0 0
 0 1 0 0 1 1 0 0 1 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0
 0 1 1 0 1 0 0 0 0 1 0]


In [19]:
#performance of prediction
pred = classifier.predict(X_test)
print(f'F1 score: {f1_score(y_test, pred)}')
print(f'Precision score: {precision_score(y_test, pred)}')
print(f'Recall score: {recall_score(y_test, pred)}')

F1 score: 0.9906542056074767
Precision score: 1.0
Recall score: 0.9814814814814815


In [20]:
#αccurancy of prediction
print('Acc of prediction', accuracy_score(y_test,pred))

Acc of prediction 0.9941860465116279
