# SPAM Filter:

We'll be using sklearn.naive_bayes to train a simple spam classifier! First by loading our data.

In [39]:
import os #This module provides a portable way of using operating system dependent functionality. For our case we are going to use it for handeling paths.
import io #manages the input/output (I/O)
import numpy as np
import string
from pandas import DataFrame
import nltk
from nltk.corpus import stopwords
from bs4 import BeautifulSoup #A python based package for parsing HTML and XML. We'll use it to remove html tags.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB #Fit the the Naive Bayes model
import statistics as st

In [27]:
def cleanLine(line):
    stopWords=set(stopwords.words('english'))
    line=BeautifulSoup(line, "lxml").text #removing html tags
    line=''.join(el for el in line if el not in set(string.punctuation)) #Removing punctuation
    line=line.split()
    line=[word for word in line if word not in stopWords]
    line=[word.lower() for word in line] #turn to lower case
    line=[word for word in line if word.isalpha()] #Remove numeric terms
    line=' '.join(line)
    return(line)
    

def readFiles(path):
    for root,foldername,filenames in os.walk(path):
        for filename in filenames:
            path = os.path.join(root,filename)
            #Since we don't want to take in consideration the header of th email we will try to skip it and only extract
            #the body of the email. The header and body are separated in all the email samples by a blank line.
            testBody = False
            lines = []
            file = io.open(path, 'r', encoding='latin1')
            for line in file:
                if testBody:
                    lines.append(cleanLine(line))
                elif line == '\n':
                    testBody = True
            file.close()
            message = ' '.join(lines)
            yield(path, message) #This a cool tech because the sequences created by multiple yield calls are iterable once
                                 #and not saved on the memory. So looking at the size of data extacted this is gonna be helpful!


# Our main objective is to create a dataframe with its index the path for the email, and 2 columns one containing the 
#classification (SPAM/HAM) and the last column containing the cleaned body of the email text.


def dataFrameFromDirectory(path, classification):
    rows = []
    index = []
    for filename, message in readFiles(path):
        rows.append({'message': message, 'class': classification})
        index.append(filename)

    return DataFrame(rows, index=index)

In [28]:
data = DataFrame({'message': [], 'class': []}) #Initial cleaned dataset

data = data.append(dataFrameFromDirectory(r'C:\Users\YsfEss\Desktop\emails\spam', 'spam'))
data = data.append(dataFrameFromDirectory(r'C:\Users\YsfEss\Desktop\emails\ham', 'ham'))

Let's have a look at that DataFrame:

In [29]:
data.head()

Unnamed: 0,class,message
C:\Users\YsfEss\Desktop\emails\spam\00001.7848dde101aa985090474a91ec93fcf0,spam,ype black display none tr go...
C:\Users\YsfEss\Desktop\emails\spam\00002.d94f1b97e48ed3b553b3508d116e6a09,spam,fight the risk cancer slim down guaranteed l...
C:\Users\YsfEss\Desktop\emails\spam\00003.2ee33bc6eacdb11f38d052c44819ba6c,spam,fight the risk cancer slim down guaranteed l...
C:\Users\YsfEss\Desktop\emails\spam\00004.eac8de8d759b7e74154f142194282724,spam,adult club offers free membership instant...
C:\Users\YsfEss\Desktop\emails\spam\00005.57696a39d7d84318ce497886896bf90d,spam,i thought might like slim down guaranteed lose...


Okay now time for spliting data into train/test datasets. We'll do a classic random 70/30 split. Idealy, a k-fold cross validation is better, but we'll use this split for this instance. Keep in mind that we have 3000 oservations.

In [30]:
#First let's shuffle the data frame to garanty randomness of the split.
shuffledData=data.sample(frac=1) 

In [41]:
testData=shuffledData[0:900]
trainData=shuffledData[900:]

To keep the model and the 2 phases of training and test unbiased, we must assure that the proportion of 'spam' in the two sets is almost the same as in the full dataset. The percentage of spams in the full dataset is: 16%.

In [42]:
print('Proportion of spams in the train dataset:')
print(trainData['class'].tolist().count('spam')/2100)
print('Proportion of spams in the test dataset:')
print(testData['class'].tolist().count('spam')/900)

Proportion of spams in the train dataset:
0.1680952380952381
Proportion of spams in the test dataset:
0.16333333333333333


So, they are pretty close we can move on knowing we didn't screw anything in the sampling phase.

We can do the creation of most frequent words vector manually and use it as our feature vector. But I choose to use all words used in any of the emails regardless of its frequency (rather than choosing the words with highest frequencies).
Now I will exploit the power of Scikit-learn and I will use a CountVectorizer which allows to turn a collection of text documents into a matrix of token (word in this case) counts.  

In [45]:
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(trainData['message'].values)
len(vectorizer.get_feature_names()) #We will print the number of features taken by CountVectorizer.

27799

Now again we will use a function of Scikit-learn function to fit a Naive Bayes model to the data.

In [46]:
classifier = MultinomialNB()
labels = trainData['class'].values
classifier.fit(counts, labels)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Let's see the training accuracy attained by this model:

In [47]:
predictions=classifier.predict(counts)

In [49]:
print('Accuracy on the training dataset is:',100*(predictions==labels).tolist().count(True)/len(predictions),'%')

Accuracy on the training dataset is: 99.57142857142857 %


Well, this not really the measure to be excited over, the right measure that reflects the performance of this algorithm is the the training dataset so let's do that!

In [50]:
test_counts = vectorizer.transform(testData['message'].values)
testpredictions = classifier.predict(test_counts)
testlabels=testData['class'].values
print('Accuracy on the test dataset is:',100*(testpredictions==testlabels).tolist().count(True)/len(testpredictions),'%')

Accuracy on the test dataset is: 98.66666666666667 %


Great! We have a good accuracy of 98.66% over the test dataset. So we can judge the filter as a good one.

We can further this study, by studying the confusion matrix to investigate what kind of mistakes our model does, since the dataset is unblanced (Only 17% of the emails are actually spam), even the null model would reach an accuracy of 83%! 

But, since in this situation we would rather let some spam mails pass rather than block a misclassified innocent and maybe important email, we are not very 'obsessed' with minimizing the false negatives (which would be large since the positives are minoritary), since it doesn't hurt the purpose of the classifier priorities explained before. **Warning:** I must underline that this is not absolute, a very high false negative rate and we found ourself with useless a SPAM classifier, but while the global error rate is very low we can get away with this.

In other words, the bias due to the nature of the dataset is not very dangerous in this case due to the tendencies of most ML algorithms to favorize the dominant class which is luckily convenient for this problem. 

Note that this is not always the case, for example, problems having as minority class whether a patient has cancer or not, or whether a bank transaction is fraudulent or not, need furthur studies on the confusion matrix and plotting ROC curves.