# SPAM Filter:

We'll be using sklearn.naive_bayes to train a spam classifier! First by loading our data.

In [28]:
import os #This module provides a portable way of using operating system dependent functionality. For our case we are going to use it for handeling paths.
import io #manages the input/output (I/O)
import numpy
import string
from pandas import DataFrame
from bs4 import BeautifulSoup #A python based package for parsing HTML and XML. We'll use it to remove html tags.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

In [19]:
path='E:\Books'
for root, dirnames, filenames in os.walk(path):
    for filename in filenames:
        print(filename)
    print('////////////')

Fundamentals Of Neural Networks.pdf
ISLR Seventh Printing.pdf
////////////
LagrangeForSVMs.pdf
lecture6.pdf
SMO algorithm.pdf
////////////


In [40]:
text="""<html> 
  <head>
    <title>My Web Page!</title>
  </head>
  <body>
    Hello World!
  </body>
</html>"""
print(cleanLine(BeautifulSoup(text, "lxml").text))


my web page hello world


In [44]:
def cleanLine(line):
    line=BeautifulSoup(line, "lxml").text
    line=''.join(el for el in line if el not in set(string.punctuation)) #Removing punctuation
    line=line.split()
    line=[word.lower() for word in line]
    line=[word for word in line if word.isalpha()]
    line=' '.join(line)
    return(line)
    

def readFiles(path):
    for root,foldername,filenames in os.walk(path):
        for filename in filenames:
            path = os.path.join(root,filename)
            #Since we don't want to take in consideration the header of th email we will try to skip it and only extract
            #the body of the email. The header and body are separated in all the email samples by a blank line.
            testBody = False
            lines = []
            file = io.open(path, 'r', encoding='latin1')
            for line in file:
                if testBody:
                    lines.append(cleanLine(line))
                elif line == '\n':
                    testBody = True
            file.close()
            message = ' '.join(lines)
            yield(path, message) #This a cool tech because the sequences created by multiple yield calls are iterable once
                                 #and not saved on the memory. So looking at the size of data extacted this is gonna be helpful!


# Our main objective is to create a dataframe with its index the path for the email, and 2 columns one containing the 
#classification (SPAM/HAM) and the last column containing the cleaned body of the email text.


def dataFrameFromDirectory(path, classification):
    rows = []
    index = []
    for filename, message in readFiles(path):
        rows.append({'message': message, 'class': classification})
        index.append(filename)

    return DataFrame(rows, index=index)

In [45]:
data = DataFrame({'message': [], 'class': []}) #Initial cleaned dataset

data = data.append(dataFrameFromDirectory(r'C:\Users\YsfEss\Desktop\emails\spam', 'spam'))
data = data.append(dataFrameFromDirectory(r'C:\Users\YsfEss\Desktop\emails\ham', 'ham'))

Let's have a look at that DataFrame:

In [46]:
data.head()

Unnamed: 0,class,message
C:\Users\YsfEss\Desktop\emails\spam\00001.7848dde101aa985090474a91ec93fcf0,spam,ype black display none tr go...
C:\Users\YsfEss\Desktop\emails\spam\00002.d94f1b97e48ed3b553b3508d116e6a09,spam,fight the risk of cancer slim down guarantee...
C:\Users\YsfEss\Desktop\emails\spam\00003.2ee33bc6eacdb11f38d052c44819ba6c,spam,fight the risk of cancer slim down guarantee...
C:\Users\YsfEss\Desktop\emails\spam\00004.eac8de8d759b7e74154f142194282724,spam,adult club offers free membership instant...
C:\Users\YsfEss\Desktop\emails\spam\00005.57696a39d7d84318ce497886896bf90d,spam,i thought you might like these slim down guara...


Okay now time for spliting data into train/test datasets. We'll do a classic random 70/30 split. Idealy, a k-fold cross validation is better, but we'll use this split for this instance.

In [47]:
#First let's shuffle the data frame to garanty randomness of the split.
testData=data.sample(frac=1)[0:900]
trainData=data.sample(frac=1)[900:]

To keep the model and the 2 phases of training and test unbiased we must assure the the proportion of 'spam' in 2 the two sets is almost the same as in the full dataset. The percentage of spams in the full dataset is: 16%.

In [56]:
print('Proportion of spams in the train dataset:')
print(trainData['class'].tolist().count('spam')/2100)
print('Proportion of spams in the test dataset:')
print(testData['class'].tolist().count('spam')/900)

Proportion of spams in the train dataset:
0.16380952380952382
Proportion of spams in the test dataset:
0.18777777777777777


So, they are pretty close we can move on knowing we didn't screw anything in the sampling phase.

Now we will use a CountVectorizer to split up each message into its list of words, and throw that into a MultinomialNB classifier. Call fit() and we've got a trained spam filter ready to go! It's just that easy.

In [3]:
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(trainData['message'].values)

classifier = MultinomialNB()
targets = data['class'].values
classifier.fit(counts, targets)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Let's try it out:

In [4]:
examples = ['Free Viagra now!!!', "Hi Bob, how about a game of golf tomorrow?"]
example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)
predictions

array(['spam', 'ham'], 
      dtype='<U4')

## Activity

Our data set is small, so our spam classifier isn't actually very good. Try running some different test emails through it and see if you get the results you expect.

If you really want to challenge yourself, try applying train/test to this spam classifier - see how well it can predict some subset of the ham and spam emails.