# Chapter 2: Your first practical NLP application, spam filtering

Read in spam and ham file lists:

In [1]:
import os
import codecs

def read_in(folder):
    files = os.listdir(folder)
    a_list = []
    for a_file in files:
        if not a_file.startswith("."):
            f = codecs.open(folder + a_file, "r", encoding = "ISO-8859-1", errors="ignore")
            a_list.append(f.read())
            f.close()
    return a_list

Initialise lists and print out length â€“ this should return 1500 for `enron1/spam` and 3672 for `enron1/ham`:

In [2]:
spam_list = read_in("enron1/spam/")
print(len(spam_list))
print(spam_list[0])
ham_list = read_in("enron1/ham/")
print(len(ham_list))
print(ham_list[0])

1500
Subject: what up,, your cam babe
What are you looking for?
If your looking for a companion for friendship, love, a date, or just good ole'
Fashioned * * * * * *, then try our brand new site; it was developed and created
To help anyone find what they' re looking for. A quick bio form and you' re
On the road to satisfaction in every sense of the word.... No matter what
That may be!
Try it out and youll be amazed.
Have a terrific time this evening
Copy and pa ste the add. Ress you see on the line below into your browser to come to the site.
Http:// www. Meganbang. Biz/bld/acc /
No more plz
Http:// www. Naturalgolden. Com/retract /
Counterattack aitken step preemptive shoehorn scaup. Electrocardiograph movie honeycomb. Monster war brandywine pietism byrne catatonia. Encomia lookup intervenor skeleton turn catfish.

3672
Subject: ena sales on hpl
Just to update you on this project' s status:
Based on a new report that scott mills ran for me from sitara, I have come up


Combine all emails together, keeping the label, and shuffle them: 

In [3]:
import random

all_emails = [(email_content, "spam") for email_content in spam_list]
all_emails += [(email_content, "ham") for email_content in ham_list]
random.seed(42)
random.shuffle(all_emails)
print (f"Dataset size = {str(len(all_emails))} emails")

Dataset size = 5172 emails


Preprocess the texts by tokenising them and removing the stopwords:


In [4]:
import nltk
from nltk import word_tokenize

def get_features(text): 
    features = {}
    word_list = [word for word in word_tokenize(text.lower())]
    for word in word_list:
        features[word] = True
    return features

all_features = [(get_features(email), label) for (email, label) in all_emails]

print(get_features("Participate In Our New Lottery NOW!"))
print(len(all_features))
print(len(all_features[0][0]))
print(len(all_features[99][0]))

{'participate': True, 'in': True, 'our': True, 'new': True, 'lottery': True, 'now': True, '!': True}
5172
38
38


Apply Naive Bayes classifier:

In [5]:
from nltk import NaiveBayesClassifier, classify

def train(features, proportion):
    train_size = int(len(features) * proportion)
    # initialise the training and test sets
    train_set, test_set = features[:train_size], features[train_size:]
    print (f"Training set size = {str(len(train_set))} emails")
    print (f"Test set size = {str(len(test_set))} emails")
    # train the classifier
    classifier = NaiveBayesClassifier.train(train_set)
    return train_set, test_set, classifier

train_set, test_set, classifier = train(all_features, 0.8)

Training set size = 4137 emails
Test set size = 1035 emails


Evaluate the performance:

In [6]:
def evaluate(train_set, test_set, classifier):
    # check how the classifier performs on the training and test sets
    print (f"Accuracy on the training set = {str(classify.accuracy(classifier, train_set))}")
    print (f"Accuracy on the test set = {str(classify.accuracy(classifier, test_set))}")    
    # check which words are most informative for the classifier
    classifier.show_most_informative_features(50)

evaluate(train_set, test_set, classifier)

Accuracy on the training set = 0.9615663524292966
Accuracy on the test set = 0.936231884057971
Most Informative Features
               forwarded = True              ham : spam   =    200.5 : 1.0
                    2004 = True             spam : ham    =    148.6 : 1.0
                     nom = True              ham : spam   =    125.8 : 1.0
                    pain = True             spam : ham    =    103.6 : 1.0
                    spam = True             spam : ham    =     92.4 : 1.0
                  health = True             spam : ham    =     81.1 : 1.0
                     sex = True             spam : ham    =     79.5 : 1.0
                     ect = True              ham : spam   =     75.7 : 1.0
              nomination = True              ham : spam   =     74.8 : 1.0
                   super = True             spam : ham    =     74.7 : 1.0
                featured = True             spam : ham    =     73.1 : 1.0
                creative = True             spam : ham

Explore the contexts of use:

In [7]:
from nltk.text import Text

def concordance(data_list, search_word):
    for email in data_list:
        word_list = [word for word in word_tokenize(email.lower())]
        text_list = Text(word_list)
        if search_word in word_list:
            text_list.concordance(search_word)


print ("STOCKS in HAM:")
concordance(ham_list, "stocks")
print ("\n\nSTOCKS in SPAM:")
concordance(spam_list, "stocks")

STOCKS in HAM:
Displaying 1 of 1 matches:
ur member directory . * follow your stocks and news headlines , exchange files
Displaying 1 of 1 matches:
ur member directory . * follow your stocks and news headlines , exchange files
Displaying 1 of 1 matches:
ur member directory . * follow your stocks and news headlines , exchange files
Displaying 1 of 1 matches:
ad my portfolio is diversified into stocks that have lost even more money than


STOCKS in SPAM:
Displaying 3 of 3 matches:
report reveals this smallcap rocket stocks newsletter first we would like to s
his email pertaining to investing , stocks , securities must be understood as 
ntative before deciding to trade in stocks featured within this email . none o
Displaying 3 of 3 matches:
might occur . as with many microcap stocks , today ' s company has additional 
is emai | pertaining to investing , stocks , securities must be understood as 
ntative before deciding to trade in stocks featured within this emai | . none 
Displaying 6 of

Displaying 2 of 2 matches:
 % on regular price we have massive stocks of drugs for same day dispatch fast
e do have the lowest price and huge stocks ready for same - day dispatch . two
Displaying 2 of 2 matches:
his email pertaining to investing , stocks , securities must be understood as 
ntative before deciding to trade in stocks featured within this email . none o
Displaying 4 of 4 matches:
n this stock . some of these smal | stocks are absoiuteiy fiying , as many of 
 statements . as with many microcap stocks , todays company has additional ris
biication pertaining to investing , stocks , securities must be understood as 
ntative before deciding to trade in stocks featured within this publication . 
Displaying 1 of 1 matches:
s obtained . investing in micro cap stocks is extremely risky and , investors 
Displaying 2 of 2 matches:
is emai | pertaining to investing , stocks , securities must be understood as 
ntative before deciding to trade in stocks featured within this email . non

Input some of your own messages:

In [8]:
test_spam_list = ["Participate in our new lottery!", "Try out this new medicine"]
test_ham_list = ["See the minutes from the last meeting attached", 
                 "Investors are coming to our office on Monday"]

test_emails = [(email_content, "spam") for email_content in test_spam_list]
test_emails += [(email_content, "ham") for email_content in test_ham_list]

new_test_set = [(get_features(email), label) for (email, label) in test_emails]

evaluate(train_set, new_test_set, classifier)

Accuracy on the training set = 0.9615663524292966
Accuracy on the test set = 1.0
Most Informative Features
               forwarded = True              ham : spam   =    200.5 : 1.0
                    2004 = True             spam : ham    =    148.6 : 1.0
                     nom = True              ham : spam   =    125.8 : 1.0
                    pain = True             spam : ham    =    103.6 : 1.0
                    spam = True             spam : ham    =     92.4 : 1.0
                  health = True             spam : ham    =     81.1 : 1.0
                     sex = True             spam : ham    =     79.5 : 1.0
                     ect = True              ham : spam   =     75.7 : 1.0
              nomination = True              ham : spam   =     74.8 : 1.0
                   super = True             spam : ham    =     74.7 : 1.0
                featured = True             spam : ham    =     73.1 : 1.0
                creative = True             spam : ham    =     71.5

See how they get classified:

In [9]:
for email in test_spam_list:
    print (email)
    print (classifier.classify(get_features(email)))
for email in test_ham_list:
    print (email)
    print (classifier.classify(get_features(email)))

Participate in our new lottery!
spam
Try out this new medicine
spam
See the minutes from the last meeting attached
ham
Investors are coming to our office on Monday
ham


Run in an interactive manner:

In [10]:
while True:
    email = input("Type in your email here (or press 'Enter'): ")
    if len(email)==0:
        break
    else: 
        prediction = classifier.classify(get_features(email))
        print (f"This email is likely {prediction}\n")

Type in your email here (or press 'Enter'): Buy new meds
This email is likely spam

Type in your email here (or press 'Enter'): Buy new meds here!
This email is likely spam

Type in your email here (or press 'Enter'): Get your stock options fast
This email is likely spam

Type in your email here (or press 'Enter'): Let's schedule a meeting for tomorrow
This email is likely ham

Type in your email here (or press 'Enter'): 


Run on a different dataset:

# Assignment:

Apply the classifier to a different test set, e.g. the emails from `enron2/`. As before, you need to read in the data, extract textual content, extract the features and evaluate the classifier. What do the results tell you?

In [11]:
test_spam_list = read_in("enron2/spam/")
print(len(test_spam_list))
print(test_spam_list[0])
test_ham_list = read_in("enron2/ham/")
print(len(test_ham_list))
print(test_ham_list[0])

test_emails = [(email_content, "spam") for email_content in test_spam_list]
test_emails += [(email_content, "ham") for email_content in test_ham_list]
random.shuffle(test_emails)

new_test_set = [(get_features(email), label) for (email, label) in test_emails]

evaluate(train_set, new_test_set, classifier)

1496
Subject: big range of all types of downloadable software.
Need software? Click here.
Our american professors like their literature clear, cold, pure and very dead.
Being another character is more interesting than being yourself.
4361
Subject: re: telephone interview with enron corp. Research dept.
Dear shirley:
Confirming that I will be waiting for the telephone interview at 1 pm
Tomorrow.? I would like to give you my cell phone number, 713/907 - 6717, as a
Back - up measure.? Please note that my first preference is to receive the call
At my home number, 713/669 - 0923.
Sincerely,
RabI de
?
? Shirley. Crenshaw@ enron. Com wrote:
Dear rabi:
I have scheduled the telephone interview for 1: 00 pm on friday, july 7 th.
We will call you at 713/669 - 0923. If there are any changes, please let
Me know.
Sincerely,
Shirley crenshaw
713 - 853 - 5290
RabI deon 06/26/2000 10: 37: 24 pm
To: shirley crenshaw
Cc:
Subject: re: telephone interview with enron corp. Research dept.
Dear ms. Crenshaw:


Combine the two datasets:

In [12]:
spam_list = read_in("enron1/spam/") + read_in("enron2/spam/")
print(len(spam_list))
ham_list = read_in("enron1/ham/") + read_in("enron2/ham/")
print(len(ham_list))

all_emails = [(email_content, "spam") for email_content in spam_list]
all_emails += [(email_content, "ham") for email_content in ham_list]
random.shuffle(test_emails)

all_features = [(get_features(email), label) for (email, label) in all_emails]
print(len(all_features))

train_set, test_set, classifier = train(all_features, 0.8)
evaluate(train_set, new_test_set, classifier)

2996
8033
11029
Training set size = 8823 emails
Test set size = 2206 emails
Accuracy on the training set = 0.9819789187351241
Accuracy on the test set = 0.9810483182516647
Most Informative Features
                   meter = True              ham : spam   =    263.8 : 1.0
                   vince = True              ham : spam   =    200.3 : 1.0
                     sex = True             spam : ham    =    195.1 : 1.0
                     nom = True              ham : spam   =    194.9 : 1.0
                     php = True             spam : ham    =    182.1 : 1.0
            prescription = True             spam : ham    =    169.2 : 1.0
                     ect = True              ham : spam   =    167.7 : 1.0
                    spam = True             spam : ham    =    145.8 : 1.0
               forwarded = True              ham : spam   =    136.4 : 1.0
                     fyi = True              ham : spam   =    134.6 : 1.0
                    2005 = True             spam : h