# Week 5/6 Assignment
## Gabrielle Bartomeo, Hovig Ohannessian

Use a set of documents (e-mails, websites, etc.) to parse out spam (fraudulent) from ham (legitimate). We decided to use [the old SpamAssassin e-mails](https://spamassassin.apache.org/old/publiccorpus/) as a corpus. If you wish to run this code locally, you'll have to download the accompanying [7zip file on Github](https://github.com/gabartomeo/data620-cunysps/blob/master/Assignment%2005/Data.7z) and extract it to the same location where you are keeping this Jupyter Notebook.

## Data setup

In [1]:
import os
import re
import nltk
from nltk.classify import apply_features
import random
import pandas as pd

To start with, we figured out which libraries would be needed. We used `os` to access local documents and `re` for formatting them and for identifying spam-related features. To further identify and classify what was and wasn't spam, we used `nltk`. The `random` library was for shuffling the words around when testing, and `pandas` for understanding our results.

In [2]:
ham = {
    "Easy Ham 1": {},
    "Easy Ham 2": {},
    "Hard Ham": {}
}

for ham_name in list(ham.keys()):
    for file_name in os.listdir("Data\\Ham\\" + ham_name + "\\"):
        open_file = open("Data\\Ham\\" + ham_name + "\\" + file_name)
        raw_text = open_file.read()
        raw_text = re.sub("(\n)|(<br\\?>)", " ", raw_text)
        raw_text = re.sub("(<.*>)|([a-zA-Z\.\-]+@[a-zA-Z\.\-]+)|(_{2,})|([a-zA-Z](?:\.[a-zA-Z]{1,}))|((?:Ii){1,})", "", raw_text)
        raw_text = re.sub("\d{2}(?:\:\d{2}){1,}", "", raw_text)
        raw_text = re.sub("[a-fA-F0-9]{2,6}", "", raw_text)
        raw_text = re.sub("[\.\\\/\\*\\-\=\+\_\|\*\,@\:~]", " ", raw_text)
        raw_text = re.sub("([^a-zA-Z]')|(.*'[^zts]+)", " ", raw_text)
        raw_text = re.sub(" {2,}", " ", raw_text)
        ham[ham_name][file_name] = raw_text
        open_file.close()

We opened each of the ham e-mails and read them into Python, removing instances of new lines, HTML tags, e-mail addresses, repeating underscores, URLs, time, hexidecimal codes, some special characters, and words with pointless apostrophes. The file was then assigned to a sub-dictionary based on where it was opened from in the first place, with each e-mail being identifiable by their file name.

In [3]:
spam = {
    "Spam 1": {},
    "Spam 2": {}
}

for spam_name in list(spam.keys()):
    for file_name in os.listdir("Data\\Spam\\" + spam_name + "\\"):
        open_file = open("Data\\Spam\\" + spam_name + "\\" + file_name, encoding="latin-1")
        raw_text = open_file.read()
        raw_text = re.sub("(\n)|(<br\\?>)", " ", raw_text)
        raw_text = re.sub("(<.*>)|([a-zA-Z\.\-]+@[a-zA-Z\.\-]+)|(_{2,})|([a-zA-Z](?:\.[a-zA-Z]{1,}))|((?:Ii){1,})", "", raw_text)
        raw_text = re.sub("\d{2}(?:\:\d{2}){1,}", "", raw_text)
        raw_text = re.sub("[a-fA-F0-9]{2,6}", "", raw_text)
        raw_text = re.sub("[\.\\\/\\*\\-\=\+\_\|\*\,@\:~]", " ", raw_text)
        raw_text = re.sub("([^a-zA-Z]')|(.*'[^zts]+)", " ", raw_text)
        raw_text = re.sub(" {2,}", " ", raw_text)
        spam[spam_name][file_name] = raw_text
        open_file.close()

The same process for the ham e-mails were repeated for the spam e-mails.

### Setting up the training tokens

From here came setting up the tokens for training and beyond.

In [4]:
stop_words = list(set(nltk.corpus.stopwords.words('english')))

English stopwords were identified, courtesy of the `nltk` library - they are typically not significant when it comes to textual analysis.

In [5]:
ham_tokens = []
ham_values = list(ham["Easy Ham 1"].values())
ham_tokens += [nltk.word_tokenize(i) for i in ham_values]
ham_tokens = [token for tokens in ham_tokens for token in tokens]
ham_tokens = list(filter(None, [i if i not in stop_words else None for i in ham_tokens]))

A list of ham tokens specifically for training the classifier were produced. These tokens were made by using the `nltk` library's `word_tokenize()` function for going through each individual e-mail from the `Easy Ham 1` set and parsing out words or... "words" as the case may be. Even repeating numbers and letters have a place when checking for spam, so we kept them. If the tokens produced were any of the English stopwords in them, those specific tokens were removed.

In [6]:
spam_tokens = []
spam_values = list(spam["Spam 1"].values())
spam_tokens += [nltk.word_tokenize(i) for i in spam_values]
spam_tokens = [token for tokens in spam_tokens for token in tokens]
spam_tokens = list(filter(None, [i if i not in stop_words else None for i in spam_tokens]))

Again, the same aforementioned process was repeated for the spam e-mails in the `Spam 1` set.

In [7]:
neutral_tokens = list(set(ham_tokens).intersection(spam_tokens))
ham_tokens = list(set(ham_tokens).difference(set(neutral_tokens)))
spam_tokens = list(set(spam_tokens).difference(set(neutral_tokens)))
mail_tokens = ([(token, "ham") for token in ham_tokens] + [(token, "spam") for token in spam_tokens])
half_mail = int(len(mail_tokens)/2)

There was obviously overlap in the words present for ham and spam e-mails. We decided the best way to deal with these were to remove them completely: we identified the tokens that were present in both sets and removed them from the ham and spam tokens. Then, we made a combined list of the tokens, and identified what number would comprise half the number of tokens in that combined list.

### Making an accuracy function

Back in Project 3 we created this function and adapted it for this Assignment as well. Its main purpose is providing multiple runs and clarity for the training and testing sets before the function we made would be tested on the e-mails instead of on a randomized mixture of the tokens of all the e-mails.

In [8]:
def accuracy(number_of_runs, function_to_use):
    acc_df = {
        "classifier": [],
        "train_set_accuracy": [],
        "test_set_accuracy": [],
        "devtest_set_accuracy": [],
        "devtest_errors": []
    }
    for i in range(number_of_runs):
        random.shuffle(mail_tokens)
        acc_train_words = mail_tokens[half_mail:]
        acc_devtest_words = mail_tokens[int(half_mail/2):half_mail]
        acc_test_words = mail_tokens[:int(half_mail/2)]
        acc_train_set = [(function_to_use(n), g) for (n,g) in acc_train_words]
        acc_devtest_set = [(function_to_use(n), g) for (n,g) in acc_devtest_words]
        acc_test_set = [(function_to_use(n), g) for (n,g) in acc_test_words]
        acc_classifier = nltk.NaiveBayesClassifier.train(acc_train_set)
        acc_df["classifier"].append(acc_classifier)
        acc_df["train_set_accuracy"].append(nltk.classify.accuracy(acc_classifier, acc_train_set))
        acc_df["test_set_accuracy"].append(nltk.classify.accuracy(acc_classifier, acc_test_set))
        acc_df["devtest_set_accuracy"].append(nltk.classify.accuracy(acc_classifier, acc_devtest_set))
        acc_errors = []
        for (word, tag) in acc_devtest_words:
            acc_guess = acc_classifier.classify(function_to_use(word))
            if acc_guess != tag:
                acc_errors.append( (tag, acc_guess, word) )
        acc_df["devtest_errors"].append(acc_errors)
    acc_df = pd.DataFrame.from_dict(acc_df)
    return(acc_df)

This function creates an empty dictionary that it fills as it goes. It runs a set number of times as defined by the user, and uses a specific function - which is also defined by the user - to test its accuracy. For each run, the e-mail tokens are shuffled and divided into three sets: train, devtest, and test. These words are then made into sets, and a classifier is developed using the Naive Bayes method via `nltk`. From here, the classifier's accuracy is checked against the three sets, and the errors are records in case they are needed in the future. The dictionary produced is then converted into a data frame using `pandas` and returned for storage in a variable.

### Making a spam-identifying function

Being able to determine the accuracy of our function was great, but what was more important than that was having a function to run in the first place. This function below for checking the features of each word was `spam_buster()`, made to identify spam.

In [9]:
def spam_buster(word):
    features = {}    
    features["lazy shift"] = True if len(list(filter(None, [w if w.isupper() else None for w in word]))) > 3 else False
    features["repeating"] = True if len(re.findall("([a-zA-Z]+)\\1{2,}", word)) > 0 or len(re.findall("\d[a-zA-Z][a-zA-Z0-9]{1,}", word)) > 0 else False
    features["number_strings"] = True if len(re.findall("\d{3,}", word)) > 0 else False
    common_work_words = ["programming", "sequence", "syntax", "error", "command", "cursor", "root", 
                         "window", "sys", "img", "input", "stdin", "stdout", "foo", "bar", "foobar", 
                         "int", "float", "sql", "mysql", "loop", "ctrl", "alt", "del", "corpus", 
                         "java", "javascript", "python", "app", "dir", "cdr", "filter"]
    features["work"] = True if word.lower() not in common_work_words else False
    common_scam_words = ["revealed", "grants", "urgent", "important", "sale", "deal", "secret", 
                         "free", "invest", "porn", "porno", "pornstar", "erotic", "enlargment", 
                         "hair", "hairline", "bald", "spam", "bonus", "financial", "call", "apply", 
                         "business", "money", "million", "multimillion", "billion", "multibillion", 
                         "rich", "market", "marketer", "marketing", "native", "paid", "partner", 
                         "partners", "dollar", "dollars", "mature", "matured", "confidential", 
                         "confidentiality", "cash", "value", "valued", "sir", "maam", "madam", 
                         "stock", "stockpick", "stocks", "commercial", "television", "commercials", 
                         "propose", "proposal", "wholesale", "wholesaler", "company", "firm", "toll", 
                         "gain", "judgement", "judgements", "invest", "investment", "visa", "check", 
                         "mastercard", "gamble", "gambling", "euros", "usd", "real", "shipping", "handling", 
                         "lifetime", "unicorn", "supplement", "supplements", "organic"]
    features["scammy"] = True if word.lower() in common_scam_words or word.istitle() else False
    features["nonsense"] = True if len(re.findall("[a-zA-Z]{1,}\d+[a-zA-Z0-9]{1,}", word)) > 0 or len(re.findall("([^\x00-\x7F])|([àèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛäëïöüÿÄËÏÖÜŸåÅçÇðÐÞ¡~\^])|(z[^aeiouy])", word)) > 0 or len(re.findall("[a-z]+[A-Z]+[a-z]+[A-Z]?", word)) > 0 or len(re.findall("[bcdfghjklmnpqrstvwxz]{4,}", word)) > 0 or (len(word) <= 3 and len(re.findall("[bcdfghjklmnpqrstvwxz0-9]{1,}", word.lower())) > 0 and word.lower() not in ["wtf", "cvs", "lol", "fb", "ups", "jk"]) else False
    features["length"] = len(word) if len(word) < 13 and len(word) > 3 else True
    return(features)

Figuring out the features to work with was actually rather fun for the two of us. We considered what sort of things are the most common in spam mail that are obviously spam. Words that seem to be permanent capslock or nearly so were a quick winner. Repeating the same characters over and over usually presented in spam too, for whatever reason, alongside numbers and letters of varying capitalizations. Long strings of numbers also. We considered briefly what sort of work-related public e-mails might people share. Programming made the most sense, and we went about creating a list of common work-related words in the field of computer science and had the code reflect that any word in that list probably wasn't spam.

On the flip side, what do people typically get spam mail about? Money and riches, investments to parts unknown, pyramid schemes, beauty, and health. Words that were in that list were deemed scammy, as were words that were title case, as We Both Have Experienced Enough Spam Mail Written Like This To Last A Lifetime.

There was plenty of nonsense to consider for what made spam mail. Words that had numbers in them, words that contained non-unicode characters, words where z was followed by anything besides a vowel, strings of letters alternating capitalization, or words where there were four or more letters in a row that were consonants, excluding y. Words that were under three letters in length were considered, so long as they contained a consonant or a number and were not part of the list of commonly used short words like "lol" and "jk".

Lastly, words that are spam words are usually unreasonably long or unbelievably short, so if a word was less than three letters in length or greater than thirteen, it was considered spam automatically.

## Analysis

### Accuracy of classifier

In [10]:
classifier_accuracy = accuracy(100, spam_buster)
classifier_accuracy.describe()

Unnamed: 0,train_set_accuracy,test_set_accuracy,devtest_set_accuracy
count,100.0,100.0,100.0
mean,0.810542,0.81029,0.80982
std,0.002159,0.003382,0.004068
min,0.805171,0.803763,0.794976
25%,0.809099,0.80779,0.807649
50%,0.810381,0.810578,0.81024
75%,0.811775,0.812577,0.812549
max,0.815872,0.818858,0.819308


As can be observed from this run, the accuracy overall for the classifier ranged on average from 80.98% to 81.05%. While not perfect, it's better than 75%, and we felt confident to go forward using it at its current potential.

In [11]:
random.shuffle(mail_tokens)
train_words = mail_tokens[half_mail:]
train_set = [(spam_buster(n), g) for (n,g) in train_words]
classifier = nltk.NaiveBayesClassifier.train(train_set)

### Ham or Spam identification

In order to do more than just identify whether or not a given word was related to spam mail or not, and to assign an entire e-mail as being ham or spam, a new function was required.

In [12]:
def ham_or_spam(email):
    email_tokens = []
    email_tokens += [nltk.word_tokenize(word) for word in email]
    email_tokens = [token for tokens in email_tokens for token in tokens]
    email_tokens = list(filter(None, [i if i not in stop_words else None for i in email_tokens]))
    email_tokens = list(set(email_tokens).difference(set(neutral_tokens)))
    email_set = [spam_buster(word) for word in email_tokens]
    email_classified = classifier.classify_many(email_set)
    email_classified = "Spam" if email_classified.count("spam") > len(email_classified)/2 else "Ham"
    return(email_classified)

The `ham_or_spam()` function we created works by gathering tokens for each individual e-mail, removing the neutral tokens, using the classifier we previously made on it, and then taking a total count of words labeled as "spam". If that total counter is greater than half the number of tokens, the e-mail is labeled as "Spam"; otherwise, it is labeled as "Ham".

In [13]:
easy_ham_2 = {
    "actual": ["Ham"]*len(list(ham["Easy Ham 2"].values())),
    "predicted": []
}
for message in list(ham["Easy Ham 2"].values()):
    easy_ham_2["predicted"].append(ham_or_spam(message))
easy_ham_2 = pd.DataFrame(easy_ham_2)
easy_ham_2["correct"] = easy_ham_2["actual"] == easy_ham_2["predicted"]
easy_ham_2.describe()

Unnamed: 0,actual,predicted,correct
count,1400,1400,1400
unique,1,2,2
top,Ham,Ham,True
freq,1400,1301,1301


When it came to the second set of easy ham mail, there was a 92.29% accuracy in classifying the e-mails properly as ham.

To determine this, a dictionary was created. It had in it the whether the e-mail was actually spam or not, what it was predicted to be via the `ham_or_spam()` function, and then a third column was created where the value would be `True` if the mail was accurately identified or `False` if it was not.

In [14]:
hard_ham = {
    "actual": ["Ham"]*len(list(ham["Hard Ham"].values())),
    "predicted": []
}
for message in list(ham["Hard Ham"].values()):
    hard_ham["predicted"].append(ham_or_spam(message))
hard_ham = pd.DataFrame(hard_ham)
hard_ham["correct"] = hard_ham["actual"] == hard_ham["predicted"]
hard_ham.describe()

Unnamed: 0,actual,predicted,correct
count,250,250,250
unique,1,2,2
top,Ham,Ham,True
freq,250,243,243


Even the harder ham e-mails had a high accuracy rate, with 97.2% of the e-mails being classified appropriately as ham.

In [15]:
spam_2 = {
    "actual": ["Spam"]*len(list(spam["Spam 2"].values())),
    "predicted": []
}
for message in list(spam["Spam 2"].values()):
    spam_2["predicted"].append(ham_or_spam(message))
spam_2 = pd.DataFrame(spam_2)
spam_2["correct"] = spam_2["actual"] == spam_2["predicted"]
spam_2.describe()

Unnamed: 0,actual,predicted,correct
count,1396,1396,1396
unique,1,2,2
top,Spam,Ham,False
freq,1396,1304,1304


The spam, which was less obviously spam, was labeled properly only 6.59% of the time. This set of spam was written as if it were regular e-mails for the most part, and this was the set we were really concerned with. Still, it performed better than our initial feature sets had - our first run had an accuracy of 0.01%.

## Conclusion

Our algorithm was excellent at discerning whether or not a given set of mail was ham mail, but failed when it came to whether or not mail was spam. Our current set of features to observe put us significantly above our earlier features and we felt confident using it for the assignment. In the future, a more nuanced look into whether or not certain characters such as punctuation or accented characters should be kept in the original text of the e-mail or stripped from it will be considered, as will removing repeating characters.