## Assignment 4

_Group 11: Alexandra Parkegren & Albin Sjöstrand_

_Time Spent: yes, many moons has passed since we started_


In this assignment you will implement a Naïve Bayes classifier in Python that will classify emails into spam and non-spam (“ham”) classes.  Your program should be able to train on a given set of spam and “ham” datasets. 
You will work with the datasets available at https://spamassassin.apache.org/old/publiccorpus/. There are three types of files in this location: 
-	easy-ham: non-spam messages typically quite easy to differentiate from spam messages. 
-	hard-ham: non-spam messages more difficult to differentiate 
-	spam: spam messages 

Execute the cell below to download and extract the data into the environment of the notebook.
The data will now be in the three folders `easy_ham`, `hard_ham`, and `spam`.


In [8]:
#Download and extract data
#!wget https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2
#!wget https://spamassassin.apache.org/old/publiccorpus/20021010_hard_ham.tar.bz2
#!wget https://spamassassin.apache.org/old/publiccorpus/20021010_spam.tar.bz2
#!tar -xjf 20021010_easy_ham.tar.bz2
#!tar -xjf 20021010_hard_ham.tar.bz2
#!tar -xjf 20021010_spam.tar.bz2

### 1. Preprocessing: 
1.	Note that the email files contain a lot of extra information, besides the actual message. Ignore that for now and run on the entire text. Further down (in the higher-grade part), you will be asked to filter out the headers and footers. 
2.	We don’t want to train and test on the same data. Split the spam and the ham datasets in a training set and a test set. (`hamtrain`, `spamtrain`, `hamtest`, and `spamtest`)


In [9]:
import pandas as pd
import tarfile
from sklearn.model_selection import train_test_split

In [10]:
# Extract the email content, decode them, and convert as dataframe
def extract_files(files):  
    rows = []
    for fname in files:
        # open the tar file
        tfile = tarfile.open(fname, 'r:bz2')
        for member in tfile.getmembers():
            f = tfile.extractfile(member)
            if f is not None:
                row = f.read()
                #get all the content of file as a row
                rows.append({'message': row.decode('latin-1'), 'class': 'ham'})
        tfile.close()
    #return rows
    return pd.DataFrame(rows)


#get one dataframe with all of our files
df_ham = extract_files(['./20021010_easy_ham.tar.bz2','./20021010_hard_ham.tar.bz2'])
df_spam = extract_files(['./20021010_spam.tar.bz2'])

hamtrain, hamtest = train_test_split(df_ham, test_size=0.25, random_state=0)
spamtrain, spamtest = train_test_split(df_spam, test_size=0.25, random_state=0)

We used the tarfiles and unpacked them ourself and saved the emails as textfiles in a dataframe.

 Split the spam and the ham datasets in a training set and a test set. 
 
What does the task mean when the split should be named "hamtrain, spamtrain, hamtest and spamtest"? 
The code above shows how we literally interpret this by splitting all ham-emails into train and test 
and then separately split the spam-emails into train and test. But we find this constellation confusion,
If we train only on spam and then test on spam, there will be 100% accuracy.

Instead we join all emails into one big dataset and split that one instead to be able to train and predict. Hence, we interpret the task as:
"Split the combinet data (the spam and the ham datasets) in a combined training set and a test set".
Logically they could therefor be called x_train, x_test, y_train,y_test as in previous assignment but lets continue by calling them ham and 
spam train and test even though for example hamtest also inclused some spam. 
See code below for this and notice how we label all data to differentiate between ham and spam.


In [43]:
# Extract the email content, decode them, and convert as dataframe
def extract_mails(files,labels):

    label = 0
    rows  = []

    # read and append for both ham files
    for fname in files:
        # open the tar file
        hfile = tarfile.open(fname, 'r:bz2')
        for member in hfile.getmembers():
            f = hfile.extractfile(member)
            if f is not None:
                row = f.read()
                #get all the content of file as a row
                rows.append({'message': row.decode('latin-1'), 'class': labels[label]}) 
        hfile.close()
        label +=1

    # create a dataframe with message and class as rows
    return pd.DataFrame(rows)

# Extract the emails to a usable dataframe
files = ['./20021010_easy_ham.tar.bz2','./20021010_hard_ham.tar.bz2','./20021010_spam.tar.bz2']
labels = ['ham','ham','spam']
df_mails = extract_mails(files,labels)

# Divide the emails into train and test sets
hamtrain, hamtest, spamtrain, spamtest = train_test_split(df_mails['message'], df_mails['class'], test_size=0.25, random_state=0)

print('Total mails:',df_mails.shape)
print('Ham train:  ',hamtrain.shape)
print('Ham test:   ',hamtest.shape)
print('Spam train: ',spamtrain.shape)
print('Spam test:  ',spamtest.shape)

Total mails: (3302, 2)
Ham train:   (2476,)
Ham test:    (826,)
Spam train:  (2476,)
Spam test:   (826,)


### 2. Write a Python program that: 
1.	Uses four datasets (`hamtrain`, `spamtrain`, `hamtest`, and `spamtest`) 
2.	Trains a Naïve Bayes classifier (e.g. Sklearn) on `hamtrain` and `spamtrain`, that classifies the test sets and reports True Positive and False Negative rates on the `hamtest` and `spamtest` datasets. You can use `CountVectorizer` to transform the email texts into vectors. Please note that there are different types of Naïve Bayes Classifier in SKlearn ([Documentation here](https://scikit-learn.org/stable/modules/naive_bayes.html)). Test two of these classifiers that are well suited for this problem
- Multinomial Naive Bayes  
- Bernoulli Naive Bayes. 

Please inspect the documentation to ensure input to the classifiers is appropriate. Discuss the differences between these two classifiers. 




In [44]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix


def calc_rates(spamtest,predictions):
 
    #a normalized confusion matrix
    tn, fp, fn, tp = confusion_matrix(spamtest, predictions, normalize='true').ravel()

    return tp,fn


def NB(dataframe):
    
    #instantiate vectorizer
    X = CountVectorizer().fit_transform(dataframe['message'])    
    y = dataframe['class']

    # Divide the emails into train and test sets
    hamtrain, hamtest, spamtrain, spamtest = train_test_split(X, y, test_size=0.25, random_state = 0)

    # instantiate a Multinomial Naive Bayes model
    mnb_classifier = MultinomialNB()
    mnb_classifier.fit(hamtrain, spamtrain)
    mnb_predictions = mnb_classifier.predict(hamtest)

    # instantiate a Bernoulli Naive Bayes model
    bnb_classifier = BernoulliNB()
    bnb_classifier.fit(hamtrain, spamtrain)
    bnb_predictions = bnb_classifier.predict(hamtest)

    # Calc TP and FN rates
    mnb_tp, mnb_fn = calc_rates(spamtest,mnb_predictions)
    bnb_tp, bnb_fn = calc_rates(spamtest,bnb_predictions)

    return mnb_tp, mnb_fn, bnb_tp, bnb_fn


# Extract the emails to a usable dataframe
files = ['./20021010_easy_ham.tar.bz2','./20021010_hard_ham.tar.bz2','./20021010_spam.tar.bz2']
labels = ['ham','ham','spam']
df_mails = extract_mails(files,labels)

mnb_tp, mnb_fn ,bnb_tp, bnb_fn = NB(df_mails)

print('Multinomial Naive Bayes model gives TP rate = %f and FN rate = %f' %(mnb_tp,mnb_fn))
print('Bernoulli   Naive Bayes model gives TP rate = %f and FN rate = %f' %(bnb_tp,bnb_fn))


Multinomial Naive Bayes model gives TP rate = 0.862069 and FN rate = 0.137931
Bernoulli   Naive Bayes model gives TP rate = 0.248276 and FN rate = 0.751724


Above we use extract_files to get a dataset with the mails we want.
Then we use NB to first turn the data into vectors with CountVectorizer to then train the 
data on two classifiers; Multinomial and Bernoulli Naive Bayes model.
We use our calculate_rates to get the true positive rate (predicted positive and is positive) 
and false negative rate (predicted negative but should have been postive). 

The Naive Bayes models flip the matrix of TP, TN, FP, FN from the way Selpi presentet the matrix in class, we therefor use [sklearns](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)
confusion_matrix to be sure which values correspond to TP and FN. 
The task was to give the rates and we choose to calculate the rates by comparing the result 
to the whole set and a easy way to do this was to normalize the confusion matrix. 
To normalize the confusion matrix we believe the result will always be more easy to understand.

A good model that makes good prediction would give a high rate for TP and TN and low values for FP and FN.
The results show that the Multinomial Naive Bayes model  is much better at
predicting positive results, in our case ham. MBM predicted less wrong on the negative/spam, 
namely better at predicting the spam messages as well.
In the opposite way the Bernoulli Naive Bayes model did not do a good job at predicting positves 
(a low value of TP) or predicting negatives (a high value for FN).

   

### 3.Run your program on 
-	Spam versus easy-ham 
-	Spam versus hard-ham.

In [15]:
df_easy_ham_spam = extract_mails(['./20021010_easy_ham.tar.bz2','./20021010_spam.tar.bz2'],['ham','spam'])
df_hard_ham_spam = extract_mails(['./20021010_hard_ham.tar.bz2','./20021010_spam.tar.bz2'],['ham','spam'])

e_mnb_tp, e_mnb_fn, e_bnb_tp, e_bnb_fn = NB(df_easy_ham_spam)
h_mnb_tp, h_mnb_fn, h_bnb_tp, h_bnb_fn = NB(df_hard_ham_spam)

print('Spam vs. easy-ham')
print('Multinomial Naive Bayes model gives TP rate = %f and FN rate = %f' %(e_mnb_tp,e_mnb_fn))
print('Bernoulli   Naive Bayes model gives TP rate = %f and FN rate = %f' %(e_bnb_tp,e_bnb_fn))
print()
print('Spam vs. hard-ham')
print('Multinomial Naive Bayes model gives TP rate = %f and FN rate = %f' %(h_mnb_tp,h_mnb_fn))
print('Bernoulli   Naive Bayes model gives TP rate = %f and FN rate = %f' %(h_bnb_tp,h_bnb_fn))


Spam vs. easy-ham
Multinomial Naive Bayes model gives TP rate = 0.862069 and FN rate = 0.137931
Bernoulli   Naive Bayes model gives TP rate = 0.517241 and FN rate = 0.482759

Spam vs. hard-ham
Multinomial Naive Bayes model gives TP rate = 0.961832 and FN rate = 0.038168
Bernoulli   Naive Bayes model gives TP rate = 0.969466 and FN rate = 0.030534



This way the model got less ham to train on, which mean a bigger percentage of the training data was spam.
This made the Bernouilli method performance better at predicting both ham (higher TP) and spam(lower FN). 
But for the, already pretty good Multinomial method, there was no difference at all between _spam vs. easy-ham_ and _spam vs. easy- and hard-ham_.

The hard-ham messages are more similar to the spam compared to the easy-ham. 
Which means, to differentiate the hard-ham message from the spam 
the algoritm need to pick up more subtle differenses. Which in turn make the test more accurate.
Consequently, when the models only got to train on the hard-spam all the 
results got much better for both models.

### 4.	To avoid classification based on common and uninformative words it is common to filter these out. 

**a.** Argue why this may be useful. Try finding the words that are too common/uncommon in the dataset. 


See code down below were we find the most common and least common words in our data.

To filter out data is very common to optimize the performance further.
If we would start by filter out very common words, like "I", "and" and "hello". Then there will be less data to handle and it will be a faster process.
It would also be easier to train and the accuracy could get better since the model is being trained only on relevant words. Say for
example we take a common word such as "you", which is very likely to be part of a large part of the emails. It may not contribute to
the training of the model because it cannot be deterimined if it contribute to the classification being a ham or a spam.
Finding relevant words will now run faster.

Because its a good thing to filter out data there already exists alot of help functions for this.
Stemming can be used to also filter out unformative words that does not contribute to the result. By trimming the words down to their stem
we can minimize the amount of words being used in the email, and as well help with the filtering of common words. For example the words
"learning" would have a stem of "learn", and if we find that "learn" is a common word it would be included into the filtering out of them. 
This can be further improved with Lemmitization, where Stemming is utilise but we also consider the context of that words, making it less likely
to be trimmed down to its stem. With this being said, it's good to think of the performance since all these calculation would add to the amount
of work that has to be done, especially on large datasets, where we want to optimize it.

The Natural Language Toolkit provide Tokenization, which removes words as well as exclamation point,commas, apostrophes, question marks commas etc.
This can be used to further filter out tokens that does not contribute to the email. TF-IDF is another method that counts how often a word appears and takes the lenght of the email into consideration.
TF-IDF (Time frequency times inverse document frequency)



In [41]:
from collections import Counter
import string
import itertools

# get all words in our data as the type counter
def count_words():
    # Extract the emails to a usable dataframe
    df_mails = extract_mails(['./20021010_easy_ham.tar.bz2', './20021010_hard_ham.tar.bz2', './20021010_spam.tar.bz2'],['ham','ham','spam'])

    #remove punctuation tokens with regex so at split "Hello:"" will be splitted as "Hello"
    df_mails['message'] = df_mails['message'].str.replace('[{}]'.format(string.punctuation), ' ')
    df_mails['message'] = df_mails['message'].str.replace('\n', ' ')
    df_mails['message'] = df_mails['message'].str.replace('\t', ' ')

    # split the mails into words
    mails_splitted = df_mails["message"].str.split(" ")

    # count how many times a word occurs in all emails
    word_counter = Counter()
    for i in range(0,len(df_mails)):
        word_counter = word_counter + Counter(mails_splitted[i])

    return word_counter



word_counter = count_words()

#how many words most and least common we would like
n_words = 10 

#the least common words
word_counter2 = word_counter
least_common_words = word_counter2.most_common()[:-n_words-1:-1]
print('The %d least common words are: %d' %(n_words, least_common_words))

#the top common words
most_common_words = word_counter.most_common(n_words)
print('The %d most common words are:  %d ' %(n_words, most_common_words)) 

Least common words:  [('7b1b73cf36cf9dbc3d64e3f2ee2b91f1', 1), ('00000', 1), ('cmds', 1), ('c4ff6dba0a5177d3c7d8ef54c8920496', 1), ('00099', 1), ('01d2958ccb7c2e4c02d0920593962436', 1), ('00098', 1), ('dce08392ba6bc552d13394fa73974b62', 1), ('00097', 1), ('b2cb600e893f7a663ea5f9bff3a6276e', 1)]
Most common words:  [('', 2290248), ('com', 68985), ('0', 46616), ('the', 35612), ('1', 34693), ('http', 33960), ('a', 33238), ('2002', 28353), ('to', 25485), ('3D', 25396)]


To be able to differentiate the words we started by removing some symbols because we noticed that 
the emails contained alot of characters that were not words. 
Some characters also interrupt words so we want to replace them by a space to bea able to read words. 
\n is an example of that because when the email was turned into a string a new line was translated to \n. 

We did keep some elements that we do not consider to be words, for example we can see a lot of them in the list of least common words.
Neither did we remove single numbers. Becase we thought it would be too much of a manipulation.

We then splitted all emails into words so a Counter() could count how many times each word occur.
In the printed result we see the word inside the apostrophes and then how many times it occured in all of our emails.
We chose to print only the 10 most common and the 10 least common found words but we could easily change that by changing the parameter n_words.


In [49]:
# get a array of strings with the most common words. how many depending on the input
def list_of_common_words(length_common_words):
    
    word_counter = count_words()

    #get most common words in counter
    most_common_words = word_counter.most_common(length_common_words)
    

    #get only the words
    top_words = []
    for i in most_common_words:
        top_words.append(i[0])
    
    top_words.sort()
    return top_words


#how many words most and least common we would like
#length of word_counter = 123645
n_words = int(len(word_counter)*0.0001) 

top_words = list_of_common_words(n_words)

print('list of most common words: ' , top_words)

Here we use count_words() but convert the list of counter into a string of just the words that are most common.
Its very useful to have the list of only the words as strings and not as type counter if we would want to use it to filter our data.

### 4.
**b. ** Use the parameters in Sklearn’s `CountVectorizer` to filter out these words. Update the program from point 3 and run it on your data and report your results.

You have two options to do this in Sklearn: either using the words found in part (a) or letting Sklearn do it for you. Argue for your decision-making.


*** Using our own list of common words ***


Using our own list of most common words to filter the mails with could give us the best result for this specific
scenario, which is often the case with data science where we have adapt methods to a specific problem. 
But a disadvantage of this is that it takes longer to run and the program may be 
too specifically tailored trained to be used in more general cases.
Something elso to consider is that our list of most common words never take into account how long the emails are which mean longer emails weigh more than short ones.
Another problem that could appear is how many words are considered to be _common_.
From this we therefore decide to use the sklearns filtering.


*** Using Sklearn's filtering ***

We opted to use Sklearn's algorithm for filtering out the words with the main reasoning being the performance of the program. By using the built-in
limiter of frequency of words _max_df_ we can limit the words used in the emails more effectively, by not having too loop through the whole Counter
ant selecting only the words that apply to the condition we specified. Theoretically we should have the same results, since both the methods
simply takes the most common words. Bu the difference is that in using the Counter above, we can specify the filtering according to how our document
looks like. This means that the CountVectorizer may not filter out `\n` and `\t` which we specifically told the algorithm above to do. The problem 
may be then that the Counter are dependent on that we know about the structure and data in the email, which we can't with this size of dataset. The
most logical solution is then to let the proven algorithm filter out tokens and words for what has been learnt about natural language and filtering
during development of these counters, comparing to us guessing to what words or tokens might be in the data.

It is worth to note that the results are dependent on more variables, not just if we use our own pre-preprocessor or Sklearn's. With the CountVectorizer
we can specify parameters such as `max_df`, `min_df`, `max_features`, `stop_words`, or `ngram_range` which alter furthermore what to filter out. In our
case we chose to specify only `max_df` and `stop_words`, so that we filter out the words that appear in 50% or more of the documents, and including
eastablish common words for Natural Language (such as _you_, _I_, or _the_).

In [42]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import nltk
from nltk.corpus import stopwords
from sklearn import svm

# Uncomment to download stopwords locally
#nltk.download('stopwords')


# Modified Naive Bayes with filtering
def NB_filter(dataframe):

    X = dataframe['message']
    y = dataframe['class']

    #instantiate vectorize with custom stop words
    #X_vectorized = CountVectorizer(stop_words=stop_words_top).fit_transform(X)
    
    # Filter out words with frequency of more than 50%, aswell as common english stopwords
    X_vectorized = CountVectorizer(max_df=0.5, stop_words=stopwords.words('english')).fit_transform(X)

    # Divide the emails into train and test sets
    hamtrain, hamtest, spamtrain, spamtest = train_test_split(X_vectorized, y, test_size=0.25, random_state = 0)

    # instantiate a Multinomial Naive Bayes model
    mnb_classifier = MultinomialNB()
    mnb_classifier.fit(hamtrain, spamtrain)
    mnb_predictions = mnb_classifier.predict(hamtest)

    # instantiate a Bernoulli Naive Bayes model
    bnb_classifier = BernoulliNB()
    bnb_classifier.fit(hamtrain, spamtrain)
    bnb_predictions = bnb_classifier.predict(hamtest)

    # Calc TP and FN rates
    mnb_tp, mnb_fn = calc_rates(mnb_predictions,spamtest,hamtest)
    bnb_tp, bnb_fn = calc_rates(bnb_predictions,spamtest,hamtest)

    return mnb_tp, mnb_fn, bnb_tp, bnb_fn


df_easy_ham_spam = extract_mails(['./20021010_easy_ham.tar.bz2','./20021010_spam.tar.bz2'],['ham','spam'])
df_hard_ham_spam = extract_mails(['./20021010_hard_ham.tar.bz2','./20021010_spam.tar.bz2'],['ham','spam'])

# Use easy_ham and hard_ham in filtering Naive Bayes
e_mnb_tp, e_mnb_fn, e_bnb_tp, e_bnb_fn = NB_filter(df_easy_ham_spam)
h_mnb_tp, h_mnb_fn, h_bnb_tp, h_bnb_fn = NB_filter(df_hard_ham_spam)

print('Spam vs. easy-ham with filtered df')
print('Multinomial Naive Bayes model gives TP rate = %f and FN rate = %f' %(e_mnb_tp,e_mnb_fn))
print('Bernoulli   Naive Bayes model gives TP rate = %f and FN rate = %f' %(e_bnb_tp,e_bnb_fn))
print()
print('Spam vs. hard-ham with filtered df')
print('Multinomial Naive Bayes model gives TP rate = %f and FN rate = %f' %(h_mnb_tp,h_mnb_fn))
print('Bernoulli   Naive Bayes model gives TP rate = %f and FN rate = %f' %(h_bnb_tp,h_bnb_fn))


Spam vs. easy-ham with filtered df
Multinomial Naive Bayes model gives TP rate = 0.922414 and FN rate = 0.077586
Bernoulli   Naive Bayes model gives TP rate = 0.456897 and FN rate = 0.543103

Spam vs. hard-ham with filtered df
Multinomial Naive Bayes model gives TP rate = 0.954198 and FN rate = 0.045802
Bernoulli   Naive Bayes model gives TP rate = 0.969466 and FN rate = 0.030534


Above we see the results for using sklearn to filter out words with frequency of more than 50%, aswell as common english stopwords.
Compared to unfiltered we see that the multinomial model gets better at predicting _spam vs.easy-ham_ but is as good as before to predict at _spam vs-hard-ham_.
The Bernoulli actually gets a bit worse at predictiong everything.

### 5. Eeking out further performance
Filter out the headers and footers of the emails before you run on them. 
The format may vary somewhat between emails, which can make this a bit tricky,
 so perfect filtering is not required. Run your program again and answer the following questions: 

In [None]:
# TODO: Filter headers and footer

#### 5.1 Does the result improve from 3 and 4? 

#### 5.2 The split of the data set into a training set and a test set can lead to very skewed results. Why is this, and do you have suggestions on remedies? 

The result will be skewed because we will have inbalance in the data, where we might train on one class more (for example ham). 
We noticed this when the results improved from task 2 to 3. But this can still yield
"good" result when testing it, but if we would try to validate the data the model will perform worse. The model are predicting on different features
that may only exist in the test test, but not in the training set.

A solution to this is to use stratification on the data, which locks the classes distributed in the training and testing sets. This can be done by using
the `stratify` parameter on `train_test_split`, and the model will then not have an imbalanced test set where the performance would seem to be good, but
when acutally validating it, it will perform worse.

#### 5.3 What do you expect would happen if your training set were mostly spam messages while your test set were mostly ham messages? 

Since the model has then been trained on mainly features of spam emails, it will have a harder time to determine weather a ham message is ham or spam.
As mentioned in previous question, it may not be evident in the result becuase the training and testing are inbalanced, but would be evident when
also validating. Instead of looking at True Positive and False Negative we instead would want to look at False Positive and True Negative.

### Re-estimate your classifier using `fit_prior` parameter set to `false`, and answer the following questions:

#### 5.4 What does this parameter (fit_prior) mean?

Having `fit_prior` set to true means that it includes previously known probabilities on the data, and this is the default that has been run up until 
this question. When running it set to false, it will instead use a uniform prior which doesn't offer any regularization. This leads to infererence and
undesired results. Uniform priros should not be used, unless we know that the bounds are representing true constraints. If we want to be vague about the prior
it's better to not specify any at all. Alternatively we can use weak prios (bad-ish but not so bad as uniform). 

#### 5.5 How does this alter the predictions? Discuss why or why not.