# Assignment 4 - Spam classification using Naïve Bayes

Contributer, time spent:

- William Albertsson, 0 hours 
- Carl Holmberg, 0 hours

## 1. Preprocessing
### a) 
*Note that the email files contain a lot of extra information, besides the actual message. Ignore that for now and run on the entire text. Further down (in the higher grade part), you will be asked to filter out the headers and footers.*

In [7]:
import pathlib

def get_directory_contents(directory):
    contents = []
    for path in pathlib.Path(directory).iterdir():
        f = open(path, "r", errors="ignore")
        contents.append(f.read())
    return contents


hard_ham_mails = get_directory_contents("./data/2002/hard-ham")
easy_ham_mails = get_directory_contents(
    "./data/2002/easy-ham"
) #+ get_directory_contents("./data/2003/easy-ham-2")
spam_mails = get_directory_contents("./data/2002/spam") + get_directory_contents(
    "./data/2003/spam-2"
)


## b)
*We don’t want to train and test on the same data. Split the spam and the ham datasets 
in a training set and a test set.*

In [8]:
import random

def train_test_split(l, p):
    random.shuffle(l)
    size = int(len(l) * p)
    fst = l[size:]
    snd = l[:size]
    return (fst, snd)

test_set_size = 0.8

hard_ham_train, hard_ham_test = train_test_split(hard_ham_mails, test_set_size)
easy_ham_train, easy_ham_test = train_test_split(easy_ham_mails, test_set_size)
spam_train,     spam_test     = train_test_split(spam_mails, test_set_size)

## 2. Python program

*Using a Naïve Bayes classifier (e.g. Sklearn), classifies the test sets and reports the 
percentage of ham and spam test sets that were classified correctly. You can use 
CountVectorizer to transform the email texts into vectors. Please note that there are 
different types of Naïve Bayes Classifier in SKlearn (Document is available here). Test two 
of these classifiers: 1. Multinomial Naive Bayes and 2. Bernoulli Naive Bayes that are well 
suited for this problem. For the case of Bernoulli Naive Bayes you should use the 
parameter binarize to make the features binary.*

In [12]:
import pandas as pd
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import BernoulliNB, MultinomialNB

# Inputs are lists of strings
def NB_classifier(hamtrain, spamtrain, hamtest, spamtest, model):
    # Create dataframes (train/test)
    train_data = zip((hamtrain+spamtrain), ['ham']*len(hamtrain)+['spam']*len(spamtrain))
    df_train = pd.DataFrame(train_data, columns=["msg", "label"])
    
    test_data = zip((hamtest+spamtest), ['ham']*len(hamtest)+['spam']*len(spamtest))
    df_test = pd.DataFrame(test_data, columns=["msg", "label"])
    # Vectorization
    vec_count = CountVectorizer()
    dtm_train = vec_count.fit_transform(df_train['msg']).toarray()
    dtm_test  = vec_count.transform(df_test['msg']).toarray()
    labels_train = df_train['label']
    labels_test  = df_test['label']

    # Train model
    model.fit(dtm_train, labels_train)
    # Predictions
    preds = model.predict(dtm_test)
    acc = metrics.accuracy_score(labels_test, preds);
    print('  Model Accuracy: {} ≈ {}%'.format(acc,int(acc*100)))
    


def program(ham_train, ham_test, spam_train, spamtest):
    #gaussianNB_classifier(['hello darkness my old friend', 'ive had a dream', 'pipes are cool', 'darkness darkness'], 
    # ['bitcoin now', 'bless you jesus christ', 'cute cats in your area'], [], ['friend darkness darkness darknass'])
    print('Bernoulli Naive Bayes:')
    NB_classifier(ham_train, spam_train, ham_test, spam_test, model = BernoulliNB(binarize=True))
    print('Multinomial Naive Bayes:')
    NB_classifier(ham_train, spam_train, ham_test, spam_test, model = MultinomialNB())
    

*Discuss the differences between these two(Bernoulli Naive Bayes, Multinomial Naive Bayes) classifiers.*



### Question 3
*Run your program on:*

  i. *Spam versus easy-ham*
  


In [13]:
# SPAM VS EASY-HAM
program(easy_ham_train, easy_ham_test, spam_train, spam_test)

Bernoulli Naive Bayes:
  Model Accuracy: 0.8347850519808935 ≈ 83%
Multinomial Naive Bayes:
  Model Accuracy: 0.952233773531891 ≈ 95%


  ii. *Spam versus hard-ham*


In [14]:
#SPAM VS HARD-HAM
program(hard_ham_train, hard_ham_test, spam_train, spam_test)

Bernoulli Naive Bayes:
  Model Accuracy: 0.8976148923792903 ≈ 89%
Multinomial Naive Bayes:
  Model Accuracy: 0.9598603839441536 ≈ 95%


### Question 4
*To avoid classification based on common and uninformative words it is common to filter these out.*

* *Argue why this may be useful. Try finding the words that are too common/uncommon in the dataset.*

We can remove some data by filtering out common words that probably exists in both the spam and ham sets.
This would not neccessarly have a large effect on the accuracy but it could help speed up the process of training and testing the model.
When the data is common, they probably exist in both of the groups and can therefore not be used to distinguish the sets.
Removing uncommon words could actually effect the accuracy of the program.
It could help remove data points that are to few to actually indicate a trend and help classify the emails better.

* *Use the parameters in Sklearn’s CountVectorizer to filter out these words. Run the updated program on your data and record how the results differ from 3. You have two options to do this in Sklearn: either using the words found in part (a) or letting Sklearn do it for you.*

### 

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=dffe86c5-8c56-427e-b159-7e1448518018' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>