## Assignment 4

In this assignment you will implement a Naïve Bayes classifier in Python that will classify emails into spam and non-spam (“ham”) classes.  Your program should be able to train on a given set of spam and “ham” datasets. 
You will work with the datasets available at https://spamassassin.apache.org/old/publiccorpus/. There are three types of files in this location: 
-	easy-ham: non-spam messages typically quite easy to differentiate from spam messages. 
-	hard-ham: non-spam messages more difficult to differentiate 
-	spam: spam messages 

Execute the cell below to download and extract the data into the environment of the notebook.
The data will now be in the three folders `easy_ham`, `hard_ham`, and `spam`.


In [None]:
#Download and extract data
#!wget https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2
#!wget https://spamassassin.apache.org/old/publiccorpus/20021010_hard_ham.tar.bz2
#!wget https://spamassassin.apache.org/old/publiccorpus/20021010_spam.tar.bz2
#!tar -xjf 20021010_easy_ham.tar.bz2
#!tar -xjf 20021010_hard_ham.tar.bz2
#!tar -xjf 20021010_spam.tar.bz2

--2020-12-01 08:59:28--  https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2
Resolving spamassassin.apache.org (spamassassin.apache.org)... 40.79.78.1, 95.216.26.30, 95.216.24.32, ...
Connecting to spamassassin.apache.org (spamassassin.apache.org)|40.79.78.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1677144 (1.6M) [application/x-bzip2]
Saving to: ‘20021010_easy_ham.tar.bz2.1’


2020-12-01 08:59:28 (42.1 MB/s) - ‘20021010_easy_ham.tar.bz2.1’ saved [1677144/1677144]

--2020-12-01 08:59:29--  https://spamassassin.apache.org/old/publiccorpus/20021010_hard_ham.tar.bz2
Resolving spamassassin.apache.org (spamassassin.apache.org)... 95.216.26.30, 40.79.78.1, 95.216.24.32, ...
Connecting to spamassassin.apache.org (spamassassin.apache.org)|95.216.26.30|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1021126 (997K) [application/x-bzip2]
Saving to: ‘20021010_hard_ham.tar.bz2.1’


2020-12-01 08:59:30 (1.60 MB/s) - ‘20

### 1. Preprocessing: 
1.	Note that the email files contain a lot of extra information, besides the actual message. Ignore that for now and run on the entire text. Further down (in the higher-grade part), you will be asked to filter out the headers and footers. 
2.	We don’t want to train and test on the same data. Split the spam and the ham datasets in a training set and a test set. (`hamtrain`, `spamtrain`, `hamtest`, and `spamtest`)


In [None]:
import pandas as pd
import tarfile
from sklearn.model_selection import train_test_split

In [None]:
# Extract the email content, decode them, and convert as dataframe
def extract_files(files):  
    rows = []
    for fname in files:
        # open the tar file
        tfile = tarfile.open(fname, 'r:bz2')
        for member in tfile.getmembers():
            f = tfile.extractfile(member)
            if f is not None:
                row = f.read()
                #get all the content of file as a row
                rows.append({'message': row.decode('latin-1'), 'class': 'ham'})
        tfile.close()
    #return rows
    return pd.DataFrame(rows)


#get one dataframe with all of our files
df_ham = extract_files(['./20021010_easy_ham.tar.bz2','./20021010_hard_ham.tar.bz2'])
df_spam = extract_files(['./20021010_spam.tar.bz2'])

hamtrain, hamtest = train_test_split(df_ham, test_size=0.25, random_state=0)
spamtrain, spamtest = train_test_split(df_spam, test_size=0.25, random_state=0)

We used the tarfiles and unpacked them ourself and saved the emails as textfiles in a dataframe.

 Split the spam and the ham datasets in a training set and a test set. 
 
What does the task mean when the split should be named "hamtrain, spamtrain, hamtest and spamtest"? 
The code above shows how we literally interpret this by splitting all ham-emails into train and test 
and then separately split the spam-emails into train and test. But we find this constellation confusion,
If we train only on spam and then test on spam, there will be 100% accuracy.

Instead we join all emails into one big dataset and split that one instead to be able to train and predict. Hence, we interpret the task as:
"Split the combinet data (the spam and the ham datasets) in a combined training set and a test set".
Logically they could therefor be called x_train, x_test, y_train,y_test as in previous assignment but lets continue by calling them ham and 
spam train and test even though for example hamtest also inclused some spam. 
See code below for this and notice how we label all data to differentiate between ham and spam.


In [None]:
# Extract the email content, decode them, and convert as dataframe
def extract_mails(files,labels):

    label = 0
    rows  = []

    # read and append for both ham files
    for fname in files:
        # open the tar file
        hfile = tarfile.open(fname, 'r:bz2')
        for member in hfile.getmembers():
            f = hfile.extractfile(member)
            if f is not None:
                row = f.read()
                #get all the content of file as a row
                rows.append({'message': row.decode('latin-1'), 'class': labels[label]}) 
        hfile.close()
        label +=1

    # create a dataframe with message and class as rows
    return pd.DataFrame(rows)

# Extract the emails to a usable dataframe
files = ['./20021010_easy_ham.tar.bz2','./20021010_hard_ham.tar.bz2','./20021010_spam.tar.bz2']
labels = ['ham','ham','spam']
df_mails = extract_mails(files,labels)

# Divide the emails into train and test sets
hamtrain, hamtest, spamtrain, spamtest = train_test_split(df_mails['message'], df_mails['class'], test_size=0.25, random_state=0)

print(df_mails)
print('Total mails:',df_mails.shape)
print('Ham train:  ',hamtrain.shape)
print('Ham test:   ',hamtest.shape)
print('Spam train: ',spamtrain.shape)
print('Spam test:  ',spamtest.shape)

                                                message class
0     From fork-admin@xent.com  Wed Aug 28 10:50:29 ...   ham
1     From exmh-users-admin@redhat.com  Mon Sep  2 1...   ham
2     From exmh-users-admin@redhat.com  Fri Sep 13 1...   ham
3     From rpm-list-admin@freshrpms.net  Thu Aug 29 ...   ham
4     From rpm-list-admin@freshrpms.net  Mon Sep  9 ...   ham
...                                                 ...   ...
3297  From havoc1006@yahoo.com  Mon Aug 26 15:49:43 ...  spam
3298  From mando@insiq.us  Mon Aug 26 15:49:52 2002\...  spam
3299  From girl_with_toys_541652k57@yahoo.com  Mon A...  spam
3300  From guyhaibo@yahoo.ca  Mon Aug 26 15:50:05 20...  spam
3301  mv 1 00001.bfc8d64d12b325ff385cca8d07b84288\nm...  spam

[3302 rows x 2 columns]
Total mails: (3302, 2)
Ham train:   (2476,)
Ham test:    (826,)
Spam train:  (2476,)
Spam test:   (826,)


In [None]:
# TA BORT OM ALLT ANNAT ÄR OK.

# Extract the email content, decode them, and convert as dataframe
def extract_mails2(hamfiles, spamfile):
    rows = []

    # read and append for both ham files
    for fname in hamfiles:
        # open the tar file
        hfile = tarfile.open(fname, 'r:bz2')
        for member in hfile.getmembers():
            f = hfile.extractfile(member)
            if f is not None:
                row = f.read()
                #get all the content of file as a row
                # set decoded message and manually if ham or spam since it's previously known
                rows.append({'message': row.decode('latin-1'), 'class': 'ham'}) 
        hfile.close()

    # read and append spam file
    sfile = tarfile.open(spamfile, 'r:bz2')
    for member in sfile.getmembers():
        f = sfile.extractfile(member)
        if f is not None:
            row = f.read()
            rows.append({'message': row.decode('latin-1'), 'class': 'spam'})
    sfile.close()

    # create a dataframe with message and class as rows
    return pd.DataFrame(rows)

# Extract the emails to a usable dataframe
df_mails = extract_mails(['./20021010_easy_ham.tar.bz2', './20021010_hard_ham.tar.bz2'], './20021010_spam.tar.bz2')

# Divide the emails into train and test sets
hamtrain, hamtest, spamtrain, spamtest = train_test_split(df_mails['message'], df_mails['class'], test_size=0.25, random_state=0)


#print('Total mails:',df_mails.shape)
#print('Ham train:',  hamtrain.shape)
#print('Ham test',    hamtest.shape)
#print('Spam train:', spamtrain.shape)
#print('Spam test',   spamtest.shape)


### 2. Write a Python program that: 
1.	Uses four datasets (`hamtrain`, `spamtrain`, `hamtest`, and `spamtest`) 
2.	Trains a Naïve Bayes classifier (e.g. Sklearn) on `hamtrain` and `spamtrain`, that classifies the test sets and reports True Positive and False Negative rates on the `hamtest` and `spamtest` datasets. You can use `CountVectorizer` to transform the email texts into vectors. Please note that there are different types of Naïve Bayes Classifier in SKlearn ([Documentation here](https://scikit-learn.org/stable/modules/naive_bayes.html)). Test two of these classifiers that are well suited for this problem
- Multinomial Naive Bayes  
- Bernoulli Naive Bayes. 

Please inspect the documentation to ensure input to the classifiers is appropriate. Discuss the differences between these two classifiers. 




In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
from sklearn.metrics import classification_report


def calc_rates(predictions,spamtest,hamtest):
    
    #TP = 0
    #FN = 0
    #i  = 0
    #for label in spamtest:
    #    if (label == predictions[i] and label == 'ham'):
    #        TP = TP+1
    #    if (label != predictions[i] and label == 'ham'):
    #        FN = FN+1  
    #    i+=1
    #TP_rate = TP / hamtest.shape[0]
    #FN_rate = FN / hamtest.shape[0]
 
    #a normalized confusion matrix
    tn, fp, fn, tp = confusion_matrix(spamtest, predictions, normalize='true').ravel()

    return tp,fn


def NB(dataframe):
    
    #instantiate vectorizer
    X = CountVectorizer().fit_transform(dataframe['message'])    
    y = dataframe['class']

    # Divide the emails into train and test sets
    hamtrain, hamtest, spamtrain, spamtest = train_test_split(X, y, test_size=0.25, random_state = 0)

    # instantiate a Multinomial Naive Bayes model
    mnb_classifier = MultinomialNB()
    mnb_classifier.fit(hamtrain, spamtrain)
    mnb_predictions = mnb_classifier.predict(hamtest)

    # instantiate a Bernoulli Naive Bayes model
    bnb_classifier = BernoulliNB()
    bnb_classifier.fit(hamtrain, spamtrain)
    bnb_predictions = bnb_classifier.predict(hamtest)

    # Calc TP and FN rates
    mnb_tp, mnb_fn = calc_rates(mnb_predictions,spamtest,hamtest)
    bnb_tp, bnb_fn = calc_rates(bnb_predictions,spamtest,hamtest)

    return mnb_tp, mnb_fn, bnb_tp, bnb_fn


# Extract the emails to a usable dataframe
files = ['./20021010_easy_ham.tar.bz2','./20021010_hard_ham.tar.bz2','./20021010_spam.tar.bz2']
labels = ['ham','ham','spam']
df_mails = extract_mails(files,labels)

mnb_tp, mnb_fn ,bnb_tp, bnb_fn = NB(df_mails)

print('Multinomial Naive Bayes model gives TP rate = %f and FN rate = %f' %(mnb_tp,mnb_fn))
print('Bernoulli   Naive Bayes model gives TP rate = %f and FN rate = %f' %(bnb_tp,bnb_fn))


Multinomial Naive Bayes model gives TP rate = 0.862069 and FN rate = 0.137931
Bernoulli   Naive Bayes model gives TP rate = 0.248276 and FN rate = 0.751724


Above we use extract_files to get a dataset with the mails we want.
Then we use NB to first turn the data into vectors with CountVectorizer to then train the 
data on two classifiers; Multinomial and Bernoulli Naive Bayes model.
We use our calculate_rates to get the true positive rate (predicted positive and is positive) 
and false negative rate (predicted negative but should have been postive). 
The Naive Bayes models flip the matrix of TP,TN,FP,FN.

Rate... antar vi vara...normalized....
The results also show that the Multinomial Naive Bayes model is a bit better at guessing the positive results, in our case ham.
Whilst Bernoulli Naive Bayes model als............

   

### 3.Run your program on 
-	Spam versus easy-ham 
-	Spam versus hard-ham.

In [None]:
df_easy_ham_spam = extract_mails(['./20021010_easy_ham.tar.bz2','./20021010_spam.tar.bz2'],['ham','spam'])
df_hard_ham_spam = extract_mails(['./20021010_hard_ham.tar.bz2','./20021010_spam.tar.bz2'],['ham','spam'])

e_mnb_tp, e_mnb_fn, e_bnb_tp, e_bnb_fn = NB(df_easy_ham_spam)
h_mnb_tp, h_mnb_fn, h_bnb_tp, h_bnb_fn = NB(df_hard_ham_spam)

print('Spam vs. easy-ham')
print('Multinomial Naive Bayes model gives TP rate = %f and FN rate = %f' %(e_mnb_tp,e_mnb_fn))
print('Bernoulli   Naive Bayes model gives TP rate = %f and FN rate = %f' %(e_bnb_tp,e_bnb_fn))
print()
print('Spam vs. hard-ham')
print('Multinomial Naive Bayes model gives TP rate = %f and FN rate = %f' %(h_mnb_tp,h_mnb_fn))
print('Bernoulli   Naive Bayes model gives TP rate = %f and FN rate = %f' %(h_bnb_tp,h_bnb_fn))


Spam vs. easy-ham
Multinomial Naive Bayes model gives TP rate = 0.862069 and FN rate = 0.137931
Bernoulli   Naive Bayes model gives TP rate = 0.517241 and FN rate = 0.482759

Spam vs. hard-ham
Multinomial Naive Bayes model gives TP rate = 0.961832 and FN rate = 0.038168
Bernoulli   Naive Bayes model gives TP rate = 0.969466 and FN rate = 0.030534


### 4.	To avoid classification based on common and uninformative words it is common to filter these out. 

**a.** Argue why this may be useful. Try finding the words that are too common/uncommon in the dataset. 


We could start by filter out very common words, like "I", "and" and "hello".
Then there will be less data to handle and it will be a faster process.
It would also be easier to train and the accuracy could get better since the model is being trained only on relevant words. Say for
example we take a common word such as "you", which is very likely to be part of a large part of the emails. It may not contribute to
the training of the model because it cannot be deterimined if it contribute to the classification being a ham or a spam.
Finding relevant words will now run faster.

There are alot of help functions to clear out data.
Stemming can be used to also filter out unformative words that does not contribute to the result. By trimming the words down to their stem
we can minimize the amount of words being used in the email, and as well help with the filtering of common words. For example the words
"learning" would have a stem of "learn", and if we find that "learn" is a common word it would be included into the filtering out of them. 
This can be further improved with Lemmitization, where Stemming is utilise but we also consider the context of that words, making it less likely
to be trimmed down to its stem. With this being said, it's good to think of the performance since all these calculation would add to the amount
of work that has to be done, especially on large datasets, where we want to optimize it.

The Natural Language Toolkit provide Tokenization, which removes words as well as exclamation point,commas, apostrophes, question marks commas etc.
This can be used to further filter out tokens that does not contribute to the email. On the other 

TF-IDF is another method that counts how often a word appears and takes the lenght of the email into consideration.
TF-IDF (Time frequency times inverse document frequency)



In [None]:
from collections import Counter
import string
import itertools


def count_words():
    # Extract the emails to a usable dataframe
    df_mails = extract_mails(['./20021010_easy_ham.tar.bz2', './20021010_hard_ham.tar.bz2', './20021010_spam.tar.bz2'],['ham','ham','spam'])

    #remove punctuation tokens with regex so at split "Hello:"" will be splitted as "Hello"
    df_mails['message'] = df_mails['message'].str.replace('[{}]'.format(string.punctuation), ' ')
    df_mails['message'] = df_mails['message'].str.replace('\n', ' ')
    df_mails['message'] = df_mails['message'].str.replace('\t', ' ')

    # split the mails into words
    mails_splitted = df_mails["message"].str.split(" ")

    # count how many times a word occurs in all emails
    word_counter = Counter()
    for i in range(0,len(df_mails)):
        word_counter = word_counter + Counter(mails_splitted[i])

    return word_counter



word_counter = count_words()

#how many words most and least common we would like
#length of word_counter = 123645
n_words = int(len(word_counter)*0.01) 

#the least common words
word_counter2 = word_counter
least_common_words = word_counter2.most_common()[:-n_words-1:-1]
print('Least common words: ', least_common_words)

#the top common words
most_common_words = word_counter.most_common(n_words)
print('Most common words: ', most_common_words) 



#  split all mail into words
#df_mails["message"].str.split(" ")
#  make all emails into one long array of words
#mails_words = df_mails["message"].tolist()
# count how many times a word occurs
#word_counter = Counter(mails_words)
#print(word_counter)




lenght   123645
Least common words:  [('7b1b73cf36cf9dbc3d64e3f2ee2b91f1', 1), ('00000', 1), ('cmds', 1), ('c4ff6dba0a5177d3c7d8ef54c8920496', 1), ('00099', 1), ('01d2958ccb7c2e4c02d0920593962436', 1), ('00098', 1), ('dce08392ba6bc552d13394fa73974b62', 1), ('00097', 1), ('b2cb600e893f7a663ea5f9bff3a6276e', 1), ('00096', 1), ('e1db2d3556c2863ef7355faf49160219', 1), ('00095', 1), ('3ba780eac7dce1c2b063cd1fc12738be', 1), ('00094', 1), ('2bb8a2a7e4d2841a14f27f32076dd77e', 1), ('00093', 1), ('bf7453c6b7917ca30074a3030d84e36d', 1), ('00092', 1), ('113ec7122d4046a2754bcf70b9fb5299', 1), ('00091', 1), ('9a7e76d58065e29e709161dbe569fe54', 1), ('00090', 1), ('c05e264fbf18783099b53dbc9a9aacda', 1), ('00009', 1), ('51c746428bb5e2793a1c04ce1e0c72c1', 1), ('00089', 1), ('f421d8c380fb0c48483f026d243df9d9', 1), ('00088', 1), ('1cbd88a0c1564cb5d6c9b12c8c4175d8', 1), ('00087', 1), ('4b3a02be9a2561ada188d95b4601c01e', 1), ('00086', 1), ('6e7b1a983ab05445a7eaffcbb6811d3f', 1), ('00085', 1), ('df5ac85de340

In [None]:
# get a array of strings with the most common words. how many depending on the input
def list_of_common_words(length_common_words):
    
    word_counter = count_words()

    #get most common words in counter
    most_common_words = word_counter.most_common(length_common_words)
    

    #get only the words
    top_words = []
    for i in most_common_words:
        top_words.append(i[0])
    
    top_words.sort()
    return top_words


top_words = list_of_common_words(1000)
stop_words_top = frozenset(top_words)
#print(top_words)

lenght   123645


### 4.
**b. ** Use the parameters in Sklearn’s `CountVectorizer` to filter out these words. Update the program from point 3 and run it on your data and report your results.

You have two options to do this in Sklearn: either using the words found in part (a) or letting Sklearn do it for you. Argue for your decision-making.


We used our own most common words to filter the mails with. The reasoning is that we want the best result we can for this specifik
scenario, which is often the case with data science where we have adapt methods to a specific problem. This means that the words
considered common in Natural Language (you, i, and, etc.) are only included if it's considered to be common in **this** dataset. 
This means that we get more accurate results for this type of probelm, compared to using the list of gathered common words in
for Natural Language, where it may not be considered common in our case. This can be seen in _4. a)_ where common words (and characters)
are `'', com, 0, the`. A disadvantage of this is that it takes longer to run and are dependent on that there actually are words that
can stand out as common. Since we have such a large dataset we can almost guarantee that there will be more and less common words, but
for instance if we would only have a few sentances, words that would only appear once would be considered as both common **and** uncommon by
our program. Another problem that could appear is how many words are considered to be _common_. Currently we are testing filtering out the `1000` most common
words in the emails ***CHANGE HERE WHY WORDS ARE COMMON***. We think that overall this will produce a better result for this case, since 
the program is tailored to work best on this scenario. It may perform less good with other circumstances, but for the purpose of this task it's 
the optimal decision.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import nltk
from nltk.corpus import stopwords
from sklearn.metrics import confusion_matrix
from sklearn import svm

# use to get stopwords locally
#nltk.download('stopwords')

#df_mails = extract_mails(['./20021010_easy_ham.tar.bz2', './20021010_hard_ham.tar.bz2'], './20021010_spam.tar.bz2')

#X_train, X_test, y_train, y_test = train_test_split(df_mails['message'], df_mails['class'], test_size = 0.25, random_state=0)

#vectorizer = CountVectorizer(max_df=0.1, stop_words=stopwords.words('english'))
#X_train = vectorizer.fit_transform(X_train)

def NB_filter(dataframe):

    X = dataframe['message']
    y = dataframe['class']

    #instantiate vectorize with custom stop words
    X_vectorized = CountVectorizer(stop_words=stop_words_top).fit_transform(X)
   
    # Divide the emails into train and test sets
    hamtrain, hamtest, spamtrain, spamtest = train_test_split(X_vectorized, y, test_size=0.25, random_state = 0)

    # instantiate a Multinomial Naive Bayes model
    mnb_classifier = MultinomialNB()
    mnb_classifier.fit(hamtrain, spamtrain)
    mnb_predictions = mnb_classifier.predict(hamtest)

    # instantiate a Bernoulli Naive Bayes model
    bnb_classifier = BernoulliNB()
    bnb_classifier.fit(hamtrain, spamtrain)
    bnb_predictions = bnb_classifier.predict(hamtest)

    #print(confusion_matrix(spamtest, mnb_predictions))
    #print(confusion_matrix(spamtest, bnb_predictions))

    # Calc TP and FN rates
    mnb_tp, mnb_fn = calc_rates(mnb_predictions,spamtest,hamtest)
    bnb_tp, bnb_fn = calc_rates(bnb_predictions,spamtest,hamtest)

    return mnb_tp, mnb_fn, bnb_tp, bnb_fn


df_easy_ham_spam = extract_mails(['./20021010_easy_ham.tar.bz2','./20021010_spam.tar.bz2'],['ham','spam'])
df_hard_ham_spam = extract_mails(['./20021010_hard_ham.tar.bz2','./20021010_spam.tar.bz2'],['ham','spam'])

e_mnb_tp, e_mnb_fn, e_bnb_tp, e_bnb_fn = NB_filter(df_easy_ham_spam)
h_mnb_tp, h_mnb_fn, h_bnb_tp, h_bnb_fn = NB_filter(df_hard_ham_spam)

print('Spam vs. easy-ham with filtered df')
print('Multinomial Naive Bayes model gives TP rate = %f and FN rate = %f' %(e_mnb_tp,e_mnb_fn))
print('Bernoulli   Naive Bayes model gives TP rate = %f and FN rate = %f' %(e_bnb_tp,e_bnb_fn))
print()
print('Spam vs. hard-ham with filtered df')
print('Multinomial Naive Bayes model gives TP rate = %f and FN rate = %f' %(h_mnb_tp,h_mnb_fn))
print('Bernoulli   Naive Bayes model gives TP rate = %f and FN rate = %f' %(h_bnb_tp,h_bnb_fn))


  'stop_words.' % sorted(inconsistent))
[[647   0]
 [  7 109]]
[[643   4]
 [ 85  31]]
  'stop_words.' % sorted(inconsistent))
[[ 56   1]
 [  3 128]]
[[ 34  23]
 [  3 128]]
Spam vs. easy-ham with filtered df
Multinomial Naive Bayes model gives TP rate = 0.847969 and FN rate = 0.009174
Bernoulli   Naive Bayes model gives TP rate = 0.842726 and FN rate = 0.111402

Spam vs. hard-ham with filtered df
Multinomial Naive Bayes model gives TP rate = 0.297872 and FN rate = 0.015957
Bernoulli   Naive Bayes model gives TP rate = 0.180851 and FN rate = 0.015957


### 5. Eeking out further performance
Filter out the headers and footers of the emails before you run on them. The format may vary somewhat between emails, which can make this a bit tricky, so perfect filtering is not required. Run your program again and answer the following questions: 
-	Does the result improve from 3 and 4? 
- The split of the data set into a training set and a test set can lead to very skewed results. Why is this, and do you have suggestions on remedies? 
- What do you expect would happen if your training set were mostly spam messages while your test set were mostly ham messages? 

Re-estimate your classifier using `fit_prior` parameter set to `false`, and answer the following questions:
- What does this parameter mean?
- How does this alter the predictions? Discuss why or why not.