**Jade Chang**

Spring 2023

CS 251/2: Data Analysis and Visualization

Project 6: Supervised learning

In [8]:
import os
import random
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

plt.style.use(['seaborn-colorblind', 'seaborn-darkgrid'])
plt.rcParams.update({'font.size': 20})

np.set_printoptions(suppress=True, precision=5)

# Automatically reload external modules
%load_ext autoreload
%autoreload 2

## Task 3: Preprocess full spam email dataset 

Before you build a Naive Bayes spam email classifier, run the full spam email dataset through your preprocessing code.

Download and extract the full **Enron** emails (*zip file should be ~29MB large*). You should see a base `enron` folder, with `spam` and `ham` subfolders when you extract the zip file (these are the 2 classes).

Run the test code below to check everything over.

### 3a) Preprocess dataset

In [58]:
import email_preprocessor as epp

#### Test `count_words` and `find_top_words`

In [74]:
word_freq, num_emails = epp.count_words()

In [75]:
print(f'You found {num_emails} emails in the datset. You should have found 32625.')

You found 32625 emails in the datset. You should have found 32625.


In [76]:
top_words, top_counts = epp.find_top_words(word_freq)
print(f"Your top 5 words are\n{top_words[:5]}\nand they should be\n['the', 'to', 'and', 'of', 'a']")
print(f"The associated counts are\n{top_counts[:5]}\nand they should be\n[277459, 203659, 148873, 139578, 111796]")

Your top 5 words are
['the', 'to', 'and', 'of', 'a']
and they should be
['the', 'to', 'and', 'of', 'a']
The associated counts are
[277459, 203659, 148873, 139578, 111796]
and they should be
[277459, 203659, 148873, 139578, 111796]


### 3b) Make train and test splits of the dataset

Here we divide the email features into a 80/20 train/test split (80% of data used to train the supervised learning model, 20% we withhold and use for testing / prediction).

In [77]:
features, y = epp.make_feature_vectors(top_words, num_emails)

In [78]:
np.random.seed(0)
x_train, y_train, inds_train, x_test, y_test, inds_test = epp.make_train_test_sets(features, y)

In [79]:
print('Shapes for train/test splits:')
print(f'Train {x_train.shape}, classes {y_train.shape}')
print(f'Test {x_test.shape}, classes {y_test.shape}')
print('\nThey should be:\nTrain (26100, 200), classes (26100,)\nTest (6525, 200), classes (6525,)')

Shapes for train/test splits:
Train (26100, 200), classes (26100,)
Test (6525, 200), classes (6525,)

They should be:
Train (26100, 200), classes (26100,)
Test (6525, 200), classes (6525,)


### 3c) Save data in binary format

It adds a lot of overhead to have to run through your raw email -> train/test feature split every time you wanted to work on your project! In this step, you will export the data in memory to disk in a binary format. That way, you can quickly load all the data back into memory (directly in ndarray format) whenever you want to work with it again. No need to parse from text files!

Running the following cell uses numpy's `save` function to make six files in `.npy` format (e.g. `email_train_x.npy`, `email_train_y.npy`, `email_train_inds.npy`, `email_test_x.npy`, `email_test_y.npy`, `email_test_inds.npy`).

In [80]:
np.save('data/email_train_x.npy', x_train)
np.save('data/email_train_y.npy', y_train)
np.save('data/email_train_inds.npy', inds_train)
np.save('data/email_test_x.npy', x_test)
np.save('data/email_test_y.npy', y_test)
np.save('data/email_test_inds.npy', inds_test)

## Task 4: Naive Bayes Classifier

After finishing your email preprocessing pipeline, implement the one other supervised learning algorithm we we will use to classify email, **Naive Bayes**.

### 4a) Implement Naive Bayes

In `naive_bayes.py`, implement the following methods:
- Constructor
- get methods
- `train(data, y)`: Train the Naive Bayes classifier so that it records the "statistics" of the training set: class priors (i.e. how likely an email is in the training set to be spam or ham?) and the class likelihoods (the probability of a word appearing in each class — spam or ham).
- `predict(data)`: Combine the class likelihoods and priors to compute the posterior distribution. The predicted class for a test sample is the class that yields the highest posterior probability.
- `accuracy(y, y_pred)`: The usual definition :)


#### Bayes rule ingredients: Priors and likelihood (`train`)

To compute class predictions (probability that a test example belong to either spam or ham classes), we need to evaluate **Bayes Rule**. This means computing the priors and likelihoods based on the training data.

**Prior:** $$P_c = \frac{N_c}{N}$$ where $P_c$ is the prior for class $c$ (spam or ham), $N_c$ is the number of training samples that belong to class $c$ and $N$ is the total number of training samples.

**Likelihood:** $$L_{c,w} = \frac{N_{c,w} + 1}{N_{c} + M}$$ where
- $L_{c,w}$ is the likelihood that word $w$ belongs to class $c$ (*i.e. what we are solving for*)
- $N_{c,w}$ is the total count of **word $w$** in emails that are only in class $c$ (*either spam or ham*)
- $N_{c}$ is the total number of **all words** that appear in emails of the class $c$ (*total number of words in all spam emails or total number of words in all ham emails*)
- $M$ is the number of features (*number of top words*).

#### Bayes rule ingredients: Posterior (`predict`)

To make predictions, we now combine the prior and likelihood to get the posterior:

**Log Posterior:** $$Log(\text{Post}_{i, c}) = Log(P_c) + \sum_{j \in J_i}x_{i,j}Log(L_{c,j})$$

 where
- $\text{Post}_{i,c}$ is the posterior for class $c$ for test sample $i$(*i.e. evidence that email $i$ is spam or ham*). We solve for its logarithm.
- $Log(P_c)$ is the logarithm of the prior for class $c$.
- $x_{i,j}$ is the number of times the jth word appears in the ith email.
- $Log(L_{c,j})$: is the log-likelihood of the jth word in class $c$.

In [81]:
from naive_bayes import NaiveBayes

#### Test `train`

###### Class priors and likelihoods

The following test should be used only if storing the class priors and likelihoods directly.

In [82]:
num_test_classes = 4
np.random.seed(0)
data_test = np.random.randint(low=0, high=20, size=(100, 6))
y_test = np.random.randint(low=0, high=num_test_classes, size=(100,))

nbc = NaiveBayes(num_classes=num_test_classes)
nbc.train(data_test, y_test)

print(f'Your class priors are: {nbc.get_priors()}\nand should be          [0.28 0.22 0.32 0.18].')
print(f'Your class likelihoods shape is {nbc.get_likelihoods().shape} and should be (4, 6).')
print(f'Your likelihoods are:\n{nbc.get_likelihoods()}')

print(f'and should be')
print('''[[0.15997 0.15091 0.2079  0.19106 0.14184 0.14832]
 [0.11859 0.16821 0.17914 0.16905 0.18082 0.18419]
 [0.16884 0.17318 0.14495 0.14332 0.18784 0.18187]
 [0.16126 0.17011 0.15831 0.13963 0.18977 0.18092]]''')

Your class priors are: [0.28 0.22 0.32 0.18]
and should be          [0.28 0.22 0.32 0.18].
Your class likelihoods shape is (4, 6) and should be (4, 6).
Your likelihoods are:
[[0.15997 0.15091 0.2079  0.19106 0.14184 0.14832]
 [0.11859 0.16821 0.17914 0.16905 0.18082 0.18419]
 [0.16884 0.17318 0.14495 0.14332 0.18784 0.18187]
 [0.16126 0.17011 0.15831 0.13963 0.18977 0.18092]]
and should be
[[0.15997 0.15091 0.2079  0.19106 0.14184 0.14832]
 [0.11859 0.16821 0.17914 0.16905 0.18082 0.18419]
 [0.16884 0.17318 0.14495 0.14332 0.18784 0.18187]
 [0.16126 0.17011 0.15831 0.13963 0.18977 0.18092]]


###### Log of class priors and likelihoods

This test should be used only if storing the log of the class priors and likelihoods.

In [83]:
num_test_classes = 4
np.random.seed(0)
data_test = np.random.randint(low=0, high=20, size=(100, 6))
y_test = np.random.randint(low=0, high=num_test_classes, size=(100,))

nbc = NaiveBayes(num_classes=num_test_classes)
nbc.train(data_test, y_test)

print(f'Your log class priors are: {nbc.get_priors()}\nand should be              [-1.27297 -1.51413 -1.13943 -1.7148 ].')
print(f'Your log class likelihoods shape is {nbc.get_likelihoods().shape} and should be (4, 6).')
print(f'Your log likelihoods are:\n{nbc.get_likelihoods()}')


print(f'and should be')
print('''[[-1.83274 -1.89109 -1.57069 -1.65516 -1.95306 -1.90841]
 [-2.13211 -1.78255 -1.71958 -1.77756 -1.71023 -1.6918 ]
 [-1.77881 -1.75342 -1.93136 -1.94266 -1.67217 -1.70448]
 [-1.82475 -1.77132 -1.84321 -1.96879 -1.66192 -1.70968]]''')

Your log class priors are: [0.28 0.22 0.32 0.18]
and should be              [-1.27297 -1.51413 -1.13943 -1.7148 ].
Your log class likelihoods shape is (4, 6) and should be (4, 6).
Your log likelihoods are:
[[0.15997 0.15091 0.2079  0.19106 0.14184 0.14832]
 [0.11859 0.16821 0.17914 0.16905 0.18082 0.18419]
 [0.16884 0.17318 0.14495 0.14332 0.18784 0.18187]
 [0.16126 0.17011 0.15831 0.13963 0.18977 0.18092]]
and should be
[[-1.83274 -1.89109 -1.57069 -1.65516 -1.95306 -1.90841]
 [-2.13211 -1.78255 -1.71958 -1.77756 -1.71023 -1.6918 ]
 [-1.77881 -1.75342 -1.93136 -1.94266 -1.67217 -1.70448]
 [-1.82475 -1.77132 -1.84321 -1.96879 -1.66192 -1.70968]]


#### Test `predict`

In [84]:
num_test_classes = 4
np.random.seed(0)
data_train = np.random.randint(low=0, high=num_test_classes, size=(100, 10))
data_test = np.random.randint(low=0, high=num_test_classes, size=(15, 10))
y_test = np.random.randint(low=0, high=num_test_classes, size=(100,))

nbc = NaiveBayes(num_classes=num_test_classes)
nbc.train(data_train, y_test)
test_y_pred = nbc.predict(data_test)

print(f'Your predicted classes are\n{test_y_pred}\nand should be\n[3 0 3 1 0 1 1 3 0 3 0 2 0 2 1]')

Your predicted classes are
[3 0 3 1 0 1 1 3 0 3 0 2 0 2 1]
and should be
[3 0 3 1 0 1 1 3 0 3 0 2 0 2 1]


### 4b) Spam filtering

Let's start classifying spam email using the Naive Bayes classifier. The following code uses `np.load` to load in the train/test split that you created last week.
- Use your Naive Bayes classifier on the Enron email dataset!

**Question 7:** Print out the accuracy that you get on the test set with Naive Bayes. It should be roughly 89%.

In [85]:
import email_preprocessor as ep

In [86]:
x_train = np.load('data/email_train_x.npy')
y_train = np.load('data/email_train_y.npy')
inds_train = np.load('data/email_train_inds.npy')
x_test = np.load('data/email_test_x.npy')
y_test = np.load('data/email_test_y.npy')
inds_test = np.load('data/email_test_inds.npy')

In [87]:
print(x_train.shape)
print(y_train.shape)
print(inds_train.shape)

(26100, 200)
(26100,)
(26100,)


In [88]:
import naive_bayes 
nb = naive_bayes.NaiveBayes(2)

In [89]:
nb.train(x_train,y_train)

In [90]:
y_pred = nb.predict(x_test)

In [91]:
nb.accuracy(y_test,y_pred)

0.8895019157088122

### 4c) Confusion matrix

To get a better sense of the errors that the Naive Bayes classifer makes, you will create a confusion matrix. 

- Implement `confusion_matrix` in `naive_bayes.py`.
- Print out a confusion matrix of the spam classification results.

**Debugging guidelines**:
1. The sum of all numbers in your 2x2 confusion matrix should equal the number of test samples (6525).
2. The sum of your spam row should equal the number of spam samples in the test set (3193)
3. The sum of your ham row should equal the number of spam samples in the test set (3332)

In [92]:
nb.confusion_matrix(y_test,y_pred)

array([[3025.,  168.],
       [ 553., 2779.]])

In [93]:
 #TP rate : TPR =TP/(TP+FN)
    
TPR = 3025/(3025+168)
print("Naive Bayes TPR: ", TPR)
# FP rate: FPR = FP/(FP+TN)

FPR = 553/(553+2779)
print("Naive Bayes FPR: ", FPR)

Naive Bayes TPR:  0.9473849044785468
Naive Bayes FPR:  0.16596638655462184


**Question 8:** Interpret the confusion matrix, using the convention that positive detection means spam (*e.g. a false positive means classifying a ham email as spam*). What types of errors are made more frequently by the classifier? What does this mean (*i.e. X (spam/ham) is more likely to be classified than Y (spam/ham) than the other way around*)?

**Reminder:** Look back and make sure you are clear on which class indices correspond to spam/ham.

**Answer 8:** bottom left: ham is categorised to spam. So false positive is more frequent. This means that ham is more likely to be wrongly classified as spam. 

## Task 5: Comparison with KNN


- Run a similar analysis to what you did with Naive Bayes above. When computing accuracy on the test set, you may want to reduce the size of the test set (e.g. to the first 500 emails in the test set).
- Copy-paste your `confusion_matrix` method into `knn.py` so that you can run the same analysis on a KNN classifier.

In [94]:
from knn import KNN

In [95]:
classifier = KNN(num_classes=2)
classifier.train(x_train,y_train)
y_pred = classifier.predict(x_test[:500],2)
accuracy = classifier.accuracy(y_test[:500],y_pred)
print(accuracy)

0.92


In [96]:
classifier.confusion_matrix(y_test[:500],y_pred)

array([[265.,   2.],
       [ 38., 195.]])

In [97]:
 #TP rate : TPR =TP/(TP+FN)
    
TPR = 265/(265+2)
print("KNN TPR: ", TPR)
# FP rate: FPR = FP/(FP+TN)

FPR = 38/(38+195)
print("KNN FPR: ", FPR)

KNN TPR:  0.9925093632958801
KNN FPR:  0.1630901287553648


**Question 9:** What accuracy did you get on the test set (potentially reduced in size)?

**Question 10:** How does the confusion matrix compare to that obtained by Naive Bayes (*If you reduced the test set size, keep that in mind*)?

**Question 11:** Briefly describe at least one pro/con of KNN compared to Naive Bayes on this dataset.

**Question 12:** When potentially reducing the size of the test set here, why is it important that we shuffled our train and test set?

**Answer 9:** Accuracy is 0.92

**Answer 10:** Similar to naive bayes classification, the False Positive rate is around 0.16. However, it seems like knn does a better job in categorizing to true positive, with a TPR of 0.99

**Answer 11:** 

Pros: Knn does not require training
Cons: It is slower

**Answer 12:** Because the original data could be ordered, (for example, all the data with the same class could be in the front of the data), so it needs to be shuffled in order to take the best representation of the entire dataset. 

## Extensions

### 0. Classify your own datasets

- Find datasets that you find interesting and run classification on them using your KNN algorithm (and if applicable, Naive Bayes). Analysis the performance of your classifer.

In [6]:
import pandas as pd

id='1eWt0Zr7Td7vSPA3AL9ssIHkF-I3VXEJg'
url= f"https://docs.google.com/uc?id={id}&export=download"
raw_dataset = pd.read_csv(url)
print(raw_dataset)
data = raw_dataset.to_numpy()
x = data[:,0]
y = data[:,1]

                                                   text  label
0     I always wrote this series off as being a comp...      0
1     1st watched 12/7/2002 - 3 out of 10(Dir-Steve ...      0
2     This movie was so poorly written and directed ...      0
3     The most interesting thing about Miryang (Secr...      1
4     when i first read about "berlin am meer" i did...      0
...                                                 ...    ...
4995  This is the kind of picture John Lassiter woul...      1
4996  A MUST SEE! I saw WHIPPED at a press screening...      1
4997  NBC should be ashamed. I wouldn't allow my chi...      0
4998  This movie is a clumsy mishmash of various gho...      0
4999  Formula movie about the illegitimate son of a ...      0

[5000 rows x 2 columns]


In [63]:
print(y)

[0 0 0 ... 0 0 0]


In [21]:
import email_preprocessor as epp
import knn

In [22]:

dictionary = {}
for text in x:
    token = epp.tokenize_words(text)
    for word in token:
        if word in dictionary:
            dictionary[word] = dictionary[word] + 1
        else:
            dictionary[word] = 1


In [None]:
print(dictionary)



In [24]:
top_words, counts = epp.find_top_words(dictionary,250)

In [25]:
feats = []


for data in x:
    feat = np.zeros(len(top_words))
    words = epp.tokenize_words(data)
    for word in words:
        if word in top_words:
            idx = top_words.index(word)
            feat[idx] = feat[idx] + 1
    feats.append(feat)

feats = np.asarray(feats)

print(feats.shape)

(5000, 250)


In [26]:
print(y)

[0 0 0 ... 0 0 0]


In [27]:
x_train, y_train, inds_train, x_test, y_test, inds_test = epp.make_train_test_sets(feats, y, test_prop=0.2, shuffle=True)



In [38]:
from knn import KNN
classifier = KNN(num_classes=2)
classifier.train(x_train,y_train)
for i in range(2,11):
    y_pred = classifier.predict(x_test[:500],i)
    accuracy = classifier.accuracy(y_test[:500],y_pred)
    print(accuracy)

0.586
0.584
0.6
0.596
0.626
0.62
0.634
0.61
0.604


using different number of ks to find the optimal accuracy solution

In [39]:

import naive_bayes 
nb = naive_bayes.NaiveBayes(2)

nb.train(x_train,y_train)
y_pred = nb.predict(x_test)
accuracy = nb.accuracy(y_test,y_pred)
print(accuracy)


0.756


It seems like naive bayes have slightly higher accuracy for this particular dataset. 

### 1. Better text preprocessing

- If you look at the top words extracted from the email dataset, many of them are common "stop words" (e.g. a, the, to, etc.) that do not carry much meaning when it comes to differentiating between spam vs. non-spam email. Improve your preprocessing pipeline by building your top words without stop words. Analyze performance differences.

In [40]:
print(top_words)
top_words = top_words[21:]
print(top_words)

['his', 'you', 'are', 'have', 'be', 'one', 'he', 'all', 'by', 'at', 'an', 'they', 'so', 'who', 'from', 'like', 'or', 'just', 'her', 'out', "it's", 'if', 'about', 'has', 'there', 'what', 'some', 'good', 'very', 'when', 'more', 'up', 'my', 'even', 'time', 'no', 'would', 'she', 'which', 'their', 'story', 'only', 'really', 'had', 'see', 'me', 'can', 'were', 'well', 'we', 'than', 'much', 'bad', 'been', 'other', 'do', 'great', 'get', 'because', 'first', 'how', 'people', 'him', 'into', "don't", 'also', 'will', 'made', 'most', 'its', 'way', 'then', 'them', 'after', 'could', 'any', 'make', 'movies', 'too', 'think', 'characters', 'two', 'watch', 'many', 'character', 'plot', 'films', 'seen', 'being', 'acting', 'never', 'life', 'did', 'best', 'know', 'love', 'show', 'off', 'where', 'little', 'ever', 'over', 'end', 'better', 'scene', 'does', 'your', 'man', 'here', 'these', 'why', 'something', 'such', 'scenes', 'still', 'say', 'while', 'through', 'should', 'go', 'watching', 'now', 'real', 'back', "i

In [41]:
feats = []


for data in x:
    feat = np.zeros(len(top_words))
    words = epp.tokenize_words(data)
    for word in words:
        if word in top_words:
            idx = top_words.index(word)
            feat[idx] = feat[idx] + 1
    feats.append(feat)

feats = np.asarray(feats)

print(feats.shape)

(5000, 208)


In [42]:
print(y)

[0 0 0 ... 0 0 0]


In [43]:
x_train, y_train, inds_train, x_test, y_test, inds_test = epp.make_train_test_sets(feats, y, test_prop=0.2, shuffle=True)



In [44]:
classifier = KNN(num_classes=2)
classifier.train(x_train,y_train)
for i in range(2,11):
    y_pred = classifier.predict(x_test[:500],i) #different number of neighbours
    accuracy = classifier.accuracy(y_test[:500],y_pred)
    print(accuracy)

0.588
0.604
0.594
0.62
0.61
0.628
0.618
0.618
0.628


In [45]:
nb.train(x_train,y_train)
y_pred = nb.predict(x_test)
accuracy = nb.accuracy(y_test,y_pred)
print(accuracy)


0.776


By removing the first 20 common words, it seems to slightly improve the accuray of the predictions for both naive bayes and knn classifiers. The trend remains the same.

In [123]:
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords

for word in stopwords.words('english'):
    if word in dictionary:
        dictionary.pop(word)
    
print(dictionary)



After removing the common english words, I want to see if that would influence accuracy of our classifiers.

In [124]:
top_words, counts = epp.find_top_words(dictionary,250)
feats = []

for data in x:
    feat = np.zeros(len(top_words))
    words = epp.tokenize_words(data)
    for word in words:
        if word in top_words:
            idx = top_words.index(word)
            feat[idx] = feat[idx] + 1
    feats.append(feat)

feats = np.asarray(feats)
x_train, y_train, inds_train, x_test, y_test, inds_test = epp.make_train_test_sets(feats, y, test_prop=0.2, shuffle=True)

#training using Naive Bayes
nb.train(x_train,y_train)
y_pred = nb.predict(x_test)
accuracy = nb.accuracy(y_test,y_pred)
print(accuracy)

classifier = KNN(num_classes=2)
classifier.train(x_train,y_train)


0.783


Compared to the original method without removing stop words, the accuracy improved slightly from 0.756 to 0.783 using naive bayes

In [125]:
classifier = KNN(num_classes=2)
classifier.train(x_train,y_train)
for i in range(2,11):
    y_pred = classifier.predict(x_test[:500],i) #different number of neighbours
    accuracy = classifier.accuracy(y_test[:500],y_pred)
    print(accuracy)

0.578
0.574
0.604
0.572
0.63
0.616
0.638
0.616
0.624


Compared to the original method without removing stop words, the accuracy is relatively similar 

They changed from 0.586
0.584
0.6
0.596
0.626
0.62
0.634
0.61
0.604 to 

0.578
0.574
0.604
0.572
0.63
0.616
0.638
0.616
0.624 using knn

### 2. Feature size

- Explore how the number of selected features for the email dataset influences accuracy and runtime performance.

In [51]:
import time
import random

In [52]:
np.random.seed(0)

for i in range(1,20):
    start = time.time()
    #print(dictionary)
    top_words, counts = epp.find_top_words(dictionary,i*100)
    
    feats = []


    for data in x:
        feat = np.zeros(len(top_words))
        words = epp.tokenize_words(data)
        for word in words:
            if word in top_words:
                idx = top_words.index(word)
                feat[idx] = feat[idx] + 1
        feats.append(feat)

    feats = np.asarray(feats)
    x_train, y_train, inds_train, x_test, y_test, inds_test = epp.make_train_test_sets(feats, y, test_prop=0.2, shuffle=True)


    nb.train(x_train,y_train)
    y_pred = nb.predict(x_test)
    accuracy = nb.accuracy(y_test,y_pred)
    end = time.time()
    result = end - start
    print("accuracy given",i*100,"number of features",accuracy)
    print("time taken: ", result)

accuracy given 0 number of features 0.479
time taken:  0.27466487884521484
accuracy given 100 number of features 0.654
time taken:  0.948052167892456
accuracy given 200 number of features 0.693
time taken:  1.4239680767059326
accuracy given 300 number of features 0.743
time taken:  1.899703025817871
accuracy given 400 number of features 0.754
time taken:  2.2571542263031006
accuracy given 500 number of features 0.787
time taken:  2.672999143600464
accuracy given 600 number of features 0.777
time taken:  3.0143790245056152
accuracy given 700 number of features 0.796
time taken:  3.325918197631836
accuracy given 800 number of features 0.779
time taken:  3.6925201416015625
accuracy given 900 number of features 0.797
time taken:  4.064137935638428
accuracy given 1000 number of features 0.799
time taken:  4.417773962020874
accuracy given 1100 number of features 0.828
time taken:  4.697555065155029
accuracy given 1200 number of features 0.821
time taken:  5.046492099761963
accuracy given 130

The more features that we retain, the longer the time it takes to process. Although accuracy increases, at around 700 numher of features used, the accuracy plateaus and even fluctuates, with little to no improvement to the accuracy. Therefore, it seems like using 500 features would produce the best result (0.811) within the shortest amount of time (3.288s). 

### 3. Distance metrics
- Compare KNN performance with the $L^2$ and $L^1$ distance metrics

### 4. K-Fold Cross-Validation

- Research this technique and apply it to data and your KNN and/or Naive Bayes classifiers.

In [48]:
#split into k number of training sets
#k number of training times. 
#k number of validation data's accuracy. 
#average of all the validation accuracy

In [101]:
feats = []
for data in x:
    feat = np.zeros(len(top_words))
    words = epp.tokenize_words(data)
    for word in words:
        if word in top_words:
            idx = top_words.index(word)
            feat[idx] = feat[idx] + 1
    feats.append(feat)

feats = np.asarray(feats)

#print(feats)

In [103]:
# test classification dataset
from sklearn.datasets import make_classification
# define dataset
X_x, y_y = make_classification(n_samples=500, n_features=200, n_informative=15, n_redundant=5, random_state=1)
# summarize the dataset
print(X_x.shape, y_y.shape)
print(feats.shape,y.shape)

(500, 200) (500,)
(5000, 1900) (5000,)


In [104]:
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import normalize
from sklearn.naive_bayes import GaussianNB
feats = normalize(feats)
y = y.astype('int64')
# create dataset
#X, y = make_classification(n_samples=100, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# prepare the cross-validation procedure
cv = KFold(n_splits=10, random_state=1, shuffle=True)
print(feats.shape)
print(y.shape)
# create model
model = GaussianNB()
# evaluate model
scores = cross_val_score(model, feats, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance


print('Accuracy: %.3f (%.3f)' % (np.mean(scores), np.std(scores)))

(5000, 1900)
(5000,)
Accuracy: 0.787 (0.014)


Using KFold with NB

In [111]:
from sklearn.model_selection import KFold
k = 5
kf = KFold(n_splits=k, shuffle = True, random_state = 1)
accuracy = []
for train_index,validation_index in kf.split(feats):
    #print(X_train,X_test) # indices
    x_train = feats[train_index] 
    x_test = feats[validation_index]
    y_train = y[train_index] 
    y_test = y[validation_index]
    
    
    nb.train(x_train,y_train)
    y_pred = nb.predict(x_test)
    nb_accuracy = nb.accuracy(y_test,y_pred)
    end = time.time()
    result = end - start
    accuracy.append(nb_accuracy)
    print("accuracy: ",nb_accuracy)
    print("time taken: ", result)

print("average accuracy: ",np.sum(accuracy)/k)

# for X_train_i,X_test_i in kf.split(feats):
#     print(feats[X_train_i],feats[X_test_i]) # features corresponding the indices

accuracy:  0.822
time taken:  2775.2070372104645
accuracy:  0.825
time taken:  2775.2529339790344
accuracy:  0.83
time taken:  2775.290516138077
accuracy:  0.823
time taken:  2775.320107936859
accuracy:  0.844
time taken:  2775.351049184799
average accuracy:  0.8288


Using KFold on Knn

In [112]:
from sklearn.model_selection import KFold
k = 5
kf = KFold(n_splits=k, shuffle = True, random_state = 1)
accuracy = []
for train_index,validation_index in kf.split(feats):
    #print(X_train,X_test) # indices
    x_train = feats[train_index] 
    x_test = feats[validation_index]
    y_train = y[train_index] 
    y_test = y[validation_index]
    
    
    classifier.train(x_train,y_train)
    y_pred = nb.predict(x_test)
    classifier_accuracy = classifier.accuracy(y_test,y_pred)
    end = time.time()
    result = end - start
    accuracy.append(classifier_accuracy)
    print("accuracy: ",classifier_accuracy)
    print("time taken: ", result)

print("average accuracy: ",np.sum(accuracy)/k)



accuracy:  0.848
time taken:  2873.823977947235
accuracy:  0.847
time taken:  2873.84588098526
accuracy:  0.859
time taken:  2873.8579540252686
accuracy:  0.849
time taken:  2873.8805680274963
accuracy:  0.844
time taken:  2873.8943390846252
average accuracy:  0.8493999999999999


### 5. Email error analysis

- Dive deeper into the properties of the emails that were misclassified (FP and/or FN) by Naive Bayes or KNN. What is their word composition? How many words were skipped because they were not in the training set? What could plausibly account for the misclassifications?

### 6. Investigate the misclassification errors

Numbers are nice, but they may not the best for developing your intuition. Sometimes, you want to see what an misclassification *actually looks like* to help you improve your algorithm. Retrieve the actual text of some example emails of false positive and false negative misclassifications to see if helps you understand why the misclassification occurred. Here is an example workflow:

- Decide on how many FP and FN emails you would like to retrieve. Find the indices of this many false positive and false negative misclassification. Remember to use your `test_inds` array to look up the index of the emails BEFORE shuffling happened.
- Implement the function `retrieve_emails` in `email_preprocessor.py` to return the string of the raw email at the error indices.
- Call your function to print out the emails that produced misclassifications.

Do the FP and FN emails make sense? Why? Do the emails have properties in common? Can you quantify and interpret them?