# Project 2: Spam/Ham Prediction

In this project, you will use what you've learned in class to create a classifier that can distinguish spam emails from ham (non-spam) emails.

We'll walk you through a couple steps to get you started, but this project is almost entirely open-ended. Instead of providing you with a skeleton to fill in, we will evaluate your work based on your model's accuracy and your written responses in this notebook.

## Kaggle

This project is a bit different from the other assignments in this class because we are using Kaggle to evaluate your model's accuracy. Kaggle is a website that hosts machine learning competitions.

We've created a competition just for this project: https://www.kaggle.com/t/433a6bca95f94a78a0d2a6e7e8b311c3

Here's how submitting to Kaggle works:

1. You will create a classifier using the training dataset.
2. You will use your classifier to make predictions on the test dataset.
3. You will upload your predictions as a CSV to https://www.kaggle.com/t/433a6bca95f94a78a0d2a6e7e8b311c3
4. The website will tell you your accuracy on the test set. You may only do this twice a day. You must reach a test set accuracy of **88%** in order to get full credit for the Kaggle portion of the assignment.

(After the assignment ends, we will evaluate your accuracy on a private test set to ensure that you aren't overfitting to the test set.)

## Submission

This project has no ok tests (and no autograder). Instead, you will submit the following:

0. **Your notebook to OkPy**. You can do this by running the `ok.submit()` cell at the bottom of this notebook. Note that there is no autograder for this assignment so you will not receive autograder emails.
0. **Your notebook's written answers to GradeScope.** The cell to export the notebook is located at the bottom of this notebook. If you have trouble converting your notebook to PDF, you may upload your notebook to http://datahub.berkeley.edu/ and run the cell there.
0. **Your model's predictions on the test set to Kaggle**, a website that hosts machine learning competitions. Kaggle will output your your accuracy on the test set so that you will know whether you've met the accuracy threshold or not.

**To prevent you from fitting to the test set, you may only upload predictions to Kaggle twice per day.** This means you should start early. In addition, if you decide to pair with someone else, your group only gets two submissions per day (not four).

This project (notebook + Gradescope submissions) is officially due Friday, Dec 1 at 11:59:59pm since we can't make assignments due after classes end. However, we will accept submissions until **Monday, Dec 4 at 11:59:59pm** without using slip days. Submissions after Dec 4 will use 1 slip day each day after Dec 4. The Kaggle competition will remain open until **Saturday Dec 9 at 11:59:59pm**.

**No late Kaggle submissions will be accepted** since we've taken slip days into account when setting the Kaggle deadline. You will not use slip days for Kaggle submissions.

## Grading
Grading will be based on a number of set criteria, enumerated below:

Task | Description
--- | ---
Basic Classifier | You succesfully implement our guided basic logistic regression classifier.
EDA | You create four exploratory plots that help explain your feature choices.
Feature Selection | You explain and justify your feature selection process
Written Questions | You answer the written questions that we place throughout this notebook.
Kaggle Accuracy | Your model beats the prediction accuracy threshold of **88%**. This is attainable with a well-thought-out model.

**You are allowed to work in groups of 2 for this assignment!** If you decide to partner with someone else, make sure you do the following:

1. Have one person in the group invite the other on OkPy: https://okpy.org/cal/ds100/fa17/proj2/
1. Have one person in the group invite the other person on Gradescope.
1. Have one person in the group invite the other person on Kaggle: https://www.kaggle.com/t/433a6bca95f94a78a0d2a6e7e8b311c3

## Prizes

Although you need to reach 88% accuracy in order to get full credit, we will reward those that create great classifiers.

The top 10 students on the Kaggle leaderboard, evaluated by their score in the private test set will: 

1. Have bragging rights 
2. Be invited to attend a lunch at the Faculty Club, hosted by Professors Gonzalez and Nolan.

## Restrictions

While we want you to be creative with your models, we want to make it fair to students who are seeing these techniques for the first time.  As such, **you are only allowed to train logistic regression models and their regularized forms**.  This means no random forest, CART, neural nets, etc.  However, you are free to feature engineer to your heart's content.  Remember that domain knowledge is the third component of data science.

## Getting Started

In [88]:
# Run this cell to set up your notebook
import seaborn as sns
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
sns.set()
sns.set_context("talk")

from IPython.display import display, Latex, Markdown, HTML, Javascript
from client.api.notebook import Notebook
ok = Notebook('proj2.ok')

Assignment: Project 2
OK, version v1.13.9



In [89]:
# Log into OkPy.
# You might need to change this to ok.auth(force=True) if you get an error
ok.auth(force=False)

Successfully logged in as aviarjavalingam@berkeley.edu


For your convenience, run this cell to highlight the written response cells in light blue. Only the highlighted cells will be converted to the GradeScope PDF, so put your written answers there.

Unfortunately, you'll have to run this each time you open your notebook to highlight cells.

In [90]:
highlight_cells = '''
Jupyter.notebook.get_cells().map(function(cell) {
  var tags = cell.metadata.tags
  if (tags && tags.indexOf('written') >= 0)
    cell.element.css('background-color', '#efefff')
})
'''
display(Javascript(highlight_cells))

<IPython.core.display.Javascript object>

## Loading in the Data

The dataset consists of email messages and their labels (0 for ham, 1 for spam). The training set contains 8348 labeled examples, and the test set contains 1000 unlabeled examples.

Run the following cells to load in the data into DataFrames.

The `train` DataFrame contains labeled data that you will use to train your model. It contains three columns:

1. `id`: An identifier for the training example.
1. `subject`: The subject of the email
1. `email`: The text of the email.
1. `spam`: 1 if the email was spam, 0 if the email was ham (not spam).

The `test` DataFrame contains another set of 1000 unlabeled examples. You will predict labels for these examples and submit your predictions to Kaggle for evaluation.

In [91]:
train = pd.read_csv('train.csv')
# We lower case the emails to make them easier to work with
train['email'] = train['email'].str.lower()
train.head()

Unnamed: 0,id,subject,email,spam
0,0,Subject: A&L Daily to be auctioned in bankrupt...,url: http://boingboing.net/#85534171\n date: n...,0
1,1,"Subject: Wired: ""Stronger ties between ISPs an...",url: http://scriptingnews.userland.com/backiss...,0
2,2,Subject: It's just too small ...,<html>\n <head>\n </head>\n <body>\n <font siz...,1
3,3,Subject: liberal defnitions\n,depends on how much over spending vs. how much...,0
4,4,Subject: RE: [ILUG] Newbie seeks advice - Suse...,hehe sorry but if you hit caps lock twice the ...,0


In [92]:
test = pd.read_csv('test.csv')
test['email'] = test['email'].str.lower()
test.head()

Unnamed: 0,id,subject,email
0,0,Subject: CERT Advisory CA-2002-21 Vulnerabilit...,\n \n -----begin pgp signed message-----\n \n ...
1,1,Subject: ADV: Affordable Life Insurance ddbfk\n,low-cost term-life insurance!\n save up to 70%...
2,2,Subject: CAREER OPPORTUNITY. WORK FROM HOME\n,------=_nextpart_000_00a0_03e30a1a.b1804b54\n ...
3,3,Subject: Marriage makes both sexes happy\n,"url: http://www.newsisfree.com/click/-3,848315..."
4,4,Subject: Re: [SAtalk] SA very slow (hangs?) on...,on thursday 29 august 2002 16:39 cet mike burg...


### Question 1

In the cell below, print the text of the first ham and the first spam email in the training set. Then, discuss one thing you notice that is different between the two.

In [93]:
# Print the text of the first ham and the first spam emails. Then, fill in your response in the q01 variable:

print(train[train['spam'] == 0].reset_index().email[0])
print(train[train['spam'] == 1].reset_index().email[0])

q01 = '''
*Type your answer here, replacing this text.*
'''
display(Markdown(q01))

url: http://boingboing.net/#85534171
 date: not supplied
 
 arts and letters daily, a wonderful and dense blog, has folded up its tent due 
 to the bankruptcy of its parent company. a&l daily will be auctioned off by the 
 receivers. link[1] discuss[2] (_thanks, misha!_)
 
 [1] http://www.aldaily.com/
 [2] http://www.quicktopic.com/boing/h/zlfterjnd6jf
 
 

<html>
 <head>
 </head>
 <body>
 <font size=3d"4"><b> a man endowed with a 7-8" hammer is simply<br>
  better equipped than a man with a 5-6"hammer. <br>
 <br>would you rather have<br>more than enough to get the job done or fall =
 short. it's totally up<br>to you. our methods are guaranteed to increase y=
 our size by 1-3"<br> <a href=3d"http://209.163.187.47/cgi-bin/index.php?10=
 004">come in here and see how</a>
 </body>
 </html>
 
 
 




*Type your answer here, replacing this text.*


## Our First Features

We would like to take the text of an email and predict whether the text is ham or spam. This is a *classification* problem, so we will use logistic regression to make a classifier.

Recall that the input to logistic regression is a matrix $X$ that contains numeric values only. Unfortunately, our data are text, not numbers. To address this, we can create numeric features derived from the email text and use those features for logistic regression.

Each row of $X$ is derived from one email example. Each column of $X$ is one feature. We'll guide you through creating a simple feature, and you'll create more interesting ones when you are trying to increase your accuracy.

### Question 2

Create a function called `words_in_text` that takes in a list of words and the text of an email. It outputs a pandas Series containing either a 0 or a 1 for each word in the list. The value of the Series should be 0 if the word doesn't appear in the text and 1 if the word does.

In [94]:
def words_in_text(words, text):
    '''
    Args:
        `words` (list of str): words to find
        `text` (str): string to search in
    
    Returns:
        Series containing either 0 or 1 for each word in words
        (0 if the word is not in text, 1 if the word is).
    '''
    
    dic = np.array([])
    for word in words:
        if word in text:
            dic = np.append(dic, 1)
        else:
            dic = np.append(dic, 0)
    return dic

# If these don't error, your function outputs the correct output for these examples
assert np.allclose(words_in_text(['hello'], 'hello world'),
                   [1])
assert np.allclose(words_in_text(['hello', 'bye', 'world'], 'hello world hello'),
                   [1, 0, 1])

### Question 3

Now, create a function called `words_in_texts` that takes in a list of words and a pandas Series of email texts. It should output a 2-dimensional NumPy matrix containing one row for each email text. The row should contain the output of `words_in_text` for each example. For example:

```python
>>> words_in_texts(['hello', 'bye', 'world'], pd.Series(['hello', 'hello world hello']))
array([[1, 0, 0],
       [1, 0, 1]])
```

You should be able to use the `.apply` and `.as_matrix` functions to implement this.

In [95]:
def words_in_texts(words, texts):
    '''
    Args:
        `words` (list of str): words to find
        `texts` (Series of str): strings to search in
    
    Returns:
        NumPy array of 0s and 1s with shape (n, p) where n is the
        number of texts and p is the number of words.
    '''
    lst = []
    for text in texts:
        lst = np.append(lst, words_in_text(words, text)) 
    return lst.reshape(len(texts), len(words))

# If these don't error, your function outputs the correct output for these examples
assert np.allclose(words_in_texts(['hello', 'bye', 'world'], pd.Series(['hello', 'hello world hello'])),
                   np.array([[1, 0, 0], [1, 0, 1]]))

## Classification

Notice that the output of `words_in_texts` is a numeric matrix containing features for each email. This means we can use it directly to train a classifier.

### Question 4

We've given you 5 words that might be useful as features to distinguish spam/ham emails. Use these words as well as the `train` DataFrame to create two NumPy arrays: `X_train` and `y_train`.

`X_train` should be a matrix of 0s and 1s created by using your `words_in_texts` function on all the emails in the training set.

`y_train` should be vector of the correct labels for each email in the training set.

In [96]:
some_words = ['drug', 'bank', 'prescription', 'memo', 'private']

X_train = words_in_texts(some_words, train.email)
y_train = [x for x in train['spam']]

X_train[:5], y_train[:5]

(array([[ 0.,  1.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  1.],
        [ 0.,  0.,  0.,  0.,  0.]]), [0, 0, 1, 0, 0])

### Question 5

Now we have matrices we can give to scikit-learn! Using the [`LogisticRegression`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) classifier, train a logistic regression model using `X_train` and `y_train`. Then, output the accuracy of the model in the cell below. You should get an accuracy of around 0.7557.

In [97]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import zero_one_loss
clf = LogisticRegression()
clf.fit(X_train, y_train)
Y_hat = clf.predict(X_train)
error = zero_one_loss(y_train, Y_hat)
accuracy = 1 - error
accuracy

0.75574988021082889

### Question 6

That doesn't seem too shabby! But the classifier you made above isn't as great as you might think. Recall that we have other ways of evaluating a classifier:

*Sensitivity* (also called *recall*) is the rate of true positives; in this case, the proportion of spam emails that are classified as spam.

*Specificity* (also called *precision*) is the rate of true negatives; in this case, the proportion of ham emails that are classified as ham.

Answer the following questions in the light blue cell below. You may create other cells for scratch work, but your final answers
must appear in the light blue cell.

0. Suppose we have a classifier that just predicts 0 (ham) for every email. What is its sensitivity? Its specificity?
0. Suppose we have a classifier that just predicts 0 (ham) for every email. What is its accuracy on the training set?
0. Our logistic regression classifier got 75% prediction accuracy (number of correct predictions / total). Why is this a poor accuracy?
0. What is the sensitivity of the logistic regression classifier above? The specificity? What kind of mistake is our classifier more likely to make: false positives or false negatives?
0. Given the word features we gave you above, name one reason this classifier is performing poorly.

In [98]:
len(train)

8348

In [99]:
len(train[train['spam'] == 0])

6208

In [100]:
len(train[train['spam'] == 1])

2140

In [101]:
np.count_nonzero(Y_hat)

375

In [102]:
len(Y_hat)

8348

In [103]:
always_ham = [0 for i in range(8348)]

In [104]:
1 - zero_one_loss(y_train, always_ham)

0.74365117393387636

In [105]:
from sklearn.metrics import precision_recall_fscore_support
precision_recall_fscore_support(y_train, Y_hat, average = 'binary')

(0.63466666666666671, 0.11121495327102804, 0.18926441351888668, None)

0. The recall of this predictor would be 0 while the precision of this same predictor would be 0 or N/A (depending on who you ask).
0. The accuracy of a predictor that always returns 0 (HAM) is 74.365%.
0. 75% is an outstandingly poor accuracy because it is only one percent better than a predictor that simply returns 0 no matter what, which is virtually ignorable. 
0. The recall of the classifier is .1112 while the precision is .6347. Our classifier is much more likely to make a false negative than a false positive because there were 375 positives and 7973 negatives.
0. One possible explanation for why our predictor is performing so poorly with the given word features is that the words are all english words that are lower cased and related to each other with no inclination that the words belong specifically to spam or ham. This is the rough equivalent of asking someone to classify something as a fruit or not a fruit by giving them the information of a bunch of orange things and telling them which ones are fruits and which ones aren't.

## Moving Forward

With this in mind, it is now your assignment to make your classifier more accurate. In particular, in order to get full credit on the accuracy part of this assignment, you must get at least **88%** accuracy on the test set. To see your accuracy on the test set, you will use your classifier to predict every email in the `test` DataFrame and upload your predictions to Kaggle.

To prevent you from fitting to the test set, you may only upload predictions to Kaggle twice per day. This means you should start early!

Here are some ideas for improving your model:

1. Finding better features based on the email text. For example, simple features that typically work for emails are:
    1. Number of characters in the subject / body
    1. Number of words in the subject / body
    1. Use of punctuation (e.g., how many '!' were there?)
    1. Number / percentage of capital letters 
    1. Whether or not the email is a reply to an earlier email or a forwarded email. 
    1. Using bag-of-words or [td-idf](http://www.tfidf.com/).
1. Finding better words to use as features. Which words are the best at distinguishing emails? This requires digging into the email text itself. (To help you out, we've given you a set of [English stopwords](https://www.wikiwand.com/en/Stop_words) in `stopwords.csv`)
1. Better data processing. For example, many emails contain HTML as well as text. You can consider extracting out the text from the HTML to help you find better words. Or, you can match HTML tags themselves, or even some combination of the two.
1. Model selection. You can adjust parameters of your model (e.g., the regularization parameter) to achieve higher accuracy. 

Recall that you should use cross-validation to do feature and model selection properly! Otherwise, you will likely overfit to your training data.

You may use whatever method you prefer in order to create features. However, we want to make it fair to students who are seeing these techniques for the first time.  As such, **you are only allowed to train logistic regression models and their regularized forms**. This means no random forest, k-nearest-neighbors, neural nets, etc.

We will not give you a code skeleton to do this, so feel free to create as many cells as you need in order to tackle this task. However, you should show us your process as outlined here:

### Feature/Model Selection Process

In this following cell, describe the process of improving your model. You should use at least 2-3 sentences each to address the follow questions:

1. How did you find better features for your model?
2. What did you try that worked / didn't work?
3. What was surprising in your search for good features?

1. *Write your answer here, replacing this text.*
1. *Write your answer here, replacing this text.*
1. *Write your answer here, replacing this text.*

### EDA

In the four light blue cells below, show us four different visualizations that you used to select features for your model. Each cell should output:

1. A plot showing something meaningful about the data that helped you during feature / model selection.
2. 2-3 sentences describing what you plotted and what its implications are for your features.

Feel to create as many plots as you want in your process of feature selection, but select four interesting ones for the cells below.

You should not show us more than one visualization for the same type of feature. For example, don't show us a bar chart of the number of emails that contain the word "hello" and a bar chart of the number of emails that contain the word "world". Each visualization should be conceptually distinct.

In [723]:
stopwords_df = pd.read_csv('stopwords.csv', sep=',',header=None)
stopwords = set(stopwords_df[0].values)

In [724]:
import string
import re
def unique_word_frequency(string_series):
    word_dict = {}
    freq = {}
    length = string_series.size
    for ind_string in string_series:
        words = set(((re.sub('\w*\d\w*', ' ',(re.sub('<[^<]+?>', ' ', ind_string.lower())))).translate(str.maketrans('','',string.punctuation))).split())
        for word in words:
            if word not in stopwords:
                if word in word_dict:
                    word_dict[word] = 1 + word_dict[word]
                else:
                    word_dict[word] = 1
    for key in word_dict:
        if ((len(key) < 16) and (len(key) > 2)):
            if word_dict[key] > 0:
                freq[key] = word_dict[key] / length
    
    return freq

In [725]:
ham_freq = unique_word_frequency(train[train['spam'] == 0]['email'])
spam_freq = unique_word_frequency(train[train['spam'] == 1]['email'])

In [726]:
def naive_bayes_spam_scorer(ham_dict, spam_dict):
    word_score_dict = {}
    for key in spam_dict:
        word_given_spam = spam_dict[key]
        word_given_ham = 0
        if key in ham_dict:
            word_given_ham = ham_dict[key]
        spam_score = word_given_spam / (word_given_spam + word_given_ham)
        word_score_dict[key] = spam_score
    return word_score_dict

In [727]:
nb_score = naive_bayes_spam_scorer(ham_freq, spam_freq)

In [728]:
scores = pd.DataFrame.from_dict(sorted(nb_score.items(), key=lambda x:x[1]))
scores = scores[scores[1] < 1]
#scores = scores[scores[1] > 0.95]
#scores.tail(100)

In [729]:
nb_score_ham = naive_bayes_spam_scorer(spam_freq, ham_freq)

In [730]:
scores_ham = pd.DataFrame.from_dict(sorted(nb_score_ham.items(), key=lambda x:x[1]))
scores_ham = scores_ham[scores_ham[1] < 1]
#scores_ham = scores_ham[scores_ham[1] > 0.95]
#scores_ham.tail(100)

In [731]:
def remove_punctuations(text):
    return text.translate(str.maketrans('','',string.punctuation))

In [732]:
def conc(text_array):
    return ' '.join(text_array)

In [733]:
train['email_smooth'] = ((((((train['email'].str).lower()).str.replace(r'<[^<]+?>', ' ')).str.replace(r'\w*\d\w*', ' ')).apply(remove_punctuations)).str.split()).apply(conc)


In [734]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words = "english", smooth_idf = False)
X = tfidf.fit_transform(train.email_smooth)

In [735]:
#scores.to_csv('spam_email_scores.csv')

In [736]:
#scores_ham.to_csv('mam_email_scores.csv')

In [737]:
spam_array_words = np.array(['requested', 'trading', 'profits', 'shipped', 'fortunate', 'utah', 'ohio', 'align', 'responded', 'livelihood', 'mccoy', 'drawings', 'imagined', 'savvy', 'seminars', 'respectability', 'trump', 'richmond', 'ini', 'informix', 'romania', 'labeling', 'cor', 'eligibility', 'occassion', 'calvin', 'meyers', 'obligations', 'earnest', 'webs', 'redesign', 'pounds', 'ally', 'ions', 'materially', 'specialized', 'reward', 'melt', 'sweets', 'deducted', 'dayton', 'earns', 'sunrise', 'labelled', 'shocking', 'backpack', 'quicker', 'resellers', 'vegas', 'casey', 'thanx', 'boyfriend', 'lion', 'diabetic', 'omen', 'swiftly', 'demise', 'specialty', 'unheard', 'delighted', 'instore', 'unforgettable', 'coli', 'deferred', 'deliveries', 'wines', 'waiters', 'marble', 'gigantic', 'turks', 'pierre', 'caledonia', 'nepal', 'marina', 'negotiating', 'poker', 'organize', 'webcam', 'notices', 'expire', 'cram', 'jelly', 'officejet', 'ixs', 'lump', 'xerox', 'nbspfor', 'catherine', 'dvdr', 'amateurs', 'sanitary', 'quotin', 'enhance', 'iid', 'iiip', 'iip', 'sixteen', 'bedside', 'diets', 'magnifying', 'ethics', 'quicken', 'duplication', 'shipment', 'graduate', 'construct', 'necessities', 'tuition', 'booming', 'unblocked', 'tidal', 'respectfully', 'roth', 'informationquot', 'appologise', 'thicker', 'decreases', 'hisher', 'juillet', 'modifiable', 'cela', 'hugo', 'denis', 'septembre', 'votre', 'quip', 'dun', 'plateforme', 'detective', 'antiques', 'multiplayer', 'shuffling', 'porno', 'ling', 'perpetrating', 'misconceptions', 'elders', 'neighborhoods', 'synonymous', 'reemerged', 'brethren', 'embarrassment', 'contingency', 'inhabit', 'negativity', 'thirtythree', 'mandatory', 'selfishness', 'wishing', 'voicestream', 'laundering', 'imprisonment', 'playstation', 'compresses', 'roundtrip', 'recycling', 'tang', 'comm', 'tien', 'jacqueline', 'landlord', 'unjustified', 'preprogrammed', 'mohammed', 'celeb', 'applegate', 'dipping', 'exhibits', 'layouts', 'unblocking', 'backgrounds', 'hindering', 'wallets', 'prudential', 'disability', 'redeem', 'redeemable', 'holland', 'multiplatform', 'sacred', 'erroneous', 'burners', 'selfesteem', 'intensity', 'maui', 'promotes', 'hybrid', 'anxiety', 'orgasm', 'partake', 'implication', 'inquire', 'calmer', 'conferencing', 'dana', 'clan', 'vise', 'apy', 'heal', 'fatigue', 'rests', 'guage', 'visits', 'lighting', 'girth', 'knocking', 'heures', 'calltime', 'ahi', 'ada', 'shortening', 'informa', 'fcfd', 'waived', 'broadway', 'thestreetcom', 'dialling', 'judicial', 'acquaintance', 'owne', 'confines', 'mutually', 'moles', 'pobox', 'cupertino', 'thames', 'internationally', 'gre', 'crowe', 'flyer', 'nocost', 'ceaseanddesist', 'reelection', 'unthinkable', 'presidentceo', 'dormitory', 'parkinsons', 'maximizer', 'survivors', 'workathome', 'bordertop', 'borderright', 'iiutaintorg', 'borderbottom', 'sharpen', 'monte', 'stamps', 'yada', 'direkt', 'dsseldorf', 'warsaw', 'mohammad', 'diagnosis', 'fountain', 'sundays', 'remuneration', 'gel', 'qvc', 'ssg', 'reel', 'disheartening', 'arthritis', 'battles', 'climates', 'tacking', 'sussex', 'naomi', 'forefront', 'nurturing', 'staffed', 'unsatisfied', 'calcium', 'arteries', 'stimulant', 'contractions', 'organs', 'colon', 'sugars', 'testes', 'mugging', 'crowded', 'knives', 'tradeshow', 'exhibitions', 'exercised', 'accountants', 'mems', 'psp', 'intervened', 'anne', 'haben', 'allergies', 'cons', 'jubii', 'fireball', 'suchen', 'austrian', 'oomph', 'messengers', 'natasha', 'physician', 'msword', 'rea', 'alloy', 'pcp', 'splendor', 'baggage', 'modesty', 'waiver', 'ailable', 'garrison', 'depot', 'appreciates', 'upc', 'touchscreen', 'woodworking', 'poss', 'collectible', 'treadmill', 'callcenter', 'crw', 'interne', 'establishes', 'brewed', 'searchable', 'interfere', 'reprisal', 'hindrance', 'neu', 'resin', 'radiant', 'foam', 'otto', 'phaseout', 'middleton', 'extr', 'offer', 'loss', 'telephone', 'contacts', 'thank', 'financial', 'lose', 'congratulations', 'professionals', 'predict', 'lawful', 'advisor', 'internets', 'teen', 'relax', 'shall', 'fraction', 'rates', 'formula', 'processed', 'comfort', 'newsgroups', 'multipart', 'approved', 'motivated', 'alaska', 'amateur', 'preparation', 'ent', 'mandate', 'superior', 'buyers', 'onetime', 'thirty', 'epson', 'pics', 'fontsize', 'confidential', 'hundreds', 'explosive', 'wyoming', 'tennessee', 'filling', 'inkjet', 'operated', 'prestigious', 'hype', 'sheets', 'unhappy', 'verification', 'ion', 'boundary', 'retail', 'directed', 'lifetime', 'fortune', 'dakota', 'hidden', 'diet', 'luckily', 'entertained', 'litigation', 'merchant', 'proven', 'regarding', 'wrapping', 'concealed', 'valued', 'wholesale', 'risk', 'vermont', 'surgery', 'refreshing', 'classmates', 'associate', 'kingdom', 'handbook', 'stroke', 'exp', 'formulas', 'projection', 'draws', 'blonde', 'opt', 'gambling', 'nbsp', 'unsolicited', 'eliminated', 'estate', 'signup', 'housing', 'remove', 'offers', 'illinois', 'guarantees', 'auto', 'inhouse', 'toners', 'invaluable', 'void', 'bank', 'http', 'millionaires', 'marketer', 'communicator', 'singles', 'quickest', 'adviser', 'traced', 'bless', 'puerto', 'nude', 'specifics', 'arial', 'sought', 'slice', 'colony', 'gen', 'humbly', 'vista', 'magnificent', 'fund', 'easytouse', 'com', 'spouse', 'par', 'repairs', 'absolutely', 'xxx', 'entitled', 'residents', 'wisconsin', 'ezine', 'authorize', 'contracts', 'qualification', 'inclusive', 'wit', 'indexhtml', 'colleagues', 'orders', 'envelope', 'employment', 'fred', 'traps', 'custody', 'savings', 'lessons', 'improvement', 'websites', 'texthtml', 'dear', 'nbspnbspnbsp', 'endowed', 'inspires', 'sluts', 'cock', 'spray', 'utilizing', 'serenity', 'recieve', 'ect', 'verifiable', 'receptive', 'disposal', 'cafeteria', 'paving', 'snippet', 'stuffy', 'retire', 'quebec', 'freak', 'gourmet', 'bankers', 'hungary', 'barbados', 'cape', 'slovakia', 'bermuda', 'slovenia', 'zurich', 'apprehensive', 'servernbsp', 'disconnected', 'simulates', 'midsize', 'iceberg', 'rape', 'smalltime', 'grind', 'retails', 'issuer', 'yearly', 'plight', 'convey', 'turbo', 'omissions', 'voted', 'webmasters', 'eyeopening', 'gentle', 'imagegif', 'vic', 'testdrive', 'commute', 'doctorate', 'physiological', 'amazes', 'lebanese', 'sinew', 'acclaimed', 'unserem', 'unanimously', 'exile', 'interior', 'impartial', 'seas', 'affiliated', 'fontfamily', 'affordable', 'amazed', 'accepted', 'advertisement', 'exercise', 'charsetiso', 'beneficial', 'instruments', 'safely', 'urgent', 'shipping', 'louisiana', 'multi', 'banners', 'pledge', 'momentum', 'interrupted', 'obtaining', 'sensitivity', 'integral', 'investigated', 'advertisements', 'traded', 'subscriber', 'investment', 'satisfaction', 'proposal', 'extracted', 'fourteen', 'congestion', 'bedroom', 'nbspand', 'adults', 'schoolgirls', 'deceased', 'vip', 'conditional', 'residing', 'identifies', 'solicitation', 'interacting', 'exhibition', 'strongly', 'trillion', 'giveaway', 'opportunity', 'ordering', 'promotions', 'satisfied', 'anytime', 'protective', 'bsp', 'investigative', 'extracting', 'weapon', 'vos', 'tablets', 'refund', 'incredible', 'reply', 'ffffff', 'jennifer', 'compete', 'visitors', 'favour', 'procedures', 'cabinet', 'purchase', 'overnight', 'bonus', 'appetite', 'recruiting', 'gamble', 'repair', 'magazines', 'mailin', 'zealand', 'zip', 'jersey', 'warranty', 'emails', 'qualify', 'sum', 'cash', 'valuable', 'rhode', 'hawaii', 'kansas', 'cede', 'exploding', 'chemically', 'guidelines', 'ers', 'manuals', 'awe', 'clic', 'multilingual', 'awaiting', 'clicking', 'exchanging', 'legality', 'enjoyable', 'disregard', 'mil', 'supplements', 'wellbeing', 'ammunition', 'bas', 'analyzes', 'accrued', 'paraguay', 'fiji', 'bangladesh', 'botswana', 'saint', 'norfolk', 'somalia', 'lebanon', 'ergonomic', 'naughty', 'invoicing', 'complies', 'peln', 'ime', 'smut', 'manufacture', 'nbspwe', 'mines', 'sansserif', 'ven', 'gua', 'hav', 'systemnbsp', 'constructing', 'simplest', 'unpredictable', 'tutoring', 'destiny', 'fulfill', 'ambition', 'finalized', 'pendant', 'pouvez', 'contrat', 'firmer', 'planners', 'unlock', 'overworked', 'eve', 'drowning', 'contracting', 'collectors', 'natures', 'lin', 'photographic', 'ahebaxeb', 'passports', 'mlms', 'lopez', 'angioplasty', 'invasive', 'transplant', 'sunglasses', 'flown', 'helicopters', 'newark', 'gum', 'pheromone', 'dosages', 'cortex', 'extravagantly', 'concurrently', 'fructus', 'issuance', 'prescriptions', 'organism', 'billie', 'aaliyah', 'leisure', 'personals', 'turk', 'mugabe', 'embassy', 'aad', 'mth', 'proportions', 'peers', 'parttime', 'prnewswire', 'hourly', 'embark', 'relocating', 'ali', 'pls', 'factored', 'stoner', 'shun', 'brutal', 'sybase', 'foxpro', 'cotton', 'corey', 'wnt', 'vrs', 'ausgabe', 'dressing', 'headset', 'cuz', 'inlaid', 'contravention', 'rmi', 'exert', 'holdover', 'prevailed', 'interred', 'denounced', 'timehonored', 'sanctuaries', 'nineteenth', 'steadfastly', 'terminates', 'proclaiming', 'tacitly', 'reprehensible', 'unchallenged', 'godgiven', 'ceremonies', 'defamation', 'atoll', 'contemplated', 'implausible', 'embodied', 'turtles', 'responds', 'deception', 'remnant', 'bibliography', 'evasion', 'cognizant', 'refugees', 'maliciously', 'apathy', 'unobstructed', 'disputed', 'insatiable', 'raison', 'hereditary', 'ominous', 'heretofore', 'pantry', 'belligerence', 'uninhabited', 'proudly', 'inflammatory', 'predates', 'monarchy', 'alchemy', 'parting', 'unjustifiably', 'qualifications', 'millionaire', 'inexpensive', 'susan', 'warehouse', 'lending', 'wish', 'transaction', 'prey', 'placing', 'stepbystep', 'helvetica', 'transmissions', 'presentations', 'invest', 'weight', 'dreaming', 'lowering', 'owning', 'shocked', 'commission', 'creditors', 'emailed', 'muscle', 'lowcost', 'prescribed', 'treasury', 'succeeding', 'gifts', 'ver', 'pharmaceutical', 'dollars', 'guarantee', 'miami', 'blaster', 'sierra', 'cigarettes', 'fantasies', 'reps', 'plates', 'profession', 'impulse', 'insured', 'concealing', 'marketing', 'communication', 'instruction', 'destructive', 'obtained', 'inform', 'credit', 'fundamentals', 'stylus', 'pill', 'income', 'bull', 'historically', 'awarded', 'greedy', 'inquiries', 'reduction', 'exceed', 'scholarship', 'endeavor', 'pencil', 'sized', 'defective', 'snoop', 'seventy', 'enhancers', 'stimulate', 'midwest', 'nbspa', 'crawl', 'coral', 'hormone', 'billed', 'filthy', 'postman', 'jewelry', 'attitudes', 'booster', 'lifes', 'bio', 'unconditional', 'phoenix', 'sunk', 'martial', 'cerebral', 'cpa', 'diagnose', 'unwanted', 'packard', 'receive', 'promptly', 'diplomatic', 'consolidation', 'debt', 'unlimited', 'arkansas', 'housekeepers', 'jour', 'mandated', 'cuttingedge', 'removed', 'click', 'classified', 'visa', 'platinum', 'marketers', 'soliciting', 'categorized', 'lowest', 'ref', 'equity', 'introductory', 'honored', 'medical', 'expenses', 'teasing', 'corny', 'esq', 'inflation', 'recession', 'reprint', 'dorm', 'reconsideration', 'quotas', 'aerospace', 'obey', 'substances', 'envelops', 'rode', 'closet', 'approximate', 'lubricant', 'recognizing', 'tailored', 'ins', 'quotthe', 'governmentthe', 'disadvantages', 'retina', 'reunion', 'caicos', 'kuwait', 'domination', 'babes', 'poors', 'petite', 'para', 'por', 'mas', 'creditor', 'humiliate', 'magenta', 'predicament', 'offsetting', 'fumble', 'shortest', 'recieved', 'sest', 'devis', 'ils', 'personnes', 'extranet', 'parcs', 'portail', 'patron', 'ensemble', 'axe', 'nationale', 'galement', 'dirigeant', 'nombreux', 'nurtured', 'exclusion', 'factoring', 'logistics', 'cheque', 'servant', 'bspnbsp', 'celebrities', 'smoked', 'insomnia', 'responsiveness', 'brewing', 'lust', 'cannabis', 'suppression', 'communion', 'anybodys', 'ana', 'rad', 'healing', 'gland', 'receivables', 'remit', 'relieve', 'dominion', 'bae', 'shorten', 'foregoing', 'persecution', 'incarceration', 'diners', 'superfast', 'zerocost', 'informat', 'mugabes', 'adress', 'rejuvenate', 'samplers', 'nausea', 'tin', 'exhibitors', 'handsomely', 'rebel', 'tnt', 'embraced', 'afb', 'elevated', 'monthly', 'traders', 'blueprint', 'checklist', 'preschool', 'webmaketalk', 'uncommon', 'participate', 'ext', 'dealer', 'earning', 'exciting', 'therapy', 'fees', 'payments', 'opted', 'incurred', 'screening', 'dreamed', 'nevada', 'interruption', 'virtue', 'disappearance', 'procurement', 'prospective', 'tremendous', 'est', 'deposited', 'proved', 'unbelievable', 'nationwide', 'payment', 'premium', 'mitchell', 'correspondence', 'promotion', 'bills', 'ingredients', 'loosing', 'georgia', 'multibillion', 'merchandise', 'billing', 'smokes', 'lace', 'reputable', 'uce', 'rem', 'automobile', 'contentid', 'cordially', 'formulation', 'utilized', 'vacations', 'dragon', 'toll', 'advised', 'cartridge', 'obligation', 'excluded', 'sincerely', 'affiliate', 'alink', 'apologise', 'researching', 'viagra', 'href', 'riskfree', 'specialists', 'usd', 'expiration', 'permanently', 'rental', 'receipt', 'inconvenience', 'cooperation', 'upfront', 'bordercolor', 'jam', 'blvd', 'participation', 'reclaim', 'invests', 'nownbsp', 'mortgages', 'wiretaps', 'trinidad', 'intimidate', 'juncture', 'gro', 'visor', 'mania', 'harvested', 'upandrunning', 'ses', 'donn', 'deux', 'systme', 'cas', 'approvals', 'vcd', 'owing', 'artery', 'coronary', 'enrollment', 'lungs', 'hawaiian', 'ther', 'youthful', 'tits', 'rancho', 'unqualified', 'ber', 'downsizing', 'pastor', 'discounted', 'quotwe', 'frankie', 'sweetness', 'spa', 'eed', 'faqs', 'deeds', 'cleansing', 'earn', 'loan', 'presently', 'adult', 'insurance', 'bureau', 'hewlett', 'alberta', 'breasts', 'advertisers', 'discoveries', 'hormones', 'postage', 'montana', 'economical', 'hyperlinks', 'safekeeping', 'contracted', 'mammoth', 'dental', 'commence', 'tomorrows', 'lease', 'lodge', 'dysfunction', 'secrets', 'qualified', 'sleep', 'pride', 'christian', 'mailto', 'bonuses', 'dare', 'assist', 'exceedingly', 'instructed', 'surrender', 'resumes', 'specialize', 'dialing', 'surcharges', 'formulated', 'referral', 'heighten', 'currencies', 'overweight', 'transcripts', 'discrete', 'bahamas', 'rake', 'savenbsp', 'epl', 'settlements', 'eligible', 'containers', 'himher', 'glossary', 'childcare', 'fares', 'checklists', 'underemployed', 'anxious', 'stringent', 'emailer', 'payperview', 'nous', 'entre', 'avant', 'tous', 'exwife', 'indictments', 'incidental', 'favorable', 'imagejpeg', 'specializing', 'faxing', 'num', 'dealers', 'heavenly', 'mellow', 'twentyone', 'lauderdale', 'dependable', 'coltd', 'winwin', 'orton', 'staggering', 'homeowners', 'cli', 'fortunes', 'sir', 'promotional', 'hobby', 'residual', 'penis', 'forwardlooking', 'bra', 'qualifying', 'belongings', 'ove', 'apologize', 'deposit', 'nigeria', 'crammed', 'assistance', 'utmost', 'prescription', 'consultation', 'idaho', 'explode', 'mentor', 'haunt', 'cholesterol', 'ies', 'sev', 'delegated', 'medication', 'showcases', 'lean', 'associations', 'destinations', 'herb', 'attracting', 'cart', 'bachelors', 'postmarked', 'infact', 'borderleft', 'blessings', 'quotation', 'financially', 'optin', 'toner', 'unlisted', 'profitable', 'amex', 'sirmadam', 'negotiate', 'physicians', 'strictest', 'groceries', 'loans', 'unsubscribed', 'accountant', 'topoftheline', 'tobacco', 'cbs', 'motivating', 'richer', 'commissions', 'fedex', 'fornbsp', 'offshore', 'futur', 'healthier', 'reciept', 'zzzzexamplecom', 'looseleaf', 'oneofakind', 'currency', 'quot', 'condone', 'laurent', 'avisited', 'removal', 'compliance', 'ministry', 'supervision', 'cote', 'envelopes', 'casino', 'brokers', 'thermal', 'capitalize', 'secretly', 'tournaments', 'professions', 'amalgamated', 'botanical', 'estates', 'cellpadding', 'cellspacing', 'illegality', 'ahover', 'refining', 'unclaimed', 'laserjet', 'blanks', 'guaranteed', 'seeker', 'debts', 'itnbsp', 'tout', 'debtors', 'deposits', 'removehtml', 'lesbian', 'charset', 'nbc', 'reversing', 'erase', 'teens', 'underline', 'aux', 'ger', 'nos', 'noninvasive', 'postal', 'suv', 'attn', 'pam', 'profiled', 'beneficiary', 'requesting', 'prohibiting', 'lbs', 'climax', 'ratios', 'tenants', 'nurse', 'gasoline', 'repaid', 'counseling', 'stimulating', 'specials', 'thi', 'affiliates', 'wrinkles', 'bodys', 'barrister', 'erection', 'daycare', 'appliances', 'hardcore', 'aactive', 'nationally', 'supplement', 'athletes', 'leasing', 'stamina', 'confidentiality', 'untitled', 'urgently', 'legible', 'sur', 'miracle', 'invitations', 'personalized', 'ofcourse', 'diagnostics', 'extracts', 'astonishment', 'icann', 'honesty', 'aging', 'familys', 'kindly', 'overlook', 'reentering', 'dqog', 'substance', 'spouting', 'homeowner', 'aolcom', 'solicit', 'mlm', 'resell', 'professionally', 'modalities', 'factual', 'dqogicag', 'tollfree', 'madam', 'potency', 'lis', 'systemworks', 'paperwork', 'mastercard', 'originator', 'consolidate', 'charsetwindows', 'mailings', 'bottles', 'refinance', 'moneyback', 'herbal', 'mortgage', 'optout', 'lenders'])


In [738]:
ham_array_words = np.array(['monday', 'sentence', 'procedure', 'workers', 'stops', 'tagged', 'gap', 'issues', 'integration', 'recording', 'iraq', 'images', 'built', 'knows', 'storage', 'aimed', 'extension', 'animated', 'evidence', 'mirror', 'band', 'alan', 'dynamically', 'alpha', 'strings', 'wells', 'rendered', 'literature', 'weapons', 'keen', 'clutter', 'partly', 'upper', 'acceleration', 'enforce', 'bright', 'functions', 'institute', 'early', 'tells', 'cool', 'downloads', 'hollywood', 'recipe', 'complain', 'retrieving', 'hat', 'interesting', 'fell', 'upgraded', 'chirac', 'scripts', 'stripped', 'drinks', 'firewire', 'deemed', 'italian', 'outlets', 'manipulation', 'bryan', 'formatting', 'dice', 'lan', 'loses', 'storm', 'depressed', 'opponents', 'peer', 'prevented', 'organizing', 'detection', 'variant', 'dominate', 'submissions', 'scientist', 'believer', 'crimes', 'oldest', 'encourages', 'rings', 'glory', 'yamaha', 'bose', 'terror', 'atlantic', 'desert', 'defeat', 'tony', 'wisdom', 'rocks', 'resolution', 'driver', 'adds', 'linked', 'religious', 'shouldnt', 'productive', 'speakers', 'killing', 'theyve', 'runs', 'brian', 'hits', 'myth', 'economist', 'dsa', 'column', 'smoother', 'soldiers', 'cares', 'diseases', 'hub', 'physics', 'micro', 'jokes', 'javamailroot', 'thatll', 'digging', 'deeply', 'username', 'occasionally', 'mad', 'assumed', 'backs', 'van', 'direction', 'devices', 'outlook', 'desktop', 'disks', 'hopefully', 'messageid', 'speaking', 'aaron', 'worm', 'novel', 'shy', 'feed', 'permits', 'football', 'broken', 'swing', 'juniper', 'highlighted', 'biological', 'comparing', 'playback', 'precedence', 'privileges', 'allen', 'adjust', 'flag', 'perfectly', 'configuration', 'turning', 'warned', 'echo', 'holes', 'weak', 'mysql', 'alternatives', 'pilot', 'causing', 'plugin', 'craig', 'running', 'hate', 'trojan', 'composite', 'vague', 'experiment', 'highend', 'destroy', 'survive', 'nets', 'nonetheless', 'fcc', 'mysterious', 'clearing', 'linking', 'thirdparty', 'tokens', 'technically', 'amd', 'controversial', 'catalogue', 'ships', 'invent', 'blew', 'chair', 'underlying', 'worse', 'url', 'serves', 'setting', 'computing', 'article', 'microsoft', 'fifth', 'trigger', 'judge', 'drag', 'stages', 'liable', 'winter', 'wap', 'hadnt', 'nextgeneration', 'mentioning', 'posted', 'enabled', 'chip', 'apparent', 'creative', 'permitted', 'component', 'barry', 'diff', 'hardware', 'politics', 'slightly', 'copyrighted', 'ross', 'forecast', 'jail', 'assumption', 'surfing', 'gpg', 'purely', 'deployed', 'brilliant', 'complaint', 'incoming', 'forbes', 'veteran', 'inappropriate', 'wider', 'toshiba', 'anger', 'anthony', 'mechanism', 'drivers', 'revision', 'networks', 'shell', 'script', 'altered', 'holy', 'nearby', 'implementing', 'packaged', 'communicating', 'intellectual', 'incentive', 'complaints', 'compromised', 'attempted', 'horror', 'jaguar', 'achievement', 'signals', 'disaster', 'heh', 'external', 'looks', 'developer', 'machines', 'configure', 'fault', 'attacks', 'variations', 'predicted', 'closest', 'supposedly', 'supreme', 'indians', 'mailman', 'bind', 'alike', 'continent', 'senders', 'contributor', 'theyll', 'funny', 'corpus', 'boston', 'lines', 'kick', 'announce', 'sept', 'sun', 'lcd', 'copying', 'jeremy', 'murder', 'concrete', 'physically', 'signatures', 'appeared', 'researchers', 'themes', 'poll', 'microsystems', 'scary', 'cambridge', 'suspect', 'scott', 'upgrade', 'icq', 'suits', 'bars', 'reporters', 'fan', 'sky', 'faced', 'highlighting', 'destroyed', 'conferences', 'eager', 'bay', 'species', 'giants', 'siemens', 'replied', 'jim', 'empire', 'spec', 'sets', 'downloading', 'arguments', 'bush', 'david', 'log', 'jmjmasonorg', 'reliability', 'delayed', 'ports', 'confuse', 'justify', 'affects', 'curve', 'lucas', 'consequences', 'emerged', 'boot', 'ran', 'generally', 'obvious', 'favourite', 'privileged', 'javascript', 'verizon', 'cur', 'stephen', 'rewrite', 'temperature', 'dozen', 'dave', 'picks', 'msn', 'classic', 'peters', 'connecting', 'careers', 'sharp', 'initiatives', 'slowly', 'jake', 'distributions', 'berkeley', 'muslim', 'combat', 'daughter', 'antiquity', 'ilug', 'character', 'moves', 'supplied', 'ham', 'distributed', 'popup', 'slashdot', 'shoot', 'warner', 'maintaining', 'hurt', 'tshirt', 'pioneers', 'dell', 'laptop', 'paul', 'clicks', 'counts', 'interact', 'nick', 'routing', 'unix', 'suppose', 'annoying', 'cyber', 'structures', 'indias', 'universe', 'jon', 'notebook', 'guess', 'nvidia', 'language', 'editors', 'productivity', 'visual', 'unlikely', 'shaw', 'apparently', 'device', 'tend', 'marriage', 'scale', 'hacking', 'tonight', 'temporary', 'returning', 'optimization', 'ending', 'church', 'displays', 'comic', 'beta', 'installed', 'beating', 'cindy', 'google', 'itll', 'cameras', 'thoughts', 'scope', 'andrew', 'admin', 'climate', 'humans', 'charging', 'robin', 'innovation', 'osdn', 'sucks', 'binary', 'ide', 'reproduce', 'bridge', 'suck', 'apple', 'reboot', 'reads', 'wednesday', 'mount', 'theo', 'mixed', 'objects', 'gordon', 'hosted', 'apache', 'clues', 'odd', 'modules', 'attachments', 'firewalls', 'particularly', 'string', 'fans', 'hes', 'plugins', 'presumably', 'austin', 'sony', 'mainly', 'apples', 'router', 'scenario', 'blind', 'newer', 'ought', 'hightech', 'amendment', 'suns', 'thinks', 'php', 'discussed', 'jack', 'gregory', 'stupid', 'filters', 'layer', 'weather', 'increasingly', 'resubscribe', 'greg', 'gay', 'headers', 'justin', 'architecture', 'characters', 'boomer', 'programmer', 'stage', 'engineer', 'command', 'commands', 'header', 'imho', 'comment', 'jul', 'wasnt', 'clue', 'hal', 'builtin', 'bits', 'sep', 'explains', 'visualize', 'map', 'gnome', 'thu', 'argument', 'kate', 'decline', 'chris', 'stuck', 'default', 'writes', 'wed', 'pointed', 'msg', 'silly', 'kevin', 'depends', 'compile', 'oct', 'tree', 'necessarily', 'bell', 'rebuild', 'module', 'tom', 'fork', 'folders', 'keys', 'tag', 'speech', 'apps', 'rss', 'img', 'install', 'fairly', 'mason', 'score', 'preferences', 'fri', 'yesterday', 'btw', 'tue', 'aug', 'apt', 'switch', 'cnet', 'adam', 'useless', 'murphy', 'perl', 'gary', 'cheers', 'spamassassin', 'rpm', 'wrote'])


In [739]:
total_array_words = np.append(spam_array_words, ham_array_words)

In [740]:
tf_ind = [tfidf.vocabulary_[word] for word in total_array_words]
tf_select = X[:, tf_ind]

In [741]:
clf_dos = LogisticRegression()
clf_dos.fit(tf_select, y_train)
Y_hat = clf_dos.predict(tf_select)
error = zero_one_loss(y_train, Y_hat)
accuracy = 1 - error
accuracy

0.93483469094393867

In [184]:
# This is the first graded EDA cell

In [175]:
# This is the second graded EDA cell

In [None]:
# This is the third graded EDA cell

In [None]:
# This is the fourth graded EDA cell

### Making an ROC Curve

It turns out that there's a tradeoff between sensitivity and specificity. In most cases we won't be able to get perfect sensitivity and specificity, so we have to select which of two we value more. For example, in the case of cancer screenings we value specificity more because false negatives are comparatively worse than false positives â€” a false negative means that a patient might not discover a disease until it's too late to treat, while a false positive means that a patient will probably have to take another screening.

Recall that logistic regression calculates the probability that an example belongs to a certain class. Then, to classify an example we say that an email is spam if our classifier gives it >=0.5 probability of being spam. However, we can adjust that cutoff: we can say that an email is spam only if our classifier gives it >=0.7 probability of being spam, for example. This is how we can trade off sensitivity and specificity.

The ROC (receiver operating charactistic) curve shows this trade off for each possible cutoff probability. We will discuss this during lecture, and you can also read [this blog post for more information.](https://www.theanalysisfactor.com/what-is-an-roc-curve/).

In the light blue cell below, plot the ROC curve for your final classifier (the one you use to make predictions for Kaggle).

In [None]:
from sklearn.metrics import roc_curve

# Note that you'll want to use the .predict_proba(...) method for your classifier
# instead of .predict(...) so you get probabilities, not classes

### Submitting to Kaggle

The following code will write your predictions on the test dataset to a CSV, which you can submit to Kaggle. You may need to modify it to suit your needs.

The code below assumes that you've saved your predictions in a 1-dimensional array called `test_predictions`.

Remember that if you've performed transformations or featurization on the training data, you must also perform the same transformations on the test data in order to make predictions. For example, if you've created features for the words "drug" and "money" on the training data, you must also extract the same features in order to use scikit-learn's `.predict(...)` method.

You should submit your CSV files to https://www.kaggle.com/t/433a6bca95f94a78a0d2a6e7e8b311c3

In [None]:
from datetime import datetime

# Assuming that your predictions on the test set are stored in a 1-dimensional array called
# test_predictions. Feel free to modify this cell as long you create a CSV in the right format.
assert isinstance(test_predictions, np.ndarray)
assert test_predictions.shape == (1000, )

submission_df = pd.DataFrame({
    "Id": test['id'], 
    "Class": test_predictions,
}, columns=['Id', 'Class'])

timestamp = datetime.isoformat(datetime.now()).split(".")[0]

submission_df.to_csv("submission_{}.csv".format(timestamp), index=False)
print('Created a CSV file: {}.'.format("submission_{}.csv".format(timestamp)))
print('You may now upload this CSV file to Kaggle for scoring.')

## Submission

Run the cell below to submit your notebook to OkPy:

In [None]:
_ = ok.submit()

Now, run this cell to create a PDF to upload to Gradescope.

In [None]:
!pip install -U gs100
from gs100 import convert
# Change the zoom argument if your font size is too small
convert('proj2.ipynb', num_questions=8, zoom=1)

Make sure to upload your PDF now. Otherwise, your written questions won't be graded.