# Project Spam/Ham Classification
## Classifiers

Data science is a collaborative activity. While you may talk with others about
the project, we ask that you **write your solutions individually**. If you do
discuss the assignments with others please **include their names** at the top
of your notebook.

## Setup

In [242]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set(style = "whitegrid", 
        color_codes = True,
        font_scale = 1.5)

In [243]:
from utils import fetch_and_cache_gdrive
fetch_and_cache_gdrive('1SCASpLZFKCp2zek-toR3xeKX3DZnBSyp', 'train.csv')
fetch_and_cache_gdrive('1ZDFo9OTF96B5GP2Nzn8P8-AL7CTQXmC0', 'test.csv')

original_training_data = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

# Convert the emails to lower case as a first step to processing the text
original_training_data['email'] = original_training_data['email'].str.lower()
test['email'] = test['email'].str.lower()

original_training_data.head()

from sklearn.model_selection import train_test_split

train, val = train_test_split(original_training_data, test_size=0.1, random_state=42)

Using version already downloaded: Sat Apr 25 12:08:27 2020
MD5 hash of file: 0380c4cf72746622947b9ca5db9b8be8
Using version already downloaded: Sat Apr 25 12:08:28 2020
MD5 hash of file: a2e7abd8c7d9abf6e6fafc1d1f9ee6bf


The following code is adapted from Part A of this project. You will be using it again in Part B.

In [244]:
def words_in_texts(words, texts):
    '''
    Args:
        words (list-like): words to find
        texts (Series): strings to search in
    
    Returns:
        NumPy array of 0s and 1s with shape (n, p) where n is the
        number of texts and p is the number of words.
    '''
    indicator_array = 1 * np.array([texts.str.contains(word) for word in words]).T
    return indicator_array

some_words = ['drug', 'bank', 'prescription', 'memo', 'private']

X_train = words_in_texts(some_words, train['email']) 
Y_train = np.array(train['spam'])

X_train[:5], Y_train[:5]

(array([[0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 1, 0]]), array([0, 0, 0, 0, 0]))

Recall that you trained the following model in Part A.

In [245]:
from sklearn.linear_model import LogisticRegression

model =  LogisticRegression()
model.fit(X_train, Y_train)

training_accuracy = model.score(X_train, Y_train)
print("Training Accuracy: ", training_accuracy)

Training Accuracy:  0.7576201251164648


## Evaluating Classifiers

The model you trained doesn't seem too shabby! But the classifier you made above isn't as good as this might lead us to believe. First, we are evaluating accuracy on the training set, which may provide a misleading accuracy measure, especially if we used the training set to identify discriminative features. In future parts of this analysis, it will be safer to hold out some of our data for model validation and comparison.

Presumably, our classifier will be used for **filtering**, i.e. preventing messages labeled `spam` from reaching someone's inbox. There are two kinds of errors we can make:
- False positive (FP): a ham email gets flagged as spam and filtered out of the inbox.
- False negative (FN): a spam email gets mislabeled as ham and ends up in the inbox.

These definitions depend both on the true labels and the predicted labels. False positives and false negatives may be of differing importance, leading us to consider more ways of evaluating a classifier, in addition to overall accuracy:

**Precision** measures the proportion $\frac{\text{TP}}{\text{TP} + \text{FP}}$ of emails flagged as spam that are actually spam.

**Recall** measures the proportion $\frac{\text{TP}}{\text{TP} + \text{FN}}$ of spam emails that were correctly flagged as spam. 

**False-alarm rate** measures the proportion $\frac{\text{FP}}{\text{FP} + \text{TN}}$ of ham emails that were incorrectly flagged as spam. 

The following image might help:

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Precisionrecall.svg/700px-Precisionrecall.svg.png" width="500px">

Note that a true positive (TP) is a spam email that is classified as spam, and a true negative (TN) is a ham email that is classified as ham.

### Suppose we have a classifier `zero_predictor` that always predicts 0 (never predicts positive). How many false positives and false negatives would this classifier have if it were evaluated on the training set and its results were compared to `Y_train`? Fill in the variables below (answers can be hard-coded):

*Tests in Question 6 only check that you have assigned appropriate types of values to each response variable, but do not check that your answers are correct.*

<!--
BEGIN QUESTION
name: q6a
points: 1
-->

In [246]:
zero_predictor_fp = 0 # FP & TP will always be 0 (never considered spam)
zero_predictor_fn = train['spam'].sum() # mislabeled as ham FN, correcly labeled as ham TN, where TN = tot_rows - zero_predictor_fn

In [247]:
ok.grade("q6a");

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 2
    Failed: 0
[ooooooooook] 100.0% passed



### What are the accuracy and recall of `zero_predictor` (classifies every email as ham) on the training set? Do **NOT** use any `sklearn` functions.

<!--
BEGIN QUESTION
name: q6b
points: 1
-->

In [248]:
zero_predictor_acc = (0 + len(train) - zero_predictor_fn) / len(train) # TP + TN / ALL ROWS
zero_predictor_recall = 0 # (TP / TP + FN) , bc TP = 0 , then (0 / 0 + zero_predictor_fn) = 0
zero_predictor_acc

0.7447091707706642

In [249]:
ok.grade("q6b");

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 2
    Failed: 0
[ooooooooook] 100.0% passed



### Provide brief explanations of the results from 6a and 6b. Explain why the number of false positives, number of false negatives, accuracy, and recall all turned out the way they did.

<!--
BEGIN QUESTION
name: q6c
manual: True
points: 2
-->
<!-- EXPORT TO PDF -->

Because our predictor always marks our emails as ham, the **True Positive** and **False Positive** will always be 0, because again, our predictor will not help us detect spam. This means that the **False Negatives** will be all those emails that are actualy spam, but considered as ham, and the **True Negative** will be all those emails that are actually ham. **Accuracy** will be all those emails that are correctly labeled; we know that our True Positive will always be 0, leaving us with only the True Negatives (which are all the actual ham emails), over the total number of emails. Finally, **Recall** is 0, as our predictor will always  yield a True Positive value of 0.

In [250]:
train.head()

Unnamed: 0,id,subject,email,spam
7657,7657,Subject: Patch to enable/disable log\n,"while i was playing with the past issues, it a...",0
6911,6911,Subject: When an engineer flaps his wings\n,url: http://diveintomark.org/archives/2002/10/...,0
6074,6074,Subject: Re: [Razor-users] razor plugins for m...,"no, please post a link!\n \n fox\n ----- origi...",0
4376,4376,Subject: NYTimes.com Article: Stop Those Press...,this article from nytimes.com \n has been sent...,0
5766,5766,Subject: What's facing FBI's new CIO? (Tech Up...,<html>\n <head>\n <title>tech update today</ti...,0


### Compute the precision, recall, and false-alarm rate of the `LogisticRegression` classifier created and trained in Part A. Do **NOT** use any `sklearn` functions.

**Note: In lecture we used the `sklearn` package to compute the rates. Here you should work through them using just the definitions to help build a deeper understanding.**

<!--
BEGIN QUESTION
name: q6d
points: 2
-->

In [251]:
Y_train_hat = model.predict(X_train)
logistic_predictor_precision = np.sum((Y_train_hat == 1) & (Y_train == 1)) / np.sum(Y_train_hat) # TP /TP + FP
logistic_predictor_recall = np.sum((Y_train_hat == 1) & (Y_train == 1)) / np.sum(Y_train) # TP / TP + FN
logistic_predictor_far = np.sum((Y_train_hat == 1) & (Y_train == 0)) / (np.sum((Y_train_hat == 1) & (Y_train == 0)) + np.sum((Y_train_hat == 0) & (Y_train == 0))) # FP / FP +TN

In [252]:
ok.grade("q6d");

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 3
    Failed: 0
[ooooooooook] 100.0% passed



### Are there more false positives or false negatives when using the logistic regression classifier from Part A?

<!--
BEGIN QUESTION
name: q6e
manual: True
points: 1
-->
<!-- EXPORT TO PDF -->

There are more false negatives as our logistic regression classifier still needs to  be improved. 

In [253]:
FP =  np.sum((Y_train_hat == 1) & (Y_train == 0))
FN =  np.sum((Y_train_hat == 0) & (Y_train == 1))

print(FP) # Ham marked as Spam
print(FN) # Spam marked as Ham

122
1699


### 

1. Our logistic regression classifier got 75.8% prediction accuracy (number of correct predictions / total). How does this compare with predicting 0 for every email?
1. Given the word features we gave you above, name one reason this classifier is performing poorly. Hint: Think about how prevalent these words are in the email set.
1. Which of these two classifiers would you prefer for a spam filter and why? Describe your reasoning and relate it to at least one of the evaluation metrics you have computed so far.

<!--
BEGIN QUESTION
name: q6f
manual: True
points: 3
-->
<!-- EXPORT TO PDF -->

1. Our logistic regression currently is slightly better at filtering which emails are spam and ham than the zero predictor, with a 75.8 compared to 74.4 
2. The words in our list some_words might not be good for our regression as it could be the case that they are not highly present in either sets of emails (spam or ham), therefore they would not be good indicators of what type of email we are looking at if they are not representative of either.
3. As of right now, I would prefer the zero predictor because it has 0 for False Positive, meaning that all of the ham would be delivered correctly, whereas our Logistic Regression model has a total of 122 hams that would have been sent to spam. The importance of each email is what makes us prefer a model with less accuracy but with lower danger of missing a crucial Ham.


## Moving Forward

With this in mind, it is now your task to make the spam filter more accurate. In order to get full credit on the accuracy part of this assignment, you must get at least **88%** accuracy on the test set. To see your accuracy on the test set, you will use your classifier to predict every email in the `test` DataFrame and upload your predictions to Kaggle.

**Kaggle limits you to four submissions per day**. This means you should start early so you have time if needed to refine your model. You will be able to see your accuracy on the entire set when submitting to Kaggle (the accuracy that will determine your score for question 9).

Here are some ideas for improving your model:

1. Finding better features based on the email text. Some example features are:
    1. Number of characters in the subject / body
    1. Number of words in the subject / body
    1. Use of punctuation (e.g., how many '!' were there?)
    1. Number / percentage of capital letters 
    1. Whether the email is a reply to an earlier email or a forwarded email
1. Finding better (and/or more) words to use as features. Which words are the best at distinguishing emails? This requires digging into the email text itself. 
1. Better data processing. For example, many emails contain HTML as well as text. You can consider extracting out the text from the HTML to help you find better words. Or, you can match HTML tags themselves, or even some combination of the two.
1. Model selection. You can adjust parameters of your model (e.g. the regularization parameter) to achieve higher accuracy. Recall that you should use cross-validation to do feature and model selection properly! Otherwise, you will likely overfit to your training data.

You may use whatever method you prefer in order to create features, but **you are not allowed to import any external feature extraction libraries**. In addition, **you are only allowed to train logistic regression models**. No random forests, k-nearest-neighbors, neural nets, etc.

We have not provided any code to do this, so feel free to create as many cells as you need in order to tackle this task. However, answering questions 7, 8, and 9 should help guide you.

---

---

In [497]:
#Fresh bread out of the oven
training_data = pd.read_csv('data/train.csv')

training, validation = train_test_split(training_data, test_size=0.1, random_state=42)

In [498]:

training['subject'] = training["subject"].apply(str)

#1
#Number of characters in the subject / body
single_sub = [sub for sub in training['subject']]
single_email = [sentence for sentence in training['email']]

chars_em = [len(a) for a in single_email]
chars_sub = [len(a) for a in single_sub]

#Number of words in the subject / body
num_words_em = training['email'].apply(len)
num_words_sub = training['subject'].apply(len)


#Use of punctuation (e.g., how many '!' were there?)
spesh_char = training['email'].str.findall('[!&.?":,|<>]').apply(len).astype(object)


#Number / percentage of capital letters
upper_em = [sum([c.isupper() for c in a]) for a in single_email]
#upper_sub = [sum([c.isupper() for c in a]) for a in single_sub]


#Whether the email is a reply to an earlier email or a forwarded email
replies = []
for sub in training["subject"]:
    if "Re:" in sub:
        replies.append(1)
    elif "RE:" in sub:
        replies.append(1)
    else:
        replies.append(0)


#2
some_other_words = ['apple', 'interview','html','body','first', 'new', '!', '1', 'selected',
                    'millionth', 'winner', 'are','...', '$', 'P.S.', 'warning', 'attention',
                    'free','FREE', 'limited','exclusive','promo', 'offer', 'act', 'now', '.', 'adult', 'Adult'
                   'membership', '<head>', '<body>', '<>', '<html>', '</head>', '<center>', '<h1>', 'table', '\n',
                   'drug', 'keto','prescription', 'cannabis', 'weight', 'loss', 'pay', 'enhance', 'performance', 
                   'hard', 'money', 'congratulations', 'pain', 'stop', 'access', '<blockquote>', 'male', 'female', 'enhancer', 
                   'site', 'treatment', 'the', 'you','your','this', 'fat', 'burn', 'girls', 'strategy', 'targeted', 'security'
                   'as', 'seen', 'singles', 'own', 'opportunity', 'lifetime', 'work', 'at','from', 'home','easily', 'ads',
                   'legal','here'] 



# Convert the emails to lower case as a first step to processing the text
training['email'] = training['email'].str.lower()

X_tr = pd.DataFrame(words_in_texts(some_other_words, training['email']))
Y_tr = training['spam']

X_tr["spesh chars"] = spesh_char
#X_tr["email length"] = num_words_em
#X_tr['subject length'] = num_words_sub
X_tr['chars in email'] = chars_em
X_tr['chars in sub'] = chars_sub
X_tr['spesh chars'] = X_tr['spesh chars'].fillna(X_tr['spesh chars'].mean())
#X_tr['email length'] = X_tr['email length'].fillna(X_tr['email length'].mean())
#X_tr['subject length'] = X_tr['subject length'].fillna(X_tr['subject length'].mean())
X_tr['# of uppercase in email'] = upper_em
X_tr['reply?'] = replies
#X_tr['forwrded?'] = forwards



X_tr


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,76,77,78,79,80,spesh chars,chars in email,chars in sub,# of uppercase in email,reply?
0,0,0,0,0,0,1,1,1,0,0,...,0,0,0,0,1,17.0,1641,37,69,0
1,0,0,1,1,1,1,0,1,0,0,...,1,0,1,0,1,22.0,4713,42,90,0
2,0,0,0,0,0,1,1,1,0,0,...,0,0,0,0,1,48.0,1399,54,51,1
3,0,0,1,0,1,1,1,1,0,0,...,0,0,0,0,1,38.0,4435,73,188,0
4,0,1,1,1,0,1,1,1,0,0,...,1,0,1,0,1,64.0,32857,52,1366,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7508,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,19.0,465,60,21,0
7509,0,0,0,0,0,1,1,1,0,0,...,0,1,1,0,1,77.0,7054,42,908,0
7510,0,0,0,1,0,0,1,1,0,0,...,0,0,0,0,0,63.0,1732,26,64,0
7511,0,0,0,0,0,1,0,1,0,0,...,0,0,0,0,0,2084.0,1098,52,33,1


In [499]:
from sklearn.metrics import accuracy_score
log_model =  LogisticRegression()
log_model.fit(X_tr, Y_tr)

tr_acc = log_model.score(X_tr, Y_tr)
print("Training Accuracy: ", tr_acc)

Y_tr_hat = log_model.predict(X_tr)


def rmse(y, yhat):
    return np.sqrt(np.mean((y - yhat)**2))

print("Training Error (RMSE):", rmse(Y_tr, Y_tr_hat))

Training Accuracy:  0.9112205510448556
Training Error (RMSE): 0.2979588041242353


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [500]:
#from lab 7
from sklearn.model_selection import KFold
from sklearn.metrics import log_loss

def compute_CV_error(model, X_train, Y_train):
    '''
    Split the training data into 4 subsets.
    For each subset, 
        fit a model holding out that subset
        compute the MSE on that subset (the validation set)
    You should be fitting 4 models total.
    Return the average MSE of these 4 folds.

    Args:
        model: an sklearn model with fit and predict functions 
        X_train (data_frame): Training data
        Y_train (data_frame): Label 

    Return:
        the average validation MSE for the 4 splits.
    '''
    kf = KFold(n_splits=10)
    validation_errors = []
    
    for train_idx, valid_idx in kf.split(X_train):
        # split the data
        split_X_train, split_X_valid = X_train.iloc[train_idx,:], X_train.iloc[valid_idx,:]
        split_Y_train, split_Y_valid = Y_train.iloc[train_idx], Y_train.iloc[valid_idx]
        # Fit the model on the training split
        model.fit(split_X_train,split_Y_train)
        
        Y_hat = model.predict(split_X_valid)
        
        # Compute the RMSE on the validation split
        error = rmse(split_Y_valid, model.predict(split_X_valid))
        #error = log_loss(Y_hat, split_Y_valid, labels = [0,1])


        validation_errors.append(error)
        
    return np.mean(validation_errors)
compute_CV_error(log_model, X_tr, Y_tr)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

0.3111757856727273

### Feature/Model Selection Process

In the following cell, describe the process of improving your model. You should use at least 2-3 sentences each to address the follow questions:

1. How did you find better features for your model?
2. What did you try that worked / didn't work?
3. What was surprising in your search for good features?

<!--
BEGIN QUESTION
name: q7
manual: True
points: 6
-->
<!-- EXPORT TO PDF -->

In [503]:

test = pd.read_csv('data/test.csv')

test['subject'] = test["subject"].apply(str)

#1
#Number of characters in the subject / body
single_sub = [sub for sub in test['subject']]
single_email = [sentence for sentence in test['email']]

chars_em = [len(a) for a in single_email]
chars_sub = [len(a) for a in single_sub]

#Number of words in the subject / body
num_words_em = test['email'].apply(len)
num_words_sub = test['subject'].apply(len)

#Use of punctuation (e.g., how many '!' were there?)
spesh_char = test['email'].str.findall('[!&.?":,|<>]').apply(len).astype(object)


#Number / percentage of capital letters
upper_em = [sum([c.isupper() for c in a]) for a in single_email]
#upper_sub = [sum([c.isupper() for c in a]) for a in single_sub]


#Whether the email is a reply to an earlier email or a forwarded email
replies = []
for sub in test["subject"]:
    if "Re:" in sub:
        replies.append(1)
    elif "RE:" in sub:
        replies.append(1)
    else:
        replies.append(0)


# Convert the emails to lower case as a first step to processing the text
test['email'] = test['email'].str.lower()

X_tst = pd.DataFrame(words_in_texts(some_other_words, test['email']))
Y_tr = training['spam']

X_tst["spesh chars"] = spesh_char

X_tst['chars in email'] = chars_em
X_tst['chars in sub'] = chars_sub
X_tst['spesh chars'] = X_tr['spesh chars'].fillna(X_tr['spesh chars'].mean())

X_tst['# of uppercase in email'] = upper_em
X_tst['reply?'] = replies

test_predictions = log_model.predict(X_tst)

AttributeError: 'Series' object has no attribute 'applest'