## Assignment 5-6: Document Classification
#### Summer 2021
**Authors:** GOAT Team (Esteban Aramayo, Ethan Haley, Claire Meyer, and Tyler Frankenburg)

In this assignment, we'll ingest a dataset on emails from the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Spambase) and build a classifier to predict if a row is a spam email or not, using the provided features.

In [118]:
import numpy as np
import pandas as pd 
import re
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_validate
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

##### Configuring the Spam dataset

First we'll import the email data into a CSV without headers, as we'll add the column names from the names file later.

In [3]:
spam_data = pd.read_csv("https://raw.githubusercontent.com/ebhtra/gory-graph/main/DocumentClassification/spambase.data", header=None)

spam_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


Then we open the names file, which includes names for all our eventual features.

In [4]:
import urllib

link = "https://raw.githubusercontent.com/ebhtra/gory-graph/main/DocumentClassification/spambase.names"
f = urllib.request.urlopen(link)
f = f.read().decode()

Using Regex findall(), we can use pattern matching to ignore documentation text and pull in all the feature names to use as columns.

In [5]:
colnames = re.findall("word_freq_[a-z]*:|word_freq_[a-z]*[0-9]*[a-z]:|word_freq_[a-z]*[0-9]*:|char_freq_.:|capital_run_length_[a-z]*:",f)

In [6]:
len(colnames)

57

In [7]:
colnames.append("spam")

In [8]:
print(len(colnames))

58


In [9]:
spam_data.columns = colnames

In [10]:
spam_data.head()

Unnamed: 0,word_freq_make:,word_freq_address:,word_freq_all:,word_freq_3d:,word_freq_our:,word_freq_over:,word_freq_remove:,word_freq_internet:,word_freq_order:,word_freq_mail:,...,char_freq_;:,char_freq_(:,char_freq_[:,char_freq_!:,char_freq_$:,char_freq_#:,capital_run_length_average:,capital_run_length_longest:,capital_run_length_total:,spam
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


##### Exploring the Data

Let's look at class balance, as well as descriptive statistics of each field.

In [11]:
spam_data.describe()

Unnamed: 0,word_freq_make:,word_freq_address:,word_freq_all:,word_freq_3d:,word_freq_our:,word_freq_over:,word_freq_remove:,word_freq_internet:,word_freq_order:,word_freq_mail:,...,char_freq_;:,char_freq_(:,char_freq_[:,char_freq_!:,char_freq_$:,char_freq_#:,capital_run_length_average:,capital_run_length_longest:,capital_run_length_total:,spam
count,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,...,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0
mean,0.104553,0.213015,0.280656,0.065425,0.312223,0.095901,0.114208,0.105295,0.090067,0.239413,...,0.038575,0.13903,0.016976,0.269071,0.075811,0.044238,5.191515,52.172789,283.289285,0.394045
std,0.305358,1.290575,0.504143,1.395151,0.672513,0.273824,0.391441,0.401071,0.278616,0.644755,...,0.243471,0.270355,0.109394,0.815672,0.245882,0.429342,31.729449,194.89131,606.347851,0.488698
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.588,6.0,35.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.065,0.0,0.0,0.0,0.0,2.276,15.0,95.0,0.0
75%,0.0,0.0,0.42,0.0,0.38,0.0,0.0,0.0,0.0,0.16,...,0.0,0.188,0.0,0.315,0.052,0.0,3.706,43.0,266.0,1.0
max,4.54,14.28,5.1,42.81,10.0,5.88,7.27,11.11,5.26,18.18,...,4.385,9.752,4.081,32.478,6.003,19.829,1102.5,9989.0,15841.0,1.0


In [12]:
spam_data['spam'].value_counts()

0    2788
1    1813
Name: spam, dtype: int64

This dataset contains just under 40% spam, so the classes are not perfectly balanced.  

The capital letter features are on a different scale from the others, so will need to be normalized for certain ML models.

##### Building the classifiers

We're going to squeeze what we can out of a few sci-kit learn classification models. To start, we'll split our target, y, from our features, X.

In [13]:
y = spam_data.iloc[:,57]
X = spam_data.iloc[:,:57]

Here are the features for the first email:

In [290]:
print(X.iloc[0,:])
print()
print(f"email is spam: {y[0]}")

word_freq_make:                  0.000
word_freq_address:               0.640
word_freq_all:                   0.640
word_freq_3d:                    0.000
word_freq_our:                   0.320
word_freq_over:                  0.000
word_freq_remove:                0.000
word_freq_internet:              0.000
word_freq_order:                 0.000
word_freq_mail:                  0.000
word_freq_receive:               0.000
word_freq_will:                  0.640
word_freq_people:                0.000
word_freq_report:                0.000
word_freq_addresses:             0.000
word_freq_free:                  0.320
word_freq_business:              0.000
word_freq_email:                 1.290
word_freq_you:                   1.930
word_freq_credit:                0.000
word_freq_your:                  0.960
word_freq_font:                  0.000
word_freq_000:                   0.000
word_freq_money:                 0.000
word_freq_hp:                    0.000
word_freq_hpl:           

There are frequency counts (how many times does this string appear in the email, for every 100 strings in the email) for 48 chosen strings, as well as 6 chosen characters.  There are also 3 different measures of capital letters.  Aside from trying to create our own non-linear combinations of the provided features, our feature engineering capabilities are handcuffed in this project, since we don't have the emails from which these features were chosen.  As such, we'll focus our efforts on leveraging sklearn's ML tools and best practices.

We'll split into test and train sets using sklearn's built-in splitting function. We'll set aside 10% of the samples as a holdout for testing.

In [162]:
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=.1, random_state=1234)

In [163]:
print('Resulting lengths of ')
print('X_train, y_train, X_test, and y_test:')
print(list(map(len, (X_train, y_train, X_test, y_test))))

Resulting lengths of 
X_train, y_train, X_test, and y_test:
[4140, 4140, 461, 461]


The capital letter features are of a different magnitude than the others, so we should scale them to similar.  It might make sense to use sklearn's MinMaxScaler, which maps an array of numbers to the range between 0 and 1, based on the minimum and maximum values in the array.

In [164]:
# For example
scaler = MinMaxScaler()
scaler.fit(X_train)  
X_scale_train = scaler.transform(X_train)
X_scale_test = scaler.transform(X_test)

In [165]:
print(f'Features now range from {np.min(X_scale_train)} to {np.max(X_scale_train)}')

Features now range from 0.0 to 1.0


But in order to use k-fold cross-validation during the training of a model, it's better to fit the scaler to each training split and then transform the held-out fold with that current iteration of the scaler, thus avoiding data leakage within the process.  Rather than implementing it from scratch, we'll use a sklearn Pipeline object.

In [166]:
pipe = Pipeline([('scaler', MinMaxScaler), ('model', LogisticRegression(random_state=1234, max_iter=500))])
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1234)
scores = cross_validate(pipeline, X_train, y_train, scoring=('accuracy', 'precision'),
                        cv=cv, return_train_score=True, n_jobs=-1)

We began with Logistic Regression here because it's well suited to the type of features that were given, and allows us to get a good idea of which of those features are generally important or not.  We chose accuracy and precision as scoring metrics because of the nature of the classification task, spam filtering.  Besides just finding the most accurate filter in terms of correctly classifying emails, we want a filter with a high precision score, assuming we'd rather see a few spam emails in our inbox than to find a few important emails in the spam folder.

In [167]:
print(f"The mean accuracy during training 10 splits 3 different times was {round(np.mean(scores['train_accuracy']),4)}")
print(f"The standard deviation of those 30 scores was {round(np.std(scores['train_accuracy']),4)}")
print()
print(f"The mean scores for the 30 held out validation folds was {round(np.mean(scores['test_accuracy']),4)}")
print(f"Their st.dev. was {round(np.std(scores['test_accuracy']),4)}")

The mean accuracy during training 10 splits 3 different times was 0.9291
The standard deviation of those 30 scores was 0.0031

The mean scores for the 30 held out validation folds was 0.9233
Their st.dev. was 0.0093


**The validation accuracies are just a half percent worse than the training ones, so there's not much evidence of overfitting.  The std devs are in line as well, with the 90-10 split provided by the 10 folds giving the smaller (validation) folds \$\sqrt{9}\$ = 3 times higher stdev.**

Let's see what the precisions were.

In [168]:
print(f"The mean test precision was {round(np.mean(scores['test_precision']),4)}")

The mean test precision was 0.9171


Since we used `RepeatedStratifiedKFold` for cross-validation, there should be about 40% spam in each fold, and since the model was about 93% accurate, it was probably predicting `spam` about 40% of the time.  With the precision score around 92%, this means we could expect around 8% False Positives on 40% spam predictions, or around 3% False Positives overall, where a non-spam email got sent to the spam box by mistake.  This may or may not be considered acceptable to a user of the filter, but we'll hope to do better with other models.

We might at this point choose to try out different regularization penalties (lasso and ridge regression) or to prune some of the 57 features we were given, but since there's little evidence of overfitting, we'll just retrain a logistic regression model on the entire training set and verify that it gets something like 92.5% accuracy with 3% false positives on the unseen 10% of the data we split off for testing.

In [169]:
# Use the scaled data from before
linear_reg = LogisticRegression(max_iter=300).fit(X_scale_train, y_train)
lin_pred = linear_reg.predict(X_scale_test)
lin_score = accuracy_score(y_test, lin_pred)
print(f"Test accuracy is {round(lin_score,4)}")
print(f"False positive rate is {round(sum(lin_pred - y_test == 1) / len(y_test), 3)}")

Test accuracy is 0.9002
False positive rate is 0.024


In [170]:
print(f"Test precision is {round(np.dot(y_test, lin_pred) / sum(lin_pred),4)}")

Test precision is 0.9308


The test precision is better than expected there but the test accuracy is lower than expected, so it's actually a good example of why you can't just base your model choice on what happens to fit the one held out test set best.  What was the worst k-fold test split out of the 30, for reference?

In [171]:
print(f"Worst cross-val accuracy out of 30 splits was {round(min(scores['test_accuracy']),4)}")
print('The 90.02% accuracy on the held out test emails happened to be slightly worse than all 30 of those.')

Worst cross-val accuracy out of 30 splits was 0.9058


**What were the 20 most important features?**

In [161]:
sorted(list(zip(linear_reg.coef_[0], X.columns)), key=lambda x: abs(x[0]), reverse=True)[:20]

[(6.634530839653129, 'word_freq_remove:'),
 (5.823450986761517, 'word_freq_000:'),
 (5.776807741527754, 'char_freq_$:'),
 (-5.735088805785175, 'word_freq_hp:'),
 (4.884655143987973, 'word_freq_free:'),
 (-4.648289195446876, 'word_freq_george:'),
 (4.386386117880408, 'capital_run_length_total:'),
 (3.5908643143794596, 'char_freq_!:'),
 (3.587826581327759, 'word_freq_business:'),
 (-3.545401108782535, 'word_freq_re:'),
 (3.531513711310692, 'word_freq_our:'),
 (-3.514266424780131, 'word_freq_meeting:'),
 (3.414309809347759, 'word_freq_money:'),
 (-3.2886563525516155, 'word_freq_hpl:'),
 (3.1815525257315667, 'word_freq_internet:'),
 (-3.1353679013712146, 'word_freq_edu:'),
 (2.980444885174455, 'word_freq_your:'),
 (2.920658083291564, 'word_freq_font:'),
 (2.8551215924272517, 'word_freq_order:'),
 (2.4842988281801284, 'word_freq_over:')]

**"Remove", "000", "$", "free", and lots of capitals all suggest spam.  "hp", "re", and "George" suggest non-spam.** 

Replies to real emails, with "RE:" in the subject, are a good sign an email is not spam. 
It's debatable whether the frequency of the word "George" should be considered an indication of non-spam.  If you want to train a separate classifier for every person then it makes sense to use their names.  But if you want to train a general purpose model, you wouldn't use names.  Still, this training set gives us no room to engineer other features, since we are just given a list of 57 of them, so perhaps it makes sense to use the name feature.

Another topic of interest is how changing the model's threshold for classification changes the accuracy and precision.
Can we raise accuracy without letting too many important emails go to the spam folder (by lowering the threshold)?
Can we reduce false positives without hurting accuracy too much (by raising the threshold)?

In [194]:
def threshold(thresh, probs, truths):
    '''Use your own threshold as first argument, to see how
    accuracy and false positive rates change for given (2-D) probability
    array and ground truths as 2nd and 3rd arguments.
    '''
    prob_score = accuracy_score(truths, [p[1] > thresh for p in probs])
    print(f"Threshold set to {thresh}")
    print(f"Test accuracy is {round(prob_score,4)}")
    print(f"False positive rate is {round(sum([p[1] > thresh for p in lin_prob] - y_test == 1) / len(y_test), 3)}")

In [209]:
# default used by sklearn predict functions is 0.5, so this should match above scores
lin_prob = linear_reg.predict_proba(X_scale_test)
truths = y_test
threshold(0.5, lin_prob, truths)

Threshold set to 0.5
Test accuracy is 0.9002
False positive rate is 0.024


Raise threshold from sklearn's default 0.5

In [218]:
threshold(0.68, lin_prob, truths)

Threshold set to 0.68
Test accuracy is 0.8351
False positive rate is 0.011


**To reduce false positives by a half, we had to lower accuracy from 0.9 to 0.835**

Lower threshold from 0.5

In [219]:
threshold(0.43, lin_prob, truths)

Threshold set to 0.43
Test accuracy is 0.9132
False positive rate is 0.039


**This direction looks worse:  Our FP rate went from 2.4% to 3.9% and we only improved accuracy from 90% to 91.3%**

##### Decision Trees

Now we'll try some decision trees to see how they compare.  We won't use a pipeline for the cross-validation scaling here, since decision trees split on the order of the feature values, and the order won't be changed by scaling.  We will, however, use sklearn's `GridSearchCV` to find the best hyperparameters via cross-validation.  For decision trees, these hyperparameters include the depth of the tree and the minimum number of nodes needed in a leaf in order to split on it (amongst others). Both of those parameters are used to control the tradeoff between underfitting and overfitting.

In [281]:
grid_params_for_trees = [{
    'max_depth': [3,6,9,12],
    'min_samples_split': [2,3,4,5]}]

CV_trees = GridSearchCV(estimator = DecisionTreeClassifier(random_state=1234), scoring='accuracy',
                               param_grid = grid_params_for_trees,
                               cv = 10, return_train_score=True)

CV_trees.fit(X_train, y_train)

GridSearchCV(cv=10, estimator=DecisionTreeClassifier(random_state=1234),
             param_grid=[{'max_depth': [3, 6, 9, 12],
                          'min_samples_split': [2, 3, 4, 5]}],
             return_train_score=True, scoring='accuracy')

In [282]:
# parameter combos with the highest validation scores
best_accuracy = pd.DataFrame(CV_trees.cv_results_).sort_values('mean_test_score', ascending=False)
best_accuracy.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_min_samples_split,params,split0_test_score,split1_test_score,split2_test_score,...,split2_train_score,split3_train_score,split4_train_score,split5_train_score,split6_train_score,split7_train_score,split8_train_score,split9_train_score,mean_train_score,std_train_score
14,0.03616,0.000893,0.001772,5.4e-05,12,4,"{'max_depth': 12, 'min_samples_split': 4}",0.908213,0.915459,0.94686,...,0.977187,0.972356,0.974772,0.972088,0.97504,0.968867,0.973162,0.97182,0.973349,0.002239
10,0.029407,0.00028,0.001753,5.6e-05,9,4,"{'max_depth': 9, 'min_samples_split': 4}",0.905797,0.910628,0.934783,...,0.961621,0.957327,0.962158,0.958132,0.957059,0.954643,0.961084,0.958937,0.959071,0.00231
15,0.036339,0.000815,0.00175,2.9e-05,12,5,"{'max_depth': 12, 'min_samples_split': 5}",0.903382,0.903382,0.932367,...,0.975309,0.971014,0.97343,0.970209,0.974235,0.967525,0.97182,0.969404,0.9719,0.002389
12,0.036155,0.000756,0.001773,5.3e-05,12,2,"{'max_depth': 12, 'min_samples_split': 2}",0.903382,0.908213,0.94686,...,0.977456,0.974503,0.976382,0.97343,0.976919,0.970478,0.973698,0.974235,0.97496,0.002197
13,0.03624,0.001076,0.001752,5.3e-05,12,3,"{'max_depth': 12, 'min_samples_split': 3}",0.896135,0.908213,0.942029,...,0.977456,0.973698,0.975309,0.972893,0.976114,0.970478,0.973698,0.97343,0.97445,0.002156


In [283]:
best_accuracy[['mean_test_score', 'mean_train_score','param_max_depth','param_min_samples_split']].head()

Unnamed: 0,mean_test_score,mean_train_score,param_max_depth,param_min_samples_split
14,0.924879,0.973349,12,4
10,0.922705,0.959071,9,4
15,0.921014,0.9719,12,5
12,0.921014,0.97496,12,2
13,0.921014,0.97445,12,3


The test scores are very similar to the LogisticRegression model results previously.  There is definitely some overfitting, since the training scores have about half the error rate of the validation folds.  This is because the best models have depths of 9 or even 12, which is deep enough to learn combinations of feature splits that don't generalize well.  On the other hand, the model is counteracting this tendency by using a minimum of 4 samples at a node in order to split on it, in the best models.

Let's train a model with the `max_depth=9` and `min_samples_split=4`, since it overfit less.  The results should be similar to what we just saw, but maybe the false positive rate is less than with logistic regression.

In [291]:
tree = DecisionTreeClassifier(max_depth=9, min_samples_split=4, random_state=1234).fit(X_train, y_train)
tree_probs = tree.predict_proba(X_test)

In [292]:
truths = y_test
threshold(0.5, tree_probs, truths)

Threshold set to 0.5
Test accuracy is 0.9349
False positive rate is 0.024


It seems like the tree got a little lucky on the 10% of emails we happened to hold out. Most importantly, it seems to have a very similar FP rate as the LogReg did.  We'll just try a couple of thresholds before moving on.

In [296]:
threshold(0.9, tree_probs, truths)

Threshold set to 0.9
Test accuracy is 0.9306
False positive rate is 0.0


**That's very powerful -- The tree can be used as a filter that sends 7 out of 40 spams to the inbox and not a single important email goes to the spam folder.**

Again, this was a slightly "lucky" split for the classifier, but the important thing is to confirm that it does almost as well with a different random split:

In [299]:
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=.1, random_state=620)
tree = DecisionTreeClassifier(max_depth=9, min_samples_split=4, random_state=620).fit(X_tr, y_tr)
tree_probs = tree.predict_proba(X_te)

In [302]:
threshold(0.9, tree_probs, y_te)

Threshold set to 0.9
Test accuracy is 0.9067
False positive rate is 0.0


The UCI website that hosts this data says:
` False positives (marking good mail as spam) are very undesirable.  If we insist on zero false positives in the training/testing set, 20-25% of the spam passed through the filter.`
Our 7-9 (accuracy = 91-93%) out of 40 passing through our tree filter line up with the low end of that estimate, so we expect it will be hard to improve upon the decision tree model optimized with GridSearchCV, but it's worth trying....

The tree variants that usually perform best on these tasks are `RandomForestClassifier` and `GradientBoostingClassifier`.

In [306]:
# As with all these GridSearchCV routines, training hundreds of models can take awhile
grid_params_for_forest = [{
    'max_depth': [3,6,9,12],
    'min_samples_split': [2,3,4,5]}]

CV_forest = GridSearchCV(estimator = RandomForestClassifier(random_state=1234), scoring='accuracy',
                               param_grid = grid_params_for_forest,
                               cv = 10, return_train_score=True)

CV_forest.fit(X_train, y_train)

GridSearchCV(cv=10, estimator=RandomForestClassifier(random_state=1234),
             param_grid=[{'max_depth': [3, 6, 9, 12],
                          'min_samples_split': [2, 3, 4, 5]}],
             return_train_score=True, scoring='accuracy')

In [307]:
# parameter combos with the highest validation scores
best_accuracy = pd.DataFrame(CV_forest.cv_results_).sort_values('mean_test_score', ascending=False)
best_accuracy.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_min_samples_split,params,split0_test_score,split1_test_score,split2_test_score,...,split2_train_score,split3_train_score,split4_train_score,split5_train_score,split6_train_score,split7_train_score,split8_train_score,split9_train_score,mean_train_score,std_train_score
12,0.363359,0.002933,0.013319,0.000466,12,2,"{'max_depth': 12, 'min_samples_split': 2}",0.942029,0.929952,0.944444,...,0.979334,0.979603,0.974772,0.979066,0.978529,0.976919,0.979066,0.977992,0.978449,0.001475
13,0.363393,0.004471,0.01315,0.000218,12,3,"{'max_depth': 12, 'min_samples_split': 3}",0.937198,0.934783,0.94686,...,0.977456,0.977724,0.973967,0.974235,0.975845,0.976651,0.976114,0.976114,0.976463,0.001465
14,0.363311,0.003128,0.013202,0.000212,12,4,"{'max_depth': 12, 'min_samples_split': 4}",0.932367,0.927536,0.94686,...,0.974503,0.976114,0.973967,0.972088,0.974772,0.97343,0.973967,0.973698,0.97453,0.00136
15,0.359892,0.00226,0.013197,0.000248,12,5,"{'max_depth': 12, 'min_samples_split': 5}",0.934783,0.929952,0.94686,...,0.972356,0.973967,0.973162,0.971283,0.973967,0.972625,0.975309,0.974235,0.973618,0.001176
9,0.332131,0.013465,0.012948,0.000993,9,3,"{'max_depth': 9, 'min_samples_split': 3}",0.939614,0.929952,0.939614,...,0.962158,0.960011,0.960548,0.960816,0.960816,0.961353,0.962426,0.960548,0.961541,0.001188


In [308]:
best_accuracy[['mean_test_score', 'mean_train_score','param_max_depth','param_min_samples_split']].head()

Unnamed: 0,mean_test_score,mean_train_score,param_max_depth,param_min_samples_split
12,0.946135,0.978449,12,2
13,0.944928,0.976463,12,3
14,0.944203,0.97453,12,4
15,0.944203,0.973618,12,5
9,0.940097,0.961541,9,3


Those are even better scores, but the CV search chose the most overfittable params -- highest depth and smallest split.  Since random forests are somewhat resistant to overfitting, we might try even more extreme numbers.

In [328]:
grid_params_for_forest = [{
    'max_depth': [10,12,14,16],
    'min_samples_split': [2,3]}]

CV_forest = GridSearchCV(estimator = RandomForestClassifier(n_estimators=500, random_state=1234), scoring='accuracy',
                               param_grid = grid_params_for_forest,
                               cv = 10, return_train_score=True)

CV_forest.fit(X_train, y_train)

GridSearchCV(cv=10,
             estimator=RandomForestClassifier(n_estimators=500,
                                              random_state=1234),
             param_grid=[{'max_depth': [10, 12, 14, 16],
                          'min_samples_split': [2, 3]}],
             return_train_score=True, scoring='accuracy')

In [329]:
# parameter combos with the highest validation scores
best_accuracy = pd.DataFrame(CV_forest.cv_results_).sort_values('mean_test_score', ascending=False)
best_accuracy[['mean_test_score', 'mean_train_score','param_max_depth','param_min_samples_split']].head()

Unnamed: 0,mean_test_score,mean_train_score,param_max_depth,param_min_samples_split
7,0.950725,0.989962,16,3
6,0.950725,0.990821,16,2
4,0.948551,0.987493,14,2
5,0.948068,0.986098,14,3
2,0.946135,0.979227,12,2


In [330]:
forest = RandomForestClassifier(max_depth=16, min_samples_split=3,
                                n_estimators=500, random_state=1234).fit(X_train, y_train)
forest_probs = forest.predict_proba(X_test)
threshold(0.5, forest_probs, y_test)

Threshold set to 0.5
Test accuracy is 0.9566
False positive rate is 0.024


Raise the threshold to where no good mail gets filtered

In [335]:
threshold(0.9, forest_probs, y_test)

Threshold set to 0.9
Test accuracy is 0.8568
False positive rate is 0.0


**This random forest model has an accuracy over 95% if you're OK with a 2.4% FPR, but drops to 85.7% if you need 0% FPR**

Let's get a second opinion before moving on to gradient boosting.

In [342]:
forest = RandomForestClassifier(max_depth=16, min_samples_split=3, random_state=620).fit(X_tr, y_tr)
forest_probs = forest.predict_proba(X_te)
threshold(0.9, forest_probs, y_te)

Threshold set to 0.9
Test accuracy is 0.8503
False positive rate is 0.0


Similar results.  Let's try gradient boosted trees.

In [345]:
grid_params_for_boosting = [{
    'max_depth': [6,9,12],
    'min_samples_split': [2,4,6],
    'learning_rate':[.005, .1, .3]}]

CV_boost = GridSearchCV(estimator = GradientBoostingClassifier(n_estimators=500, random_state=1234), scoring='accuracy',
                               param_grid = grid_params_for_boosting,
                               cv = 10, return_train_score=True)
# use whole dataset this time, since we'll rebuild a model anyways
CV_boost.fit(X, y)

GridSearchCV(cv=10,
             estimator=GradientBoostingClassifier(n_estimators=500,
                                                  random_state=1234),
             param_grid=[{'learning_rate': [0.005, 0.1, 0.3],
                          'max_depth': [6, 9, 12],
                          'min_samples_split': [2, 4, 6]}],
             return_train_score=True, scoring='accuracy')

In [347]:
# parameter combos with the highest validation scores
best_accuracy = pd.DataFrame(CV_boost.cv_results_).sort_values('mean_test_score', ascending=False)
best_accuracy[['mean_test_score', 'mean_train_score','param_max_depth','param_min_samples_split','param_learning_rate']].head()

Unnamed: 0,mean_test_score,mean_train_score,param_max_depth,param_min_samples_split,param_learning_rate
11,0.942837,0.9993,6,6,0.1
10,0.942621,0.9993,6,4,0.1
23,0.942405,0.999396,9,6,0.3
26,0.942404,0.999396,12,6,0.3
22,0.942187,0.999396,9,4,0.3


That one is overfitting badly, so we should retrain it with shallower trees.

In [348]:
grid_params_for_boosting = [{
    'max_depth': [3,4,5],
    'min_samples_split': [5],
    'learning_rate':[.1]}]

CV_boost = GridSearchCV(estimator = GradientBoostingClassifier(n_estimators=500, random_state=1234), scoring='accuracy',
                               param_grid = grid_params_for_boosting,
                               cv = 10, return_train_score=True)
# use whole dataset this time, since we'll rebuild a model anyways
CV_boost.fit(X, y)

GridSearchCV(cv=10,
             estimator=GradientBoostingClassifier(n_estimators=500,
                                                  random_state=1234),
             param_grid=[{'learning_rate': [0.1], 'max_depth': [3, 4, 5],
                          'min_samples_split': [5]}],
             return_train_score=True, scoring='accuracy')

In [349]:
# parameter combos with the highest validation scores
best_accuracy = pd.DataFrame(CV_boost.cv_results_).sort_values('mean_test_score', ascending=False)
best_accuracy[['mean_test_score', 'mean_train_score','param_max_depth','param_min_samples_split','param_learning_rate']].head()

Unnamed: 0,mean_test_score,mean_train_score,param_max_depth,param_min_samples_split,param_learning_rate
0,0.941748,0.991475,3,5,0.1
1,0.940881,0.998068,4,5,0.1
2,0.939795,0.999179,5,5,0.1


Let's see what the FP rate is.

In [350]:
booster = GradientBoostingClassifier(max_depth=3, min_samples_split=6,
                                n_estimators=500, random_state=1234).fit(X_train, y_train)
booster_probs = booster.predict_proba(X_test)
threshold(0.5, booster_probs, y_test)

Threshold set to 0.5
Test accuracy is 0.9479
False positive rate is 0.024


And with a higher classification threshold?

In [357]:
threshold(0.89, booster_probs, y_test)

Threshold set to 0.89
Test accuracy is 0.9393
False positive rate is 0.0


Second opinion, using same threshold:

In [358]:
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=.1, random_state=620)
booster = GradientBoostingClassifier(max_depth=3, min_samples_split=6,
                                n_estimators=500, random_state=620).fit(X_train, y_train)
booster_probs = booster.predict_proba(X_test)
threshold(0.89, booster_probs, y_test)

Threshold set to 0.89
Test accuracy is 0.9393
False positive rate is 0.0


**Were the same features important for the GBC as for LogReg?**

In [364]:
print('GBC most important')
sorted(list(zip(booster.feature_importances_, X.columns)), reverse=True)[:10]

GBC most important


[(0.23555716994129805, 'char_freq_!:'),
 (0.1894544321548047, 'char_freq_$:'),
 (0.12349834369582058, 'word_freq_remove:'),
 (0.09609956157230902, 'word_freq_hp:'),
 (0.0516503809617116, 'word_freq_free:'),
 (0.046537355829543145, 'capital_run_length_average:'),
 (0.03864972709681108, 'word_freq_your:'),
 (0.034816065391462134, 'capital_run_length_longest:'),
 (0.02906783160228493, 'word_freq_edu:'),
 (0.0277970870724421, 'word_freq_george:')]

In [365]:
print('LogReg most important')
sorted(list(zip(linear_reg.coef_[0], X.columns)), key=lambda x: abs(x[0]), reverse=True)[:10]

LogReg most important


[(7.173262142126195, 'word_freq_remove:'),
 (6.266720414314751, 'char_freq_$:'),
 (6.259045569423496, 'word_freq_000:'),
 (-6.193964932477645, 'word_freq_hp:'),
 (5.240425361873076, 'word_freq_free:'),
 (-4.899673982254518, 'word_freq_george:'),
 (4.3477817045903855, 'capital_run_length_total:'),
 (4.189373330848128, 'word_freq_business:'),
 (3.9105736696203515, 'word_freq_our:'),
 (-3.8069001892377266, 'word_freq_re:')]

There's a lot of overlap, but the Gradient Boosted model leaned heavily on exclamation points where the Linear model used '000' as a spam flag, perhaps where large dollar amounts were being used as clickbait, and the numbers got split on commas by the parser (e.g. '$1,000,000!' became 2 counts of '000')

### Summary

Constrained to a pre-fixed set of features and goals, we turned our focus to using some of the most common and powerful ML tools.  
- Training Classifiers to generalize to unseen inputs
 - Scaling features appropriately
 - Splitting data into subsets where necessary
 - Using pipelines with cross-validation
 - Using grid search with cross-validation to optimize hyperparameters  
 
 
- Evaluating the trained Classifiers
 - Choosing appropriate metrics 
 - Interpreting repeatability/generalizability of test results
 - Avoiding test data leakage
 - Raising classification thresholds to eliminate false positives
 - Interpreting learned features

Our most powerful model was the gradient boosted classifier with a high threshold, which was 93.93% accurate on multiple data splits, without wrongly sending a single good email to the spam folder.  With 39.4% of the emails being spam, that means our model was letting about 15-16% of spam into inboxes, which is not bad, considering the hosts of the data project suggested 20-25% would need to be let through, in order to eliminate false positives.