# Week 10 and 11 Homework: Document Classification  

For this assignment, we were asked to perform a spam document classification for an email spam dataset. To perform this, I used a decision tree on the UCI dataset that was provided. 

### Importing Libraries
Standard library load of `nltk`, `numpy`, and `pandas`. I'm also using parts of the `sklearn` Machine Learning library. 

In [65]:
%matplotlib inline

import nltk
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn import tree
import sklearn.metrics as sm

### Loading Data
I load the data into a pandas dataframe from spam dataset. The dataset was slightly modified to make the the first row the the keys for the dataset.

In [51]:
spambase = pd.read_csv("spambase/spambase.data")

In [52]:
spambase.describe()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spamclass
count,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,...,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0
mean,0.104553,0.213015,0.280656,0.065425,0.312223,0.095901,0.114208,0.105295,0.090067,0.239413,...,0.038575,0.13903,0.016976,0.269071,0.075811,0.044238,5.191515,52.172789,283.289285,0.394045
std,0.305358,1.290575,0.504143,1.395151,0.672513,0.273824,0.391441,0.401071,0.278616,0.644755,...,0.243471,0.270355,0.109394,0.815672,0.245882,0.429342,31.729449,194.89131,606.347851,0.488698
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.588,6.0,35.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.065,0.0,0.0,0.0,0.0,2.276,15.0,95.0,0.0
75%,0.0,0.0,0.42,0.0,0.38,0.0,0.0,0.0,0.0,0.16,...,0.0,0.188,0.0,0.315,0.052,0.0,3.706,43.0,266.0,1.0
max,4.54,14.28,5.1,42.81,10.0,5.88,7.27,11.11,5.26,18.18,...,4.385,9.752,4.081,32.478,6.003,19.829,1102.5,9989.0,15841.0,1.0


In [53]:
spam_count = len(spambase[spambase.spamclass==1])
non_spam_count = len(spambase[spambase.spamclass==0])
total = spam_count + non_spam_count

print("Spam percentage:", (spam_count/total)*100)

Spam percentage: 39.404477287546186


After loading in the data, we see that about 40% of the dataset is spam. 

### Decision Tree Training

I start by splitting the the data with 70% set for training, and 15% each set for testing and validating:

In [67]:
N = len(spam)
train_num = int(0.7 * N)
val_num = int(0.15 * N)
test_num = N - train_num - val_num

In [70]:
train_set, test_set = train_test_split(spam, test_size=test_num, random_state=8)
train_set, val_set = train_test_split(trainSet, test_size=val_num, random_state=88)

Once we have our data split, we can go ahead and classify, train and predict how much of it is spam: 

In [72]:
train_set_class = train_set['spamclass']
train_set_vars = train_set.drop(labels='spamclass', axis=1)
dt = tree.DecisionTreeClassifier(criterion="entropy", random_state=88)
dt_fit = dt.fit(train_set_vars, train_set_class)

In [80]:
def model_summary(actual, pred):
    cm = sm.confusion_matrix(actual, pred, labels=[1, 0])
    print("True positives:", cm[0,0])
    print("False positives:", cm[1,0])
    print("True negatives:", cm[1,1])
    print("False negatives:", cm[0,1])
    print(sm.classification_report(actual, pred, labels=[1,0], target_names=["Spam", "Not spam"]))

In [81]:
dt_train = dt_fit.predict(train_set_vars)
model_summary(train_set_class, dt_train)

True positives: 955
False positives: 0
True negatives: 1574
False negatives: 1
             precision    recall  f1-score   support

       Spam       1.00      1.00      1.00       956
   Not spam       1.00      1.00      1.00      1574

avg / total       1.00      1.00      1.00      2530



In [82]:
val_set_class = val_set['spamclass']
val_set_vars = val_set.drop(labels='spamclass', axis=1)

dt_val = dt_fit.predict(val_set_vars)
model_summary(val_set_class, dt_val)

True positives: 270
False positives: 21
True negatives: 363
False negatives: 36
             precision    recall  f1-score   support

       Spam       0.93      0.88      0.90       306
   Not spam       0.91      0.95      0.93       384

avg / total       0.92      0.92      0.92       690



In [84]:
test_set_class = test_set['spamclass']
test_set_vars = test_set.drop(labels='spamclass', axis=1)

dt_test = dt_fit.predict(test_set_vars)
model_summary(test_set_class, dt_test)

True positives: 246
False positives: 32
True negatives: 380
False negatives: 33
             precision    recall  f1-score   support

       Spam       0.88      0.88      0.88       279
   Not spam       0.92      0.92      0.92       412

avg / total       0.91      0.91      0.91       691



In [88]:
fit = dt_fit.feature_importances_
df = {'Var': pd.Series(test_set_vars.columns.values), 'Imp': pd.Series(fit)}
fi = pd.DataFrame(df, columns=['Var','Imp'])

fi.sort_values(['Imp'], ascending=0).head(10)

Unnamed: 0,Var,Imp
52,char_freq_$,0.262627
6,word_freq_remove,0.134747
51,char_freq_!,0.104697
24,word_freq_hp,0.069772
55,capital_run_length_longest,0.057546
26,word_freq_george,0.044251
45,word_freq_edu,0.037727
18,word_freq_you,0.036291
54,capital_run_length_average,0.03308
56,capital_run_length_total,0.031777


### Conclusion
Overall, the decision tree model was pretty effective as the average precision was about 91% for detecting spam.