## Assignment 5-6: Document Classification
#### Summer 2021
**Authors:** GOAT Team (Estaban Aramayo, Ethan Haley, Claire Meyer, and Tyler Frankenburg)

In this assignment, we'll ingest a dataset on spam emails from the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Spambase) and build a classifier to predict if a row is a spam email or not, using the included features.

In [152]:
import pandas as pd 
import re
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

##### Configuring the Spam dataset

First we'll import the email data into a CSV without headers, as we'll add the column names from the names file later.

In [94]:
spam_data = pd.read_csv("https://raw.githubusercontent.com/ebhtra/gory-graph/main/DocumentClassification/spambase.data",header=0)


In [95]:
spam_data.head()

Unnamed: 0,0,0.64,0.64.1,0.1,0.32,0.2,0.3,0.4,0.5,0.6,...,0.40,0.41,0.42,0.778,0.43,0.44,3.756,61,278,1
0,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
1,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
2,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,1.85,0.0,0.0,1.85,0.0,0.0,...,0.0,0.223,0.0,0.0,0.0,0.0,3.0,15,54,1


Then we open the names file, which includes names for all our eventual features.

In [97]:
f = open("/Users/clairemeyer/Downloads/spambase/spambase.names", "r")
#f = open("https://raw.githubusercontent.com/ebhtra/gory-graph/main/DocumentClassification/spambase.names", "r")
f = f.read()

In [98]:
print(f)

| SPAM E-MAIL DATABASE ATTRIBUTES (in .names format)
|
| 48 continuous real [0,100] attributes of type word_freq_WORD 
| = percentage of words in the e-mail that match WORD,
| i.e. 100 * (number of times the WORD appears in the e-mail) / 
| total number of words in e-mail.  A "word" in this case is any 
| string of alphanumeric characters bounded by non-alphanumeric 
| characters or end-of-string.
|
| 6 continuous real [0,100] attributes of type char_freq_CHAR
| = percentage of characters in the e-mail that match CHAR,
| i.e. 100 * (number of CHAR occurences) / total characters in e-mail
|
| 1 continuous real [1,...] attribute of type capital_run_length_average
| = average length of uninterrupted sequences of capital letters
|
| 1 continuous integer [1,...] attribute of type capital_run_length_longest
| = length of longest uninterrupted sequence of capital letters
|
| 1 continuous integer [1,...] attribute of type capital_run_length_total
| = sum of length of uninterrupted sequences of

Using Regex findall(), we can use pattern matching to ignore documentation text and pull in all the feature names to use as columns.

In [100]:
colnames = re.findall("word_freq_[a-z]*:|word_freq_[a-z]*[0-9]*[a-z]:|word_freq_[a-z]*[0-9]*:|char_freq_.:|capital_run_length_[a-z]*:",f)

In [101]:
colnames
colnames.append("spam")

In [102]:
print(len(colnames))

58


In [103]:
spam_data.columns = colnames

In [104]:
spam_data.head()

Unnamed: 0,word_freq_make:,word_freq_address:,word_freq_all:,word_freq_3d:,word_freq_our:,word_freq_over:,word_freq_remove:,word_freq_internet:,word_freq_order:,word_freq_mail:,...,char_freq_;:,char_freq_(:,char_freq_[:,char_freq_!:,char_freq_$:,char_freq_#:,capital_run_length_average:,capital_run_length_longest:,capital_run_length_total:,spam
0,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
1,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
2,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,1.85,0.0,0.0,1.85,0.0,0.0,...,0.0,0.223,0.0,0.0,0.0,0.0,3.0,15,54,1


##### Exploring the Data

Let's look at class balance, as well as descriptive statistics of each field.

In [109]:
spam_data.describe()

Unnamed: 0,word_freq_make:,word_freq_address:,word_freq_all:,word_freq_3d:,word_freq_our:,word_freq_over:,word_freq_remove:,word_freq_internet:,word_freq_order:,word_freq_mail:,...,char_freq_;:,char_freq_(:,char_freq_[:,char_freq_!:,char_freq_$:,char_freq_#:,capital_run_length_average:,capital_run_length_longest:,capital_run_length_total:,spam
count,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,...,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0
mean,0.104576,0.212922,0.280578,0.065439,0.312222,0.095922,0.114233,0.105317,0.090087,0.239465,...,0.038583,0.139061,0.01698,0.26896,0.075827,0.044248,5.191827,52.17087,283.290435,0.393913
std,0.305387,1.2907,0.50417,1.395303,0.672586,0.27385,0.39148,0.401112,0.278643,0.644816,...,0.243497,0.270377,0.109406,0.815726,0.245906,0.429388,31.732891,194.912453,606.413764,0.488669
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.588,6.0,35.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.065,0.0,0.0,0.0,0.0,2.2755,15.0,95.0,0.0
75%,0.0,0.0,0.42,0.0,0.3825,0.0,0.0,0.0,0.0,0.16,...,0.0,0.188,0.0,0.31425,0.052,0.0,3.70525,43.0,265.25,1.0
max,4.54,14.28,5.1,42.81,10.0,5.88,7.27,11.11,5.26,18.18,...,4.385,9.752,4.081,32.478,6.003,19.829,1102.5,9989.0,15841.0,1.0


In [110]:
spam_data['spam'].value_counts()

0    2788
1    1812
Name: spam, dtype: int64

This dataset contains just under 40% spam, so the classes are not perfectly balanced.

##### Building the classifier

We're going to play with sci-kit learn classification models. To start, we'll split our target, y, from our features, X.

In [111]:
y = spam_data.iloc[:,57]
X = spam_data.iloc[:,:57]

In [115]:
print(X.head())
print(y.head())

   word_freq_make:  word_freq_address:  word_freq_all:  word_freq_3d:  \
0             0.21                0.28            0.50            0.0   
1             0.06                0.00            0.71            0.0   
2             0.00                0.00            0.00            0.0   
3             0.00                0.00            0.00            0.0   
4             0.00                0.00            0.00            0.0   

   word_freq_our:  word_freq_over:  word_freq_remove:  word_freq_internet:  \
0            0.14             0.28               0.21                 0.07   
1            1.23             0.19               0.19                 0.12   
2            0.63             0.00               0.31                 0.63   
3            0.63             0.00               0.31                 0.63   
4            1.85             0.00               0.00                 1.85   

   word_freq_order:  word_freq_mail:  ...  word_freq_conference:  \
0              0.00     

Now we'll split into test and train sets using sklearn's built-in splitting function. We'll choose a 30% holdout for testing.

In [121]:
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=.3, random_state=1234)

Then we can build our classifiers. We'll build a Linear Regression, a Decision Tree, a Random Forest, and a Naive Bayes and compare accuracies across each.

In [154]:
linear_reg = LogisticRegression().fit(X_train, y_train)
lin_pred = linear_reg.predict(X_test)
lin_score = accuracy_score(y_test, lin_pred)
print(lin_score)

0.9152173913043479


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [155]:
d_tree = DecisionTreeClassifier(max_depth=5).fit(X_train, y_train)
dt_pred = d_tree.predict(X_test)
dt_score = accuracy_score(y_test, dt_pred)
print(dt_score)

0.9036231884057971


In [156]:
randomf = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X_train, y_train)
rf_pred = randomf.predict(X_test)
rf_score = accuracy_score(y_test, rf_pred)
print(rf_score)

0.9297101449275362


In [157]:
nb = GaussianNB().fit(X_train, y_train)
nb_pred = nb.predict(X_test)
nb_score = accuracy_score(y_test, nb_pred)
print(nb_score)

0.8231884057971014
