<a href="https://colab.research.google.com/github/abnormalPotassium/DATA620/blob/main/Assignment%208/Assignment8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 8: Document Classification

By: Al Haque, Taha Ahmad


---
## Goal

This assignments's goal is to:
-  Use the [UCI spam dataset](http://archive.ics.uci.edu/ml/datasets/Spambase) to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

---
## Package Installation

Any packages that need to be installed for working on the classifiers can be added in the code block below. The very initial package assumption is that we'll simply need nltk and possibly pandas.

In [None]:
!pip install nltk
!pip install pandas
!pip install ucimlrepo


---
## Dataset Loading

[Our data](http://archive.ics.uci.edu/ml/datasets/Spambase) is a set of emails collected by Mark Hopkins, Erik Reeber, George Forman, and Jaap Suermondt in 1999 where the emails are classified into being spam emails or ham (legitimate) emails. There are 4601 emails in total with 58 different variables per email including classification of being spam (1) or ham (0) as our target variable.

The ucimlrepo package allows for directly downloading and accessing this dataset which we do below. Note that dataset loading process is largely identical to sample code provided in the [UCI archive](http://archive.ics.uci.edu/dataset/94/spambase).

After loading we display a preview of the feature dataframe which consists of variables extracted from processing each email such as word frequency, character frequency, and running capital length information.

In [3]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
spambase = fetch_ucirepo(id=94)

# data (as pandas dataframes)
X = spambase.data.features
y = spambase.data.targets

X

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,word_freq_conference,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total
0,0.00,0.64,0.64,0.0,0.32,0.00,0.00,0.00,0.00,0.00,...,0.0,0.000,0.000,0.0,0.778,0.000,0.000,3.756,61,278
1,0.21,0.28,0.50,0.0,0.14,0.28,0.21,0.07,0.00,0.94,...,0.0,0.000,0.132,0.0,0.372,0.180,0.048,5.114,101,1028
2,0.06,0.00,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.0,0.010,0.143,0.0,0.276,0.184,0.010,9.821,485,2259
3,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.0,0.000,0.137,0.0,0.137,0.000,0.000,3.537,40,191
4,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.0,0.000,0.135,0.0,0.135,0.000,0.000,3.537,40,191
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4596,0.31,0.00,0.62,0.0,0.00,0.31,0.00,0.00,0.00,0.00,...,0.0,0.000,0.232,0.0,0.000,0.000,0.000,1.142,3,88
4597,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.0,0.000,0.000,0.0,0.353,0.000,0.000,1.555,4,14
4598,0.30,0.00,0.30,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.0,0.102,0.718,0.0,0.000,0.000,0.000,1.404,6,118
4599,0.96,0.00,0.00,0.0,0.32,0.00,0.00,0.00,0.00,0.00,...,0.0,0.000,0.057,0.0,0.000,0.000,0.000,1.147,5,78


Looking at the target variable dataframe we see that our dataset consists of about 60% spam emails and 40% ham emails. Slightly unbalanced, but not by a huge amount that we would want to under or oversample for.

In [4]:
y.value_counts(normalize = True)*100

Class
0        60.595523
1        39.404477
Name: proportion, dtype: float64

Finally for loading in the data we split our dataset into an 80-20 train test split so we can evaluate our final model on data that it was not trained on directly.

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

---
## Model Building

We will begin our model building process by fitting our training data split into 5 cross-validation folds into a multinomial naive-bayes model which should be the simplest way of classifying our data.

In [21]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score

mnb = MultinomialNB()
cvs = cross_val_score(mnb, X_train, y_train.values.ravel(), cv = 5)

print(f"Our full featured naive-bayes model gives us a {cvs.mean()*100:0.1f}% accuracy with a standard deviation of {cvs.std()*100:0.1f}%")

Our full featured naive-bayes model gives us a 79.6% accuracy with a standard deviation of 1.3%


Next we want to try a maximum entropy model (aka a logistic regression model) with all features intact since we know that we have a very large amount of variables which would make a logistic regression model work well for a maximum entropy model

In [25]:
from sklearn.linear_model import LogisticRegression

lgr = LogisticRegression(max_iter = 1000)
cvs = cross_val_score(lgr, X_train, y_train.values.ravel(), cv = 5)

print(f"Our full featured logistic regression model gives us a {cvs.mean()*100:0.1f}% accuracy with a standard deviation of {cvs.std()*100:0.1f}%")

Our full featured logistic regression model gives us a 92.6% accuracy with a standard deviation of 1.2%


As we can see, utilizing a maximum entropy model on all the features gives us a fairly accurate result right out of the gate. It's likely that this is the model we would want to develop on further rather than the naive-bayes model.

Trying to improve on the model further we attempt automatic recursive feature selection which gives us a five-fold cross-validated accuracy score for each set of features within our maximum entropy model. We end up with an identical accuracy and standard deviation percentage which likely means that feature selecting for this model does not happen to change much.

In [56]:
from sklearn.feature_selection import RFECV

rfe = RFECV(lgr, cv = 5)
rfe = rfe.fit(X_train, y_train.values.ravel())

mean = rfe.cv_results_["mean_test_score"][-1]
std = rfe.cv_results_["std_test_score"][-1]

print(f"Our feature selected logistic regression model gives us a {mean*100:0.1f}% accuracy with a standard deviation of {std*100:0.1f}%")

Our feature selected logistic regression model gives us a 92.6% accuracy with a standard deviation of 1.2%


If we want to see what exactly our feature selection did to our model we can check the difference between the features in and those that end up being used in the end model combined with rankings of feature importance where those with lower rankings would not be included. We see that there is a single feature not included in the feature selected model.

In [76]:
print(f"There are {rfe.n_features_in_ - rfe.n_features_} less features in the feature selected model")
print()
print(rfe.ranking_)

There are 1 less features in the feature selected model

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2]


Since the ranking list coincides with the position of the columns in our original data frame we see that capital_run_length_total does not seem to be an important feature for determining if an email message is spam or not while the other features are all very important.

In [75]:
list(X_train.columns)[-1]

'capital_run_length_total'

With there not being many more changes we can do to a maximum entropy model and it outclassing our naive bayes model, we take it forward to evaluate against our test data.

---
## Conclusion

Here we test the final versions of our classifier against the test set to evaluate it.

In [82]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Predict on the test data
y_pred = rfe.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy*100:0.2f}%")
print("")

# Print classification report
print(classification_report(y_test, y_pred))

# Print confusion matrix
print(confusion_matrix(y_test, y_pred))

Accuracy: 92.40%

              precision    recall  f1-score   support

           0       0.92      0.96      0.94       531
           1       0.94      0.88      0.91       390

    accuracy                           0.92       921
   macro avg       0.93      0.92      0.92       921
weighted avg       0.92      0.92      0.92       921

[[508  23]
 [ 47 343]]


We see that our accuracy has only decreased by 0.20% against the test set indicating that our model has been well fit to be generalizable to data it hasn't seen before. If we take a look at the recall of spam emails our model ends up being slightly more biased towards classifying spam emails as ham which is good and what you want a spam classifier to be doing. More spam emails should be let in if it leads to a greater portion of legitimate emails being let in as well.

Ideally to extend upon this model building exercise we would change the model's boundaries of classification to require a greater confidence for classifying an email as spam to ensure that the precision of legitimate emails is in the realm of 95% or higher.

---
## Video Presentation

The code below allows a YouTube link to the video presentation to be inserted for the url variable and will then display the YouTube video within the notebook itself.

A regex match extracts the video ID from the URL which is then fed into the IPython package's built in Youtube embedder.

In [83]:
url = "https://youtu.be/0QDxB-v_QAw"

In [84]:
from IPython.display import YouTubeVideo
import re

reg = r"(?:v=|\/)([0-9A-Za-z_-]{11}).*"
urlid = re.search(reg, url)[1]

YouTubeVideo(urlid, width=800, height=450)