<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#About-this-Assignment" data-toc-modified-id="About-this-Assignment-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>About this Assignment</a></span></li><li><span><a href="#Load-and-Prepare-Data" data-toc-modified-id="Load-and-Prepare-Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Load and Prepare Data</a></span></li><li><span><a href="#Split-data-to-Train-and-Test-sets" data-toc-modified-id="Split-data-to-Train-and-Test-sets-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Split data to Train and Test sets</a></span></li><li><span><a href="#Multinomial-Naive-Bayes" data-toc-modified-id="Multinomial-Naive-Bayes-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Multinomial Naive Bayes</a></span></li><li><span><a href="#Logistic-Regression" data-toc-modified-id="Logistic-Regression-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Logistic Regression</a></span></li><li><span><a href="#XGBoost" data-toc-modified-id="XGBoost-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>XGBoost</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></div>

### About this Assignment

It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  Here is one example of such data:  [UCI Machine Learning Repository: Spambase Data Set](http://archive.ics.uci.edu/ml/datasets/Spambase)
<br><br>
For this project, we'll use the above dataset to predict the class of new documents withheld from the training dataset.


### Load and Prepare Data

The primary dataset is a collection of 58 columns... <br>

<I> Columns 1 thru 48:</I> continuous real attributes of type word_freq_WORD.  Percentage of words in the e-mail that match word, i.e. 100 * (number of times the word appears in the e-mail) / total number of words in e-mail.

<I>Columns 49 thru 54:</I> continuous real attributes of type char_freq_CHAR. Percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail

<I>Column 55:</I> attribute of type capital_run_length_average = average length of uninterrupted sequences of capital letters

<I>Column 56:</I> attribute of type capital_run_length_longest = length of longest uninterrupted sequence of capital letters

<I>Column 57:</I> attribute of type capital_run_length_total = total number of capital letters in the e-mail

<I>Column 58:</I> attribute of type spam -- denotes whether the e-mail was considered spam (1) or not (0)

In [1]:
import pandas as pd
import numpy as np
import nltk

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

from xgboost import XGBClassifier
from sklearn.naive_bayes import MultinomialNB

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

import warnings
warnings.filterwarnings('ignore')

In [2]:
%cd C:\Users\user\Documents\00_Applications_DataScience\CUNY\DATA620\KJW_CUNY_DATA_620\Week5_Part2 (Document Classification)

C:\Users\user\Documents\00_Applications_DataScience\CUNY\DATA620\KJW_CUNY_DATA_620\Week5_Part2 (Document Classification)


In [3]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [4]:
#Load the Spambase dataset from UCI
data = pd.read_csv('spambase.data', header=None)

In [5]:
#Construct a header and apply it to the dataframe
word_freq = ['word_freq' + str(r+1) for r in data.columns[0:48]]
char_freq = ['char_freq' + str(r+1) for r in data.columns[48:54]]

data_header = word_freq + char_freq + ['cap_letter_avg'] + ['cap_letter_longest'] + ['cap_letter_total'] + ['spam_indicator']

data.columns = data_header

In [6]:
data.head()

Unnamed: 0,word_freq1,word_freq2,word_freq3,word_freq4,word_freq5,word_freq6,word_freq7,word_freq8,word_freq9,word_freq10,word_freq11,word_freq12,word_freq13,word_freq14,word_freq15,word_freq16,word_freq17,word_freq18,word_freq19,word_freq20,word_freq21,word_freq22,word_freq23,word_freq24,word_freq25,word_freq26,word_freq27,word_freq28,word_freq29,word_freq30,word_freq31,word_freq32,word_freq33,word_freq34,word_freq35,word_freq36,word_freq37,word_freq38,word_freq39,word_freq40,word_freq41,word_freq42,word_freq43,word_freq44,word_freq45,word_freq46,word_freq47,word_freq48,char_freq49,char_freq50,char_freq51,char_freq52,char_freq53,char_freq54,cap_letter_avg,cap_letter_longest,cap_letter_total,spam_indicator
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,0.0,0.64,0.0,0.0,0.0,0.32,0.0,1.29,1.93,0.0,0.96,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,0.21,0.79,0.65,0.21,0.14,0.14,0.07,0.28,3.47,0.0,1.59,0.0,0.43,0.43,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,0.38,0.45,0.12,0.0,1.75,0.06,0.06,1.03,1.36,0.32,0.51,0.0,1.16,0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0,0.0,0.12,0.0,0.06,0.06,0.0,0.0,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,0.31,0.31,0.31,0.0,0.0,0.31,0.0,0.0,3.18,0.0,0.31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,0.31,0.31,0.31,0.0,0.0,0.31,0.0,0.0,3.18,0.0,0.31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


### Split data to Train and Test sets

In [7]:
#Let x be all the predictor data and y be the spam indicator column that we are trying to predict
X = data.iloc[:, 0:57]
y = data.spam_indicator

In [8]:
# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 53)

### Multinomial Naive Bayes

In [9]:
# Instantiate a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()

In [10]:
# Fit the classifier to the training data
nb_classifier.fit(X_train,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [11]:
# Create the predicted tags
pred = nb_classifier.predict(X_test)

In [12]:
# Calculate the accuracy score
score = accuracy_score(y_test,pred)
print(score)

0.7972350230414746


In [13]:
# Calculate the confusion matrix: cm
cm = confusion_matrix(y_test,pred,labels=[0,1])
print(cm)

[[789 149]
 [159 422]]


### Logistic Regression

This article from Towards Data Science titled, ["Building a Logistic Regression Model in Python, Step by Step"](https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8) was helpful for this section.

In [14]:
# Instantiate a Multinomial Naive Bayes classifier: nb_classifier
lr_classifier = LogisticRegression()

In [15]:
# Fit the classifier to the training data

rfe = RFE(lr_classifier, 57)
rfe = rfe.fit(X_train, y_train)

print(rfe.support_)
print(rfe.ranking_)

[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]


In [16]:
# Create the predicted tags
pred = rfe.predict(X_test)

In [17]:
# Calculate the accuracy score
rfe_score = accuracy_score(y_test,pred)
print(rfe_score)

0.9374588545095458


In [18]:
# Calculate the confusion matrix: cm
xgb_cm = confusion_matrix(y_test,pred,labels=[0,1])
print(xgb_cm)

[[901  37]
 [ 58 523]]


### XGBoost

In [19]:
# Instantiate a Multinomial Naive Bayes classifier: nb_classifier
xgb_classifier = XGBClassifier()

In [20]:
# Fit the classifier to the training data
xgb_classifier.fit(X_train,y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1,
              objective='binary:logistic', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [21]:
# Create the predicted tags
pred = xgb_classifier.predict(X_test)

In [22]:
# Calculate the accuracy score
xgb_score = accuracy_score(y_test,pred)
print(xgb_score)

0.9578670177748518


In [23]:
# Calculate the confusion matrix: cm
xgb_cm = confusion_matrix(y_test,pred,labels=[0,1])
print(xgb_cm)

[[917  21]
 [ 43 538]]


### Conclusion

Of the three models (Naive Bayes, Logistic Regression, and XGBoost), both Logistic Regression and XGBoost performed well with accuracy scores of 93.7% and 95.7% respectively.  The confusion matrix's false positive (FP) and false negatives (FN) for each of these models were very close, with XGBoost having only 15 less FP and 16 less FN than Logistic Regression.

Of interest was that running a feature selection for the Logistic Regression model yielded a result of all 57 columns being relevant to obtain the high accuracy and confusion matrix scores.  This led to a decision to not drop any of the data columns, which tends to be an unusual result.

If this project was "real life", XGBoost would be the model that would be used to run against additional datasets and ultimately move forward into a production environment.