# Predicting Spam using a Decision Tree

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier

## Import Spam Data

__Spam Dataset__

The Spam dataset contains 4601 observations and 58 features. The feature we will predict is labeled as 'spam'. This is a binary variable where spam emails are labeled 1 and non-spam emails are labeled 0.

In [5]:
# import dataset
# df = pd.read_csv("C:/Users/Christine/Documents/Data Sci/Econ 128/Data files/spam.csv")
df.head()

Unnamed: 0,word_make,word_address,word_all,word_3d,word_our,word_over,word_remove,word_internet,word_order,word_mail,...,char_semicolon,char_leftbrac,char_leftsquarebrac,char_exclaim,char_dollar,char_pound,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spam
0,0,1,1,0,1,0,0,0,0,0,...,0,0,0,1,0,0,3.756,61,278,1
1,1,1,1,0,1,1,1,1,0,1,...,0,1,0,1,1,1,5.114,101,1028,1
2,1,0,1,0,1,1,1,1,1,1,...,1,1,0,1,1,1,9.821,485,2259,1
3,0,0,0,0,1,0,1,1,1,1,...,0,1,0,1,0,0,3.537,40,191,1
4,0,0,0,0,1,0,1,1,1,1,...,0,1,0,1,0,0,3.537,40,191,1


In [6]:
df.shape

(4601, 58)

In [7]:
# breakdown of spam feature
df['spam'].value_counts()

0    2788
1    1813
Name: spam, dtype: int64

## Pre-processing

Using df as the Spam.csv data read by pandas, we declare the following variables
- X : the feature matrix, excluding capital run length variables
- y : the response variable (target)



In [13]:
X = df.iloc[:, :54].values
X

array([[0, 1, 1, ..., 1, 0, 0],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 0, 1, ..., 1, 1, 1],
       ...,
       [1, 0, 1, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 1, 0, 0]], dtype=int64)

In [15]:
y = df['spam'].values
y

array([1, 1, 1, ..., 0, 0, 0], dtype=int64)

## Setting up the Decision Tree

In [16]:
from sklearn.model_selection import train_test_split

Using scikit-learn's train_test_split method, we split the spam dataset into a training set and testing set. The test size is 30% of the dataset, while the training set it the remaining 70%

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

In [18]:
print('trainset')
print(X_train.shape)
print(y_train.shape)
print()
print('testset')
print(X_test.shape)
print(y_test.shape)

trainset
(3220, 54)
(3220,)

testset
(1381, 54)
(1381,)


## Modeling

We create an instance of the Decision Tree, which will be named SpamTree. We use the entropy to determine the split information gain at each node.

In [45]:
SpamTree = DecisionTreeClassifier(criterion = 'entropy')
SpamTree

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [46]:
# Fit the DecisionTree model
SpamTree.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

## Prediction

Using the trained decision tree, we can make predictions about the test set. This will be stored as 'prediction'

In [47]:
prediction = SpamTree.predict(X_test)

# Quick comparison between predictions and actual values for the first 5 observations
print(prediction[0:5])
print(y_test[0:5])

[1 0 0 1 1]
[1 0 0 1 0]


## Evaluation

Import metrics from sklearn to check the accuracy of the model

In [48]:
from sklearn import metrics
print('Accuracy of SpamTree: ', metrics.accuracy_score(y_test, prediction))

Accuracy of SpamTree:  0.9044170890658942
