Import related libraries. For this task I am using scikit-learn package for building models and pandas module to read input data. 

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import make_scorer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

Read and prepare data from 'spambase.data' file. We sprate column 57 out to as the target (spam vs non-spam). Leave out 30% of the data as testing set. Perform a sanity check on the shape of training data.

In [2]:
spam_data = pd.read_csv('spambase.data', header=None)
spam_target = spam_data.pop(57)
X_train, X_test, y_train, y_test = train_test_split(spam_data, spam_target, test_size=0.3, random_state=0)
X_train.shape, y_train.shape

((3220, 57), (3220,))

Use a random forest classifer with 100 decision trees for this task, given its resistence to overfitting, and high accuracy in overall classification performance. We'll build each decision tree based on information gain (decrease in entropy).

In [3]:
rf = RandomForestClassifier(criterion='entropy', n_estimators = 100)

Prepare custom score functions for cross_validation. We'll utilize scikit-learn's confusion_matrix function to obtain false positive and false negative value, and since the error rate $(FP+FN)/(P+N)$ = 1.0 - $(TP+TN)/(P+N)$ = 1.0 - accuracy, we obtain it through scikit-learn's accuracy_score function.

In [4]:
def fp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 1]
def fn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 0]
def err(y_true, y_pred): return 1.0 - accuracy_score(y_true, y_pred)
scoring = {'fp': make_scorer(fp), 'fn': make_scorer(fn), 'err': make_scorer(err)}

Perform corss validation with 10 folds. Get the false positive, false negative, and error rate for each fold to store in scores dictionary.

In [5]:
scores = cross_validate(rf, X_train, y_train, cv=10, scoring = scoring,
                        return_train_score = False, return_estimator = True)

Output the table for the above scores, compute the average across all 10 folds.

In [6]:
print("{:<8} {:<15} {:<15} {:<15}".format('Fold','False Positive','False Negative','Error Rate'))
fp_array, fn_array, err_array = scores['test_fp'], scores['test_fn'], scores['test_err']
for i in range(10):
    print("{:<8} {:<15} {:<15} {:<15}".format(i+1,fp_array[i],fn_array[i],err_array[i]))
print("{:<8} {:<15} {:<15} {:<15}".format('avg.',sum(fp_array)/10,sum(fn_array)/10,sum(err_array)/10))

Fold     False Positive  False Negative  Error Rate     
1        6               2               0.024767801857585092
2        6               10              0.049535603715170295
3        9               11              0.0619195046439629
4        5               14              0.05882352941176472
5        3               12              0.04658385093167705
6        9               10              0.05900621118012417
7        8               7               0.04672897196261683
8        4               12              0.049844236760124616
9        6               6               0.03738317757009346
10       3               8               0.034267912772585674
avg.     5.9             9.2             0.04688608008057048


We can also fit on the entire training data and then perform accuracy test on test data.

In [7]:
rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.9500362056480811