# Summary

Random forest gave the best result with a 5% error rate (on a test set of ~1600 samples).
BernoulliNB give an 11% error when binarizing at 0.0, while a decision tree gives ~10% error rate.

The data was randomized and split into a 4000 training set and 1600 test set.


In [55]:
import nltk 
import random
from sklearn.naive_bayes import BernoulliNB
import numpy as np
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
#import copy
#import pickle

In [75]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import scale


In [38]:
def getPercErrorRate(ans, y_test):
    er = 0
    for x in range(0, len(ans)):
        if ans[x] != y_test[x]:
            er += 1
    return 100*er/float(len(ans))

def binerize(threshold, lis):
    lnew = []
    for x in lis:
        temp = []
        for y in x:
            
            if float(y)>threshold:
                temp.append(1)
            else:
                temp.append(0)
        lnew.append(temp)
    return lnew

In [7]:
with open('/Users/alexandersatz/Documents/Cuny/IS620/classifySpam/spambase.data', 'r') as f:
    l1 = []
    for line in f:
        line = line.strip('\n\r')
        line = line.split(",")
        l1.append(line)

    
        

In [120]:
random.shuffle(l1)

We break down our data set into a training set of 3000 and a test set of 1601.
The inputs are also non-standardized, standardized, and/or normalized.

In [121]:
lx1, ly = [],[]
for x in l1:
    lx1.append(x[:-1])
    ly.append(int(x[-1]))
lx = []
for x in lx1:
    t = []
    for y in x:
        t.append(float(y))
    lx.append(t)
    
lx_normalized = []
for x in lx1:
    t = []
    for y in x:
        t.append(float(y))
    t = np.asarray(t)
    t = scale( t, axis=0, with_mean=True, with_std=True, copy=True )
    lx_normalized.append(t)
    
lx_scaled = []
for x in lx1:
    t = []
    for y in x:
        t.append(float(y))
    t = np.asarray(t)
    t = t/t.max()
    lx_scaled.append(t)


In [122]:
len(lx)

4601

In [123]:
xtrain, xtest = lx[:3000], lx[3000:]
ytrain, ytest = ly[:3000], ly[3000:]
xtrain_n, xtest_n = lx_normalized[:3000], lx_normalized[3000:]
xtrain_s, xtest_s = lx_scaled[:3000], lx_scaled[3000:]


Below we look at sklearn decision trees with two different classifiers.  
We see error rates of 10% either way.

In [124]:
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf.fit(xtrain, ytrain)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [125]:
ans = clf.predict(xtest)

In [126]:
getPercErrorRate(ans, ytest)

10.118675827607746

In [147]:
clf = tree.DecisionTreeClassifier()  ## uses gini by default
clf.fit(xtrain, ytrain)
ans = clf.predict(xtest)
getPercErrorRate(ans, ytest)

10.680824484697064

Here we use the bernoulli naive bayes.  Best result is with binarizing at 0.0 with unscaled, or scaled data.  Normalized data does not give a good result.  This gives an 11% error at best.

In [146]:
#xbinary = binerize(1.0, xtrain)
clf = BernoulliNB()
clf.fit(xtrain, ytrain)
ans = clf.predict(xtest)
getPercErrorRate(ans, ytest)

11.118051217988757

In [136]:
for x in range(0,10):
    clf = BernoulliNB(binarize = x/10.0)
    clf.fit(xtrain_n, ytrain)
    ans = clf.predict(xtest_n)
    print str(getPercErrorRate(ans, ytest)) + " " + str(x)

30.6058713304 0
36.5396627108 1
35.6027482823 2
36.7895065584 3
37.9762648345 4
37.7888819488 5
37.4765771393 6
37.1642723298 7
38.1636477202 8
37.8513429107 9


In [134]:
for x in range(0,10):
    clf = BernoulliNB(binarize = x/40.0)
    clf.fit(xtrain_s, ytrain)
    ans = clf.predict(xtest_s)
    print str(getPercErrorRate(ans, ytest)) + " " + str(x)
    


11.118051218 0
30.7932542161 1
35.6027482823 2
37.7264209869 3
37.1018113679 4
37.4765771393 5
37.039350406 6
37.9762648345 7
37.2267332917 8
36.2273579013 9


The Gaussian naive bayes does very badly with normalized or standardized data at >23% error

In [137]:
clf = GaussianNB()
clf.fit(xtrain_n, ytrain)
ans = clf.predict(xtest_n)
getPercErrorRate(ans, ytest)

36.16489693941287

In [138]:
clf = GaussianNB()
clf.fit(xtrain_s, ytrain)
ans = clf.predict(xtest_s)
getPercErrorRate(ans, ytest)

23.672704559650217

The random forest gives an error as low as 5% with n_estimators set at 20.  Changing the max features setting didn't improve performance.

In [141]:
for x in range(1,10):
    clf = RandomForestClassifier(n_estimators=x*2)
    clf.fit(xtrain_n, ytrain)
    ans = clf.predict(xtest_n)
    print str(getPercErrorRate(ans, ytest)) + " " + str(x*2)

14.9906308557 2
11.9300437227 4
10.493441599 6
8.68207370394 8
8.11992504685 10
9.30668332292 12
8.49469081824 14
8.86945658963 16
7.30793254216 18


In [154]:
for x in range(1,10):
    clf = RandomForestClassifier(n_estimators=x*3)
    clf.fit(xtrain, ytrain)
    ans = clf.predict(xtest)
    print str(getPercErrorRate(ans, ytest)) + " " + str(x*3)

8.18238600874 3
7.18301061836 6
6.18363522798 9
5.2467207995 12
5.1217988757 15
4.80949406621 18
4.37226733292 21
5.3091817614 24
4.74703310431 27


The MultinomialNB does poorly (>23% errors).  When the data is binarized prior to fitting, the error drops to 9%, which is basically the same as the BernoulliNB as expected.

In [143]:
clf = MultinomialNB()  
clf.fit(xtrain_s, ytrain)
ans = clf.predict(xtest_s)
getPercErrorRate(ans, ytest)

38.663335415365395

In [144]:
clf = MultinomialNB()
clf.fit(xtrain, ytrain)
ans = clf.predict(xtest)
getPercErrorRate(ans, ytest)

23.735165521549032

In [152]:
xbinary = binerize(0.0, xtrain_s)
xtest_bin = binerize(0.0, xtest_s) 
clf = MultinomialNB()
clf.fit(xbinary, ytrain)
ans = clf.predict(xtest_bin)
getPercErrorRate(ans, ytest)


9.119300437226734