# Example - The spam filter

### Introduction

A **spam filter** is an algorithm which classifies e-mail messages as either spam or non-spam, based on a collection of **numeric features** such as the frequency of certain words or characters. In a spam filter, the **false positive rate**, that is, the proportion of non-spam messages wrongly classified as spam, must be very low.

### The data set

The file `spam.csv` contains data on 4,601 e-mail messages. Among these messages, 1,813 have been classified as spam. The data were gathered at Hewlett-Packard by merging: (a) a collection of spam e-mail from the company postmaster and the individuals who had filed spam, and (b) a collection of non-spam e-mail, extracted from filed work and personal e-mail.

The variables are:

* 48 numeric features whose names start with `word_`, followed by a word. They indicate the frequency, in percentage scale, with which that word appears in the message. Example: for a particular message, a value 0.21 for `word_make` means that 0.21% of the words in the message match the word 'make'.

* 3 numeric features indicating, respectively, the average length of uninterrupted sequences of capital letters (`cap_ave`), the length of the longest uninterrupted sequence of capital letters (`cap_long`) and the total number of capital letters in the message (`cap_total`).

* A dummy indicating whether that e-mail message is spam (`spam`).

Source: Hewlett-Packard. Taken from T Hastie, R Tibshirani & JH Friedman (2001), *The Elements of Statistical Learning*, Springer.

### Importing the data

In [1]:
import numpy as np

In [2]:
data = np.recfromcsv('spam.csv')

Checking the data:

In [3]:
data.shape

(4601,)

In [4]:
data[:1]

rec.array([(0., 0.64, 0.64, 0., 0.32, 0., 0., 0., 0., 0., 0., 0.64, 0., 0., 0., 0.32, 0., 1.29, 1.93, 0., 0.96, 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 3.756, 61, 278, 1)],
          dtype=[('word_make', '<f8'), ('word_address', '<f8'), ('word_all', '<f8'), ('word_3d', '<f8'), ('word_our', '<f8'), ('word_over', '<f8'), ('word_remove', '<f8'), ('word_internet', '<f8'), ('word_order', '<f8'), ('word_mail', '<f8'), ('word_receive', '<f8'), ('word_will', '<f8'), ('word_people', '<f8'), ('word_report', '<f8'), ('word_addresses', '<f8'), ('word_free', '<f8'), ('word_business', '<f8'), ('word_email', '<f8'), ('word_you', '<f8'), ('word_credit', '<f8'), ('word_your', '<f8'), ('word_font', '<f8'), ('word_000', '<f8'), ('word_money', '<f8'), ('word_hp', '<f8'), ('word_hpl', '<f8'), ('word_george', '<f8'), ('word_650', '<f8'), ('word_lab', '<f8'), ('word_labs', '<f8'), ('word_telnet', '<f8'), ('word_857', '<f8'), ('word_data', '<

We check the spam rate:

In [5]:
round(np.mean(data['spam']), 3)

0.394

### Target vector and feature matrix

Target vector:

In [6]:
y = data['spam']

Now the feature matrix. We drop the last column, converting the resulting recarray to an unstructured array:

In [7]:
X = data[list(data.dtype.names[:-1])]
from numpy.lib.recfunctions import structured_to_unstructured
X = structured_to_unstructured(X)

### Decision tree classifier

Import the **estimator class** `DecisionTreeClassifier` from the scikit-learn subpackage `tree`. 

In [8]:
from sklearn.tree import DecisionTreeClassifier

Instantiate an estimator from this class. The argument `max_depth=2` limits the length of the longest branch.

In [9]:
treeclf1 = DecisionTreeClassifier(max_depth=2)

Next, we **fit** the estimator `treeclf1` to the data: 

In [10]:
treeclf1.fit(X, y)

DecisionTreeClassifier(max_depth=2)

The **accuracy** of this model is:

In [11]:
round(treeclf1.score(X, y), 3)

0.834

### Confusion matrix

A finer screening can be performed on the **confusion matrix**. First, we obtain the predicted target values with the method `predict`:

In [12]:
ypred1 = treeclf1.predict(X)

The confusion matrix is obtained by means of the method `confusion_matrix`, from the subpackage `metrics`:

In [14]:
from sklearn.metrics import confusion_matrix
conf1 = confusion_matrix(y, ypred1)
conf1

array([[2575,  213],
       [ 549, 1264]])

True positive and false positive rates

In [16]:
tp1 = conf1[1, 1]/sum(conf1[1, :])
fp1 = conf1[0, 1]/sum(conf1[0, :])
round(tp1, 3), round(fp1, 3)

(0.697, 0.076)

The weakest point is the false positive rate, that is, the proportion of legal messages classified as spam, which is a bit too high.  

### A deeper tree

We try next a tree with `max_depth=3`.

In [18]:
treeclf2 = DecisionTreeClassifier(max_depth=3)
treeclf2.fit(X, y)
treeclf2.score(X, y)

0.8704629428385133

The accuracy is improved. The confusion matrix:

In [19]:
ypred2 = treeclf2.predict(X)
conf2 = confusion_matrix(y, ypred2)
conf2

array([[2598,  190],
       [ 406, 1407]])

True positive and false positive rates both improve, though the second one is still a bit high.

In [20]:
tp2 = conf2[1, 1]/sum(conf2[1, :])
fp2 = conf2[0, 1]/sum(conf2[0, :])
round(tp2, 3), round(fp2, 3)

(0.776, 0.068)

### More depth

We set now `max_depth=4`:

In [21]:
treeclf3 = DecisionTreeClassifier(max_depth=4)
treeclf3.fit(X, y)
treeclf3.score(X, y)

0.8908932840686807

In [22]:
ypred3 = treeclf3.predict(X)
conf3 = confusion_matrix(y, ypred3)
conf3

array([[2627,  161],
       [ 341, 1472]])

In [23]:
tp3 = conf3[1, 1]/sum(conf3[1, :])
fp3 = conf3[0, 1]/sum(conf3[0, :])
round(tp3, 3), round(fp3, 3)

(0.812, 0.058)

### A final model

In [24]:
treeclf4 = DecisionTreeClassifier(max_depth=5)
treeclf4.fit(X, y)
treeclf4.score(X, y)

0.9039339274070854

In [25]:
ypred4 = treeclf4.predict(X)
conf4 = confusion_matrix(y, ypred4)
conf4

array([[2696,   92],
       [ 350, 1463]])

In [26]:
tp4 = conf4[1, 1]/sum(conf4[1, :])
fp4 = conf4[0, 1]/sum(conf4[0, :])
round(tp4, 3), round(fp4, 3)

(0.807, 0.033)

We got stuck with the positive rate, but the false positive rate is promising.

### Feature importance

In [28]:
treeclf4.feature_importances_

array([0.        , 0.        , 0.        , 0.        , 0.00428051,
       0.        , 0.40565322, 0.00866955, 0.        , 0.00535017,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.23255608, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.05686147, 0.1077894 , 0.04113833,
       0.        , 0.04083541, 0.00327628, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.0013128 , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.03341536, 0.        , 0.        , 0.05179166, 0.00463312,
       0.00243665])

The features used by the tree are:

In [33]:
a = np.array(data.dtype.names[:-1])[treeclf4.feature_importances_ > 0]

In [40]:
b = np.round(treeclf4.feature_importances_[treeclf4.feature_importances_ > 0], 3)

In [45]:
c = np.array([a, b], dtype=object).transpose()
c

array([['word_our', 0.004],
       ['word_remove', 0.406],
       ['word_internet', 0.009],
       ['word_mail', 0.005],
       ['word_free', 0.233],
       ['word_000', 0.057],
       ['word_money', 0.108],
       ['word_hp', 0.041],
       ['word_george', 0.041],
       ['word_650', 0.003],
       ['word_technology', 0.001],
       ['word_edu', 0.033],
       ['cap_ave', 0.052],
       ['cap_long', 0.005],
       ['cap_total', 0.002]], dtype=object)