# Spam Classification

Use SVMs to build filter to classify emails into spam and non-spam email with high accuracy.

In [14]:
import numpy as np
from scipy.io import loadmat
from sklearn import svm
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
%matplotlib inline

## 1 Load processed data

Features have been extracted from source emails as a n-dimension vector.

$x_i = 1$ if i-th word in the vocabulary list and $x_i = 0$ if not.

In [15]:
data = loadmat('spamTrain.mat')
X_train = data['X']
y_train = data['y'].ravel()
print(X_train.shape)
print(y_train.shape)

(4000, 1899)
(4000,)


In [16]:
data = loadmat('spamTest.mat')
X_test = data['Xtest']
y_test = data['ytest'].ravel()
print(X_test.shape)
print(y_test.shape)

(1000, 1899)
(1000,)


## 2 Training SVMs for spam classification

In [19]:
clf = svm.SVC(C=100, kernel='rbf', gamma='auto')
clf.fit(X_train, y_train)
print('Training accuracy is about {:.2f}%'.format(clf.score(X_train, y_train) * 100))
print('Test accuracy is about {:.2f}%'.format(clf.score(X_test, y_test) * 100))

Training accuracy is about 99.90%
Test accuracy is about 99.00%


In [21]:
pred = clf.predict(X_test)
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       1.00      0.99      0.99       692
           1       0.97      0.99      0.98       308

   micro avg       0.99      0.99      0.99      1000
   macro avg       0.99      0.99      0.99      1000
weighted avg       0.99      0.99      0.99      1000

