Classify the spam data using support vector machines.

(Note to the reader: Wasserman means here using
a support vector classifier *without* kernelization).

In [1]:
import pandas as pd
from sklearn.metrics import zero_one_loss
from sklearn.model_selection import cross_val_score
from sklearn.svm import LinearSVC

In [2]:
# Read the data into a pandas data frame
df = pd.read_csv('../data/spam.dat', sep=' ', header=None)

# Extract the response variable Y from the data frame
# and convert it to a numpy array
Y = df[df.columns[-1]].to_numpy()

# Extract all 57 covariates into a numpy array X
X = df[df.columns[:-1]].to_numpy()

In [7]:
# Define the model
model = LinearSVC()

# Fit the model
fitted_model = model.fit(X, Y)

# Compute the empirical error rate
empirical_error_rate = zero_one_loss(Y, fitted_model.predict(X))

# Compute a cross-validation estimate of the true error rate
true_error_rate_cv_estimate = 1 - cross_val_score(model, X, Y, cv=10).mean()

In [8]:
# Report out the results
print(
    f"Misclassification rate: {empirical_error_rate:.3}\n"
    f"Cross-validation estimate of the true error rate: {true_error_rate_cv_estimate:.3}"
)

Misclassification rate: 0.0704
Cross-validation estimate of the true error rate: 0.0902
