# SVM - Email Classification

For this assignment, we will use the following dataset.

**Problem Statement**: Spam email classification using Support Vector Machine: In this assignment you will use a SVM to classify emails into spam or non-spam categories. And report the classification accuracy for various SVM parameters and kernel functions. No programs need to be submitted. 

**Data Set Description**: An email is represented by various features like frequency of occurrences of certain keywords, length of capitalized words etc. A data set containing about 4601 instances are available in this link (data folder):
 https://archive.ics.uci.edu/ml/datasets/Spambase 

The data format is also described in the above link. You have to randomly pick 70% of the data set as training data and the remaining as test data.

**Deliverables**: In this assignment you can use any SVM package to classify the above data set using sklearn svm implementation in Python. You have to study performance of the SVM algorithms. You have to use each of the following three kernel functions (a) Linear, (b) Quadratic, (c) RBF. ii. For each of the kernels, report the training and test set classification accuracy for the best value of generalization constant C. The best C value is the one which provides the best test set accuracy that you have found out by trial of different values of C. Report accuracies in the form of a comparison table, along with the values of C.

In [1]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

In [2]:
data = np.genfromtxt('./data/spambase.data', delimiter=',')

In [3]:
X = data[:, :-1]  # Features
y = data[:, -1]   # Labels

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

### Define the values of C to be tested

In [5]:
C_values = [0.1, 1, 10, 100]

### Define kernel functions

In [6]:
kernels = ['linear', 'poly', 'rbf']

### Initialize variables to store results

In [7]:
results = {}

In [None]:
for kernel in kernels:
    kernel_results = {'C': None, 'train_accuracy': 0, 'test_accuracy': 0}
    
    for C in C_values:
        svm_model = SVC(C=C, kernel=kernel)
        svm_model.fit(X_train, y_train)
        
        y_train_pred = svm_model.predict(X_train)
        train_acc = accuracy_score(y_train, y_train_pred)
        
        y_test_pred = svm_model.predict(X_test)
        test_acc = accuracy_score(y_test, y_test_pred)
        
        if test_acc > kernel_results['test_accuracy']:
            kernel_results['C'] = C
            kernel_results['train_accuracy'] = train_acc
            kernel_results['test_accuracy'] = test_acc
    
    results[kernel] = kernel_results

In [None]:
print("Kernel\t\tC\t\tTrain Accuracy\t\tTest Accuracy")
for kernel, result in results.items():
    print(f"{kernel}\t\t{result['C']}\t\t{result['train_accuracy']:.4f}\t\t{result['test_accuracy']:.4f}")