# Spam/Ham Classifier<br>

**Submitted By:** Divij Bhatia, divijbha@usc.edu<br>
### Steps:<br>
>**1.** Load the dataset<br>
>**2.** Extract features and labels<br>
>**3.** Shuffle the features and labels with respect to each other<br>
>**4.** Define the value of k for k-fold cross validation<br>
>**5.** For every fold from 1 to k do the following:<br>
>>**5.1** Extract the training and testing dataset from the features and labels extracted in step 2<br>
>>**5.2** Train the classifier on the training dataset to generate a model M<br>
>>**5.3** Predict the labels of testing data using the model M<br>
>>**5.4** Compare the predicted values with the actual labels to calculate the false positive rate, false negative rate and overall error rate.<br>

>**6.** Tune the hyperparameters based on the performance of the model<br>
### Properties of the Classifier:

>**K:** 10 _(10-Fold Cross Validation)_<br>
>**Classifier Used:** Multi Layer Perceptron<br>
>**Layers:** 1 Input Layer : 57 Nodes, 1 Hidden Layer : 50 Nodes, 1 Output Layer : 2 Nodes<br>
>**Activation Function:** ReLU<br>
>**Solver:** Adam<br>
>**Learning Rate**: Inverse Scaled<br>
>**Initial Learning Rate:** 0.0001<br>
>**Epochs:** 500 _(Maximum)_<br>


### Results:<br>


| K | False Positive Rate | False Negative Rate | Overall Error Rate |
| --- | --- | --- | --- |
| 1 | 0.045627376 | 0.091370558 | 0.065217391 | 
| 2 | 0.054545455 | 0.118918919 | 0.080434783 | 
| 3 | 0.042402827 | 0.11299435 | 0.069565217 | 
| 4 | 0.029090909 | 0.075675676 | 0.047826087 | 
| 5 | 0.063670412 | 0.067357513 | 0.065217391 | 
| 6 | 0.037037037 | 0.105263158 | 0.065217391 | 
| 7 | 0.037288136 | 0.121212121 | 0.067391304 | 
| 8 | 0.054151625 | 0.081967213 | 0.065217391 | 
| 9 | 0.046357616 | 0.069620253 | 0.054347826 | 
| 10 | 0.082142857 | 0.055555556 | 0.07173913 | 
| **Average** | **0.049231425** | **0.089993532** | **0.065217391** | 



### Code:
#### Import Packages

In [None]:
import numpy as np
from sklearn.utils import shuffle
from sklearn.neural_network import MLPClassifier

#### Read & Preprocess the Dataset

In [None]:
#read datafile
data = np.genfromtxt('spambase.data', delimiter=',',dtype='float')

#extract features
features=data[:,:-1]

#extract labels i.e. last column in the data file
labels = data[:,-1]

# shuffle the data
features,labels = shuffle(features,labels, random_state=0)

#### Define Parameters

In [None]:
#defining parameters for k-fold cross validation
n=len(data) #size of data
k=10 
len_of_fold=int(n/k) #size of one of the k segments or the size of test data
l_index=0 #initial starting index of test data
h_index=len_of_fold #initial ending index of test data

#### Train & Test using K-Fold Cross Validation

In [None]:
for i in range(k):
    #extract test dataset from all the features
    features_test=features[l_index:h_index]
    
    #extract actual labels of test dataset
    labels_test=labels[l_index:h_index]
    
    #count number of spam examples in test set
    spam_test_ex=np.count_nonzero(labels_test)
    
    #count number of non-spam examples in test set
    #count_of_non_spam_examples_in_test_dataset = size_of_test_dataset - count_of_spam_examples_in_test_data 
    non_spam_test_ex=len_of_fold-spam_test_ex
    
    #extracting training dataset features
    features_train=np.concatenate((features[0:l_index],features[h_index:]),axis=0)
    
    #extracting training dataset labels
    labels_train=np.concatenate((labels[0:l_index],labels[h_index:]),axis=0)
    
    #Instantiating Multi-Layered Perceptron (MLP)
    classifier = MLPClassifier(hidden_layer_sizes=(50,), activation='relu', solver='adam', learning_rate='invscaling', learning_rate_init=0.0001, power_t=0.5, max_iter=500)
    
    #Training MLP
    classifier.fit(features_train,labels_train)
    
    #Testing data using the trained model
    prediction=classifier.predict(features_test)
    
    #counting misclassified examples
    fp=0 #false_positive_counter initialized to 0
    fn=0 #false_negative_counter initialized to 0
    
    for j in range(len_of_fold):
        #false_positive_counter is incremented if a non_spam email is classified as spam
        if prediction[j]==1 and prediction[j]!=labels_test[j]:
            fp+=1
        #false_negative_counter is incremented if a spam email is classified as non_spam
        elif prediction[j]==0 and prediction[j]!=labels_test[j]:
            fn+=1
    
    #printing K, false_positive_rate, false_negative_rate, overall_error_rate
    print(i+1,",",fp/non_spam_test_ex,",",fn/spam_test_ex,",",(fp+fn)/len_of_fold)
    
    #incrementing the indexes of the test dataset
    l_index+=len_of_fold
    h_index+=len_of_fold