## Introduction <a name="introduction"></a>
SpamBase dataset is a classification dataset containing 4601 emails sent to HP (Hewlett-Packard) during some period of time. The SpamBase contains numeric 57 features for each email and a binary label, 1 for spam, 0 for ham(email). This is a typical binary classification problem with the added need for clever feature selection, as many of the features provided in the dataset might be useless (Hopkins et al., 1998).

## Exploratory Data Analysis <a name="DataExploration"></a>
In this section, we use various python libraries like the pandas dataframe, numpy, matplotlib, and seaborn to do a data summary and data visualization of the spambase dataset. This will give us a better understanding of the distributions and relationships between various features and the target variable(label).

In [None]:
import numpy as np 
import pandas as pd
#for data visualisation:
import matplotlib.pyplot as plt 
import seaborn as sns
%matplotlib inline

In [None]:
# read the file containing the column names
with open('spambase.names') as f:
    list_contents = f.readlines()
    colnames = []
    for item in list_contents:
        colname = item.split(':')[0]
        colnames.append(colname)
colnames.append('label')

In [None]:
# get the length of the features(columns)
len(colnames)

In [None]:
# read the file containing the dataset and assign it to a variable
dataset = pd.read_csv('spambase.data', header=None)

dataset.columns = colnames

In [None]:
# get the first five rows of the dataset
dataset.head()

In [None]:
# get the last five rows of the dataset
dataset.tail()

In [None]:
# check datatype for all columns and rows
dataset.info()

In [None]:
#This gives us all the statistical summary for each column.
dataset.describe()

In [None]:
# Get the shape of the dataset. the first element represents the number of rows and the second element represents the columns
np.shape(dataset)

In [None]:
#check whether there are any missing or null values
dataset.isnull().sum()#This will give number of NaN values in every column.

In [None]:
#If needed, check NaN values in every column using the code: by default axis=0
dataset.isnull().sum(axis = 1)

In [None]:
# check if there's any duplicate row or column in the dataset 
dataset.duplicated()

In [None]:
# plot a histogram of the dataset.
hist_of_dataset = dataset.hist(figsize = (30,20))
hist_of_dataset

In [None]:
# visualize if there's any missing value using a barchart. A blue bar represents the number of missing values.
# in this dataset we have no missing value.
sns.set(rc={'figure.figsize':(17,7)})
miss_vals = pd.DataFrame(dataset.isnull().sum() / len(dataset) * 100)
miss_vals.plot(kind='bar',title='Missing values in percentage',ylabel='percentage')


### The Outcome <a name="Theoutcome"></a>
We've successfully analysed the spambase dataset using various exploratory tools like graphs and statistical analysis so as to enable us do an informed data preprocessing.

## Data Preprocessing <a name="DataPreprocessing"></a>
In this section, we clean the dataset by filling the missing values(if any) with the mean of its column, we split our dataset into training and testing. We need to perform Feature Scaling when we are dealing with Gradient Descent Based algorithms. Scaling has no significant effect on tree based algorithms, so we would only scale it during the Neural network section. this is the section we also encode the target variable if it's not in binary format.


In [None]:
# fill and missing data with te mean of its column.
dataset.fillna(dataset.mean(), inplace=True)

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# seperate the features from the target variable.
X = dataset.drop('label', axis=1)
y = dataset['label']

In [None]:
# split the dataset into training set of 75% and test sets of 25%. We also set the stratify hyperparameter for equal distribution.
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25)

In [None]:
# get the shape of the split data
X_train.shape, X_test.shape, y_train.shape, y_test.shape

### The outcome <a name="Theoutcome"></a>
We've been able to fill missing data, and split our dataset into training and testing. scaling will be applied only to the neural network model.

.
.
.

## Feature Engineering and Feature Selection <a name="FEFS"></a>
Here we apply feature engineering and feature selection to get the relevant features. We achieve this by using sklearn variance threshold, pearson correlation coefficient, and DecisionTree Feature_importance_.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import VarianceThreshold

Using scikit-learn Variance threshold to remove features with variance 0

In [None]:
# check for variance within each feature and remove the features with variance=0
var_thres = VarianceThreshold(threshold=0.0) # set the threshold to 0
var_thres.fit(X)

In [None]:
# get the sum of all the features with a variance above 0 
sum(var_thres.get_support())

In [None]:
# we check how many features have a variance of 0
constant_columns = [column for column in X_train.columns
                    if column not in X_train.columns[var_thres.get_support()]]

print(len(constant_columns))

In [None]:
# drop features with variance of 0. we apply it to only X_train not the whole dataset.
X_train.drop(constant_columns,axis=1)

Checking for correlation between features after applying the scikit learn variance threshold

In [None]:
#we check with the X_train not the whole features.
cor = X_train.corr()

# we use the seaborn heatmap to visualise the correlations.
cmap = sns.cm.rocket_r #for reversed color, the darker the more correlated.

plt.figure(figsize=(50, 50)) # set the size of the plot.

# initialise the seaborn heatmap.
ax = sns.heatmap(cor, linewidths=.3, annot=True, fmt=".2", cmap=cmap)#show numbers on the cells: annot=True

# for avoiding reseting labels
ax.tick_params(axis='x', labelrotation=45)

Using Pearson Correlation Coefficient to remove features with high correlations between them.

In [None]:
# code partially gotten from https://www.youtube.com/watch?v=FndwYNcVe0U&list=PLZoTAELRMXVPgjwJ8VyRoqmfNs2CJwhVH&index=3

# Checking for correlation using pearson correlation coefficient
def correlation(dataset, threshold):
    col_corr = set()  # Set of all the names of correlated columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)-1):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > threshold: # we are interested in absolute coeff values.
                #We compare the feature correlation with target and drop the ones with lower coeff.
                if abs(corr_matrix.iloc[j, len(corr_matrix.columns)-1]) > abs(corr_matrix.iloc[i, len(corr_matrix.columns)-1]):
                    colname = corr_matrix.columns[i]
                else:
                    colname = corr_matrix.columns[j]
                col_corr.add(colname)
    return col_corr

In [None]:
# Apply the pearson correlation to our dataset with a threshold of 90%.
corr_features = correlation(dataset, 0.90)

print(corr_features)

print('number of correlated features: '+str(len(set(corr_features))))

In [None]:
# drop the correlated features. the features that are related to target are kept.
dataset_feature_reduce = dataset.drop(corr_features, axis=1)

In [None]:
# get the shape of the reduced dataset
dataset_feature_reduce.shape

In [None]:
# get a new X and y from the reduced dataset
X_train, X_test, y_train, y_test = train_test_split(dataset_feature_reduce.drop('label', axis=1), dataset_feature_reduce['label'],
                                                    stratify=dataset_feature_reduce['label'], random_state=42)

In [None]:
len(X_train.columns)

In [None]:
# visualise it with seaborn heatmap
#reverse the color scheme: the darker the more positive related.
cmap = sns.cm.rocket_r 

plt.figure(figsize=(50, 50)) # set the size of the heatmap

#https://stackoverflow.com/questions/39409866/correlation-heatmap
view = sns.heatmap(dataset_feature_reduce.corr(), linewidths=.3, annot=True, fmt=".2", cmap=cmap)#show numbers on the cells: annot=True

# To avoid resetting labels
view.tick_params(axis='x', labelrotation=45) # tilt the x-label by 45 degree.

Let's compare the evaluation results before and after reducing features

In [None]:
#before feature reduction
from sklearn.ensemble import RandomForestClassifier
model_rf = RandomForestClassifier(n_estimators=100)#todo: tune hyper param
scores_rf_featRed = cross_val_score(model_rf, dataset.drop('label', axis=1), dataset_feature_reduce['label'], scoring = "f1_weighted", cv=5)
print(scores_rf_featRed)
scores_rf_featRed.mean()

In [None]:
#after feature reduction
model_rf = RandomForestClassifier(n_estimators=100)
scores_rf_featRed = cross_val_score(model_rf, dataset_feature_reduce.drop('label', axis=1), dataset_feature_reduce['label'], scoring = "f1_weighted", cv=5)
print(scores_rf_featRed)
scores_rf_featRed.mean()

Applying Adaboost before and after feature reduction

In [None]:
#before feature reduction
# use the AdaBoost classifier with the default base classifier - DecisionTreeClassifier(max_depth=1) 
from sklearn.ensemble import AdaBoostClassifier
model_ada = AdaBoostClassifier(n_estimators=100)#todo: tune hyper param
scores_ada_feature_reduce = cross_val_score(model_ada, dataset.drop('label', axis=1), dataset_feature_reduce['label'], scoring = "f1_weighted", cv=5)
print(scores_ada_feature_reduce)
scores_ada_feature_reduce.mean()#best result: 0.9423

In [None]:
#after feature reduction
model_ada = AdaBoostClassifier(n_estimators=100)
scores_ada_feature_reduce = cross_val_score(model_ada, dataset_feature_reduce.drop('label', axis=1), dataset_feature_reduce['label'], scoring = "f1_weighted", cv=5)
print(scores_ada_feature_reduce)
scores_ada_feature_reduce.mean()

###The Outcome
There's no significant difference before and after dropping the feature because there's only one feature dropped and it was insignifiant to the target variable.

Sorting the features using the feature_importance_ and removing the features with importance <= 0 in relation to the target variable.

In [None]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(random_state=42, max_depth=8,class_weight='balanced') 

model.fit(X_train,y_train)

# Get feature importances
importances = model.feature_importances_

In [None]:
# View the important feature on a barplot
feat_importances = pd.DataFrame(importances, index=X_train.columns, columns=["Importance"])
feat_importances.sort_values(by='Importance', ascending=False, inplace=True)
feat_importances.plot(kind='bar', figsize=(16,7))

In [None]:
# Append the features greater than 0 to a new array
threshold = 0.00000
important_features = [feature for feature, importance in zip(X_train.columns, importances) if importance > threshold]

important_features

In [None]:
# get the length of the new important features
len(important_features)

In [None]:
# create a new dataset with only the important features
import_feat = dataset_feature_reduce[important_features]
import_feat.head()

In [None]:
cmap = sns.cm.rocket_r #reverse the color scheme: the darker the more positive related
plt.figure(figsize=(60, 50))

# Plot a heatmap of the important features
heat = sns.heatmap(import_feat.corr(), linewidths=.3, annot=True, fmt=".2", cmap=cmap)

heat.tick_params(axis='x', labelrotation=45)

### The outcome <a name="Theoutcome"></a>
In this example, we've applied sklearn variance threshold, pearson corr. coef. and decision tree feature_importance_ to extract the features that are relevant in predicting the outcome of an email.

## Select, Train, Apply ML models <a name="SA"></a>
In this section, we'll explore, compare and optimise various classification models,ensemble models and ANN model.

Using Support Vector Machine on the reduced spambase dataset

In [None]:
from sklearn import svm
from sklearn.metrics import accuracy_score # to check the accuracy of the model

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(import_feat, y, stratify=y, test_size=0.25, random_state=42)

# Train the SVM model
clf = svm.SVC(kernel='linear') #Linear Kernel is used when the data is Linearly separable 
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Evaluate the model's performance
acc = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(acc * 100))

Using Decision tree classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier(max_depth=5, random_state=42)

dtc.fit(X_train, y_train)

# Make predictions on the test set
y_pred = dtc.predict(X_test)

# Evaluate the model's performance
acc = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(acc * 100))

Using SGClassifier

In [None]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(loss = 'modified_huber') # "modified_humber" brings tolerance to outliers as well as probability estimates.

sgd_clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = dtc.predict(X_test)

# Evaluate the model's performance
acc = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(acc * 100))

Using Random Forest to classify the Spambase Dataset

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Train the random forest model
clf = RandomForestClassifier(n_estimators=100,max_depth=5, random_state=42)

clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Evaluate the model's performance
acc = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(acc * 100))


Using Ensemble learning techniques boosting(Adaboost Classifier).

In [None]:
from sklearn.ensemble import AdaBoostClassifier

# Train the AdaBoost model
ada_clf = AdaBoostClassifier(n_estimators=100, learning_rate=1.0, random_state=42) #Weight applied to each classifier at each boosting iteration

ada_clf.fit(X_train, y_train)


# Evaluate the model's performance
accuracy = ada_clf.score(X_test, y_test)

print("Accuracy: {:.2f}%".format(acc * 100))

Using Ensemble learning Bagging Technique

In [None]:
from sklearn.ensemble import BaggingClassifier

# Define the base estimator
base_estimator = DecisionTreeClassifier(max_depth=10, random_state=42)

# Train the Bagging model
bag_clf = BaggingClassifier(estimator=base_estimator, n_estimators=100, random_state=42)

bag_clf.fit(X_train, y_train)

# Evaluate the model on the test set
accuracy = bag_clf.score(X_test, y_test)
print("Accuracy: {:.2f}%".format(accuracy * 100))


In this section, we use Artificial Neural Network(ANN) to classify the Spambase Dataset.

In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report
from sklearn.preprocessing import MinMaxScaler

In [None]:
sc = MinMaxScaler() # define the scaler
df_scaled = pd.DataFrame(sc.fit_transform(import_feat)) # fit & transform the data
print(df_scaled.head())

In [None]:
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df_scaled, y, stratify=y, test_size=0.25, random_state=42)

In [None]:
# initialize the neural network
neu_net = MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000, alpha=1e-4,
                        solver='sgd', verbose=10, tol=1e-4, random_state=1, # solver specifies the algorithm for weight optimization over the nodes.
                        learning_rate_init=.1) 

# train the neural network
neu_net.fit(X_train, y_train)

# evaluate the model
y_pred = neu_net.predict(X_test)

print(classification_report(y_test, y_pred))

### The outcome <a name="Theoutcome"></a>
We were able to train and test our dataset using various classifiers and ensemble techniques, and each one performed well, some better than others. Now we would evaluate them to know which one performed better.


## Evaluation <a name="Evaluation"></a>
Evaluate the models.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Define the models to evaluate
models = {"Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
          "Decision Tree":DecisionTreeClassifier(max_depth=5, random_state=42),
          "AdaBoost": AdaBoostClassifier(n_estimators=100, random_state=42),
          "SVM":svm.SVC(kernel='linear'),
          "SGD":SGDClassifier(loss = 'modified_huber'),
          "Bagging": BaggingClassifier(estimator=DecisionTreeClassifier(max_depth=5, random_state=42), n_estimators=100, random_state=42),
          "Neural Network": MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000, alpha=1e-4,
                        solver='sgd', verbose=10, tol=1e-4, random_state=1, learning_rate_init=.1)
         }

# Evaluate each model
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred) * 100
    precision = precision_score(y_test, y_pred) * 100
    recall = recall_score(y_test, y_pred) * 100
    f1 = f1_score(y_test, y_pred) * 100
    print(f"{name}:\n\tAccuracy: {accuracy:.2f}\n\tPrecision: {precision:.2f}\n\tRecall: {recall:.2f}\n\tF1-Score: {f1:.2f}")


## Communicating the Results <a name="CommunicateResults"></a>
After applying data preprocessing, feature engineering and selection, training and testing of the spambase dataset, both individually and combined through the evaluation process, the model with the highest precision scores are Adaboost and random forest with an accuracy of 94.61 and 94.53 respectively. Other models performed well like the ANN with a score of 92.44. A decrease in the learning rate of the ANN to 0.01 showed a slight improvement to 92.96 but it took a longer time to execute. SVM had the least accuracy score of 89.04.
It is recommended that Adaboost should be used when building a model to classify emails as Spam or Ham as its accuracy is higher than other classifiers.