<br>

**First things first** - please go to 'File' and select 'Save a copy in Drive' so that you have your own version of this activity set up and ready to use.
Remember to update the portfolio index link to your own work once completed!

# Activity 3.3.6 Exploring evaluation metrics

## Scenario
Hopkins et al. (1999) created the Spambase data set donated to the UCI Machine Learning Repository. The data set contains 4,601 emails marked as spam or non-spam by a postmaster or individuals. Fifty-seven features aid in classifying emails as spam (e.g. word frequencies and email characteristics). The Spambase data set is used for developing and benchmarking spam detection models, providing a base for analysing the effectiveness of various machine learning techniques in distinguishing between spam and legitimate emails.

As a data professional, you were tasked by your company to develop a neural network with TensorFlow that can classify emails as spam or non-spam. You were tasked to develop a model based on the Spambase data set.



## Objective
In this portfolio activity, you’ll continue to work with the model you created in Activity 3.2.3: Experimenting with hyperparameter tuning by applying evaluation metrics and a pre-trained model to classify emails as spam or non-spam.

You will complete the activity in your Notebook, where you’ll:
- choose the best model based on model performance
- make predictions based on the chosen model
- convert probabilities to binary predictions and view accuracy, F1 score, and recall
- present your insights based on the model's performance.


## Assessment criteria
By completing this activity, you will be able to provide evidence that you can critically select appropriate strategies to demonstrate expertise in model-tuning techniques.


## Activity guidance
1. Continue to work on the model you created in **Activity 3.2.3**.
2. Select the best model you obtained through hyperparameter tuning. Substantiate your choice.
3. Run the chosen model again and save it in an `h5` file named `best_model.h5`. Remember to specify the path.
4. Check further metrics for the model with the predict function applied to your model variable in order to create predictions on the `X_test` data set.
5. Convert probabilities to binary predictions and print the accuracy, F1 score, and recall. You can use the following code:
 - predictions: `y_pred = (y_pred > 0.5).astype(int)`
 - confusion matrix metrics: `accuracy_score`, `precision_score`, `recall_score` and `f1_score` functions.

> Start your activity here. Select the pen from the toolbar to add your entry.

In [None]:
# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import load_model

# URL to import data set from GitHub.
url = 'https://raw.githubusercontent.com/fourthrevlxd/cam_dsb/main/spamdata.csv'

In [None]:
#Loading data set
df = pd.read_csv(url, header=None)
df.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [None]:
#Identify target columns
target_col_index = df.shape[1] - 1
X = df.drop(columns=target_col_index)
y = df.iloc[:, target_col_index]

In [None]:
#Stratified split for class balance
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.1, stratify=y_train_full, random_state=42)


In [None]:
#Standardising the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

In [None]:
#Defining best model based on previous tuning results
def build_best_model(input_dim):
    model = Sequential([
        Input(shape=(input_dim,)),
        Dense(64, activation='relu'),
        Dense(32, activation='relu'),
        Dense(32, activation='relu'),
        Dense(32, activation='relu'),
        Dense(32, activation='relu'),
        Dense(16, activation='relu'),
        Dense(1, activation='sigmoid')
    ])
    model.compile(optimizer=Adam(), loss='binary_crossentropy', metrics=['accuracy'])
    return model

In [None]:
#Train and save the model
best_model = build_best_model(X_train.shape[1])
best_model.fit(X_train, y_train, batch_size=32, epochs=30, validation_data=(X_val, y_val), verbose=1)
best_model.save('best_model.h5')
print("Model saved as 'best_model.h5' in Colab working directory")

Epoch 1/30
[1m104/104[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 9ms/step - accuracy: 0.6690 - loss: 0.5628 - val_accuracy: 0.9484 - val_loss: 0.1876
Epoch 2/30
[1m104/104[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - accuracy: 0.9253 - loss: 0.2135 - val_accuracy: 0.9565 - val_loss: 0.1353
Epoch 3/30
[1m104/104[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step - accuracy: 0.9439 - loss: 0.1725 - val_accuracy: 0.9592 - val_loss: 0.1317
Epoch 4/30
[1m104/104[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.9343 - loss: 0.1682 - val_accuracy: 0.9647 - val_loss: 0.1249
Epoch 5/30
[1m104/104[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - accuracy: 0.9491 - loss: 0.1363 - val_accuracy: 0.9647 - val_loss: 0.1119
Epoch 6/30
[1m104/104[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.9658 - loss: 0.1065 - val_accuracy: 0.9592 - val_loss: 0.1177
Epoch 7/30
[1m104/104[0m 



Model saved as 'best_model.h5' in Colab working directory


In [None]:
#Predict and evaluating
y_pred_proba = best_model.predict(X_test)
y_pred = (y_pred_proba > 0.5).astype(int)

[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 16ms/step


In [None]:
#Computing accuracy metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

In [None]:
#Printing the results
print(f"\nEvaluation Metrics on Test Data:")
print(f"Accuracy : {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall   : {recall:.4f}")
print(f"F1 Score : {f1:.4f}")


Evaluation Metrics on Test Data:
Accuracy : 0.9283
Precision: 0.9136
Recall   : 0.9036
F1 Score : 0.9086


In [None]:
#Insights
from textwrap import fill
print("\nInsight:")
final_insight = (
    "The chosen model, trained with 30 epochs and batch size of 32, achieved strong performance across key metrics. "
      "High recall indicates that the model is particularly good at identifying spam correctly, which is critical in spam detection. "
      "Balanced precision and F1 score suggest that the model maintains reliability without over-predicting false positives."
)
print(fill(final_insight, width=120))


Insight:
The chosen model, trained with 30 epochs and batch size of 32, achieved strong performance across key metrics. High
recall indicates that the model is particularly good at identifying spam correctly, which is critical in spam detection.
Balanced precision and F1 score suggest that the model maintains reliability without over-predicting false positives.


# Reflect

Write a brief paragraph highlighting your process and the rationale to showcase critical thinking and problem-solving.

> In this final phase, I selected the best-performing model from the tuning stage based on test accuracy and validation stability. I retrained it using the most effective combination of epochs and batch size, then saved the model for future reuse. To properly evaluate its real-world utility, I generated predictions on the test set and calculated multiple performance metrics including accuracy, recall, and F1 score. This gave a more complete picture of the model’s strengths, especially its ability to correctly identify spam without over-flagging legitimate emails. The process reinforced the value of structured experimentation and metric-driven validation in model selection.



# References

Hopkins, M., Reeber, E., Forman, G., Suermondt, J., 1999. Spambase. [online]. Available at: https://archive.ics.uci.edu/dataset/94. [Accessed 5 March 2024].