### **NLP Final Project**
#### **Spam and Sentiment Email Analysis: Spam Bi-LSTM Supervised Learning**

Wilson Neira

##### **1. Import**
* Import libraries needed for deep learning and text sequence preparation with TensorFlow/Keras.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.metrics import classification_report
from sklearn.model_selection import StratifiedKFold
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, LSTM, Bidirectional, Dense, Concatenate


##### **2. Tokenization and Sequence Padding**

* Load datasets, encode labels, convert email texts into padded numeric sequences using Tokenizer.

In [2]:
# Load Data
train_df = pd.read_csv("group34_train_data_with_clusters.csv")
test_df = pd.read_csv("group34_test_data_with_clusters.csv")

# Encode labels
le = LabelEncoder()
train_labels = le.fit_transform(train_df['label'])  # spam:1, ham:0
test_labels = le.transform(test_df['label'])

# Prepare tokenizer (fit on train)
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(train_df['email'])

# Text to sequences
X_train_seq = tokenizer.texts_to_sequences(train_df['email'])
X_test_seq = tokenizer.texts_to_sequences(test_df['email'])

# Padding sequences
max_len = 200
X_train_pad = pad_sequences(X_train_seq, maxlen=max_len, padding='post')
X_test_pad = pad_sequences(X_test_seq, maxlen=max_len, padding='post')

##### **3. Bi-LSTM Model Definition**
* Defined a flexible Bi-LSTM model that can incorporate clustering features through concatenation for classification.

In [3]:
# Function to create Bi-LSTM model
def create_model(cluster_feature_dim=0):
    input_text = Input(shape=(max_len,))
    embedding = Embedding(input_dim=5000, output_dim=128)(input_text)
    x = Bidirectional(LSTM(64))(embedding)

    if cluster_feature_dim > 0:
        input_cluster = Input(shape=(cluster_feature_dim,))
        concatenated = Concatenate()([x, input_cluster])
        output = Dense(1, activation='sigmoid')(concatenated)
        model = Model(inputs=[input_text, input_cluster], outputs=output)
    else:
        output = Dense(1, activation='sigmoid')(x)
        model = Model(inputs=input_text, outputs=output)

    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

##### **4. Prepare Clustering Features and Configurations**
* Perform one-hot encoding on K-Means and hierarchical cluster labels, prepare datasets for each model configuration and set up the StratifiedKFold for cross-validation.

In [4]:
# Prepare clustering features
cluster_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
train_kmeans_feat = cluster_encoder.fit_transform(train_df[['kmeans_cluster']])
test_kmeans_feat = cluster_encoder.transform(test_df[['kmeans_cluster']])

train_hier_feat = cluster_encoder.fit_transform(train_df[['hierarchical_cluster']])
test_hier_feat = cluster_encoder.transform(test_df[['hierarchical_cluster']])

# Combine K-Means and Hierarchical clustering features
train_combined_feat = np.hstack((train_kmeans_feat, train_hier_feat))
test_combined_feat = np.hstack((test_kmeans_feat, test_hier_feat))

# K-fold Cross-validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Store configurations with their features
configurations = {
    'Baseline Bi-LSTM': (X_train_pad, None, None),
    'Bi-LSTM + K-Means': (X_train_pad, train_kmeans_feat, test_kmeans_feat),
    'Bi-LSTM + Hierarchical': (X_train_pad, train_hier_feat, test_hier_feat),
    'Bi-LSTM + Combined Clusters': (X_train_pad, train_combined_feat, test_combined_feat)
}

##### **5. Stratified 5-Fold Cross-Validation**
* Implement a 5-fold cross-validation strategy for each model configuration, train and validate the models and report classification metrics for each configuration.

In [5]:
# Perform StratifiedKFold cross-validation for each configuration
for config_name, (text_data, cluster_data, test_cluster_data) in configurations.items():
    print(f"\n Stratified 5-Fold CV: {config_name} ")

    all_val_preds = []
    all_val_actuals = []
    fold = 1
    for train_index, val_index in skf.split(text_data, train_labels):
        print(f"\nFold {fold}")
        X_text_train, X_text_val = text_data[train_index], text_data[val_index]
        y_train_fold, y_val_fold = train_labels[train_index], train_labels[val_index]

        # Prepare clustering data if possible
        if cluster_data is not None:
            X_cluster_train, X_cluster_val = cluster_data[train_index], cluster_data[val_index]
            model = create_model(cluster_feature_dim=cluster_data.shape[1])
            model.fit(
                [X_text_train, X_cluster_train], y_train_fold,
                epochs=3, batch_size=64,
                validation_data=([X_text_val, X_cluster_val], y_val_fold)
            )
            val_pred = (model.predict([X_text_val, X_cluster_val]) > 0.5).astype("int32")
        else:
            model = create_model()
            model.fit(
                X_text_train, y_train_fold,
                epochs=3, batch_size=64,
                validation_data=(X_text_val, y_val_fold)
            )
            val_pred = (model.predict(X_text_val) > 0.5).astype("int32")

        all_val_preds.extend(val_pred.flatten())
        all_val_actuals.extend(y_val_fold)
        fold += 1

    # Classification report after cross-validation for each configuration
    print(f"\nClassification Report for {config_name} (5-fold CV combined results):")
    print(classification_report(all_val_actuals, all_val_preds, target_names=le.classes_))


 Stratified 5-Fold CV: Baseline Bi-LSTM 

Fold 1
Epoch 1/3
[1m338/338[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 22ms/step - accuracy: 0.8862 - loss: 0.2657 - val_accuracy: 0.9824 - val_loss: 0.0530
Epoch 2/3
[1m338/338[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 19ms/step - accuracy: 0.9922 - loss: 0.0258 - val_accuracy: 0.9839 - val_loss: 0.0510
Epoch 3/3
[1m338/338[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 25ms/step - accuracy: 0.9959 - loss: 0.0149 - val_accuracy: 0.9820 - val_loss: 0.0561
[1m169/169[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 9ms/step

Fold 2
Epoch 1/3
[1m338/338[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 23ms/step - accuracy: 0.8609 - loss: 0.3122 - val_accuracy: 0.9752 - val_loss: 0.0802
Epoch 2/3
[1m338/338[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 21ms/step - accuracy: 0.9906 - loss: 0.0346 - val_accuracy: 0.9841 - val_loss: 0.0616
Epoch 3/3
[1m338/338[0m [32m━━━━━━━━━━━━━━━━━━━━[

##### **6. Final Explicit Evaluation on Unseen Test Set**
* Retrain each model configuration explicitly on the full training set, evaluate them on an independent test set, and provide detailed classification metrics along with insightful extreme-error analyses.

In [6]:
# ---------------------------------------------------------
# Final evaluation for each configuration clearly:
print("\n FINAL EVALUATION ON UNSEEN TEST SET \n")

for config_name, (text_data, cluster_data, test_cluster_data) in configurations.items():
    print(f"\n Final Classification Report: {config_name} ")

    # Train on full training data
    if cluster_data is not None:
        model = create_model(cluster_feature_dim=cluster_data.shape[1])
        model.fit(
            [text_data, cluster_data], train_labels,
            epochs=3, batch_size=64, validation_split=0.1
        )
        predictions = (model.predict([X_test_pad, test_cluster_data]) > 0.5).astype("int32")
    else:
        model = create_model()
        model.fit(
            text_data, train_labels,
            epochs=3, batch_size=64, validation_split=0.1
        )
        predictions = (model.predict(X_test_pad) > 0.5).astype("int32")

    # Print final classification report
    print(classification_report(test_labels, predictions, target_names=le.classes_))

    # Extreme error analysis
    error_analysis_df = test_df.copy()
    error_analysis_df['prediction'] = predictions.flatten()
    error_analysis_df['actual'] = test_labels
    extreme_errors = error_analysis_df[error_analysis_df['prediction'] != error_analysis_df['actual']]

    print(f"\nInsightful Analysis of Extreme Errors for {config_name} (Top 10 examples):")
    print(extreme_errors[['email', 'actual', 'prediction']].head(10))


 FINAL EVALUATION ON UNSEEN TEST SET 


 Final Classification Report: Baseline Bi-LSTM 
Epoch 1/3
[1m380/380[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 21ms/step - accuracy: 0.8823 - loss: 0.2706 - val_accuracy: 0.9874 - val_loss: 0.0427
Epoch 2/3
[1m380/380[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 19ms/step - accuracy: 0.9907 - loss: 0.0311 - val_accuracy: 0.9867 - val_loss: 0.0425
Epoch 3/3
[1m380/380[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 19ms/step - accuracy: 0.9948 - loss: 0.0192 - val_accuracy: 0.9885 - val_loss: 0.0364
[1m211/211[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 7ms/step
              precision    recall  f1-score   support

         ham       0.99      0.99      0.99      3309
        spam       0.99      0.99      0.99      3434

    accuracy                           0.99      6743
   macro avg       0.99      0.99      0.99      6743
weighted avg       0.99      0.99      0.99      6743


Insightful Analysis

##### **7. Results**
4 configurations were evaluated using a Bi-LSTM model with 5-fold cross-validation and an independent test set.

### Cross-Validation Results

* **Baseline Bi-LSTM**:
  * Accuracy ranged between 98.0-98.6%, with loss consistently low (0.043-0.056).
  * High precision, recall, and F1-score for both classes:
    * ham: 99% precision, 98% recall, 98% F1
    * spam: 98% precision, 99% recall, 98% F1

* **Bi-LSTM + K-Means**:
  * Similar accuracy (mostly 98.0-98.8%) and low loss (0.039-0.065).
  * Slightly better variance handling across folds.
  * ham: 99% precision, 98% recall, 98% F1
  * spam: 98% precision, 99% recall, 98% F1

* **Bi-LSTM + Hierarchical**:
  * Performance close to baseline and K-Means. Accuracy hovered around 98-98.7%, loss within 0.041-0.067.
  * ham: 99% precision, 98% recall, 99% F1
  * spam: 98% precision, 99% recall, 99% F1

* **Bi-LSTM + Combined Clusters**:
  * Accuracy and loss comparable to the best-performing models (accuracy 98-99%, loss 0.032-0.050).
  * F1 scores identical to other configs:
    * ham: 99% precision, 98% recall, 99% F1
    * spam: 98% precision, 99% recall, 99% F1

### Final Evaluation on Unseen Test Set

* **Baseline Bi-LSTM**:
  * Accuracy: 99%, Loss: 0.036
  * ham/spam: 99% precision, 99% recall, 99% F1

* **Bi-LSTM + K-Means**:
  * Accuracy: 99%, Loss: 0.038
  * ham/spam: 99% precision, 99% recall, 99% F1

* **Bi-LSTM + Hierarchical**:
  * Accuracy: 99%, Loss: 0.041
  * ham: 99% precision, 98% recall, 99% F1
  * spam: 98% precision, 99% recall, 99% F1

* **Bi-LSTM + Combined Clusters**:
  * Accuracy: 99%, Lowest Loss: 0.033
  * ham: 99% precision, 98% recall, 99% F1
  * spam: 98% precision, 99% recall, 99% F1

### Bias-Variance Analysis

* **Baseline Bi-LSTM** shows **low bias** (good training performance) and **low variance** (generalizes well to test set). Minimal overfitting observed.

* **Bi-LSTM + K-Means** adds slight regularization with clustering, which maintains generalization. Its initial training accuracy starts lower, but rapidly improves — indicating **slightly reduced variance**.

* **Bi-LSTM + Hierarchical** introduces more variance at the beginning (some inconsistent early training losses), but achieves excellent test accuracy indicating **slightly higher variance**, yet still balanced.

* **Bi-LSTM + Combined Clusters** performs equally or slightly better than others in loss, with balanced training/test performance. Suggests **low bias** and possibly even improved robustness, due to richer feature representations.

### Extreme Error Analysis
Across all models, the misclassified examples remain mostly consistent and tend to involve ambiguous or complex language. This could imply that the model’s performance limit is more due to data complexity than model bias or variance.

### Summary
All configurations perform very well. The combined cluster configuration shows small improvements in generalization and loss, suggesting that combining clustering techniques can slightly enhance robustness. However, given the near-perfect scores across all models, the baseline Bi-LSTM already captures core patterns effectively. Clustering features may not offer much benefit on this dataset, but they help slightly with variance reduction and model stability.
