### **NLP Final Project**
#### **Spam and Sentiment Email Analysis: Spam Bi-LSTM Supervised Learning**

Wilson Neira

##### **1. Import**
* Import libraries needed for deep learning and text sequence preparation with TensorFlow/Keras.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.metrics import classification_report
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, LSTM, Bidirectional, Dense, Concatenate


##### **2. Tokenization and Sequence Padding**

* Load datasets, encode labels, convert email texts into padded numeric sequences using Tokenizer.

In [2]:
# Load Data
train_df = pd.read_csv("train_data_with_clusters.csv")
test_df = pd.read_csv("test_data_with_clusters.csv")

# Encode labels
le = LabelEncoder()
train_labels = le.fit_transform(train_df['label'])  # spam:1, ham:0
test_labels = le.transform(test_df['label'])

# Prepare tokenizer (fit on train)
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(train_df['email'])

# Text to sequences
X_train_seq = tokenizer.texts_to_sequences(train_df['email'])
X_test_seq = tokenizer.texts_to_sequences(test_df['email'])

# Padding sequences
max_len = 200
X_train_pad = pad_sequences(X_train_seq, maxlen=max_len, padding='post')
X_test_pad = pad_sequences(X_test_seq, maxlen=max_len, padding='post')


##### **3. Bi-LSTM Baseline (no clusters)**
* Define, train, and evaluate a baseline Bi-LSTM model on email sequences, reporting classification metrics.

In [3]:
# Model definition
input_text = Input(shape=(max_len,))
embedding = Embedding(input_dim=5000, output_dim=128)(input_text)
x = Bidirectional(LSTM(64))(embedding)
output = Dense(1, activation='sigmoid')(x)

model_baseline = Model(inputs=input_text, outputs=output)
model_baseline.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train
model_baseline.fit(X_train_pad, train_labels, epochs=5, batch_size=64, validation_split=0.1)

# Evaluate
predictions = (model_baseline.predict(X_test_pad) > 0.5).astype("int32")
print("Baseline Bi-LSTM Classification Report:")
print(classification_report(test_labels, predictions, target_names=le.classes_))


Epoch 1/5
[1m380/380[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m63s[0m 156ms/step - accuracy: 0.8809 - loss: 0.2738 - val_accuracy: 0.9837 - val_loss: 0.0514
Epoch 2/5
[1m380/380[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m58s[0m 154ms/step - accuracy: 0.9790 - loss: 0.0611 - val_accuracy: 0.9867 - val_loss: 0.0433
Epoch 3/5
[1m380/380[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m61s[0m 160ms/step - accuracy: 0.9827 - loss: 0.0524 - val_accuracy: 0.9896 - val_loss: 0.0379
Epoch 4/5
[1m380/380[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m63s[0m 165ms/step - accuracy: 0.9961 - loss: 0.0143 - val_accuracy: 0.9885 - val_loss: 0.0439
Epoch 5/5
[1m380/380[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m61s[0m 161ms/step - accuracy: 0.9976 - loss: 0.0117 - val_accuracy: 0.9841 - val_loss: 0.0559
[1m211/211[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 38ms/step
Baseline Bi-LSTM Classification Report:
              precision    recall  f1-score   support

         

##### **4. Bi-LSTM with K-Means Cluster Features**
Add K-Means cluster labels (one-hot encoded) as extra features, defining a combined model (text + clusters), train and evaluate it, to print performance metrics.

In [4]:
# Prepare cluster features (one-hot encoding)
cluster_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

# KMeans clusters as example 
train_cluster_feat = cluster_encoder.fit_transform(train_df[['kmeans_cluster']])
test_cluster_feat = cluster_encoder.transform(test_df[['kmeans_cluster']])

# Model definition (text + cluster)
input_text = Input(shape=(max_len,))
embedding = Embedding(input_dim=5000, output_dim=128)(input_text)
x = Bidirectional(LSTM(64))(embedding)

# Cluster input
input_cluster = Input(shape=(train_cluster_feat.shape[1],))

# Concatenate clusters with Bi-LSTM output
concatenated = Concatenate()([x, input_cluster])

# Dense layers
output = Dense(1, activation='sigmoid')(concatenated)

model_clusters = Model(inputs=[input_text, input_cluster], outputs=output)
model_clusters.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train
model_clusters.fit(
    [X_train_pad, train_cluster_feat], 
    train_labels, 
    epochs=5, 
    batch_size=64, 
    validation_split=0.1
)

# Evaluate
predictions = (model_clusters.predict([X_test_pad, test_cluster_feat]) > 0.5).astype("int32")
print("Bi-LSTM + Clusters Classification Report:")
print(classification_report(test_labels, predictions, target_names=le.classes_))


Epoch 1/5
[1m380/380[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m69s[0m 174ms/step - accuracy: 0.8792 - loss: 0.2405 - val_accuracy: 0.9844 - val_loss: 0.0512
Epoch 2/5
[1m380/380[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m66s[0m 174ms/step - accuracy: 0.9899 - loss: 0.0343 - val_accuracy: 0.9904 - val_loss: 0.0338
Epoch 3/5
[1m380/380[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m66s[0m 174ms/step - accuracy: 0.9957 - loss: 0.0169 - val_accuracy: 0.9907 - val_loss: 0.0430
Epoch 4/5
[1m380/380[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m67s[0m 176ms/step - accuracy: 0.9969 - loss: 0.0118 - val_accuracy: 0.9889 - val_loss: 0.0431
Epoch 5/5
[1m380/380[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m70s[0m 184ms/step - accuracy: 0.9985 - loss: 0.0068 - val_accuracy: 0.9881 - val_loss: 0.0390
[1m211/211[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 39ms/step
Bi-LSTM + Clusters Classification Report:
              precision    recall  f1-score   support

       

##### **5. Bi-LSTM with Hierarchical Cluster Features**
* Use hierarchical clustering features instead, define another combined model, train and evaluate it, to report classification metrics.

In [5]:
# Prepare cluster features (one-hot encoding)
cluster_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

# KMeans clusters as example 
train_cluster_feat = cluster_encoder.fit_transform(train_df[['hierarchical_cluster']])
test_cluster_feat = cluster_encoder.transform(test_df[['hierarchical_cluster']])

# Model definition (text + cluster)
input_text = Input(shape=(max_len,))
embedding = Embedding(input_dim=5000, output_dim=128)(input_text)
x = Bidirectional(LSTM(64))(embedding)

# Cluster input
input_cluster = Input(shape=(train_cluster_feat.shape[1],))

# Concatenate clusters with Bi-LSTM output
concatenated = Concatenate()([x, input_cluster])

# Dense layers
output = Dense(1, activation='sigmoid')(concatenated)

model_clusters = Model(inputs=[input_text, input_cluster], outputs=output)
model_clusters.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train
model_clusters.fit(
    [X_train_pad, train_cluster_feat], 
    train_labels, 
    epochs=5, 
    batch_size=64, 
    validation_split=0.1
)

# Evaluate
predictions = (model_clusters.predict([X_test_pad, test_cluster_feat]) > 0.5).astype("int32")
print("Bi-LSTM + Hierarchical Clusters Classification Report:")
print(classification_report(test_labels, predictions, target_names=le.classes_))


Epoch 1/5
[1m380/380[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m72s[0m 179ms/step - accuracy: 0.9171 - loss: 0.2366 - val_accuracy: 0.9852 - val_loss: 0.0447
Epoch 2/5
[1m380/380[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m65s[0m 170ms/step - accuracy: 0.9912 - loss: 0.0268 - val_accuracy: 0.9874 - val_loss: 0.0331
Epoch 3/5
[1m380/380[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m69s[0m 181ms/step - accuracy: 0.9959 - loss: 0.0149 - val_accuracy: 0.9885 - val_loss: 0.0361
Epoch 4/5
[1m380/380[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m64s[0m 168ms/step - accuracy: 0.9949 - loss: 0.0136 - val_accuracy: 0.9855 - val_loss: 0.0489
Epoch 5/5
[1m380/380[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m64s[0m 169ms/step - accuracy: 0.9958 - loss: 0.0123 - val_accuracy: 0.9867 - val_loss: 0.0452
[1m211/211[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 38ms/step
Bi-LSTM + Hierarchical Clusters Classification Report:
              precision    recall  f1-score   sup

##### **6. Results**
The results indicate great performance across all 3 models tested: the baseline Bi-LSTM model, Bi-LSTM with K-Means clustering features, and Bi-LSTM with Hierarchical clustering features. Each achieved accuracy, precision, recall, and F1-scores close to 100%, showing minimal performance differences. However, among them, the **Bi-LSTM with hierarchical clustering features** performed slightly better overall, reaching near-perfect precision, recall, and F1-scores (99%) for both ham and spam categories. The **Bi-LSTM with K-Means clustering features** was the next best performer, achieving nearly identical results to the hierarchical approach, while the **baseline Bi-LSTM** model had marginally lower but still exceptionally strong performance at approximately 98%. The negligible performance differences suggest that while clustering provided slightly improved contextual information to the models, the original text features alone were already highly discriminative. These minor differences emphasize the effectiveness of the baseline model and indicate that adding clustering features slightly enhances classification but isn't strictly necessary given the clear separability of the data.