# **Final Project**

## **Introduction**

In the digital age, email remains one of the most widely used tools for personal and professional communication. However, with the convenience of email comes a persistent challenge: spam. Spam emails are not only disruptive, but they can also pose significant security risks through phishing attempts and malware distribution. According to Cisco's 2021 Cybersecurity Threat Trends report, nearly 85% of all email traffic is spam, emphasizing the need for robust spam detection systems (Cisco, 2021).

This project addresses the problem of accurately classifying email messages as either spam or not spam (ham) using machine learning techniques. Building upon a prior implementation that utilized Logistic Regression and Naive Bayes models, our goal is to improve predictive performance through advanced modeling strategies, better preprocessing, and enhanced feature selection.

We are guided by the following research questions:
- Can we improve classification accuracy by using more complex models such as ensemble methods?
- Which preprocessing strategies (e.g., dimensionality reduction, TF-IDF) lead to better model generalization?
- How can we evaluate and compare models effectively based on metrics like precision, recall, F1 score, and confusion matrices?

By addressing these questions, our work contributes to a broader understanding of how data science and natural language processing techniques can reduce exposure to unsolicited content and support safer digital communication environments.


## **About the Data**

The dataset used in this project is sourced from Kaggle and was originally published as part of a spam classification challenge (Balaka, 2021). The dataset contains **5,172 rows** (emails) and **3,002 columns**, making it a high-dimensional problem suitable for feature selection and regularization strategies.

- The **first column**, `Email No.`, is an identifier for each email and does not carry predictive value.
- The **last column**, `Prediction`, is the binary target variable where:
  - `1` indicates a spam email,
  - `0` indicates a legitimate (ham) email.
- The **remaining 3,000 columns** represent the raw word count of individual tokens (words or characters) that appear in the body of each email.

This is a **bag-of-words** representation of the emails, where each column represents a word and each row contains the number of times that word appears in a specific email. Because of the dataset's high dimensionality, steps such as removing low-variance features, scaling, or applying dimensionality reduction (e.g., PCA or TF-IDF transformations) are important in preprocessing.

This dataset presents a useful benchmark for spam classification tasks, especially given its sparsity and large feature space, making it ideal for exploring the performance of different classification models and feature selection techniques. 


In [45]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.preprocessing import FunctionTransformer
from scipy.sparse import csr_matrix

In [47]:
# Load Data
df = pd.read_csv("emails copy.csv")

## **Methods**

### Preprocessing

We begin by loading the dataset and dropping the `Email No.` column, which serves only as an identifier and holds no predictive value. We verify there are no missing values and remove duplicate records to avoid introducing bias into the training data.

To better represent the importance of each word, we apply a **TF-IDF (Term Frequency-Inverse Document Frequency)** transformation. TF-IDF emphasizes rare but informative terms while down-weighting common terms that appear frequently across all documents (Ramos, 2003). This transformation improves model generalization by ensuring that the classifier focuses on features that are more likely to distinguish between spam and non-spam messages. Unlike raw word counts, TF-IDF values normalize for document length and term frequency, helping reduce overfitting and bias.

We then perform a **stratified train-test split**, preserving the ratio of spam to non-spam emails in both training and testing sets. This ensures a more reliable and fair evaluation across imbalanced classes.

---

### Modeling Approaches

We experiment with two pipelines designed to evaluate different modeling strategies and dimensionality reduction techniques.

---

#### **Pipeline 1: TF-IDF → Multinomial Naive Bayes**

This pipeline builds directly on the model used in the original project. After transforming the data using TF-IDF, we apply the **Multinomial Naive Bayes** (MNB) classifier. MNB is particularly well-suited for text classification tasks with discrete features like word frequencies or TF-IDF values. It uses Bayes' theorem under a feature independence assumption and performs well even with high-dimensional data (Manning, Raghavan, & Schütze, 2008). This model is efficient, interpretable, and works well on sparse matrices with non-negative values, making it ideal for TF-IDF-transformed data.

---

#### **Pipeline 2: TF-IDF → Truncated SVD → Logistic Regression**

To explore further performance improvements, we introduce a second pipeline that includes **dimensionality reduction** through **Truncated Singular Value Decomposition (SVD)**, followed by a **Logistic Regression** classifier.

- **Truncated SVD** is used to project high-dimensional TF-IDF vectors into a lower-dimensional latent space, capturing the most informative patterns while reducing noise and redundancy. It is especially effective on sparse matrices (Halko, Martinsson, & Tropp, 2011) and is commonly used in natural language processing applications like latent semantic analysis.

- **Logistic Regression** is a linear classifier that models the probability of a binary outcome. It performs well when features are continuous, as is the case with SVD-transformed data. It does not require features to be non-negative and benefits from the de-noising effect of dimensionality reduction, making it a strong candidate for this text classification task (Ng, 2002).

Together, TF-IDF and SVD improve the quality and structure of the input space, allowing Logistic Regression to capture meaningful decision boundaries in fewer dimensions.


In [49]:
df.head()

Unnamed: 0,Email No.,the,to,ect,and,for,of,a,you,hou,...,connevey,jay,valued,lay,infrastructure,military,allowing,ff,dry,Prediction
0,Email 1,0,0,1,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Email 2,8,13,24,6,6,2,102,1,27,...,0,0,0,0,0,0,0,1,0,0
2,Email 3,0,0,1,0,0,0,8,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Email 4,0,5,22,0,5,1,51,2,10,...,0,0,0,0,0,0,0,0,0,0
4,Email 5,7,6,17,1,5,2,57,0,9,...,0,0,0,0,0,0,0,1,0,0


In [51]:
# Basic preprocessing steps. Determine if there are null values or duplicates. Remove them if they exist.
# Drop the 'Email No.' column as it is not relevant.
df.drop('Email No.', axis=1, inplace=True)
missing_values = df.isnull().sum()
df.dropna(inplace=True)
df.drop_duplicates(inplace=True)

In [52]:
df.info

<bound method DataFrame.info of       the  to  ect  and  for  of    a  you  hou  in  ...  connevey  jay  \
0       0   0    1    0    0   0    2    0    0   0  ...         0    0   
1       8  13   24    6    6   2  102    1   27  18  ...         0    0   
2       0   0    1    0    0   0    8    0    0   4  ...         0    0   
3       0   5   22    0    5   1   51    2   10   1  ...         0    0   
4       7   6   17    1    5   2   57    0    9   3  ...         0    0   
...   ...  ..  ...  ...  ...  ..  ...  ...  ...  ..  ...       ...  ...   
5167    2   2    2    3    0   0   32    0    0   5  ...         0    0   
5168   35  27   11    2    6   5  151    4    3  23  ...         0    0   
5169    0   0    1    1    0   0   11    0    0   1  ...         0    0   
5170    2   7    1    0    2   1   28    2    0   8  ...         0    0   
5171   22  24    5    1    6   5  148    8    2  23  ...         0    0   

      valued  lay  infrastructure  military  allowing  ff  dry  Pre

In [62]:
# --- Pipeline 1: TF-IDF + Multinomial Naive Bayes ---

# Load the dataset
df = pd.read_csv("emails copy.csv")

# Drop the identifier column
df.drop(columns=['Email No.'], inplace=True)

# Separate features and target
X_counts = df.drop(columns=['Prediction'])
y = df['Prediction']

# Convert to sparse matrix for memory efficiency
X_sparse = csr_matrix(X_counts.values)

# Stratified split to preserve spam/ham ratio
X_train, X_test, y_train, y_test = train_test_split(
    X_sparse, y, test_size=0.2, random_state=42, stratify=y
)

# Define the pipeline
pipeline_nb = Pipeline([
    ('tfidf', TfidfTransformer()),         # Convert raw counts to TF-IDF
    ('nb', MultinomialNB())                # Multinomial Naive Bayes classifier
])

# Train the pipeline
pipeline_nb.fit(X_train, y_train)

# Predict and evaluate
y_pred_nb = pipeline_nb.predict(X_test)
print("TF-IDF + Naive Bayes Classification Report:\n")
print(classification_report(y_test, y_pred_nb))
print("Confusion Matrix:\n")
print(confusion_matrix(y_test, y_pred_nb))


TF-IDF + Naive Bayes Classification Report:

              precision    recall  f1-score   support

           0       0.86      0.99      0.92       735
           1       0.94      0.62      0.75       300

    accuracy                           0.88      1035
   macro avg       0.90      0.80      0.84      1035
weighted avg       0.89      0.88      0.87      1035

Confusion Matrix:

[[724  11]
 [113 187]]


In [64]:
# --- Pipeline 2: TF-IDF + SVD + Logistic Regression ---

# Define the pipeline
pipeline_lr = Pipeline([
    ('tfidf', TfidfTransformer()),                      # Convert raw counts to TF-IDF
    ('svd', TruncatedSVD(n_components=30, random_state=42)),  # Reduce to 30 latent dimensions
    ('lr', LogisticRegression(max_iter=1000, random_state=42)) # Logistic Regression on reduced features
])

# Train the pipeline
pipeline_lr.fit(X_train, y_train)

# Predict and evaluate
y_pred_lr = pipeline_lr.predict(X_test)
print("TF-IDF + SVD + Logistic Regression Classification Report:\n")
print(classification_report(y_test, y_pred_lr))
print("Confusion Matrix:\n")
print(confusion_matrix(y_test, y_pred_lr))


TF-IDF + SVD + Logistic Regression Classification Report:

              precision    recall  f1-score   support

           0       0.91      0.96      0.93       735
           1       0.88      0.77      0.82       300

    accuracy                           0.90      1035
   macro avg       0.90      0.86      0.88      1035
weighted avg       0.90      0.90      0.90      1035

Confusion Matrix:

[[704  31]
 [ 70 230]]


## **Evaluation**

To assess the performance of our spam classification models, we compared two pipelines using the same stratified train-test split:

1. **TF-IDF → Multinomial Naive Bayes**  
2. **TF-IDF → Truncated SVD → Logistic Regression**

We used standard classification metrics including **precision**, **recall**, **F1 score**, and overall **accuracy**. Additionally, confusion matrices were analyzed to assess the nature of classification errors.

---

### Pipeline 1: TF-IDF + Multinomial Naive Bayes

- **Accuracy**: 88%
- **Precision (spam class)**: 0.94
- **Recall (spam class)**: 0.62
- **F1 Score (spam class)**: 0.75

While this pipeline achieved strong precision, it struggled with recall for the spam class, correctly identifying only **62% of actual spam messages**. This suggests that although it made fewer false positives, it was more likely to **miss spam emails** (113 false negatives). The overall F1 score for spam was 0.75.

---

### Pipeline 2: TF-IDF + SVD + Logistic Regression

- **Accuracy**: 90%
- **Precision (spam class)**: 0.88
- **Recall (spam class)**: 0.77
- **F1 Score (spam class)**: 0.82

This model produced a **more balanced performance**, improving recall from 0.62 to 0.77 while only slightly reducing precision. It identified **230 out of 300 spam emails** correctly, reducing false negatives from 113 to 70. The improvement in F1 score for the spam class (from 0.75 to 0.82) demonstrates the model's better balance between precision and recall.

---

### Overall Comparison

| Metric                | Naive Bayes | Logistic Regression |
|----------------------|-------------|---------------------|
| Accuracy             | 0.88        | **0.90**            |
| Spam Recall (Class 1)| 0.62        | **0.77**            |
| Spam Precision       | **0.94**    | 0.88                |
| Spam F1 Score        | 0.75        | **0.82**            |
| Weighted Avg F1      | 0.87        | **0.90**            |

The **Logistic Regression pipeline outperforms** the Naive Bayes model in overall accuracy, F1 score, and especially in **recall**, which is crucial in spam detection tasks. A model with low recall for spam may allow too many malicious emails to reach users. Therefore, we consider the second pipeline to be a more reliable and robust choice for deployment.

---

Both pipelines benefited from the use of **TF-IDF**, which helped emphasize semantically relevant terms, and the application of a **stratified split**, which ensured balanced representation of spam and ham in the training and test sets.

Future improvements could include experimenting with regularization in Logistic Regression, expanding the feature space with n-grams, or using ensemble models such as Random Forest or Gradient Boosting for further performance gains.


## **Storytelling and Conclusion**

This project began with a simple but real-world relevant goal: improve the performance of a spam detection model originally built using **Multinomial Naive Bayes** and raw word counts. The original model offered a solid baseline, but as we explored deeper into the data and modeling techniques, it became clear there were meaningful opportunities to enhance recall and overall model balance.

Through the process, we learned that **preprocessing choices matter deeply**—especially when working with high-dimensional text data. By applying **TF-IDF transformation**, we gave our models the ability to focus on words that were more predictive of spam versus non-spam, rather than just frequent. This adjustment alone made the Naive Bayes model more precise.

We then took the project further by incorporating **dimensionality reduction** through **Truncated SVD** and transitioning to **Logistic Regression**. This introduced a shift from generative modeling to discriminative modeling, and the improvement in **recall, accuracy, and F1-score** confirmed that this trade-off was worthwhile. Logistic Regression, although slightly less precise, caught more actual spam, reducing false negatives—arguably the most important type of error in a filtering task.

From a data science perspective, this project emphasized the importance of:
- Evaluating models on metrics that align with real-world goals (e.g., prioritizing recall for spam detection),
- Understanding the assumptions and limitations of different algorithms (e.g., Naive Bayes’ sensitivity to negative input values),
- Iterative experimentation—not every technique works right away, and even valid approaches like SVD required adjustment to component size and model compatibility.

Our journey from baseline replication to meaningful performance gains shows how even modest changes—when grounded in theory and tested with care—can significantly improve outcomes. The final model is not only more accurate, but also more aligned with the practical needs of spam detection systems.

This project also deepened our understanding of NLP workflows, pipeline engineering, model evaluation, and the value of documentation and reproducibility.

In a future iteration, we would explore:
- Hyperparameter tuning of the `alpha` in Naive Bayes and regularization in Logistic Regression,
- Testing n-gram features (e.g., bi-grams) to capture more contextual information,
- Applying ensemble models or neural networks for further accuracy,
- Addressing potential class imbalance with SMOTE or other re-sampling strategies.

Overall, this project illustrates the iterative nature of machine learning, the impact of thoughtful preprocessing, and the importance of aligning modeling decisions with the real-world context of the task.


## **Impact**

Spam detection plays a critical role in protecting users from unwanted content, phishing scams, and harmful links. By improving the recall of our spam classifier—without sacrificing overall accuracy—we contribute to a system that better shields users from malicious content and reduces the cognitive load of sorting through irrelevant messages.

Our improved model is particularly impactful in **contexts where missing a spam message has significant consequences**, such as corporate environments or vulnerable populations who may be more susceptible to phishing or financial scams. Enhancing recall means fewer harmful emails go undetected, supporting digital security and user trust.

That said, our project also surfaces potential **ethical concerns**. A model that prioritizes spam recall may increase **false positives**, where legitimate emails are incorrectly marked as spam. This could lead to important communications being missed—especially for users whose messages already face systemic biases, such as those writing in non-standard dialects or using informal language more commonly flagged as suspicious. Over-filtering can have real consequences in marginalized communities or in mission-critical communication settings like job applications or healthcare.

There are also **privacy considerations**. While our dataset is anonymized, spam detection systems often rely on continuous ingestion of user data. Developers and organizations deploying such models must be transparent about data usage and implement appropriate safeguards to avoid misuse or overreach in content monitoring.

Lastly, models trained on historical spam data may inherit outdated biases or fail to adapt to evolving spam techniques. A spam email from 2015 likely differs greatly from those exploiting social engineering trends in 2025. This highlights the importance of continuous retraining and auditing of deployed models.

In summary, while our project improves model performance and serves a socially beneficial purpose, it also reminds us of the **importance of fairness, privacy, and adaptability** when developing and deploying real-world machine learning applications.


## **Referrences** 
* Balaka. (2021). *Email Spam Classification Dataset (CSV)*. Kaggle. https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv/data

* Cisco. (2021). *2021 Cybersecurity Threat Trends: Phishing, Crypto Top the List*. Cisco Talos. https://www.cisco.com/c/en/us/products/security/email-security/white-paper-listing.html

* Halko, N., Martinsson, P. G., & Tropp, J. A. (2011). *Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions*. SIAM Review, 53(2), 217–288. https://doi.org/10.1137/090771806

* Manning, C. D., Raghavan, P., & Schütze, H. (2008). *Introduction to Information Retrieval*. Cambridge University Press.

* Ng, A. Y. (2002). *Feature selection, L1 vs. L2 regularization, and rotational invariance*. Proceedings of the 19th International Conference on Machine Learning (ICML).

* Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, É. (2011). *Scikit-learn: Machine Learning in Python*. Journal of Machine Learning Research, 12, 2825–2830.

* Ramos, J. (2003). *Using TF-IDF to determine word relevance in document queries*. In Proceedings of the First Instructional Conference on Machine Learning. 