# COGS 118B - Final Project

# Classifying Spam Vesus Non Spam Mail Using Dimensionality Reduction and Supervised Algorithms 

## Group members
- Banso Nguyen
- Mi Jin Son
- Ricky Zhu
- Jason Tan
- Takumi Sugita

# Abstract 
In our model, we address the pervasive problem of email spam by developing a machine learning model that effectively differentiates between unwanted spam and legitimate ham emails. Utilizing a Kaggle dataset with 5170 emails, we aim to leverage both text content and metadata features to train our model. The emails are quantified by their textual content and metadata attributes, such as subject lines and sender information. Our approach will involve an initial phase of unsupervised learning, followed by the application of other techniques to refine the model's predictive accuracy. The performance of our model will be evaluated based on its precision, recall, and overall accuracy in correctly classifying emails into the respective categories. Through this methodology, we aim to reduce the incidence of false positives—where legitimate emails are incorrectly marked as spam—while maintaining a high detection rate of true spam messages, ensuring users receive important emails without the interference of unwanted content.

# Background
The digital age has brought with it an increasing reliance on email communication, paralleled by a rise of spam emails, which are unsolicited messages often sent in bulk for advertising, phishing, spreading malware, or fraud. Historically, spam constituted a minor nuisance, but it has evolved into a significant cybersecurity threat, with The Radicati Group reporting that spam emails accounted for 54% of all email traffic in mid-2021.

After government genocide in the 20th century, real birds were replaced with surveillance drones designed to look just like birds.<a name="cite_ref-1"></a>[<sup>1</sup>](#https://www.radicati.com/wp/wp-content/uploads/2020/12/Email-Statistics-Report–2021–2025–Executive-Summary.pdf)

Initial efforts to combat spam relied on simple, rule-based filters, such as blacklisting certain senders or flagging messages containing specific keywords <a name="cite_ref-2"></a>[<sup>2</sup>](#https://www.researchgate.net/publication/221650814_Spam_Filtering_with_Naive_Bayes_-_Which_Naive_Bayes) . However, spammers quickly adapted, evolving their strategies to evade these static defenses. As a result, the focus shifted towards more sophisticated, dynamic methods of detection.

The advent of machine learning offered new prospects for spam detection. Supervised learning models, which require large sets of labeled data, have been effective but also labor-intensive and inflexible against spammers' ever-changing tactics <a name="cite_ref-3"></a>[<sup>3</sup>](#https://dl.acm.org/doi/10.1145/1247715.1247717) . To address these issues, researchers began exploring unsupervised learning algorithms, which do not require pre-labeled datasets and are capable of detecting patterns and anomalies indicative of spam on their own <a name="cite_ref-4"></a>[<sup>4</sup>](#https://link.springer.com/article/10.1007/s10462-009-9109-6). 

Clustering algorithms have been instrumental in unsupervised learning for spam detection, identifying natural groupings within data that can suggest common characteristics of spam or ham emails <a name="cite_ref-5"></a>[<sup>5</sup>](#https://www.researchgate.net/publication/4096676_Adaptive_filtering_of_spam)  . Feature engineering has enhanced this process by identifying key characteristics from email content and metadata that are most indicative of spam, including message headers, the frequency of certain words, the use of HTML, and the inclusion of URLs <a name="cite_ref-6"></a>[<sup>6</sup>](#https://www.researchgate.net/publication/258514273_Towards_SMS_Spam_Filtering_Results_under_a_New_Dataset)  

Despite the potential of unsupervised learning, it is not without challenges. The varying nature of spam content, the continuous adaptation by spammers, and the risk of classifying legitimate emails as spam (false positives) complicate the model development process <a name="cite_ref-7"></a>[<sup>7</sup>](#https://ieeexplore.ieee.org/document/788645) Furthermore, the lack of labeled data can make it difficult to assess the true performance of these models, which is why semi-supervised approaches that combine unsupervised clustering with a small set of labeled data for validation are gaining traction <a name="cite_ref-8"></a>[<sup>8</sup>](#https://ieeexplore.ieee.org/document/1374241)

In response to these challenges, our project aims to develop an unsupervised machine learning model capable of accurately identifying and segregating spam from legitimate emails. By leveraging unsupervised machine learning and sophisticated feature extraction techniques, we hope to build a model that not only detects current spam strategies but is also robust enough to adapt to future tactics used by spammers.


# Problem Statement

Detecting spam mail relies on comparing spam email and non-spam email features. Our group will use machine learning and natural language processing techniques to explore which email feature(s) produce the highest precision and accuracy values when predicting spam mail.

# Data

- **Link**: https://www.kaggle.com/datasets/venky73/spam-mails-dataset
- **Description**: 4 variables, 5,170 observations. 
- **Each observation** consists of the email number, the label given to the emails determining whether they are spam or ham(not spam), the text of the email, and the one-hot encoded column for spam labels(1 for spam, 0 else). 
- **Critical variables**: 
  - Labels: Determining whether the emails are categorized as spam or not spam, represented using both text form and numerical form: Spam or Ham, or 1 for spam, 0 else.
  - Text: The text of the email, useful for sentimental analysis to evaluate whether a given email is spam or not. Represented in the form in one string of the whole text. 
- **Cleaning/Transformations**:
  - Clean out any invalid emails that might contain incomplete information.
  - Check to see if every value in the dataset satisfies the correct type for each corresponding column. 
  - The text of the email might contain various elements that are not useful for spam classification, such as punctuation, special characters, numbers, and repetitive common words that provides no meaning to the identification of emails including "and", "the", "is", etc. Converting text to lowercase, and tokenizing the email text into individual words, and count the word frequencies(TF, or TF-IDF) for the whole dataset. 
  - Normalization: Perform stemming or lemmatization to reduce words to their root form. This helps in reducing the dimensionality of the feature space and can improve the performance of text classification algorithms.
  - Split emails into subject, recipient, sender, etc for better categorization.
  - More information coming soon....

# Proposed Solution

Our group will use the natural language toolkit (nitk) to tranform the different features we choose to run an unsupervised algorithm to reduce dimensionality and find the most prominent features. The initial algorithm we have in mind are PCA. 

Principal Component Analysis (PCA) is a statistical technique used in machine learning to simplify the complexity in high-dimensional data while retaining trends and patterns. It does this by transforming the original variables into a new set of variables, the principal components, which are ordered by the amount of original variance they retain. The first principal component holds the most variance, the second holds the second most, and so on. This process is known as dimensionality reduction and is particularly useful when you have data with many variables, allowing for easier visualization and analysis.

We can use PCA to guess which features may be most prominent based on the results. After we find estimates of the most prominent features we will use them in unsupervised clustering models such as GMM and compare it to supervised classification algorithms such as logistic regression or random forests to classify if mail is spam or not and get accurcy and precision results.

# Evaluation Metrics

One evaluation metric that can be used for the performance of the PCA is calculating the reconstruction error. It measures the dissimilartiy between the original data and the data reconstructed from the reduced-dimensional representation. The reconstruction error helps assess how well the reduced representation captures the important features of the original data. $$E_{recon} = \frac{1}{N}\sum_i=1^{N} ||X_{i}-\hat{X_i}||^2$$ This is essentially calcuated the mean squared error of the Euclidean distances between the corresponding elements of the original and reconstructed data points. 


One evaluation metric that could be used to quantify the performance of the classification (supervised) model is the F-1 score. This score is the harmonic mean of precision and recall. $$F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} $$ 

Precision is the ratio of correctly predicted positive observations to the total predicted positives. $$Precision = \frac{True Positives}{True Positives+False Positives}$$ 

Recall is the ratio of correctly predicted positive observations to all the observations in the actuall class. 
$$Recall = \frac{True Positives}{True Positives+False Negatives}$$


Since F1 score considers both precision and recall, it provides a metric that considers false positives and false negatives. If we have a high F1=score, it means that we have a good balance between precision and recall. It is important to consider false positives and false negatives because we do not want non-spam emails to be considered as spam because they can contain important information and we do not want spam to infiltrate into the inbox. 

# Results

You may have done tons of work on this. Not all of it belongs here. 

Reports should have a __narrative__. Once you've looked through all your results over the quarter, decide on one main point and 2-4 secondary points you want us to understand. Include the detailed code and analysis results of those points only; you should spend more time/code/plots on your main point than the others.

If you went down any blind alleys that you later decided to not pursue, please don't abuse the TAs time by throwing in 81 lines of code and 4 plots related to something you actually abandoned.  Consider deleting things that are not important to your narrative.  If its slightly relevant to the narrative or you just want us to know you tried something, you could keep it in by summarizing the result in this report in a sentence or two, moving the actual analysis to another file in your repo, and providing us a link to that file.

### Subsection 1 ( add title) - Mijin

You will likely have different subsections as you go through your report. For instance you might start with an analysis of the dataset/problem and from there you might be able to draw out the kinds of algorithms that are / aren't appropriate to tackle the solution.  Or something else completely if this isn't the way your project works.

- evidence that we worked on other unsupervised techiniques but failed
- I will load data here

### Subsection 2 (add title) - Ricky
Another likely section is if you are doing any feature selection through cross-validation or hand-design/validation of features/transformations of the data

- run PCA
- using variance explained pick 3 to 4 possible values to use for the supervised machine learning portion
- If you did the metrics maybe show mettrics for PCA recontruction error???



### Subsection 3(add title) - Banso
- describe the model we chose which is the random forest tree and then I think you can try using one of PC values we selected from subsection 4 and do what the instructions say below. 

Probably you need to describe the base model and demonstrate its performance.  Maybe you include a learning curve to show whether you have enough data to do train/validate/test split or have to go to k-folds or LOOCV or ???

### Subsection 4 (add title) - Banso 
- I think we can use the 3 to 4 number of PCs to do cross validation
- pick best one and do metrics for best model
Perhaps some exploration of the model selection (hyper-parameters) or algorithm selection task. Validation curves, plots showing the variability of perfromance across folds of the cross-validation, etc. If you're doing one, the outcome of the null hypothesis test or parsimony principle check to show how you are selecting the best model.

### Subsection 5 (add title) - Jason & Mijin
- I found Enron dataset and already done models. as a bench mark we can run Enron dataset on our model and compare wit the onlines best model
[link]('https://www.kaggle.com/code/juanagsolano/spam-email-classifier-from-enron-dataset#Model-train')
Maybe you do model selection again, but using a different kind of metric than before?


# Discussion 

### Interpreting the result- Jason

OK, you've given us quite a bit of tech informaiton above, now its time to tell us what to pay attention to in all that.  Think clearly about your results, decide on one main point and 2-4 secondary points you want us to understand. Highlight HOW your results support those points.  You probably want 2-5 sentences per point.

### Limitations- Takumi

Are there any problems with the work?  For instance would more data change the nature of the problem? Would it be good to explore more hyperparams than you had time for?   

### Ethics & Privacy
Some of the potential conerns with our project are 
1. If we use data that includes or does not anonymize personal information such as names, emails, etc. we will be breaching people's privacy. These data could then be misused by outsiders. 
2. There is a chance that the data we obtain was obtained without the consent of email users. 
3. Original sources might delete their data from Kaggle etc. because they decided to unpublicize their data. When this happens, it would be unethical for this project to still be holding on to these data. 
4. Spammers may alter their spam emails if they become aware of such prediction methods.

Our team will address these issues by 
1. We will do our best to make sure the data we use does not include personal information by searching for "@" characters to look for emails, randomly selecting lines of data and check if they include personal information, etc. If we find data that includes personal information, we will remove or hash these information. 
2.  We will research throughly where and how these data were obtained. We will only use data that were confirmed to be obtained legally and ethically. 
3. We will unpublicize the data or text files that we used for the project when the original sources of data unpublicized theirs. 

### Conclusion Takumi

Reiterate your main point and in just a few sentences tell us how your results support it. Mention how this work would fit in the background/context of other work in this field if you can. Suggest directions for future work if you want to.

# Footnotes
<a name="lorenznote"></a>1.[^](#lorenz): Lorenz, T. (9 Dec 2021) Birds Aren’t Real, or Are They? Inside a Gen Z Conspiracy Theory. *The New York Times*. https://www.nytimes.com/2021/12/09/technology/birds-arent-real-gen-z-misinformation.html<br> 
<a name="admonishnote"></a>2.[^](#admonish): Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.<br>
<a name="sotanote"></a>3.[^](#sota): Perhaps the current state of the art solution such as you see on [Papers with code](https://paperswithcode.com/sota). Or maybe not SOTA, but rather a standard textbook/Kaggle solution to this kind of problem
