<a href="https://colab.research.google.com/github/aaronhowellai/machine-learning-projects/blob/main/95%20Percent%20Accuracy%20Email%20Spam%20Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Email Spam Classifier**

This notebook is divided into 2 parts:



## 👍 **Part 1 - 96.3% Accurate Email Spam Classifier**:  

A classification project to build a spam detection system based on a code-along example outlined in Chapter 3 of "Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow" by Aurélien Géron.


## 🔥 **Part 2 - 94.5% Accurate Email Spam Classifier** (with harder dataset)

A self-directed extension of the project in Part 1 using a harder dataset that will require a more thorough model selection and fine-tuning process.


---

# 🟢 **Part 1:** Guided Case Study

**Brief:**
* Download examples of spam and ham from [Apache SpamAssassin's public datasets](https://homl.info/spamassassin).
* Unzip the datasets into training set and test set
* Write a data preprocessing pipeline to convert each email into a feature vector. It should transform an email into a sparse vector that indicates the presence or absence of each possible word.
* If all emails were to only ever contain the following 4-word sentence structure: " 'Hello', 'how', 'are', 'you' ", then the email " 'Hello','you', 'Hello', 'you' " would be converted into a boolean vector [1,0,0,1].

## 📦 **Import Packages**

In [100]:
# import packages
import tarfile
import email
import email.policy
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
import pandas as pd
import re
from html import unescape
from scipy.sparse import csr_matrix
import sys
import urllib

# url replacement
try:
   import urlextract
except ImportError:
  if "google.colab" in sys.modules:
    %pip install -q -U urlextract

# beautiful soup
try:
  from bs4 import BeautifulSoup
except ImportError:
  import subprocess
  subprocess.check_call(['pip', 'install', 'beautifulsoup4'])
  from bs4 import BeautifulSoup

# machine learning packages
import nltk
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
# models, model selection, feature engineering
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
# metrics
from sklearn.metrics import precision_score, recall_score, accuracy_score, classification_report
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

## 🔧 **Utils & Functions**

In [24]:
# +----------------------------------------------------------------------+

# (1) fetch spam data
def fetch_spam_data():
  spam_root = "https://spamassassin.apache.org/old/publiccorpus/"
  ham_url = spam_root + "20030228_easy_ham.tar.bz2"
  spam_url = spam_root + "20030228_spam.tar.bz2"

  spam_path = Path() / "datasets" / "spam"
  spam_path.mkdir(
      parents=True,
      exist_ok=True
      ),

  for dir_name, tar_name, url in (
      ("easy_ham", "ham", ham_url),
       ("spam","spam", spam_url)
       ):
    if not (spam_path / dir_name).is_dir():
      path = (spam_path / tar_name).with_suffix(".tar.bz2")
      print("Downloading", path)
      urllib.request.urlretrieve(url, path)
      tar_bz2_file = tarfile.open(path)
      tar_bz2_file.extractall(path=spam_path)
      tar_bz2_file.close()
  return [spam_path / dir_name for dir_name in ("easy_ham", "spam")]

# +----------------------------------------------------------------------+

# (2) load email
def load_email(filepath):
  with open(filepath, "rb") as f:
    return email.parser.BytesParser(policy=email.policy.default).parse(f)

# +----------------------------------------------------------------------+

# (3) get email structure
def get_email_structure(email):
  if isinstance(email, str):
      return email
  payload = email.get_payload()
  if isinstance(payload, list):
    multipart = ", ".join(
        [
        get_email_structure(sub_email) for sub_email in payload
        ]
    )
    return f"multipart({multipart})"
  else:
    return email.get_content_type()

# +----------------------------------------------------------------------+

# (4) structures counter
def structures_counter(emails):
  structures = Counter()
  for email in emails:
    structure = get_email_structure(email)
    structures[structure] += 1
  return structures

# +----------------------------------------------------------------------+

# (5) random state value
random_state = 42

## 📤 **Examples of Spam & Ham**
* From [Apache SpamAssassin's public datasets](https://spamassassin.apache.org/old/publiccorpus/).

In [25]:
# download data
ham_dir, spam_dir = fetch_spam_data()

In [26]:
# load emails
ham_filenames = [f for f in sorted(ham_dir.iterdir()) if len(f.name) > 20]
spam_filenames = [f for f in sorted(spam_dir.iterdir()) if len(f.name) > 20]
print(
    f"Ham Filename Length: {len(ham_filenames)}\n"
    f"Spam Filename Length: {len(spam_filenames)}"
)

Ham Filename Length: 2500
Spam Filename Length: 500


In [27]:
# parse the emails
ham_emails = [load_email(filepath) for filepath in ham_filenames]
spam_emails = [load_email(filepath) for filepath in spam_filenames]

Looking at one example from both ham and spam, to get a feel of what the data looks like:

In [28]:
# viewing an example of ham
print(ham_emails[1].get_content().strip())

Martin A posted:
Tassos Papadopoulos, the Greek sculptor behind the plan, judged that the
 limestone of Mount Kerdylio, 70 miles east of Salonika and not far from the
 Mount Athos monastic community, was ideal for the patriotic sculpture. 
 
 As well as Alexander's granite features, 240 ft high and 170 ft wide, a
 museum, a restored amphitheatre and car park for admiring crowds are
planned
---------------------
So is this mountain limestone or granite?
If it's limestone, it'll weather pretty fast.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
4 DVDs Free +s&p Join Now
http://us.click.yahoo.com/pt6YBB/NXiEAA/mG3HAA/7gSolB/TM
---------------------------------------------------------------------~->

To unsubscribe from this group, send an email to:
forteana-unsubscribe@egroups.com

 

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/


In [29]:
# viewing an example of spam
print(spam_emails[6].get_content().strip())

Help wanted.  We are a 14 year old fortune 500 company, that is
growing at a tremendous rate.  We are looking for individuals who
want to work from home.

This is an opportunity to make an excellent income.  No experience
is required.  We will train you.

So if you are looking to be employed from home with a career that has
vast opportunities, then go:

http://www.basetel.com/wealthnow

We are looking for energetic and self motivated people.  If that is you
than click on the link and fill out the form, and one of our
employement specialist will contact you.

To be removed from our link simple go to:

http://www.basetel.com/remove.html


4139vOLW7-758DoDY1425FRhM1-764SMFc8513fCsLl40


Because some emails are multipart, with images and attachments, it is useful to look at the various types of structures within the imported dataset:

In [30]:
# viewing email structures
structures_counter(ham_emails).most_common()

[('text/plain', 2408),
 ('multipart(text/plain, application/pgp-signature)', 66),
 ('multipart(text/plain, text/html)', 8),
 ('multipart(text/plain, text/plain)', 4),
 ('multipart(text/plain)', 3),
 ('multipart(text/plain, application/octet-stream)', 2),
 ('multipart(text/plain, text/enriched)', 1),
 ('multipart(text/plain, application/ms-tnef, text/plain)', 1),
 ('multipart(multipart(text/plain, text/plain, text/plain), application/pgp-signature)',
  1),
 ('multipart(text/plain, video/mng)', 1),
 ('multipart(text/plain, multipart(text/plain))', 1),
 ('multipart(text/plain, application/x-pkcs7-signature)', 1),
 ('multipart(text/plain, multipart(text/plain, text/plain), text/rfc822-headers)',
  1),
 ('multipart(text/plain, multipart(text/plain, text/plain), multipart(multipart(text/plain, application/x-pkcs7-signature)))',
  1),
 ('multipart(text/plain, application/x-java-applet)', 1)]

In [31]:
# viewing spam email structures
structures_counter(spam_emails).most_common()

[('text/plain', 218),
 ('text/html', 183),
 ('multipart(text/plain, text/html)', 45),
 ('multipart(text/html)', 20),
 ('multipart(text/plain)', 19),
 ('multipart(multipart(text/html))', 5),
 ('multipart(text/plain, image/jpeg)', 3),
 ('multipart(text/html, application/octet-stream)', 2),
 ('multipart(text/plain, application/octet-stream)', 1),
 ('multipart(text/html, text/plain)', 1),
 ('multipart(multipart(text/html), application/octet-stream, image/jpeg)', 1),
 ('multipart(multipart(text/plain, text/html), image/gif)', 1),
 ('multipart/alternative', 1)]

**Analysis of Email Structure**

* **Ham:** The ham emails seem to be overwhelmingly structured in plain text
* **Spam:** A large portion of the spam emails are structured with text and Hypertext Markup Language, as the second largest category.

**View an Example Email Header**

In [32]:
# viewing the email headers
for header, value in spam_emails[0].items():
  print(header, ":", value)

Return-Path : <12a1mailbot1@web.de>
Delivered-To : zzzz@localhost.spamassassin.taint.org
Received : from localhost (localhost [127.0.0.1])	by phobos.labs.spamassassin.taint.org (Postfix) with ESMTP id 136B943C32	for <zzzz@localhost>; Thu, 22 Aug 2002 08:17:21 -0400 (EDT)
Received : from mail.webnote.net [193.120.211.219]	by localhost with POP3 (fetchmail-5.9.0)	for zzzz@localhost (single-drop); Thu, 22 Aug 2002 13:17:21 +0100 (IST)
Received : from dd_it7 ([210.97.77.167])	by webnote.net (8.9.3/8.9.3) with ESMTP id NAA04623	for <zzzz@spamassassin.taint.org>; Thu, 22 Aug 2002 13:09:41 +0100
From : 12a1mailbot1@web.de
Received : from r-smtp.korea.com - 203.122.2.197 by dd_it7  with Microsoft SMTPSVC(5.5.1775.675.6);	 Sat, 24 Aug 2002 09:42:10 +0900
To : dcek1a1@netsgo.com
Subject : Life Insurance - Why Pay More?
Date : Wed, 21 Aug 2002 20:31:57 -1600
MIME-Version : 1.0
Message-ID : <0103c1042001882DD_IT7@dd_it7>
Content-Type : text/html; charset="iso-8859-1"
Content-Transfer-Encoding : qu

One of the standout pieces of information here is the email address: 12a1mailbot1@web.de sounds quite suspicious

**View The Email Subject Header**

In [33]:
spam_emails[0]['subject']

'Life Insurance - Why Pay More?'

## 🧹 **Preprocessing**

### 📊 **Train-test Split on Dataset**

In [34]:
# train-test split
X = np.array(ham_emails + spam_emails, dtype=object)
y = np.array([0] * len(ham_emails) + [1] * len(spam_emails))

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=.2, random_state=random_state
    )

**Train-test Split Code Summary:**

The dataset is being explicitly labeled and merged so that it can be split consistently into training and test sets. Both `x` and `y` must correspond 1-to-1.

```
X = np.array(ham_emails + spam_emails, dtype=object)
```
* Combines the two lists (`ham_emails` and `spam_emails`) into one array `X`
* `dtype=object` is used because the emails are text strings

```
y = np.array([0] * len(ham_emails) + [1] * len(spam_emails))
```
Creates the labels (`y`) for the classification task:
* `0` for **ham** (not spam)
* `1` for **spam**

### 🛠️ **Preprocessing Functions**

**Function to convert HTML to plain text**

In [35]:
# function that converts HTML to plain text
def html_to_plain_text(html):
  soup = BeautifulSoup(html, "html.parser")

  # remove unwanted tags like <head>, <style>, <script>, etc.
  for tag in soup(['head','style','script']):
    tag.decompose()

  # replace links with the word 'HYPERLINK'
  for a in soup.find_all('a'):
    a.replace_with('HYPERLINK')

  # get text and unescape HTML entities
  text = soup.get_text(separator='\n')
  text = unescape(text)

  # normalize excessive whitespace
  return '\n'.join(line.strip() for line in text.splitlines() if line.strip())

In [36]:
# testing the function with spam
html_spam_emails = [
    email for email in X_train[y_train==1] if get_email_structure(email) == 'text/html'
    ]
sample_html_spam = html_spam_emails[7]
print(sample_html_spam.get_content()[:1000], "...")

<HTML><HEAD><TITLE></TITLE><META http-equiv="Content-Type" content="text/html; charset=windows-1252"><STYLE>A:link {TEX-DECORATION: none}A:active {TEXT-DECORATION: none}A:visited {TEXT-DECORATION: none}A:hover {COLOR: #0033ff; TEXT-DECORATION: underline}</STYLE><META content="MSHTML 6.00.2713.1100" name="GENERATOR"></HEAD>
<BODY text="#000000" vLink="#0033ff" link="#0033ff" bgColor="#CCCC99"><TABLE borderColor="#660000" cellSpacing="0" cellPadding="0" border="0" width="100%"><TR><TD bgColor="#CCCC99" valign="top" colspan="2" height="27">
<font size="6" face="Arial, Helvetica, sans-serif" color="#660000">
<b>OTC</b></font></TD></TR><TR><TD height="2" bgcolor="#6a694f">
<font size="5" face="Times New Roman, Times, serif" color="#FFFFFF">
<b>&nbsp;Newsletter</b></font></TD><TD height="2" bgcolor="#6a694f"><div align="right"><font color="#FFFFFF">
<b>Discover Tomorrow's Winners&nbsp;</b></font></div></TD></TR><TR><TD height="25" colspan="2" bgcolor="#CCCC99"><table width="100%" border="0" 

In [37]:
# see the resulting plain text
print(html_to_plain_text(sample_html_spam.get_content())[:1000],"...")

OTC
Newsletter
Discover Tomorrow's Winners
For Immediate Release
Cal-Bay (Stock Symbol: CBYI)
Watch for analyst "Strong Buy Recommendations" and several advisory newsletters picking CBYI.  CBYI has filed to be traded on the OTCBB, share prices historically INCREASE when companies get listed on this larger trading exchange. CBYI is trading around 25 cents and should skyrocket to $2.66 - $3.25 a share in the near future.
Put CBYI on your watch list, acquire a position TODAY.
REASONS TO INVEST IN CBYI
A profitable company and is on track to beat ALL earnings estimates!
One of the FASTEST growing distributors in environmental & safety equipment instruments.
Excellent management team, several EXCLUSIVE contracts.  IMPRESSIVE client list including the U.S. Air Force, Anheuser-Busch, Chevron Refining and Mitsubishi Heavy Industries, GE-Energy & Environmental Research.
RAPIDLY GROWING INDUSTRY
Industry revenues exceed $900 million, estimates indicate that there could be as much as $25 billion 

In [38]:
# function that takes email as input, returning it as plain text
def email_to_text(email):
  html = None
  for part in email.walk():
    ctype = part.get_content_type()
    if not ctype in ('text/plain', 'text/html'):
      continue
    try:
      content = part.get_content()
    except:
      content = str(part.get_payload())
    if ctype == 'text/plain':
      return content
    else:
      html = content
  if html:
    return html_to_plain_text(html)

result = email_to_text(sample_html_spam)
print(result[:100], '...')

OTC
Newsletter
Discover Tomorrow's Winners
For Immediate Release
Cal-Bay (Stock Symbol: CBYI)
Watch  ...


**Stemming using Natural Language Toolkit**
* The process of reducing a word to its base or root form, usually by chopping off its suffixes.
* A common preprocessing step in natural language processing to help algorithms treat related words (like compute, computing and computed) as equivalent, allowing it to be treated as one token and to reduce vocabulary size and improve generalization.

In [39]:
# stemming using nltk
stemmer = nltk.PorterStemmer()
for word in (
    "Computations",
    "Computation",
    "Computing",
    "Computed",
    "Compute",
    "Compusive"
):
  print(word, "=>", stemmer.stem(word))

Computations => comput
Computation => comput
Computing => comput
Computed => comput
Compute => comput
Compusive => compus


In [40]:
# replace urls
url_extractor = urlextract.URLExtract()
some_text = "Will it detect github.com and https://www.youtube.com/watch?v=dQw4w9WgXcQ"
(url_extractor.find_urls(some_text))

['github.com', 'https://www.youtube.com/watch?v=dQw4w9WgXcQ']

**Note:**
* You can click the link to make sure that it is valid and working properly

#### 🔤 **Email to Word-Counter Transformer**

📦 **Class Purpose:**

A transformer that takes in raw email objects (likely from the Python email library), converts them to plain text, and outputs a Counter-style word frequency dictionary for each email.

🔧 **Initialization** (`__init__`)
These parameters control the preprocessing behavior:

* `strip_headers`: Remove email headers (not implemented in the shown code — but assumed available via email_to_text)

* `lower_case`: Convert text to lowercase

* `remove_punctuation`: Strip punctuation

* `replace_urls`: Replace URLs with placeholder " URL "

* `replace_numbers`: Replace numbers with "NUMBER"

* `stemming`: Apply stemming using an NLTK stemmer

These settings allow easy tuning and `GridSearchCV` compatibility.

In [41]:
class EmailToWordCounterTransformer(BaseEstimator, TransformerMixin):
  def __init__(
      self,
      strip_headers=True,
      lower_case=True,
      remove_punctuation=True,
      replace_urls=True,
      replace_numbers=True,
      stemming=True
      ):
    self.strip_headers = strip_headers
    self.lower_case = lower_case
    self.remove_punctuation = remove_punctuation
    self.replace_urls = replace_urls
    self.replace_numbers = replace_numbers
    self.stemming = stemming
  def fit(self, X, y=None):
    return self
  def transform(self, X, y=None):
    X_transformed = []
    for email in X:
      text = email_to_text(email) or ""
      if self.lower_case:
        text = text.lower()
      if self.replace_urls and url_extractor is not None:
        urls = list(set(url_extractor.find_urls(text)))
        urls.sort(key=lambda url: len(url), reverse=True)
        for url in urls:
          text = text.replace(url, " URL ")
      if self.replace_numbers:
        text = re.sub(r'\d+(?:\.\d*)?(?:[eE][+-]?\d+)?','NUMBER', text)
      if self.remove_punctuation:
        text = re.sub(r'\W+', ' ', text, flags=re.M)
      word_counts = Counter(text.split())
      if self.stemming and stemmer is not None:
        stemmed_word_counts = Counter()
        for word, count in word_counts.items():
          stemmed_word = stemmer.stem(word)
          stemmed_word_counts[stemmed_word] += count
        word_counts = stemmed_word_counts
      X_transformed.append(word_counts)
    return np.array(X_transformed)

🔁 `transform()` **Email to Text**:

1️⃣ **Extract Text** - Converts the structured email into plain text
```
text = email_to_text(email)
```


2️⃣ **Lowercase Conversion** - Reduces vocabulary size by normalizing word casing.
```
if self.lower_case:
    text = text.lower()
```

3️⃣ **URL Replacement** - Replaces detected URLs with the token `" URL "` — useful for treating all links uniformly.
```
if self.replace_urls:
    urls = url_extractor.find_urls(text)
    urls.sort(key=lambda url: len(url), reverse=True)
    for url in urls:
        text = text.replace(url, " URL ")
```

4️⃣ **Number Replacement** - Uses regex to replace both integers and decimals (including scientific notation) with `"NUMBER"`.
```
if self.replace_numbers:
    text = re.sub(r'\d+(?:\.\d+)?(?:[Ee][+-]?\d+)?', 'NUMBER', text)
```

5️⃣ **Punctuation Removal** - Replaces non-word characters with space.
```
if self.remove_punctuation:
    text = re.sub(r'\W+', ' ', text)
```

6️⃣ **Tokenization and Word Counting** - Splits text into words and counts occurrences with `Counter`.
```
word_counts = Counter(text.split())
```

7️⃣ **Stemming** - Applies stemming to each word and recomputes counts on the stemmed versions.
```
if self.stemming:
    stemmed_word_counts = Counter()
    for word, count in word_counts.items():
        stemmed_word = stemmer.stem(word)
        stemmed_word_counts[stemmed_word] += count
    word_counts = stemmed_word_counts
```

🔚 Final Output
```
X_transformed.append(word_counts)
```
At the end, the method returns:
```
return np.array(X_transformed)
```

Where each entry is a `Counter` dictionary of stemmed, cleaned word frequencies for one email.

In [42]:
# trying the transformer on a few emails
X_few = X_train[:3]
X_few_wordcounts = EmailToWordCounterTransformer().fit_transform(X_few)
X_few_wordcounts

array([Counter({'chuck': 1, 'murcko': 1, 'wrote': 1, 'stuff': 1, 'yawn': 1, 'r': 1}),
       Counter({'the': 11, 'of': 9, 'and': 8, 'all': 3, 'christian': 3, 'to': 3, 'by': 3, 'jefferson': 2, 'i': 2, 'have': 2, 'superstit': 2, 'one': 2, 'on': 2, 'been': 2, 'ha': 2, 'half': 2, 'rogueri': 2, 'teach': 2, 'jesu': 2, 'some': 1, 'interest': 1, 'quot': 1, 'url': 1, 'thoma': 1, 'examin': 1, 'known': 1, 'word': 1, 'do': 1, 'not': 1, 'find': 1, 'in': 1, 'our': 1, 'particular': 1, 'redeem': 1, 'featur': 1, 'they': 1, 'are': 1, 'alik': 1, 'found': 1, 'fabl': 1, 'mytholog': 1, 'million': 1, 'innoc': 1, 'men': 1, 'women': 1, 'children': 1, 'sinc': 1, 'introduct': 1, 'burnt': 1, 'tortur': 1, 'fine': 1, 'imprison': 1, 'what': 1, 'effect': 1, 'thi': 1, 'coercion': 1, 'make': 1, 'world': 1, 'fool': 1, 'other': 1, 'hypocrit': 1, 'support': 1, 'error': 1, 'over': 1, 'earth': 1, 'six': 1, 'histor': 1, 'american': 1, 'john': 1, 'e': 1, 'remsburg': 1, 'letter': 1, 'william': 1, 'short': 1, 'again': 1, 'becom

#### 🔢 **Word-Counter to Vector Transformer**

📦 **Class Purpose:**

A transformer that converts word counts to vectors whose `fit()` method builds the vocabulary (an ordered list of the common words) and whose `transform()` method will use the vocabulary to convert word counts to vectors. The output is a sparse matrix.

🔧 **Initialization** (__init__) These parameters control the preprocessing behavior:
* `vocabulary_size`: The number of top most frequent words to keep in the vocabulary. All other words will be ignored during transformation.
 *(Default = 1000)*

This setup makes it compatible with pipelines and tools like `GridSearchCV` for parameter tuning.

In [43]:
class WordCounterToVectorTransformer(BaseEstimator, TransformerMixin):
  def __init__(self, vocabulary_size=1000):
    self.vocabulary_size = vocabulary_size
  def fit(self, X, y=None):
    total_count = Counter()
    for word_count in X:
      for word, count in word_count.items():
        total_count[word] += min(count, 10)
    most_common = total_count.most_common()[:self.vocabulary_size]
    self.vocabulary_ = {word: index + 1
                        for index, (word, count) in enumerate(most_common)}
    return self
  def transform(self, X, y=None):
    rows = []
    cols = []
    data = []
    for row, word_count in enumerate(X):
      for word,count in word_count.items():
        rows.append(row)
        cols.append(self.vocabulary_.get(word, 0))
        data.append(count)
    return csr_matrix((data, (rows, cols)),
                      shape=(len(X), self.vocabulary_size + 1))

🧾 `fit()` – **Vocabulary Building**
* **Step 1:** Accumulate word frequencies across all emails, capping each word’s count at 10 to avoid overweighting frequent words.

* **Step 2:** Select the `vocabulary_size` most common words.

* **Step 3:** Build a vocabulary dictionary mapping each word to a unique index, starting from `1`.

**Note:** Index `0` is reserved for unknown words during transformation.

🔁 `transform()` – **Vector Conversion**
* Converts each `Counter` into a sparse vector of word counts.
* Unknown words are placed in column index `0`.
* Output is a SciPy `csr_matrix`, ready for training ML models.

In [44]:
# testing the vocabulary transformer
vocab_transformer = WordCounterToVectorTransformer(vocabulary_size=10)
X_few_vectors = vocab_transformer.fit_transform(X_few_wordcounts)
X_few_vectors

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 20 stored elements and shape (3, 11)>

In [45]:
X_few_vectors.toarray()

array([[ 6,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [99, 11,  9,  8,  3,  1,  3,  1,  3,  2,  3],
       [67,  0,  1,  2,  3,  4,  1,  2,  0,  1,  0]])

**Analysis**

* **99:** The 99 in the second row at index 0 means that the second email contains 99 words that are not a part of the vocabulary.
* **11:** The 11 next to it means that the first word in the vocab appears 11 times in the email.
* **9**: The 9 next to it means that the second word is present 9 times

so on...

In [46]:
vocab_transformer.vocabulary_

{'the': 1,
 'of': 2,
 'and': 3,
 'to': 4,
 'url': 5,
 'all': 6,
 'in': 7,
 'christian': 8,
 'on': 9,
 'by': 10}

## 🏋️ **Model Training**

In [47]:
# create a preprocessing pipeline for training
preprocess_pipeline = Pipeline([
     ('email_to_wordcount', EmailToWordCounterTransformer()),
     ('wordcount_to_vector', WordCounterToVectorTransformer())
 ])

X_train_transformed = preprocess_pipeline.fit_transform(X_train)

### 🧪 **98.6% Mean Score on Training Set**

In [48]:
# model instantiation
log_clf = LogisticRegression(max_iter=1000, random_state=random_state)
score = cross_val_score(
    log_clf, X_train_transformed,
    y_train, cv=3
    )
print(f'Cross Validation Score (Accuracy) of Logistic Regression Classifier: {score.mean():.1%}')

Cross Validation Score (Accuracy) of Logistic Regression Classifier: 98.6%


### 🧪 **96.9% Precision / 97.9% Recall on Test Set**

#### **Note on Precision/Recall Trade-off:**

According to ChatGPT, In most spam‐filtering scenarios, **precision** tends to be the more critical metric:

* **Precision** (97.7%) tells you “of all the emails we flagged as spam, 97.7% truly were spam.”

  * A **high precision** means very few ham messages get mis-labeled and sent to the spam folder by mistake. Losing legitimate mail is usually more painful for users than letting a few spam messages slip through.

* **Recall** (89.9%) tells you “of all the actual spam emails, we caught 89.9%.”

  * A **slightly lower recall** means you’ll miss about 10% of spam (it lands in the inbox), which is annoying but generally less harmful than false positives.

Because **false positives** (**ham** → **spam**) risk losing or hiding important mail, you’ll typically **optimize** for **precision first**.

You can then **tune the decision threshold** (or explore cost-sensitive training) to nudge recall higher without letting precision drop below an acceptable floor.









In [49]:
X_test_transformed = preprocess_pipeline.transform(X_test)

log_clf = LogisticRegression(max_iter=1000, random_state=random_state)
log_clf.fit(X_train_transformed, y_train)

y_pred = log_clf.predict(X_test_transformed)

print(f'Precision: {precision_score(y_test, y_pred):.1%}')
print(f'Recall: {recall_score(y_test, y_pred):.1%}')

Precision: 96.9%
Recall: 97.9%


### 🧪 **96.3% Mean Score on Test Set**

In [50]:
# model instantiation
log_clf = LogisticRegression(max_iter=1000, random_state=random_state)
score = cross_val_score(
    log_clf, X_test_transformed,
    y_test, cv=3
    )
print(f'Cross Validation Score (Accuracy) of Logistic Regression Classifier (Test Set): {score.mean():.1%}')

Cross Validation Score (Accuracy) of Logistic Regression Classifier (Test Set): 96.3%


* The results are very strong across the board for the logistic regression classifier on the **easy dataset** and can serve as a benchmark for performance on the **hard dataset**.

-----------

# 🔴 **Part 2:** Self-directed Case Study

## 🚨 **Using The Hard Dataset**

So far, the results from the Logistic Regression classifier have had their performance evaluated on one of the more light-hearted datasets from [Apache SpamAssassin's public datasets](https://homl.info/spamassassin).

I will now import a **hard** dataset and perform a full process of model selection, fine-tuning (using cross-validation) and performance evaluation to attempt to reach results near to that of the first dataset.

**Brief**

1. Import Hard Dataset
2. Inspect The Data
3. Preprocessing
4. Model Selection
5. Fine-tuning
6. Results
7. Summary

## **🔧 Utils & Functions** (Round 2)

In [68]:
# +----------------------------------------------------------------------+

# (1) fetch spam data
def fetch_spam_data():
  spam_root = "https://spamassassin.apache.org/old/publiccorpus/"
  ham_url = spam_root + "20030228_hard_ham.tar.bz2"
  spam_url = spam_root + "20030228_spam_2.tar.bz2"

  spam_path = Path() / "datasets" / "spam"
  spam_path.mkdir(
      parents=True,
      exist_ok=True
      ),

  for dir_name, tar_name, url in (
      ("hard_ham", "ham", ham_url),
       ("spam_2","spam", spam_url)
       ):
    if not (spam_path / dir_name).is_dir():
      path = (spam_path / tar_name).with_suffix(".tar.bz2")
      print("Downloading", path)
      urllib.request.urlretrieve(url, path)
      tar_bz2_file = tarfile.open(path)
      tar_bz2_file.extractall(path=spam_path)
      tar_bz2_file.close()
  return [spam_path / dir_name for dir_name in ("hard_ham", "spam_2")]

# +----------------------------------------------------------------------+

## 📤 **Examples of Spam & Ham** (Round 2)
* From Hard [Apache SpamAssassin's public datasets](https://spamassassin.apache.org/old/publiccorpus/).
* `20030228_hard_ham.tar.bz2`
* `20030228_spam_2.tar.bz2`

In [61]:
# download data
ham_dir, spam_dir = fetch_spam_data()

In [62]:
# load emails
ham_filenames = [f for f in sorted(ham_dir.iterdir()) if len(f.name) > 20]
spam_filenames = [f for f in sorted(spam_dir.iterdir()) if len(f.name) > 20]
print(
    f"Ham Filename Length: {len(ham_filenames)}\n"
    f"Spam Filename Length: {len(spam_filenames)}"
)

Ham Filename Length: 250
Spam Filename Length: 1397


**Note:**
* There seems to be a lot more spam than ham.

In [63]:
# parse the emails
ham_emails = [load_email(filepath) for filepath in ham_filenames]
spam_emails = [load_email(filepath) for filepath in spam_filenames]

In [64]:
# viewing an example of ham
print(ham_emails[1].get_content().strip())

May 7, 2002


Dear rod-3ds@arsecandle.org:


Congratulations!  On behalf of Frito-Lay, Inc., we are pleased to advise you
 that you've won Fourth Prize in the 3D's(R) Malcolm in the Middle(TM)
 Sweepstakes.   Fourth Prize consists of 1 manufacturer's coupon redeemable at
 participating retailers for 1 free bag of 3D's(R) brand snacks (up to 7 oz.
 size), with an approximate retail value of $2.59 and an expiration date of
 12/31/02.

Follow these instructions to claim your prize:

1.	Print out this email message.

2.	Complete ALL of the information requested.  Print clearly and legibly.  Sign
 where indicated.

3.	If you are under 18 or otherwise under the legal age of majority in your
 state, your parent or legal guardian must co-sign where indicated below.

4.	Mail the completed and signed form to:  3D's(R) Malcolm in the Middle(TM)
 Sweepstakes, Redemption Center, PO Box 1520, Elmhurst IL 60126.  WE MUST
 RECEIVE THIS FORM NO LATER THAN MAY 28, 2002 IN ORDER TO SEND YOU THE PRIZE.

P

**Note:**
* This email is much longer than the example loaded in Part 1.

In [65]:
# viewing an example of spam
print(spam_emails[6].get_content().strip())

NEW PRODUCT ANNOUNCEMENT

From: OUTSOURCE ENG.& MFG. INC.


Sir/Madam;

This note is to inform you of new watchdog board technology for maintaining
continuous unattended operation of PC/Servers etc. that we have released for
distribution.
  
We are proud to announce Watchdog Control Center featuring MAM (Multiple
Applications Monitor) capability.
The key feature of this application enables you to monitor as many
applications as you
have resident on any computer as well as the operating system for
continuous unattended operation.  The Watchdog Control Center featuring
MAM capability expands third party application "control" of a Watchdog as
access to the application's
source code is no longer needed.

Here is how it all works:
Upon installation of the application and Watchdog, the user may select
many configuration options, based on their model of Watchdog, to fit their
operational needs.  If the MAM feature is enabled, the user may select any
executable program that they wish for monit

**Most Common Email Structures**

As seen below, the most common email structures in the hard ham dataset is now `text/html` and very near to the most common for the new spam data.

In [66]:
# viewing email structures
structures_counter(ham_emails).most_common()

[('text/html', 118),
 ('text/plain', 81),
 ('multipart(text/plain, text/html)', 43),
 ('multipart(text/html)', 2),
 ('multipart(text/plain, image/bmp)', 1),
 ('multipart(multipart(text/plain, text/html))', 1),
 ('multipart(text/plain, application/x-pkcs7-signature)', 1),
 ('multipart(text/plain, image/png, image/png)', 1),
 ('multipart(multipart(text/plain, text/html), image/gif, image/gif, image/gif, image/gif, image/gif, image/gif, image/gif, image/gif, image/gif, image/gif, image/gif, image/jpeg, image/gif, image/gif, image/gif, image/gif, image/gif, image/gif)',
  1),
 ('multipart(text/plain, text/plain)', 1)]

In [67]:
# viewing spam email structures
structures_counter(spam_emails).most_common()

[('text/plain', 598),
 ('text/html', 589),
 ('multipart(text/plain, text/html)', 114),
 ('multipart(text/html)', 29),
 ('multipart(text/plain)', 25),
 ('multipart(multipart(text/html))', 18),
 ('multipart(multipart(text/plain, text/html))', 5),
 ('multipart(text/plain, application/octet-stream, text/plain)', 3),
 ('multipart(text/html, text/plain)', 2),
 ('multipart(text/html, image/jpeg)', 2),
 ('multipart(multipart(text/plain), application/octet-stream)', 2),
 ('multipart(text/plain, application/octet-stream)', 2),
 ('multipart(text/plain, multipart(text/plain))', 1),
 ('multipart(multipart(text/plain, text/html), image/jpeg, image/jpeg, image/jpeg, image/jpeg, image/jpeg)',
  1),
 ('multipart(multipart(text/plain, text/html), image/jpeg, image/jpeg, image/jpeg, image/jpeg, image/gif)',
  1),
 ('text/plain charset=us-ascii', 1),
 ('multipart(multipart(text/html), image/gif)', 1),
 ('multipart(multipart(text/plain, text/html), application/octet-stream, application/octet-stream, applic

**View an Example Email Header**

In [69]:
# viewing the email headers
for header, value in spam_emails[0].items():
  print(header, ":", value)

Return-Path : <ilug-admin@linux.ie>
Delivered-To : yyyy@localhost.netnoteinc.com
Received : from localhost (localhost [127.0.0.1])	by phobos.labs.netnoteinc.com (Postfix) with ESMTP id 9E1F5441DD	for <jm@localhost>; Tue,  6 Aug 2002 06:48:09 -0400 (EDT)
Received : from phobos [127.0.0.1]	by localhost with IMAP (fetchmail-5.9.0)	for jm@localhost (single-drop); Tue, 06 Aug 2002 11:48:09 +0100 (IST)
Received : from lugh.tuatha.org (root@lugh.tuatha.org [194.125.145.45]) by    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g72LqWv13294 for    <jm-ilug@jmason.org>; Fri, 2 Aug 2002 22:52:32 +0100
Received : from lugh (root@localhost [127.0.0.1]) by lugh.tuatha.org    (8.9.3/8.9.3) with ESMTP id WAA31224; Fri, 2 Aug 2002 22:50:17 +0100
Received : from bettyjagessar.com (w142.z064000057.nyc-ny.dsl.cnc.net    [64.0.57.142]) by lugh.tuatha.org (8.9.3/8.9.3) with ESMTP id WAA31201 for    <ilug@linux.ie>; Fri, 2 Aug 2002 22:50:11 +0100
Received : from 64.0.57.142 [202.63.165.34] by bettyjagessa

**View The Email Subject Header**

In [70]:
spam_emails[0]['subject']

'[ILUG] STOP THE MLM INSANITY'

**Note:**
* **MLM** - presumably Multi-level marketing (MLM) refering to the controversial marketing strategy where a company sells products or services through a network of distributors who earn commissions not only from their sales but also from the sales of their recruits and their recruits' recruits.

## 🧹 **Preprocessing** (Round 2)

### 📊 **Train-test Split on Dataset** (Round 2)

In [71]:
# train-test split
X = np.array(ham_emails + spam_emails, dtype=object)
y = np.array([0] * len(ham_emails) + [1] * len(spam_emails))

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=.2, random_state=random_state
    )

**Train-test Split Code Summary:** (As done in Part 1 of the notebook)

The dataset is being explicitly labeled and merged so that it can be split consistently into training and test sets. Both `x` and `y` must correspond 1-to-1.

```
X = np.array(ham_emails + spam_emails, dtype=object)
```
* Combines the two lists (`ham_emails` and `spam_emails`) into one array `X`
* `dtype=object` is used because the emails are text strings

```
y = np.array([0] * len(ham_emails) + [1] * len(spam_emails))
```
Creates the labels (`y`) for the classification task:
* `0` for **ham** (not spam)
* `1` for **spam**

### 🛠️ **Preprocessing Functions** (Round 2)

In [72]:
# testing the HTML to plain text function on the new spam data
html_spam_emails = [
    email for email in X_train[y_train==1] if get_email_structure(email) == 'text/html'
    ]
sample_html_spam = html_spam_emails[7]
print(sample_html_spam.get_content()[:1000], "...")

<html>
<head>
<title>Foreclosure Toolkit</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css">
	<!--

	td {  font-family: Arial, Helvetica, sans-serif}

	-->
	</style>
</head>

<body>

<p>Great News webmaster,</p>

<p><b>FINALLY  -  A Foreclosure Tycoon Reveals His Most Closely Guarded Secrets! </b><br>
<br>
For the first time ever, we are proud to bring you Foreclosure's most closely guarded secrets - a complete, turn-key system to either owning your own home or making a fortune in Foreclosure Real Estate without tenants, headaches and bankers. <br>
<br>
FREE information for a revolutionary brand-new approach to show you exactly how to buy a foreclosure or make a fortune in foreclosure real estate in today's market.<br>
<br>
Are you ready to take advantage of this amazing information?<br>
webmaster,  take this first step to improving your life in the next 2 minutes!</p>

<p><b>For <font color="#FF0000">FREE</font> Information <a href=

In [73]:
# see the resulting plain text
print(html_to_plain_text(sample_html_spam.get_content())[:1000],"...")

Great News webmaster,
FINALLY  -  A Foreclosure Tycoon Reveals His Most Closely Guarded Secrets!
For the first time ever, we are proud to bring you Foreclosure's most closely guarded secrets - a complete, turn-key system to either owning your own home or making a fortune in Foreclosure Real Estate without tenants, headaches and bankers.
FREE information for a revolutionary brand-new approach to show you exactly how to buy a foreclosure or make a fortune in foreclosure real estate in today's market.
Are you ready to take advantage of this amazing information?
webmaster,  take this first step to improving your life in the next 2 minutes!
For
FREE
Information
HYPERLINK
http://www.foreclosureworld.net/cgi-bin/click.cgi?c=12
HYPERLINK
To unsubscribe or change subscriber options click:
HYPERLINK
--DeathToSpamDeathToSpamDeathToSpam--
-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_____________

In [77]:
result = email_to_text(sample_html_spam)
print(result[:100], '...')

Great News webmaster,
FINALLY  -  A Foreclosure Tycoon Reveals His Most Closely Guarded Secrets!
For ...


#### **Testing Previous Preprocessing Transformers on New Data**

##### 🔤 **Email to Word-Counter Transformer**

📦 **Class Purpose:**

A transformer that takes in raw email objects (likely from the Python email library), converts them to plain text, and outputs a Counter-style word frequency dictionary for each email.

In [81]:
# testing transformer on 3 emails from the new data
X_few = X_train[:3]
X_few_wordcounts = EmailToWordCounterTransformer().fit_transform(X_few)
X_few_wordcounts

array([Counter({'the': 11, 'number': 11, 'you': 8, 'to': 7, 'of': 5, 'have': 4, 'new': 4, 'are': 3, 'them': 3, 'or': 3, 'for': 3, 'in': 3, 'life': 3, 'and': 3, 'thi': 3, 'there': 2, 'more': 2, 'opportun': 2, 'that': 2, 'don': 2, 't': 2, 'go': 2, 'is': 2, 'a': 2, 'year': 2, 'your': 2, 'home': 2, 'our': 2, 'busi': 2, 'incom': 2, 'from': 2, 'inform': 2, 'pleas': 2, 'no': 2, 'north': 2, 'carolina': 2, 'dakota': 2, 'south': 2, 'virginia': 2, 'e': 2, 'mail': 2, 'subject': 2, 'financi': 1, 'out': 1, 'than': 1, 'ever': 1, 'befor': 1, 'major': 1, 'those': 1, 'succeed': 1, 'follow': 1, 'rule': 1, 'they': 1, 'bend': 1, 'avoid': 1, 'around': 1, 'freedom': 1, 'sucker': 1, 'work': 1, 'hour': 1, 'week': 1, 'all': 1, 'make': 1, 'someon': 1, 'els': 1, 'wealthi': 1, 'we': 1, 'better': 1, 'way': 1, 'interest': 1, 'creat': 1, 'immedi': 1, 'wealth': 1, 'consid': 1, 'improv': 1, 'qualiti': 1, 'do': 1, 'current': 1, 'car': 1, 'style': 1, 'dream': 1, 'develop': 1, 'figur': 1, 'earner': 1, 'quickli': 1, 'easil

##### 🔢 **Word-Counter to Vector Transformer**

📦 **Class Purpose:**

A transformer that converts word counts to vectors whose `fit()` method builds the vocabulary (an ordered list of the common words) and whose `transform()` method will use the vocabulary to convert word counts to vectors. The output is a sparse matrix.

In [82]:
# testing Word-Counter to Vector Transformer on the new data
vocab_transformer = WordCounterToVectorTransformer(vocabulary_size=10)
X_few_vectors = vocab_transformer.fit_transform(X_few_wordcounts)
X_few_vectors

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 23 stored elements and shape (3, 11)>

In [83]:
X_few_vectors.toarray()

array([[209,  11,   8,   7,   2,   5,  11,   3,   3,   4,   3],
       [ 10,   0,   0,   1,   0,   0,   0,   0,   0,   0,   0],
       [136,   7,   9,   5,  12,   5,   0,   6,   4,   2,   2]])

In [84]:
vocab_transformer.vocabulary_

{'the': 1,
 'you': 2,
 'to': 3,
 'a': 4,
 'of': 5,
 'number': 6,
 'or': 7,
 'in': 8,
 'have': 9,
 'for': 10}

##### 🧮 **Top 20 Most Common Words in Training Data**

In [90]:
# see the most common words in all emails
counters = EmailToWordCounterTransformer(
    lower_case=True,
    replace_urls=True,
    replace_numbers=True,
    remove_punctuation=True,
    stemming=False
).fit_transform(X_train)

# sum them all into one big Counter
total_counts = Counter()
for ct in counters:
    total_counts.update(ct)

# top 20 most common
most_common = total_counts.most_common(20)

print("Top 20 tokens across dataset:")
for token, freq in most_common:
    print(f"{token:>12s}  —  {freq}")

Top 20 tokens across dataset:
      NUMBER  —  25100
         the  —  17603
          to  —  14330
         and  —  10617
           a  —  9485
          of  —  9383
         you  —  8990
   hyperlink  —  7317
          in  —  6249
         for  —  6068
        your  —  5645
          is  —  5151
         URL  —  4931
        this  —  4527
           i  —  3784
        font  —  3772
           s  —  3574
          on  —  3496
          it  —  3420
        that  —  3353


## 🏋️ **Model Training** (Round 2)

In [85]:
# use previously created pipeline to transform the new training set
X_train_transformed = preprocess_pipeline.fit_transform(X_train)

### ⚖️ **Logistic Regression**
Benchmark model

#### 🧪 **96.1% Mean Score on Training Set** (Logistic Regression)

In [86]:
# model instantiation
log_clf = LogisticRegression(max_iter=1000, random_state=random_state)
score = cross_val_score(
    log_clf, X_train_transformed,
    y_train, cv=3
    )
print(f'Cross Validation Score (Accuracy) of Logistic Regression Classifier: {score.mean():.1%}')

Cross Validation Score (Accuracy) of Logistic Regression Classifier: 96.1%


#### 🧪 **93.9% Mean Score on Test Set** (Logistic Regression)

In [88]:
# model instantiation
log_clf = LogisticRegression(max_iter=1000, random_state=random_state)
score = cross_val_score(
    log_clf, X_test_transformed,
    y_test, cv=3
    )
print(f'Cross Validation Score (Accuracy) of Logistic Regression Classifier (Test Set): {score.mean():.1%}')

Cross Validation Score (Accuracy) of Logistic Regression Classifier (Test Set): 93.9%


#### 🧪 **96.8% Precision / 98.2% Recall Score on Test Set** (Logistic Regression)

In [87]:
X_test_transformed = preprocess_pipeline.transform(X_test)

log_clf = LogisticRegression(max_iter=1000, random_state=random_state)
log_clf.fit(X_train_transformed, y_train)

y_pred = log_clf.predict(X_test_transformed)

print(f'Precision: {precision_score(y_test, y_pred):.1%}')
print(f'Recall: {recall_score(y_test, y_pred):.1%}')

Precision: 96.8%
Recall: 98.2%


### 🎲 **Multinomial Naive Bayes**
   * **Why use this model?**
   
   A classic for text—very fast, often a strong baseline on bag-of-words or TF–IDF features.
   * **Technique:**

     * Vectorize with `TfidfVectorizer(ngram_range=(1,2))`
     * Fit `MultinomialNB(alpha=1.0)`
     * Tune `alpha` via CV to control smoothing

#### **🧪 88.2% Acc Train, 89.7% Acc Test (Multinomial Naive Bayes)**

In [96]:
# alpha hyperparameter tuning
param_grid = {
    "alpha":[
        0.01, 0.1, 0.5, 1.0, 1.5, 2, 2.5, 3.5, 4.5, 5
    ]
}

# GridSearchCV
grid = GridSearchCV(
    MultinomialNB(),
    param_grid=param_grid,
    cv=3,
    scoring='accuracy',
    n_jobs=-1
)

# fit on training set
grid.fit(X_train_transformed, y_train)

# best cross validation score
print(f'Best Cross Validation Score Accuracy: {grid.best_score_:.1%}')

# retrieve best estimator
best_mnb = grid.best_estimator_

# evaluate on test set
test_acc = best_mnb.score(X_test_transformed, y_test)
print(f'Test Set Accuracy of MNB: {test_acc:.1%}')

Best Cross Validation Score Accuracy: 88.2%
Test Set Accuracy of MNB: 89.7%


#### 🧪 **97.7% Precision / 89.9% Recall Score on Test Set** (Multinomial NB)

In [97]:
# predict on the test set
y_pred = best_mnb.predict(X_test_transformed)

# precision and recall
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print(f'Precision of MNB on Test Set: {precision:.1%}')
print(f'Recall of MNB on Test Set:    {recall:.1%}')

Precision of MNB on Test Set: 97.7%
Recall of MNB on Test Set:    89.9%


### 🌲**Tree-Based Ensemble (Random Forest Classifier)**

#### **🧪 95.5% Acc Train, 94.5% Acc Test (Random Forest Classifier)**

In [102]:
# instantiate
rf_clf = RandomForestClassifier(
    n_estimators=100,
    max_depth=None,
    random_state=random_state
)

# cross-validate on the training set
cv_scores = cross_val_score(
    rf_clf,
    X_train_transformed,
    y_train,
    cv=3,
    scoring='accuracy',
    n_jobs=-1
)

print(f'Random Forest CV Accuracy: {cv_scores.mean():.1%}')

Random Forest CV Accuracy: 95.5%


In [105]:
cv_scores_test = cross_val_score(
    rf_clf,
    X_test_transformed,
    y_test,
    cv=3,
    scoring='accuracy',
    n_jobs=-1
)

print(f'Random Forest CV Test Accuracy: {cv_scores_test.mean():.1%}')

Random Forest CV Test Accuracy: 94.5%


#### 🧪 **96.2% Precision / 98.9% Recall Score on Test Set** (Random Forest Classifier)

In [106]:
# fit on train, predict on test
rf_clf.fit(X_train_transformed, y_train)
y_pred = rf_clf.predict(X_test_transformed)

# precision and recall on test set
precision = precision_score(y_test, y_pred)
recall    = recall_score(y_test, y_pred)

print(f'Precision of RFC on Test Set: {precision:.1%}')
print(f'Recall of RFC on Test Set:    {recall:.1%}')

Precision of RFC on Test Set: 96.2%
Recall of RFC on Test Set:    98.9%


## 📈 **Results**

### 🟢 **Easy Dataset Results**

In [141]:
# easy dataset results using Logistic Regression
data_easy = {
    'Model': ['Logistic Regression'],
    'Train Accuracy (%)': [98.6],    # 98.6% mean score on training set
    'Test Accuracy (%)':  [96.3],    # 96.3% mean score on test set
    'Precision (%)':      [96.9],    # 96.9% precision on test set
    'Recall (%)':         [97.9]     # 97.9% recall on test set
}

df_easy = pd.DataFrame(data_easy)
df_easy

Unnamed: 0,Model,Train Accuracy (%),Test Accuracy (%),Precision (%),Recall (%)
0,Logistic Regression,98.6,96.3,96.9,97.9


### 🔴 **Hard Dataset Results**

In [107]:
# prepare the results data of model training (round 2)
data = {
    'Model': [
        'Logistic Regression',
        'Multinomial Naive Bayes',
        'Random Forest Classifier'
    ],
    'Train Accuracy (%)': [96.1, 88.2, 95.5],
    'Test Accuracy (%)': [93.9, 89.7, 94.5],
    'Precision (%)': [96.8, 97.7, 96.2],
    'Recall (%)': [98.2, 89.9, 98.9]
}

# create DataFrame
df_results_2 = pd.DataFrame(data)

# display it
df_results_2

Unnamed: 0,Model,Train Accuracy (%),Test Accuracy (%),Precision (%),Recall (%)
0,Logistic Regression,96.1,93.9,96.8,98.2
1,Multinomial Naive Bayes,88.2,89.7,97.7,89.9
2,Random Forest Classifier,95.5,94.5,96.2,98.9


In [135]:
# highest precision accuracy in descending order by model
prec_df = (
    df_results_2
      .groupby('Model', as_index=False)['Precision (%)']
      .max()
      .sort_values('Precision (%)', ascending=False)
)
prec_df

Unnamed: 0,Model,Precision (%)
1,Multinomial Naive Bayes,97.7
0,Logistic Regression,96.8
2,Random Forest Classifier,96.2


In [139]:
# highest test accuracy in descending order by model
acc_df = (
    df_results_2
      .groupby('Model', as_index=False)['Test Accuracy (%)']
      .min()
      .sort_values('Test Accuracy (%)', ascending=False)
)
acc_df

Unnamed: 0,Model,Test Accuracy (%)
2,Random Forest Classifier,94.5
0,Logistic Regression,93.9
1,Multinomial Naive Bayes,89.7


In [140]:
# highest recall in descending order by model
acc_df = (
    df_results_2
      .groupby('Model', as_index=False)['Recall (%)']
      .min()
      .sort_values('Recall (%)', ascending=False)
)
acc_df

Unnamed: 0,Model,Recall (%)
2,Random Forest Classifier,98.9
0,Logistic Regression,98.2
1,Multinomial Naive Bayes,89.9


In [145]:
# easy dataset logistic ression model
df_easy

Unnamed: 0,Model,Train Accuracy (%),Test Accuracy (%),Precision (%),Recall (%)
0,Logistic Regression,98.6,96.3,96.9,97.9


In [143]:
# random forest classifier
df_results_2[2:]

Unnamed: 0,Model,Train Accuracy (%),Test Accuracy (%),Precision (%),Recall (%)
2,Random Forest Classifier,95.5,94.5,96.2,98.9


* Satisfying close to the easy dataset metrics!

## 🧾 **Summary**










### 🔥 **Random Forest Classifier**
Based on the statistics summaries in the above DataFrames, the Random Forest classifier maintains the best trade-off for the **Hard Dataset**:

* ✅ **Highest test accuracy**
  * (94.5% vs. 93.9% for LR and 89.7% for NB)

* 🔍 **Strongest recall**
  * (98.9%—catch nearly every spam)

* 👍 **Very good precision**
  * (96.2%—still avoid most false-positives)

* ↔️ **Reasonable generalization gap**
  * (95.5% train → 94.5% test)

While NB edges out on precision (97.7%) and LR on recall (98.2%), Random Forest gives **optimal performance across all categories** **without sacrificing** much performance in any one metric, while also being close enough to the performance of the easy dataset, that the differences are marginal.

# **Appendix**

## ✅ **Precision / Recall Disambiguation**

**Precision** = TP / (TP + FP): of all emails flagged **spam**, how many truly were spam?

**Recall** = TP / (TP + FN): of all **actual** spam emails, how many did it catch?

### 📥 **Precision** (When classified as “spam,” how often is it right?)

  * Of all the emails the model **flags as spam**, precision is the ***percentage that are actually spam.***

  * **High precision** means it ***rarely mis-labels*** a good (ham) email as spam.

### 📥 **Recall** (Of all the real spam out there, how many did it catch?)

  * Of all the **actual spam** emails **in the inbox**, recall is the ***percentage the model successfully flags***.

  * **High recall** means it ***misses very few spam messages***—even if it accidentally catches a bit more ham.

### **Plain Summary**
📥 **Precision:** “If the model quarantines 100 emails as spam, how many of those 100 are truly spam?”

📤 **Recall:** “If there are 100 spam emails in total, how many did the filter actually catch?”

### **Further Explanation**
Spam = Positive

* 🟢✅ **TP** (True Positives): actual spam correctly labeled spam

* 🔴✅ **FP** (False Positives): actual ham incorrectly labeled spam

* 🔴❌ **FN** (False Negatives): actual spam missed (labeled ham)

* 🟢❌ **TN** (True Negatives): actual ham correctly labeled ham





```
                Predicted
               | Spam | Ham
        ───────────────────
Actual | Spam  |  TP  |  FN
       | Ham   |  FP  |  TN
       
```



### **Note on Precision/Recall Trade-off for Spam Detection**

According to ChatGPT, In most spam‐filtering scenarios, **precision** tends to be the more critical metric:

* **Precision** (97.7%) tells you “of all the emails we flagged as spam, 97.7% truly were spam.”

  * A **high precision** means very few ham messages get mis-labeled and sent to the spam folder by mistake. Losing legitimate mail is usually more painful for users than letting a few spam messages slip through.

* **Recall** (89.9%) tells you “of all the actual spam emails, we caught 89.9%.”

  * A **slightly lower recall** means you’ll miss about 10% of spam (it lands in the inbox), which is annoying but generally less harmful than false positives.

Because **false positives** (**ham** → **spam**) risk losing or hiding important mail, you’ll typically **optimize** for          **precision first**.

You can then **tune the decision threshold** (or explore cost-sensitive training) to nudge recall higher without letting precision drop below an acceptable floor.

*Thank You*