# DX 704 Week 9 Project
This week's project will build an email spam classifier based on the Enron email data set.
You will perform your own feature extraction, and use naive Bayes to estimate the probability that a particular email is spam or not.
Finally, you will review the tradeoffs from different thresholds for automatically sending emails to the junk folder.

The full project description and a template notebook are available on GitHub: [Project 9 Materials](https://github.com/bu-cds-dx704/dx704-project-09).


## Example Code

You may find it helpful to refer to these GitHub repositories of Jupyter notebooks for example code.

* https://github.com/bu-cds-omds/dx601-examples
* https://github.com/bu-cds-omds/dx602-examples
* https://github.com/bu-cds-omds/dx603-examples
* https://github.com/bu-cds-omds/dx704-examples

Any calculations demonstrated in code examples or videos may be found in these notebooks, and you are allowed to copy this example code in your homework answers.

## Part 1: Download Data Set

We will be using the Enron spam data set as prepared in this GitHub repository.

https://github.com/MWiechmann/enron_spam_data

You may need to download this differently depending on your environment.

In [1]:
!wget https://github.com/MWiechmann/enron_spam_data/raw/refs/heads/master/enron_spam_data.zip

--2025-10-31 02:48:59--  https://github.com/MWiechmann/enron_spam_data/raw/refs/heads/master/enron_spam_data.zip
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/MWiechmann/enron_spam_data/refs/heads/master/enron_spam_data.zip [following]
--2025-10-31 02:48:59--  https://raw.githubusercontent.com/MWiechmann/enron_spam_data/refs/heads/master/enron_spam_data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15642124 (15M) [application/zip]
Saving to: ‘enron_spam_data.zip’


2025-10-31 02:48:59 (141 MB/s) - ‘enron_spam_data.zip’ saved [15642124/15642124]



In [2]:
import pandas as pd

In [3]:
# pandas can read the zip file directly
enron_spam_data = pd.read_csv("enron_spam_data.zip")
enron_spam_data

Unnamed: 0,Message ID,Subject,Message,Spam/Ham,Date
0,0,christmas tree farm pictures,,ham,1999-12-10
1,1,"vastar resources , inc .","gary , production from the high island larger ...",ham,1999-12-13
2,2,calpine daily gas nomination,- calpine daily gas nomination 1 . doc,ham,1999-12-14
3,3,re : issue,fyi - see note below - already done .\nstella\...,ham,1999-12-14
4,4,meter 7268 nov allocation,fyi .\n- - - - - - - - - - - - - - - - - - - -...,ham,1999-12-14
...,...,...,...,...,...
33711,33711,= ? iso - 8859 - 1 ? q ? good _ news _ c = eda...,"hello , welcome to gigapharm onlinne shop .\np...",spam,2005-07-29
33712,33712,all prescript medicines are on special . to be...,i got it earlier than expected and it was wrap...,spam,2005-07-29
33713,33713,the next generation online pharmacy .,are you ready to rock on ? let the man in you ...,spam,2005-07-30
33714,33714,bloow in 5 - 10 times the time,learn how to last 5 - 10 times longer in\nbed ...,spam,2005-07-30


In [4]:
(enron_spam_data["Spam/Ham"] == "spam").mean()

np.float64(0.5092834262664611)

## Part 2: Design a Feature Extractor

Design a feature extractor for this data set and write out two files of features based on the text.
Don't forget that both the Subject and Message columns are relevant sources of text data.
For each email, you should count the number of repetitions of each feature present.
The auto-grader will assume that you are using a multinomial distribution in the following problems.

In [5]:
# YOUR CHANGES HERE

import json
import re
from collections import Counter

def extract_features(subject, message):
    """
    Extract features (words) from subject and message.
    Returns a dictionary with word counts.
    """
    text = ""
    if pd.notna(subject):
        text += str(subject) + " "
    if pd.notna(message):
        text += str(message)

    text = text.lower()

    words = re.findall(r'\b[a-z0-9]+\b', text)

    feature_counts = Counter(words)

    return dict(feature_counts)

enron_spam_data['features'] = enron_spam_data.apply(
    lambda row: extract_features(row['Subject'], row['Message']),
    axis=1
)

Assign a row to the test data set if `Message ID % 30 == 0` and assign it to the training data set otherwise.
Write two files, "train-features.tsv" and "test-features.tsv" with two columns, Message ID and features_json.
The features_json column should contain a JSON dictionary where the keys are your feature names and the values are integer feature values.
This will give us a sparse feature representation.


In [6]:
# YOUR CHANGES HERE

train_data = enron_spam_data[enron_spam_data['Message ID'] % 30 != 0]
test_data = enron_spam_data[enron_spam_data['Message ID'] % 30 == 0]

train_features = train_data[['Message ID', 'features']].copy()
train_features['features_json'] = train_features['features'].apply(json.dumps)
train_features[['Message ID', 'features_json']].to_csv('train-features.tsv', sep='\t', index=False)

test_features = test_data[['Message ID', 'features']].copy()
test_features['features_json'] = test_features['features'].apply(json.dumps)
test_features[['Message ID', 'features_json']].to_csv('test-features.tsv', sep='\t', index=False)

print(f"Training set size: {len(train_data)}")
print(f"Test set size: {len(test_data)}")
print(f"\nExample features from first training email:")
print(json.dumps(list(train_features['features'].iloc[0].items())[:10], indent=2))

Training set size: 32592
Test set size: 1124

Example features from first training email:
[
  [
    "vastar",
    6
  ],
  [
    "resources",
    4
  ],
  [
    "inc",
    4
  ],
  [
    "gary",
    2
  ],
  [
    "production",
    3
  ],
  [
    "from",
    3
  ],
  [
    "the",
    8
  ],
  [
    "high",
    2
  ],
  [
    "island",
    2
  ],
  [
    "larger",
    2
  ]
]


Submit "train-features.tsv" and "test-features.tsv" in Gradescope.

Hint: these features will be graded based on the test accuracy of a logistic regression based on the training features.
This is to make sure that your feature set is not degenerate; you do not need to compute this regression yourself.
You can separately assess your feature quality based on your results in part 6.

## Part 3: Compute Conditional Probabilities

Based on your training data, compute appropriate conditional probabilities for use with naïve Bayes.
Use of additive smoothing with $\alpha=1$ to avoid zeros.


In [7]:
# YOUR CHANGES HERE

from collections import defaultdict

train_spam = train_data[train_data['Spam/Ham'] == 'spam']
train_ham = train_data[train_data['Spam/Ham'] == 'ham']

spam_feature_counts = defaultdict(int)
ham_feature_counts = defaultdict(int)

for features in train_spam['features']:
    for feature, count in features.items():
        spam_feature_counts[feature] += count

for features in train_ham['features']:
    for feature, count in features.items():
        ham_feature_counts[feature] += count

all_features = set(spam_feature_counts.keys()) | set(ham_feature_counts.keys())
vocabulary_size = len(all_features)

total_spam_words = sum(spam_feature_counts.values())
total_ham_words = sum(ham_feature_counts.values())

alpha = 1

feature_probabilities = []

for feature in all_features:
    spam_count = spam_feature_counts[feature]
    ham_count = ham_feature_counts[feature]

    spam_prob = (spam_count + alpha) / (total_spam_words + alpha * vocabulary_size)

    ham_prob = (ham_count + alpha) / (total_ham_words + alpha * vocabulary_size)

    feature_probabilities.append({
        'feature': feature,
        'ham_probability': ham_prob,
        'spam_probability': spam_prob
    })

prob_df = pd.DataFrame(feature_probabilities)

print(f"Vocabulary size: {vocabulary_size}")
print(f"Total spam words: {total_spam_words}")
print(f"Total ham words: {total_ham_words}")
print(f"\nFirst few feature probabilities:")
print(prob_df.head(10))

Vocabulary size: 154293
Total spam words: 3447484
Total ham words: 4416160

First few feature probabilities:
      feature  ham_probability  spam_probability
0       pdowo     2.187967e-07      8.329222e-07
1        xlol     2.187967e-07      8.329222e-07
2     jugging     2.187967e-07      1.665844e-06
3        3263     2.187967e-07      5.552815e-07
4      pompon     2.187967e-07      1.388204e-06
5   atlantico     2.187967e-07      1.110563e-06
6  themillion     2.187967e-07      5.552815e-07
7     totemic     2.187967e-07      1.110563e-06
8     impulse     8.751868e-07      3.054048e-06
9     genesys     2.187967e-07      8.329222e-07


Save the conditional probabilities in a file "feature-probabilities.tsv" with columns feature, ham_probability and spam_probability.

In [8]:
# YOUR CHANGES HERE

prob_df.to_csv('feature-probabilities.tsv', sep='\t', index=False)

print("Feature probabilities saved to 'feature-probabilities.tsv'")
print(f"Total features: {len(prob_df)}")

Feature probabilities saved to 'feature-probabilities.tsv'
Total features: 154293


Submit "feature-probabilities.tsv" in Gradescope.

## Part 4: Implement a Naïve Bayes Classifier

Implement a naïve Bayes classifier based on your previous feature probabilities.

In [10]:
# YOUR CHANGES HERE

import numpy as np

prior_spam = (train_data['Spam/Ham'] == 'spam').mean()
prior_ham = (train_data['Spam/Ham'] == 'ham').mean()

print(f"Prior P(spam): {prior_spam:.4f}")
print(f"Prior P(ham): {prior_ham:.4f}")

spam_prob_dict = dict(zip(prob_df['feature'], prob_df['spam_probability']))
ham_prob_dict = dict(zip(prob_df['feature'], prob_df['ham_probability']))

def predict_naive_bayes(features):
    """
    Predict spam/ham probabilities for an email given its features.
    Uses log probabilities to avoid underflow.
    """
    log_prob_spam = np.log(prior_spam)
    log_prob_ham = np.log(prior_ham)

    for feature, count in features.items():
        if feature in spam_prob_dict:
            log_prob_spam += count * np.log(spam_prob_dict[feature])
        if feature in ham_prob_dict:
            log_prob_ham += count * np.log(ham_prob_dict[feature])

    max_log_prob = max(log_prob_spam, log_prob_ham)
    prob_spam = np.exp(log_prob_spam - max_log_prob)
    prob_ham = np.exp(log_prob_ham - max_log_prob)

    total = prob_spam + prob_ham
    prob_spam /= total
    prob_ham /= total

    return prob_ham, prob_spam

train_data['ham_prob'], train_data['spam_prob'] = zip(*train_data['features'].apply(predict_naive_bayes))

print(f"\nFirst few predictions:")
print(train_data[['Message ID', 'Spam/Ham', 'ham_prob', 'spam_prob']].head(10))

Prior P(spam): 0.5093
Prior P(ham): 0.4907

First few predictions:
    Message ID Spam/Ham  ham_prob      spam_prob
1            1      ham       1.0  1.566871e-186
2            2      ham       1.0   1.520225e-12
3            3      ham       1.0  7.126796e-155
4            4      ham       1.0  2.713786e-150
5            5      ham       1.0   5.918101e-40
6            6      ham       1.0   9.672365e-29
7            7      ham       1.0  7.961722e-207
8            8      ham       1.0   9.156855e-92
9            9      ham       1.0  3.760904e-248
10          10      ham       1.0   1.327986e-69


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data['ham_prob'], train_data['spam_prob'] = zip(*train_data['features'].apply(predict_naive_bayes))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data['ham_prob'], train_data['spam_prob'] = zip(*train_data['features'].apply(predict_naive_bayes))


Save your prediction probabilities to "train-predictions.tsv" with columns Message ID, ham and spam.

In [11]:
# YOUR CHANGES HERE

train_predictions = train_data[['Message ID', 'ham_prob', 'spam_prob']].copy()
train_predictions.columns = ['Message ID', 'ham', 'spam']

train_predictions.to_csv('train-predictions.tsv', sep='\t', index=False)

print(f"Train predictions saved to 'train-predictions.tsv'")
print(f"Total predictions: {len(train_predictions)}")

Train predictions saved to 'train-predictions.tsv'
Total predictions: 32592


Submit "train-predictions.tsv" in Gradescope.

## Part 5: Predict Spam Probability for Test Data

Use your previous classifier to predict spam probability for the test data.

In [12]:
# YOUR CHANGES HERE

test_data['ham_prob'], test_data['spam_prob'] = zip(*test_data['features'].apply(predict_naive_bayes))

print(f"First few test predictions:")
print(test_data[['Message ID', 'Spam/Ham', 'ham_prob', 'spam_prob']].head(10))

First few test predictions:
     Message ID Spam/Ham  ham_prob      spam_prob
0             0      ham  0.054552   9.454476e-01
30           30      ham  1.000000   1.159190e-84
60           60      ham  1.000000   1.180465e-12
90           90      ham  1.000000   7.572908e-34
120         120      ham  1.000000  2.341464e-188
150         150      ham  1.000000   4.142824e-11
180         180      ham  0.999991   9.471159e-06
210         210      ham  1.000000   2.707647e-38
240         240      ham  1.000000   4.825440e-59
270         270      ham  1.000000   4.073921e-38


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_data['ham_prob'], test_data['spam_prob'] = zip(*test_data['features'].apply(predict_naive_bayes))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_data['ham_prob'], test_data['spam_prob'] = zip(*test_data['features'].apply(predict_naive_bayes))


Save your prediction probabilities in "test-predictions.tsv" with the same columns as "train-predictions.tsv".

In [13]:
# YOUR CHANGES HERE

test_predictions = test_data[['Message ID', 'ham_prob', 'spam_prob']].copy()
test_predictions.columns = ['Message ID', 'ham', 'spam']

test_predictions.to_csv('test-predictions.tsv', sep='\t', index=False)

print(f"Test predictions saved to 'test-predictions.tsv'")
print(f"Total predictions: {len(test_predictions)}")

Test predictions saved to 'test-predictions.tsv'
Total predictions: 1124


Submit "test-predictions.tsv" in Gradescope.

## Part 6: Construct ROC Curve

For every probability threshold from 0.01 to .99 in increments of 0.01, compute the false and true positive rates from the test data using the spam class for positives.
That is, if the predicted spam probability is greater than or equal to the threshold, predict spam.

In [14]:
# YOUR CHANGES HERE

test_actual = (test_data['Spam/Ham'] == 'spam').values
test_spam_probs = test_data['spam_prob'].values

roc_data = []

for threshold in np.arange(0.01, 1.00, 0.01):
    predictions = test_spam_probs >= threshold

    true_positives = np.sum((predictions == True) & (test_actual == True))
    false_positives = np.sum((predictions == True) & (test_actual == False))
    true_negatives = np.sum((predictions == False) & (test_actual == False))
    false_negatives = np.sum((predictions == False) & (test_actual == True))

    true_positive_rate = true_positives / (true_positives + false_negatives)

    false_positive_rate = false_positives / (false_positives + true_negatives)

    roc_data.append({
        'threshold': threshold,
        'false_positive_rate': false_positive_rate,
        'true_positive_rate': true_positive_rate
    })

roc_df = pd.DataFrame(roc_data)

print(f"ROC curve data calculated for {len(roc_df)} thresholds")
print(f"\nFirst few rows:")
print(roc_df.head(10))
print(f"\nLast few rows:")
print(roc_df.tail(10))

ROC curve data calculated for 99 thresholds

First few rows:
   threshold  false_positive_rate  true_positive_rate
0       0.01             0.034420            0.996503
1       0.02             0.034420            0.996503
2       0.03             0.032609            0.996503
3       0.04             0.032609            0.996503
4       0.05             0.028986            0.996503
5       0.06             0.027174            0.996503
6       0.07             0.025362            0.996503
7       0.08             0.025362            0.994755
8       0.09             0.023551            0.994755
9       0.10             0.023551            0.994755

Last few rows:
    threshold  false_positive_rate  true_positive_rate
89       0.90             0.018116            0.979021
90       0.91             0.018116            0.979021
91       0.92             0.018116            0.977273
92       0.93             0.018116            0.977273
93       0.94             0.018116            0.977273

Save this data in a file "roc.tsv" with columns threshold, false_positive_rate and true_positive rate.

In [15]:
# YOUR CHANGES HERE

roc_df.to_csv('roc.tsv', sep='\t', index=False)

print(f"ROC curve data saved to 'roc.tsv'")
print(f"Total thresholds: {len(roc_df)}")

ROC curve data saved to 'roc.tsv'
Total thresholds: 99


Submit "roc.tsv" in Gradescope.

## Part 7: Signup for Gemini API Key

Create a free Gemini API key at https://aistudio.google.com/app/api-keys.
You will need to do this with a personal Google account - it will not work with your BU Google account.
This will not incur any charges unless you configure billing information for the key.

You will be asked to start a Gemini free trial for week 11.
This will not incur any charges unless you exceed expected usage by an order of magnitude.


No submission needed.

## Part 8: Code

Please submit a Jupyter notebook that can reproduce all your calculations and recreate the previously submitted files.
You do not need to provide code for data collection if you did that by manually.

## Part 9: Acknowledgements

If you discussed this assignment with anyone, please acknowledge them here.
If you did this assignment completely on your own, simply write none below.

If you used any libraries not mentioned in this module's content, please list them with a brief explanation what you used them for. If you did not use any other libraries, simply write none below.

If you used any generative AI tools, please add links to your transcripts below, and any other information that you feel is necessary to comply with the generative AI policy. If you did not use any generative AI tools, simply write none below.

In [16]:
from datetime import date
import os

ack_text = f"""DX704 Week 9 — Acknowledgments
Date: {date.today().isoformat()}

People / Discussions
- None.

External Libraries (beyond standard course stack)
- None. (Only numpy, pandas, json, re, and collections were used, which are part of the standard stack.)

Data Sources
- Enron spam dataset from https://github.com/MWiechmann/enron_spam_data

Generative AI Usage
- None.

Other Sources
- DX601–DX704 example notebooks referenced as allowed.
"""

with open("acknowledgements.txt", "w", encoding="utf-8") as f:
    f.write(ack_text)

print("Exists?", os.path.exists("acknowledgements.txt"),
      "Size:", os.path.getsize("acknowledgements.txt"), "bytes")

Exists? True Size: 423 bytes
