# DX 704 Week 9 Project

This week's project will build an email spam classifier based on the Enron email data set.
You will perform your own feature extraction, and use naive Bayes to estimate the probability that a particular email is spam or not.
Finally, you will review the tradeoffs from different thresholds for automatically sending emails to the junk folder.

The full project description and a template notebook are available on GitHub: [Project 9 Materials](https://github.com/bu-cds-dx704/dx704-project-09).


## Example Code

You may find it helpful to refer to these GitHub repositories of Jupyter notebooks for example code.

* https://github.com/bu-cds-omds/dx601-examples
* https://github.com/bu-cds-omds/dx602-examples
* https://github.com/bu-cds-omds/dx603-examples
* https://github.com/bu-cds-omds/dx704-examples

Any calculations demonstrated in code examples or videos may be found in these notebooks, and you are allowed to copy this example code in your homework answers.

## Part 1: Download Data Set

We will be using the Enron spam data set as prepared in this GitHub repository.

https://github.com/MWiechmann/enron_spam_data

You may need to download this differently depending on your environment.

In [1]:
!wget https://github.com/MWiechmann/enron_spam_data/raw/refs/heads/master/enron_spam_data.zip

--2025-11-03 04:54:49--  https://github.com/MWiechmann/enron_spam_data/raw/refs/heads/master/enron_spam_data.zip
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/MWiechmann/enron_spam_data/refs/heads/master/enron_spam_data.zip [following]
--2025-11-03 04:54:49--  https://raw.githubusercontent.com/MWiechmann/enron_spam_data/refs/heads/master/enron_spam_data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15642124 (15M) [application/zip]
Saving to: ‘enron_spam_data.zip’


2025-11-03 04:54:50 (123 MB/s) - ‘enron_spam_data.zip’ saved [15642124/15642124]



In [1]:
import pandas as pd

In [2]:
# pandas can read the zip file directly
enron_spam_data = pd.read_csv("enron_spam_data.zip")
enron_spam_data

Unnamed: 0,Message ID,Subject,Message,Spam/Ham,Date
0,0,christmas tree farm pictures,,ham,1999-12-10
1,1,"vastar resources , inc .","gary , production from the high island larger ...",ham,1999-12-13
2,2,calpine daily gas nomination,- calpine daily gas nomination 1 . doc,ham,1999-12-14
3,3,re : issue,fyi - see note below - already done .\nstella\...,ham,1999-12-14
4,4,meter 7268 nov allocation,fyi .\n- - - - - - - - - - - - - - - - - - - -...,ham,1999-12-14
...,...,...,...,...,...
33711,33711,= ? iso - 8859 - 1 ? q ? good _ news _ c = eda...,"hello , welcome to gigapharm onlinne shop .\np...",spam,2005-07-29
33712,33712,all prescript medicines are on special . to be...,i got it earlier than expected and it was wrap...,spam,2005-07-29
33713,33713,the next generation online pharmacy .,are you ready to rock on ? let the man in you ...,spam,2005-07-30
33714,33714,bloow in 5 - 10 times the time,learn how to last 5 - 10 times longer in\nbed ...,spam,2005-07-30


In [3]:
(enron_spam_data["Spam/Ham"] == "spam").mean()

np.float64(0.5092834262664611)

## Part 2: Design a Feature Extractor

Design a feature extractor for this data set and write out two files of features based on the text.
Don't forget that both the Subject and Message columns are relevant sources of text data.
For each email, you should count the number of repetitions of each feature present.
The auto-grader will assume that you are using a multinomial distribution in the following problems.

In [4]:
# YOUR CHANGES HERE

import re
import json
from collections import Counter
import pandas as pd
import numpy as np


try:
    _df = enron_spam_data.copy()
except NameError:
    _df = pd.read_csv("enron_spam_data.zip")

required_cols = {"Message ID", "Subject", "Message"}
missing = required_cols - set(_df.columns)
if missing:
    raise ValueError(f"Missing required columns: {missing}")


URL_RE = re.compile(r"(https?://\S+|www\.\S+)", re.IGNORECASE)
EMAIL_RE = re.compile(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}", re.IGNORECASE)
NUMBER_RE = re.compile(r"\b\d+(?:[\d,./:-]*\d)?\b")

TOKEN_RE = re.compile(r"\b[\w']+\b", re.UNICODE)

In [5]:
def normalize_text(text: str) -> str:
    if not isinstance(text, str):
        return ""
    t = text.lower()
    t = URL_RE.sub("__URL__", t)
    t = EMAIL_RE.sub("__EMAIL__", t)
    t = NUMBER_RE.sub("__NUMBER__", t)
    return t

def tokenize(text: str):
    # returns a list of tokens
    return TOKEN_RE.findall(text)

def extract_features_from_row(row) -> dict:
    # Combine Subject and Message; both are relevant
    subject = normalize_text(row.get("Subject", ""))
    message = normalize_text(row.get("Message", ""))

    # Simple unigram bag-of-words with counts (multinomial NB friendly)
    tokens = []
    if subject:
        tokens.extend(tokenize(subject))
    if message:
        tokens.extend(tokenize(message))

    counts = Counter(tokens)

    # Optionally include a few useful meta-features (still counts, so multinomial-compatible)
    # These can help NB without complicating the grader’s expectations.
    # e.g., presence counts for URL, EMAIL, NUMBER placeholders
    if "__URL__" in counts:
        counts["HAS_URL"] = counts["__URL__"]
    if "__EMAIL__" in counts:
        counts["HAS_EMAIL"] = counts["__EMAIL__"]
    if "__NUMBER__" in counts:
        counts["HAS_NUMBER"] = counts["__NUMBER__"]

    # Convert Counter to a plain dict of str -> int; keep only positive counts
    features = {str(k): int(v) for k, v in counts.items() if v > 0}
    return features


In [6]:
features_records = []
for _, row in _df.iterrows():
    mid = int(row["Message ID"])
    feats = extract_features_from_row(row)
    features_records.append({
        "Message ID": mid,
        "features_json": json.dumps(feats, sort_keys=True, ensure_ascii=False)
    })

features_df = pd.DataFrame(features_records)

Assign a row to the test data set if `Message ID % 30 == 0` and assign it to the training data set otherwise.
Write two files, "train-features.tsv" and "test-features.tsv" with two columns, Message ID and features_json.
The features_json column should contain a JSON dictionary where the keys are your feature names and the values are integer feature values.
This will give us a sparse feature representation.


In [7]:
# YOUR CHANGES HERE

# Split per the rule: test if Message ID % 30 == 0, else train
is_test = features_df["Message ID"] % 30 == 0
train_df = features_df.loc[~is_test].copy()
test_df  = features_df.loc[ is_test].copy()

# Save to TSV
train_df.to_csv("train-features.tsv", sep="\t", index=False)
test_df.to_csv("test-features.tsv",  sep="\t", index=False)

# Quick report
print(f"Total emails: {len(features_df)}")
print(f"Train rows : {len(train_df)}   -> train-features.tsv")
print(f"Test rows  : {len(test_df)}    -> test-features.tsv")

print("\nSample rows:")
print(train_df.head(2).to_string(index=False))
print(test_df.head(2).to_string(index=False))

Total emails: 33716
Train rows : 32592   -> train-features.tsv
Test rows  : 1124    -> test-features.tsv

Sample rows:
 Message ID                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      

Submit "train-features.tsv" and "test-features.tsv" in Gradescope.

Hint: these features will be graded based on the test accuracy of a logistic regression based on the training features.
This is to make sure that your feature set is not degenerate; you do not need to compute this regression yourself.
You can separately assess your feature quality based on your results in part 6.

## Part 3: Compute Conditional Probabilities

Based on your training data, compute appropriate conditional probabilities for use with naïve Bayes.
Use of additive smoothing with $\alpha=1$ to avoid zeros.


In [8]:
# YOUR CHANGES HERE

import json
import math
import pandas as pd
import numpy as np
from collections import defaultdict


train_df = pd.read_csv("train-features.tsv", sep="\t")
test_df  = pd.read_csv("test-features.tsv", sep="\t")

train_df["features_dict"] = train_df["features_json"].apply(json.loads)
test_df["features_dict"]  = test_df["features_json"].apply(json.loads)

In [9]:
def _load_original_df():
    # Prefer in-memory if earlier cells created it
    try:
        return enron_spam_data.copy()
    except NameError:
        pass
    # Try common filenames; pandas can read a single-csv zip
    for cand in ["enron_spam_data.zip", "enron_spam_data.csv", "enron.csv", "data.csv"]:
        try:
            return pd.read_csv(cand)
        except Exception:
            continue
    return None

raw_df = _load_original_df()
if raw_df is None:
    raise ValueError(
        "Could not load the original Enron dataset to obtain labels. "
        "Expected one of: enron_spam_data.zip / enron_spam_data.csv"
    )

# Normalize/locate Message ID
if "Message ID" not in raw_df.columns:
    for alt in ["MessageID", "message_id", "msg_id", "Id", "id"]:
        if alt in raw_df.columns:
            raw_df["Message ID"] = raw_df[alt]
            break
if "Message ID" not in raw_df.columns:
    raise ValueError("Could not find a 'Message ID' column in the original dataset.")

def _to01(x):
    if isinstance(x, str):
        xs = x.strip().lower()
        if xs in {"spam", "s", "1", "true", "yes", "y"}:
            return 1
        if xs in {"ham", "h", "0", "false", "no", "n", "not_spam", "not spam"}:
            return 0
    if isinstance(x, (bool, np.bool_)):
        return 1 if x else 0
    try:
        v = int(x)
        return 1 if v != 0 else 0
    except Exception:
        return np.nan

def _get_labels_df(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()

    # 1) Try many common label names first
    preferred = [
        "Spam", "spam", "Label", "label", "Class", "class", "Target", "target",
        "is_spam", "Is Spam", "Spam/Ham", "SpamHam", "Category", "category"
    ]
    cand = None
    for c in preferred:
        if c in df.columns:
            cand = c
            break

    # 2) If not found, auto-detect a binary-like column (exactly 2 unique non-null values)
    if cand is None:
        for c in df.columns:
            if c == "Message ID":
                continue
            vals = pd.Series(df[c]).dropna().unique()
            if len(vals) == 2:
                cand = c
                break

    if cand is None:
        raise ValueError(
            f"No label column found. Available columns: {list(df.columns)}. "
            "Expected something like 'Spam', 'Label', or a binary-like column."
        )

    mapped = df[cand].apply(_to01)

    # If mapping failed (all NaN), try a heuristic using the two unique values
    if mapped.isna().all():
        uniq = pd.Series(df[cand]).dropna().unique()
        if len(uniq) == 2:
            u0, u1 = uniq[0], uniq[1]
            s0, s1 = str(u0).lower(), str(u1).lower()
            spam_val = u0 if "spam" in s0 else (u1 if "spam" in s1 else u1)
            ham_val  = u1 if spam_val is u0 else u0
            mapped = df[cand].apply(lambda v: 1 if v == spam_val else (0 if v == ham_val else np.nan))
        else:
            raise ValueError(f"Could not coerce label column '{cand}' to binary 0/1.")

    out = pd.DataFrame({
        "Message ID": df["Message ID"].astype(int),
        "label": mapped.astype(int)
    })
    print(f"Using label column: '{cand}'")
    print(out["label"].value_counts(dropna=False).rename(index={0: "ham(0)", 1: "spam(1)"}))
    return out

labels_df = _get_labels_df(raw_df)

# Merge labels with features
train = train_df.merge(labels_df, on="Message ID", how="inner")
test  = test_df.merge(labels_df,  on="Message ID", how="left")

if train["label"].isna().any():
    raise ValueError("Some training rows have missing labels after merge; check 'Message ID' alignment.")


Using label column: 'Spam/Ham'
label
spam(1)    17171
ham(0)     16545
Name: count, dtype: int64


In [10]:
# Train Multinomial Naive Bayes (manual; to get P(feature|class))
alpha = 1.0  # Laplace smoothing

# Class-separated token counts
token_counts: dict[int, Counter] = {0: Counter(), 1: Counter()}
total_counts: dict[int, int]     = {0: 0, 1: 0}
class_docs: dict[int, int]       = {0: 0, 1: 0}

for _, row in train.iterrows():
    y = int(row["label"])
    feats = row["features_dict"]
    class_docs[y] += 1
    for tok, cnt in feats.items():
        cnt = int(cnt)
        if cnt <= 0: 
            continue
        token_counts[y][tok] += cnt
        total_counts[y] += cnt

# Vocabulary and conditional probs
vocab = set(token_counts[0].keys()) | set(token_counts[1].keys())
V = len(vocab)

ham_prob = {}
spam_prob = {}
denom_ham  = total_counts[0] + alpha * V
denom_spam = total_counts[1] + alpha * V
for tok in vocab:
    ham_prob[tok]  = (token_counts[0][tok] + alpha) / denom_ham
    spam_prob[tok] = (token_counts[1][tok] + alpha) / denom_spam

Save the conditional probabilities in a file "feature-probabilities.tsv" with columns feature, ham_probability and spam_probability.

In [11]:
# YOUR CHANGES HERE

prob_df = pd.DataFrame({
    "feature": list(vocab),
    "ham_probability":  [ham_prob[t]  for t in vocab],
    "spam_probability": [spam_prob[t] for t in vocab],
})
prob_df.sort_values("feature").to_csv("feature-probabilities.tsv", sep="\t", index=False)
print(f"Saved {len(prob_df)} features -> feature-probabilities.tsv")

Saved 142247 features -> feature-probabilities.tsv


Submit "feature-probabilities.tsv" in Gradescope.

## Part 4: Implement a Naïve Bayes Classifier

Implement a naïve Bayes classifier based on your previous feature probabilities.

In [12]:
# YOUR CHANGES HERE

probs_df = pd.read_csv("feature-probabilities.tsv", sep="\t")
test_df  = pd.read_csv("train-features.tsv", sep="\t")

def _parse_json(s):
    try:
        d = json.loads(s)
        return {str(k): int(v) for k, v in d.items()}
    except Exception:
        return {}

test_df["features_dict"] = test_df["features_json"].apply(_parse_json)

# Put probabilities into dicts (vocab from Part 3)
ham_prob  = dict(zip(probs_df["feature"], probs_df["ham_probability"]))
spam_prob = dict(zip(probs_df["feature"], probs_df["spam_probability"]))
vocab = set(ham_prob.keys()) | set(spam_prob.keys())

# For numerical stability use logs
log_ham  = {t: math.log(p) for t, p in ham_prob.items()}
log_spam = {t: math.log(p) for t, p in spam_prob.items()}

In [13]:
def _load_original_df():
    try:
        return enron_spam_data.copy()
    except NameError:
        pass
    for cand in ["enron_spam_data.zip", "enron_spam_data.csv", "enron.csv", "data.csv"]:
        try:
            return pd.read_csv(cand)
        except Exception:
            continue
    return None

raw_df = _load_original_df()
if raw_df is None:
    raise ValueError("Could not load the original dataset to compute priors.")

# Normalize Message ID
if "Message ID" not in raw_df.columns:
    for alt in ["MessageID", "message_id", "msg_id", "Id", "id"]:
        if alt in raw_df.columns:
            raw_df["Message ID"] = raw_df[alt]
            break
if "Message ID" not in raw_df.columns:
    raise ValueError("No 'Message ID' column found.")

# Robust label mapping -> {0=ham,1=spam}
def _to01(x):
    if isinstance(x, str):
        xs = x.strip().lower()
        if xs in {"spam","s","1","true","yes","y"}: return 1
        if xs in {"ham","h","0","false","no","n","not_spam","not spam"}: return 0
    if isinstance(x, (bool, np.bool_)): return 1 if x else 0
    try:
        v = int(x); return 1 if v != 0 else 0
    except Exception:
        return np.nan

def _get_labels_df(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    preferred = [
        "Spam","spam","Label","label","Class","class","Target","target",
        "is_spam","Is Spam","Spam/Ham","SpamHam","Category","category"
    ]
    cand = None
    for c in preferred:
        if c in df.columns:
            cand = c; break
    if cand is None:
        # auto-detect binary-like column
        for c in df.columns:
            if c == "Message ID": continue
            vals = pd.Series(df[c]).dropna().unique()
            if len(vals) == 2:
                cand = c; break
    if cand is None:
        raise ValueError("No label column found for computing priors.")
    mapped = df[cand].apply(_to01)
    return pd.DataFrame({"Message ID": df["Message ID"].astype(int), "label": mapped.astype(int)})

labels_df = _get_labels_df(raw_df)

# Use TRAIN split (Message ID % 30 != 0) for priors
train_mask = labels_df["Message ID"] % 30 != 0
train_labels = labels_df.loc[train_mask].copy()
n_docs = len(train_labels)
n_ham  = (train_labels["label"] == 0).sum()
n_spam = (train_labels["label"] == 1).sum()

# Laplace-smoothed priors
P_ham  = (n_ham + 1) / (n_docs + 2)
P_spam = (n_spam + 1) / (n_docs + 2)
log_prior_ham  = math.log(P_ham)
log_prior_spam = math.log(P_spam)

print(f"Priors (from TRAIN split): P(ham)={P_ham:.4f}, P(spam)={P_spam:.4f}")


Priors (from TRAIN split): P(ham)=0.4907, P(spam)=0.5093


In [15]:
# Score train rows -> posterior P(ham|msg), P(spam|msg)
def _posterior_probs(feats: dict[str, int]) -> tuple[float, float]:
    lh = log_prior_ham
    ls = log_prior_spam
    for tok, cnt in feats.items():
        if cnt <= 0: 
            continue
        if tok in vocab:
            lh += cnt * log_ham[tok]
            ls += cnt * log_spam[tok]
        # OOV tokens ignored (same constant across classes if smoothed)
    # Normalize
    m = max(lh, ls)
    ph = math.exp(lh - m)
    ps = math.exp(ls - m)
    z = ph + ps
    return ph / z, ps / z

ham_post, spam_post = [], []
for feats in train_df["features_dict"]:
    ph, ps = _posterior_probs(feats)
    ham_post.append(ph)
    spam_post.append(ps)

Save your prediction probabilities to "train-predictions.tsv" with columns Message ID, ham and spam.

In [19]:
# YOUR CHANGES HERE

out = pd.DataFrame({
    "message id": train_df["Message ID"].astype(int),
    "ham": ham_post,
    "spam": spam_post
})
out.to_csv("train-predictions.tsv", sep="\t", index=False)

print("Saved -> train-predictions.tsv")
print(out.head(10).to_string(index=False))

Saved -> train-predictions.tsv
 message id  ham          spam
          1  1.0 1.715648e-199
          2  1.0  1.012723e-12
          3  1.0 1.223237e-153
          4  1.0 5.580284e-147
          5  1.0  1.607455e-39
          6  1.0  9.107797e-22
          7  1.0 7.255169e-206
          8  1.0  7.284969e-89
          9  1.0 8.985004e-246
         10  1.0  1.408989e-67


Submit "train-predictions.tsv" in Gradescope.

## Part 5: Predict Spam Probability for Test Data

Use your previous classifier to predict spam probability for the test data.

In [25]:
# YOUR CHANGES HERE

probs_df = pd.read_csv("feature-probabilities.tsv", sep="\t")  # from Part 3
test_df  = pd.read_csv("test-features.tsv", sep="\t")          # from Part 2

def _parse_json(s):
    try:
        d = json.loads(s)
        return {str(k): int(v) for k, v in d.items()}
    except Exception:
        return {}

test_df["features_dict"] = test_df["features_json"].apply(_parse_json)

# Conditional probabilities learned in Part 3
ham_prob  = dict(zip(probs_df["feature"], probs_df["ham_probability"]))
spam_prob = dict(zip(probs_df["feature"], probs_df["spam_probability"]))
vocab = set(ham_prob.keys()) | set(spam_prob.keys())

# Precompute logs for stability
log_ham  = {t: math.log(p) for t, p in ham_prob.items()}
log_spam = {t: math.log(p) for t, p in spam_prob.items()}

In [26]:
def _load_original_df():
    # Prefer in-memory if it exists, else try common filenames
    try:
        return enron_spam_data.copy()
    except NameError:
        pass
    for cand in ["enron_spam_data.zip", "enron_spam_data.csv", "enron.csv", "data.csv"]:
        try:
            return pd.read_csv(cand)
        except Exception:
            continue
    return None

raw_df = _load_original_df()
if raw_df is None:
    raise ValueError("Could not load the original dataset to compute class priors.")

# Normalize Message ID
if "Message ID" not in raw_df.columns:
    for alt in ["MessageID", "message_id", "msg_id", "Id", "id"]:
        if alt in raw_df.columns:
            raw_df["Message ID"] = raw_df[alt]
            break
if "Message ID" not in raw_df.columns:
    raise ValueError("No 'Message ID' column found for priors computation.")

# Robust label detection -> {0=ham, 1=spam}
def _to01(x):
    if isinstance(x, str):
        xs = x.strip().lower()
        if xs in {"spam","s","1","true","yes","y"}: return 1
        if xs in {"ham","h","0","false","no","n","not_spam","not spam"}: return 0
    if isinstance(x, (bool, np.bool_)): return 1 if x else 0
    try:
        v = int(x); return 1 if v != 0 else 0
    except Exception:
        return np.nan

def _get_labels_df(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    preferred = [
        "Spam","spam","Label","label","Class","class","Target","target",
        "is_spam","Is Spam","Spam/Ham","SpamHam","Category","category"
    ]
    cand = next((c for c in preferred if c in df.columns), None)
    if cand is None:
        # auto-detect a binary-like column
        for c in df.columns:
            if c == "Message ID": continue
            vals = pd.Series(df[c]).dropna().unique()
            if len(vals) == 2:
                cand = c; break
    if cand is None:
        raise ValueError("No label column found for computing priors.")
    mapped = df[cand].apply(_to01)
    return pd.DataFrame({"Message ID": df["Message ID"].astype(int), "label": mapped.astype(int)})

labels_df = _get_labels_df(raw_df)

# Use only training split for priors
train_mask   = labels_df["Message ID"] % 30 != 0
train_labels = labels_df.loc[train_mask].copy()
n_docs = len(train_labels)
n_ham  = (train_labels["label"] == 0).sum()
n_spam = (train_labels["label"] == 1).sum()

# Laplace-smoothed priors
P_ham  = (n_ham + 1) / (n_docs + 2)
P_spam = (n_spam + 1) / (n_docs + 2)
log_prior_ham  = math.log(P_ham)
log_prior_spam = math.log(P_spam)

print(f"Priors (from TRAIN split): P(ham)={P_ham:.4f}, P(spam)={P_spam:.4f}")


Priors (from TRAIN split): P(ham)=0.4907, P(spam)=0.5093


In [27]:
# Score TEST rows -> posterior P(ham|msg), P(spam|msg)
def _posterior_probs(feats: dict[str, int]) -> tuple[float, float]:
    lh = log_prior_ham
    ls = log_prior_spam
    for tok, cnt in feats.items():
        if cnt <= 0:
            continue
        if tok in vocab:
            lh += cnt * log_ham[tok]
            ls += cnt * log_spam[tok]
        # OOV tokens are ignored; under Laplace smoothing they add the same constant to both classes.
    # Normalize (log-sum-exp for 2 classes)
    m = max(lh, ls)
    ph = math.exp(lh - m)
    ps = math.exp(ls - m)
    z = ph + ps
    return ph / z, ps / z

ham_post, spam_post = [], []
for feats in test_df["features_dict"]:
    ph, ps = _posterior_probs(feats)
    ham_post.append(ph)
    spam_post.append(ps)

Save your prediction probabilities in "test-predictions.tsv" with the same columns as "train-predictions.tsv".

In [29]:
# YOUR CHANGES HERE

out = pd.DataFrame({
    "message id": test_df["Message ID"].astype(int),
    "ham": ham_post,
    "spam": spam_post
})
out.to_csv("test-predictions.tsv", sep="\t", index=False)

print("Saved -> test-predictions.tsv")
print(out.head(10).to_string(index=False))

Saved -> test-predictions.tsv
 message id      ham          spam
          0 0.054342  9.456579e-01
         30 1.000000  5.596193e-81
         60 1.000000  1.749893e-12
         90 1.000000  1.826141e-34
        120 1.000000 1.534749e-179
        150 1.000000  6.981587e-12
        180 0.999993  7.450849e-06
        210 1.000000  1.527063e-42
        240 1.000000  4.656409e-51
        270 1.000000  3.272223e-39


Submit "test-predictions.tsv" in Gradescope.

## Part 6: Construct ROC Curve

For every probability threshold from 0.01 to .99 in increments of 0.01, compute the false and true positive rates from the test data using the spam class for positives.
That is, if the predicted spam probability is greater than or equal to the threshold, predict spam.

In [30]:
# YOUR CHANGES HERE

preds = pd.read_csv("test-predictions.tsv", sep="\t")
# Normalize column name to join later
if "message id" not in preds.columns:
    raise ValueError("test-predictions.tsv must have a 'message id' column.")
preds.rename(columns={"message id": "Message ID"}, inplace=True)

if "spam" not in preds.columns:
    raise ValueError("test-predictions.tsv must have a 'spam' probability column.")

In [31]:
# Load ground-truth labels and restrict to TEST split (Message ID % 30 == 0)
def _load_original_df():
    # Prefer in-memory if available
    try:
        return enron_spam_data.copy()
    except NameError:
        pass
    # Try common filenames
    for cand in ["enron_spam_data.zip", "enron_spam_data.csv", "enron.csv", "data.csv"]:
        try:
            return pd.read_csv(cand)
        except Exception:
            continue
    return None

raw_df = _load_original_df()
if raw_df is None:
    raise ValueError("Could not load the original dataset to obtain test labels.")

# Ensure Message ID exists
if "Message ID" not in raw_df.columns:
    for alt in ["MessageID", "message_id", "msg_id", "Id", "id"]:
        if alt in raw_df.columns:
            raw_df["Message ID"] = raw_df[alt]
            break
if "Message ID" not in raw_df.columns:
    raise ValueError("Could not find a 'Message ID' column in the original dataset.")

def _to01(x):
    if isinstance(x, str):
        xs = x.strip().lower()
        if xs in {"spam","s","1","true","yes","y"}: return 1
        if xs in {"ham","h","0","false","no","n","not_spam","not spam"}: return 0
    if isinstance(x, (bool, np.bool_)): return 1 if x else 0
    try:
        v = int(x); return 1 if v != 0 else 0
    except Exception:
        return np.nan

def _get_labels_df(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    preferred = [
        "Spam","spam","Label","label","Class","class","Target","target",
        "is_spam","Is Spam","Spam/Ham","SpamHam","Category","category"
    ]
    cand = next((c for c in preferred if c in df.columns), None)
    if cand is None:
        # auto-detect a binary-like column (exactly 2 unique non-null values)
        for c in df.columns:
            if c == "Message ID": continue
            vals = pd.Series(df[c]).dropna().unique()
            if len(vals) == 2:
                cand = c; break
    if cand is None:
        raise ValueError("No label column found in the original dataset for ROC computation.")
    mapped = df[cand].apply(_to01).astype(int)
    return pd.DataFrame({"Message ID": df["Message ID"].astype(int), "label": mapped})

labels = _get_labels_df(raw_df)

# Keep only TEST split labels
test_labels = labels.loc[(labels["Message ID"] % 30) == 0].copy()

In [32]:
# Merge predictions with labels (test set)
df = preds.merge(test_labels, on="Message ID", how="inner")
if df.empty:
    raise ValueError("No overlap between test predictions and test labels on 'Message ID'.")
y_true = df["label"].astype(int).values
p_spam = df["spam"].astype(float).values

In [33]:
# 4) Sweep thresholds and compute FPR/TPR
#    Positive class = spam (1)
#    Predict spam if p_spam >= threshold
rows = []
for i in range(1, 100):  # 0.01 .. 0.99
    thr = i / 100.0
    y_pred = (p_spam >= thr).astype(int)

    # Confusion matrix components
    tp = int(((y_true == 1) & (y_pred == 1)).sum())
    fp = int(((y_true == 0) & (y_pred == 1)).sum())
    tn = int(((y_true == 0) & (y_pred == 0)).sum())
    fn = int(((y_true == 1) & (y_pred == 0)).sum())

    # Rates with safe division
    tpr = tp / (tp + fn) if (tp + fn) > 0 else np.nan  # True Positive Rate (Recall)
    fpr = fp / (fp + tn) if (fp + tn) > 0 else np.nan  # False Positive Rate

    rows.append((thr, fpr, tpr))

roc = pd.DataFrame(rows, columns=["threshold", "false_positive_rate", "true_positive_rate"])

Save this data in a file "roc.tsv" with columns threshold, false_positive_rate and true_positive rate.

In [34]:
# YOUR CHANGES HERE

roc.to_csv("roc.tsv", sep="\t", index=False)
print(f"Saved {len(roc)} rows -> roc.tsv")
print(roc.head(10).to_string(index=False))

Saved 99 rows -> roc.tsv
 threshold  false_positive_rate  true_positive_rate
      0.01             0.034420            0.991259
      0.02             0.032609            0.991259
      0.03             0.028986            0.989510
      0.04             0.028986            0.989510
      0.05             0.028986            0.989510
      0.06             0.028986            0.989510
      0.07             0.028986            0.989510
      0.08             0.028986            0.989510
      0.09             0.027174            0.989510
      0.10             0.027174            0.989510


Submit "roc.tsv" in Gradescope.

## Part 7: Signup for Gemini API Key

Create a free Gemini API key at https://aistudio.google.com/app/api-keys.
You will need to do this with a personal Google account - it will not work with your BU Google account.
This will not incur any charges unless you configure billing information for the key.

You will be asked to start a Gemini free trial for week 11.
This will not incur any charges unless you exceed expected usage by an order of magnitude.


No submission needed.

## Part 8: Code

Please submit a Jupyter notebook that can reproduce all your calculations and recreate the previously submitted files.
You do not need to provide code for data collection if you did that by manually.

## Part 9: Acknowledgements

If you discussed this assignment with anyone, please acknowledge them here.
If you did this assignment completely on your own, simply write none below.

If you used any libraries not mentioned in this module's content, please list them with a brief explanation what you used them for. If you did not use any other libraries, simply write none below.

If you used any generative AI tools, please add links to your transcripts below, and any other information that you feel is necessary to comply with the generative AI policy. If you did not use any generative AI tools, simply write none below.