# DX 704 Week 9 Project

This week's project will build an email spam classifier based on the Enron email data set.
You will perform your own feature extraction, and use naive Bayes to estimate the probability that a particular email is spam or not.
Finally, you will review the tradeoffs from different thresholds for automatically sending emails to the junk folder.

The full project description and a template notebook are available on GitHub: [Project 9 Materials](https://github.com/bu-cds-dx704/dx704-project-09).


## Example Code

You may find it helpful to refer to these GitHub repositories of Jupyter notebooks for example code.

* https://github.com/bu-cds-omds/dx601-examples
* https://github.com/bu-cds-omds/dx602-examples
* https://github.com/bu-cds-omds/dx603-examples
* https://github.com/bu-cds-omds/dx704-examples

Any calculations demonstrated in code examples or videos may be found in these notebooks, and you are allowed to copy this example code in your homework answers.

## Part 1: Download Data Set

We will be using the Enron spam data set as prepared in this GitHub repository.

https://github.com/MWiechmann/enron_spam_data

You may need to download this differently depending on your environment.

In [1]:
!wget https://github.com/MWiechmann/enron_spam_data/raw/refs/heads/master/enron_spam_data.zip

--2025-11-01 23:55:30--  https://github.com/MWiechmann/enron_spam_data/raw/refs/heads/master/enron_spam_data.zip
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/MWiechmann/enron_spam_data/refs/heads/master/enron_spam_data.zip [following]
--2025-11-01 23:55:30--  https://raw.githubusercontent.com/MWiechmann/enron_spam_data/refs/heads/master/enron_spam_data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15642124 (15M) [application/zip]
Saving to: ‘enron_spam_data.zip’


2025-11-01 23:55:30 (88.2 MB/s) - ‘enron_spam_data.zip’ saved [15642124/15642124]



In [2]:
import pandas as pd

In [3]:
# pandas can read the zip file directly
enron_spam_data = pd.read_csv("enron_spam_data.zip")
enron_spam_data

Unnamed: 0,Message ID,Subject,Message,Spam/Ham,Date
0,0,christmas tree farm pictures,,ham,1999-12-10
1,1,"vastar resources , inc .","gary , production from the high island larger ...",ham,1999-12-13
2,2,calpine daily gas nomination,- calpine daily gas nomination 1 . doc,ham,1999-12-14
3,3,re : issue,fyi - see note below - already done .\nstella\...,ham,1999-12-14
4,4,meter 7268 nov allocation,fyi .\n- - - - - - - - - - - - - - - - - - - -...,ham,1999-12-14
...,...,...,...,...,...
33711,33711,= ? iso - 8859 - 1 ? q ? good _ news _ c = eda...,"hello , welcome to gigapharm onlinne shop .\np...",spam,2005-07-29
33712,33712,all prescript medicines are on special . to be...,i got it earlier than expected and it was wrap...,spam,2005-07-29
33713,33713,the next generation online pharmacy .,are you ready to rock on ? let the man in you ...,spam,2005-07-30
33714,33714,bloow in 5 - 10 times the time,learn how to last 5 - 10 times longer in\nbed ...,spam,2005-07-30


In [4]:
(enron_spam_data["Spam/Ham"] == "spam").mean()

np.float64(0.5092834262664611)

## Part 2: Design a Feature Extractor

Design a feature extractor for this data set and write out two files of features based on the text.
Don't forget that both the Subject and Message columns are relevant sources of text data.
For each email, you should count the number of repetitions of each feature present.
The auto-grader will assume that you are using a multinomial distribution in the following problems.

In [6]:
# YOUR CHANGES HERE
# Imports
import json
import re
from collections import Counter

# Extract features from email subject and message
# Return a dictionary of feature counts
def extract_features(subject, message):
    features = Counter()
    
    # Combine subject and message, handling NaN values
    text = ""
    if pd.notna(subject):
        text += str(subject).lower() + " "
    if pd.notna(message):
        text += str(message).lower()
    
    # Feature 1: Word tokens - alphanumeric sequences
    words = re.findall(r'\b[a-z0-9]+\b', text)
    for word in words:
        if len(word) >= 2:  # Skip single characters
            features[f"word_{word}"] += 1
    
    # Feature 2: Presence of numbers
    if re.search(r'\d', text):
        features["has_numbers"] += 1
    
    # Feature 3: Presence of dollar signs
    features["dollar_signs"] = text.count('$')
    
    # Feature 4: Presence of exclamation marks
    features["exclamation_marks"] = text.count('!')
    
    # Feature 5: All caps words (usually seen in spam)
    if pd.notna(subject):
        caps_words = re.findall(r'\b[A-Z]{2,}\b', str(subject))
        features["caps_words"] = len(caps_words)
    
    # Feature 6: Special spam-related keywords
    spam_keywords = ['free', 'click', 'buy', 'now', 'offer', 'price', 
                     'discount', 'save', 'order', 'viagra', 'pharmacy',
                     'pills', 'medication', 'prescription', 'online']
    for keyword in spam_keywords:
        if keyword in text:
            features[f"spam_keyword_{keyword}"] += text.count(keyword)
    
    # Feature 7: Email-like patterns
    features["email_addresses"] = len(re.findall(r'\b[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}\b', text))
    
    # Feature 8: URLs
    features["urls"] = len(re.findall(r'http[s]?://|www\.', text))
    
    # Feature 9: Message length (binned)
    text_length = len(text)
    if text_length < 100:
        features["length_very_short"] = 1
    elif text_length < 500:
        features["length_short"] = 1
    elif text_length < 1000:
        features["length_medium"] = 1
    else:
        features["length_long"] = 1
    
    return dict(features)

# Apply feature extraction to all emails
print("Extracting features from emails:")
enron_spam_data['features'] = enron_spam_data.apply(
    lambda row: extract_features(row['Subject'], row['Message']), 
    axis=1
)

print(f"Total emails processed: {len(enron_spam_data)}")
print(f"Sample features from first email: {list(enron_spam_data['features'].iloc[0].keys())[:10]}")


Extracting features from emails:
Total emails processed: 33716
Sample features from first email: ['word_christmas', 'word_tree', 'word_farm', 'word_pictures', 'dollar_signs', 'exclamation_marks', 'caps_words', 'email_addresses', 'urls', 'length_very_short']


Assign a row to the test data set if `Message ID % 30 == 0` and assign it to the training data set otherwise.
Write two files, "train-features.tsv" and "test-features.tsv" with two columns, Message ID and features_json.
The features_json column should contain a JSON dictionary where the keys are your feature names and the values are integer feature values.
This will give us a sparse feature representation.


In [7]:
# YOUR CHANGES HERE

# Split into train and test sets
# Test: Message ID % 30 == 0
# Train: otherwise
enron_spam_data['is_test'] = enron_spam_data['Message ID'] % 30 == 0

train_data = enron_spam_data[~enron_spam_data['is_test']].copy()
test_data = enron_spam_data[enron_spam_data['is_test']].copy()

print(f"Training set size: {len(train_data)}")
print(f"Test set size: {len(test_data)}")
print(f"Test set proportion: {len(test_data) / len(enron_spam_data):.3f}")

# Create the output DataFrames with Message ID and features_json
train_output = pd.DataFrame({
    'Message ID': train_data['Message ID'],
    'features_json': train_data['features'].apply(json.dumps)
})

test_output = pd.DataFrame({
    'Message ID': test_data['Message ID'],
    'features_json': test_data['features'].apply(json.dumps)
})

# Write to TSV files
train_output.to_csv('train-features.tsv', sep='\t', index=False)
test_output.to_csv('test-features.tsv', sep='\t', index=False)

print("\nFiles written ------")
print(f"train-features.tsv: {len(train_output)} rows")
print(f"test-features.tsv: {len(test_output)} rows")

# Show a sample of the output
print("\nSample from train-features.tsv:")
print(train_output.head(3))

Training set size: 32592
Test set size: 1124
Test set proportion: 0.033

Files written ------
train-features.tsv: 32592 rows
test-features.tsv: 1124 rows

Sample from train-features.tsv:
   Message ID                                      features_json
1           1  {"word_vastar": 6, "word_resources": 4, "word_...
2           2  {"word_calpine": 2, "word_daily": 2, "word_gas...
3           3  {"word_re": 2, "word_issue": 4, "word_fyi": 1,...


Submit "train-features.tsv" and "test-features.tsv" in Gradescope.

Hint: these features will be graded based on the test accuracy of a logistic regression based on the training features.
This is to make sure that your feature set is not degenerate; you do not need to compute this regression yourself.
You can separately assess your feature quality based on your results in part 6.

## Part 3: Compute Conditional Probabilities

Based on your training data, compute appropriate conditional probabilities for use with naïve Bayes.
Use of additive smoothing with $\alpha=1$ to avoid zeros.


In [8]:
# YOUR CHANGES HERE

# Imports
import json
from collections import defaultdict

# Load the training data with labels
train_features = pd.read_csv('train-features.tsv', sep='\t')
train_labels = enron_spam_data[~enron_spam_data['is_test']][['Message ID', 'Spam/Ham']].copy()

# Merge features with labels
train_data_with_labels = train_features.merge(train_labels, on='Message ID')

print(f"Training emails: {len(train_data_with_labels)}")
print(f"Spam emails: {(train_data_with_labels['Spam/Ham'] == 'spam').sum()}")
print(f"Ham emails: {(train_data_with_labels['Spam/Ham'] == 'ham').sum()}")

# Initialize counters for each class
spam_feature_counts = defaultdict(int)
ham_feature_counts = defaultdict(int)
spam_total_features = 0
ham_total_features = 0

# Count features for each class
for idx, row in train_data_with_labels.iterrows():
    features = json.loads(row['features_json'])
    is_spam = row['Spam/Ham'] == 'spam'
    
    for feature, count in features.items():
        if is_spam:
            spam_feature_counts[feature] += count
            spam_total_features += count
        else:
            ham_feature_counts[feature] += count
            ham_total_features += count

# Get all unique features
all_features = set(spam_feature_counts.keys()) | set(ham_feature_counts.keys())
vocabulary_size = len(all_features)

print(f"\nTotal unique features (vocabulary size): {vocabulary_size}")
print(f"Total feature counts in spam: {spam_total_features}")
print(f"Total feature counts in ham: {ham_total_features}")

# Compute conditional probabilities with additive smoothing (alpha=1)
alpha = 1.0
feature_probabilities = []

for feature in all_features:
    spam_count = spam_feature_counts[feature]
    ham_count = ham_feature_counts[feature]
    
    # P(feature|spam) with Laplace smoothing
    spam_prob = (spam_count + alpha) / (spam_total_features + alpha * vocabulary_size)
    
    # P(feature|ham) with Laplace smoothing
    ham_prob = (ham_count + alpha) / (ham_total_features + alpha * vocabulary_size)
    
    feature_probabilities.append({
        'feature': feature,
        'ham_probability': ham_prob,
        'spam_probability': spam_prob
    })

# Create a DataFrame
prob_df = pd.DataFrame(feature_probabilities)

print(f"\nFeature probabilities computed: {len(prob_df)}")
print(f"\nSample probabilities:")
print(prob_df.head(10))
print(f"\nSummary statistics:")
print(prob_df[['ham_probability', 'spam_probability']].describe())


Training emails: 32592
Spam emails: 16599
Ham emails: 15993

Total unique features (vocabulary size): 154282
Total feature counts in spam: 3376073
Total feature counts in ham: 4255508

Feature probabilities computed: 154282

Sample probabilities:
            feature  ham_probability  spam_probability
0    word_groomsman     2.267682e-07      8.497729e-07
1     word_railways     6.803045e-07      5.665153e-07
2        word_yyutu     2.267682e-07      5.665153e-07
3      word_penises     2.267682e-07      5.665153e-07
4         word_yawo     2.267682e-07      5.665153e-07
5    word_immingham     2.267682e-07      5.665153e-07
6  word_yelpazesine     2.267682e-07      5.665153e-07
7       word_averse     1.133841e-06      3.682349e-06
8    word_reibstein     4.535363e-07      2.832576e-07
9         word_skci     2.267682e-07      1.416288e-06

Summary statistics:
       ham_probability  spam_probability
count     1.542820e+05      1.542820e+05
mean      6.481638e-06      6.481638e-06
std 

Save the conditional probabilities in a file "feature-probabilities.tsv" with columns feature, ham_probability and spam_probability.

In [9]:
# YOUR CHANGES HERE
# Save to TSV file
prob_df.to_csv('feature-probabilities.tsv', sep='\t', index=False)
# Confirm save
print("File saved: feature-probabilities.tsv")
print(f"Total features saved: {len(prob_df)}")

# Show some interesting features
print("\nFeatures most indicative of spam (high spam_probability):")
print(prob_df.nlargest(10, 'spam_probability')[['feature', 'ham_probability', 'spam_probability']])
print("\nFeatures most indicative of ham (high ham_probability):")
print(prob_df.nlargest(10, 'ham_probability')[['feature', 'ham_probability', 'spam_probability']])

File saved: feature-probabilities.tsv
Total features saved: 154282

Features most indicative of spam (high spam_probability):
          feature  ham_probability  spam_probability
114672   word_the         0.040089          0.029732
69690     word_to         0.028540          0.022827
95418    word_and         0.018372          0.020315
38578     word_of         0.017073          0.019419
118151   word_you         0.007582          0.013025
113202    word_in         0.013355          0.012586
107580  word_this         0.006382          0.009557
153678   word_for         0.010937          0.009488
45510   word_your         0.002967          0.009263
146374    word_is         0.008426          0.009136

Features most indicative of ham (high ham_probability):
           feature  ham_probability  spam_probability
114672    word_the         0.040089          0.029732
69690      word_to         0.028540          0.022827
95418     word_and         0.018372          0.020315
38578      word_of

Submit "feature-probabilities.tsv" in Gradescope.

## Part 4: Implement a Naïve Bayes Classifier

Implement a naïve Bayes classifier based on your previous feature probabilities.

In [10]:
# YOUR CHANGES HERE

# Import numpy
import numpy as np

# Compute the prior probabilities P(spam) and P(ham)
n_spam = (train_data_with_labels['Spam/Ham'] == 'spam').sum()
n_ham = (train_data_with_labels['Spam/Ham'] == 'ham').sum()
n_total = len(train_data_with_labels)

prior_spam = n_spam / n_total
prior_ham = n_ham / n_total

print(f"Prior probabilities:")
print(f"P(spam) = {prior_spam:.6f}")
print(f"P(ham) = {prior_ham:.6f}")

# Load feature probabilities
feature_probs = pd.read_csv('feature-probabilities.tsv', sep='\t')

# Create dictionaries for fast lookup
spam_probs_dict = dict(zip(feature_probs['feature'], feature_probs['spam_probability']))
ham_probs_dict = dict(zip(feature_probs['feature'], feature_probs['ham_probability']))


# Predict the spam/ham probabilities using naive Bayes
# Returns (ham_probability, spam_probability)
def predict_naive_bayes(features_json):
    features = json.loads(features_json)
    
    # Start with log priors to avoid numerical underflow
    log_prob_spam = np.log(prior_spam)
    log_prob_ham = np.log(prior_ham)
    
    # Multiply conditional probabilities (add logs)
    for feature, count in features.items():
        if feature in spam_probs_dict:
            log_prob_spam += count * np.log(spam_probs_dict[feature])
        if feature in ham_probs_dict:
            log_prob_ham += count * np.log(ham_probs_dict[feature])
    
    # Convert back from log space and normalize
    # Use log-sum-exp trick for numerical stability
    max_log_prob = max(log_prob_spam, log_prob_ham)
    prob_spam = np.exp(log_prob_spam - max_log_prob)
    prob_ham = np.exp(log_prob_ham - max_log_prob)
    
    # Normalize to sum to 1
    total = prob_spam + prob_ham
    prob_spam /= total
    prob_ham /= total
    
    return prob_ham, prob_spam

# Apply classifier to training data
print("\Apply naive Bayes classifier to training data ----")
predictions = []

for idx, row in train_features.iterrows():
    message_id = row['Message ID']
    ham_prob, spam_prob = predict_naive_bayes(row['features_json'])
    
    predictions.append({
        'Message ID': message_id,
        'ham': ham_prob,
        'spam': spam_prob
    })
    
    if idx % 5000 == 0:
        print(f"Processed {idx} emails...")

predictions_df = pd.DataFrame(predictions)

print(f"\nPredictions completed: {len(predictions_df)} emails")
print(f"\nSample predictions:")
print(predictions_df.head(10))
print(f"\nSummary statistics:")
print(predictions_df[['ham', 'spam']].describe())

  print("\Apply naive Bayes classifier to training data ----")


Prior probabilities:
P(spam) = 0.509297
P(ham) = 0.490703
\Apply naive Bayes classifier to training data ----
Processed 0 emails...
Processed 5000 emails...
Processed 10000 emails...
Processed 15000 emails...
Processed 20000 emails...
Processed 25000 emails...
Processed 30000 emails...

Predictions completed: 32592 emails

Sample predictions:
   Message ID  ham           spam
0           1  1.0  3.007139e-183
1           2  1.0   2.285904e-12
2           3  1.0  3.839046e-157
3           4  1.0  1.940944e-151
4           5  1.0   2.952858e-40
5           6  1.0   5.167505e-29
6           7  1.0  5.569444e-209
7           8  1.0   3.788587e-93
8           9  1.0  4.269051e-249
9          10  1.0   2.829233e-69

Summary statistics:
                ham          spam
count  3.259200e+04  3.259200e+04
mean   4.881663e-01  5.118337e-01
std    4.977325e-01  4.977325e-01
min    0.000000e+00  0.000000e+00
25%    7.239133e-38  1.538849e-45
50%    7.537969e-03  9.924620e-01
75%    1.000000e+00  1

Save your prediction probabilities to "train-predictions.tsv" with columns Message ID, ham and spam.

In [11]:
# YOUR CHANGES HERE
predictions_df.to_csv('train-predictions.tsv', sep='\t', index=False)
print("File saved: train-predictions.tsv")
print(f"Total predictions saved: {len(predictions_df)}")

# Evaluate accuracy on training data
train_with_preds = train_data_with_labels.merge(predictions_df, on='Message ID')
train_with_preds['predicted_class'] = train_with_preds.apply(
    lambda row: 'spam' if row['spam'] > row['ham'] else 'ham', 
    axis=1
)

accuracy = (train_with_preds['Spam/Ham'] == train_with_preds['predicted_class']).mean()
print(f"\nTraining accuracy: {accuracy:.4f}")

# Confusion matrix for understanding
print("\nConfusion matrix:")
confusion = pd.crosstab(
    train_with_preds['Spam/Ham'], 
    train_with_preds['predicted_class'],
    rownames=['Actual'],
    colnames=['Predicted']
)
print(confusion)

File saved: train-predictions.tsv
Total predictions saved: 32592

Training accuracy: 0.9920

Confusion matrix:
Predicted    ham   spam
Actual                 
ham        15805    188
spam          73  16526


Submit "train-predictions.tsv" in Gradescope.

## Part 5: Predict Spam Probability for Test Data

Use your previous classifier to predict spam probability for the test data.

In [12]:
# YOUR CHANGES HERE
# Load test features from the file we created
test_features = pd.read_csv('test-features.tsv', sep='\t')

print(f"Test set size: {len(test_features)}")

# Apply classifier to test data
print("\nApply naive Bayes classifier to test data")
test_predictions = []

for idx, row in test_features.iterrows():
    message_id = row['Message ID']
    ham_prob, spam_prob = predict_naive_bayes(row['features_json'])
    
    test_predictions.append({
        'Message ID': message_id,
        'ham': ham_prob,
        'spam': spam_prob
    })
    
    if idx % 200 == 0:
        print(f"Processed {idx} emails...")

test_predictions_df = pd.DataFrame(test_predictions)

print(f"\nPredictions completed: {len(test_predictions_df)} emails")
print(f"\nSample predictions:")
print(test_predictions_df.head(10))
print(f"\nSummary statistics:")
print(test_predictions_df[['ham', 'spam']].describe())

Test set size: 1124

Apply naive Bayes classifier to test data
Processed 0 emails...
Processed 200 emails...
Processed 400 emails...
Processed 600 emails...
Processed 800 emails...
Processed 1000 emails...

Predictions completed: 1124 emails

Sample predictions:
   Message ID       ham           spam
0           0  0.042598   9.574016e-01
1          30  1.000000   2.612224e-85
2          60  1.000000   1.250946e-12
3          90  1.000000   8.311318e-34
4         120  1.000000  3.588515e-189
5         150  1.000000   4.437439e-11
6         180  0.999949   5.139716e-05
7         210  1.000000   1.058019e-39
8         240  1.000000   1.695084e-59
9         270  1.000000   3.724895e-39

Summary statistics:
                ham          spam
count  1.124000e+03  1.124000e+03
mean   4.847127e-01  5.152873e-01
std    4.970497e-01  4.970497e-01
min    0.000000e+00  0.000000e+00
25%    1.413636e-35  2.038597e-42
50%    8.123439e-03  9.918766e-01
75%    1.000000e+00  1.000000e+00
max    1.000000

Save your prediction probabilities in "test-predictions.tsv" with the same columns as "train-predictions.tsv".

In [13]:
# YOUR CHANGES HERE

# Save predictions to TSV
test_predictions_df.to_csv('test-predictions.tsv', sep='\t', index=False)
print("File saved: test-predictions.tsv")
print(f"Total predictions saved: {len(test_predictions_df)}")

# Evaluate accuracy on test data
test_labels = enron_spam_data[enron_spam_data['is_test']][['Message ID', 'Spam/Ham']].copy()
test_with_preds = test_labels.merge(test_predictions_df, on='Message ID')
test_with_preds['predicted_class'] = test_with_preds.apply(
    lambda row: 'spam' if row['spam'] > row['ham'] else 'ham', 
    axis=1
)

test_accuracy = (test_with_preds['Spam/Ham'] == test_with_preds['predicted_class']).mean()
print(f"\nTest accuracy: {test_accuracy:.4f}")

# Confusion matrix
print("\nConfusion matrix:")
test_confusion = pd.crosstab(
    test_with_preds['Spam/Ham'], 
    test_with_preds['predicted_class'],
    rownames=['Actual'],
    colnames=['Predicted']
)
print(test_confusion)

# Additional statistics
print("\nTest set class distribution:")
print(test_with_preds['Spam/Ham'].value_counts())
print("\nPredicted class distribution:")
print(test_with_preds['predicted_class'].value_counts())

File saved: test-predictions.tsv
Total predictions saved: 1124

Test accuracy: 0.9831

Confusion matrix:
Predicted  ham  spam
Actual              
ham        538    14
spam         5   567

Test set class distribution:
Spam/Ham
spam    572
ham     552
Name: count, dtype: int64

Predicted class distribution:
predicted_class
spam    581
ham     543
Name: count, dtype: int64


Submit "test-predictions.tsv" in Gradescope.

## Part 6: Construct ROC Curve

For every probability threshold from 0.01 to .99 in increments of 0.01, compute the false and true positive rates from the test data using the spam class for positives.
That is, if the predicted spam probability is greater than or equal to the threshold, predict spam.

In [14]:
# YOUR CHANGES HERE
# We technically already have testing with predictions (test_with_preds) in our previous work
# This gave us Message ID, Spam/Ham (actual), ham, spam, (the predicted probabilities)

# Generate thresholds from 0.01 to 0.99 in 0.01 increments
thresholds = np.arange(0.01, 1.00, 0.01)

roc_data = []

print("Compute the ROC curve ------")
print(f"Number of thresholds: {len(thresholds)}")

for threshold in thresholds:
    # Predict spam if spam probability >= threshold
    test_with_preds['predicted_spam'] = test_with_preds['spam'] >= threshold
    
    # Actual labels (True = spam, False = ham)
    actual_spam = test_with_preds['Spam/Ham'] == 'spam'
    
    # True Positives: actual spam predicted as spam
    tp = ((actual_spam) & (test_with_preds['predicted_spam'])).sum()
    
    # False Positives: actual ham predicted as spam
    fp = ((~actual_spam) & (test_with_preds['predicted_spam'])).sum()
    
    # True Negatives: actual ham predicted as ham
    tn = ((~actual_spam) & (~test_with_preds['predicted_spam'])).sum()
    
    # False Negatives: actual spam predicted as ham
    fn = ((actual_spam) & (~test_with_preds['predicted_spam'])).sum()
    
    # Calculate rates
    # True Positive Rate (TPR) = TP / (TP + FN) = TP / Total Actual Positives
    tpr = tp / (tp + fn) if (tp + fn) > 0 else 0
    
    # False Positive Rate (FPR) = FP / (FP + TN) = FP / Total Actual Negatives
    fpr = fp / (fp + tn) if (fp + tn) > 0 else 0
    
    roc_data.append({
        'threshold': threshold,
        'false_positive_rate': fpr,
        'true_positive_rate': tpr
    })

roc_df = pd.DataFrame(roc_data)

print(f"\nROC data computed: {len(roc_df)} thresholds")
print(f"\nSample ROC data:")
print(roc_df.head(10))
print(f"\nSummary statistics:")
print(roc_df.describe())

Compute the ROC curve ------
Number of thresholds: 99

ROC data computed: 99 thresholds

Sample ROC data:
   threshold  false_positive_rate  true_positive_rate
0       0.01             0.034420            0.996503
1       0.02             0.032609            0.996503
2       0.03             0.032609            0.996503
3       0.04             0.030797            0.996503
4       0.05             0.030797            0.996503
5       0.06             0.028986            0.996503
6       0.07             0.028986            0.996503
7       0.08             0.028986            0.996503
8       0.09             0.028986            0.996503
9       0.10             0.028986            0.996503

Summary statistics:
       threshold  false_positive_rate  true_positive_rate
count  99.000000            99.000000           99.000000
mean    0.500000             0.024136            0.989281
std     0.287228             0.004115            0.006429
min     0.010000             0.014493          

Save this data in a file "roc.tsv" with columns threshold, false_positive_rate and true_positive rate.

In [15]:
# YOUR CHANGES HERE
# Save to a TSV file
roc_df.to_csv('roc.tsv', sep='\t', index=False)
print("File saved confirmed to: roc.tsv")
print(f"Total thresholds saved: {len(roc_df)}")

# Show some key thresholds
print("\nROC data at key thresholds:")
key_thresholds = [0.01, 0.10, 0.25, 0.50, 0.75, 0.90, 0.99]
for t in key_thresholds:
    row = roc_df[roc_df['threshold'] == t]
    if not row.empty:
        print(f"Threshold {t:.2f}: FPR = {row['false_positive_rate'].values[0]:.4f}, TPR = {row['true_positive_rate'].values[0]:.4f}")

# Calculate AUC (Area Under Curve) as a measure of classifier quality
from sklearn.metrics import auc
auc_score = auc(roc_df['false_positive_rate'], roc_df['true_positive_rate'])
print(f"\nArea Under ROC Curve (AUC): {auc_score:.4f}")

File saved confirmed to: roc.tsv
Total thresholds saved: 99

ROC data at key thresholds:
Threshold 0.01: FPR = 0.0344, TPR = 0.9965
Threshold 0.25: FPR = 0.0254, TPR = 0.9948
Threshold 0.50: FPR = 0.0254, TPR = 0.9913
Threshold 0.75: FPR = 0.0236, TPR = 0.9843
Threshold 0.90: FPR = 0.0163, TPR = 0.9808
Threshold 0.99: FPR = 0.0145, TPR = 0.9685

Area Under ROC Curve (AUC): 0.0197


Submit "roc.tsv" in Gradescope.

## Part 7: Signup for Gemini API Key

Create a free Gemini API key at https://aistudio.google.com/app/api-keys.
You will need to do this with a personal Google account - it will not work with your BU Google account.
This will not incur any charges unless you configure billing information for the key.

You will be asked to start a Gemini free trial for week 11.
This will not incur any charges unless you exceed expected usage by an order of magnitude.


No submission needed.

## Part 8: Code

Please submit a Jupyter notebook that can reproduce all your calculations and recreate the previously submitted files.
You do not need to provide code for data collection if you did that by manually.

## Part 9: Acknowledgements

If you discussed this assignment with anyone, please acknowledge them here.
If you did this assignment completely on your own, simply write none below.

If you used any libraries not mentioned in this module's content, please list them with a brief explanation what you used them for. If you did not use any other libraries, simply write none below.

If you used any generative AI tools, please add links to your transcripts below, and any other information that you feel is necessary to comply with the generative AI policy. If you did not use any generative AI tools, simply write none below.

In [16]:
# Creating the acknoledgements.txt file using code
acknowledgements_text = """Discussed assignment with:
No one

Libraries used:
- pandas: For data manipulation and reading/writing TSV files
- numpy: For numerical computations and log probability calculations
- json: For handling JSON-formatted feature dictionaries
- re: For regular expression-based text processing and feature extraction
- collections: For Counter and defaultdict data structures
- matplotlib: For creating ROC curve and error rate visualizations
- sklearn.metrics: For computing Area Under Curve (AUC)

Additional resources:
I used the lecture materials provided in the course, including the project files and lecture videos on naive Bayes classification, conditional probabilities, and ROC curves.
"""

with open('acknowledgements.txt', 'w') as f:
    f.write(acknowledgements_text)

print("File created: acknowledgements.txt")
print("\nContents:")
print(acknowledgements_text)

File created: acknowledgements.txt

Contents:
Discussed assignment with:
No one

Libraries used:
- pandas: For data manipulation and reading/writing TSV files
- numpy: For numerical computations and log probability calculations
- json: For handling JSON-formatted feature dictionaries
- re: For regular expression-based text processing and feature extraction
- collections: For Counter and defaultdict data structures
- matplotlib: For creating ROC curve and error rate visualizations
- sklearn.metrics: For computing Area Under Curve (AUC)

Additional resources:
I used the lecture materials provided in the course, including the project files and lecture videos on naive Bayes classification, conditional probabilities, and ROC curves.

