# DX 704 Week 12 Project
This week's project will revisit the email spam classifier project from week 9 using large language model embeddings instead of custom features.


The full project description and a template notebook are available on GitHub: [Project 12 Materials](https://github.com/bu-cds-dx704/dx704-project-12).


## Example Code

You may find it helpful to refer to these GitHub repositories of Jupyter notebooks for example code.

* https://github.com/bu-cds-omds/dx601-examples
* https://github.com/bu-cds-omds/dx602-examples
* https://github.com/bu-cds-omds/dx603-examples
* https://github.com/bu-cds-omds/dx704-examples

Any calculations demonstrated in code examples or videos may be found in these notebooks, and you are allowed to copy this example code in your homework answers.

## Part 1: Download Data Set

We will be using the Enron spam data set as prepared in this GitHub repository.

https://github.com/MWiechmann/enron_spam_data

You may need to download this differently depending on your environment.

In [3]:
!wget https://github.com/MWiechmann/enron_spam_data/raw/refs/heads/master/enron_spam_data.zip

--2025-11-21 00:16:27--  https://github.com/MWiechmann/enron_spam_data/raw/refs/heads/master/enron_spam_data.zip
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/MWiechmann/enron_spam_data/refs/heads/master/enron_spam_data.zip [following]
--2025-11-21 00:16:27--  https://raw.githubusercontent.com/MWiechmann/enron_spam_data/refs/heads/master/enron_spam_data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15642124 (15M) [application/zip]
Saving to: ‘enron_spam_data.zip.2’


2025-11-21 00:16:27 (537 MB/s) - ‘enron_spam_data.zip.2’ saved [15642124/15642124]



In [4]:
import pandas as pd

In [5]:
# pandas can read the zip file directly
enron_spam_data = pd.read_csv("enron_spam_data.zip")
enron_spam_data

Unnamed: 0,Message ID,Subject,Message,Spam/Ham,Date
0,0,christmas tree farm pictures,,ham,1999-12-10
1,1,"vastar resources , inc .","gary , production from the high island larger ...",ham,1999-12-13
2,2,calpine daily gas nomination,- calpine daily gas nomination 1 . doc,ham,1999-12-14
3,3,re : issue,fyi - see note below - already done .\nstella\...,ham,1999-12-14
4,4,meter 7268 nov allocation,fyi .\n- - - - - - - - - - - - - - - - - - - -...,ham,1999-12-14
...,...,...,...,...,...
33711,33711,= ? iso - 8859 - 1 ? q ? good _ news _ c = eda...,"hello , welcome to gigapharm onlinne shop .\np...",spam,2005-07-29
33712,33712,all prescript medicines are on special . to be...,i got it earlier than expected and it was wrap...,spam,2005-07-29
33713,33713,the next generation online pharmacy .,are you ready to rock on ? let the man in you ...,spam,2005-07-30
33714,33714,bloow in 5 - 10 times the time,learn how to last 5 - 10 times longer in\nbed ...,spam,2005-07-30


In [6]:
(enron_spam_data["Spam/Ham"] == "spam").mean()

np.float64(0.5092834262664611)

## Part 2: Download BERT Model

We will use a pre-trained BERT model to extract embedding vectors as described in lesson 2.1 this week.
Here is sample code to download the model from [Hugging Face](https://huggingface.co/) and extract one vector.
This model is small enough that you can run it with CPU only, but GPUs will be faster if available.

In [7]:
# You may need to install torch and transformers.
# Google Colab will have these installed already.
#
# pip install transformers torch --upgrade

import torch
from transformers import AutoTokenizer, AutoModel

In [8]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

To download the pre-trained model from Hugging Face, you will need to sign up for a free account with them at https://huggingface.co/join.
Afterwards, get an API token and if you are using Google Colab, save it as a secret named HF_TOKEN.

In [26]:
MODEL_NAME = "bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
bert_model = AutoModel.from_pretrained(MODEL_NAME)
bert_model.to(device)
bert_model.eval()


BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

In [27]:
@torch.no_grad()
def embed_text(text):
    batch = [text]
    inputs = tokenizer(batch, padding=True, truncation=True, return_tensors="pt").to(device)
    outputs = bert_model(**inputs)
    # CLS token embedding is the first token's hidden state
    cls_emb = outputs.last_hidden_state[:, 0, :]  # shape: [batch_size, 768]
    return cls_emb.cpu()

In [28]:
v = embed_text("Hi, will you buy my spam?")
v.shape

torch.Size([1, 768])

## Part 3: Create Embedding Vectors

Use BERT to create embeddings for each email in the Enron data set.
You will have to decide how to combine the different columns of the original data set to produce one embedding vector.


Hint: BERT can be run without a GPU, but will be much slower.
Using Google Colab with only a CPU, it runs around 1 embedding per second.
Using Google Colab with the T4 GPU option, it runs around 60 embeddings per second.
Caching is also encouraged to avoid unnecessary reruns.

In [29]:
# YOUR CHANGES HERE
# Imports
import numpy as np
import json
from tqdm import tqdm

# Combine subject and message into one text field
# Handle NaN values by converting to empty string
def prepare_text(row):
    subject = str(row['Subject']) if pd.notna(row['Subject']) else ''
    message = str(row['Message']) if pd.notna(row['Message']) else ''
    # combine with space separator
    combined = f"{subject} {message}".strip()
    return combined

# create embeddings for all emails
embeddings_list = []

print(f"Processing {len(enron_spam_data)} emails:")
print(f"Using device: {device}")

for idx, row in tqdm(enron_spam_data.iterrows(), total=len(enron_spam_data)):
    text = prepare_text(row)
    # get embedding vector
    emb = embed_text(text)
    # convert to numpy array and flatten to 1D
    emb_array = emb.numpy().flatten()
    # round to 3 decimal places to reduce file size
    emb_array_rounded = np.round(emb_array, 3)
    # convert to list for JSON serialization (use the rounded array)
    emb_list = emb_array_rounded.tolist()

    embeddings_list.append({
        'Message ID': row['Message ID'],
        'embedding_vector_json': json.dumps(emb_list)
    })

# Create dataframe
embeddings_df = pd.DataFrame(embeddings_list)
embeddings_df

Processing 33716 emails:
Using device: cuda


100%|██████████| 33716/33716 [05:42<00:00, 98.50it/s]


Unnamed: 0,Message ID,embedding_vector_json
0,0,"[-0.4020000100135803, 0.29499998688697815, -0...."
1,1,"[-0.6060000061988831, -0.05000000074505806, 0...."
2,2,"[-0.7509999871253967, -0.09099999815225601, -0..."
3,3,"[-0.7710000276565552, -0.29600000381469727, -0..."
4,4,"[-0.7080000042915344, -0.1850000023841858, 0.2..."
...,...,...
33711,33711,"[-0.05000000074505806, -0.10700000077486038, 0..."
33712,33712,"[0.0989999994635582, -0.02800000086426735, 0.3..."
33713,33713,"[-0.008999999612569809, -0.15299999713897705, ..."
33714,33714,"[0.23100000619888306, -0.023000000044703484, 0..."


Save your embeddings in a file "embeddings.tsv.gz" with two columns, Message ID and embedding_vector_json, where embedding_vector_json is a JSON-encoded list.
Make sure that embedding_vector_json is a 1 dimensional list, not 2 dimensional.

Hint: don't forget the ".gz" extension indicating gzip compression.
The Pandas `.to_csv` method will automatically add the compression if you save data with a filename ending in ".gz", so you just need to pass it the right filename.

In [30]:
# YOUR CHANGES HERE

# Save embeddings to TSV file using gzip compression
embeddings_df.to_csv('embeddings.tsv.gz', sep='\t', index=False, compression='gzip')

# verify the file was created and check its size
import os
file_size = os.path.getsize('embeddings.tsv.gz')
print(f"File 'embeddings.tsv.gz' created successfully")
print(f"File size: {file_size / (1024*1024):.2f} MB")

# verify we can read it back
test_read = pd.read_csv('embeddings.tsv.gz', sep='\t', nrows=5)
print(f"\nVerification - first few rows:")
print(test_read)

File 'embeddings.tsv.gz' created successfully
File size: 73.22 MB

Verification - first few rows:
   Message ID                              embedding_vector_json
0           0  [-0.4020000100135803, 0.29499998688697815, -0....
1           1  [-0.6060000061988831, -0.05000000074505806, 0....
2           2  [-0.7509999871253967, -0.09099999815225601, -0...
3           3  [-0.7710000276565552, -0.29600000381469727, -0...
4           4  [-0.7080000042915344, -0.1850000023841858, 0.2...


Submit "embeddings.tsv.gz" in Gradescope.

## Part 4: Train a Linear Regression

Train an ordinary least squares regression for spam/ham status where spam is treated as target value 1 and ham is treated as target value 0 with your embeddings above as the only input variables.


In [31]:
# YOUR CHANGES HERE
# import linear regression
from sklearn.linear_model import LinearRegression

# prepare the data
# load embeddings and parse JSON
embeddings_df['embedding_vector'] = embeddings_df['embedding_vector_json'].apply(json.loads)

# merge with original data to get spam/ham labels
data_with_embeddings = enron_spam_data.merge(embeddings_df, on='Message ID')

# create feature matrix X and target vector y
X = np.array(data_with_embeddings['embedding_vector'].tolist())
y = (data_with_embeddings['Spam/Ham'] == 'spam').astype(int)

print(f"Feature matrix shape: {X.shape}")
print(f"Target vector shape: {y.shape}")
print(f"Target distribution: {y.value_counts().to_dict()}")

# train linear regression
model = LinearRegression()
model.fit(X, y)

print(f"\nModel training completed")
print(f"Model coefficients shape: {model.coef_.shape}")
print(f"Model intercept: {model.intercept_}")

Feature matrix shape: (33716, 768)
Target vector shape: (33716,)
Target distribution: {1: 17171, 0: 16545}

Model training completed
Model coefficients shape: (768,)
Model intercept: 1.117572394484729


Save the coefficients of your linear model in a file "coefficients.tsv" with columns dim and coefficient where dim specifies the offset in the embedding vector (0-767).
Don't worry about the bias term (but your model should still have it).

In [32]:
# YOUR CHANGES HERE
# create dataframe with coefficients
coefficients_df = pd.DataFrame({
    'dim': range(len(model.coef_)),
    'coefficient': model.coef_
})

# Save to TSV file
coefficients_df.to_csv('coefficients.tsv', sep='\t', index=False)

print(f"Coefficients saved to 'coefficients.tsv'")
print(f"Number of coefficients: {len(coefficients_df)}")
print(f"\nFirst few coefficients:")
print(coefficients_df.head(10))

Coefficients saved to 'coefficients.tsv'
Number of coefficients: 768

First few coefficients:
   dim  coefficient
0    0     0.061469
1    1     0.019338
2    2     0.007468
3    3    -0.000475
4    4     0.046114
5    5     0.042197
6    6     0.040991
7    7     0.056726
8    8    -0.014250
9    9     0.044545


Submit "coefficients.tsv" in Gradescope.

## Part 5: Search for Relevant Documents

The file "queries.tsv" specifies 10 queries.
For each of the queries, encode them as a vector, and find the message that is closest using $L_2$.

In [33]:
# YOUR CHANGES HERE
# Load queries (using slash for colab path)
queries_df = pd.read_csv('/queries.tsv', sep='\t')
print("Queries loaded:")
print(queries_df)

# Encode each query and find closest message
query_results = []

print(f"\nProcessing {len(queries_df)} queries:")

for idx, row in tqdm(queries_df.iterrows(), total=len(queries_df)):
    query_text = row['query']
    query_id = row['query_id']

    # encode query
    query_emb = embed_text(query_text)
    query_array = query_emb.numpy().flatten()
    # round to 3 decimal places for consistency with part 3
    query_array_rounded = np.round(query_array, 3)

    # calculate L2 distances to all messages
    distances = np.linalg.norm(X - query_array_rounded, axis=1)

    # find the closest message
    closest_idx = np.argmin(distances)
    closest_message_id = data_with_embeddings.iloc[closest_idx]['Message ID']

    query_results.append({
        'query_id': query_id,
        'query_vector_json': json.dumps(query_array_rounded.tolist()),
        'Message ID': closest_message_id
    })

    print(f"Query {query_id}: '{query_text[:50]}...' \t-> Message ID {closest_message_id} (distance: {distances[closest_idx]:.4f})")

# create dataframe
query_matches_df = pd.DataFrame(query_results)
query_matches_df

Queries loaded:
   query_id                                              query
0         1                            accounting arrangements
1         2                       sales higher than production
2         3  asked lawyer to write letter about unexpected ...
3         4                                   engineering exam
4         5                                 discounted tickets
5         6                               unexecuted agreement
6         7                                          well bore
7         8                                   capacity problem
8         9                                     london partner
9        10                                    dormant account

Processing 10 queries:


 10%|█         | 1/10 [00:00<00:01,  5.73it/s]

Query 1: 'accounting arrangements...' 	-> Message ID 3273 (distance: 8.2522)


 20%|██        | 2/10 [00:00<00:01,  5.73it/s]

Query 2: 'sales higher than production...' 	-> Message ID 2663 (distance: 7.6640)


 30%|███       | 3/10 [00:00<00:01,  5.72it/s]

Query 3: 'asked lawyer to write letter about unexpected even...' 	-> Message ID 5057 (distance: 8.4663)


 40%|████      | 4/10 [00:00<00:01,  5.75it/s]

Query 4: 'engineering exam...' 	-> Message ID 13222 (distance: 7.8520)


 50%|█████     | 5/10 [00:00<00:00,  5.79it/s]

Query 5: 'discounted tickets...' 	-> Message ID 3743 (distance: 7.8421)


 60%|██████    | 6/10 [00:01<00:00,  5.82it/s]

Query 6: 'unexecuted agreement...' 	-> Message ID 21810 (distance: 6.8349)


 70%|███████   | 7/10 [00:01<00:00,  5.80it/s]

Query 7: 'well bore...' 	-> Message ID 18137 (distance: 3.9656)


 80%|████████  | 8/10 [00:01<00:00,  5.81it/s]

Query 8: 'capacity problem...' 	-> Message ID 14635 (distance: 4.3778)


 90%|█████████ | 9/10 [00:01<00:00,  5.83it/s]

Query 9: 'london partner...' 	-> Message ID 14635 (distance: 4.1715)


100%|██████████| 10/10 [00:01<00:00,  5.80it/s]

Query 10: 'dormant account...' 	-> Message ID 15831 (distance: 7.4941)





Unnamed: 0,query_id,query_vector_json,Message ID
0,1,"[-0.06700000166893005, 0.4860000014305115, -0....",3273
1,2,"[-0.8650000095367432, -0.2460000067949295, 0.2...",2663
2,3,"[-0.2980000078678131, -0.2029999941587448, -0....",5057
3,4,"[-0.2150000035762787, 0.19900000095367432, -0....",13222
4,5,"[-0.19599999487400055, -0.04899999871850014, -...",3743
5,6,"[-0.5419999957084656, -0.2750000059604645, -0....",21810
6,7,"[-0.12399999797344208, 0.3540000021457672, 0.0...",18137
7,8,"[-0.257999986410141, 0.0560000017285347, -0.19...",14635
8,9,"[-0.3199999928474426, 0.15800000727176666, -0....",14635
9,10,"[-0.2980000078678131, -0.3059999942779541, -0....",15831


Save your results in a file "query-matches.tsv" with columns query_id, query_vector_json, and Message ID.

In [34]:
# YOUR CHANGES HERE
# save query matches to TSV
query_matches_df.to_csv('query-matches.tsv', sep='\t', index=False)

print(f"Saved to 'query-matches.tsv'")
print(f"Number of queries processed: {len(query_matches_df)}")
print(f"\nSummary:")
print(query_matches_df[['query_id', 'Message ID']])

Saved to 'query-matches.tsv'
Number of queries processed: 10

Summary:
   query_id  Message ID
0         1        3273
1         2        2663
2         3        5057
3         4       13222
4         5        3743
5         6       21810
6         7       18137
7         8       14635
8         9       14635
9        10       15831


Submit "query-matches.tsv" in Gradescope.

## Part 6: Code

Please submit a Jupyter notebook that can reproduce all your calculations and recreate the previously submitted files.
You do not need to provide code for data collection if you did that by manually.

## Part 7: Acknowledgements

If you discussed this assignment with anyone, please acknowledge them here.
If you did this assignment completely on your own, simply write none below.

If you used any libraries not mentioned in this module's content, please list them with a brief explanation what you used them for. If you did not use any other libraries, simply write none below.

If you used any generative AI tools, please add links to your transcripts below, and any other information that you feel is necessary to comply with the generative AI policy. If you did not use any generative AI tools, simply write none below.

In [35]:
# Create acknowledgements file
acknowledgements_text = """Discussed assignment with: No one

Libraries used:
- pandas: For data manipulation and reading/writing CSV/TSV files
- numpy: For numerical operations and array manipulations
- torch: For running BERT model computations
- transformers: For loading and using pre-trained BERT model from Hugging Face
- sklearn.linear_model: For training linear regression model
- json: For encoding embedding vectors as JSON
- tqdm: For progress bars during processing

Additional resources:
I used the lecture materials provided in the course. I refered to Piaza for understanding the file size and rounding solutions. I also referenced the Hugging Face documentation for the BERT model implementation and for the API key and how to use it in Google Colab.
"""

with open('acknowledgements.txt', 'w') as f:
    f.write(acknowledgements_text.strip())

print("acknowledgements.txt created successfully")
print("\nContent:")
print(acknowledgements_text.strip())

acknowledgements.txt created successfully

Content:
Discussed assignment with: No one

Libraries used:
- pandas: For data manipulation and reading/writing CSV/TSV files
- numpy: For numerical operations and array manipulations
- torch: For running BERT model computations
- transformers: For loading and using pre-trained BERT model from Hugging Face
- sklearn.linear_model: For training linear regression model
- json: For encoding embedding vectors as JSON
- tqdm: For progress bars during processing

Additional resources:
I used the lecture materials provided in the course. I refered to Piaza for understanding the file size and rounding solutions. I also referenced the Hugging Face documentation for the BERT model implementation and for the API key and how to use it in Google Colab.
