# DX 704 Week 12 Project

This week's project will revisit the email spam classifier project from week 9 using large language model embeddings instead of custom features.


The full project description and a template notebook are available on GitHub: [Project 12 Materials](https://github.com/bu-cds-dx704/dx704-project-12).


## Example Code

You may find it helpful to refer to these GitHub repositories of Jupyter notebooks for example code.

* https://github.com/bu-cds-omds/dx601-examples
* https://github.com/bu-cds-omds/dx602-examples
* https://github.com/bu-cds-omds/dx603-examples
* https://github.com/bu-cds-omds/dx704-examples

Any calculations demonstrated in code examples or videos may be found in these notebooks, and you are allowed to copy this example code in your homework answers.

## Part 1: Download Data Set

We will be using the Enron spam data set as prepared in this GitHub repository.

https://github.com/MWiechmann/enron_spam_data

You may need to download this differently depending on your environment.

In [1]:
!wget https://github.com/MWiechmann/enron_spam_data/raw/refs/heads/master/enron_spam_data.zip

--2025-11-19 01:19:01--  https://github.com/MWiechmann/enron_spam_data/raw/refs/heads/master/enron_spam_data.zip
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/MWiechmann/enron_spam_data/refs/heads/master/enron_spam_data.zip [following]
--2025-11-19 01:19:01--  https://raw.githubusercontent.com/MWiechmann/enron_spam_data/refs/heads/master/enron_spam_data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15642124 (15M) [application/zip]
Saving to: ‘enron_spam_data.zip.4’


2025-11-19 01:19:02 (61.2 MB/s) - ‘enron_spam_data.zip.4’ saved [15642124/15642124]



In [2]:
import pandas as pd

In [3]:
# pandas can read the zip file directly
enron_spam_data = pd.read_csv("enron_spam_data.zip")
enron_spam_data

Unnamed: 0,Message ID,Subject,Message,Spam/Ham,Date
0,0,christmas tree farm pictures,,ham,1999-12-10
1,1,"vastar resources , inc .","gary , production from the high island larger ...",ham,1999-12-13
2,2,calpine daily gas nomination,- calpine daily gas nomination 1 . doc,ham,1999-12-14
3,3,re : issue,fyi - see note below - already done .\nstella\...,ham,1999-12-14
4,4,meter 7268 nov allocation,fyi .\n- - - - - - - - - - - - - - - - - - - -...,ham,1999-12-14
...,...,...,...,...,...
33711,33711,= ? iso - 8859 - 1 ? q ? good _ news _ c = eda...,"hello , welcome to gigapharm onlinne shop .\np...",spam,2005-07-29
33712,33712,all prescript medicines are on special . to be...,i got it earlier than expected and it was wrap...,spam,2005-07-29
33713,33713,the next generation online pharmacy .,are you ready to rock on ? let the man in you ...,spam,2005-07-30
33714,33714,bloow in 5 - 10 times the time,learn how to last 5 - 10 times longer in\nbed ...,spam,2005-07-30


In [4]:
(enron_spam_data["Spam/Ham"] == "spam").mean()

np.float64(0.5092834262664611)

## Part 2: Download BERT Model

We will use a pre-trained BERT model to extract embedding vectors as described in lesson 2.1 this week.
Here is sample code to download the model from [Hugging Face](https://huggingface.co/) and extract one vector.
This model is small enough that you can run it with CPU only, but GPUs will be faster if available.

In [5]:
import torch
from transformers import AutoTokenizer, AutoModel

print("torch:", torch.__version__)
print("transformers:", __import__("transformers").__version__)

  from .autonotebook import tqdm as notebook_tqdm


torch: 2.9.1+cu128
transformers: 4.57.1


In [None]:
# You may need to install torch and transformers.
# Google Colab will have these installed already.
#
#!pip3 install transformers torch --upgrade

import torch
from transformers import AutoTokenizer, AutoModel



In [7]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cpu'

To download the pre-trained model from Hugging Face, you will need to sign up for a free account with them at https://huggingface.co/join.
Afterwards, get an API token and if you are using Google Colab, save it as a secret named HF_TOKEN.

In [8]:
MODEL_NAME = "bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
bert_model = AutoModel.from_pretrained(MODEL_NAME)
bert_model.to(device)
bert_model.eval()


BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

In [9]:
@torch.no_grad()
def embed_text(text):
    batch = [text]
    inputs = tokenizer(batch, padding=True, truncation=True, return_tensors="pt").to(device)
    outputs = bert_model(**inputs)
    # CLS token embedding is the first token's hidden state
    cls_emb = outputs.last_hidden_state[:, 0, :]  # shape: [batch_size, 768]
    return cls_emb.cpu()

In [10]:
v = embed_text("Hi, will you buy my spam?")
v.shape

torch.Size([1, 768])

## Part 3: Create Embedding Vectors

Use BERT to create embeddings for each email in the Enron data set.
You will have to decide how to combine the different columns of the original data set to produce one embedding vector.


Hint: BERT can be run without a GPU, but will be much slower.
Using Google Colab with only a CPU, it runs around 1 embedding per second.
Using Google Colab with the T4 GPU option, it runs around 60 embeddings per second.
Caching is also encouraged to avoid unnecessary reruns.

In [11]:
import zipfile

with zipfile.ZipFile("enron_spam_data.zip", "r") as z:
    z.extractall("enron_data")

In [14]:
emails = pd.read_csv("enron_data/enron_spam_data.csv")  

In [16]:
emails.head()

Unnamed: 0,Message ID,Subject,Message,Spam/Ham,Date,combined_text
0,0,christmas tree farm pictures,,ham,1999-12-10,
1,1,"vastar resources , inc .","gary , production from the high island larger ...",ham,1999-12-13,
2,2,calpine daily gas nomination,- calpine daily gas nomination 1 . doc,ham,1999-12-14,
3,3,re : issue,fyi - see note below - already done .\nstella\...,ham,1999-12-14,
4,4,meter 7268 nov allocation,fyi .\n- - - - - - - - - - - - - - - - - - - -...,ham,1999-12-14,


In [None]:
# YOUR CHANGES HERE

MODEL_NAME = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
bert_model = AutoModel.from_pretrained(MODEL_NAME)
bert_model.to(device)
bert_model.eval()

from tqdm import tqdm  

# Combine columns into one string per email 
TEXT_COLUMNS = ["subject", "body"]  

def combine_text(row, text_cols=TEXT_COLUMNS):
    parts = []
    if "subject" in text_cols and "subject" in row:
        parts.append(f"Subject: {str(row['subject'])}")
    # Add body / content
    if "body" in text_cols and "body" in row:
        parts.append(str(row["body"]))
    return "\n\n".join(parts).strip()

emails["combined_text"] = emails.apply(combine_text, axis=1)

# Function to get BERT embeddings
def bert_encode(texts, tokenizer, model, device, batch_size=16, max_length=128):
    all_embs = []
    model.eval()
    with torch.no_grad():
        for i in tqdm(range(0, len(texts), batch_size)):
            batch_texts = texts[i : i + batch_size]

            enc = tokenizer(
                batch_texts,
                padding=True,
                truncation=True,
                max_length=max_length,
                return_tensors="pt",
            ).to(device)

            outputs = model(**enc) 

            # Mean-pool over tokens
            attn_mask = enc["attention_mask"].unsqueeze(-1)          
            masked_embeddings = outputs.last_hidden_state * attn_mask
            sum_embeddings = masked_embeddings.sum(dim=1)            
            lengths = attn_mask.sum(dim=1)                           
            mean_embeddings = sum_embeddings / lengths               

            all_embs.append(mean_embeddings.cpu())

    return torch.cat(all_embs, dim=0) 

# Run it on all emails
texts = emails["combined_text"].tolist()
email_embeddings = bert_encode(texts, tokenizer, bert_model, device)

print(email_embeddings.shape) 

100%|██████████| 2108/2108 [03:52<00:00,  9.06it/s]


torch.Size([33716, 768])


Save your embeddings in a file "embeddings.tsv.gz" with two columns, Message ID and embedding_vector_json, where embedding_vector_json is a JSON-encoded list.
Make sure that embedding_vector_json is a 1 dimensional list, not 2 dimensional.

Hint: don't forget the ".gz" extension indicating gzip compression.
The Pandas `.to_csv` method will automatically add the compression if you save data with a filename ending in ".gz", so you just need to pass it the right filename.

In [17]:
# YOUR CHANGES HERE

import json
import numpy as np
import pandas as pd


emb_np = email_embeddings.cpu().numpy()      # (N, 768)
emb_np = np.round(emb_np, 3)                 # 3 decimal places

# JSON-encode each 1D vector
embedding_json = [json.dumps(vec.tolist()) for vec in emb_np]

# Build output 
out_df = pd.DataFrame({
    "Message ID": emails['Message ID'],          
    "embedding_vector_json": embedding_json
})

# Save as compressed TSV
out_df.to_csv(
    "embeddings.tsv.gz",
    sep="\t",
    index=False,
    compression="gzip"
)

out_df.head()

Unnamed: 0,Message ID,embedding_vector_json
0,0,"[-0.008999999612569809, -0.22200000286102295, ..."
1,1,"[-0.008999999612569809, -0.22200000286102295, ..."
2,2,"[-0.008999999612569809, -0.22200000286102295, ..."
3,3,"[-0.008999999612569809, -0.22200000286102295, ..."
4,4,"[-0.008999999612569809, -0.22200000286102295, ..."


Submit "embeddings.tsv.gz" in Gradescope.

## Part 4: Train a Linear Regression

Train an ordinary least squares regression for spam/ham status where spam is treated as target value 1 and ham is treated as target value 0 with your embeddings above as the only input variables.


In [18]:
# YOUR CHANGES HERE

print(emails.columns)

Index(['Message ID', 'Subject', 'Message', 'Spam/Ham', 'Date',
       'combined_text'],
      dtype='object')


In [19]:
emails["target"] = emails["Spam/Ham"].map({"ham": 0, "spam": 1})

In [None]:
X = email_embeddings.numpy() 
y = emails["target"].values  

In [21]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X, y)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


Save the coefficients of your linear model in a file "coefficients.tsv" with columns dim and coefficient where dim specifies the offset in the embedding vector (0-767).
Don't worry about the bias term (but your model should still have it).

In [22]:
# YOUR CHANGES HERE

coefs = lr.coef_          # shape (768,)
dims = np.arange(len(coefs))

coef_df = pd.DataFrame({
    "dim": dims,
    "coefficient": coefs
})

coef_df.to_csv("coefficients.tsv", sep="\t", index=False)
coef_df.head()

Unnamed: 0,dim,coefficient
0,0,4.325119e-09
1,1,-2.124107e-06
2,2,-2.371363e-06
3,3,-5.961467e-08
4,4,2.477158e-07


Submit "coefficients.tsv" in Gradescope.

## Part 5: Search for Relevant Documents

The file "queries.tsv" specifies 10 queries.
For each of the queries, encode them as a vector, and find the message that is closest using $L_2$.

In [23]:
# YOUR CHANGES HERE

queries = pd.read_csv("queries.tsv", sep="\t")
print(queries.head())
print(queries.columns)

   query_id                                              query
0         1                            accounting arrangements
1         2                       sales higher than production
2         3  asked lawyer to write letter about unexpected ...
3         4                                   engineering exam
4         5                                 discounted tickets
Index(['query_id', 'query'], dtype='object')


In [None]:
def bert_encode(texts, tokenizer, model, device, batch_size=16, max_length=128):
    all_embs = []
    model.eval()
    with torch.no_grad():
        for i in tqdm(range(0, len(texts), batch_size)):
            batch_texts = texts[i : i + batch_size]

            enc = tokenizer(
                batch_texts,
                padding=True,
                truncation=True,
                max_length=max_length,
                return_tensors="pt",
            ).to(device)

            outputs = model(**enc)  

            # Mean-pool over tokens
            attn_mask = enc["attention_mask"].unsqueeze(-1)         
            masked_embeddings = outputs.last_hidden_state * attn_mask
            sum_embeddings = masked_embeddings.sum(dim=1)           
            lengths = attn_mask.sum(dim=1)                         
            mean_embeddings = sum_embeddings / lengths              

            all_embs.append(mean_embeddings.cpu())

    return torch.cat(all_embs, dim=0)  # (N_texts, 768)

In [None]:
query_texts = queries["query"].tolist() 
query_embeddings = bert_encode(query_texts, tokenizer, bert_model, device)

print(query_embeddings.shape)  

100%|██████████| 1/1 [00:00<00:00,  4.12it/s]

torch.Size([10, 768])





In [None]:
# Ensure everything is  CPU 
email_embs = email_embeddings         
query_embs = query_embeddings        

# L2 distance between query q and message m
distances = torch.cdist(query_embs, email_embs, p=2)   

# For each query, find index of  nearest message
nearest_indices = torch.argmin(distances, dim=1)       
nearest_indices = nearest_indices.numpy()

In [33]:
nearest_message_ids = emails["Message ID"].iloc[nearest_indices].values
nearest_dists = distances[torch.arange(len(query_embs)), nearest_indices].numpy()

query_ids = queries["query_id"].values

results = pd.DataFrame({
    "query_id": query_ids,
    "query_text": queries["query"],   # adjust column name if needed
    "nearest_message_id": nearest_message_ids,
    "nearest_distance_L2": nearest_dists,
})

results.head()

Unnamed: 0,query_id,query_text,nearest_message_id,nearest_distance_L2
0,1,accounting arrangements,0,9.941863
1,2,sales higher than production,0,11.525019
2,3,asked lawyer to write letter about unexpected ...,0,12.666759
3,4,engineering exam,0,10.44375
4,5,discounted tickets,0,11.428681


Save your results in a file "query-matches.tsv" with columns query_id, query_vector_json, and Message ID.

In [34]:
# YOUR CHANGES HERE

query_vectors_np = query_embs.numpy()          # (Q, 768)
query_vectors_np = np.round(query_vectors_np, 3)

query_vector_json = [
    json.dumps(vec.tolist())                   # 1D JSON list of floats
    for vec in query_vectors_np
]

out_df = pd.DataFrame({
    "query_id": query_ids,
    "query_vector_json": query_vector_json,
    "Message ID": nearest_message_ids
})

out_df.to_csv("query-matches.tsv", sep="\t", index=False)

out_df.head()

Unnamed: 0,query_id,query_vector_json,Message ID
0,1,"[0.23600000143051147, 0.19599999487400055, -0....",0
1,2,"[-0.6859999895095825, -0.5, 0.2319999933242797...",0
2,3,"[-0.19499999284744263, -0.18799999356269836, 0...",0
3,4,"[0.2290000021457672, -0.13500000536441803, -0....",0
4,5,"[0.2619999945163727, -0.35899999737739563, -0....",0


Submit "query-matches.tsv" in Gradescope.

## Part 6: Code

Please submit a Jupyter notebook that can reproduce all your calculations and recreate the previously submitted files.
You do not need to provide code for data collection if you did that by manually.

## Part 7: Acknowledgements

If you discussed this assignment with anyone, please acknowledge them here.
If you did this assignment completely on your own, simply write none below.

If you used any libraries not mentioned in this module's content, please list them with a brief explanation what you used them for. If you did not use any other libraries, simply write none below.

If you used any generative AI tools, please add links to your transcripts below, and any other information that you feel is necessary to comply with the generative AI policy. If you did not use any generative AI tools, simply write none below.