# ADS-509 Assignment 3.1
## Word Embeddings

In this assignment, you will use the HackerNews dataset created in the Module 1 assignment o:
- Build static embeddings by training Word2Vec
- Build contextual embeddings with a sentence-transformer (BERT-family)
- Perform EDA on embeddings  
- Train a simple classifier (logistic regression) on each embedding type to predict a lightweight label from story titles

If you are not confident in the quality of your own dataset from Module 1, there is a clean dataset available for your use in Canvas.

## General Assignment Instructions

These instructions are included in every assignment, to remind you of the coding standards for the class. Feel free to delete this cell after reading it.

Work through this notebook as if it were a worksheet, completing the code sections marked with **TODO** in the cells provided. Similarly, written questions will be marked by a "Q:" and will have a corresponding "A:" spot for you to fill in with your answers. **Make sure to answer every question marked with a Q: for full credit**.

Your code should be relatively easy-to-read, sensibly commented, and clean. Writing code is a messy process, so please be sure to edit your final submission. Remove any cells that are not needed or parts of cells that contain unnecessary code. Remove inessential import statements and make sure that all such statements are moved into the designated cell.

A .pdf of this notebook, with your completed code and written answers, is what you should submit in Canvas for full credit. **DO NOT SUBMIT A NEW NOTEBOOK FILE OR A RAW .PY FILE**. Submitting in a different format makes it difficult to grade your work, and students who have done this in the past inevitably miss some of the required work or written questions.

## Imports and Downloads

Once again we will use some datasets from the NLTK library, so we need to make sure that these are downloaded (you should have them from Module 2, but it never hurts to double check).

We will also be using the pre-trained embedding models (Word2Vec and SentenceTransformer) from Gensim.

In [None]:
import os, re, string, random, math, warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import ast

from tqdm import tqdm

# Text preprocessing
import nltk
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')
try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Embeddings
from gensim.models import Word2Vec
from sentence_transformers import SentenceTransformer

# ML
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity

plt.rcParams['figure.figsize'] = (8,5)
plt.rcParams['figure.dpi'] = 120


## Load Data

Next we will load our dataset from Module 2 and double check that it is formatted correctly.

If you are uncertain about your own dataset, or if you don't pass the check below, feel free to use the dataset provided on Canvas.

In [None]:
DATA_PATH = 'data/module2/hn_comment_features.csv'  # TODO: Update the file path as needed

assert os.path.exists(DATA_PATH), f"Dataset not found at {DATA_PATH}. Update the path for your environment."

In [None]:
df = pd.read_csv(DATA_PATH)
print('Rows:', len(df))
expected_cols = {'story_id','title','comment_id','user','comment_text','text_norm','tokens_clean','sent_compound'}
missing = expected_cols - set(df.columns)
if missing:
    print('Warning: missing expected columns:', missing)

df["tokens_clean"] = df["tokens_clean"].apply(ast.literal_eval) # convert python list from a string
df.head()

## Create a label for classification

The dataset that we scraped from HackerNews doesn't have an obvious variable for use in a classification task (which we will need later). There are many ways that we could create such a label, but here we will use string matching to create some rough labels based on the words used in the article titles.

**Q**: Give at least two other ideas for how we could label our dataset for classification. Keep data balance in mind (i.e. the resulting dataset should not be completely dominated by one label) in a brief discussion of why that method would/wouldn't be a good choice.

**A**: 

In [None]:
# Build a simple label from the story title
def label_from_title(title):
    if not isinstance(title, str):
        return None
    t = title.strip().lower()
    if any(kw in t for kw in ['rust', 'python', 'sql', 'linux', 'windows', 'ios', 'c++', 'perl', 'pfp', 'java']):
        return 'programming'
    if any(kw in t for kw in ['google', 'facebook', 'meta', 'apple', 'amazon', 'microsoft']):
        return 'big-tech'
    if any(kw in t for kw in ['ai ', 'torch', 'llm', 'large language model', 'claude', 'gemini', 'copilot']):
        return 'ai'
    else:
        return 'other'
    return None

df['label'] = df['title'].apply(label_from_title)

print('Total rows for task:', len(df))
print(df['label'].value_counts())

df[['title','comment_text','label']].head(3)

## Train Static Word Embedding Model

We will train a Word2Vec model to produce static word embeddings for our normalized text, which we will use later for data exploration and classification.

**TODO**:

Use the gensim Word2Vec class to train a static embedding model on our tokenized comment text. Check out the [documentation](https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.html#gensim-models-word2vec) to assign the following settings (Hint: You might need to go into the class source code to find argument descriptions):

- Embedding size 100
- Sequence window size 5
- Limit to tokens that appear at least 3 times
- Skip-gram algorithm (rather than CBOW)
- Run training for 10 epochs
- If you have multiple CPUs available, set the number of workers to improve training speed

**Q**: Why do we call a model like Word2Vec a *static* word embedding model?

**A**: 

**Q**: What is the difference between the CBOW and skip-gram algorithms?

**A**: 

In [None]:
# TODO: train a Word2Vec model on our normalized text
sentences = ??

w2v = ??

w2v_vecs = w2v.wv
len(w2v_vecs), w2v_vecs.vector_size

The word embeddings that we just created can now be used to perform a semantic comparison of individual words/tokens in our vocabulary.

**TODO**:
- Choose a few words from our corpus vocabulary and print the 5 nearest neighbors using the Word2Vec.most_similar() function

**Q**: Are the nearest neighbors semantically similar to your chosen words? Think about how the Word2Vec model works and provide a 2-3 sentence description of why this comparison does/doesn't work for our dataset.

**A**: 


In [None]:
# TODO: choose a few words from our vocabulary and examine their 5 nearest neighbors

probe_terms = [??]
for term in probe_terms:
    ??

## Build Document Embeddings 

One way to represent a document (comment) in our dataset is to aggregate all of the individual word embeddings by computing an average embedding or document vector. This vector can then be used to compare entire documents instead of individual words.

**TODO**:
- Average the word/token embeddings for each comment in our dataset to produce a single document embedding vector
- Compute the cosine similarity between the first comment's document embedding and all of the rest
- Print out the two most and least similar comments

**Q**: Describe your results. Does the cosine similarity of document/comment embeddings do a good job of identifying similar content?

**A**: 

In [None]:
# TODO: average the word/token embeddings for each comment

def docvec_average(tokens, wv):
    vecs = [wv[t] for t in tokens if t in wv]
    if not vecs:
        return np.zeros(wv.vector_size, dtype=np.float32)
    return ??

df["w2v_embedding"] = list(np.vstack([docvec_average(toks, w2v_vecs) for toks in df['tokens_clean']]))


In [None]:
# TODO: compute the cosine similarity between the first comment and the rest of the dataset

def docvec_similarity(target, docvecs):
    sims = ??
    return sims

df["w2v_sim"] = [None] + docvec_similarity(df["w2v_embedding"][0], df["w2v_embedding"][1:]) # create new column with None for self-similarity

In [None]:
# TODO: Print out the target comment and the two comments with the highest and lowest cosine similarities

print("Target Comment:")
print(df["comment_text"][0])

print("\nLowest Similarities:")
print(??)

print("\nHighest similarities:")
print(??)


## Build Contextual Embeddings with Sentence-Transformers

Another way to create a document-level embedding is to use a more sophisticated, pre-trained embedding model, like a Sentence Transformer (based on the BERT family architecture). These models use an *attention block* to create context-specific embeddings for a string of text, rather than simply averaging the static word embeddings as we did above.

**TODO**:
- Use the gensim SentenceTransformer class to load the  pre-trained 'sentence-transformers/all-MiniLM-L6-v2' model
- Use this model to create embeddings for our comment text (HINT: You don't need to normalize or tokenize text before feeding it into a transformer model)


In [None]:
# TODO: Use the gensim pre-trained model to create embeddings for our comment text
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
st_model = SentenceTransformer(model_name)

df["bert_embedding"]= ??

Now we will compute the same cosine similarities as we did with the Word2Vec embeddings above.

**Q**: How do the transformer-based embedding similarities compare to what you found above? Which method would you choose and why?

**A**: 

In [None]:
df["bert_sim"] = [None] + docvec_similarity(df["bert_embedding"][0], df["bert_embedding"][1:]) # create new column with None for self-similarity

In [None]:
# TODO: Print out the target comment and the two comments with the highest and lowest cosine similarities

print("Target Comment:")
print(df["comment_text"][0])

print("\nLowest Similarities:")
print(??)

print("\nHighest similarities:")
print(??)

## Embedding-based classification

Finally, we will explore how well the document embeddings perform on the classification task of predicting our comment category labels. We will use the two types of embeddings for input to two different logistic regression models, and then compare the performance.

**TODO**:

- Use the sklearn LogisticRegression class to predict our comment labels using both embedding methods

**Q**: How do the two embedding methods perform? Use the results from the model metrics and the confusion matrices to discuss and compare.

**A**: 

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Train/test split (same split for both)
Xw_train, Xw_test, y_train, y_test = train_test_split(df['w2v_embedding'], df['label'], test_size=0.2, random_state=42, stratify=df['label'])
Xb_train, Xb_test, _, _ = train_test_split(df['bert_embedding'], df['label'], test_size=0.2, random_state=42, stratify=df['label'])

# Standardize 
sc_w2v = StandardScaler(with_mean=False)
Xw_train_s = sc_w2v.fit_transform(np.stack(Xw_train.to_numpy()))
Xw_test_s  = sc_w2v.transform(np.stack(Xw_test.to_numpy()))

sc_bert = StandardScaler(with_mean=False)
Xb_train_s = sc_bert.fit_transform(np.stack(Xb_train.to_numpy()))
Xb_test_s  = sc_bert.transform(np.stack(Xb_test.to_numpy()))

# TODO: Train logistic regression models with both embedding methods
clf_w2v = ??
pred_w2v = ??

clf_bert = ??
pred_bert = ??

# Print logistic regression model performance metrics
print("\n=== Word2Vec Avg → Logistic Regression ===\n")
print(classification_report(y_test, pred_w2v, digits=3))
print("Confusion Matrix:\n", confusion_matrix(y_test, pred_w2v))

print("\n=== Sentence-BERT → Logistic Regression ===\n")
print(classification_report(y_test, pred_bert, digits=3))
print("Confusion Matrix:\n", confusion_matrix(y_test, pred_bert))