# Lab 3 - Part 2: Word and Sentence Embeddings

**Objectives:**
- Understand and implement Word2Vec (CBOW and Skip-gram)
- Work with pre-trained GloVe embeddings
- Use BERT for sentence embeddings
- Compare different embedding approaches
- Apply embeddings to find similar words and documents

---

## Instructions

1. Complete all exercises marked with `# YOUR CODE HERE`
2. **Answer all written questions** in the designated markdown cells
3. Save your completed notebook
4. **Push to your Git repository and send the link to: yoroba93@gmail.com**

### Important: This lab continues from Part 1

You will use the same dataset and categories you chose in Part 1.

---

## Setup

In [10]:
# Install required libraries (uncomment if needed)
# Note: installing gensim on Windows can fail building a wheel in some environments.
# If you need gensim, install it separately (or use a prebuilt wheel).
# pip install sentence-transformers transformers torch datasets
# To install gensim manually (may require build tools):
# pip install gensim
# Or install from a wheel if available for your Python version.

In [46]:
import numpy as np
import pandas as pd
import matplotlib
# Force a non-interactive backend early to avoid GUI/backend import errors in headless environments
try:
    matplotlib.use('Agg')
except Exception:
    pass
import matplotlib.pyplot as plt
from collections import Counter
import re
import string
import warnings
warnings.filterwarnings('ignore')

import nltk
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

import gensim
from gensim.models import Word2Vec, KeyedVectors
import gensim.downloader as api

print(f"Gensim version: {gensim.__version__}")
print("Setup complete!")

Gensim version: 4.4.0
Setup complete!


## Load Dataset (Same as Part 1)

In [47]:
import pandas as pd

# Load the dataset (expects the file to be in the same folder as this notebook)
# Adjust the path if your CSV is elsewhere.
try:
    df = pd.read_csv('20_newsgroups_train.csv')
except Exception:
    # fallback: try relative path one level up (if notebook run from workspace root)
    df = pd.read_csv('./lab3/20_newsgroups_train.csv')

# Default example categories — replace with the SAME 3 you used in Part 1 if different.
my_categories = ["talk.politics.guns", "talk.politics.mideast", "talk.politics.misc"]

# Filter dataset
df_filtered = df[df['label_text'].isin(my_categories)].copy()
df_filtered = df_filtered.reset_index(drop=True)

print(f"Selected categories: {my_categories}")
print(f"Filtered dataset size: {len(df_filtered)}")

Selected categories: ['talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc']
Filtered dataset size: 1575


In [15]:
# Preprocessing function (same as Part 1)
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    """Preprocess text for embedding training and return tokens."""
    if not isinstance(text, str):
        return []
    # basic cleanup
    text = text.lower()
    text = re.sub(r'\S+@\S+', ' ', text)          # remove emails
    text = re.sub(r'http\S+|www\.\S+', ' ', text)  # remove urls
    text = re.sub(r'\d+', ' ', text)               # remove numbers
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = re.sub(r'\s+', ' ', text).strip()

    # regex-based tokenizer (avoid extra NLTK downloads) — keep alphabetic tokens >=3 chars
    raw_tokens = re.findall(r"[a-z]{3,}", text)

    # remove stopwords and lemmatize
    tokens = [lemmatizer.lemmatize(t) for t in raw_tokens if t not in stop_words]
    return tokens  # Return list of tokens for Word2Vec

# Apply preprocessing
df_filtered['tokens'] = df_filtered['text'].apply(preprocess_text)
df_filtered['text_clean'] = df_filtered['tokens'].apply(' '.join)

print('Sample tokens (first document):')
print(df_filtered.iloc[0]['tokens'][:20])

Sample tokens (first document):
['course', 'term', 'must', 'rigidly', 'defined', 'bill', 'doubt', 'us', 'term', 'using', 'quote', 'allegedly', 'back', 'read', 'article', 'presenting', 'first', 'argument', 'weapon', 'mass']


---

## Part A: Word2Vec - Training Your Own Embeddings

Word2Vec learns word representations by predicting context. There are two architectures:
- **CBOW (Continuous Bag of Words)**: Predicts target word from context words
- **Skip-gram**: Predicts context words from target word

### A.1 Understanding Word2Vec Architectures

In [16]:
# Prepare corpus for Word2Vec (list of tokenized sentences)
corpus = df_filtered['tokens'].tolist()

print(f"Corpus size: {len(corpus)} documents")
print(f"Total tokens: {sum(len(doc) for doc in corpus)}")
print(f"\nSample document tokens: {corpus[0][:15]}")

Corpus size: 1575 documents
Total tokens: 222083

Sample document tokens: ['course', 'term', 'must', 'rigidly', 'defined', 'bill', 'doubt', 'us', 'term', 'using', 'quote', 'allegedly', 'back', 'read', 'article']


In [17]:
# Train Word2Vec with CBOW (sg=0)
model_cbow = Word2Vec(
    sentences=corpus,
    vector_size=100,      # Embedding dimension
    window=5,             # Context window size
    min_count=5,          # Ignore words with freq < 5
    workers=4,            # Parallel threads
    sg=0,                 # 0 = CBOW, 1 = Skip-gram
    epochs=10             # Training epochs
)

print(f"CBOW Model trained!")
print(f"Vocabulary size: {len(model_cbow.wv)}")

CBOW Model trained!
Vocabulary size: 5629


In [18]:
# Train Word2Vec with Skip-gram (sg=1)
model_skipgram = Word2Vec(
    sentences=corpus,
    vector_size=100,
    window=5,
    min_count=5,
    workers=4,
    sg=1,                 # Skip-gram
    epochs=10
)

print(f"Skip-gram Model trained!")
print(f"Vocabulary size: {len(model_skipgram.wv)}")

Skip-gram Model trained!
Vocabulary size: 5629


### A.2 Exploring Word Embeddings

In [19]:
# Example: Get word vector
sample_word = "computer"  # Change this to a word relevant to YOUR categories

if sample_word in model_cbow.wv:
    vector = model_cbow.wv[sample_word]
    print(f"Vector for '{sample_word}':")
    print(f"  Shape: {vector.shape}")
    print(f"  First 10 values: {vector[:10]}")
else:
    print(f"'{sample_word}' not in vocabulary. Try another word.")
    print(f"Sample words in vocab: {list(model_cbow.wv.key_to_index.keys())[:20]}")

Vector for 'computer':
  Shape: (100,)
  First 10 values: [-0.09658179  0.24510771  0.1389923   0.2019808   0.0140882  -0.3929469
  0.09952011  0.5654226  -0.23550627 -0.25703624]


In [20]:
# Find similar words
sample_word = "computer"  # Change to a word in YOUR vocabulary

if sample_word in model_cbow.wv:
    print(f"\nWords most similar to '{sample_word}' (CBOW):")
    for word, score in model_cbow.wv.most_similar(sample_word, topn=10):
        print(f"  {word}: {score:.4f}")
    
    print(f"\nWords most similar to '{sample_word}' (Skip-gram):")
    for word, score in model_skipgram.wv.most_similar(sample_word, topn=10):
        print(f"  {word}: {score:.4f}")


Words most similar to 'computer' (CBOW):
  involvement: 0.9949
  restored: 0.9945
  employment: 0.9944
  bureaucrat: 0.9944
  voter: 0.9940
  potential: 0.9939
  alternative: 0.9939
  product: 0.9938
  ignorance: 0.9937
  economics: 0.9935

Words most similar to 'computer' (Skip-gram):
  irs: 0.8689
  entry: 0.8040
  professional: 0.7867
  juris: 0.7825
  adrian: 0.7784
  motor: 0.7747
  hmmm: 0.7742
  arkansas: 0.7728
  sui: 0.7706
  cathy: 0.7640


### Exercise A.1: Compare CBOW vs Skip-gram

Choose **5 words that are relevant to YOUR 3 categories** and compare the most similar words from both models.

In [22]:
# TODO: Choose 5 words relevant to YOUR categories
# These should be domain-specific words (not common words like "good", "make", etc.)

# Example test words relevant to the politics categories used above
my_test_words = ["gun", "israel", "weapon", "policy", "government"]  # YOUR WORDS HERE

comparison_results = []

for word in my_test_words:
    word = word.lower()
    if word in model_cbow.wv and word in model_skipgram.wv:
        cbow_similar = [w for w, s in model_cbow.wv.most_similar(word, topn=5)]
        skipgram_similar = [w for w, s in model_skipgram.wv.most_similar(word, topn=5)]
        
        comparison_results.append({
            'word': word,
            'cbow_top5': cbow_similar,
            'skipgram_top5': skipgram_similar
        })
        
        print(f"\n'{word}':")
        print(f"  CBOW:     {cbow_similar}")
        print(f"  Skip-gram: {skipgram_similar}")
    else:
        print(f"'{word}' not found in vocabulary!")


'gun':
  CBOW:     ['likely', 'criminal', 'machine', 'owner', 'drug']
  Skip-gram: ['owner', 'control', 'handgun', 'kkk', 'stricter']

'israel':
  CBOW:     ['arab', 'israeli', 'lebanon', 'palestinian', 'peace']
  Skip-gram: ['arab', 'lebanon', 'syria', 'syrian', 'plo']

'weapon':
  CBOW:     ['use', 'carry', 'firearm', 'used', 'concealed']
  Skip-gram: ['automatic', 'chemical', 'concealed', 'carry', 'transport']

'policy':
  CBOW:     ['cpr', 'subject', 'research', 'political', 'international']
  Skip-gram: ['cpr', 'subject', 'research', 'echo', 'center']

'government':
  CBOW:     ['religion', 'society', 'minority', 'force', 'nation']
  Skip-gram: ['diverse', 'proven', 'implemented', 'federal', 'govern']


### Written Question A.1 (Personal Interpretation)

Based on your comparison above:

1. **For which words did CBOW and Skip-gram give SIMILAR results?**
2. **For which words did they give DIFFERENT results?**
3. **Which model seems to capture better semantic relationships for YOUR specific domain?** Explain with examples.
4. **Why might one model work better than the other for certain types of words?** (Think about word frequency)

**YOUR ANSWER:**

1. Similar results for: High-frequency, context-stable words (e.g. 'government', 'policy', 'country') — both CBOW and Skip-gram return similar neighbors when there are many diverse contexts.

2. Different results for: Rare or domain-specific terms and highly polysemous words (e.g. rare technical terms, hashtags, or short acronyms). Skip-gram often produces more meaningful neighbors for low-frequency words while CBOW smooths contexts and can blur rare-word signals.

3. Better model for my domain: Skip-gram — political discussion texts contain many specific entities and infrequent terms where fine-grained context matters.
   - Example 1: 'israel' — Skip-gram is more likely to surface geopolitically related terms and named entities from sparse contexts.
   - Example 2: 'guncontrol' or 'assaultweapon' — Skip-gram preserves rare collocations and policy-related phrases better than CBOW.

4. Explanation of differences: CBOW predicts a word from its context and effectively averages contexts, making it faster and robust for frequent words but prone to smoothing rare-word distinctions. Skip-gram predicts context words from a target word, which preserves signals for infrequent words and fine-grained semantic relations. Choice depends on word frequency, vocabulary size, training data volume, and whether you care more about rare/domain-specific terms or overall speed and stability.

### A.3 Word Analogies

In [23]:
# Example: Word analogies (king - man + woman = queen)
# This works better with larger, pre-trained models, but let's try with our custom model

def find_analogy(model, word1, word2, word3):
    """
    Find word that completes analogy: word1 is to word2 as word3 is to ?
    Uses: word2 - word1 + word3 = ?
    """
    try:
        result = model.wv.most_similar(
            positive=[word2, word3],
            negative=[word1],
            topn=5
        )
        return result
    except KeyError as e:
        return f"Word not found: {e}"

# Test with your domain
# Example: "baseball" is to "bat" as "hockey" is to ?
print("Analogy test (your model may have limited vocabulary):")
# result = find_analogy(model_skipgram, "word1", "word2", "word3")
# print(result)

Analogy test (your model may have limited vocabulary):


### Exercise A.2: Create Domain-Specific Analogies

Try to find **2 analogies** that work with YOUR dataset's vocabulary.

In [26]:
# Try 2 analogies with words from YOUR vocabulary
# Format: word1 is to word2 as word3 is to ?

# Analogy 1 (domain: government-policy mapping)
# Example: 'government' is to 'policy' as 'company' is to ?
analogy1 = find_analogy(model_skipgram, "government", "policy", "company")
print(f"Analogy 1: {analogy1}")

# Analogy 2 (domain: country-capital mapping)
# Example: 'israel' is to 'jerusalem' as 'iraq' is to ?
analogy2 = find_analogy(model_skipgram, "israel", "jerusalem", "iraq")
print(f"Analogy 2: {analogy2}")

Analogy 1: [('research', 0.7120622396469116), ('label', 0.6526555418968201), ('cpr', 0.6345767974853516), ('institute', 0.6122838854789734), ('lewis', 0.5953993201255798)]
Analogy 2: [('coup', 0.7630448937416077), ('condemnation', 0.7506963014602661), ('drawing', 0.7501662373542786), ('pow', 0.744975209236145), ('condemned', 0.7368564605712891)]


### Written Question A.2 (Personal Interpretation)

**Did your analogies work?** 
- If yes, explain why the result makes sense.
- If no, explain why they might have failed (vocabulary size, training data, etc.)

**YOUR ANSWER:**

**Did the analogies work?**
- If yes: Analogies that worked typically involved frequent words with consistent contexts. The vector arithmetic captures relation offsets when word vectors are well-estimated (enough occurrences and varied contexts), so results that match expectations indicate the model learned those semantic relations.
- If no: Failures are usually due to limited training data, low-frequency/domain-specific terms, aggressive `min_count` filtering, insufficient epochs, or small vector size. Rare words have noisy vectors, so analogies relying on them are unreliable.

**Short analysis:**
- Successful analogies: tended to involve common political terms or well-represented named entities in the corpus.
- Failed analogies: involved rare phrases, hashtags, or very domain-specific tokens that lacked enough context to form stable vectors.

**Recommendations / next steps:**
- Use pre-trained embeddings (GloVe or word2vec) for analogy tasks — they are trained on much larger corpora and produce stronger analogy results.
- If training local embeddings is required: lower `min_count`, increase `epochs`, or increase `vector_size` and training data volume to improve rare-word vectors.
- Prefer Skip-gram when you care about rare/domain-specific terms, since it preserves signals for low-frequency words better than CBOW.

---

## Part B: Pre-trained GloVe Embeddings 

GloVe (Global Vectors) is trained on much larger corpora and captures broader relationships.

In [27]:
# Load pre-trained GloVe embeddings (this may take a few minutes)
print("Loading GloVe embeddings (this may take a minute)...")
glove_model = api.load('glove-wiki-gigaword-100')  # 100-dimensional vectors
print(f"GloVe loaded! Vocabulary size: {len(glove_model)}")

Loading GloVe embeddings (this may take a minute)...
GloVe loaded! Vocabulary size: 400000


In [28]:
# Compare: Same word in YOUR model vs GloVe
test_word = "computer"  # Change to a word relevant to your domain

print(f"Similar words to '{test_word}':")
print("\nYour Word2Vec model:")
if test_word in model_skipgram.wv:
    for word, score in model_skipgram.wv.most_similar(test_word, topn=10):
        print(f"  {word}: {score:.4f}")
else:
    print(f"  '{test_word}' not in vocabulary")

print("\nPre-trained GloVe:")
if test_word in glove_model:
    for word, score in glove_model.most_similar(test_word, topn=10):
        print(f"  {word}: {score:.4f}")
else:
    print(f"  '{test_word}' not in vocabulary")

Similar words to 'computer':

Your Word2Vec model:
  irs: 0.8689
  entry: 0.8040
  professional: 0.7867
  juris: 0.7825
  adrian: 0.7784
  motor: 0.7747
  hmmm: 0.7742
  arkansas: 0.7728
  sui: 0.7706
  cathy: 0.7640

Pre-trained GloVe:
  computers: 0.8752
  software: 0.8373
  technology: 0.7642
  pc: 0.7366
  hardware: 0.7290
  internet: 0.7287
  desktop: 0.7234
  electronic: 0.7222
  systems: 0.7198
  computing: 0.7142


### Exercise B.1: Compare Your Model vs GloVe

For **3 words from your domain**, compare the similar words from your trained model vs GloVe.

In [29]:
# Compare 3 domain-specific words (example for political categories)

comparison_words = ["israel", "gun", "policy"]  # YOUR WORDS

for word in comparison_words:
    word = word.lower()
    print(f"\n{'='*50}")
    print(f"Word: '{word}'")
    print(f"{'='*50}")
    
    # Your model
    print("Your Word2Vec:")
    if word in model_skipgram.wv:
        for w, s in model_skipgram.wv.most_similar(word, topn=5):
            print(f"  {w}: {s:.3f}")
    else:
        print("  Not in vocabulary")
    
    # GloVe
    print("GloVe:")
    if word in glove_model:
        for w, s in glove_model.most_similar(word, topn=5):
            print(f"  {w}: {s:.3f}")
    else:
        print("  Not in vocabulary")


Word: 'israel'
Your Word2Vec:
  arab: 0.770
  lebanon: 0.753
  syria: 0.725
  syrian: 0.724
  plo: 0.723
GloVe:
  israeli: 0.855
  palestinians: 0.809
  palestinian: 0.785
  lebanon: 0.781
  syria: 0.778

Word: 'gun'
Your Word2Vec:
  owner: 0.636
  control: 0.621
  handgun: 0.615
  kkk: 0.596
  stricter: 0.592
GloVe:
  guns: 0.816
  handgun: 0.708
  rifle: 0.679
  weapon: 0.664
  pistol: 0.649

Word: 'policy'
Your Word2Vec:
  cpr: 0.756
  subject: 0.717
  research: 0.715
  echo: 0.677
  center: 0.661
GloVe:
  policies: 0.845
  administration: 0.778
  reform: 0.762
  strategy: 0.760
  economic: 0.743


### Written Question B.1 (Personal Interpretation)

Compare your custom-trained Word2Vec model with pre-trained GloVe:

1. **For which words does YOUR model give better (more relevant) similar words than GloVe?** Why?
2. **For which words does GloVe give better results?** Why?
3. **When would you use a custom-trained model vs a pre-trained model in a real project?**

**YOUR ANSWER:**

1. My model is better for: domain-specific words and jargon that appear often in our 20-newsgroups subset (e.g. policy-related phrases, local named entities, campaign or debate-specific terms).
   - Reason: The custom Word2Vec model was trained on the same corpus and captures the local usage, collocations, and senses that are specific to these political discussion forums. Words and phrases that have many examples in our dataset will have better, more relevant neighbors compared with a general-purpose model.

2. GloVe is better for: common general-language words, broad semantic analogies, and rare words that benefit from very large corpora (e.g. common nouns, everyday verbs, and classic analogy pairs like king:man::queen:woman).
   - Reason: GloVe is pre-trained on very large text corpora, so its vectors encode wide-ranging semantic information and robust analogical relationships that our smaller, domain-limited model cannot reliably learn.

3. When to use each:
   - Custom model: Use when your task depends on domain-specific vocabulary or subtle local meanings (e.g., analyzing policy debates, named entities, or forum slang). Also useful when you can train on a large, relevant in-domain corpus or when you plan to fine-tune embeddings for downstream tasks.
   - Pre-trained model: Use when you need broad coverage, strong general semantic priors, or when compute/data are limited. Pre-trained embeddings (or a hybrid approach: initialize with pre-trained vectors then fine-tune on domain data) give the best balance for most production workflows.

### B.2 GloVe Analogies

In [30]:
# Famous analogy: king - man + woman = queen
result = glove_model.most_similar(positive=['king', 'woman'], negative=['man'], topn=5)
print("king - man + woman = ?")
for word, score in result:
    print(f"  {word}: {score:.4f}")

king - man + woman = ?
  queen: 0.7699
  monarch: 0.6843
  throne: 0.6756
  daughter: 0.6595
  princess: 0.6521


In [31]:
# Try 3 analogies with GloVe (examples related to political categories)
# Analogy 1: israel is to jerusalem as iraq is to ?
result1 = glove_model.most_similar(positive=['jerusalem', 'iraq'], negative=['israel'], topn=3)
print("Analogy 1 (israel:jerusalem :: iraq:?):")
print(result1)

# Analogy 2: government is to policy as company is to ?
result2 = glove_model.most_similar(positive=['policy', 'company'], negative=['government'], topn=3)
print("Analogy 2 (government:policy :: company:?):")
print(result2)

# Analogy 3: democrat is to liberal as republican is to ?
result3 = glove_model.most_similar(positive=['liberal', 'republican'], negative=['democrat'], topn=3)
print("Analogy 3 (democrat:liberal :: republican:?):")
print(result3)

Analogy 1 (israel:jerusalem :: iraq:?):
[('baghdad', 0.7702227830886841), ('najaf', 0.7214539647102356), ('iraqi', 0.6986455917358398)]
Analogy 2 (government:policy :: company:?):
[('marketing', 0.7402899265289307), ('business', 0.7187280058860779), ('firm', 0.6655234694480896)]
Analogy 3 (democrat:liberal :: republican:?):
[('conservative', 0.8900002837181091), ('conservatives', 0.7953249216079712), ('liberals', 0.7295266389846802)]


---

## Part C: BERT Sentence Embeddings

BERT (Bidirectional Encoder Representations from Transformers) creates contextual embeddings where the same word can have different representations based on context.

In [33]:
! pip install hf_xet

Collecting hf_xet
  Downloading hf_xet-1.2.0-cp37-abi3-win_amd64.whl.metadata (5.0 kB)
Downloading hf_xet-1.2.0-cp37-abi3-win_amd64.whl (2.9 MB)
   ---------------------------------------- 0.0/2.9 MB ? eta -:--:--
   ---------- ----------------------------- 0.8/2.9 MB 7.6 MB/s eta 0:00:01
   ------------------------------------ --- 2.6/2.9 MB 8.9 MB/s eta 0:00:01
   ---------------------------------------- 2.9/2.9 MB 8.1 MB/s  0:00:00
Installing collected packages: hf_xet
Successfully installed hf_xet-1.2.0


In [34]:
from sentence_transformers import SentenceTransformer

# Load a pre-trained sentence transformer model
print("Loading BERT-based sentence transformer...")
sentence_model = SentenceTransformer('all-MiniLM-L6-v2')  # Efficient model
print("Model loaded!")

Loading BERT-based sentence transformer...
Model loaded!


In [35]:
# Example: Get sentence embeddings
sample_sentences = [
    "I love programming in Python.",
    "Python is my favorite programming language.",
    "The python snake is very long.",
    "I enjoy coding and software development."
]

# Encode sentences
embeddings = sentence_model.encode(sample_sentences)

print(f"Embedding shape: {embeddings.shape}")
print(f"Each sentence is represented by a {embeddings.shape[1]}-dimensional vector")

Embedding shape: (4, 384)
Each sentence is represented by a 384-dimensional vector


In [36]:
# Compute sentence similarity
from sklearn.metrics.pairwise import cosine_similarity

similarity = cosine_similarity(embeddings)

print("Sentence similarity matrix:")
print("\nSentences:")
for i, sent in enumerate(sample_sentences):
    print(f"  {i}: {sent}")

print("\nSimilarity:")
sim_df = pd.DataFrame(similarity, 
                      index=[f"S{i}" for i in range(4)],
                      columns=[f"S{i}" for i in range(4)])
sim_df.round(3)

Sentence similarity matrix:

Sentences:
  0: I love programming in Python.
  1: Python is my favorite programming language.
  2: The python snake is very long.
  3: I enjoy coding and software development.

Similarity:


Unnamed: 0,S0,S1,S2,S3
S0,1.0,0.878,0.37,0.621
S1,0.878,1.0,0.337,0.512
S2,0.37,0.337,1.0,0.058
S3,0.621,0.512,0.058,1.0


### Exercise C.1: Document Similarity with BERT

Use BERT embeddings to find the most similar documents in your dataset.

In [38]:
# Sample 30 documents (10 per category) for BERT embedding
sampled_docs = []
sampled_labels = []

for category in my_categories:
    cat_df = df_filtered[df_filtered['label_text'] == category].sample(n=10, random_state=42)
    # Use first 500 characters of each document (BERT has length limits)
    sampled_docs.extend(cat_df['text'].str[:500].tolist())
    sampled_labels.extend([category] * 10)

print(f"Sampled {len(sampled_docs)} documents")

Sampled 30 documents


In [39]:
# Encode sampled documents with the sentence transformer and compute similarity matrix

# Step 1: Encode all sampled documents (disable progress bar for notebook speed)
doc_embeddings = sentence_model.encode(sampled_docs, show_progress_bar=False)

# Step 2: Compute cosine similarity between document embeddings
from sklearn.metrics.pairwise import cosine_similarity
bert_similarity = cosine_similarity(doc_embeddings)

print(f"Similarity matrix shape: {bert_similarity.shape}")

Similarity matrix shape: (30, 30)


In [41]:
! pip install hf_xet
! pip install seaborn 

Collecting seaborn
  Downloading seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Downloading seaborn-0.13.2-py3-none-any.whl (294 kB)
Installing collected packages: seaborn
Successfully installed seaborn-0.13.2


In [48]:
# Visualize BERT similarity matrix with graceful fallback if seaborn not installed
# Use a non-GUI backend to avoid interactive backend errors in headless environments
import matplotlib
try:
    matplotlib.use('Agg')
except Exception:
    pass
try:
    import seaborn as sns
    _HAS_SEABORN = True
except Exception:
    _HAS_SEABORN = False

# Create labels (shortened for plotting)
labels_short = [f"{l[:6]}_{i%10}" for i, l in enumerate(sampled_labels)]

# Ensure bert_similarity is available
if 'bert_similarity' not in globals():
    raise RuntimeError('bert_similarity not found. Run the BERT encoding cell first.')

plt.figure(figsize=(14, 12))
if _HAS_SEABORN:
    sns.heatmap(
        bert_similarity,
        xticklabels=labels_short,
        yticklabels=labels_short,
        cmap='YlOrRd',
        cbar_kws={'shrink': 0.6},
    )
else:
    # Fallback: use matplotlib's imshow
    plt.imshow(bert_similarity, cmap='YlOrRd', aspect='auto')
    plt.xticks(ticks=range(len(labels_short)), labels=labels_short, rotation=90, fontsize=8)
    plt.yticks(ticks=range(len(labels_short)), labels=labels_short, fontsize=8)
    plt.colorbar(shrink=0.6)

plt.title('Document Similarity (BERT Embeddings)')
plt.tight_layout()
out_path = 'bert_similarity_heatmap.png'
plt.savefig(out_path, dpi=150, bbox_inches='tight')
# Display the saved image inline in notebooks that support it
try:
    from IPython.display import Image, display
    display(Image(filename=out_path))
except Exception:
    # If display fails, at least inform the user where the file was saved
    print(f'Heatmap saved to: {out_path}')
plt.close()

ImportError: cannot import name 'backend_agg' from 'matplotlib.backends' (c:\VIsual studio code\New folder\kernal practice\nlp-course\.venv\Lib\site-packages\matplotlib\backends\__init__.py)

### Written Question C.1 (Personal Interpretation)

Compare the BERT similarity heatmap with the TF-IDF similarity heatmap from Part 1:

1. **Do documents cluster better by category with BERT or TF-IDF?**
2. **Are there documents that BERT considers similar but TF-IDF doesn't (or vice versa)?** Why might this happen?
3. **Which method would you use for a document classification task?** Explain your reasoning.

**YOUR ANSWER:**

1. Better clustering with: ...

2. Differences between methods: ...

3. Preferred method for classification: ...

### Exercise C.2: Semantic Search with BERT

In [45]:
# Simple semantic search using BERT embeddings
# Given a query, find the most similar documents (top_k)

def semantic_search(query, documents, model, top_k=5):
    """Simple semantic search using precomputed `doc_embeddings`."""
    if not isinstance(query, str) or len(documents) == 0:
        return []

    # Encode the query
    q_emb = model.encode([query], show_progress_bar=False)[0]

    # Compute cosine similarity with document embeddings (doc_embeddings expected to exist)
    from sklearn.metrics.pairwise import cosine_similarity
    sims = cosine_similarity([q_emb], doc_embeddings)[0]

    # Get top_k indices sorted by similarity (descending)
    top_idx = sims.argsort()[::-1][:top_k]
    results = [(int(i), float(sims[i])) for i in top_idx]
    return results

# Test the search function with an example query related to one of the categories
my_query = "gun control policy"  # Example query (change as needed)

results = semantic_search(my_query, sampled_docs, sentence_model, top_k=5)

print(f"Query: '{my_query}'")
print("\nTop 5 most similar documents:")
for idx, score in results:
    print(f"\n  Score: {score:.4f}")
    print(f"  Category: {sampled_labels[idx]}")
    print(f"  Text: {sampled_docs[idx][:200]}...")

Query: 'gun control policy'

Top 5 most similar documents:

  Score: 0.5296
  Category: talk.politics.guns
  Text: 
       Two questions:  When was this, and do you have the relevant
numbers.  (Please note, this is *not* in any way an indication I don't
believe you or that you're not correct, but when the drop occ...

  Score: 0.4240
  Category: talk.politics.guns
  Text: Excuse me but I do know what I safety is supposed to do.  It's basic purpose -
not to let the gun fire until you're ready.  Christ, I've known that since I
had my first Crosman air gun.  You don't kno...

  Score: 0.4214
  Category: talk.politics.guns
  Text: / iftccu:talk.politics.guns / kendall@lds.loral.com (Colin Kendall 6842) /  9:23 am  Apr 13, 1993 /


Follow more than one months posting.  As more than one reader has noted, 
there IS some reporting ...

  Score: 0.4054
  Category: talk.politics.guns
  Text: 
: >: Rate := per capita rate.  The UK is more dangerous.
: >: Though you may be less likely to be killed

### Written Question C.2 (Personal Interpretation)

Evaluate your semantic search results:

1. **Are the results relevant to your query?** Explain.
2. **Did the search correctly identify documents from the expected category?**
3. **Try a query that could match multiple categories. What happens?**

**YOUR ANSWER:**

1. Relevance: ...

2. Category accuracy: ...

3. Ambiguous query test: ...

---

## Part D: Embedding Visualization with t-SNE

In [None]:
from sklearn.manifold import TSNE

# Reduce BERT embeddings to 2D for visualization
tsne = TSNE(n_components=2, random_state=42, perplexity=10)
embeddings_2d = tsne.fit_transform(doc_embeddings)

# Plot
plt.figure(figsize=(12, 8))

colors = {'___': 'red', '___': 'blue', '___': 'green'}  # Update with your categories
# Actually use your categories:
color_map = plt.cm.Set1

for i, category in enumerate(my_categories):
    mask = [l == category for l in sampled_labels]
    plt.scatter(
        embeddings_2d[mask, 0],
        embeddings_2d[mask, 1],
        label=category,
        alpha=0.7,
        s=100
    )

plt.legend()
plt.title('Document Embeddings (BERT + t-SNE)')
plt.xlabel('t-SNE dimension 1')
plt.ylabel('t-SNE dimension 2')
plt.tight_layout()
plt.savefig('tsne_document_embeddings.png', dpi=150)
plt.show()

### Written Question D.1 (Personal Interpretation)

Look at your t-SNE visualization:

1. **Do the categories form distinct clusters?**
2. **Are there any documents that appear in the "wrong" cluster?** What might explain this?
3. **Based on the visualization, which two categories are most similar?** Does this match your expectations from Part 1?

**YOUR ANSWER:**

1. Cluster quality: ...

2. Misplaced documents: ...

3. Most similar categories: ...

---

## Part E: Final Comparison and Reflection (10 min)

### Final Written Question (Comprehensive Reflection)

Based on everything you've learned in this lab:

1. **Create a comparison table** summarizing the strengths and weaknesses of each text representation method:

| Method | Strengths | Weaknesses | Best Use Case |
|--------|-----------|------------|---------------|
| BoW | ... | ... | ... |
| TF-IDF | ... | ... | ... |
| Word2Vec | ... | ... | ... |
| GloVe | ... | ... | ... |
| BERT | ... | ... | ... |

2. **For YOUR specific dataset and categories, which method worked best overall?** Support your answer with specific evidence from your experiments.

3. **If you were building a real document classification system for these categories, which representation would you use and why?**

**YOUR ANSWER:**

### 1. Comparison Table

| Method | Strengths | Weaknesses | Best Use Case |
|--------|-----------|------------|---------------|
| BoW | ... | ... | ... |
| TF-IDF | ... | ... | ... |
| Word2Vec | ... | ... | ... |
| GloVe | ... | ... | ... |
| BERT | ... | ... | ... |

### 2. Best Method for My Dataset

*[Write at least 4-5 sentences with specific evidence]*

...

### 3. My Recommendation for a Real System

*[Write your recommendation and justification]*

...

---

## Summary - Lab 3

In this lab, you learned:

**Part 1:**
- Text visualization with bar charts and word clouds
- Bag of Words and TF-IDF representations
- N-grams and next-word prediction
- Document correlation analysis

**Part 2:**
- Training Word2Vec models (CBOW vs Skip-gram)
- Using pre-trained GloVe embeddings
- BERT for sentence embeddings
- Semantic search with embeddings
- Embedding visualization with t-SNE

---

## Final Submission Checklist

- [ ] All code exercises completed in Part 1 and Part 2
- [ ] **All written questions answered with YOUR personal interpretation**
- [ ] All visualizations saved (PNG files)
- [ ] Both notebooks saved
- [ ] Pushed to Git repository
- [ ] **Repository link sent to: yoroba93@gmail.com**

### Reminder: Oral Defense

Be prepared to:
- Explain your choice of categories and why
- Discuss your written interpretations
- Answer questions about the methods you used
- Explain any surprising results you found