# 🧠 NLP 102 – Word Embeddings & Visualizing Meaning

## Featuring The Hitchhiker’s Guide to the Galaxy
### ⏱️ Duration: ~60 minutes
### 🛠️ Requirements: Python 3, Jupyter Notebook or any Python IDE, nltk, gensim

### 🗂️ Overview

Traditional methods like Bag-of-Words or TF-IDF ignore word meaning and context. That’s where embeddings shine.
Embeddings represent words as dense vectors in a multi-dimensional space where semantic similarity = spatial closeness.

In this notebook, you will:
- Understand what word embeddings are and why they are powerful
- Learn how to train a simple Word2Vec model using gensim
- Use UMAP to reduce the dimensionality of word vectors
- Create an interactive plot to explore word relationships visually

🧩 By the end, you'll be able to see how similar words cluster together!

## 📦 Setup

In [None]:
import nltk
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px


nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

In [None]:
corpus = [
    "Time is an illusion. Lunchtime doubly so.",
    "The ships hung in the sky in much the same way that bricks don’t.",
    "The Hitchhiker’s Guide to the Galaxy is a wholly remarkable book.",
    "The Answer to the Great Question... of Life, the Universe and Everything... is... Forty-two.",
    "Don’t Panic.",
    "So long, and thanks for all the fish."
]

## 🧪 Exercise 0: Prepare your data

**Goal:** Tokenize the corpus

**Optional:** Load the whole book from disk and use it as corpus
### 🧰 Tools:

`word_tokenize` from `nltk.tokenize`

`simple_preprocess` from `gensim.utils`

### 💻 Task:
- Preprocess the data
- Create a corpus
- Optional: Load book from disk
- Optional: Split sentences
- Optional: Create corpus from whole book

In [None]:
# your code goes here

### 📖 Solution

In [None]:
from nltk.tokenize import word_tokenize
from gensim.utils import simple_preprocess

sentences = [simple_preprocess(s) for s in corpus]

In [None]:
from nltk.tokenize import word_tokenize
from gensim.utils import simple_preprocess

doc = open('data/guide.txt', encoding ='utf-8')

sentences =[]
for sentence in doc.read().split('.'):
  sentences.append(simple_preprocess(sentence))

## 🧪 Exercise 1: Create Model and explore word similarities

**Goal:** Train a Word2Vec Model 

**Optional:** Explore different parameters

### 🧰 Tools:

`Word2Vec` from `gensim.models`

`wv.most_similar` from your trained model

### 💻 Task:
- Create an instance of `Word2Vec`
- Use the tokenized corpus as data
- Test some word similarities
- Optional: Make one with `skip-gram` and one with `CBOW`
- OPtional:

In [None]:
# your code goes here

### 📖 Solution

In [None]:
from gensim.models import Word2Vec

model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=2, sg=1)
model.save("hitchhiker_word2vec_sg.model")

model.wv.most_similar("fish")

## 🧪 Exercise 2: Visualize with UMAP

**UMAP (Uniform Manifold Approximation and Projection)** is a dimensionality reduction technique that preserves both local and global structure in data, making it great for visualizing high-dimensional embeddings. It works by modeling the data as a graph and optimizing a low-dimensional representation that maintains the original relationships as closely as possible.

**Goal:** Identify the most meaningful words in each sentence

### 🧰 Tools:

`UMAP` from `umap`

`array` from `numpy`

### 💻 Task:
- Extract all words and vectors
- Reduce dimensions
- Plot the result

### ✅ Expected Output (example):

Plot of the reduced word embeddings

### 📖 Solution

In [None]:
import umap

words = list(model.wv.index_to_key)
word_vectors = np.array([model.wv[word] for word in words])

reducer = umap.UMAP(n_neighbors=5, min_dist=0.3, metric='cosine', random_state=42)
embeddings_2d = reducer.fit_transform(word_vectors)

plt.figure(figsize=(10, 7))
for i, (x, y) in enumerate(embeddings_2d):
    plt.scatter(x, y, s=10)
    #plt.text(x + 0.1, y + 0.1, words[i], fontsize=8)

plt.title("UMAP Projection of Embeddings")
plt.axis('off')
plt.show()

## 🧪 Exercise 3: Interactive plot with plotly

**Goal:** Create a plot where you can interactively explore the UMAP data

### 🧰 Tools:

`scatter` from `plotly`

`dataframe` from `pandas`


### 💻 Task:
- Prepare a dataframe
- Make an interactive plot

In [None]:
# your code goes here

### 📖 Solution

In [None]:
df = pd.DataFrame(embeddings_2d, columns=["x", "y"])
df["word"] = words

fig = px.scatter(df, x="x", y="y", text="word", title="Word Embeddings (Word2Vec + UMAP)")
fig.update_traces(textposition='top center')
fig.show()

## 🧪 Exercise 4: Use pre-trained Word2Vec

**Goal:** Load a pre-trained Word2Vec model and explore what is different

**Optional:** Load two pre-trained Word2Vec models and explore how they differ

**Super Optional:** Visualize some word from a pretrained Word2Vec model

**Super Super Optiona:** Visualize some words from two pretrained models in the same plot

### 🧰 Tools:

`load` from `gensim.downloader`

`info` from `gensim.downloader` (Use this to list available models)

### 💻 Task:
- Load a pretrained model
- Create a list of words you want to test
- Calculate the 3 most similar

### ✅ Expected Output (example):

```python
🔎 Word: computer
3 most similar: [('computers', 0.916504442691803), ('software', 0.8814992904663086), ('technology', 0.852556049823761)]
```

In [None]:
# your code goes here

### 📖 Solution

In [None]:
import gensim.downloader as api

# Example: Load two models
model_glove = api.load("glove-wiki-gigaword-50")

words_to_test = ["king", "apple", "computer", "music"]

for word in words_to_test:
    print(f"\n🔎 Word: {word}")
    print("3 most similar:", model_glove.most_similar(word)[:3])

In [None]:
import gensim.downloader as api

# Step 1: Load pretrained models
model_glove = api.load("glove-wiki-gigaword-50")
model_glove_twitter = api.load("glove-twitter-50")

# Step 2: Select common words to compare
words_to_plot = ["king", "queen", "man", "woman", "apple", "orange", "computer", "music", "city", "doctor"]

# Filter only words available in both vocabularies
words_common = [word for word in words_to_plot if word in model_glove_twitter and word in model_glove]

# Step 3: Collect vectors
vectors = []
labels = []
sources = []

for word in words_common:
    vectors.append(model_glove_twitter[word])
    labels.append(word)
    sources.append("GloveTwitter")

    vectors.append(model_glove[word])
    labels.append(word)
    sources.append("GloVe")

vectors = np.array(vectors)

# Step 4: Reduce dimensions with UMAP
reducer = umap.UMAP(n_neighbors=5, min_dist=0.3, metric='cosine', random_state=42)
embedding_2d = reducer.fit_transform(vectors)

# Step 5: Prepare dataframe for plotting
df = pd.DataFrame(embedding_2d, columns=["x", "y"])
df["word"] = labels
df["model"] = sources

# Step 6: Plot with Plotly
fig = px.scatter(df, x="x", y="y", text="word", color="model",
                 title="Comparison of Word Embeddings from Glove Twitter and GloVe Gigawords")
fig.update_traces(textposition='top center')
fig.update_layout(legend_title_text='Embedding Source')
fig.show()