# 🧠 NLP 102 – Word Embeddings & Visualizing Meaning

## Featuring The Hitchhiker’s Guide to the Galaxy
### ⏱️ Duration: ~30 minutes
### 🛠️ Requirements: Python 3, Jupyter Notebook or any Python IDE, nltk, gensim

### 🗂️ Overview

Traditional methods like Bag-of-Words or TF-IDF ignore word meaning and context. That’s where embeddings shine.
Embeddings represent words as dense vectors in a multi-dimensional space where semantic similarity = spatial closeness.

In this notebook, you will:
- Understand what word embeddings are and why they are powerful
- Learn how to train a simple Word2Vec model using gensim
- Use UMAP to reduce the dimensionality of word vectors
- Create an interactive plot to explore word relationships visually

🧩 By the end, you'll be able to see how similar words cluster together!

## 📦 Setup

In [None]:
import nltk
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px


nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

In [None]:
corpus = [
    "Time is an illusion. Lunchtime doubly so.",
    "The ships hung in the sky in much the same way that bricks don’t.",
    "The Hitchhiker’s Guide to the Galaxy is a wholly remarkable book.",
    "The Answer to the Great Question... of Life, the Universe and Everything... is... Forty-two.",
    "Don’t Panic.",
    "So long, and thanks for all the fish."
]

## 🧪 Exercise 0: Prepare your data

**Goal:** Tokenize the corpus

**Optional:** Load the whole book from disk and use it as corpus
### 🧰 Tools:

`word_tokenize` from `nltk.tokenize`

`simple_preprocess` from `gensim.utils`

### 💻 Task:
- Preprocess the data
- Create a corpus
- Optional: Load book from disk
- Optional: Split sentences
- Optional: Create corpus from whole book

In [None]:
# your code goes here

## 🧪 Exercise 1: Create Model and explore word similarities

**Goal:** Train a Word2Vec Model 

**Optional:** Explore different parameters

**Super Optional:** Test a word not in the vocabulary

### 🧰 Tools:

`Word2Vec` from `gensim.models`

`wv.most_similar` from your trained model

### 💻 Task:
- Create an instance of `Word2Vec`
- Use the tokenized corpus as data
- Test some word similarities
- Optional: Make one with `skip-gram` and one with `CBOW`

In [None]:
# your code goes here

## 🧪 Exercise 2: Visualize with UMAP

**UMAP (Uniform Manifold Approximation and Projection)** is a dimensionality reduction technique that preserves both local and global structure in data, making it great for visualizing high-dimensional embeddings. It works by modeling the data as a graph and optimizing a low-dimensional representation that maintains the original relationships as closely as possible.

**Goal:** Reduce the word2vec vectors to two dimensions and visualize them.

**Optional:** Use t-SNE as alternative

**Super Optional:** Experiment with different parameters

### 🧰 Tools:

`UMAP` from `umap`

`array` from `numpy`

Optional: `TSNE` from `sklearn.manifold`

### 💻 Task:
- Extract all words and vectors
- Reduce dimensions
- Plot the result

### ✅ Expected Output (example):

Plot of the reduced word embeddings

## 🧪 Exercise 3: Interactive plot with plotly

**Goal:** Create a plot where you can interactively explore the UMAP data

### 🧰 Tools:

`scatter` from `plotly`

`dataframe` from `pandas`


### 💻 Task:
- Prepare a dataframe
- Make an interactive plot

In [None]:
# your code goes here

## 🧪 Exercise 4: Use pre-trained Word2Vec

**Goal:** Load a pre-trained Word2Vec model and explore what is different

**Optional:** Load two pre-trained Word2Vec models and explore how they differ

**Super Optional:** Visualize some word from a pretrained Word2Vec model

**Super Super Optiona:** Visualize some words from two pretrained models in the same plot

### 🧰 Tools:

`load` from `gensim.downloader`

`info` from `gensim.downloader` (Use this to list available models)

### 💻 Task:
- Load a pretrained model
- Create a list of words you want to test
- Calculate the 3 most similar

### ✅ Expected Output (example):

```python
🔎 Word: computer
3 most similar: [('computers', 0.916504442691803), ('software', 0.8814992904663086), ('technology', 0.852556049823761)]
```

In [None]:
# your code goes here