<a href="https://colab.research.google.com/github/appliedcode/mthree-c422/blob/main/Exercises/day-4/Conversion_techniques/Word2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧠 Word2Vec Exercise

## 🎯 Objective
- Train a **Word2Vec** model on a small text corpus.
- Convert words into **dense vector representations** (embeddings).
- Explore **semantically similar words** and basic **word relationships**.

---

## 📚 Dataset

Use the following **6 sentences** as your training corpus:

1. *I enjoy walking in the park.*
2. *Walking and running are good exercises.*
3. *I love jogging around the neighborhood.*
4. *Exercise keeps me healthy and energetic.*
5. *Morning walks help clear my mind.*
6. *The park is full of beautiful trees.*

## 🧩 Tasks

### 🔄 Preprocessing
- Convert all sentences to **lowercase**.
- **Tokenize** each sentence into individual words.
- **Remove punctuation** from the tokens.

---

### 🛠️ Train Word2Vec Model
- Use the **skip-gram** architecture (`sg=1`).
- Set:
  - `vector_size = 50`
  - `window = 3`
  - `epochs = 100`
- Input the **preprocessed tokenized sentences** into the model for training.

---

### 🔍 Explore Embeddings

- **Extract and display** the vector embedding corresponding to the word `"walking"` from the trained Word2Vec model.

- **Identify and print** the top five words most similar to `"walking"`, ranked by cosine similarity.

- **Calculate and report** the cosine similarity score between the words `"walking"` and `"running"`.

- **Solve a word analogy** using vector arithmetic:  
  Determine the word that is most similar to the result of `"running"` + `"morning"` − `"walking"`.

In [2]:
import nltk
# Download both old and new tokenizer data
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [4]:
# Install required packages if you have not already
!pip install gensim nltk -q

import nltk
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
import string

# Download punkt tokenizer once
nltk.download('punkt')

# Step 1: Corpus and Preprocessing
sentences = [
    "I enjoy walking in the park.",
    "Walking and running are good exercises.",
    "I love jogging around the neighborhood.",
    "Exercise keeps me healthy and energetic.",
    "Morning walks help clear my mind.",
    "The park is full of beautiful trees."
]

def preprocess(sent):
    # Lowercase
    sent = sent.lower()
    # Remove punctuation
    sent = sent.translate(str.maketrans('', '', string.punctuation))
    # Tokenize
    tokens = word_tokenize(sent)
    return tokens

tokenized_sentences = [preprocess(s) for s in sentences]

# Step 2: Train Word2Vec Model
model = Word2Vec(
    sentences=tokenized_sentences,
    vector_size=50,
    window=3,
    min_count=1,  # consider all words
    sg=1,         # skip-gram
    epochs=100
)

# Step 3: Vector for 'walking'
print("Vector for 'walking' (first 10 dimensions):")
print(model.wv['walking'][:10])

# Step 4: Top 5 most similar words to 'walking'
print("\nTop 5 words similar to 'walking':")
for word, score in model.wv.most_similar('walking', topn=5):
    print(f"{word}: {score:.4f}")

# Step 5: Similarity between 'walking' and 'running'
similarity = model.wv.similarity('walking', 'running')
print(f"\nSimilarity between 'walking' and 'running': {similarity:.4f}")

# Step 6: Analogy: running + morning - walking
print("\nWords most similar to the analogy 'running' + 'morning' - 'walking':")
for word, score in model.wv.most_similar(positive=['running', 'morning'], negative=['walking'], topn=3):
    print(f"{word}: {score:.4f}")


Vector for 'walking' (first 10 dimensions):
[-0.01759433  0.00767909  0.01061761  0.0114562   0.0146957  -0.0129286
  0.00259938  0.0126877  -0.00618321 -0.01260977]

Top 5 words similar to 'walking':
me: 0.2180
and: 0.1931
exercises: 0.1745
help: 0.1718
the: 0.1668

Similarity between 'walking' and 'running': 0.1529

Words most similar to the analogy 'running' + 'morning' - 'walking':
in: 0.1845
mind: 0.1780
clear: 0.1610


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
