# **🧠 Word Embeddings**

## 📌 What are Word Embeddings?

> **Word Embeddings** are a technique in Natural Language Processing (NLP) that **converts words into numerical vectors**, so that words with **similar meanings** have **similar vectors**.

🧍 Humans understand words by their meaning.  
🤖 Computers only understand numbers.  
✅ So, **word embeddings help machines understand language** by turning words into **meaningful numbers**.

---

# 🧠 Full Detailed Definition of Word Embeddings (Easy + In-Depth):
Word Embeddings are a type of word representation that allows words with similar meaning to have similar numerical representations (vectors).

In Natural Language Processing (NLP), word embeddings are used to transform words into dense vectors of real numbers, where the position of each word in the vector space is learned from a large amount of text. The idea is to capture the meaning of words, their syntactic roles, and their semantic relationships with other words.

---

## 🔍 Why are Word Embeddings special?

Unlike Bag of Words or TF-IDF, word embeddings capture the **meaning**, **context**, and **relationship** between words.

| 🧱 Traditional Methods   | 🚀 Word Embeddings                     |
|--------------------------|----------------------------------------|
| Just count word frequency| Capture **meaning and context**        |
| Vectors are large/sparse | Vectors are small/dense/fixed-size     |
| Ignores word meaning     | Understands **similar and related words** |

---

## 🧠 Example:

Let’s take four words:

- **king**
- **queen**
- **man**
- **woman**

Word embeddings understand the relationship:

**king - man + woman ≈ queen**



This works because word embeddings capture **gender and role** relationships in the vector space.

---

## 📈 How do they represent words?

Each word is turned into a list of numbers (vector):

| Word   | Vector (shortened)         |
|--------|-----------------------------|
| dog    | [0.73, 0.14, -0.59, ...]    |
| puppy  | [0.75, 0.12, -0.55, ...]    |
| table  | [-0.22, 0.91, 0.48, ...]    |

✅ Similar words (dog & puppy) have **similar vectors**  
❌ Unrelated words (dog & table) have **different vectors**

---

## 📦 Common Types of Word Embeddings:

| Method       | Description                                      |
|--------------|--------------------------------------------------|
| **Word2Vec** | Learns embeddings using neural networks          |
| **GloVe**    | Combines word frequency & co-occurrence matrix   |
| **FastText** | Understands even rare or unknown words (subwords)|
| **BERT**     | Context-aware embeddings (meaning changes by context) |

---

## 💡 Why are Word Embeddings important?

They are used to:

- Understand word **similarity** and **relationships**
- Improve performance of deep learning models (LSTM, RNN, BERT)
- Power applications like:
  - ✅ Chatbots  
  - ✅ Search engines  
  - ✅ Translation tools  
  - ✅ Sentiment analysis  
  - ✅ Text classification

---

## ✅ Summary in One Line

> **Word Embeddings** turn words into meaningful numbers that reflect their **context and meaning**, helping machines process language like humans.


----

---

# **🧠 Word2Vec – Easy and Practical Explanation**

## 📌 What is Word2Vec?

> **Word2Vec** is a technique developed by Google that **learns word embeddings** — it turns words into **vectors of numbers** that capture their **meaning, context, and relationships** with other words.

It helps the computer understand that:
- "king" is related to "queen"
- "cat" and "dog" are both animals
- "walk" and "walking" are similar

And it learns all of this automatically from raw text!

---

## 🔹 Why is it called "Word2Vec"?

- Word2Vec = **Word to Vector**
- It **converts a word →  into a vector**,  (a list of real numbers like [0.2, -0.4, 0.7, ...])

---

## **🧠 How does Word2Vec work?**

Word2Vec trains a small **neural network** using a large text dataset in one of two ways:

OR

Word2Vec trains a neural network on a large text corpus using one of two methods:

### 1. **CBOW (Continuous Bag of Words)**

> Predicts a word from its context

**Example:**  
Input: “I ___ NLP”  
Output: “love”

✅ CBOW learns:  
"👉 If the context is I and NLP, the likely word is 'love'"

---

### 2. **Skip-Gram**

> Predicts context from a word

**Example:**  
Input: “love”  
Output: likely neighbors → “I”, “NLP”

✅ Skip-Gram learns:  
👉 "If I see 'love', the surrounding words might be 'I' or 'NLP'."

---

## **🎯 What does Word2Vec learn?**

It learns vectors (embeddings) where:

- Similar words have similar vector representations
- It captures relationships like:

**king - man + woman ≈ queen**

# These vectors are then used in:

- Chatbots 🤖
- Search engines 🔍
- Translators 🌍
- Sentiment analysis 😊😢

---

## 🛠️ Key Parameters

| Parameter     | Description                                             |
|---------------|---------------------------------------------------------|
| `vector_size` | Number of dimensions in word vectors (e.g., 100)        |
| `window`      | Context window size (how many words to the left/right)  |
| `min_count`   | Ignore words that appear fewer times than this value    |

---

## ✅ Summary (One Line)

> **Word2Vec turns words into meaningful vectors using their context**, allowing machines to understand and compare words like humans do.


In [None]:
pip install gensim

import gensim
from gensim.models import Word2Vec, KeyedVectors
import wget

# Download the Google News vectors
url = "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
filename = wget.download(url)

# Load the model (this will take some time and memory)
model = KeyedVectors.load_word2vec_format(filename, binary=True)

# Now you can use the model
# Example:
print(model.most_similar('computer'))

In [None]:
# After downloading and loading the model (as shown in previous code)
# You can access word vectors like this:

# Check the shape of a word vector (should be 300 dimensions)
print(model['man'].shape)  # Output shown in image: (300,) 

# Note: The image shows (308,) which is unexpected - should be (300,)
# This might indicate an issue with the model loading

# Other examples of using the word vectors:
# Get the vector for a word
vector = model['woman']

# Find most similar words
similar_words = model.most_similar('king', topn=5)
print(similar_words)

# Word analogy: king - man + woman = ?
result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
print(result)  # Should output something like [('queen', 0.7...)]

In [None]:
vector = model['woman'].shap

In [None]:
# Word similarity examples
similarity_man_woman = model.similarity('man', 'woman')
print(f"Similarity between 'man' and 'woman': {similarity_man_woman:.4f}")
# Output shown: 0.7664

similarity_man_php = model.similarity('man', 'PHP')
print(f"Similarity between 'man' and 'PHP': {similarity_man_php:.4f}")
# Output shown: -0.0330 (shows they're unrelated)

# Odd one out example
odd_one = model.doesnt_match(['PHP', 'java', 'monkey'])
print(f"The odd one out is: {odd_one}")
# Output shown: 'monkey' (as it's not a programming language)

# Word analogy: king - man + woman = queen
vector = model['king'] - model['man'] + model['woman']
analogy_results = model.most_similar([vector], topn=3)

print("\nWord analogy results (king - man + woman):")
for word, score in analogy_results:
    print(f"{word}: {score:.4f}")
# Expected to show 'queen' as top result with high similarity score

In [None]:
# Currency analogy: INR is to India as GBP is to England
vec = model['INR'] - model['India'] + model['England']
currency_analogy_results = model.most_similar([vec], topn=5)

print("Currency analogy results (INR - India + England):")
for word, score in currency_analogy_results:
    print(f"{word}: {score:.4f}")

# Expected output similar to:
# INR: 0.6442
# GBP: 0.5841
# England: 0.4465
# ... (other related terms)