
# What Are Embeddings?

**Embeddings** are a way to represent text (**words, sentences, or documents**) as **numbers** so that computers can understand **meaning**, not just exact words.

### In Simple Terms
- Embeddings convert text into **vectors** (lists of numbers).
- These vectors **capture the meaning** of the text.


## Why Embeddings Are Needed

- Computers **cannot understand human language directly**.
- They can only process **numbers**.

### Limitations of Older Methods
- Traditional techniques like **TF-IDF** only **count word frequency**.
- They **do not understand meaning or context**.

### How Embeddings Help
- **Embeddings** represent the **semantic meaning** of text.
- This allows machines to understand **similarity and context**, not just exact words.


## Simple Example (Human Meaning)

These two sentences mean the same thing to humans:

- *I love AI*  
- *I enjoy artificial intelligence*

**TF-IDF** → treats them as **different**  
**Embeddings** → understand they are **similar**

**That’s the power of embeddings.**


## How Embeddings Work (Intuition)

- Each word or sentence is converted into a **vector**  
  (for example, **384** or **768** numbers).
- **Similar meanings** → vectors are **close together**.
- **Different meanings** → vectors are **far apart**.


## Types of Embeddings

### 1. Word Embeddings
- Represent **individual words**.

**Examples:**
- Word2Vec  
- GloVe  

**Limitation:**
- Same word → same vector  
- No understanding of **context**


### 2. Sentence Embeddings (Most Used Today)
- Represent **full sentences or paragraphs**.

**Examples:**
- Sentence Transformers  
- OpenAI embeddings  
- Qwen / DeepSeek embeddings  

**Used in:**
- Semantic search  
- RAG (Retrieval-Augmented Generation) systems  
- Chatbots

# Word Embeddings

In [1]:
from gensim.models import Word2Vec

In [2]:
sentences = [
    ["I", "went", "to", "the", "bank", "to", "deposit", "money"],
    ["The", "bank", "approved", "my", "loan"],
    ["The", "river", "bank", "was", "flooded"],
    ["We", "sat", "near", "the", "bank", "of", "the", "river"]
]

In [3]:
model = Word2Vec(
    sentences,
    vector_size=50,
    window=5,
    min_count=1,
    workers=1
)

In [5]:
bank_vector = model.wv["bank"]

print("Vector length:", len(bank_vector))
print("First 10 values of 'bank' embedding:")
print(bank_vector[:10])

Vector length: 50
First 10 values of 'bank' embedding:
[-0.00107245  0.00047286  0.0102067   0.01801855 -0.0186059  -0.01423362
  0.01291774  0.01794598 -0.01003086 -0.00752674]


In [6]:
model.wv.most_similar("bank")

[('money', 0.21057455241680145),
 ('loan', 0.16704076528549194),
 ('approved', 0.150198832154274),
 ('sat', 0.13204392790794373),
 ('river', 0.12667091190814972),
 ('flooded', 0.09980078041553497),
 ('went', 0.05936766415834427),
 ('I', 0.04979119822382927),
 ('the', 0.042373016476631165),
 ('my', 0.04067763686180115)]

## Why This Happens (Important)

The word **“bank”** appears in training data with **multiple meanings**:

### Financial Bank
- money  
- loan  
- approved  

### River Bank
- river  
- flooded  

### What Word2Vec Does
- Word2Vec creates **one single vector** for the word **“bank”**.
- Both meanings are **merged into that same vector**.

### Result
The model thinks:

> **“bank” is related to both money and river**

This is why **finance-related** and **river-related** words appear **together**.

This is a key limitation of **word embeddings without context**.


# Sentence-transformers

In [11]:
from sentence_transformers import SentenceTransformer




In [12]:
model = SentenceTransformer("all-MiniLM-L6-v2")

In [13]:
sentences = [
    "I love AI",
    "I enjoy artificial intelligence",
    "The weather is sunny"
]

embeddings = model.encode(sentences)

print(embeddings.shape)

(3, 384)


In [None]:
from sklearn.metrics.pairwise import cosine_similarity

similarity_1_2 = cosine_similarity(
    [embeddings[0]], [embeddings[1]]
)

similarity_1_3 = cosine_similarity(
    [embeddings[0]], [embeddings[2]]
)

print("Similarity (sentence 1 & 2):", similarity_1_2[0][0])
print("Similarity (sentence 1 & 3):", similarity_1_3[0][0])

Similarity (sentence 1 & 2): 0.81963444
Similarity (sentence 1 & 3): -0.0033589536


## Key Takeaway

- **Sentence embeddings** represent **meaning**, not just individual words.
- **Similar meanings** → **similar vectors**.
- They are **essential** for modern **NLP** and **LLM pipelines**.
- **RAG (Retrieval-Augmented Generation) systems** rely heavily on **sentence embeddings**.

# Why do sentence embeddings work better than word embeddings for semantic search?
Sentence embeddings work better than word embeddings for semantic search because they capture the meaning of the entire sentence in context, while word embeddings represent individual words without context.

Word embeddings assign the same vector to a word everywhere, so different meanings get mixed. Sentence embeddings change based on surrounding words, allowing them to understand intent, synonyms, and context, which leads to more accurate semantic search results.