# Day 62 – Word Embedding Algorithms: BoW, TF-IDF, and Word2Vec

Today, I'm exploring **Word Embedding** algorithms. These techniques are essential because machine learning models only understand numbers, so we need effective ways to convert human language (text) into numerical vectors while preserving the meaning and context.

---

## **Introduction**

In Natural Language Processing (NLP), machines need a way to understand and process **text data** — but computers can’t read text directly.
To make text understandable for algorithms, we must **convert it into numerical form**.

This process is known as **Word Embedding** or **Text Vectorization**.
Word embedding algorithms represent words as **vectors (numerical arrays)** that capture their meaning, frequency, and context.

---

### **Why Word Embeddings?**

* Machine learning models work on **numerical input**, not raw text.
* Word embeddings help convert **text → numbers** while keeping **semantic meaning** (relationships between words).
* They help models understand:

  * How often words appear
  * How important they are in a document
  * How words relate to each other

---

## 1. Bag of Words (BoW)

###  **Concept**

The **Bag of Words (BoW)** model is the simplest way to convert text into numbers.
It creates a **vocabulary** of all unique words in the dataset and represents each sentence as a **vector of word counts**.

* Each column represents a word.
* Each row represents a sentence or document.
* The value shows how many times a word appeared in that sentence.

> It ignores grammar, word order, and context — it only focuses on frequency.

### **How it Works**

1. Collect all unique words → create vocabulary
2. For each sentence, count how many times each word appears
3. Represent the counts in a table or matrix

Example:

```
Sentence 1: NLP makes machines understand text  
Sentence 2: Machines learn from text data  
```

| Word       | Sentence 1 | Sentence 2 |
| ---------- | ---------- | ---------- |
| NLP        | 1          | 0          |
| makes      | 1          | 0          |
| machines   | 1          | 1          |
| understand | 1          | 0          |
| text       | 1          | 1          |
| data       | 0          | 1          |
| learn      | 0          | 1          |

### Limitations:
* **Sparsity**: If the vocabulary is large, most entries in the vector will be zero, leading to very large, inefficient "sparse" vectors.
* **No Context/Meaning**: BoW completely ignores the **order** of words (the grammar) and treats all words equally. It doesn't capture semantic similarity (e.g., it treats "king" and "apple" as equally different).

---

In [13]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd

In [14]:
sentences = [
    "Natural Language Processing makes machines understand text.",
    "Machines can learn from text data using NLP techniques.",
    "Text preprocessing is an important step in NLP."
]

In [15]:
# Initialize CountVectorizer
cv = CountVectorizer()

# Fit and transform sentences
bow_matrix = cv.fit_transform(sentences)

# Create a DataFrame to view results
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=cv.get_feature_names_out())

print(bow_df)

   an  can  data  from  important  in  is  language  learn  machines  makes  \
0   0    0     0     0          0   0   0         1      0         1      1   
1   0    1     1     1          0   0   0         0      1         1      0   
2   1    0     0     0          1   1   1         0      0         0      0   

   natural  nlp  preprocessing  processing  step  techniques  text  \
0        1    0              0           1     0           0     1   
1        0    1              0           0     0           1     1   
2        0    1              1           0     1           0     1   

   understand  using  
0           1      0  
1           0      1  
2           0      0  


---

## 2. TF-IDF (Term Frequency – Inverse Document Frequency)

### **Concept**

TF-IDF improves on BoW by not only counting words but also **weighing them** based on importance.
It gives higher weight to **unique words** and lower weight to **common words** like “the”, “is”, etc.

### **How it Works**

1.  **TF (Term Frequency)** – measures how often a word appears in a document.

    $$
    \text{TF} = \frac{\text{Number of times word appears}}{\text{Total words in document}}
    $$

2.  **IDF (Inverse Document Frequency)** – measures how unique a word is across all documents.

    $$
    \text{IDF} = \log\left(\frac{\text{Total documents}}{\text{Documents containing the word}}\right)
    $$

3.  **TF-IDF** is the final weight assigned to a word, calculated as the product of Term Frequency and Inverse Document Frequency:

    $$
    \text{TF-IDF} = \text{TF} \times \text{IDF}
    $$

This means:

* Words appearing often in one document but rarely in others get **high scores**.
* Common words appearing in all documents get **low scores**.


In [11]:
# Initialize TF-IDF Vectorizer
tfidf = TfidfVectorizer()

# Fit and transform sentences
tfidf_matrix = tfidf.fit_transform(sentences)

# Create DataFrame to view results
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf.get_feature_names_out())

print(tfidf_df.round(3))  # round for better readability

     an    can   data   from  important    in    is  language  learn  \
0  0.00  0.000  0.000  0.000       0.00  0.00  0.00     0.411  0.000   
1  0.00  0.365  0.365  0.365       0.00  0.00  0.00     0.000  0.365   
2  0.38  0.000  0.000  0.000       0.38  0.38  0.38     0.000  0.000   

   machines  makes  natural    nlp  preprocessing  processing  step  \
0     0.312  0.411    0.411  0.000           0.00       0.411  0.00   
1     0.278  0.000    0.000  0.278           0.00       0.000  0.00   
2     0.000  0.000    0.000  0.289           0.38       0.000  0.38   

   techniques   text  understand  using  
0       0.000  0.243       0.411  0.000  
1       0.365  0.216       0.000  0.365  
2       0.000  0.224       0.000  0.000  


---

## 3. Word2Vec

### **Concept**

Unlike BoW and TF-IDF, which treat words as isolated tokens, **Word2Vec** captures the **semantic relationships** between words.
It converts each word into a **dense vector** (a list of numbers) that represents its **meaning and context**.

This model is based on **neural networks** and learns relationships like:

* “king” - “man” + “woman” ≈ “queen”
* “Paris” - “France” + “Italy” ≈ “Rome”

### **How it Works**

Word2Vec uses one of two main approaches:

1. **CBOW (Continuous Bag of Words)** – predicts a word from its surrounding context.
2. **Skip-gram** – predicts surrounding words given a single target word.

### **Key Features**

* Learns **contextual meaning** of words
* Produces **dense embeddings** (compact numeric representation)
* Words with similar meanings have **similar vectors**

### **Example with Gensim**

We train Word2Vec on a small custom corpus.
After training:

* We can view the **vocabulary** (all words it learned)
* We can find **similar words** (based on cosine similarity)
* We can check the **vector representation** for any word

Example result:

```
Words similar to 'learning':
[('machine', 0.89), ('deep', 0.83), ('nlp', 0.78)]
```

In [None]:
from gensim.models import Word2Vec

# Sample corpus (list of tokenized sentences)
corpus = [
    ['machine', 'learning', 'is', 'fun'],
    ['deep', 'learning', 'is', 'a', 'part', 'of', 'machine', 'learning'],
    ['nlp', 'stands', 'for', 'natural', 'language', 'processing'],
    ['text', 'data', 'is', 'used', 'in', 'nlp'],
    ['word2vec', 'creates', 'word', 'embeddings', 'for', 'text', 'data']
]

# Train Word2Vec model
model = Word2Vec(sentences=corpus, vector_size=50, window=3, min_count=1, sg=0)

# Get vocabulary
print("Vocabulary:", list(model.wv.index_to_key), "\n")

# Find similar words
print("Words similar to 'learning':")
print(model.wv.most_similar('learning'))

# Display word vector for one word
print("\nWord vector for 'machine':")
print(model.wv['machine'])

```
Vocabulary: ['learning', 'is', 'for', 'data', 'nlp', 'text', 'machine', 'used', 'processing', 'language', 'natural', 'word', 'stands', 'in', 'of', 'part', 'a', 'deep', 'fun', 'word2vec', 'creates', 'embeddings'] 

Words similar to 'learning':
[('fun', 0.27054551243782043), ('of', 0.2105627804994583), ('word', 0.16703936457633972), ('in', 0.15023992955684662), ('used', 0.13204482197761536), ('for', 0.12669815123081207), ('language', 0.09986130893230438), ('creates', 0.07065954059362411), ('a', 0.059388983994722366), ('deep', 0.04984808713197708)]

Word vector for 'machine':
[-0.01648536  0.01859871 -0.00039532 -0.00393455  0.00920726 -0.00819063
  0.00548623  0.01387993  0.01213085 -0.01502159  0.0187647   0.00934362
  0.00793224 -0.01248701  0.01691996 -0.00430033  0.01765038 -0.01072401
 -0.01625884  0.01364912  0.00334239 -0.00439702  0.0190272   0.01898771
 -0.01954809  0.00501046  0.01231338  0.00774491  0.00404557  0.000861
  0.00134726 -0.00764127 -0.0142805  -0.00417774  0.0078478   0.01763737
  0.0185183  -0.01195187 -0.01880534  0.01952875  0.00685957  0.01033223
  0.01256469 -0.00560853  0.01464541  0.00566054  0.00574201 -0.00476074
 -0.0062565  -0.00474028]
```

---

## **Comparison Summary**

| Feature           | BoW           | TF-IDF                 | Word2Vec        |
| ----------------- | ------------- | ---------------------- | --------------- |
| Captures meaning  | No            | Partially              | Yes             |
| Context awareness | No            | No                     | Yes             |
| Output type       | Sparse matrix | Sparse matrix          | Dense vectors   |
| Based on          | Frequency     | Frequency + Importance | Neural networks |
| Common use        | Simple models | Text classification    | Deep NLP models |

---

## **Conclusion**

In this session, I explored one of the most important concepts in Natural Language Processing — **Word Embeddings**, which help convert text data into numerical form that machines can understand.

I started by revisiting simple methods like **Bag of Words (BoW)** and **TF-IDF**, which focus on word frequency and importance, and then moved towards **Word2Vec**, which captures the *semantic meaning* and *contextual relationship* between words.

Each algorithm progressively improves the way textual information is represented:

* BoW focuses only on **counting words**.
* TF-IDF adds a measure of **importance** to each word.
* Word2Vec introduces **context awareness**, helping machines understand word relationships and meanings.

These techniques form the foundation of modern NLP systems. Advanced language models like **BERT**, **GPT**, and **T5** are built upon these concepts — they use contextual embeddings to understand language at a much deeper level.

---

## **Key Takeaways**

* **Word embeddings** are numerical representations of words that help machines understand text.

* **Bag of Words (BoW):**

  * Converts text into word count vectors.
  * Ignores grammar and word order.
  * Simple but effective for basic models.

* **TF-IDF (Term Frequency – Inverse Document Frequency):**

  * Weighs words based on their frequency and uniqueness.
  * Common words get lower weights, rare words get higher weights.
  * Useful for text classification and document similarity tasks.

* **Word2Vec:**

  * Neural network-based algorithm that captures semantic meaning.
  * Words appearing in similar contexts have similar vector representations.
  * Forms the basis for modern embedding-based NLP models.

* **Progression Summary:**
  BoW ➜ TF-IDF ➜ Word2Vec represents the evolution from **frequency-based** to **context-based** understanding of text.

* These embedding techniques are crucial before feeding text data into **Machine Learning** or **Deep Learning** models for NLP tasks like sentiment analysis, chatbots, and text summarization.

---