# üìò NLP Notes - Bag of Words (BoW) üî°

## üìç Lecture Overview

- ‚úÖ Review: One-Hot Encoding  
- üöÄ Introduction to Bag of Words  
- üõ†Ô∏è Step-by-step BoW implementation  
- üìä Binary BoW vs Count BoW  
- ‚úÖ Advantages & ‚ùå Disadvantages  
- üí° Real-world considerations  
- üß† Semantic issues & limitations  

---

## üîÅ Recap: From One-Hot Encoding to Bag of Words

In **One-Hot Encoding**, we represented **each word** with a binary vector of size equal to the **vocabulary**. For each **word**, we created a separate vector.

**Limitations we faced:**
- Sparse matrices
- Variable-length input
- No semantic meaning
- Out-of-vocabulary issues

So now, we move to a **better method for entire text/sentences**: **Bag of Words (BoW)**.

---

## üß™ Step-by-Step: How Bag of Words Works

### ‚úèÔ∏è Sample Sentences

We start with a simple dataset:

- S1: He is a good boy
- S2: She is a good girl
- S3: Boy and girl are good


### üßæ Step 2: Build Vocabulary

From the cleaned sentences, we extract unique words:

Vocabulary = ['good', 'boy', 'girl']


We also note **frequencies**:

| Word  | Frequency |
|-------|-----------|
| good  | 3         |
| boy   | 2         |
| girl  | 2         |

This is useful to **prioritize features** or to **limit vocabulary** (e.g., top 10 frequent words).

---

### üî¢ Step 3: Convert Sentences to Vectors

We now build **BoW vectors** based on vocabulary order.

#### üìÑ Sentence 1: good boy ‚Üí Vector: [1, 1, 0]  
#### üìÑ Sentence 2: good girl ‚Üí Vector: [1, 0, 1]  
#### üìÑ Sentence 3: boy girl good ‚Üí Vector: [1, 1, 1]

Each vector represents **presence (1) or absence (0)** of each word in the sentence.

> This is also known as **Binary Bag of Words**

---

## üÜö Count BoW vs Binary BoW

Let‚Äôs take this sentence:

good girl girl

### üßÆ Count BoW:

Vector ‚Üí [1, 0, 2] # good (1), boy (0), girl (2)

### ‚ö° Binary BoW:

Vector ‚Üí [1, 0, 1] # Regardless of repetitions


So,
- **Binary BoW**: Uses 1/0 (present or not)
- **Count BoW**: Uses actual **frequency count**

---

## ‚úÖ Advantages of Bag of Words

1. **Simple & Intuitive**
   - Easy to implement (e.g., using `CountVectorizer` from `sklearn`).

2. **Fixed-Length Vectors**
   - Unlike one-hot, BoW provides fixed-size input for all documents, matching vocabulary size.

3. **Effective for Text Classification**
   - Works well for spam/ham detection, sentiment analysis, etc.

---

## ‚ùå Disadvantages of Bag of Words

1. **Still a Sparse Matrix**
   - For large vocabularies (e.g. 50,000 words), most entries will still be 0 ‚Üí inefficient.

2. **No Semantic Understanding**
   - ‚Äúgreat‚Äù and ‚Äúawesome‚Äù are treated as different, unrelated words.

3. **Out-of-Vocabulary (OOV)**
   - New words not seen during training (e.g., ‚Äúschool‚Äù) are ignored during test time.

4. **Loss of Word Order**
   - "The food is good" vs "The food is not good" ‚Üí both may seem similar in BoW.
   - Can lead to wrong interpretations (e.g., missing negation).

5. **Misleading Similarity**
   - Cosine similarity between:
     - ‚ÄúThe food is good‚Äù
     - ‚ÄúThe food is not good‚Äù  
     May be **high**, even though the **sentiments are opposite**.

---

## üîÅ Summary Table

| Feature                   | Bag of Words      |
|--------------------------|-------------------|
| Easy to Implement         | ‚úÖ Yes            |
| Sparse Matrix             | ‚ö†Ô∏è Yes            |
| Fixed Input Size          | ‚úÖ Yes            |
| Semantic Understanding    | ‚ùå No             |
| Word Order Preserved      | ‚ùå No             |
| Handles OOV               | ‚ùå Poorly         |
| Frequency-Based Options   | ‚úÖ Yes (Count BoW) |

---

## üíª Python Example: Using `CountVectorizer`

```python
from sklearn.feature_extraction.text import CountVectorizer

# Sample sentences
docs = [
    "He is a good boy",
    "She is a good girl",
    "Boy and girl are good"
]

# Preprocess: lowercase and remove stopwords manually if needed

# Initialize CountVectorizer
cv = CountVectorizer()

# Fit and transform
X = cv.fit_transform(docs)

# Convert to array
print(cv.get_feature_names_out())
print(X.toarray())
