In [1]:
from sklearn.feature_extraction.text import CountVectorizer

In [2]:

sentences = ["I love NLP", "NLP is fun NLP", "I love fun"]

### Binary Bag of words

In [3]:
vectorizer = CountVectorizer(binary=True)

# Fit and transform the sentences
x = vectorizer.fit_transform(sentences)

In [4]:
one_hot_matrix = x.toarray()
one_hot_matrix

array([[0, 0, 1, 1],
       [1, 1, 0, 1],
       [1, 0, 1, 0]])

In [5]:
vocab = vectorizer.get_feature_names_out()

In [6]:
import pandas as pd
pd.DataFrame(one_hot_matrix,columns=vocab)

Unnamed: 0,fun,is,love,nlp
0,0,0,1,1
1,1,1,0,1
2,1,0,1,0


### Normal Bag of Words

In [7]:
vectorizer = CountVectorizer(binary=False)

# Fit and transform the sentences
x = vectorizer.fit_transform(sentences)

In [8]:
one_hot_matrix = x.toarray()
# one_hot_matrix
vocab = vectorizer.get_feature_names_out()

In [9]:
pd.DataFrame(one_hot_matrix,columns=vocab)

Unnamed: 0,fun,is,love,nlp
0,0,0,1,1
1,1,1,0,2
2,1,0,1,0




### **Advantages of Bag of Words**

* **Simple and intuitive**: Easy to understand and implement.
* **Fixed-size input**: Generates a fixed-length vector regardless of sentence length, which is useful for machine learning models.
* **Works well for basic text classification tasks**: Such as spam detection or sentiment analysis when context is not crucial.

---

### **Disadvantages of Bag of Words**

* **Sparse vectors**: Most values in the vector are zero, which increases memory usage and computational cost.
* **Overfitting**: With large vocabularies and small datasets, models can overfit due to too many features.
* **Word order is ignored**: Loses syntactic structure and context; “dog bites man” and “man bites dog” are treated the same.
* **No semantic meaning captured**: Words like “good” and “great” are treated as entirely unrelated.
* **Out of Vocabulary (OOV) issue**: If a word wasn't seen during training, it can’t be encoded during inference.

