In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
sentences = ["I love NLP", "NLP is fun", "I love fun"]

In [None]:
# Initialize CountVectorizer with binary=True for one-hot encoding
vectorizer = CountVectorizer(binary=True)

# Fit and transform the sentences
x = vectorizer.fit_transform(sentences)

In [None]:
one_hot_matrix = x.toarray()
one_hot_matrix

array([[0, 0, 1, 1],
       [1, 1, 0, 1],
       [1, 0, 1, 0]], dtype=int64)

In [None]:
vocab = vectorizer.get_feature_names_out()

In [None]:
import pandas as pd
pd.DataFrame(one_hot_matrix,columns=vocab)

Unnamed: 0,fun,is,love,nlp
0,0,0,1,1
1,1,1,0,1
2,1,0,1,0


### **Advantages of One-Hot Encoding**

* **Simple to implement**: Easy to understand and apply to small vocabularies.
* **Works well with algorithms** that assume categorical input (e.g., decision trees, naive Bayes).

---

### **Disadvantages of One-Hot Encoding**

* ** Sparse Matrix**:
  It creates high-dimensional vectors where most elements are 0 (e.g., vocab size of 10,000 → each word = 10,000-length vector). This wastes memory and slows down computations.

* ** No Semantic Meaning**:
  One-hot vectors don’t capture relationships or meanings between words. For example:

  * "king" and "queen" are just as different as "king" and "banana".
  * No similarity or context is retained.

* ** No Fixed Vector Size Across Datasets**:
  The vector size depends on the vocabulary size. If new data has new words, the model needs retraining or re-vectorization, making it hard to scale or generalize.

* ** Out of Vocabulary (OOV) Problem**:
  If the model sees a word at inference time that wasn’t in the training vocabulary, it can’t be encoded — resulting in loss of data or errors.
