# Vectorization Techniques in NLP

In Natural Language Processing (NLP), text data must be converted into numerical form (vectors) before applying machine learning models.  
This process is called **Vectorization**.

## 1. Count Vectorizer
- Represents text as a **bag of words** (word counts).
- Each document becomes a vector where each element = frequency of a word in that document.

### How it works
- Converts a collection of text documents into a **Document-Term Matrix (DTM)**.
- **Rows** = documents  
- **Columns** = unique words in the corpus (vocabulary)  
- **Values** = count of each word in the document  


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample corpus
docs = ["I like NLP", "I like Machine Learning", "NLP is fun"]

# Initialize count vectorizer
cv = CountVectorizer()
X = cv.fit_transform(docs)

# print dictionary of words with their index
print("Vocabulary:", cv.vocabulary_)
print("Count Matrix:\n", X.toarray())


Vocabulary: {'like': 3, 'nlp': 5, 'machine': 4, 'learning': 2, 'is': 1, 'fun': 0}
Count Matrix:
 [[0 0 0 1 0 1]
 [0 0 1 1 1 0]
 [1 1 0 0 0 1]]


Example Corpus:  


Vocabulary = {"I", "like", "NLP", "Machine", "Learning", "is", "fun"}  

The **matrix** will look like:

| Document | I | like | NLP | Machine | Learning | is | fun |
|----------|---|------|-----|---------|----------|----|-----|
| Doc1     | 1 | 1    | 1   | 0       | 0        | 0  | 0   |
| Doc2     | 1 | 1    | 0   | 1       | 1        | 0  | 0   |
| Doc3     | 0 | 0    | 1   | 0       | 0        | 1  | 1   |


### Advantages
- Simple and interpretable.  
- Easy to implement.  
- Provides a clear picture of word frequency in documents.  


### Limitations
1. **High dimensionality**  
   - Vocabulary grows with dataset size → large sparse matrices.  

2. **No semantics**  
   - "good" and "bad" are treated as independent and equally distant words.  

3. **No word order**  
   - Loses context (e.g., "dog bites man" vs "man bites dog").  

4. **Common words dominate**  
   - Words like *"the"*, *"is"*, *"and"* may skew results.  

