#### Text Vectorization with CountVectorizer

#### CountVectorizer
- **Purpose**: Converts a collection of text documents into a matrix of token (word) counts.

- **From**: `sklearn.feature_extraction.text`

- **Output**: A **sparse matrix** (or dense array) where:
  - Rows = documents (e.g., reviews)
  - Columns = unique tokens (vocabulary)
  - Values = frequency of the token in the document

- **Basic Features**:
  - Automatically **tokenizes** text into words
  - Converts all text to **lowercase**
  - Ignores **punctuation** and **non-alphabetic tokens**
  - Builds a **vocabulary** from all documents
  - Option to **remove stopwords** (like "the", "is", etc.)

- **Key Parameters**:
  - `stop_words='english'`: removes common English stopwords
  - `ngram_range=(1,2)`: include unigrams, bigrams
  - `max_features=n`: keep only top `n` frequent terms
  - `min_df`, `max_df`: control how frequently a word must appear to be included

- **Common Methods**:
  - `fit(corpus)`: learns the vocabulary
  - `transform(corpus)`: converts text to count vectors
  - `fit_transform(corpus)`: does both
  - `get_feature_names_out()`: returns the vocabulary list
  - `vocabulary_`: dictionary mapping tokens to column indices


In [5]:
reviews = [
    "The product is great and easy to use",
    "Easy to use and very effective",
    "Not great, the product broke quickly",
    "Very bad experience, not recommended"
]

In [7]:
import pandas as pd
from sklearn.feature_extraction.text \
                      import CountVectorizer

# Initialize the vectorizer
vectorizer = CountVectorizer(stop_words='english')

# Transform reviews into count vectors
X = vectorizer.fit_transform(reviews)

# Get the vocabulary
vocab = vectorizer.get_feature_names_out()
print(vocab)
print(len(vocab))

['bad' 'broke' 'easy' 'effective' 'experience' 'great' 'product' 'quickly'
 'recommended' 'use']
10


In [11]:
# print(reviews[0])
# print()
print(X.toarray())   # Print the transformed text in to an array

The product is great and easy to use

[[1 0 0 1 0 0 1 1 0 1 0 0 1 1 1 0]
 [1 0 0 1 1 0 0 0 0 0 0 0 0 1 1 1]
 [0 0 1 0 0 0 1 0 1 1 1 0 1 0 0 0]
 [0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1]]


In [9]:
# Create DataFrame
df = pd.DataFrame(X.toarray(), \
        columns=vectorizer.get_feature_names_out(), \
        index=[f"Review {i+1}" for i in range(len(reviews))])
df

Unnamed: 0,bad,broke,easy,effective,experience,great,product,quickly,recommended,use
Review 1,0,0,1,0,0,1,1,0,0,1
Review 2,0,0,1,1,0,0,0,0,0,1
Review 3,0,1,0,0,0,1,1,1,0,0
Review 4,1,0,0,0,1,0,0,0,1,0


Review1: The product is great and easy to use

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

reviews = [
    "The product is great and easy to use",
    "Easy to use and very effective",
    "Not great, the product broke quickly",
    "Very bad experience, not recommended"
]

# Bigram vectorizer (ngram_range=(1,2) )
vectorizer = CountVectorizer(ngram_range=(1,2),stop_words='english')
X = vectorizer.fit_transform(reviews)
print(X.toarray())
len(vectorizer.get_feature_names_out())

[[0 0 0 0 1 1 0 0 0 1 1 0 1 0 1 0 0 1 0]
 [0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1]
 [0 0 1 1 0 0 0 0 0 1 0 1 1 1 0 1 0 0 0]
 [1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0]]


19

In [33]:
vectorizer.get_feature_names_out()

array(['bad', 'bad experience', 'broke', 'broke quickly', 'easy',
       'easy use', 'effective', 'experience', 'experience recommended',
       'great', 'great easy', 'great product', 'product', 'product broke',
       'product great', 'quickly', 'recommended', 'use', 'use effective'],
      dtype=object)

In [15]:
# Create DataFrame
df = pd.DataFrame(X.toarray(), columns=vectorizer. \
                  get_feature_names_out(), index= \
        [f"Review {i+1}" for i in range(len(reviews))])
df

Unnamed: 0,bad,bad experience,broke,broke quickly,easy,easy use,effective,experience,experience recommended,great,great easy,great product,product,product broke,product great,quickly,recommended,use,use effective
Review 1,0,0,0,0,1,1,0,0,0,1,1,0,1,0,1,0,0,1,0
Review 2,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,1,1
Review 3,0,0,1,1,0,0,0,0,0,1,0,1,1,1,0,1,0,0,0
Review 4,1,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0


#### Text vectorization using TfidfVectorizer


#### ðŸ”¹ `TfidfVectorizer` 

- **Purpose**: Converts text documents to a matrix of **TF-IDF features**, capturing both term **frequency** and **importance**.

- **TF-IDF** stands for:
  - **TF** â€“ Term Frequency: how often a word appears in a document.
  - **IDF** â€“ Inverse Document Frequency: how rare a word is across all documents.
  - Formula:  
    $
    \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \log\left( \frac{N}{\text{DF}(t)} \right)
    $ 
    where:
    - *t* = term  
    - *d* = document  
    - *N* = total number of documents  
    - *DF(t)* = number of documents containing term *t*

- **Use Case**: Better than raw counts (like in `CountVectorizer`) because:
  - It **downweights common words** (e.g., "the", "is")
  - It **upweights rare but important words**

---

##### Key Features:
- **Removes common wordsâ€™ dominance**
- Captures **relative importance** of words
- Automatically **normalizes** vectors (L2 norm by default)
- Supports **ngrams**, **stopword removal**, **max_features**, etc.

---

##### Common Parameters:
- `ngram_range=(1,2)` â†’ unigrams + bigrams
- `stop_words='english'` â†’ remove common English stopwords
- `max_df=0.7` â†’ ignore terms in more than 70% of docs
- `min_df=2` â†’ include only terms in 2+ docs


In [39]:
reviews = [
    "The product is great and easy to use",
    "Easy to use and very effective",
    "Not great, the product broke quickly",
    "Very bad experience, not recommended"
]

### TF-IDF Vectorizer

In [18]:
from sklearn.feature_extraction.text \
                          import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(reviews)
print(vectorizer.get_feature_names_out())
print("\n")
print(X.toarray())

['bad' 'broke' 'easy' 'effective' 'experience' 'great' 'product' 'quickly'
 'recommended' 'use']


[[0.         0.         0.5        0.         0.         0.5
  0.5        0.         0.         0.5       ]
 [0.         0.         0.52640543 0.66767854 0.         0.
  0.         0.         0.         0.52640543]
 [0.         0.55528266 0.         0.         0.         0.43779123
  0.43779123 0.55528266 0.         0.        ]
 [0.57735027 0.         0.         0.         0.57735027 0.
  0.         0.         0.57735027 0.        ]]


In [20]:
# Create DataFrame
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out(), index=[f"R{i+1}" for i in range(len(reviews))])
df

Unnamed: 0,bad,broke,easy,effective,experience,great,product,quickly,recommended,use
R1,0.0,0.0,0.5,0.0,0.0,0.5,0.5,0.0,0.0,0.5
R2,0.0,0.0,0.526405,0.667679,0.0,0.0,0.0,0.0,0.0,0.526405
R3,0.0,0.555283,0.0,0.0,0.0,0.437791,0.437791,0.555283,0.0,0.0
R4,0.57735,0.0,0.0,0.0,0.57735,0.0,0.0,0.0,0.57735,0.0
