# Day-51 Bag of Words (BOW) and TF-IDF

Yesterday, we learned how to clean text by tokenizing and stemming. Today, we tackle the most crucial step in classical NLP: Feature Extraction. We need to convert that clean list of words into a numerical format (a vector) that machine learning algorithms can actually process. We'll focus on the two foundational methods: Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF).

## Topic Covered

- Bag of Words (BoW) and Count Vectorization

## Bag of Words (BoW) and Count Vectorization

The Bag of Words (BoW) model is a simple way of representing text data where the order of words doesn't matter (it's just a "bag"). Count Vectorization is the process used to implement BoW.

- `How it Works`: 
    - It creates a vocabulary (a list of every unique word across all documents) and then represents each document as a vector where each element is the count of how often that vocabulary word appears in the document.

- `Analogy`: 
    - Shopping List. A bag of groceries might contain 3 apples, 2 bananas, and 1 milk. You care about the counts, not the order you put them in the bag.

- `Example (2 Documents)`:

    - Doc 1: "The quick brown fox"

    - Doc 2: "The quick fox ran"

    - Vocabulary: ['the', 'quick', 'brown', 'fox', 'ran']

        - Vector for Doc 1: [1, 1, 1, 1, 0]

        - Vector for Doc 2: [1, 1, 0, 1, 1]

## Sparse Matrix: Managing Massive Data 

In real-world NLP, the vocabulary size can reach tens of thousands of words. If a document only contains 50 words, its vector will have thousands of zeros.

- Sparse Matrix: A specialized data structure used to store these BoW or TF-IDF vectors efficiently. Instead of storing every zero, it only records the position and value of the non-zero elements.

- `Why it Matters`: Storing the entire matrix (which is mostly zeros) wastes enormous amounts of memory and computational power. Using a sparse matrix is essential for scaling NLP.

## TF-IDF: Term Weighting and Importance

Term Frequency-Inverse Document Frequency (TF-IDF) is an advanced version of BoW that assigns a numerical weight to each word, reflecting its importance in a document relative to the entire collection of documents (the corpus).

It is calculated by multiplying two components:

1. **Term Frequency (TF)**: How often a word appears in a specific document. (Similar to BoW).

$ TF(t,d)= \frac{Total number of terms in document d}{Count of term t in document d} $
​
 


2. **Inverse Document Frequency (IDF)**: A measure of how rare a word is across all documents. Words like "the" have a low IDF score (they appear in many documents), while unique technical words have a high IDF score.

$ IDF(t)=log( \frac{Number of documents containing term t}{Total number of documents}) $

- `Final Score`: $ TF-IDF=TF×IDF $

- `Analogy`: The Expert Witness. A common word like "is" might appear frequently (high TF), but since it appears in every document (low IDF), its TF-IDF weight is low. A unique word like "photosynthesis" might appear only once (low TF), but since it appears in only one document (high IDF), its TF-IDF weight is high. The high-scoring word is the "expert" that defines the document's topic.

## Code Example: BoW and TF-IDF with Scikit-learn

In [2]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# 1. Input Corpus (Assume pre-processed and cleaned text)
corpus = [
    'i love machine learning and python',
    'python is a great language for learning',
    'i love data science and coding',
    'data analysis is key to data science',
    'machine learning is fun and exciting',
    'coding in python is enjoyable and fun'

]

# --- A. Bag of Words (BoW) using CountVectorizer ---
print("--- Count Vectorization (BoW) ---")
# 1. Initialize the CountVectorizer
count_vectorizer = CountVectorizer()

# 2. Fit the vectorizer to learn the vocabulary and transform the corpus
bow_matrix = count_vectorizer.fit_transform(corpus)

# 3. Get the vocabulary (features/column names)
feature_names = count_vectorizer.get_feature_names_out()
print(f"Vocabulary: {feature_names}")

# 4. Convert the sparse matrix to a dense NumPy array for viewing
print("\nBoW Matrix (Counts):")
print(bow_matrix.toarray())
# Output (Rows=Documents, Columns=Vocabulary):
# Row 1 (Doc 1): [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]

# --- B. TF-IDF Vectorization ---
print("\n--- TF-IDF Vectorization ---")
# 1. Initialize the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# 2. Fit and transform the corpus
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

# 3. Get feature names (same as BoW, but weights are different)
# feature_names_tfidf = tfidf_vectorizer.get_feature_names_out()

# 4. Convert the sparse matrix to a dense NumPy array for viewing (shows the weights)
print("TF-IDF Matrix (Weights):")
print(tfidf_matrix.toarray().round(3))
# Notice the weights are floats, reflecting the importance of words like 'data' (high weight) 
# versus 'i' or 'love' (low weight because they appear frequently).

--- Count Vectorization (BoW) ---
Vocabulary: ['analysis' 'and' 'coding' 'data' 'enjoyable' 'exciting' 'for' 'fun'
 'great' 'in' 'is' 'key' 'language' 'learning' 'love' 'machine' 'python'
 'science' 'to']

BoW Matrix (Counts):
[[0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0]
 [0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 1 0 0]
 [0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0]
 [1 0 0 2 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1]
 [0 1 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0]
 [0 1 1 0 1 0 0 1 0 1 1 0 0 0 0 0 1 0 0]]

--- TF-IDF Vectorization ---
TF-IDF Matrix (Weights):
[[0.    0.364 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
  0.    0.425 0.503 0.503 0.425 0.    0.   ]
 [0.    0.    0.    0.    0.    0.    0.482 0.    0.482 0.    0.286 0.
  0.482 0.333 0.    0.    0.333 0.    0.   ]
 [0.    0.34  0.47  0.47  0.    0.    0.    0.    0.    0.    0.    0.
  0.    0.    0.47  0.    0.    0.47  0.   ]
 [0.386 0.    0.    0.633 0.    0.    0.    0.    0.    0.    0.229 0.386
  0.    0.    0.    0.    0.    0.316 0.386]
 [0.

## Summary of Day 51

Today, you learned how to convert messy text into clean numerical features. Count Vectorization creates the Bag of Words model based on raw frequency. TF-IDF refines this by applying term weighting, assigning higher scores to words that are rare across the corpus but frequent within a specific document. Both methods produce a sparse matrix representation, which is key for efficient machine learning.

## What's Next (Day 52)

BoW and TF-IDF are great, but they treat every word as an independent dimension, ignoring meaning and context. Tomorrow, on Day 52, we will move to the next generation of NLP features: Word Embeddings. You'll learn how models like Word2Vec and GloVe capture the semantic relationships between words, which is fundamental to modern deep learning NLP!