# Notebook 03: Vectorization Techniques

## Learning Objectives
By the end of this session, you will:
1. **Understand text vectorization** - Convert text to numerical representations
2. **Master Bag-of-Words (BoW)** - Build document-term matrices
3. **Implement TF-IDF** - Weight terms by importance
4. **Extract n-grams** - Capture phrase-level patterns
5. **Visualize text features** - Analyze word frequencies and importance
6. **Compare vectorization methods** - Choose the right approach for your task

## Why Vectorize Text?
Machine learning algorithms work with numbers, not text. We need to convert text into numerical vectors while preserving semantic meaning.


In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re

# Sample documents for vectorization
documents = [
    "I love machine learning and natural language processing",
    "Machine learning is a subset of artificial intelligence", 
    "Natural language processing helps computers understand human language",
    "Deep learning is a powerful machine learning technique",
    "Python is great for machine learning and data science",
    "Text mining and natural language processing are related fields",
    "Artificial intelligence will transform many industries",
    "Data science combines statistics, programming, and domain knowledge"
]

print("Sample Documents for Vectorization:")
print("=" * 50)
for i, doc in enumerate(documents, 1):
    print(f"{i}. {doc}")

# Create a simple preprocessing function
def simple_preprocess(text):
    """Basic text preprocessing for vectorization"""
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation and extra spaces
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Preprocess documents
processed_docs = [simple_preprocess(doc) for doc in documents]
print(f"\nPreprocessed Documents:")
for i, doc in enumerate(processed_docs, 1):
    print(f"{i}. {doc}")

## Section 1: Bag-of-Words (BoW) Representation

### Concept
Bag-of-Words represents each document as a vector of word counts, ignoring grammar and word order but keeping track of frequency.

**Example**: "I love NLP" → [1, 1, 1, 0, 0, ...] for vocabulary [I, love, NLP, machine, learning, ...]

### Advantages:
- Simple to understand and implement
- Works well for many text classification tasks
- Captures word importance through frequency

### Disadvantages:
- Loses word order and grammar
- Sparse vectors (mostly zeros)
- No semantic understanding

In [None]:
# Bag-of-Words with scikit-learn CountVectorizer

# Initialize CountVectorizer
vectorizer = CountVectorizer(
    lowercase=True,          # Convert to lowercase
    stop_words='english',    # Remove common English stop words
    max_features=100,        # Limit vocabulary size
    ngram_range=(1, 1)       # Use single words (unigrams)
)

# Fit and transform documents
bow_matrix = vectorizer.fit_transform(processed_docs)
feature_names = vectorizer.get_feature_names_out()

print("BAG-OF-WORDS ANALYSIS")
print("=" * 50)
print(f"Vocabulary size: {len(feature_names)}")
print(f"Matrix shape: {bow_matrix.shape}")
print(f"Matrix density: {bow_matrix.nnz / (bow_matrix.shape[0] * bow_matrix.shape[1]):.3f}")

# Convert to dense array for visualization
bow_dense = bow_matrix.toarray()

# Create DataFrame for better visualization
bow_df = pd.DataFrame(bow_dense, columns=feature_names)
bow_df.index = [f"Doc {i+1}" for i in range(len(documents))]

print(f"\nFirst 10 features:")
print(bow_df.iloc[:, :10])

# Analyze vocabulary
print(f"\nVocabulary (first 20 words):")
print(list(feature_names[:20]))

# Word frequency analysis
word_freq = bow_matrix.sum(axis=0).A1  # Sum across documents
word_freq_df = pd.DataFrame({
    'word': feature_names,
    'frequency': word_freq
}).sort_values('frequency', ascending=False)

print(f"\nTop 15 most frequent words:")
print(word_freq_df.head(15))

# Visualize word frequencies
plt.figure(figsize=(12, 6))
top_words = word_freq_df.head(15)
plt.bar(top_words['word'], top_words['frequency'])
plt.title('Top 15 Most Frequent Words (BoW)')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Document similarity using BoW
similarity_matrix = cosine_similarity(bow_matrix)
similarity_df = pd.DataFrame(similarity_matrix, 
                           index=[f"Doc {i+1}" for i in range(len(documents))],
                           columns=[f"Doc {i+1}" for i in range(len(documents))])

print(f"\nDocument Similarity Matrix (Cosine Similarity):")
print(similarity_df.round(3))