# BBC Text Representations - Sparse Methods

**Roll Number:** SE22UARI195

**Tasks:**
1. One-Hot Encoding (OHE) - Top 2000 tokens
2. Bag-of-Words (BoW) - Unigram counts
3. N-grams - Unigrams + Bigrams
4. TF-IDF - With manual verification
5. Calculate health metrics for all methods

---

## 1. Setup & Load Preprocessed Data

In [1]:

import pandas as pd
import numpy as np
import pickle
import time
import sys
from pathlib import Path
from collections import Counter

# Sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import normalize
import scipy.sparse as sp

# Progress bar
from tqdm.notebook import tqdm

print("‚úÖ Imports successful!")

‚úÖ Imports successful!


In [2]:
# Configuration
ROLL = "SE22UARI195"
CACHE_DIR = Path("../cache")
MODELS_DIR = Path("../models")
MODELS_DIR.mkdir(exist_ok=True)

print(f"Roll Number: {ROLL}")
print(f"Cache Directory: {CACHE_DIR}")
print(f"Models Directory: {MODELS_DIR}")

Roll Number: SE22UARI195
Cache Directory: ../cache
Models Directory: ../models


In [3]:
# Load preprocessed data
print("üìÇ Loading preprocessed data...\n")

with open(CACHE_DIR / 'train_processed.pkl', 'rb') as f:
    train_df = pickle.load(f)
print(f"‚úÖ TRAIN: {len(train_df)} documents")

with open(CACHE_DIR / 'dev_processed.pkl', 'rb') as f:
    dev_df = pickle.load(f)
print(f"‚úÖ DEV: {len(dev_df)} documents")

with open(CACHE_DIR / 'test_processed.pkl', 'rb') as f:
    test_df = pickle.load(f)
print(f"‚úÖ TEST: {len(test_df)} documents")

with open(CACHE_DIR / 'vocab_counter.pkl', 'rb') as f:
    vocab_counter = pickle.load(f)
print(f"‚úÖ Vocabulary: {len(vocab_counter):,} unique tokens")

with open(CACHE_DIR / 'metadata.pkl', 'rb') as f:
    metadata = pickle.load(f)
print(f"‚úÖ Metadata loaded")

üìÇ Loading preprocessed data...

‚úÖ TRAIN: 1335 documents
‚úÖ DEV: 445 documents
‚úÖ TEST: 445 documents
‚úÖ Vocabulary: 20,404 unique tokens
‚úÖ Metadata loaded


In [4]:
# Extract processed text for vectorization
X_train_text = train_df['text_processed'].values
X_dev_text = dev_df['text_processed'].values
X_test_text = test_df['text_processed'].values

y_train = train_df['label'].values
y_dev = dev_df['label'].values
y_test = test_df['label'].values

print("\nüìä Data shapes:")
print(f"  TRAIN: {len(X_train_text)} documents")
print(f"  DEV: {len(X_dev_text)} documents")
print(f"  TEST: {len(X_test_text)} documents")


üìä Data shapes:
  TRAIN: 1335 documents
  DEV: 445 documents
  TEST: 445 documents


## 2. Helper Functions for Health Metrics

In [5]:
def calculate_health_metrics(X_matrix, vectorizer, X_test_matrix, test_tokens_list, 
                            fit_time, transform_times, method_name):
    """
    Calculate comprehensive health metrics for a representation.
    
    Args:
        X_matrix: Training matrix (sparse or dense)
        vectorizer: Fitted vectorizer object
        X_test_matrix: Test matrix
        test_tokens_list: List of token lists for test set
        fit_time: Time taken to fit (seconds)
        transform_times: List of transform times per document (seconds)
        method_name: Name of the method
    
    Returns:
        Dictionary of metrics
    """
    metrics = {}
    
    # Vocabulary size
    if hasattr(vectorizer, 'vocabulary_'):
        metrics['V'] = len(vectorizer.vocabulary_)
    else:
        metrics['V'] = X_matrix.shape[1]
    
    # Non-zero entries
    if sp.issparse(X_matrix):
        metrics['nnz'] = int(X_matrix.nnz)
    else:
        metrics['nnz'] = int(np.count_nonzero(X_matrix))
    
    # Sparsity
    total_elements = X_matrix.shape[0] * X_matrix.shape[1]
    metrics['sparsity'] = float(1 - metrics['nnz'] / total_elements)
    
    # OOV rate on test set
    if hasattr(vectorizer, 'vocabulary_'):
        train_vocab = set(vectorizer.vocabulary_.keys())
        total_test_tokens = 0
        oov_tokens = 0
        
        for tokens in test_tokens_list:
            total_test_tokens += len(tokens)
            oov_tokens += sum(1 for t in tokens if t not in train_vocab)
        
        metrics['oov'] = float(oov_tokens / total_test_tokens) if total_test_tokens > 0 else 0.0
    else:
        metrics['oov'] = 0.0
    
    # Top-k coverage (only for TF-IDF)
    metrics['topk_100'] = 0.0
    metrics['topk_500'] = 0.0
    
    # Timing
    metrics['fit_s'] = float(fit_time)
    metrics['ms_per_doc'] = float(np.mean(transform_times) * 1000)  # Convert to ms
    
    # Memory
    if sp.issparse(X_matrix):
        # Sparse matrix memory
        mem_bytes = X_matrix.data.nbytes + X_matrix.indices.nbytes + X_matrix.indptr.nbytes
    else:
        # Dense matrix memory
        mem_bytes = X_matrix.nbytes
    
    metrics['mem_mb'] = float(mem_bytes / (1024 * 1024))
    
    return metrics

def print_metrics(metrics, method_name):
    """Pretty print metrics."""
    print(f"\n{'='*60}")
    print(f"üìä {method_name} - Health Metrics")
    print(f"{'='*60}")
    print(f"  Vocabulary size (V):        {metrics['V']:>10,}")
    print(f"  Non-zero entries (nnz):     {metrics['nnz']:>10,}")
    print(f"  Sparsity:                   {metrics['sparsity']:>10.4f}")
    print(f"  OOV rate (TEST):            {metrics['oov']:>10.4f}")
    if metrics['topk_100'] > 0:
        print(f"  Top-100 coverage:           {metrics['topk_100']:>10.4f}")
        print(f"  Top-500 coverage:           {metrics['topk_500']:>10.4f}")
    print(f"  Fit time (s):               {metrics['fit_s']:>10.3f}")
    print(f"  Transform time (ms/doc):    {metrics['ms_per_doc']:>10.3f}")
    print(f"  Memory (MB):                {metrics['mem_mb']:>10.2f}")
    print(f"{'='*60}")

print("‚úÖ Helper functions defined!")

‚úÖ Helper functions defined!


## 3. One-Hot Encoding (OHE)

Top 2000 tokens by frequency in TRAIN, binary 0/1 per document.

In [6]:
print("\nüîß Building One-Hot Encoding (OHE)...\n")

# Get top 2000 tokens from vocab_counter
top_2000_tokens = [token for token, count in vocab_counter.most_common(2000)]

print(f"Selected top 2000 tokens")
print(f"\nTop 20 tokens: {top_2000_tokens[:20]}")


üîß Building One-Hot Encoding (OHE)...

Selected top 2000 tokens

Top 20 tokens: ['said', 'year', 'mr', 'would', 'also', 'people', 'new', 'one', 'time', 'could', 'game', 'last', 'two', 'first', 'world', 'say', 'film', 'company', 'firm', 'make']


In [7]:
# Create vectorizer with fixed vocabulary
ohe_vectorizer = CountVectorizer(
    vocabulary=top_2000_tokens,
    binary=True,  # Binary encoding (0/1)
    lowercase=False,  # Already preprocessed
    token_pattern=r'(?u)\b\w+\b'
)

# Fit and transform TRAIN
start_time = time.time()
X_train_ohe = ohe_vectorizer.fit_transform(X_train_text)
fit_time_ohe = time.time() - start_time

print(f"‚úÖ OHE fitted on TRAIN in {fit_time_ohe:.3f}s")
print(f"   Shape: {X_train_ohe.shape}")

‚úÖ OHE fitted on TRAIN in 0.224s
   Shape: (1335, 2000)


In [8]:
# Transform DEV and TEST
transform_times_ohe = []

# DEV
start = time.time()
X_dev_ohe = ohe_vectorizer.transform(X_dev_text)
dev_time = time.time() - start
transform_times_ohe.append(dev_time / len(X_dev_text))

# TEST
start = time.time()
X_test_ohe = ohe_vectorizer.transform(X_test_text)
test_time = time.time() - start
transform_times_ohe.append(test_time / len(X_test_text))

print(f"‚úÖ OHE transformed DEV in {dev_time:.3f}s")
print(f"‚úÖ OHE transformed TEST in {test_time:.3f}s")

‚úÖ OHE transformed DEV in 0.058s
‚úÖ OHE transformed TEST in 0.053s


In [9]:
# Calculate health metrics
ohe_metrics = calculate_health_metrics(
    X_train_ohe, ohe_vectorizer, X_test_ohe, test_df['tokens'].tolist(),
    fit_time_ohe, transform_times_ohe, "OHE"
)

print_metrics(ohe_metrics, "One-Hot Encoding (OHE)")


üìä One-Hot Encoding (OHE) - Health Metrics
  Vocabulary size (V):             2,000
  Non-zero entries (nnz):        134,064
  Sparsity:                       0.9498
  OOV rate (TEST):                0.2863
  Fit time (s):                    0.224
  Transform time (ms/doc):         0.124
  Memory (MB):                      1.54


In [10]:
# Save OHE artifacts
with open(MODELS_DIR / 'ohe_vectorizer.pkl', 'wb') as f:
    pickle.dump(ohe_vectorizer, f)

sp.save_npz(MODELS_DIR / 'X_train_ohe.npz', X_train_ohe)
sp.save_npz(MODELS_DIR / 'X_dev_ohe.npz', X_dev_ohe)
sp.save_npz(MODELS_DIR / 'X_test_ohe.npz', X_test_ohe)

print("\nüíæ OHE artifacts saved!")


üíæ OHE artifacts saved!


## 4. Bag-of-Words (BoW)

Unigram counts with min_df >= 2

In [11]:
print("\nüîß Building Bag-of-Words (BoW)...\n")

# Create BoW vectorizer
bow_vectorizer = CountVectorizer(
    min_df=2,  # Minimum document frequency
    lowercase=False,  # Already preprocessed
    token_pattern=r'(?u)\b\w+\b'
)

# Fit and transform TRAIN
start_time = time.time()
X_train_bow = bow_vectorizer.fit_transform(X_train_text)
fit_time_bow = time.time() - start_time

print(f"‚úÖ BoW fitted on TRAIN in {fit_time_bow:.3f}s")
print(f"   Shape: {X_train_bow.shape}")
print(f"   Vocabulary size: {len(bow_vectorizer.vocabulary_):,}")


üîß Building Bag-of-Words (BoW)...

‚úÖ BoW fitted on TRAIN in 0.221s
   Shape: (1335, 11515)
   Vocabulary size: 11,515


In [12]:
# Transform DEV and TEST
transform_times_bow = []

# DEV
start = time.time()
X_dev_bow = bow_vectorizer.transform(X_dev_text)
dev_time = time.time() - start
transform_times_bow.append(dev_time / len(X_dev_text))

# TEST
start = time.time()
X_test_bow = bow_vectorizer.transform(X_test_text)
test_time = time.time() - start
transform_times_bow.append(test_time / len(X_test_text))

print(f"‚úÖ BoW transformed DEV in {dev_time:.3f}s")
print(f"‚úÖ BoW transformed TEST in {test_time:.3f}s")

‚úÖ BoW transformed DEV in 0.069s
‚úÖ BoW transformed TEST in 0.085s


In [13]:
# Calculate health metrics
bow_metrics = calculate_health_metrics(
    X_train_bow, bow_vectorizer, X_test_bow, test_df['tokens'].tolist(),
    fit_time_bow, transform_times_bow, "BoW"
)

print_metrics(bow_metrics, "Bag-of-Words (BoW)")


üìä Bag-of-Words (BoW) - Health Metrics
  Vocabulary size (V):            11,515
  Non-zero entries (nnz):        186,739
  Sparsity:                       0.9879
  OOV rate (TEST):                0.0706
  Fit time (s):                    0.221
  Transform time (ms/doc):         0.173
  Memory (MB):                      2.14


In [14]:
# Save BoW artifacts
with open(MODELS_DIR / 'bow_vectorizer.pkl', 'wb') as f:
    pickle.dump(bow_vectorizer, f)

sp.save_npz(MODELS_DIR / 'X_train_bow.npz', X_train_bow)
sp.save_npz(MODELS_DIR / 'X_dev_bow.npz', X_dev_bow)
sp.save_npz(MODELS_DIR / 'X_test_bow.npz', X_test_bow)

print("\nüíæ BoW artifacts saved!")


üíæ BoW artifacts saved!


## 5. N-grams (Unigrams + Bigrams)

Unigrams + bigrams with min_df >= 3

In [15]:
print("\nüîß Building N-grams (1,2)...\n")

# Create N-gram vectorizer
ngram_vectorizer = CountVectorizer(
    ngram_range=(1, 2),  # Unigrams + bigrams
    min_df=3,  # Minimum document frequency
    lowercase=False,  # Already preprocessed
    token_pattern=r'(?u)\b\w+\b'
)

# Fit and transform TRAIN
start_time = time.time()
X_train_ngram = ngram_vectorizer.fit_transform(X_train_text)
fit_time_ngram = time.time() - start_time

print(f"‚úÖ N-gram fitted on TRAIN in {fit_time_ngram:.3f}s")
print(f"   Shape: {X_train_ngram.shape}")
print(f"   Vocabulary size: {len(ngram_vectorizer.vocabulary_):,}")


üîß Building N-grams (1,2)...

‚úÖ N-gram fitted on TRAIN in 0.724s
   Shape: (1335, 18625)
   Vocabulary size: 18,625


In [16]:
# Check unigram vs bigram distribution
vocab = ngram_vectorizer.vocabulary_
unigrams = sum(1 for term in vocab.keys() if ' ' not in term)
bigrams = sum(1 for term in vocab.keys() if ' ' in term)

print(f"\nüìä Vocabulary breakdown:")
print(f"   Unigrams: {unigrams:,} ({unigrams/len(vocab)*100:.1f}%)")
print(f"   Bigrams:  {bigrams:,} ({bigrams/len(vocab)*100:.1f}%)")

# Show some example bigrams
bigram_list = [term for term in vocab.keys() if ' ' in term]
print(f"\n   Example bigrams: {bigram_list[:10]}")


üìä Vocabulary breakdown:
   Unigrams: 8,508 (45.7%)
   Bigrams:  10,117 (54.3%)

   Example bigrams: ['worldcom bos', 'former worldcom', 'bos bernie', 'bernie ebbers', 'accused overseeing', 'overseeing 11bn', 'made comment', 'defence lawyer', 'mr ebbers', 'phone company']


In [17]:
# Transform DEV and TEST
transform_times_ngram = []

# DEV
start = time.time()
X_dev_ngram = ngram_vectorizer.transform(X_dev_text)
dev_time = time.time() - start
transform_times_ngram.append(dev_time / len(X_dev_text))

# TEST
start = time.time()
X_test_ngram = ngram_vectorizer.transform(X_test_text)
test_time = time.time() - start
transform_times_ngram.append(test_time / len(X_test_text))

print(f"\n‚úÖ N-gram transformed DEV in {dev_time:.3f}s")
print(f"‚úÖ N-gram transformed TEST in {test_time:.3f}s")


‚úÖ N-gram transformed DEV in 0.178s
‚úÖ N-gram transformed TEST in 0.155s


In [18]:
# Calculate health metrics
ngram_metrics = calculate_health_metrics(
    X_train_ngram, ngram_vectorizer, X_test_ngram, test_df['tokens'].tolist(),
    fit_time_ngram, transform_times_ngram, "N-gram"
)

print_metrics(ngram_metrics, "N-grams (1,2)")


üìä N-grams (1,2) - Health Metrics
  Vocabulary size (V):            18,625
  Non-zero entries (nnz):        230,909
  Sparsity:                       0.9907
  OOV rate (TEST):                0.0935
  Fit time (s):                    0.724
  Transform time (ms/doc):         0.374
  Memory (MB):                      2.65


In [19]:
# Save N-gram artifacts
with open(MODELS_DIR / 'ngram_vectorizer.pkl', 'wb') as f:
    pickle.dump(ngram_vectorizer, f)

sp.save_npz(MODELS_DIR / 'X_train_ngram.npz', X_train_ngram)
sp.save_npz(MODELS_DIR / 'X_dev_ngram.npz', X_dev_ngram)
sp.save_npz(MODELS_DIR / 'X_test_ngram.npz', X_test_ngram)

print("\nüíæ N-gram artifacts saved!")


üíæ N-gram artifacts saved!


## 6. TF-IDF

With smoothed IDF: idf(t) = log((N + 1)/(df(t) + 1)) + 1

In [20]:
print("\nüîß Building TF-IDF...\n")

# Create TF-IDF vectorizer with sklearn's default smoothing
tfidf_vectorizer = TfidfVectorizer(
    min_df=2,
    lowercase=False,
    token_pattern=r'(?u)\b\w+\b',
    smooth_idf=True,  # Use smoothed IDF
    sublinear_tf=False  # Use raw term frequency
)

# Fit and transform TRAIN
start_time = time.time()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train_text)
fit_time_tfidf = time.time() - start_time

print(f"‚úÖ TF-IDF fitted on TRAIN in {fit_time_tfidf:.3f}s")
print(f"   Shape: {X_train_tfidf.shape}")
print(f"   Vocabulary size: {len(tfidf_vectorizer.vocabulary_):,}")


üîß Building TF-IDF...

‚úÖ TF-IDF fitted on TRAIN in 0.200s
   Shape: (1335, 11515)
   Vocabulary size: 11,515


In [21]:
# Transform DEV and TEST
transform_times_tfidf = []

# DEV
start = time.time()
X_dev_tfidf = tfidf_vectorizer.transform(X_dev_text)
dev_time = time.time() - start
transform_times_tfidf.append(dev_time / len(X_dev_text))

# TEST
start = time.time()
X_test_tfidf = tfidf_vectorizer.transform(X_test_text)
test_time = time.time() - start
transform_times_tfidf.append(test_time / len(X_test_text))

print(f"‚úÖ TF-IDF transformed DEV in {dev_time:.3f}s")
print(f"‚úÖ TF-IDF transformed TEST in {test_time:.3f}s")

‚úÖ TF-IDF transformed DEV in 0.064s
‚úÖ TF-IDF transformed TEST in 0.061s


### Manual TF-IDF Verification

Verify sklearn's TF-IDF matches manual calculation (within 1e-6)

In [22]:
print("\nüîç Manual TF-IDF Verification...\n")

# Take first 3 documents as tiny example
tiny_docs = X_train_text[:3]
print("Documents for verification:")
for i, doc in enumerate(tiny_docs):
    print(f"  Doc {i}: {doc[:80]}...")

# Fit sklearn TF-IDF on tiny corpus
tiny_vectorizer = TfidfVectorizer(
    lowercase=False,
    token_pattern=r'(?u)\b\w+\b',
    smooth_idf=True
)
X_tiny_sklearn = tiny_vectorizer.fit_transform(tiny_docs).toarray()

# Manual calculation
from collections import Counter

# Tokenize documents
tiny_tokens = [doc.split() for doc in tiny_docs]
print(f"\n\nTokenized docs:")
for i, tokens in enumerate(tiny_tokens):
    print(f"  Doc {i}: {tokens[:10]}... ({len(tokens)} tokens)")

# Build vocabulary
vocab = {}
idx = 0
for tokens in tiny_tokens:
    for token in set(tokens):
        if token not in vocab:
            vocab[token] = idx
            idx += 1

print(f"\nVocabulary size: {len(vocab)}")

# Calculate TF (term frequency)
N = len(tiny_docs)
V = len(vocab)

# Initialize TF matrix
tf_matrix = np.zeros((N, V))
for doc_idx, tokens in enumerate(tiny_tokens):
    token_counts = Counter(tokens)
    for token, count in token_counts.items():
        if token in vocab:
            tf_matrix[doc_idx, vocab[token]] = count

print(f"\nTF matrix shape: {tf_matrix.shape}")

# Calculate DF (document frequency)
df = np.zeros(V)
for doc_idx, tokens in enumerate(tiny_tokens):
    for token in set(tokens):
        if token in vocab:
            df[vocab[token]] += 1

# Calculate IDF with smoothing: log((N+1)/(df+1)) + 1
idf = np.log((N + 1) / (df + 1)) + 1

print(f"\nSample IDF values (first 5 terms):")
vocab_list = list(vocab.keys())[:5]
for term in vocab_list:
    term_idx = vocab[term]
    print(f"  {term:15s}: df={df[term_idx]:.0f}, idf={idf[term_idx]:.6f}")

# Calculate TF-IDF
tfidf_manual = tf_matrix * idf

# Normalize rows to unit length (L2 normalization)
row_norms = np.sqrt(np.sum(tfidf_manual**2, axis=1, keepdims=True))
row_norms[row_norms == 0] = 1  # Avoid division by zero
tfidf_manual = tfidf_manual / row_norms

print(f"\nManual TF-IDF matrix shape: {tfidf_manual.shape}")

# Compare with sklearn (reorder columns to match)
sklearn_vocab_inv = {idx: term for term, idx in tiny_vectorizer.vocabulary_.items()}
manual_to_sklearn = [tiny_vectorizer.vocabulary_[term] for term in vocab.keys() if term in tiny_vectorizer.vocabulary_]

# Get matching subset
X_tiny_sklearn_reordered = X_tiny_sklearn[:, manual_to_sklearn]
tfidf_manual_subset = tfidf_manual[:, :len(manual_to_sklearn)]

# Calculate difference
diff = np.abs(X_tiny_sklearn_reordered - tfidf_manual_subset)
max_diff = np.max(diff)

print(f"\nüìä Verification Results:")
print(f"  Max difference: {max_diff:.2e}")
if max_diff < 1e-6:
    print(f"  ‚úÖ Match: Manual and sklearn TF-IDF agree within tolerance!")
else:
    print(f"  ‚ùå Mismatch: Difference exceeds tolerance")

# Show a sample comparison
print(f"\n  Sample TF-IDF values (Doc 0, first 5 non-zero terms):")
print(f"    {'Term':<15s} {'Manual':<12s} {'Sklearn':<12s} {'Diff':<12s}")
print(f"    {'-'*50}")
    
sample_indices = np.where(tfidf_manual_subset[0] > 0)[0][:5]
for idx in sample_indices:
    term = list(vocab.keys())[idx]
    manual_val = tfidf_manual_subset[0, idx]
    sklearn_val = X_tiny_sklearn_reordered[0, idx]
    diff_val = abs(manual_val - sklearn_val)
    print(f"    {term:<15s} {manual_val:<12.6f} {sklearn_val:<12.6f} {diff_val:<12.2e}")


üîç Manual TF-IDF Verification...

Documents for verification:
  Doc 0: worldcom bos left book alone former worldcom bos bernie ebbers accused overseein...
  Doc 1: yeading face newcastle fa cup premiership side newcastle united face trip ryman ...
  Doc 2: ocean twelve raid box office ocean twelve crime caper sequel starring george clo...


Tokenized docs:
  Doc 0: ['worldcom', 'bos', 'left', 'book', 'alone', 'former', 'worldcom', 'bos', 'bernie', 'ebbers']... (187 tokens)
  Doc 1: ['yeading', 'face', 'newcastle', 'fa', 'cup', 'premiership', 'side', 'newcastle', 'united', 'face']... (239 tokens)
  Doc 2: ['ocean', 'twelve', 'raid', 'box', 'office', 'ocean', 'twelve', 'crime', 'caper', 'sequel']... (176 tokens)

Vocabulary size: 411

TF matrix shape: (3, 411)

Sample IDF values (first 5 terms):
  loss           : df=1, idf=1.693147
  know           : df=1, idf=1.693147
  admitted       : df=1, idf=1.693147
  alone          : df=1, idf=1.693147
  11bn           : df=1, idf=1.693147

M

In [23]:
# Calculate top-k coverage
print("\nüìä Calculating top-k coverage...\n")

# Get average TF-IDF per term across all documents
term_importance = np.array(X_train_tfidf.mean(axis=0)).flatten()

# Get indices of top-k terms
top_k_indices = {}
top_k_indices[100] = np.argsort(term_importance)[-100:]
top_k_indices[500] = np.argsort(term_importance)[-500:]

# Calculate coverage: sum of TF-IDF mass for top-k terms / total mass
total_mass = X_train_tfidf.sum()

topk_100_mass = X_train_tfidf[:, top_k_indices[100]].sum()
topk_500_mass = X_train_tfidf[:, top_k_indices[500]].sum()

topk_100_coverage = float(topk_100_mass / total_mass)
topk_500_coverage = float(topk_500_mass / total_mass)

print(f"Top-100 coverage: {topk_100_coverage:.4f} ({topk_100_coverage*100:.2f}%)")
print(f"Top-500 coverage: {topk_500_coverage:.4f} ({topk_500_coverage*100:.2f}%)")

# Show top-10 terms
vocab_list = list(tfidf_vectorizer.vocabulary_.keys())
vocab_indices = list(tfidf_vectorizer.vocabulary_.values())
idx_to_term = {idx: term for term, idx in tfidf_vectorizer.vocabulary_.items()}

top_10_indices = np.argsort(term_importance)[-10:][::-1]
print(f"\nüîù Top-10 most important terms (by avg TF-IDF):")
for rank, idx in enumerate(top_10_indices, 1):
    term = idx_to_term[idx]
    importance = term_importance[idx]
    print(f"  {rank:2d}. {term:15s} : {importance:.6f}")


üìä Calculating top-k coverage...

Top-100 coverage: 0.1195 (11.95%)
Top-500 coverage: 0.3344 (33.44%)

üîù Top-10 most important terms (by avg TF-IDF):
   1. said            : 0.040767
   2. mr              : 0.028011
   3. year            : 0.023712
   4. would           : 0.019603
   5. game            : 0.018427
   6. film            : 0.018066
   7. people          : 0.017062
   8. new             : 0.016816
   9. also            : 0.016027
  10. one             : 0.015069


In [24]:
# Calculate health metrics with top-k coverage
tfidf_metrics = calculate_health_metrics(
    X_train_tfidf, tfidf_vectorizer, X_test_tfidf, test_df['tokens'].tolist(),
    fit_time_tfidf, transform_times_tfidf, "TF-IDF"
)

# Add top-k coverage
tfidf_metrics['topk_100'] = topk_100_coverage
tfidf_metrics['topk_500'] = topk_500_coverage

print_metrics(tfidf_metrics, "TF-IDF")


üìä TF-IDF - Health Metrics
  Vocabulary size (V):            11,515
  Non-zero entries (nnz):        186,739
  Sparsity:                       0.9879
  OOV rate (TEST):                0.0706
  Top-100 coverage:               0.1195
  Top-500 coverage:               0.3344
  Fit time (s):                    0.200
  Transform time (ms/doc):         0.141
  Memory (MB):                      2.14


In [25]:
# Save TF-IDF artifacts
with open(MODELS_DIR / 'tfidf_vectorizer.pkl', 'wb') as f:
    pickle.dump(tfidf_vectorizer, f)

sp.save_npz(MODELS_DIR / 'X_train_tfidf.npz', X_train_tfidf)
sp.save_npz(MODELS_DIR / 'X_dev_tfidf.npz', X_dev_tfidf)
sp.save_npz(MODELS_DIR / 'X_test_tfidf.npz', X_test_tfidf)

print("\nüíæ TF-IDF artifacts saved!")


üíæ TF-IDF artifacts saved!


## 7. Summary & Comparison

In [26]:
# Create comparison table
comparison_df = pd.DataFrame({
    'Method': ['OHE', 'BoW', 'N-gram', 'TF-IDF'],
    'Vocab Size': [
        ohe_metrics['V'],
        bow_metrics['V'],
        ngram_metrics['V'],
        tfidf_metrics['V']
    ],
    'Sparsity': [
        ohe_metrics['sparsity'],
        bow_metrics['sparsity'],
        ngram_metrics['sparsity'],
        tfidf_metrics['sparsity']
    ],
    'OOV Rate': [
        ohe_metrics['oov'],
        bow_metrics['oov'],
        ngram_metrics['oov'],
        tfidf_metrics['oov']
    ],
    'Fit Time (s)': [
        ohe_metrics['fit_s'],
        bow_metrics['fit_s'],
        ngram_metrics['fit_s'],
        tfidf_metrics['fit_s']
    ],
    'Transform (ms/doc)': [
        ohe_metrics['ms_per_doc'],
        bow_metrics['ms_per_doc'],
        ngram_metrics['ms_per_doc'],
        tfidf_metrics['ms_per_doc']
    ],
    'Memory (MB)': [
        ohe_metrics['mem_mb'],
        bow_metrics['mem_mb'],
        ngram_metrics['mem_mb'],
        tfidf_metrics['mem_mb']
    ]
})

print("\n" + "="*80)
print("üìä SPARSE METHODS COMPARISON")
print("="*80)
print(comparison_df.to_string(index=False))
print("="*80)


üìä SPARSE METHODS COMPARISON
Method  Vocab Size  Sparsity  OOV Rate  Fit Time (s)  Transform (ms/doc)  Memory (MB)
   OHE        2000  0.949789  0.286281      0.223587            0.124383     1.539337
   BoW       11515  0.987852  0.070593      0.221294            0.172940     2.142155
N-gram       18625  0.990713  0.093533      0.723580            0.374283     2.647640
TF-IDF       11515  0.987852  0.070593      0.200188            0.140581     2.142155


In [27]:
# Save all metrics to cache
sparse_metrics = {
    'ohe': ohe_metrics,
    'bow': bow_metrics,
    'ngram': ngram_metrics,
    'tfidf': tfidf_metrics
}

with open(CACHE_DIR / 'sparse_metrics.pkl', 'wb') as f:
    pickle.dump(sparse_metrics, f)

print("\nüíæ All sparse metrics saved to cache/sparse_metrics.pkl")


üíæ All sparse metrics saved to cache/sparse_metrics.pkl


## 8. Summary

‚úÖ **Completed:**
- One-Hot Encoding (top 2000 tokens)
- Bag-of-Words (unigram counts, min_df=2)
- N-grams (unigrams + bigrams, min_df=3)
- TF-IDF (with manual verification)
- Calculated all health metrics
- Saved all representations and metrics

**Key Observations:**
- N-grams have the largest vocabulary (includes bigrams)
- All methods are highly sparse (>95% sparsity)
- TF-IDF typically performs best for text classification
- Top-500 terms capture significant portion of TF-IDF mass

**Next Steps:**
- Build dense representations (Word2Vec, GloVe)
- Train classifiers on all representations
- Build retrieval system

In [28]:
print("\n" + "="*60)
print("üéâ NOTEBOOK 02: SPARSE METHODS COMPLETE! üéâ")
print("="*60)
print(f"\n‚úÖ Built 4 sparse representations")
print(f"‚úÖ Calculated health metrics for all methods")
print(f"‚úÖ Verified TF-IDF manual calculation")
print(f"‚úÖ Saved all artifacts to {MODELS_DIR}")
print("\nüìù Ready for next notebook: 03_dense_methods.ipynb")


üéâ NOTEBOOK 02: SPARSE METHODS COMPLETE! üéâ

‚úÖ Built 4 sparse representations
‚úÖ Calculated health metrics for all methods
‚úÖ Verified TF-IDF manual calculation
‚úÖ Saved all artifacts to ../models

üìù Ready for next notebook: 03_dense_methods.ipynb
