# Protest Classification
## Compare TF-IDF vs Embeddings

**Goal**: Train ONE model (Random Forest Classifier) with TWO vectorization methods and compare results.

**Time**: 30 minutes


## Step 1: Load Data (3 minutes)

In [None]:
# import necessary libraries

# Load data
df = # YOUR CODE

# TODO: Combine 'notes' and 'description' into a new column 'text'. 
# You can use string concatenation with a space in between.
df['text'] = # YOUR CODE

# Quick check
print(f"Shape: {df.shape}")
print(f"\nCategories:\n{df['category'].value_counts()}")

In [None]:
# TODO: Encode labels into numeric labels suitable for ML models (e.g., 0, 1, 2, ...)
# use LabelEncoder from from sklearn.preprocessing import LabelEncoder
#  to transform 'category' column into numerical labels
# use fit_transform() method to generate 
# Alternatively, you can use pd.factorize() or a dictionary mapping


# YOUR CODE
y = # YOUR CODE

# TODO: Split data (80/20) using train_test_split() function from sklearn.model_selection import train_test_split
# YOUR CODE

## Step 2: Model with TF-IDF (10 minutes)

**What is TF-IDF?**
- **TF** (Term Frequency): How often a word appears in a document
- **IDF** (Inverse Document Frequency): How rare/common a word is across all documents
- **Result**: Important words get high scores, common words (like "the") get low scores
- **Example**: "protest" appears often in one document but not all â†’ high TF-IDF score

TF-IDF converts text into numbers that capture word importance, making it possible for ML models to work with text.

ðŸ“š **Learn more**: [TfidfVectorizer documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
# import other necessary libraries

# TODO: Create TF-IDF features
tfidf = TfidfVectorizer(max_features=1000)
X_train_tfidf = # YOUR CODE (fit_transform)
X_test_tfidf = # YOUR CODE (transform only)

print(f"TF-IDF shape: {X_train_tfidf.shape}")

In [None]:
# TODO: Train Random Forest model
# Initialize model
model_tfidf = # YOUR CODE
# YOUR CODE - fit the model

# TODO: Predict and evaluate
y_pred_tfidf = # YOUR CODE

# Calculate accuracy
acc_tfidf = # YOUR CODE


# Print results
print(f"\n TF-IDF Results:")
print(f"   Accuracy: {acc_tfidf:.4f}")

## Step 3: Model with Embeddings (12 minutes)

**What are Sentence Embeddings?**
- Convert entire sentences/paragraphs into dense vectors (fixed-size arrays of numbers)
- Capture **semantic meaning**: "protest rally" and "demonstration" will have similar vectors
- Pre-trained on huge datasets, so they understand context and synonyms
- **all-MiniLM-L6-v2**: Lightweight model (384 dimensions, fast, good quality)

**Key difference from TF-IDF:**
- TF-IDF: Word frequency only â†’ "protest" and "demonstration" are completely different
- Embeddings: Semantic meaning â†’ "protest" and "demonstration" are similar

ðŸ“š **Learn more**: [Sentence Transformers documentation](https://www.sbert.net/docs/pretrained_models.html)

In [None]:
from sentence_transformers import SentenceTransformer

# TODO: Load embedding model
print("Loading embedding model...")
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# TODO: Encode training data to get embeddings
print("\nEncoding training data (this takes ~1 minute)...")
X_train_embed = # YOUR CODE (use .encode())

# TODO: Encode test data to get embeddings
print("Encoding test data...")
X_test_embed = # YOUR CODE

print(f"\nEmbedding shape: {X_train_embed.shape}")

In [None]:
# TODO: Train same model with embeddings
model_embed = # YOUR CODE
# YOUR CODE - fit the model

# TODO: Predict and evaluate
y_pred_embed = # YOUR CODE
acc_embed = 


print(f"\nEmbeddings Results:")
print(f"   Accuracy: {acc_embed:.4f}")

## Step 4: Compare Results (5 minutes)

In [None]:
# TODO: Create comparison table
# Create a DataFrame to compare results
# one column for method, one for accuracy
results = # YOUR CODE

print("\n" + "="*60)
print("FINAL COMPARISON")
print("="*60)
print(results.to_string(index=False))
print("="*60)
