# Protest Classification Exercise - SOLUTION
## Complete Implementation with Two Models

**Models Used**: Logistic Regression and Random Forest

---

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, f1_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sentence_transformers import SentenceTransformer
import warnings
warnings.filterwarnings('ignore')

  from .autonotebook import tqdm as notebook_tqdm


âœ… Imports complete


## Step 1: Load and Prepare Data

In [3]:
from pathlib import Path
# ==================
# SETUP INPUT
# ==================
DIR_DATA = Path.cwd().parents[1] / "data"
FILE_PROTESTS = DIR_DATA / "conflict/protests_filtered.csv"

In [None]:
# Load data
# PTS: 1
df = pd.read_csv(FILE_PROTESTS)

# Combine notes and description
# PTS: 2
df['text'] = df['notes'] + ' ' + df['description']

# Check data
print(f"Dataset shape: {df.shape}")
print(f"\nCategories:")
print(df['category'].value_counts())
print(f"\nSample text: {df['text'].iloc[0][:150]}...")

In [7]:
# Encode target labels
# PTS: 2
# If the trainer used another method (e.g., pd.factorize() or dictionary mapping), 
# that's acceptable too as long as the labels are correctly encoded.
le = LabelEncoder()
y = le.fit_transform(df['category'])

print(f"Encoded labels: {np.unique(y)}")
print(f"Label mapping: {dict(enumerate(le.classes_))}")

Encoded labels: [0 1 2 3 4 5]
Label mapping: {0: 'Business and legal', 1: 'Climate and environment', 2: 'Livelihood (Prices, jobs and salaries)', 3: 'Political/Security', 4: 'Public service delivery', 5: 'Social'}


## Step 2: Model with TF-IDF (10 minutes)

**What is TF-IDF?**
- **TF** (Term Frequency): How often a word appears in a document
- **IDF** (Inverse Document Frequency): How rare/common a word is across all documents
- **Result**: Important words get high scores, common words (like "the") get low scores
- **Example**: "protest" appears often in one document but not all â†’ high TF-IDF score

TF-IDF converts text into numbers that capture word importance, making it possible for ML models to work with text.

ðŸ“š **Learn more**: [TfidfVectorizer documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [9]:
# Split data (80/20)
# PTS: 2
X_train, X_test, y_train, y_test = train_test_split(
    df['text'].values,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")

Training samples: 179
Test samples: 45


In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
# import other necessary libraries

# TODO: Create TF-IDF features
# PTS: 5
tfidf = TfidfVectorizer(max_features=1000)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

print(f"TF-IDF shape: {X_train_tfidf.shape}")

TF-IDF shape: (179, 1000)


In [11]:
# TODO: Train Random Forest model
# Initialize model
# PTS: 4
model_tfidf = RandomForestClassifier(random_state=42)
# YOUR CODE - fit the model
model_tfidf.fit(X_train_tfidf, y_train)

# TODO: Predict and evaluate
# PTS: 4
y_pred_tfidf = model_tfidf.predict(X_test_tfidf)
# Calculate accuracy
acc_tfidf = accuracy_score(y_test, y_pred_tfidf)


# Print results
print(f"\n TF-IDF Results:")
print(f"   Accuracy: {acc_tfidf:.4f}")


 TF-IDF Results:
   Accuracy: 1.0000


## Step 3: Model with Embeddings (12 minutes)

**What are Sentence Embeddings?**
- Convert entire sentences/paragraphs into dense vectors (fixed-size arrays of numbers)
- Capture **semantic meaning**: "protest rally" and "demonstration" will have similar vectors
- Pre-trained on huge datasets, so they understand context and synonyms
- **all-MiniLM-L6-v2**: Lightweight model (384 dimensions, fast, good quality)

**Key difference from TF-IDF:**
- TF-IDF: Word frequency only â†’ "protest" and "demonstration" are completely different
- Embeddings: Semantic meaning â†’ "protest" and "demonstration" are similar

ðŸ“š **Learn more**: [Sentence Transformers documentation](https://www.sbert.net/docs/pretrained_models.html)

In [None]:
# Load lightweight embedding model
print("Loading embedding model...")
# PTS: 1
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

print("\nEncoding training data...")
# PTS: 2
X_train_embed = embedding_model.encode(X_train, show_progress_bar=True)

print("\nEncoding test data...")
# PTS: 2
X_test_embed = embedding_model.encode(X_test, show_progress_bar=True)

print(f"\nEmbedding shape: {X_train_embed.shape}")
print(f"Embedding dimensions: {X_train_embed.shape[1]}")

Loading embedding model...

Encoding training data...


Batches: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 6/6 [00:00<00:00, 61.24it/s]



Encoding test data...


Batches: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 2/2 [00:00<00:00, 79.43it/s]


Embedding shape: (179, 384)
Embedding dimensions: 384





In [None]:
# TODO: Train same model with embeddings
#PTS: 4
model_embed = RandomForestClassifier(random_state=42)
model_embed.fit(X_train_embed, y_train)

# TODO: Predict and evaluate
# PTS: 4
y_pred_embed = model_embed.predict(X_test_embed)
acc_embed = accuracy_score(y_test, y_pred_embed)


print(f"\nEmbeddings Results:")``
print(f"   Accuracy: {acc_embed:.4f}")


Embeddings Results:
   Accuracy: 0.9111


## Step 4: Compare Results (5 minutes)

In [17]:
# TODO: Create comparison table
# Create a DataFrame to compare results
# one column for method, one for accuracy
# PTS: 4
results = pd.DataFrame({
    'Method': ['TF-IDF + Random Forest', 'Embeddings + Random Forest'],
    'Accuracy': [acc_tfidf, acc_embed]
})

print("\n" + "="*60)
print("FINAL COMPARISON")
print("="*60)
print(results.to_string(index=False))
print("="*60)


FINAL COMPARISON
                    Method  Accuracy
    TF-IDF + Random Forest  1.000000
Embeddings + Random Forest  0.911111


In [18]:
1 + 2 + 2 + 2 + 5 + 4 + 4 + 1 + 2 + 2 + 4 + 4 + 4

37

# Total Points 
1 + 2 + 2 + 2 + 5 + 4 + 4 + 1 + 2 + 2 + 4 + 4 + 4 = 37
