# 03 — Baseline Predictions

**Speech-to-Vote Project**

- **Baseline 1**: Party majority only — what accuracy?
- **Baseline 2**: Party + topic (activiteit_onderwerp) — what accuracy?

These tell us: how much does speech text *add* beyond party/metadata?

In [None]:
import pandas as pd
from pathlib import Path

ANALYSIS_DIR = Path("../data/analysis")
pairs = pd.read_parquet(ANALYSIS_DIR / "speech_vote_pairs.parquet")

# For faster iteration, sample if very large
if len(pairs) > 500_000:
    pairs = pairs.sample(500_000, random_state=42)
    print(f"Sampled 500k rows for speed")

print(f"Working with {len(pairs):,} pairs")

## Baseline 1: Party majority

For each (besluit_id, fractie), predict the majority vote of that party.

In [None]:
# Per (besluit_id, fractie): majority vote
majority = pairs.groupby(['besluit_id', 'fractie'])['vote'].agg(
    lambda x: x.mode().iloc[0] if len(x.mode()) > 0 else x.iloc[0]
).reset_index().rename(columns={'vote': 'pred_party'})

merged = pairs.merge(majority, on=['besluit_id', 'fractie'], how='left')
merged = merged[merged['pred_party'].notna()]

acc1 = (merged['vote'] == merged['pred_party']).mean() * 100
print(f"Baseline 1 (party majority): {acc1:.2f}% accuracy")
print(f"Correct: {(merged['vote'] == merged['pred_party']).sum():,} / {len(merged):,}")

## Baseline 2: Party + topic (activiteit_onderwerp)

If we have topic, group by (besluit_id, fractie, topic) and predict majority.

In [None]:
if 'activiteit_onderwerp' in pairs.columns and pairs['activiteit_onderwerp'].notna().sum() > 1000:
    # Use first 50 chars of topic as hash (simplified topic)
    merged['topic'] = merged['activiteit_onderwerp'].fillna('').str[:50]
    majority2 = merged.groupby(['besluit_id', 'fractie', 'topic'])['vote'].agg(
        lambda x: x.mode().iloc[0] if len(x.mode()) > 0 else x.iloc[0]
    ).reset_index().rename(columns={'vote': 'pred_party_topic'})
    merged2 = merged.merge(majority2, on=['besluit_id', 'fractie', 'topic'], how='left')
    merged2 = merged2[merged2['pred_party_topic'].notna()]
    acc2 = (merged2['vote'] == merged2['pred_party_topic']).mean() * 100
    print(f"Baseline 2 (party + topic): {acc2:.2f}% accuracy")
else:
    print("Insufficient topic data for Baseline 2")

## Interpretation

**Key question**: Can speech text *add* predictive value beyond these baselines?

If Baseline 1 is ~95%, speech must beat that to be meaningful.
If Baseline 1 is ~70%, there's room for speech features to help.