# Parliament Relevance Classifier
## CSC 4792 Group Project - Team 16

**Topic 2.1.10**: Classify each MP's response as *relevant* or *not relevant* to the motion

**Team Members**:
- Francis Kalunga, 2021518884, francis.kalunga@cs.unza.zm
- Victor Chabunda, 2021422496, victor.chabunda@cs.unza.zm
- Constance Chilamo, 2021517420, chilamo.constance@cs.unza.zm

**Date**: 2025

---

### Problem Statement
Given a motion and the debate transcript for a parliamentary sitting, label every MP utterance as **Relevant** or **NotRelevant** to that motion.

### Approach
This notebook follows the CRISP-DM methodology through all phases:
- **[BU]** Business Understanding
- **[DU]** Data Understanding  
- **[DP]** Data Preparation
- **[MO]** Modeling
- **[EV]** Evaluation
- **[DE]** Deployment


## [BU] Business Understanding

Phase: [BU] Business Understanding  
Date: 2025  
Team: Team 16

### Team Members
- Francis Kalunga, 2021518884, francis.kalunga@cs.unza.zm
- Victor Chabunda, 2021422496, victor.chabunda@cs.unza.zm
- Constance Chilamo, 2021517420, chilamo.constance@cs.unza.zm

### Problem Statement
The Zambian National Assembly generates extensive debate transcripts during parliamentary sittings, but currently lacks an automated system to identify which speaker utterances are directly relevant to the motions being discussed. This creates challenges for:
- **Parliamentary staff** who need to index and search through Hansard records efficiently
- **Researchers and journalists** who want to analyze parliamentary discourse and voting patterns
- **Citizens** who seek to understand how their representatives engage with specific policy issues

The problem is to **classify each speaker turn in parliamentary debates as either "Relevant" or "NotRelevant" to the motion under discussion**, where:
- **Relevant**: utterances that argue for/against the motion, provide supporting evidence, propose amendments, or discuss implementation
- **NotRelevant**: procedural points, greetings, tangential discussions, jokes, or administrative matters

### Business Objectives
1. **Improve Accessibility of Parliamentary Records**  
   Enable parliamentary staff to automatically classify utterances as relevant or not, making it easier to index, search, and retrieve key information.
2. **Support Evidence-Based Research and Journalism**  
   Provide researchers and journalists with structured data on parliamentary debates, improving analysis of policy discussions and representative accountability.
3. **Enhance Transparency and Civic Engagement**  
   Allow citizens and civic organizations to more easily understand how their representatives engage with motions, fostering accountability and democratic participation.
4. **Increase Operational Efficiency**  
   Reduce the time and effort required for manual annotation and indexing of Hansard records by introducing automation.

### Intended Users
1. **Parliamentary Information Services** - for automated indexing and search
2. **Political researchers** - for discourse analysis and policy tracking  
3. **Journalists** - for identifying key arguments and positions on specific motions
4. **Civic organizations** - for monitoring representative engagement with issues

### Key Performance Indicators (KPIs)
**Primary Metrics**
- **Macro-F1 Score** ≥ 0.75 (balanced performance across both classes)
- **Relevant Class Recall** ≥ 0.80 (capture most relevant utterances)
- **Area Under Precision-Recall Curve (AUPRC)** ≥ 0.70 (handle class imbalance)

**Secondary Metrics**
- Balanced Accuracy ≥ 0.75
- Per-sitting performance consistency
- Per-speaker performance analysis

### Scope and Constraints
**In Scope**
- Parliamentary debates and proceedings from Zambian National Assembly
- English language utterances
- Motions from Order Papers (substantive motions, not procedural)
- Speaker turns with clear attribution and timestamps

**Out of Scope**
- Committee proceedings (different format and context)
- Question Time sessions (different interaction patterns)
- Languages other than English
- Real-time classification (batch processing acceptable)

**Data Coverage**
- Target: 6-10 parliamentary sittings from 2023
- Minimum: 1,000 manually labeled utterances for training
- Time range: Representative sample across different motion types

### Risks and Assumptions
**Technical Risks**
- **Low inter-annotator agreement** - Complex cases may be subjectively labeled
- **Class imbalance** - Most utterances may be relevant, creating skewed training data
- **Context dependency** - Relevance may require understanding previous utterances
- **Motion complexity** - Compound motions may have multiple relevant topics

**Business Risks**
- **Annotation quality** - Inconsistent labeling could hurt model performance
- **Generalizability** - Model may not work well on different parliamentary systems
- **Deployment complexity** - Integration with existing parliamentary systems

**Key Assumptions**
- Parliamentary transcripts are accurately transcribed with speaker attribution
- Order Papers correctly identify the motions being debated
- Manual annotation can achieve reasonable consistency (κ ≥ 0.75)
- TF-IDF and transformer features will capture relevance patterns effectively

### Ethical Considerations
- **Transparency**: Classification decisions should be explainable to users
- **Bias**: Ensure model doesn't discriminate based on speaker identity or political affiliation  
- **Privacy**: No personal information beyond public parliamentary records
- **Accuracy**: False classifications could misrepresent parliamentary discourse

**Extended Ethical Considerations**
- **Transparency and Explainability**: Classification decisions should be explainable to users, enabling parliamentary staff and researchers to understand why utterances are classified as relevant or not relevant. The model should provide confidence scores and feature importance to support decision transparency (Westminster Foundation for Democracy, 2025).
- **Political Neutrality and Non-Discrimination**: Ensure the model doesn't discriminate based on speaker identity, political affiliation, or political orientation, as algorithmic bias against political viewpoints can arise in AI systems (Leerssen, 2022). The model should perform consistently across all political parties and individual representatives.
- **Fairness Across Demographics**: Minimize bias and ensure fairness across different speaker demographics, including gender, age, constituency, and years of service to prevent systematic disadvantaging of any group (Inter-Parliamentary Union, 2025).
- **Data Representativeness**: Address potential correlations that may overlap with protected categories or political viewpoints by ensuring training data represents diverse speakers, motion types, and parliamentary contexts to prevent accidental bias (Wikipedia, 2025).
- **Contextual Sensitivity**: Respect the cultural and institutional context of Zambian parliamentary discourse, ensuring the model doesn't impose external definitions of relevance that may not align with local parliamentary traditions and practices.
- **Privacy and Consent**: Maintain appropriate handling of public parliamentary records while respecting speaker attribution and ensuring no personal information beyond publicly available Hansard records is used in the classification system.
- **Accountability and Human Oversight**: Promote human autonomy and decision-making by designing the system as a decision-support tool rather than a replacement for human judgment, with clear protocols for human review of classifications (Westminster Foundation for Democracy, 2025).
- **Impact Assessment**: Monitor for unintended consequences such as potential chilling effects on parliamentary speech or systematic misrepresentation of certain speakers' contributions to debates.

**References:**
- Inter-Parliamentary Union. (2025). Ethical principles: Fairness and non-discrimination. AI Guidelines for Parliaments.
- Leerssen, P. (2022). Algorithmic Political Bias in Artificial Intelligence Systems. Philosophy & Technology, 35(2).
- Westminster Foundation for Democracy. (2025). AI guidelines for parliaments.
- Wikipedia. (2025). Algorithmic bias.

### Success Criteria
The project will be considered successful if:
1. Achieves target KPI thresholds on held-out test data
2. Demonstrates consistent performance across different sittings and speakers
3. Provides interpretable predictions with confidence scores
4. Delivers a working CLI tool for batch classification
5. Completes all CRISP-DM phases with proper documentation



## [DU] Data Understanding

Goal: build a reliable picture of what data exists, what we need to extract, and how it links together so we can prepare high-quality inputs for modeling.

### 1) Data landscape (what exists)
- Debates & Proceedings (index), plus an alternate debates index with separate pagination. ([1], [10])
- Order Papers (index) listing the Order of the Day with the motion text. ([2])
- Votes & Proceedings (index) summarizing timing/outcomes for validation. ([3])
- Debate pages are Drupal node pages; treat node pages as canonical targets. The site includes PDFs (e.g., abstracts) alongside HTML. ([7], [9])

### 2) What we will collect per sitting (artifacts)
- Motion (from Order Paper): `motion_text` and relevant metadata. ([2])
- Debate content (from Debates): segmented utterances `(speaker, timestamp?, utterance_text, stage_marker?)`. ([7])
- Optional validation (from Votes & Proceedings): date/session alignment or outcomes. ([3])

### 3) How we will join (keys and gaps)
- Primary join key: date; use session header when available for disambiguation.
- Expect gaps: some dates appear in one debates index but not the other, or have an Order Paper without a visible debate entry. Crawl both indices; prefer node pages when present. ([1], [10])

### 4) Collection strategy (practical crawling)
- Pagination: crawl until no more pages (do not hard‑code page counts). OP paginates deeper than Debates. ([2])
- Attachments: download linked PDFs under `/sites/default/files/...` and text‑extract alongside HTML.
- Polite crawling: throttle requests, set a descriptive user‑agent, use content‑hashing to skip duplicates.
- Storage layout: `data/raw/` (HTML/PDF snapshots), `data/interim/` (parsed text, JSONL).

### 5) Parsing & intermediate schema
- Segment debate pages into utterances using heading patterns and stage markers.
- Output: `data/interim/utterances.jsonl` with fields:
  - `sitting_id` (e.g., `YYYY-MM-DD`), `assembly_session?`, `speaker`, `timestamp?`, `utterance_text`, `stage_marker?`
- Save `motion_text` to `data/interim/<date>_motion.txt` for conditioning.

### 6) Quick EDA (sanity checks)
- Length distribution of utterances; per‑speaker turn counts; frequency of stage markers.
- Lexical‑overlap heuristic vs motion to estimate a rough prior of Relevant/NotRelevant for inspection.
- Spot‑check one older sitting to verify parser robustness across templates.

### 7) Risks and mitigations
- Template drift (old vs new): maintain versioned parsers per template. ([4])
- HTML/PDF variance: add a PDF extraction path; keep raw snapshots. ([9])
- Ambiguous “relevance” edges: create an annotation guide; double‑label a subset for κ. ([5])
- Session/date mismatches: join on date; validate with Votes & Proceedings. ([3])
- Dual debates indices: crawl both indices; de‑duplicate node links. ([1], [10])
- Attachment links: fetch and extract PDFs to avoid missing content.

### 8) Ready‑to‑run checklist
- [ ] Crawl 3–5 sittings from both debates indices; store raw HTML/PDF with hashes. ([1], [10])
- [ ] Fetch matching Order Papers; save raw and `<date>_motion.txt`. ([2])
- [ ] Implement `parse_segment.py` → `data/interim/utterances.jsonl`.
- [ ] EDA: length histograms, turns per speaker, marker frequencies; add screenshots of index pages.
- [ ] Draft data card with provenance and verbatim references. ([5])

[1]: https://www.parliament.gov.zm/publications/debates-list "Debates and Proceedings | National Assembly of Zambia"
[2]: https://www.parliament.gov.zm/publications/order-paper-list "Order Paper | National Assembly of Zambia"
[3]: https://www.parliament.gov.zm/publications/votes-proceedings "Votes and Proceedings | National Assembly of Zambia"
[4]: https://www.parliament.gov.zm/ "National Assembly of Zambia"
[5]: https://www.parliament.gov.zm/node/173 "Publications | National Assembly of Zambia"
[7]: https://www.parliament.gov.zm/node/1401 "Debates- Thursday, 4th November, 2010"
[9]: https://www.parliament.gov.zm/sites/default/files/images/publication_docs/Abstract%202%20Debate%20In%20Parliament.pdf "Abstract 2 Debate In Parliament.pdf"
[10]: https://www.parliament.gov.zm/publications/debates-proceedings "Debates & Proceedings (alternate) | National Assembly of Zambia"


## [DP] Data Preparation

Goal: clean, transform, and prepare structured inputs for modeling.  
This section builds on the raw crawled artifacts (`/data/raw/`) and outputs
processed JSONL/CSV under `data/interim/`.

---

### 1) Data Cleaning
- Remove empty utterances, stray whitespace, and non-UTF8 chars.  
- Normalize speaker names (strip titles, unify casing).  
- Handle missing values in `speaker`, `timestamp`.  

### 2) Feature Engineering
- Add `utterance_len` (token count).  
- Add `is_stage_marker` (binary).  
- Add `motion_overlap` (rough lexical overlap with motion text).  

### 3) Data Transformation
- Convert categorical features (e.g. `speaker`) into numeric encodings.  
- Ensure output schema matches:  


- Save to `data/interim/utterances_prepared.csv`.  

---


In [None]:
import os
import re
import json
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from difflib import SequenceMatcher

# Paths
RAW_DIR = "data/raw/"
INTERIM_DIR = "data/interim/"
os.makedirs(INTERIM_DIR, exist_ok=True)

# --- 1) LOAD RAW JSONL ---
# assume DU phase produced: data/interim/utterances.jsonl
utterances_path = os.path.join(INTERIM_DIR, "utterances.jsonl")

records = []
with open(utterances_path, "r", encoding="utf-8") as f:
    for line in f:
        try:
            records.append(json.loads(line))
        except json.JSONDecodeError:
            continue

df = pd.DataFrame(records)

print("Raw shape:", df.shape)
df.head()


In [None]:
# --- 2) DATA CLEANING ---

# Drop empty utterances
df = df[df["utterance_text"].notna() & (df["utterance_text"].str.strip() != "")]

# Normalize speaker names
def normalize_speaker(s):
    if pd.isna(s): 
        return "UNKNOWN"
    s = re.sub(r"^Hon\.?|Dr\.?|Mr\.?|Mrs\.?|Ms\.?", "", s, flags=re.I)
    return s.strip().title()

df["speaker"] = df["speaker"].apply(normalize_speaker)

# Fill missing timestamps with placeholder
df["timestamp"] = df["timestamp"].fillna("NA")

print("After cleaning:", df.shape)


In [None]:
# --- 3) FEATURE ENGINEERING ---

# Length of utterance (tokens)
df["utterance_len"] = df["utterance_text"].apply(lambda x: len(x.split()))

# Stage marker flag
df["is_stage_marker"] = df["stage_marker"].notna() & (df["stage_marker"].str.strip() != "")

# Motion text (load from DU artifact)
motion_file = [f for f in os.listdir(INTERIM_DIR) if f.endswith("_motion.txt")]
motion_text = ""
if motion_file:
    with open(os.path.join(INTERIM_DIR, motion_file[0]), "r", encoding="utf-8") as f:
        motion_text = f.read()

def lexical_overlap(a, b):
    if not a or not b:
        return 0.0
    return SequenceMatcher(None, a.lower(), b.lower()).ratio()

df["motion_overlap"] = df["utterance_text"].apply(lambda x: lexical_overlap(x, motion_text))
