# README.md

# A Flexible Measure of Voter Polarization: A Python Implementation

<!-- PROJECT SHIELDS -->
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python Version](https://img.shields.io/badge/python-3.9%2B-blue.svg)](https://www.python.org/downloads/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
[![Type Checking: mypy](https://img.shields.io/badge/type_checking-mypy-blue)](http://mypy-lang.org/)
[![Pandas](https://img.shields.io/badge/pandas-%23150458.svg?style=flat&logo=pandas&logoColor=white)](https://pandas.pydata.org/)
[![NumPy](https://img.shields.io/badge/numpy-%23013243.svg?style=flat&logo=numpy&logoColor=white)](https://numpy.org/)
[![SciPy](https://img.shields.io/badge/SciPy-%230C55A5.svg?style=flat&logo=scipy&logoColor=white)](https://scipy.org/)
[![Matplotlib](https://img.shields.io/badge/Matplotlib-%23ffffff.svg?style=flat&logo=Matplotlib&logoColor=black)](https://matplotlib.org/)
[![Seaborn](https://img.shields.io/badge/seaborn-%233776AB.svg?style=flat&logo=python&logoColor=white)](https://seaborn.pydata.org/)
[![Jupyter](https://img.shields.io/badge/Jupyter-%23F37626.svg?style=flat&logo=Jupyter&logoColor=white)](https://jupyter.org/)
[![arXiv](https://img.shields.io/badge/arXiv-2507.07770-b31b1b.svg)](https://arxiv.org/abs/2507.07770)
[![DOI](https://img.shields.io/badge/DOI-10.48550/arXiv.2507.07770-blue)](https://doi.org/10.48550/arXiv.2507.07770)
[![Research](https://img.shields.io/badge/Research-Computational%20Social%20Science-green)](https://github.com/chirindaopensource/ginzburg_polarization_model_python_implementation)
[![Discipline](https://img.shields.io/badge/Discipline-Political%20Economy-blue)](https://github.com/chirindaopensource/ginzburg_polarization_model_python_implementation)
[![Methodology](https://img.shields.io/badge/Methodology-Distributional%20Analysis-orange)](https://github.com/chirindaopensource/ginzburg_polarization_model_python_implementation)
[![Data Source](https://img.shields.io/badge/Data%20Source-ANES-lightgrey)](https://electionstudies.org/)
[![Year](https://img.shields.io/badge/Year-2025-purple)](https://github.com/chirindaopensource/ginzburg_polarization_model_python_implementation)

**Repository:** https://github.com/chirindaopensource/ginzburg_polarization_model_python_implementation

**Owner:** 2025 Craig Chirinda (Open Source Projects)

This repository contains an **independent** implementation of the research methodology from the 2025 paper entitled **"A Flexible Measure of Voter Polarization"** by:

*   Boris Ginzburg

The project provides a robust, end-to-end Python pipeline for computing and analyzing the flexible polarization index, `P(F, x*)`. This measure moves beyond traditional, mean-centric metrics like variance to provide a high-resolution diagnostic tool. It allows an analyst to measure polarization around any specified point in the ideological spectrum, enabling the precise identification and tracking of specific fault lines within an electorate.

## Table of Contents

- [Introduction](#introduction)
- [Theoretical Background](#theoretical-background)
- [Features](#features)
- [Methodology Implemented](#methodology-implemented)
- [Core Components (Notebook Structure)](#core-components-notebook-structure)
- [Key Callable: execute_full_research_project](#key-callable-execute_full_research_project)
- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Input Data Structure](#input-data-structure)
- [Usage](#usage)
- [Output Structure](#output-structure)
- [Project Structure](#project-structure)
- [Customization](#customization)
- [Contributing](#contributing)
- [License](#license)
- [Citation](#citation)
- [Acknowledgments](#acknowledgments)

## Introduction

This project provides a Python implementation of the methodologies presented in the 2025 paper "A Flexible Measure of Voter Polarization." The core of this repository is the iPython Notebook `ginzburg_polarization_index_implementation_draft.ipynb`, which contains a comprehensive suite of functions to compute the `P(F, x*)` index, analyze its dynamics, and compare it against traditional measures.

Traditional measures of polarization, such as variance, describe the dispersion of an electorate around a single point—the mean. This can mask crucial asymmetric dynamics. For instance, the center-right may be polarizing while the center-left is coalescing, an effect that a single variance number might miss. The Ginzburg index solves this by making the reference point `x*` a flexible parameter, effectively allowing an analyst to "scan" the entire ideological spectrum for polarization.

This codebase enables researchers, political analysts, and data scientists to:
-   Rigorously compute the `P(F, x*)` index for any ideological distribution.
-   Analyze the evolution of polarization at different points on the political spectrum over time.
-   Systematically identify the ideological "cleavage points" where polarization has increased the most.
-   Compare the insights from this flexible measure against traditional metrics.
-   Replicate and extend the findings of the original research paper.

## Theoretical Background

The implemented methods are grounded in distributional analysis and integral calculus, applied to survey data.

**A Flexible Definition of Polarization:** The framework begins by defining what it means for one distribution `F̂` to be more polarized than another `F` around a specific point `x*`. This occurs if the probability mass in *any* interval containing `x*` is smaller under `F̂`. This leads to a practical single-crossing condition on the Cumulative Distribution Functions (CDFs).

**The Polarization Index `P(F, x*)`:** To provide a scalar measure, the paper defines the polarization index. The formula is designed to increase as probability mass shifts away from the central point `x*` towards both tails of the distribution.
$P(F, x^*) := \frac{\int_{\min\{X\}}^{x^*} F(x) dx}{x^* - \min\{X\}} - \frac{\int_{x^*}^{\max\{X\}} F(x) dx}{\max\{X\} - x^*} + 1$
For discrete survey data, the integrals are calculated exactly as the sum of areas of rectangles under the empirical CDF's step function.

**Cleavage Point Finder:** The framework can be inverted. Instead of specifying `x*` and measuring polarization, the pipeline can perform a grid search across all possible `x*` values to find the one where the percentage increase in `P(F, x*)` between two time periods is maximized. This identifies the most significant emerging ideological fault line.

## Features

The provided iPython Notebook (`ginzburg_polarization_index_implementation_draft.ipynb`) implements the full research pipeline, including:

-   **Parameter Validation:** Rigorous checks for all input data and configurations to ensure methodological compliance.
-   **Data Cleansing:** Robust handling of survey-specific missing value codes.
-   **Weighted Statistics:** Correct application of survey weights for all calculations, including CDF construction and traditional measures.
-   **Exact `P(F, x*)` Calculation:** A numerically stable and mathematically precise implementation of the polarization index for discrete data.
-   **Automated Analysis Suite:** Functions to systematically analyze temporal trends, election-year effects, and comparative performance against traditional metrics.
-   **Cleavage Point Finder:** An algorithm to scan the ideological spectrum and identify points of maximum polarization increase.
-   **Theoretical Extensions:** Computational models for affective polarization and the effects of issue salience.
-   **Robustness Checks:** A framework for testing the sensitivity of results to key methodological choices (e.g., weighted vs. unweighted analysis).
-   **Publication-Quality Visualization:** A suite of functions to generate the key figures from the paper.

## Methodology Implemented

The core analytical steps directly implement the methodology from the paper:

1.  **Data Preparation (Tasks 1-2):** The pipeline ingests raw ANES data, cleans it by handling missing value codes, and constructs a weighted empirical Cumulative Distribution Function (CDF) for each survey year and ideological scale.
2.  **Polarization Calculation (Tasks 3-4):** It computes the `P(F, x*)` index for every specified `(year, scale, x*)` combination using the formula from Definition 2.
3.  **Temporal and Event Analysis (Tasks 5-6):** The pipeline analyzes the evolution of `P(F, x*)` over time and its short-term changes around elections, replicating the analysis in Figures 1, 2, and 3 of the paper.
4.  **Cleavage Point Identification (Tasks 7-8):** It implements the grid search algorithm to find the `x*` that maximizes the percentage increase in polarization between two periods.
5.  **Comparative Analysis (Tasks 9-10):** The pipeline computes traditional measures (e.g., variance) and systematically compares their trends to those of the `P(F, x*)` index at various points, quantifying where the measures converge and diverge.

## Core Components (Notebook Structure)

The `ginzburg_polarization_index_implementation_draft.ipynb` notebook is structured as a logical pipeline with modular functions for each task:

-   **Task 0: `validate_parameters`**: The initial quality gate for all inputs.
-   **Task 1: `clean_anes_data`**: Handles data quality and missing codes.
-   **Task 2: `preprocess_for_polarization`**: The core CDF generation engine.
-   **Task 3-4: `calculate_polarization_index`, `compute_all_polarization_measures`**: The main polarization index calculation.
-   **Task 5-10**: A suite of analysis functions for temporal, election, cleavage, and comparative analysis.
-   **Task 11: `calculate_affective_polarization`, `simulate_issue_salience_effect`**: Implementation of the theoretical models.
-   **Task 12-14**: High-level orchestrators (`run_polarization_pipeline`, `run_robustness_analysis`, `execute_full_research_project`) that run the entire workflow.

## Key Callable: execute_full_research_project

The central function in this project is `execute_full_research_project`. It orchestrates the entire analytical workflow from raw data to final results, including robustness checks and report generation.

```python
def execute_full_research_project(
    anes_df: pd.DataFrame,
    params: Dict[str, Any]
) -> Dict[str, Any]:
    """
    Executes the complete, end-to-end polarization research project.

    This master orchestrator function serves as the single entry point to run
    the entire analysis suite, from raw data to final report assets. It encapsulates
    the full research workflow, including the baseline analysis, robustness checks,
    and the generation of all tables and visualizations.

    Args:
        anes_df (pd.DataFrame): The raw ANES survey data.
        params (Dict[str, Any]): A comprehensive dictionary containing all
            parameters required for every stage of the analysis.

    Returns:
        Dict[str, Any]: A master dictionary containing the complete project results.
    """
    # ... (implementation is in the notebook)
```

## Prerequisites

-   Python 3.9+
-   Core dependencies as listed in `requirements.txt`: `pandas`, `numpy`, `scipy`, `matplotlib`, `seaborn`.

## Installation

1.  **Clone the repository:**
    ```sh
    git clone https://github.com/chirindaopensource/ginzburg_polarization_model_python_implementation.git
    cd ginzburg_polarization_model_python_implementation
    ```

2.  **Create and activate a virtual environment (recommended):**
    ```sh
    python -m venv venv
    source venv/bin/activate  # On Windows, use `venv\Scripts\activate`
    ```

3.  **Install Python dependencies from `requirements.txt`:**
    ```sh
    pip install -r requirements.txt
    ```

## Input Data Structure

The primary input is a `pandas.DataFrame` with the following required columns:
-   `respondent_id`: A unique identifier for each respondent.
-   `year`: A string/object representing the survey wave (e.g., `'2004a'`, `'2016b'`).
-   `left_right`: The respondent's self-placement on the 0-10 left-right scale.
-   `liberal_conservative`: The respondent's self-placement on the 1-7 liberal-conservative scale.
-   `weight`: The full-sample survey weight for that respondent.

## Usage

### **User Guide: Deploying the End-to-End Polarization Analysis Pipeline**

This section provides a practical, step-by-step guide to utilizing the `execute_full_research_project` function. This function is the single entry point to the entire analytical library, designed for robust, reproducible, and comprehensive analysis of voter polarization.

#### **Step 1: Data Acquisition and Preparation**

The pipeline requires a single, consolidated `pandas.DataFrame` as its primary data input. This DataFrame must be prepared *before* calling the main function and must adhere to a strict schema.

*   **Action:** The analyst must first acquire the necessary survey data (e.g., from the ANES Data Center). The data from multiple years and waves should be merged into one file.
*   **Schema:** The resulting DataFrame must contain the following columns with these exact names:
    *   `respondent_id`: A unique identifier for each respondent.
    *   `year`: A string or object representing the survey wave (e.g., `'2004a'`, `'2016b'`).
    *   `left_right`: The respondent's self-placement on the 0-10 left-right scale.
    *   `liberal_conservative`: The respondent's self-placement on the 1-7 liberal-conservative scale.
    *   `weight`: The full-sample survey weight provided by ANES for that specific survey wave.

**Code Snippet: Creating a Mock DataFrame**

For demonstration purposes, we will construct a small, mock `DataFrame` that conforms to this schema. In a real-world scenario, this `anes_df` would be the result of loading and merging actual ANES `.dta` or `.csv` files.

```python
import pandas as pd
import numpy as np

# In a real application, this DataFrame would be loaded from a file.
# df = pd.read_csv("path/to/your/merged_anes_data.csv")
# For this example, we create a mock DataFrame.

data = {
    'respondent_id': [f'R_{i}' for i in range(100)],
    'year': np.random.choice(['2012', '2016a', '2016b', '2020'], 100),
    'left_right': np.random.choice(list(range(11)) + [98], 100), # Include a missing code
    'liberal_conservative': np.random.choice(list(range(1, 8)) + [99], 100), # Include a missing code
    'weight': np.random.uniform(0.5, 2.5, 100)
}
anes_df = pd.DataFrame(data)

print("Sample of the prepared input DataFrame:")
print(anes_df.head())
```

#### **Step 2: Constructing the Master Parameters Dictionary**

The `execute_full_research_project` function is controlled by a single, comprehensive parameters dictionary. This object encapsulates every tunable setting for the entire analysis, ensuring perfect reproducibility. We will construct this dictionary section by section, using the default values specified in the original project brief.

**Code Snippet: Defining the `params` Dictionary**

```python
# Initialize the master parameters dictionary.
params = {}

# --- Central Points of Interest (x*) ---
# These are the points around which polarization will be measured for each scale.
params['central_points_params'] = {
    "left_right_x_stars": [1, 2, 3, 4, 5, 6, 7, 8, 9], # Note: Excludes boundaries 0 and 10
    "liberal_conservative_x_stars": [2, 3, 4, 5, 6] # Note: Excludes boundaries 1 and 7
}

# --- Policy Space Boundaries ---
# These define the theoretical min and max of each ideological scale.
params['boundaries_params'] = {
    "left_right_boundaries": {'min': 0, 'max': 10},
    "liberal_conservative_boundaries": {'min': 1, 'max': 7}
}

# --- Integration Parameters ---
# Defines the numerical method for the P(F,x*) calculation.
# Note: The current implementation is optimized for 'trapezoidal' on discrete data.
params['integration_params'] = {
    'method': 'trapezoidal',
    'num_points': 1000
}

# --- Cleavage Finder Parameters ---
# Defines the time periods to scan for the maximum increase in polarization.
params['cleavage_finder_params'] = {
    'time_points': [('2016a', '2020')], # Using years available in our mock data
    'potential_x_stars': params['central_points_params']
}

# --- Analysis-Specific Parameters ---
# Defines midpoints for partitioning temporal plots and centrist definitions.
params['midpoints'] = {
    'left_right': 5,
    'liberal_conservative': 4
}
params['centrist_definitions'] = {
    'left_right': [4, 5, 6],
    'liberal_conservative': [3, 4, 5]
}
params['election_years_for_analysis'] = [2016] # Using year available in mock data

# --- Theoretical Extension Parameters ---
# Defines parameters for the issue salience simulation.
params['salience_simulation_params'] = {
    'salience_alphas': [0.1, 0.3, 0.5, 0.7, 0.9],
    'common_value_dist': {'low': 4.8, 'high': 5.2},
    'divisive_issue_dist': {'low': 0, 'high': 10},
    'polarization_params': {
        'x_star': 5, # The x* to calculate polarization around in the simulation
        'boundaries': params['boundaries_params']['left_right_boundaries']
    },
    'random_seed': 42 # For reproducibility
}

# --- Optional: ANES Missing Value Codes ---
# This can be customized if different codes are used in the data.
params['missing_value_map'] = {
    'left_right': [98, 99],
    'liberal_conservative': [98, 99]
}

print("\nMaster parameters dictionary constructed successfully.")
```

#### **Step 3: Executing the Pipeline and Inspecting Results**

With the input `DataFrame` and the `params` dictionary prepared, the entire research project can be executed with a single function call. The function will print its progress through the various stages of the analysis.

**Code Snippet: Running the Master Orchestrator**

```python
# First, ensure the full library of functions is loaded in your environment.
# from ginzburg_polarization_model_python_implementation import execute_full_research_project

# Execute the entire research project.
# This single call runs validation, cleaning, all calculations, analyses,
# robustness checks, and reporting.
project_results = execute_full_research_project(
    anes_df=anes_df,
    params=params
)
```

The returned `project_results` object is a deeply nested dictionary containing every artifact generated during the run. An analyst can now programmatically access any piece of the analysis for inspection, custom plotting, or further work.

**Code Snippet: Accessing and Inspecting Key Outputs**

```python
# --- Inspecting the Main Analysis Results ---
print("\n--- Example: Inspecting Key Outputs from the Main Analysis ---")

# Access the main results dictionary
main_analysis_results = project_results['main_analysis']['results']

# 1. View the summary of the cleavage point analysis
print("\nCleavage Point Analysis Summary:")
cleavage_summary = main_analysis_results['cleavage_analysis']
print(cleavage_summary)

# 2. View the summary of the comparative analysis (correlation)
print("\nCorrelation Summary (Traditional vs. Flexible Measures):")
correlation_summary = main_analysis_results['comparative_framework']['correlation_summary']
print(correlation_summary)

# 3. Access a specific plot (e.g., temporal trends for the left-right scale)
report_assets = project_results['main_analysis']['report_assets']
temporal_plot_lr = report_assets['plots']['temporal_trends_left_right']
# To display the plot in a Jupyter environment:
# temporal_plot_lr.show()
# To save the plot to a file:
# temporal_plot_lr.savefig("temporal_trends_left_right.png", dpi=300)
print("\nTemporal trend plot for 'left_right' scale has been generated.")

# --- Inspecting the Robustness Analysis Results ---
print("\n--- Example: Comparing Robustness Check Results ---")

# Compare the cleavage point found in the weighted vs. unweighted scenarios
try:
    weighted_cleavage = project_results['robustness_analysis']['weighted']['cleavage_analysis']
    unweighted_cleavage = project_results['robustness_analysis']['unweighted']['cleavage_analysis']

    print("\nCleavage points from 'weighted' analysis:")
    print(weighted_cleavage[['cleavage_point', 'relative_position']])

    print("\nCleavage points from 'unweighted' analysis:")
    print(unweighted_cleavage[['cleavage_point', 'relative_position']])
except KeyError:
    print("\nRobustness check for one or more scenarios may have failed.")

```
This example provides a complete, end-to-end workflow, demonstrating how a user can prepare their data, configure the analysis, execute the entire pipeline with one command, and access the structured, high-value outputs for interpretation.

## Output Structure

The `execute_full_research_project` function returns a single, comprehensive dictionary with the following top-level keys:

-   `main_analysis`: Contains the results of the primary pipeline run.
    -   `results`: A dictionary of all key data artifacts and analytical DataFrames (e.g., `polarization_results`, `cleavage_analysis`).
    -   `report_assets`: A dictionary containing the generated `matplotlib` figures and formatted LaTeX/HTML tables.
-   `robustness_analysis`: Contains the results from the different robustness scenarios (e.g., 'weighted' vs. 'unweighted'), allowing for direct comparison.

## Project Structure

```
ginzburg_polarization_model_python_implementation/
│
├── ginzburg_polarization_index_implementation_draft.ipynb  # Main implementation notebook
├── requirements.txt                                      # Python package dependencies
├── LICENSE                                                 # MIT license file
└── README.md                                               # This documentation file
```

## Customization

The pipeline is highly customizable via the master `params` dictionary. Users can easily modify:
-   The lists of `x_stars` to analyze for each scale.
-   The `time_points` for the cleavage finder.
-   The `midpoints` and `centrist_definitions` for analysis and comparison.
-   The parameters for the theoretical simulations.

## Contributing

Contributions are welcome. Please fork the repository, create a feature branch, and submit a pull request with a clear description of your changes. Adherence to PEP 8, type hinting, and comprehensive docstrings is required.

## License

This project is licensed under the MIT License. See the `LICENSE` file for details.

## Citation

If you use this code or the methodology in your research, please cite the original paper:

```bibtex
@article{ginzburg2025flexible,
  title={A Flexible Measure of Voter Polarization},
  author={Ginzburg, Boris},
  journal={arXiv preprint arXiv:2507.07770},
  year={2025}
}
```

For the implementation itself, you may cite this repository:
```
Chirinda, C. (2025). A Python Implementation of the Ginzburg Flexible Polarization Model.
GitHub repository: https://github.com/chirindaopensource/ginzburg_polarization_model_python_implementation
```

## Acknowledgments

-   Credit to Boris Ginzburg for the novel theoretical framework and the flexible polarization measure.
-   Thanks to the developers of the `pandas`, `numpy`, `scipy`, `matplotlib`, and `seaborn` libraries, which are the foundational pillars of this analytical pipeline.

--

*This README was generated based on the structure and content of `ginzburg_polarization_index_implementation_draft.ipynb` and follows best practices for research software documentation.*

# Paper

Title: "A Flexible Measure of Voter Polarization"

Link: https://arxiv.org/abs/2507.07770

E-Journal Submission Date: 11th of July 2025

Author: Boris Ginzburg

Abstract:

This paper introduces a definition of ideological polarization of an electorate around a particular central point. By being flexible about the location or width of the center, this measure enables the researcher to analyze polarization around any point of interest. The paper then applies this approach to US voter survey data between 2004 and 2020, showing how polarization between right-of-center voters and the rest of the electorate was increasing gradually, while polarization between left-wingers and the rest was originally constant and then rose steeply. It also shows how, following elections, polarization around left-wing positions decreased while polarization around right-wing positions increased. Furthermore, the paper shows how this measure can be used to find cleavage points around which polarization changed the most. I then show how ideological polarization as defined here is related to other phenomena, such as affective polarization and increased salience of divisive issues.



# Summary

### **Summary of "A Flexible Measure of Voter Polarization"**

This paper introduces a novel theoretical framework and an associated numerical measure for ideological polarization. Its primary innovation is its flexibility, allowing a researcher to analyze polarization around *any* chosen central point, rather than being restricted to traditional measures centered on the mean or median.

#### **Step 1: The Core Problem and Motivation**

The paper begins by identifying a key limitation in the existing literature on mass polarization. While polarization is often described as the "disappearance of the center," common statistical measures used to capture this phenomenon are often inadequate or too restrictive:

*   **Measures like variance or kurtosis** are, by construction, measures of dispersion around the distribution's mean. This implicitly assumes the mean is the only relevant "center," which may not be true in many political or theoretical contexts (e.g., the median voter, a neutral policy point).
*   **Measures based on the share of "centrists"** require the researcher to arbitrarily define the boundaries of the center (e.g., scores of 4-6 on a 10-point scale). The results can be highly sensitive to this choice.
*   **Group-based measures** (e.g., distance between Republican and Democrat means) require pre-defined group identities and are less useful for analyzing polarization within a heterogeneous electorate as a whole.

The author proposes a measure that overcomes these limitations by being agnostic about the location of the center, allowing the researcher to define it based on the specific research question.

#### **Step 2: A Formal Definition of Polarization (The Partial Order)**

The theoretical core of the paper is a new definition of polarization based on a partial ordering of distributions.

*   **Definition 1:** A distribution `F̂` is said to dominate another distribution `F` in polarization around a specific point `x*` if, for *any* interval that contains `x*`, the share of the population within that interval is smaller under `F̂` than under `F`.
*   **Intuition:** This formalizes the idea of the "hollowing out" of the center. If polarization increases, fewer people should be found near the central point `x*`, regardless of how narrowly or broadly one defines "near."
*   **Proposition 1 (The Single-Crossing Condition):** The paper provides a simple and powerful necessary and sufficient condition for this dominance relationship. `F̂` is more polarized around `x*` than `F` if and only if the cumulative distribution function (CDF) of `F̂` is everywhere above the CDF of `F` to the left of `x*`, and everywhere below it to the right of `x*`.
    *   `F̂(x) ≥ F(x)` for all `x < x*`
    *   `F̂(x) ≤ F(x)` for all `x > x*`
    This is equivalent to saying that for the population to the left of `x*`, `F` first-order stochastically dominates `F̂` (a shift towards `x*`), while for the population to the right of `x*`, `F̂` first-order stochastically dominates `F` (a shift away from `x*`).

#### **Step 3: A Numerical Measure of Polarization (The Index)**

The partial ordering from Step 2 cannot compare all pairs of distributions. To address this, the paper introduces a numerical index, `P(F, x*)`, that is consistent with the partial order but allows for the comparison of any two distributions.

*   **Definition 2:** The index `P(F, x*)` is defined based on the normalized integrals of the CDF on either side of the central point `x*`.
*   **Properties (Proposition 2):** This index is well-behaved.
    1.  It is normalized to lie within the `[0, 1]` interval, making it comparable across different central points `x*`.
    2.  Crucially, it is consistent with the partial order: if `F̂` dominates `F` in polarization around `x*`, then `P(F̂, x*) > P(F, x*)`.

#### **Step 4: Empirical Application to the US Electorate**

The paper demonstrates the utility of this new framework by applying it to American National Election Studies (ANES) data from 2004 to 2020, analyzing both left-right and liberal-conservative self-placement scales. The findings are more nuanced than those from traditional measures like variance.

*   **Asymmetric Polarization Trends:** While variance shows a general increase in polarization after 2012, the `P(F, x*)` measure reveals a more complex dynamic:
    *   **Right-of-Center:** Polarization around various right-wing and conservative positions showed a *gradual and steady increase* across the entire 2004-2020 period.
    *   **Left-of-Center:** Polarization around various left-wing and liberal positions was relatively stable or even decreased until 2012, after which it began to *increase sharply*.
*   **Analyzing Short-Term Events:** The measure can detect subtle shifts during election cycles. Following the 2004 and 2016 elections, polarization around liberal positions *decreased*, while polarization around conservative positions *increased*. A simple variance measure would miss this asymmetric shift.
*   **Finding Cleavages:** The framework can be used in reverse: instead of specifying a center `x*`, one can search for the `x*` around which polarization has changed the most. The analysis shows that in the post-2012 period, the largest percentage increases in polarization occurred around points slightly to the *left of the center* of the ideological scale, identifying this as a key cleavage.

#### **Step 5: Linking Ideological Polarization to Other Phenomena**

The paper concludes by formally connecting its definition of polarization to two other major concepts in political science.

*   **Affective Polarization (Proposition 3):** The paper presents a model where voters' dislike of the "other side" (affective polarization) is a function of the ideological distance to that group. It proves that an increase in ideological polarization (as defined by the paper's partial order) necessarily leads to an increase in the average level of affective polarization. This provides a formal microfoundation for the empirically observed link between the two.
*   **Salience of Divisive Issues (Proposition 4):** A second model conceptualizes a voter's overall ideology as a weighted average of their positions on a consensual ("common-value") issue and a divisive issue. The paper shows that increasing the salience (weight) of the more divisive issue causes an increase in ideological polarization as defined in the paper.

### **Overall Contribution**

The paper's contribution is both methodological and substantive. It provides a rigorous yet intuitive framework that:
1.  **Generalizes** the concept of polarization as the "disappearance of the center" to any point of interest.
2.  **Enables** a more granular and nuanced empirical analysis, revealing asymmetric dynamics missed by traditional aggregate measures.
3.  **Offers** a tool to endogenously identify the ideological "cleavage points" where political division is growing fastest.
4.  **Builds theoretical bridges** by formally linking this measure of ideological polarization to the distinct concepts of affective polarization and issue salience.

# Import Essential Modules

In [None]:
# ==============================================================================
#
#  A Flexible Measure of Voter Polarization: A Computational Implementation
#
#  This module provides a complete, production-grade implementation of the
#  analytical framework presented in "A Flexible Measure of Voter Polarization"
#  by Boris Ginzburg. It includes functions for data validation, cleansing,
#  preprocessing, core polarization calculation, and a suite of advanced
#  analytical and reporting tools.
#
# ==============================================================================

# --- Consolidated Imports ---
import pandas as pd
import numpy as np
import warnings
from typing import Dict, Any, List, Tuple, Callable, Optional, Set, Union
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import percentileofscore



# Implementation

## Draft 1

### Inputs, Processes and Outputs (IPO) Analysis of Key Callables

### **I-P-O Analysis of the Function Library**

#### **Task 0: Parameter Validation**

*   **Callable:** `validate_parameters` (and its helpers: `_validate_dataframe`, `_validate_boundaries`, `_validate_central_points`, `_validate_integration_params`, `_validate_cleavage_finder_params`)
    *   **Inputs:** Raw `anes_df` DataFrame and all parameter dictionaries (`central_points_params`, `boundaries_params`, etc.).
    *   **Processes:** This function performs no data transformation. It is a pure validation gatekeeper. It systematically inspects the structure, types, and values of all inputs against a set of predefined rules derived from the logical and mathematical constraints of the subsequent analysis. It checks for required columns, valid years, logical boundary conditions (`min < max`), and cross-validates parameters (e.g., ensuring years specified for cleavage analysis exist in the data).
    *   **Outputs:** None. The function's output is binary: it either completes silently, signifying that all inputs are valid, or it raises a specific, informative `ValueError` or `TypeError`, halting the pipeline before any computation occurs.
    *   **Role in Research Pipeline:** This callable serves as the foundational "pre-flight check" for the entire research project. It ensures that all data and parameters are sound, preventing errors in downstream calculations and guaranteeing the reproducibility and integrity of the analysis from the outset. It implements the implicit requirement for valid inputs that underpins every subsequent calculation.

--

#### **Task 1: Data Cleansing**

*   **Callable:** `clean_anes_data`
    *   **Inputs:** The raw `anes_df` DataFrame.
    *   **Processes:** The function executes a multi-step data purification process.
        1.  It scans all numeric columns for non-finite values (`np.inf`, `-np.inf`) and transforms them into a standard missing value representation (`np.nan`).
        2.  It takes survey-specific missing value codes (e.g., 98, 99) in the ideological columns and transforms them also into `np.nan`.
        3.  It then applies a listwise deletion, removing any row (respondent) that has a `np.nan` value in any of the critical analytical columns (`left_right`, `liberal_conservative`, `weight`).
        4.  Finally, it resets the DataFrame's index to be a clean, contiguous integer sequence.
    *   **Outputs:** A new, cleaned `pandas.DataFrame` where the core analytical columns are guaranteed to be free of missing or invalid data.
    *   **Role in Research Pipeline:** This function prepares the raw survey data for rigorous quantitative analysis. It operationalizes the standard practice of handling non-substantive responses ("Don't know," "Refused") and ensures the dataset used for modeling is complete with respect to the variables of interest. This is a prerequisite for constructing valid probability distributions.

--

#### **Task 2: Data Preprocessing**

*   **Callable:** `preprocess_for_polarization`
    *   **Inputs:** The `cleaned_df` DataFrame from Task 1.
    *   **Processes:** This function transforms the flat, clean data into a hierarchical, computationally efficient structure.
        1.  For each survey `year`, it normalizes the `weight` column so that all weights within that year sum to 1.0, creating probabilistic weights.
        2.  It then iterates through each ideological `scale`. For each `(year, scale)` pair, it sorts the respondents by their ideological position.
        3.  It constructs the empirical Probability Mass Function (PMF) by grouping by unique positions and summing the normalized weights.
        4.  It then constructs the empirical Cumulative Distribution Function (CDF), `F(x)`, by calculating the cumulative sum of the PMF.
    *   **Outputs:** A nested dictionary of the form `{year: {scale: {'positions': np.ndarray, 'weights': np.ndarray, 'cdf': pd.DataFrame}}}`. This structure contains the empirical distribution `F` for every year and scale, ready for analysis.
    *   **Role in Research Pipeline:** This callable is the direct computational counterpart to the concept of an electorate's ideological distribution, denoted by `F` throughout the paper. It takes the raw survey responses and produces the precise mathematical object, the empirical CDF, upon which the entire polarization measurement is based.

--

#### **Task 3 & 4: Polarization Calculation**

*   **Callables:** `calculate_polarization_index` and `compute_all_polarization_measures`
    *   **Inputs:** The `preprocessed_data` dictionary containing the CDFs, and the `central_points_params` and `boundaries_params` dictionaries.
    *   **Processes:** The `compute_all_polarization_measures` function orchestrates a systematic grid search. It iterates through every `(year, scale, x*)` combination. In each iteration, it calls `calculate_polarization_index`, which performs the core calculation. This core function implements the exact formula for the polarization index from Definition 2 of the paper. For the discrete empirical CDF, it calculates the integrals as the exact sum of the areas of rectangles under the step-function curve.
    *   **Outputs:** A single `pandas.DataFrame` with a `(year, scale, x_star)` MultiIndex, containing the computed `polarization_value` for every point in the parameter space.
    *   **Role in Research Pipeline:** This is the central implementation of the paper's primary contribution. It computes the flexible measure of polarization, `P(F, x*)`.
    *   **Equation Implemented:**
        $P(F, x^*) := \frac{\int_{\min\{X\}}^{x^*} F(x) dx}{x^* - \min\{X\}} - \frac{\int_{x^*}^{\max\{X\}} F(x) dx}{\max\{X\} - x^*} + 1$

--

#### **Task 5, 6, 8, 10: Analysis Suite**

*   **Callables:** `prepare_temporal_analysis`, `analyze_election_year_effects`, `identify_and_analyze_cleavage_points`, `create_comparative_analysis_framework`
    *   **Inputs:** The `polarization_results` DataFrame and, for comparison, the `traditional_measures` DataFrame.
    *   **Processes:** This suite of functions transforms and analyzes the computed polarization results.
        *   `prepare_temporal_analysis` reshapes the data for time-series plotting, replicating the panel structure of Figures 1 & 2.
        *   `analyze_election_year_effects` isolates pre/post election waves and calculates the change `ΔP`, replicating the analysis of Figure 3.
        *   `identify_and_analyze_cleavage_points` applies the Cleavage Point Finder algorithm to locate the `x*` that maximizes the percentage increase in polarization over a period, as discussed in the "Finding cleavages" section.
        *   `create_comparative_analysis_framework` joins the `P(F, x*)` results with traditional measures (variance, etc.) and computes their correlation to identify areas of analytical divergence and convergence.
    *   **Outputs:** A collection of analysis-ready DataFrames and summary tables.
    *   **Role in Research Pipeline:** These functions execute the various applications of the `P(F, x*)` index demonstrated in the paper. They show how the flexible measure can be used to gain more detailed insights into polarization dynamics over time, around specific events, and in comparison to older methods.
    *   **Algorithm Implemented (`find_cleavage_points`):**
        1.  For a given period `(t1, t2)` and `scale`:
        2.  For each candidate `x^*` in the policy space:
        3.  Calculate `increase = 100 \times (P(F_{t2}, x^*) - P(F_{t1}, x^*)) / P(F_{t1}, x^*)`
        4.  Return the `x^*` that maximizes `increase`.

--

#### **Task 9: Traditional Measures**

*   **Callable:** `compute_traditional_measures` (and its helper `_weighted_descriptive_stats`)
    *   **Inputs:** The `cleaned_df` DataFrame.
    *   **Processes:** This function calculates standard, non-flexible measures of polarization. It computes the weighted variance of the ideological distributions and the weighted proportion of "centrist" voters for each year and scale.
    *   **Outputs:** A `pandas.DataFrame` indexed by `(year, scale)` containing the values for these traditional metrics.
    *   **Role in Research Pipeline:** This callable provides the analytical baseline. As the paper argues, traditional measures like variance can mask underlying dynamics. By computing these measures accurately, we create the necessary benchmark to demonstrate the superior diagnostic power of the `P(F, x*)` index in the comparative analysis (Task 10).

--

#### **Task 11: Theoretical Extensions**

*   **Callables:** `calculate_affective_polarization` and `simulate_issue_salience_effect`
    *   **Inputs:** For affective polarization: `positions`, `weights`, a dividing `x*`, and an animosity function `g`. For issue salience: simulation parameters (`alpha` values, distribution definitions).
    *   **Processes:**
        *   `calculate_affective_polarization` implements the model linking ideological distance to animosity, calculating the total weighted animosity in the electorate.
        *   `simulate_issue_salience_effect` implements the model where ideology is a weighted average of two issues. It simulates electorates with varying issue salience (`alpha`) and calculates the resulting `P(F, x*)` for each.
    *   **Outputs:** For affective polarization, a scalar score. For issue salience, a DataFrame showing the relationship between `alpha` and `P(F, x*)`.
    *   **Role in Research Pipeline:** These functions provide computational validation for the theoretical mechanisms discussed in the paper (Propositions 3 and 4). They demonstrate how the `P(F, x*)` measure connects to other important concepts like affective polarization and the salience of divisive issues.
    *   **Equations Implemented:**
        *   Affective Polarization: $A(F) = \sum_{i: x_i < x^*} w_i \cdot g(|x_i - m_R|) + \sum_{i: x_i \ge x^*} w_i \cdot g(|x_i - m_L|)$
        *   Issue Salience: $x = (1-\alpha)c + \alpha d$

--

#### **Task 12 & 14: Master Orchestrators**

*   **Callables:** `run_polarization_pipeline`, `run_robustness_analysis`, `execute_full_research_project`
    *   **Inputs:** The raw `anes_df` and the master `params` dictionary.
    *   **Processes:** These are high-level orchestrators that execute the entire analytical workflow. `run_polarization_pipeline` runs the baseline analysis from Task 0 to 11. `run_robustness_analysis` repeatedly calls the pipeline under different assumptions (e.g., weighted vs. unweighted). `execute_full_research_project` calls both of these and the reporting function (`generate_report_visuals`) to produce the final, complete project output.
    *   **Outputs:** A single, comprehensive, hierarchically structured dictionary containing all data, results, analyses, and report assets generated by the entire project.
    *   **Role in Research Pipeline:** These functions encapsulate the entire research project into a single, reproducible, and automated workflow. They represent the final, professional-grade "run" command for the entire analysis, ensuring that the research can be executed from start to finish in a single, deterministic step.


### Usage


### **User Guide: Deploying the End-to-End Polarization Analysis Pipeline**

This section provides a practical, step-by-step guide to utilizing the `execute_full_research_project` function. This function is the single entry point to the entire analytical library, designed for robust, reproducible, and comprehensive analysis of voter polarization.

#### **Step 1: Data Acquisition and Preparation**

The pipeline requires a single, consolidated `pandas.DataFrame` as its primary data input. This DataFrame must be prepared *before* calling the main function and must adhere to a strict schema.

*   **Action:** The analyst must first acquire the necessary survey data (e.g., from the ANES Data Center). The data from multiple years and waves should be merged into one file.
*   **Schema:** The resulting DataFrame must contain the following columns with these exact names:
    *   `respondent_id`: A unique identifier for each respondent.
    *   `year`: A string or object representing the survey wave (e.g., `'2004a'`, `'2016b'`).
    *   `left_right`: The respondent's self-placement on the 0-10 left-right scale.
    *   `liberal_conservative`: The respondent's self-placement on the 1-7 liberal-conservative scale.
    *   `weight`: The full-sample survey weight provided by ANES for that specific survey wave.

**Code Snippet: Creating a Mock DataFrame**

For demonstration purposes, we will construct a small, mock `DataFrame` that conforms to this schema. In a real-world scenario, this `anes_df` would be the result of loading and merging actual ANES `.dta` or `.csv` files.

```python
import pandas as pd
import numpy as np

# In a real application, this DataFrame would be loaded from a file.
# df = pd.read_csv("path/to/your/merged_anes_data.csv")
# For this example, we create a mock DataFrame.

data = {
    'respondent_id': [f'R_{i}' for i in range(100)],
    'year': np.random.choice(['2012', '2016a', '2016b', '2020'], 100),
    'left_right': np.random.choice(list(range(11)) + [98], 100), # Include a missing code
    'liberal_conservative': np.random.choice(list(range(1, 8)) + [99], 100), # Include a missing code
    'weight': np.random.uniform(0.5, 2.5, 100)
}
anes_df = pd.DataFrame(data)

print("Sample of the prepared input DataFrame:")
print(anes_df.head())
```

#### **Step 2: Constructing the Master Parameters Dictionary**

The `execute_full_research_project` function is controlled by a single, comprehensive parameters dictionary. This object encapsulates every tunable setting for the entire analysis, ensuring perfect reproducibility. We will construct this dictionary section by section, using the default values specified in the original project brief.

**Code Snippet: Defining the `params` Dictionary**

```python
# Initialize the master parameters dictionary.
params = {}

# --- Central Points of Interest (x*) ---
# These are the points around which polarization will be measured for each scale.
params['central_points_params'] = {
    "left_right_x_stars": [1, 2, 3, 4, 5, 6, 7, 8, 9], # Note: Excludes boundaries 0 and 10
    "liberal_conservative_x_stars": [2, 3, 4, 5, 6] # Note: Excludes boundaries 1 and 7
}

# --- Policy Space Boundaries ---
# These define the theoretical min and max of each ideological scale.
params['boundaries_params'] = {
    "left_right_boundaries": {'min': 0, 'max': 10},
    "liberal_conservative_boundaries": {'min': 1, 'max': 7}
}

# --- Integration Parameters ---
# Defines the numerical method for the P(F,x*) calculation.
# Note: The current implementation is optimized for 'trapezoidal' on discrete data.
params['integration_params'] = {
    'method': 'trapezoidal',
    'num_points': 1000
}

# --- Cleavage Finder Parameters ---
# Defines the time periods to scan for the maximum increase in polarization.
params['cleavage_finder_params'] = {
    'time_points': [('2016a', '2020')], # Using years available in our mock data
    'potential_x_stars': params['central_points_params']
}

# --- Analysis-Specific Parameters ---
# Defines midpoints for partitioning temporal plots and centrist definitions.
params['midpoints'] = {
    'left_right': 5,
    'liberal_conservative': 4
}
params['centrist_definitions'] = {
    'left_right': [4, 5, 6],
    'liberal_conservative': [3, 4, 5]
}
params['election_years_for_analysis'] = [2016] # Using year available in mock data

# --- Theoretical Extension Parameters ---
# Defines parameters for the issue salience simulation.
params['salience_simulation_params'] = {
    'salience_alphas': [0.1, 0.3, 0.5, 0.7, 0.9],
    'common_value_dist': {'low': 4.8, 'high': 5.2},
    'divisive_issue_dist': {'low': 0, 'high': 10},
    'polarization_params': {
        'x_star': 5, # The x* to calculate polarization around in the simulation
        'boundaries': params['boundaries_params']['left_right_boundaries']
    },
    'random_seed': 42 # For reproducibility
}

# --- Optional: ANES Missing Value Codes ---
# This can be customized if different codes are used in the data.
params['missing_value_map'] = {
    'left_right': [98, 99],
    'liberal_conservative': [98, 99]
}

print("\nMaster parameters dictionary constructed successfully.")
```

#### **Step 3: Executing the Pipeline and Inspecting Results**

With the input `DataFrame` and the `params` dictionary prepared, the entire research project can be executed with a single function call. The function will print its progress through the various stages of the analysis.

**Code Snippet: Running the Master Orchestrator**

```python
# First, ensure the full library of functions is loaded in your environment.
# from ginzburg_polarization_model_python_implementation import execute_full_research_project

# Execute the entire research project.
# This single call runs validation, cleaning, all calculations, analyses,
# robustness checks, and reporting.
project_results = execute_full_research_project(
    anes_df=anes_df,
    params=params
)
```

The returned `project_results` object is a deeply nested dictionary containing every artifact generated during the run. An analyst can now programmatically access any piece of the analysis for inspection, custom plotting, or further work.

**Code Snippet: Accessing and Inspecting Key Outputs**

```python
# --- Inspecting the Main Analysis Results ---
print("\n--- Example: Inspecting Key Outputs from the Main Analysis ---")

# Access the main results dictionary
main_analysis_results = project_results['main_analysis']['results']

# 1. View the summary of the cleavage point analysis
print("\nCleavage Point Analysis Summary:")
cleavage_summary = main_analysis_results['cleavage_analysis']
print(cleavage_summary)

# 2. View the summary of the comparative analysis (correlation)
print("\nCorrelation Summary (Traditional vs. Flexible Measures):")
correlation_summary = main_analysis_results['comparative_framework']['correlation_summary']
print(correlation_summary)

# 3. Access a specific plot (e.g., temporal trends for the left-right scale)
report_assets = project_results['main_analysis']['report_assets']
temporal_plot_lr = report_assets['plots']['temporal_trends_left_right']
# To display the plot in a Jupyter environment:
# temporal_plot_lr.show()
# To save the plot to a file:
# temporal_plot_lr.savefig("temporal_trends_left_right.png", dpi=300)
print("\nTemporal trend plot for 'left_right' scale has been generated.")

# --- Inspecting the Robustness Analysis Results ---
print("\n--- Example: Comparing Robustness Check Results ---")

# Compare the cleavage point found in the weighted vs. unweighted scenarios
try:
    weighted_cleavage = project_results['robustness_analysis']['weighted']['cleavage_analysis']
    unweighted_cleavage = project_results['robustness_analysis']['unweighted']['cleavage_analysis']

    print("\nCleavage points from 'weighted' analysis:")
    print(weighted_cleavage[['cleavage_point', 'relative_position']])

    print("\nCleavage points from 'unweighted' analysis:")
    print(unweighted_cleavage[['cleavage_point', 'relative_position']])
except KeyError:
    print("\nRobustness check for one or more scenarios may have failed.")

```
This example provides a complete, end-to-end workflow, demonstrating how a user can prepare their data, configure the analysis, execute the entire pipeline with one command, and access the structured, high-value outputs for interpretation.



In [None]:
# Task 0: Parameter Validation

def _validate_dataframe(
    df: pd.DataFrame,
    expected_cols: Set[str],
    valid_years: Set[Any]
) -> None:
    """
    Validates the structure and content of the input ANES DataFrame.

    Args:
        df (pd.DataFrame): The DataFrame to validate.
        expected_cols (Set[str]): A set of required column names.
        valid_years (Set[Any]): A set of permissible values for the 'year' column.

    Raises:
        ValueError: If the DataFrame is empty, missing required columns, or contains
                    invalid years or non-positive weights.
        TypeError: If columns that should be numeric are not.
    """
    # Step 1.1: Check if the DataFrame is empty.
    # An empty DataFrame cannot be processed.
    if df.empty:
        # Raise an error indicating the DataFrame is empty.
        raise ValueError("Input DataFrame 'anes_df' cannot be empty.")

    # Step 1.2: Check for the presence of all required columns.
    # The analysis depends on a specific set of columns.
    actual_cols = set(df.columns)
    # Find the set of columns that are missing from the DataFrame.
    missing_cols = expected_cols - actual_cols
    # If any columns are missing, raise an error.
    if missing_cols:
        # Raise an error listing the missing columns.
        raise ValueError(
            f"Input DataFrame 'anes_df' is missing required columns: {sorted(list(missing_cols))}"
        )

    # Step 1.3: Validate the 'year' column against a set of allowed survey waves.
    # The analysis is specific to certain ANES survey years.
    unique_years = set(df['year'].unique())
    # Identify any years present in the data that are not in the allowed list.
    invalid_years = unique_years - valid_years
    # If invalid years are found, raise an error.
    if invalid_years:
        # Raise an error listing the unexpected years.
        raise ValueError(
            f"Input DataFrame 'anes_df' contains invalid or unexpected years: {sorted(list(invalid_years))}. "
            f"Valid years are: {sorted(list(valid_years))}"
        )

    # Step 1.4: Validate data types for numeric columns.
    # Ideological scales and weights must be numeric for calculations.
    for col in ['left_right', 'liberal_conservative', 'weight']:
        # Check if the column's data type is numeric.
        if not pd.api.types.is_numeric_dtype(df[col]):
            # If not numeric, raise a TypeError.
            raise TypeError(
                f"Column '{col}' in 'anes_df' must be of a numeric type, but found {df[col].dtype}."
            )

    # Step 1.5: Ensure all weight values are positive.
    # Weights must be strictly positive for valid weighted statistical calculations.
    if (df['weight'] <= 0).any():
        # Raise an error if any weight is zero or negative.
        raise ValueError("Column 'weight' in 'anes_df' must contain only positive values.")

def _validate_boundaries(
    boundaries_params: Dict[str, Dict[str, float]]
) -> None:
    """
    Validates the policy space boundaries dictionary.

    Args:
        boundaries_params (Dict[str, Dict[str, float]]): The boundaries dictionary.

    Raises:
        ValueError: If the structure is incorrect or if min >= max.
        TypeError: If boundary values are not numeric.
    """
    # Step 3.1: Define the expected structure.
    # The dictionary must contain boundaries for both ideological scales.
    expected_scales = {'left_right_boundaries', 'liberal_conservative_boundaries'}
    # If the top-level keys do not match expectations, raise an error.
    if not expected_scales.issubset(boundaries_params.keys()):
        # Raise an error indicating the missing scale definitions.
        raise ValueError(f"Boundary parameters missing keys. Expected: {expected_scales}")

    # Step 3.2: Iterate through each scale's boundaries to validate them.
    for scale_name, bounds in boundaries_params.items():
        # Check for the presence of 'min' and 'max' keys.
        if not {'min', 'max'}.issubset(bounds.keys()):
            # Raise an error if keys are missing for a scale.
            raise ValueError(f"Boundary definition for '{scale_name}' is missing 'min' or 'max' key.")

        # Extract min and max values.
        min_val, max_val = bounds['min'], bounds['max']

        # Check if min and max are numeric.
        if not all(isinstance(v, (int, float)) for v in [min_val, max_val]):
            # Raise a TypeError if values are not numeric.
            raise TypeError(f"Boundary values for '{scale_name}' must be numeric.")

        # Step 3.3: Enforce the logical constraint that min must be less than max.
        # The policy space must have a positive width.
        if min_val >= max_val:
            # Raise an error if the condition is violated.
            raise ValueError(
                f"Invalid boundaries for '{scale_name}': 'min' ({min_val}) must be strictly less than 'max' ({max_val})."
            )

def _validate_central_points(
    central_points_params: Dict[str, List[float]],
    boundaries_params: Dict[str, Dict[str, float]]
) -> None:
    """
    Validates the central points of interest (x_stars) dictionary.

    Args:
        central_points_params (Dict[str, List[float]]): The central points dictionary.
        boundaries_params (Dict[str, Dict[str, float]]): The validated boundaries dictionary.

    Raises:
        ValueError: If structure is incorrect, values are out of bounds, not sorted, or not unique.
        TypeError: If values are not lists of numbers.
    """
    # Step 2.1: Define the expected structure.
    # The dictionary must contain x_star lists for both ideological scales.
    expected_scales = {'left_right_x_stars', 'liberal_conservative_x_stars'}
    # If the top-level keys do not match expectations, raise an error.
    if not expected_scales.issubset(central_points_params.keys()):
        # Raise an error indicating the missing x_star definitions.
        raise ValueError(f"Central points parameters missing keys. Expected: {expected_scales}")

    # Step 2.2: Iterate through each scale's x_star list to validate it.
    for scale_name, x_stars in central_points_params.items():
        # Check if the value is a list.
        if not isinstance(x_stars, list):
            # Raise a TypeError if it's not a list.
            raise TypeError(f"Central points for '{scale_name}' must be a list.")
        # Check if the list is empty.
        if not x_stars:
            # Raise a ValueError if the list is empty.
            raise ValueError(f"Central points list for '{scale_name}' cannot be empty.")

        # Convert to a NumPy array for efficient validation.
        x_stars_arr = np.array(x_stars)

        # Step 2.3: Check if all values are within the policy space boundaries.
        # The corresponding boundary key is derived from the x_star key name.
        boundary_key = scale_name.replace('_x_stars', '_boundaries')
        # Get the min and max bounds for the current scale.
        min_bound = boundaries_params[boundary_key]['min']
        max_bound = boundaries_params[boundary_key]['max']
        # Check if any x_star is outside the [min, max] range.
        if np.any(x_stars_arr < min_bound) or np.any(x_stars_arr > max_bound):
            # Raise an error specifying the valid range.
            raise ValueError(
                f"Central points for '{scale_name}' are out of bounds. "
                f"All values must be within [{min_bound}, {max_bound}]."
            )

        # Step 2.4: Check for uniqueness and ascending order.
        # The difference between consecutive elements must always be positive.
        if not np.all(np.diff(x_stars_arr) > 0):
            # Raise an error if the list is not strictly sorted.
            raise ValueError(
                f"Central points list for '{scale_name}' must be unique and sorted in ascending order."
            )

def _validate_integration_params(
    integration_params: Dict[str, Any]
) -> None:
    """
    Validates the numerical integration parameters.

    Args:
        integration_params (Dict[str, Any]): The integration parameters dictionary.

    Raises:
        ValueError: If structure is incorrect, method is unsupported, or num_points is invalid.
        TypeError: If num_points is not an integer.
    """
    # Step 4.1: Check for required keys.
    if not {'method', 'num_points'}.issubset(integration_params.keys()):
        # Raise an error if keys are missing.
        raise ValueError("Integration parameters must contain 'method' and 'num_points' keys.")

    # Step 4.2: Validate the integration method.
    # For this project, only 'trapezoidal' is specified, but this allows for future extension.
    supported_methods = {'trapezoidal'}
    # Check if the specified method is in the set of supported methods.
    if integration_params['method'] not in supported_methods:
        # Raise an error listing the supported methods.
        raise ValueError(
            f"Unsupported integration method '{integration_params['method']}'. "
            f"Supported methods are: {supported_methods}"
        )

    # Step 4.3: Validate the number of points for discretization.
    num_points = integration_params['num_points']
    # Check if num_points is an integer.
    if not isinstance(num_points, int):
        # Raise a TypeError if it's not an integer.
        raise TypeError(f"'num_points' must be an integer, but found {type(num_points)}.")
    # Check if num_points is positive and sufficiently large.
    if num_points <= 1:
        # Raise a ValueError for invalid number of points.
        raise ValueError(f"'num_points' must be an integer greater than 1, but found {num_points}.")

def _validate_cleavage_finder_params(
    cleavage_params: Dict[str, Any],
    df_years: Set[Any]
) -> None:
    """
    Validates the cleavage point finder parameters.

    Args:
        cleavage_params (Dict[str, Any]): The cleavage finder parameters dictionary.
        df_years (Set[Any]): The set of unique years available in the input DataFrame.

    Raises:
        ValueError: If structure is incorrect, time points are invalid or not in data.
        TypeError: If time_points is not a list of tuples.
    """
    # Step 5.1: Check for required keys.
    if not {'time_points', 'potential_x_stars'}.issubset(cleavage_params.keys()):
        # Raise an error if keys are missing.
        raise ValueError("Cleavage finder parameters must contain 'time_points' and 'potential_x_stars' keys.")

    # Step 5.2: Validate the 'time_points' list.
    time_points = cleavage_params['time_points']
    # Check if it's a list.
    if not isinstance(time_points, list):
        # Raise a TypeError if it's not a list.
        raise TypeError("'time_points' must be a list of tuples.")

    # A set to collect all years required for the cleavage analysis.
    required_years = set()
    # Iterate through each time period tuple.
    for period in time_points:
        # Check if each item is a tuple of length 2.
        if not isinstance(period, tuple) or len(period) != 2:
            # Raise a ValueError for malformed periods.
            raise ValueError(f"Each item in 'time_points' must be a tuple of length 2, but found {period}.")

        # Unpack the start and end year.
        t1, t2 = period
        # Step 5.3: Check for chronological order.
        if t1 >= t2:
            # Raise a ValueError if the period is not chronological.
            raise ValueError(f"Time periods must be chronological (t1 < t2), but found ({t1}, {t2}).")

        # Add the years to the set of required years.
        required_years.add(t1)
        required_years.add(t2)

    # Step 5.4: Cross-validate required years against years available in the data.
    # Find which of the required years are missing from the DataFrame.
    missing_from_df = required_years - df_years
    # If any required years are not in the data, the analysis cannot proceed.
    if missing_from_df:
        # Raise an error listing the missing years.
        raise ValueError(
            f"Cleavage analysis requires years that are not present in the DataFrame: {sorted(list(missing_from_df))}. "
            f"Available years: {sorted(list(df_years))}"
        )

def validate_parameters(
    anes_df: pd.DataFrame,
    central_points_params: Dict[str, List[float]],
    boundaries_params: Dict[str, Dict[str, float]],
    integration_params: Dict[str, Any],
    cleavage_finder_params: Dict[str, Any]
) -> None:
    """
    Performs a comprehensive validation of all input parameters for the polarization analysis.

    This function serves as a single entry point for validating all inputs before any
    computation begins, ensuring data integrity and parameter sanity. It follows the
    "fail-fast" principle, raising descriptive errors immediately upon finding an issue.

    Args:
        anes_df (pd.DataFrame):
            The primary DataFrame containing ANES survey data. Expected columns are
            ['respondent_id', 'year', 'left_right', 'liberal_conservative', 'weight'].
        central_points_params (Dict[str, List[float]]):
            A dictionary specifying the central points (x*) for analysis on each scale.
            Example: {'left_right_x_stars': [0, ..., 10], 'liberal_conservative_x_stars': [1, ..., 7]}
        boundaries_params (Dict[str, Dict[str, float]]):
            A dictionary defining the min and max boundaries of the policy space for each scale.
            Example: {'left_right_boundaries': {'min': 0, 'max': 10}, ...}
        integration_params (Dict[str, Any]):
            Parameters for numerical integration.
            Example: {'method': 'trapezoidal', 'num_points': 1000}
        cleavage_finder_params (Dict[str, Any]):
            Parameters for the cleavage point finder algorithm, including time periods.
            Example: {'time_points': [(2012, 2016), ...], ...}

    Raises:
        ValueError: If any parameter has an invalid value or logical inconsistency.
        TypeError: If any parameter has an incorrect data type.
    """
    # Define expected columns and valid years for the DataFrame validation.
    expected_cols = {'respondent_id', 'year', 'left_right', 'liberal_conservative', 'weight'}
    valid_years = {'2000', '2004a', '2004b', '2008', '2012', '2016a', '2016b', '2020'}

    # Task 0, Step 1: Validate the main DataFrame.
    _validate_dataframe(anes_df, expected_cols, valid_years)

    # Task 0, Step 3: Validate policy space boundaries. This must be done before
    # validating central points, as they depend on the boundaries.
    _validate_boundaries(boundaries_params)

    # Task 0, Step 2: Validate central points, using the now-validated boundaries.
    _validate_central_points(central_points_params, boundaries_params)

    # Task 0, Step 4: Validate integration parameters.
    _validate_integration_params(integration_params)

    # Task 0, Step 5: Validate cleavage finder parameters, cross-referencing with the DataFrame.
    df_years = set(anes_df['year'].unique())
    _validate_cleavage_finder_params(cleavage_finder_params, df_years)


In [None]:
# Task 1: Data Cleansing

def clean_anes_data(
    anes_df: pd.DataFrame,
    missing_value_map: Dict[str, List[Any]] = None
) -> pd.DataFrame:
    """
    Cleans the raw ANES DataFrame by handling infinite values, mapping specific
    missing value codes to NaN, and dropping rows with missing essential data.

    This function executes a rigorous, multi-step cleansing process crucial for
    preparing survey data for quantitative analysis. It ensures that the
    resulting DataFrame is free of non-substantive values in the core
    analytical columns, making it ready for preprocessing and modeling.

    The process is as follows:
    1.  A copy of the input DataFrame is created to prevent side effects.
    2.  Infinite values (np.inf, -np.inf) across the entire DataFrame are
        replaced with np.nan. A warning is issued if any are found.
    3.  User-defined missing value codes (e.g., 98 for "Don't Know", 99 for
        "Refused") are replaced with np.nan for specified ideological columns.
    4.  Rows containing any np.nan in the critical columns ('left_right',
        'liberal_conservative', 'weight') are dropped (listwise deletion).
    5.  The number of rows removed is reported to the user for transparency.
    6.  The index of the cleaned DataFrame is reset to ensure it is contiguous.
    7.  A final validation asserts that the critical columns are free of nulls.

    Args:
        anes_df (pd.DataFrame):
            The raw input DataFrame from ANES. It is expected to have undergone
            initial validation via the `validate_parameters` function.
        missing_value_map (Dict[str, List[Any]], optional):
            A dictionary mapping column names to a list of values that should be
            treated as missing. If None, a default map for common ANES missing
            codes ([98, 99]) is used for the ideological columns.
            Example: {'left_right': [98, 99], 'liberal_conservative': [97, 98, 99]}

    Returns:
        pd.DataFrame:
            A new, cleaned DataFrame with no infinite values or missing data in
            the core analytical columns.

    Raises:
        TypeError: If 'anes_df' is not a pandas DataFrame.
        ValueError: If 'anes_df' is empty.
    """
    # --- Input Validation ---
    # Ensure the primary input is a pandas DataFrame.
    if not isinstance(anes_df, pd.DataFrame):
        # Raise a TypeError if the input is not of the expected type.
        raise TypeError("Input 'anes_df' must be a pandas DataFrame.")

    # Ensure the DataFrame is not empty before proceeding.
    if anes_df.empty:
        # Raise a ValueError as cleansing an empty DataFrame is not meaningful.
        raise ValueError("Input 'anes_df' cannot be empty.")

    # --- Data Cleansing Pipeline ---
    # Create a copy to avoid modifying the original DataFrame in place. This is a
    # critical best practice for robust data pipelines.
    df_cleaned = anes_df.copy()

    # Store the original number of rows for final reporting.
    initial_rows = len(df_cleaned)

    # --- Step 1: Identify and Replace Infinite Values ---
    # Check for the presence of infinite values, which can result from data
    # loading errors or upstream calculations.
    if np.isinf(df_cleaned.select_dtypes(include=np.number)).any().any():
        # Issue a warning to the user that infinite values were found and are being replaced.
        # This alerts them to potential upstream data quality issues.
        warnings.warn(
            "Infinite values found in the DataFrame. Replacing with np.nan.",
            UserWarning
        )
        # Replace all occurrences of numpy.inf and -numpy.inf with numpy.nan.
        df_cleaned.replace([np.inf, -np.inf], np.nan, inplace=True)

    # --- Step 2: Handle Missing Value Codes Specific to ANES Surveys ---
    # If no specific map is provided, use a default for the ideological scales.
    # This provides sensible default behavior while allowing for customization.
    if missing_value_map is None:
        # Default map targets common ANES codes for "Don't Know" and "Refused".
        missing_value_map = {
            'left_right': [98, 99],
            'liberal_conservative': [98, 99]
        }

    # Apply the replacements defined in the map.
    # This converts survey-specific codes into the standard np.nan representation.
    df_cleaned.replace(missing_value_map, np.nan, inplace=True)

    # --- Step 3: Drop Rows with Missing Data in Critical Columns ---
    # Define the columns that are essential for the polarization analysis.
    # A row is only useful if it has valid data for both scales and a weight.
    critical_cols = ['left_right', 'liberal_conservative', 'weight']

    # Drop any row that has a null value in any of the critical columns.
    # This is a listwise deletion strategy focused on the core analytical variables.
    df_cleaned.dropna(subset=critical_cols, inplace=True)

    # --- Step 4: Post-Cleansing Validation and Reporting ---
    # Calculate the number of rows that were dropped during the process.
    rows_dropped = initial_rows - len(df_cleaned)

    # Report the outcome to the user for transparency and reproducibility.
    print(
        f"Data cleansing complete. "
        f"Removed {rows_dropped} of {initial_rows} rows "
        f"({rows_dropped / initial_rows:.2%}) due to missing data in critical columns."
    )

    # If the entire DataFrame becomes empty after cleansing, raise an error.
    if df_cleaned.empty:
        raise ValueError(
            "All rows were removed during data cleansing. "
            "Check data quality and missing value definitions."
        )

    # Reset the index of the DataFrame to be a clean, contiguous sequence from 0.
    # This is crucial for preventing indexing errors in subsequent processing steps.
    df_cleaned.reset_index(drop=True, inplace=True)

    # Final assertion: Verify that the critical columns now contain no null values.
    # This is a self-check to guarantee the function's post-conditions are met.
    assert df_cleaned[critical_cols].isnull().sum().sum() == 0, \
        "Post-cleansing validation failed: Null values remain in critical columns."

    # Return the fully cleaned and validated DataFrame.
    return df_cleaned


In [None]:
# Task 2: Data Preprocessing

def preprocess_for_polarization(
    cleaned_df: pd.DataFrame
) -> Dict[Any, Dict[str, Dict[str, np.ndarray]]]:
    """
    Preprocesses cleaned ANES data to create a hierarchical, analysis-ready structure.

    This function transforms a flat, cleaned DataFrame into a nested dictionary
    optimized for the polarization calculations. The structure is:
    {year: {scale: {'positions': array, 'weights': array, 'cdf': DataFrame}}}.
    This pre-computation avoids repeated filtering and sorting during the main
    analysis loop, significantly improving performance.

    The process for each year and ideological scale ('left_right', 'liberal_conservative') is:
    1.  Normalizes survey weights within each year to sum to 1.0, creating
        probabilistic weights.
    2.  For each scale, sorts the data by ideological position. This is a
        prerequisite for correct CDF construction.
    3.  Constructs a weighted Probability Mass Function (PMF) by summing the
        normalized weights at each unique ideological position.
    4.  Constructs the weighted Cumulative Distribution Function (CDF) by
        computing the cumulative sum of the PMF.
    5.  Performs rigorous quality assurance checks on each generated CDF to
        ensure it is mathematically valid (monotonic, bounded [0, 1]).
    6.  Stores the sorted positions, corresponding weights, and the CDF
        (as a DataFrame mapping positions to values) in the final nested dictionary.

    Args:
        cleaned_df (pd.DataFrame):
            A DataFrame that has been processed by `clean_anes_data`. It must
            contain the columns 'year', 'weight', 'left_right', and
            'liberal_conservative'.

    Returns:
        Dict[Any, Dict[str, Dict[str, np.ndarray]]]:
            A nested dictionary where keys are years, then scales. The innermost
            dictionary contains NumPy arrays for 'positions' and 'weights', and
            a pandas DataFrame for the 'cdf' (index=position, column='cdf_value').

    Raises:
        ValueError: If the input DataFrame is empty or if a year's data becomes
                    empty after scale-specific processing.
    """
    # --- Input Validation ---
    # Ensure the input DataFrame is not empty.
    if cleaned_df.empty:
        # Raise an error as preprocessing an empty DataFrame is not possible.
        raise ValueError("Input 'cleaned_df' cannot be empty.")

    # --- Step 1: Apply ANES Survey Weights (Normalization) ---
    # Create a copy to ensure the original DataFrame is not modified.
    df = cleaned_df.copy()
    # For each group (year), transform the 'weight' column by dividing each
    # weight by the sum of weights for that year. This creates 'normalized_weight'.
    # w_normalized_i = w_i / sum(w_j) for all j in the same year.
    df['normalized_weight'] = df.groupby('year')['weight'].transform(lambda x: x / x.sum())

    # --- Step 2 & 3: Create Hierarchical Structure and Construct CDFs ---
    # Initialize the top-level dictionary to store the structured results.
    preprocessed_data = {}
    # Define the ideological scales to be processed.
    scales = ['left_right', 'liberal_conservative']

    # Iterate through each unique year present in the DataFrame.
    for year in df['year'].unique():
        # Initialize the dictionary for the current year.
        preprocessed_data[year] = {}
        # Get the subset of the DataFrame corresponding to the current year.
        year_df = df[df['year'] == year]

        # Iterate through each ideological scale.
        for scale in scales:
            # For each scale, select the relevant columns and drop any rows that
            # might have a NaN value for this specific scale.
            scale_df = year_df[[scale, 'normalized_weight']].dropna(subset=[scale])

            # If a specific year-scale combination has no data, skip it.
            if scale_df.empty:
                # This can happen if a survey year did not include a particular question.
                continue

            # Sort the data by the ideological position. This is essential for
            # creating the PMF and CDF correctly.
            scale_df_sorted = scale_df.sort_values(by=scale)

            # Construct the weighted Probability Mass Function (PMF).
            # Group by the unique ideological positions and sum their normalized weights.
            # This correctly handles ties in positions.
            pmf = scale_df_sorted.groupby(scale)['normalized_weight'].sum().reset_index()
            pmf.columns = ['position', 'pmf_value']

            # Construct the weighted Cumulative Distribution Function (CDF).
            # F(x) = sum_{i: x_i <= x} w_normalized_i
            # This is achieved by taking the cumulative sum of the PMF values.
            pmf['cdf_value'] = pmf['pmf_value'].cumsum()

            # Create the final CDF DataFrame, indexed by position for fast lookups.
            cdf_df = pmf[['position', 'cdf_value']].set_index('position')

            # --- Step 4: Data Quality Assurance for the CDF ---
            # Retrieve the computed CDF values as a NumPy array.
            cdf_values = cdf_df['cdf_value'].values
            # Assert that the CDF is monotonically non-decreasing.
            # The difference between consecutive elements must be >= 0.
            assert np.all(np.diff(cdf_values) >= -1e-9), \
                f"CDF for {year}-{scale} is not monotonic."
            # Assert that the last value of the CDF is approximately 1.0.
            # np.isclose handles potential floating-point inaccuracies.
            assert np.isclose(cdf_values[-1], 1.0), \
                f"CDF for {year}-{scale} does not sum to 1.0 (is {cdf_values[-1]})."

            # Store the final, validated results in the hierarchical dictionary.
            # We store the sorted positions and weights from the original sorted
            # DataFrame, and the computed CDF map.
            preprocessed_data[year][scale] = {
                'positions': scale_df_sorted[scale].values,
                'weights': scale_df_sorted['normalized_weight'].values,
                'cdf': cdf_df
            }

    # Return the final, structured, and validated data object.
    return preprocessed_data


In [None]:
# Task 3: Implement the Polarization Measure Function

def calculate_polarization_index(
    cdf_df: pd.DataFrame,
    x_star: float,
    boundaries: Dict[str, float]
) -> float:
    """
    Calculates the flexible measure of voter polarization, P(F, x*).

    This function provides a precise, numerically stable implementation of the
    polarization index defined in Definition 2 of Ginzburg's paper. The index
    measures the dispersion of an electorate's ideological distribution (F)
    around a specified central point (x*).

    The formula is:
    P(F, x*) := [A / (x* - min{X})] - [B / (max{X} - x*)] + 1
    where:
    A = Integral from min{X} to x* of F(x) dx
    B = Integral from x* to max{X} of F(x) dx

    For a discrete empirical CDF (a step function), the integrals are calculated
    exactly as the sum of the areas of rectangles under the CDF curve.

    Args:
        cdf_df (pd.DataFrame):
            A DataFrame representing the Cumulative Distribution Function (CDF).
            The index must contain the unique ideological positions, and a column
            named 'cdf_value' must contain the corresponding CDF values. This
            is generated by the `preprocess_for_polarization` function.
        x_star (float):
            The central point of interest around which polarization is measured.
        boundaries (Dict[str, float]):
            A dictionary containing the theoretical minimum and maximum of the
            policy space, e.g., {'min': 0, 'max': 10}.

    Returns:
        float:
            The calculated polarization index P(F, x*), a value in [0, 1].

    Raises:
        ValueError: If x_star is not strictly within the policy space boundaries.
        KeyError: If 'boundaries' or 'cdf_df' have incorrect structure.
    """
    # --- Step 3: Handle Edge Cases and Numerical Stability (Input Validation) ---
    # Extract min and max boundaries for the policy space.
    min_X = boundaries['min']
    max_X = boundaries['max']

    # The polarization measure is mathematically undefined at the boundaries.
    # Ensure x_star is strictly within the interior of the policy space.
    if not (min_X < x_star < max_X):
        # Raise an error if x_star is at or outside the boundaries.
        raise ValueError(
            f"x_star ({x_star}) must be strictly between the boundaries "
            f"min ({min_X}) and max ({max_X})."
        )

    # --- Step 2: Implement Numerical Integration for Step Functions ---
    # Extract the unique positions and CDF values into NumPy arrays for performance.
    positions = cdf_df.index.values
    cdf_values = cdf_df['cdf_value'].values

    # To correctly calculate the area under the step function, we need to define
    # the intervals. We create a full set of points including the boundaries.
    # This ensures the integration covers the entire range from min_X to max_X.
    full_pos = np.union1d(np.union1d(positions, [x_star]), [min_X, max_X])

    # We need the CDF value at each point in our full set of positions.
    # We use np.searchsorted to efficiently find the CDF value for each point.
    # 'right' gives F(x), the probability of being <= x.
    # We find the index in the original `positions` for each point in `full_pos`.
    indices = np.searchsorted(positions, full_pos, side='right')
    # We create an array of corresponding CDF values. For points before the first
    # actual position, the CDF is 0.
    full_cdf_values = np.concatenate(([0], cdf_values))[indices]

    # Calculate the widths of the rectangles under the CDF curve.
    # This is the distance between each consecutive point in our full set.
    widths = np.diff(full_pos)

    # The heights of the rectangles are the CDF values at the start of each interval.
    heights = full_cdf_values[:-1]

    # The area of each rectangle is height * width.
    areas = heights * widths

    # Now, we sum these areas over the correct domains for integrals A and B.
    # Get the midpoints of our integration intervals.
    midpoints = (full_pos[:-1] + full_pos[1:]) / 2

    # Integral A: Sum of areas where the interval midpoint is less than x_star.
    # This corresponds to ∫_{min{X}}^{x*} F(x) dx
    integral_A = np.sum(areas[midpoints < x_star])

    # Integral B: Sum of areas where the interval midpoint is greater than or equal to x_star.
    # This corresponds to ∫_{x*}^{max{X}} F(x) dx
    integral_B = np.sum(areas[midpoints >= x_star])

    # --- Step 1: Implement Core Polarization Formula P(F,x*) ---
    # Calculate the first normalized term of the equation.
    term1 = integral_A / (x_star - min_X)

    # Calculate the second normalized term of the equation.
    term2 = integral_B / (max_X - x_star)

    # Combine the terms according to the formula from Definition 2.
    polarization_value = term1 - term2 + 1

    # --- Step 4: Validate Mathematical Properties of P(F,x*) ---
    # According to Proposition 2, the result must be in the range [0, 1].
    # This assertion acts as a final self-check on the calculation's correctness.
    # We use a small tolerance to account for potential floating-point inaccuracies.
    assert -1e-9 <= polarization_value <= 1 + 1e-9, \
        f"Calculation resulted in P={polarization_value}, which is outside the theoretical [0, 1] range."

    # Clip the result to the [0, 1] range to handle any minor floating point overflow.
    polarization_value = np.clip(polarization_value, 0, 1)

    # Return the final, validated polarization index.
    return polarization_value


In [None]:
# Task 4: Compute Polarization Measures

def compute_all_polarization_measures(
    preprocessed_data: Dict[Any, Dict[str, Dict[str, Union[np.ndarray, pd.DataFrame]]]],
    central_points_params: Dict[str, List[float]],
    boundaries_params: Dict[str, Dict[str, float]]
) -> pd.DataFrame:
    """
    Computes the polarization index P(F, x*) for all years, scales, and central points.

    This function orchestrates the core analysis by systematically iterating through
    every combination of survey year, ideological scale, and specified central
    point (x*). It calls the `calculate_polarization_index` function for each
    combination and aggregates the results into a single, structured DataFrame.

    The output is a DataFrame with a MultiIndex ('year', 'scale', 'x_star'),
    which is the canonical data structure for the subsequent temporal and
    comparative analysis tasks.

    Args:
        preprocessed_data (Dict):
            The hierarchical data structure created by `preprocess_for_polarization`.
            Contains the pre-computed CDFs for each year and scale.
        central_points_params (Dict[str, List[float]]):
            A dictionary specifying the central points (x*) for analysis on each scale.
        boundaries_params (Dict[str, Dict[str, float]]):
            A dictionary defining the min and max boundaries of the policy space.

    Returns:
        pd.DataFrame:
            A DataFrame containing all computed polarization values, indexed by
            year, scale, and x_star. The columns are ['polarization_value'].

    Raises:
        ValueError: If a year-scale combination specified by the parameters
                    is not found in the preprocessed data.
    """
    # --- Step 1: Vectorize (Iterate) Polarization Calculation ---
    # Initialize a list to store the results as dictionaries. This is an
    # efficient way to build up the data before converting to a DataFrame.
    results_list = []

    # Iterate through each year in the preprocessed data.
    for year, year_data in preprocessed_data.items():
        # Iterate through each ideological scale ('left_right', 'liberal_conservative').
        for scale, scale_data in year_data.items():
            # Determine the correct keys for the parameter dictionaries based on the scale.
            x_stars_key = f"{scale}_x_stars"
            boundaries_key = f"{scale}_boundaries"

            # Retrieve the list of x* values and the boundaries for the current scale.
            x_stars = central_points_params[x_stars_key]
            boundaries = boundaries_params[boundaries_key]

            # Retrieve the pre-computed CDF for this specific year and scale.
            cdf_df = scale_data['cdf']

            # Iterate through each central point (x*) for the current scale.
            for x_star in x_stars:
                try:
                    # Call the core function from Task 3 to compute the index.
                    polarization_value = calculate_polarization_index(
                        cdf_df=cdf_df,
                        x_star=x_star,
                        boundaries=boundaries
                    )

                    # Append the result as a dictionary to our list.
                    results_list.append({
                        'year': year,
                        'scale': scale,
                        'x_star': x_star,
                        'polarization_value': polarization_value
                    })
                except ValueError as e:
                    # If calculate_polarization_index raises a ValueError (e.g., x* is
                    # on a boundary), we print a warning and skip that point.
                    # This makes the process robust to invalid x* points in the config.
                    warnings.warn(
                        f"Skipping calculation for {year}-{scale} at x*={x_star}: {e}",
                        UserWarning
                    )
                    continue

    # If no results were generated, it indicates a problem.
    if not results_list:
        raise ValueError("No polarization measures could be computed. Check input data and parameters.")

    # --- Step 2: Store Results in Structured Multi-Dimensional Format ---
    # Convert the list of dictionaries into a pandas DataFrame.
    # This is a highly efficient and standard way to create a DataFrame from collected data.
    results_df = pd.DataFrame(results_list)

    # Set the hierarchical MultiIndex for efficient slicing and analysis.
    # This is the optimal structure for this kind of multi-dimensional panel data.
    results_df.set_index(['year', 'scale', 'x_star'], inplace=True)

    # Sort the index for performance and clean presentation.
    results_df.sort_index(inplace=True)

    # --- Step 3: Implement Quality Control and Validation ---
    # Assert that there are no null values in the final results. This would
    # indicate a failure during computation that was not caught.
    assert results_df['polarization_value'].isnull().sum() == 0, \
        "Validation failed: Found NaN values in the final results DataFrame."

    # Assert that all computed values fall within the theoretical [0, 1] range.
    # This is a final sanity check on the entire batch of computations.
    assert results_df['polarization_value'].between(0, 1).all(), \
        "Validation failed: Polarization values found outside the theoretical [0, 1] range."

    # Return the final, structured, and validated DataFrame of polarization measures.
    return results_df


In [None]:
# Task 5: Temporal Analysis

def _convert_year_to_numeric(year_series: pd.Series) -> pd.Series:
    """Converts a Series of ANES year strings to a numeric format for plotting."""
    # Replaces 'a' (pre-election) with .0 and 'b' (post-election) with .5
    return (
        year_series.astype(str)
        .str.replace('a', '.0')
        .str.replace('b', '.5')
        .astype(float)
    )

def prepare_temporal_analysis(
    polarization_results: pd.DataFrame,
    midpoints: Dict[str, float]
) -> Dict[str, Dict[str, Dict[str, pd.DataFrame]]]:
    """
    Prepares polarization data for temporal analysis and visualization.

    This function takes the computed polarization measures and transforms them
    into structures optimized for time-series plotting and analysis. It replicates
    the analytical setup used in the source paper (e.g., Figure 1), which
    involves partitioning the data into left/liberal and right/conservative
    panels based on the position of the central point (x*) relative to a scale
    midpoint.

    The process is as follows:
    1.  Converts the 'year' index level to a numeric format to enable
        chronological plotting (e.g., '2004a' -> 2004.0, '2004b' -> 2004.5).
    2.  For each ideological scale ('left_right', 'liberal_conservative'):
        a.  Pivots the data into a "wide" format where the index is the
            numeric year and columns correspond to each x*.
        b.  Partitions this wide DataFrame into two: one for x* values at or
            below the midpoint ('left_panel') and one for x* values above
            the midpoint ('right_panel').
    3.  For each of these partitioned DataFrames, it also creates a "long"
        format version, which is the standard input for many modern plotting
        libraries like Seaborn.
    4.  Returns a nested dictionary containing all four partitioned DataFrames
        in both wide and long formats, ready for direct use in visualization.

    Args:
        polarization_results (pd.DataFrame):
            The MultiIndex DataFrame of polarization values from
            `compute_all_polarization_measures`.
        midpoints (Dict[str, float]):
            A dictionary defining the midpoint for each scale used for partitioning.
            Example: {'left_right': 5, 'liberal_conservative': 4}

    Returns:
        Dict[str, Dict[str, Dict[str, pd.DataFrame]]]:
            A nested dictionary structured as:
            {scale: {'left_panel' | 'right_panel': {'wide' | 'long': DataFrame}}}
            This contains all the necessary data views for temporal plotting.
    """
    # --- Input Validation ---
    if not isinstance(polarization_results, pd.DataFrame) or not isinstance(polarization_results.index, pd.MultiIndex):
        raise TypeError("`polarization_results` must be a DataFrame with a MultiIndex.")
    if not {'year', 'scale', 'x_star'}.issubset(polarization_results.index.names):
        raise ValueError("`polarization_results` must have index levels 'year', 'scale', and 'x_star'.")

    # --- Step 1: Organize Time Series Data ---
    # Create a working copy of the results DataFrame.
    df = polarization_results.copy()

    # Convert the 'year' level of the index to a numeric format for plotting.
    # This is a critical step for treating the time axis correctly.
    numeric_year_index = _convert_year_to_numeric(df.index.get_level_values('year'))
    # Replace the original categorical year index with the new numeric one.
    df.index = pd.MultiIndex.from_arrays(
        [numeric_year_index, df.index.get_level_values('scale'), df.index.get_level_values('x_star')],
        names=['year_numeric', 'scale', 'x_star']
    )

    # Initialize the final output dictionary.
    analysis_data = {}
    # Get the unique scales from the data, e.g., ['left_right', 'liberal_conservative'].
    scales = df.index.get_level_values('scale').unique()

    # Process each scale independently.
    for scale in scales:
        # Initialize the dictionary for the current scale.
        analysis_data[scale] = {}
        # Select the data for the current scale. The result is a Series
        # with a MultiIndex of ('year_numeric', 'x_star').
        scale_series = df.xs(scale, level='scale')['polarization_value']

        # Unstack the 'x_star' level to create a wide-format DataFrame.
        # The index will be 'year_numeric', and columns will be the x* values.
        # This structure is ideal for analyzing multiple time series together.
        wide_df = scale_series.unstack(level='x_star')

        # --- Step 2: Implement Left-Center-Right Partitioning Logic ---
        # Retrieve the midpoint for the current scale from the parameters.
        midpoint = midpoints[scale]

        # Define the two panels based on the midpoint.
        panels = {
            'left_panel': wide_df.loc[:, wide_df.columns <= midpoint],
            'right_panel': wide_df.loc[:, wide_df.columns > midpoint]
        }

        # --- Step 3 & 4: Prepare Data Structures for Visualization ---
        # Process each of the two panels (left and right).
        for panel_name, panel_df_wide in panels.items():
            # If a panel is empty (e.g., no x* values > midpoint), skip it.
            if panel_df_wide.empty:
                continue

            # "Melt" the wide-format DataFrame to create a long-format version.
            # This is the standard "tidy" data format preferred by Seaborn.
            panel_df_long = panel_df_wide.reset_index().melt(
                id_vars='year_numeric',
                var_name='x_star',
                value_name='polarization_value'
            )

            # Store both the wide and long formats in the final output structure.
            # This provides maximum flexibility for different plotting/analysis needs.
            analysis_data[scale][panel_name] = {
                'wide': panel_df_wide,
                'long': panel_df_long
            }

    # Return the fully structured dictionary ready for visualization.
    return analysis_data


In [None]:
# Task 6: Election Year Analysis

def analyze_election_year_effects(
    polarization_results: pd.DataFrame,
    election_years: List[int]
) -> pd.DataFrame:
    """
    Analyzes the change in polarization before and after specified election years.

    This function isolates the short-term impact of an election by comparing
    polarization measures from pre-election ('a' wave) and post-election
    ('b' wave) surveys. It calculates the absolute change,
    ΔP = P_post - P_pre, for each ideological scale and central point (x*).

    The process is as follows:
    1.  Iterates through a list of specified base election years (e.g., 2004, 2016).
    2.  For each year, it extracts the pre-election (e.g., '2004a') and
        post-election (e.g., '2004b') data from the main results DataFrame.
    3.  It performs a vectorized subtraction to compute the change in polarization
        for every (scale, x*) pair.
    4.  The results (pre-election value, post-election value, and the change)
        are consolidated into a single, tidy DataFrame with a MultiIndex of
        (election_year, scale, x_star) for easy analysis and plotting.

    Args:
        polarization_results (pd.DataFrame):
            The MultiIndex DataFrame of polarization values from
            `compute_all_polarization_measures`. Must have 'year' as a level
            in the index.
        election_years (List[int]):
            A list of integer years to analyze (e.g., [2004, 2016]). The function
            will look for corresponding 'a' and 'b' waves.

    Returns:
        pd.DataFrame:
            A DataFrame containing the pre- and post-election polarization values
            and their difference. The index is ('election_year', 'scale', 'x_star').
            Columns are ['polarization_pre', 'polarization_post', 'polarization_change'].
            Returns an empty DataFrame if no valid election year pairs are found.
    """
    # --- Input Validation ---
    if not isinstance(polarization_results, pd.DataFrame) or not isinstance(polarization_results.index, pd.MultiIndex):
        raise TypeError("`polarization_results` must be a DataFrame with a MultiIndex.")
    if 'year' not in polarization_results.index.names:
        raise ValueError("`polarization_results` must have 'year' as an index level.")

    # A list to hold the results DataFrames for each election year.
    all_election_results = []

    # --- Step 1: Extract Pre- and Post-Election Data Pairs ---
    # Iterate through each specified base election year.
    for year_int in election_years:
        # Construct the string identifiers for the pre ('a') and post ('b') waves.
        pre_election_wave = f"{year_int}a"
        post_election_wave = f"{year_int}b"

        try:
            # Extract the data for the pre-election wave.
            # The .xs() method is efficient for selecting data from a specific
            # level of a MultiIndex.
            pre_df = polarization_results.xs(pre_election_wave, level='year')

            # Extract the data for the post-election wave.
            post_df = polarization_results.xs(post_election_wave, level='year')

            # --- Step 2: Compute Pre-Post Polarization Differences ---
            # Calculate the change: ΔP = P_post - P_pre.
            # Pandas' vectorized subtraction automatically aligns the indices
            # ('scale', 'x_star'), ensuring a correct element-wise calculation.
            diff_df = post_df - pre_df

            # --- Step 3 & 4: Consolidate and Structure Results ---
            # Rename columns for clarity in the final merged DataFrame.
            pre_df.rename(columns={'polarization_value': 'polarization_pre'}, inplace=True)
            post_df.rename(columns={'polarization_value': 'polarization_post'}, inplace=True)
            diff_df.rename(columns={'polarization_value': 'polarization_change'}, inplace=True)

            # Concatenate the three DataFrames (pre, post, change) side-by-side.
            # The result is a single DataFrame for the current election year.
            year_results_df = pd.concat([pre_df, post_df, diff_df], axis=1)

            # Add the integer election year as a new index level for organization.
            year_results_df['election_year'] = year_int
            year_results_df.set_index('election_year', append=True, inplace=True)

            # Reorder index levels to the desired (election_year, scale, x_star) format.
            year_results_df = year_results_df.reorder_levels(['election_year', 'scale', 'x_star'])

            # Add the consolidated results for this year to our list.
            all_election_results.append(year_results_df)

        except KeyError:
            # If either the 'a' or 'b' wave is not found for a year, issue a
            # warning and skip that year. This makes the function robust.
            warnings.warn(
                f"Could not find both pre-election ('{pre_election_wave}') and "
                f"post-election ('{post_election_wave}') data. Skipping year {year_int}.",
                UserWarning
            )
            continue

    # If the list of results is empty, return an empty DataFrame.
    if not all_election_results:
        warnings.warn("No valid election year pairs were found in the data. Returning an empty DataFrame.")
        return pd.DataFrame()

    # Concatenate the results from all processed election years into a single DataFrame.
    final_results_df = pd.concat(all_election_results)

    # Sort the index for clean presentation and efficient lookups.
    final_results_df.sort_index(inplace=True)

    return final_results_df


In [None]:
# Task 7: Implement Cleavage Point Finder Algorithm

def find_cleavage_points(
    polarization_results: pd.DataFrame,
    cleavage_finder_params: Dict[str, Any],
    zero_threshold: float = 1e-6
) -> pd.DataFrame:
    """
    Identifies ideological cleavage points by finding the x* with the maximum
    percentage increase in polarization over specified time periods.

    This function implements the "Cleavage Point Finder" algorithm, a form of
    grid search. For each specified time period (t1, t2) and ideological scale,
    it calculates the percentage change in P(F, x*) for all candidate x* values.
    It then identifies the x* at which this increase is maximized, pinpointing
    the primary ideological fault line for that period.

    The algorithm is as follows:
    1.  For each (t1, t2) period and scale, extract the polarization values.
    2.  Calculate the absolute change (P_t2 - P_t1) and percentage change
        (100 * (P_t2 - P_t1) / P_t1).
    3.  To ensure numerical stability, the percentage change is only calculated
        if the initial polarization P_t1 is above a small threshold.
    4.  A deterministic tie-breaking rule is used: candidates are sorted first
        by percentage change (descending), then by absolute change (descending).
        The top result is selected as the cleavage point.
    5.  Results are compiled into a summary DataFrame.

    Args:
        polarization_results (pd.DataFrame):
            The MultiIndex DataFrame of polarization values from
            `compute_all_polarization_measures`.
        cleavage_finder_params (Dict[str, Any]):
            A dictionary specifying the time periods and candidate x* values.
            Expected keys: 'time_points' (List[Tuple]) and 'potential_x_stars' (Dict).
        zero_threshold (float, optional):
            The value below which an initial polarization P_t1 is considered
            too small for a meaningful percentage change calculation. Defaults to 1e-6.

    Returns:
        pd.DataFrame:
            A DataFrame summarizing the analysis, indexed by a string representation
            of the time period and the scale. Columns include the identified
            'cleavage_point' (the winning x*), 'max_perc_change', and the
            'abs_change_at_cleavage'.
    """
    # --- Input Validation ---
    if not isinstance(polarization_results, pd.DataFrame) or not isinstance(polarization_results.index, pd.MultiIndex):
        raise TypeError("`polarization_results` must be a DataFrame with a MultiIndex.")
    if not {'time_points', 'potential_x_stars'}.issubset(cleavage_finder_params.keys()):
        raise ValueError("`cleavage_finder_params` is missing required keys.")

    # A list to store the final summary results for each search.
    summary_results = []

    # --- Step 1: Implement Core Grid Search Algorithm Structure ---
    # Iterate through each specified time period tuple (t1, t2).
    for t1, t2 in cleavage_finder_params['time_points']:
        # Iterate through each ideological scale defined in the parameters.
        for scale, x_stars in cleavage_finder_params['potential_x_stars'].items():
            try:
                # Extract polarization values for the start (t1) and end (t2) of the period.
                # We select for the specific scale and the set of candidate x* values.
                p_t1 = polarization_results.loc[(t1, scale, x_stars), 'polarization_value']
                p_t2 = polarization_results.loc[(t2, scale, x_stars), 'polarization_value']

                # Combine into a single DataFrame for vectorized calculations.
                # The index is 'x_star'.
                changes_df = pd.DataFrame({'p_t1': p_t1, 'p_t2': p_t2})

                # Calculate the absolute change in polarization.
                changes_df['abs_change'] = changes_df['p_t2'] - changes_df['p_t1']

                # --- Step 2: Implement Robust Percentage Change Calculation ---
                # To avoid division by zero or near-zero, we only calculate percentage
                # change where the initial value p_t1 is above a threshold.
                # Equation: 100 * (P_t2 - P_t1) / P_t1
                changes_df['perc_change'] = np.where(
                    changes_df['p_t1'] > zero_threshold,
                    100 * changes_df['abs_change'] / changes_df['p_t1'],
                    np.nan  # Assign NaN where p_t1 is too small for meaningful calculation.
                )

                # Drop any rows where percentage change could not be calculated.
                valid_changes = changes_df.dropna(subset=['perc_change'])

                if valid_changes.empty:
                    warnings.warn(f"No valid data to find cleavage point for {scale} in period {t1}-{t2}.")
                    continue

                # --- Step 3: Implement Optimization and Tie-Breaking Logic ---
                # Sort to find the best candidate.
                # Primary sort key: percentage change (descending).
                # Secondary sort key (tie-breaker): absolute change (descending).
                sorted_candidates = valid_changes.sort_values(
                    by=['perc_change', 'abs_change'], ascending=[False, False]
                )

                # The winning cleavage point is the index of the top row after sorting.
                best_candidate = sorted_candidates.iloc[0]
                cleavage_point = best_candidate.name

                # --- Step 4: Store and Validate Algorithm Results ---
                # Append the summary of this search to our results list.
                summary_results.append({
                    'time_period': f"{t1}-{t2}",
                    'scale': scale,
                    'cleavage_point': cleavage_point,
                    'max_perc_change': best_candidate['perc_change'],
                    'abs_change_at_cleavage': best_candidate['abs_change']
                })

            except KeyError:
                warnings.warn(
                    f"Data not available for period {t1}-{t2} and scale '{scale}'. Skipping.",
                    UserWarning
                )
                continue

    # If no results were generated, return an empty DataFrame.
    if not summary_results:
        warnings.warn("Cleavage point analysis yielded no results.")
        return pd.DataFrame()

    # Convert the list of dictionaries to a final summary DataFrame.
    summary_df = pd.DataFrame(summary_results)
    # Set a clean, descriptive index.
    summary_df.set_index(['time_period', 'scale'], inplace=True)

    return summary_df


In [None]:
# Task 8: Identify Cleavage Points

def identify_and_analyze_cleavage_points(
    polarization_results: pd.DataFrame,
    cleavage_finder_params: Dict[str, Any],
    midpoints: Dict[str, float],
    zero_threshold: float = 1e-6
) -> pd.DataFrame:
    """
    Applies the cleavage finder algorithm and enriches the results with
    contextual analysis of the cleavage points' location and significance.

    This function serves as the primary application layer for Task 8. It first
    calls the `find_cleavage_points` algorithm (from Task 7) to get the raw
    results. It then performs further analysis on these results to provide
    deeper insights, addressing the following:
    1.  **Spatial Analysis:** It determines where each cleavage point lies
        relative to its scale's midpoint (e.g., 'Left of Center').
    2.  **Significance Heuristic:** It calculates the percentile rank of the
        cleavage point's percentage increase relative to the increases at all
        other candidate points. A high percentile suggests the cleavage is a
        statistically prominent feature, not just a marginal winner.

    Args:
        polarization_results (pd.DataFrame):
            The MultiIndex DataFrame of polarization values.
        cleavage_finder_params (Dict[str, Any]):
            Parameters specifying time periods and candidate x* values.
        midpoints (Dict[str, float]):
            A dictionary defining the midpoint for each scale.
        zero_threshold (float, optional):
            Threshold for meaningful percentage change calculation.

    Returns:
        pd.DataFrame:
            An enriched DataFrame, indexed by ('time_period', 'scale'), with
            detailed analysis of each identified cleavage point, including its
            location, relative position, and percentile rank of its change.
    """
    # --- Step 1: Apply Cleavage Finder to Specified Time Periods ---
    # Call the core algorithm from Task 7 to get the initial results.
    cleavage_summary_df = find_cleavage_points(
        polarization_results=polarization_results,
        cleavage_finder_params=cleavage_finder_params,
        zero_threshold=zero_threshold
    )

    # If the core algorithm returns no results, exit gracefully.
    if cleavage_summary_df.empty:
        warnings.warn("Initial cleavage point search yielded no results. Cannot perform analysis.")
        return pd.DataFrame()

    # --- Step 2: Analyze Spatial Distribution of Cleavage Points ---
    # Add the midpoint for each scale to the summary DataFrame for context.
    cleavage_summary_df['scale_midpoint'] = cleavage_summary_df.index.get_level_values('scale').map(midpoints)

    # Define a function to categorize the cleavage point's position.
    def get_relative_position(row: pd.Series) -> str:
        # Use np.sign to determine if the point is left, right, or at the center.
        sign = np.sign(row['cleavage_point'] - row['scale_midpoint'])
        if sign < 0:
            return 'Left of Center'
        elif sign > 0:
            return 'Right of Center'
        else:
            return 'Center'

    # Apply the function to create a new 'relative_position' column.
    cleavage_summary_df['relative_position'] = cleavage_summary_df.apply(get_relative_position, axis=1)

    # --- Step 3 & 4: Validate Significance and Interpret in Context ---
    # To calculate the percentile, we need the distribution of *all* changes for each period.
    # We must re-run the change calculation part of the algorithm.
    percentiles = []
    for (time_period, scale), row in cleavage_summary_df.iterrows():
        t1_str, t2_str = time_period.split('-')
        t1, t2 = int(t1_str), int(t2_str)

        x_stars = cleavage_finder_params['potential_x_stars'][scale]

        # Re-calculate the changes for this specific period-scale combination.
        p_t1 = polarization_results.loc[(t1, scale, x_stars), 'polarization_value']
        p_t2 = polarization_results.loc[(t2, scale, x_stars), 'polarization_value']
        changes_df = pd.DataFrame({'p_t1': p_t1, 'p_t2': p_t2})
        changes_df['perc_change'] = np.where(
            changes_df['p_t1'] > zero_threshold,
            100 * (changes_df['p_t2'] - changes_df['p_t1']) / changes_df['p_t1'],
            np.nan
        ).copy()

        # Get the full distribution of valid percentage changes.
        all_changes = changes_df['perc_change'].dropna().values
        # Get the winning percentage change for the identified cleavage point.
        winning_change = row['max_perc_change']

        # Calculate the percentile rank of the winning change within its distribution.
        # A value of 100 means it was the highest or tied for highest.
        if len(all_changes) > 0:
            percentile = percentileofscore(all_changes, winning_change, kind='weak')
        else:
            percentile = np.nan # Should not happen if cleavage was found.

        percentiles.append(percentile)

    # Add the calculated percentile rank as a new column.
    # This provides a powerful heuristic for the "significance" or "prominence"
    # of the identified cleavage point.
    cleavage_summary_df['perc_change_percentile'] = percentiles

    # Return the final, enriched DataFrame.
    return cleavage_summary_df


In [None]:
# Task 9: Compute Traditional Polarization Measures

def _weighted_descriptive_stats(
    values: pd.Series,
    weights: pd.Series
) -> pd.Series:
    """
    Calculates weighted mean, variance, and standard deviation for a Series.

    Args:
        values (pd.Series): A Series of data points (e.g., ideological positions).
        weights (pd.Series): A Series of corresponding survey weights.

    Returns:
        pd.Series: A Series containing the weighted 'mean', 'variance', and 'std_dev'.
    """
    # Calculate the weighted average (mean).
    # μ_w = Σ(w_i * x_i) / Σ(w_i)
    weighted_mean = np.average(values, weights=weights)

    # Calculate the weighted variance.
    # Var_w = Σ(w_i * (x_i - μ_w)²) / Σ(w_i)
    # Note: The denominator is the sum of weights, not N-1.
    weighted_variance = np.average((values - weighted_mean)**2, weights=weights)

    # Weighted standard deviation is the square root of the variance.
    weighted_std_dev = np.sqrt(weighted_variance)

    # Return the results as a pandas Series for easy integration.
    return pd.Series({
        'mean': weighted_mean,
        'variance': weighted_variance,
        'std_dev': weighted_std_dev
    })

def compute_traditional_measures(
    cleaned_df: pd.DataFrame,
    centrist_definitions: Dict[str, List[int]]
) -> pd.DataFrame:
    """
    Computes traditional polarization measures for comparison.

    This function calculates established, mean-centric and distribution-based
    measures of polarization to serve as a baseline for evaluating the insights
    from the flexible P(F, x*) index.

    The computed measures for each year and ideological scale are:
    1.  **Weighted Mean, Variance, and Standard Deviation:** These measure the
        central tendency and dispersion of the entire distribution. Variance is a
        classic, widely used polarization metric.
    2.  **Share of Centrist Voters:** This measures the "hollowing out" of the
        ideological middle by calculating the weighted proportion of respondents
        who place themselves in a predefined centrist range.

    Args:
        cleaned_df (pd.DataFrame):
            A DataFrame processed by `clean_anes_data`, containing columns 'year',
            'weight', 'left_right', and 'liberal_conservative'.
        centrist_definitions (Dict[str, List[int]]):
            A dictionary defining the range of values considered "centrist" for
            each scale. Example: {'left_right': [4, 5, 6], ...}

    Returns:
        pd.DataFrame:
            A DataFrame indexed by ('year', 'scale') containing the computed
            traditional measures as columns.
    """
    # --- Input Validation ---
    if cleaned_df.empty:
        raise ValueError("Input 'cleaned_df' cannot be empty.")
    if not isinstance(centrist_definitions, dict):
        raise TypeError("`centrist_definitions` must be a dictionary.")

    # A list to store the results DataFrames from each scale.
    all_measures = []
    # Define the ideological scales to process.
    scales = ['left_right', 'liberal_conservative']

    for scale in scales:
        # Create a working DataFrame for the current scale, dropping any
        # potential NaNs in the scale or weight columns.
        scale_df = cleaned_df[['year', scale, 'weight']].dropna()

        # --- Step 1 & 3: Calculate Weighted Descriptive Statistics ---
        # Group by year and apply the weighted stats helper function.
        # This computes mean, variance, and std_dev for each year.
        desc_stats = scale_df.groupby('year').apply(
            lambda g: _weighted_descriptive_stats(g[scale], g['weight'])
        )

        # --- Step 2: Calculate Centrist Voter Shares ---
        # Define a helper function to calculate the weighted share of centrists.
        def get_centrist_share(group: pd.DataFrame) -> float:
            # Filter the group to include only respondents in the centrist range.
            centrist_mask = group[scale].isin(centrist_definitions[scale])
            # Sum the weights of the centrist respondents.
            centrist_weight_sum = group.loc[centrist_mask, 'weight'].sum()
            # Sum the weights of all respondents in the group.
            total_weight_sum = group['weight'].sum()
            # The share is the ratio of the two sums.
            return centrist_weight_sum / total_weight_sum if total_weight_sum > 0 else 0.0

        # Group by year and apply the centrist share function.
        centrist_share = scale_df.groupby('year').apply(get_centrist_share)
        # Convert the resulting Series to a DataFrame with a descriptive column name.
        centrist_share_df = centrist_share.to_frame(name='centrist_share')

        # --- Step 4: Organize Measures for Comparative Analysis ---
        # Join the descriptive statistics and the centrist share results.
        # The 'year' index aligns the two DataFrames perfectly.
        combined_measures = desc_stats.join(centrist_share_df)

        # Add a 'scale' column to identify which scale these measures belong to.
        combined_measures['scale'] = scale

        # Append the combined results for this scale to our list.
        all_measures.append(combined_measures)

    # Concatenate the results from all scales into a single DataFrame.
    final_df = pd.concat(all_measures)
    # Set a clean MultiIndex of ('year', 'scale').
    final_df.set_index('scale', append=True, inplace=True)
    final_df = final_df.reorder_levels(['year', 'scale'])

    # Sort the index for clean presentation and efficient lookups.
    final_df.sort_index(inplace=True)

    return final_df


In [None]:
# Task 10: Comparative Analysis

def create_comparative_analysis_framework(
    polarization_results: pd.DataFrame,
    traditional_measures: pd.DataFrame
) -> Dict[str, pd.DataFrame]:
    """
    Creates a unified framework for comparing the flexible polarization index
    against traditional measures.

    This function is central to demonstrating the value of the P(F, x*) index.
    It integrates the results from the flexible measure with traditional metrics
    (like variance and centrist share) into a single, analysis-ready DataFrame.
    It then performs a systematic correlation analysis to quantitatively identify
    where the different measures tell similar stories and, more importantly, where
    they diverge, thus highlighting the unique insights offered by the new index.

    The process is as follows:
    1.  Pivots the `polarization_results` DataFrame to a wide format where each
        x* has its own column.
    2.  Joins this wide DataFrame with the `traditional_measures` DataFrame,
        aligning them by 'year' and 'scale'. This creates the master
        comparative framework.
    3.  For each scale, it computes the temporal correlation matrix between all
        traditional measures and all P(F, x*) measures.
    4.  From this matrix, it extracts a summary identifying, for each traditional
        measure, which x* is most and least correlated with it.
    5.  Returns a dictionary containing both the master unified DataFrame and
        the concise correlation summary.

    Args:
        polarization_results (pd.DataFrame):
            The MultiIndex DataFrame of P(F, x*) values from Task 4.
        traditional_measures (pd.DataFrame):
            The DataFrame of traditional measures from Task 9.

    Returns:
        Dict[str, pd.DataFrame]:
            A dictionary with two keys:
            'unified_data': The master DataFrame containing all measures.
            'correlation_summary': A DataFrame summarizing the correlation
                                   analysis.
    """
    # --- Input Validation ---
    if not all(isinstance(df, pd.DataFrame) for df in [polarization_results, traditional_measures]):
        raise TypeError("Both inputs must be pandas DataFrames.")
    if polarization_results.empty or traditional_measures.empty:
        raise ValueError("Input DataFrames cannot be empty.")

    # --- Step 1: Create Integrated Comparison Framework ---
    # Unstack the 'x_star' level to pivot the P(F, x*) results into a wide format.
    # The index will be ('year', 'scale'), and columns will be the x* values.
    p_results_wide = polarization_results['polarization_value'].unstack(level='x_star')
    # Add a prefix to the column names to clearly identify them as P(F, x*) measures.
    p_results_wide.columns = [f"p_xstar_{col}" for col in p_results_wide.columns]

    # Join the wide-format P(F, x*) results with the traditional measures.
    # The join is performed on the shared ('year', 'scale') index, ensuring perfect alignment.
    unified_df = traditional_measures.join(p_results_wide)

    # Drop any rows with NaNs that might result from an incomplete join, though
    # this should not happen with correctly generated inputs.
    unified_df.dropna(inplace=True)

    if unified_df.empty:
        raise ValueError("The join between polarization results and traditional measures yielded an empty DataFrame.")

    # --- Step 2 & 3: Identify Discrepancies and Quantify Value-Added ---
    # Initialize a list to store the correlation summary results.
    correlation_analysis_results = []
    # Get the unique scales from the unified DataFrame's index.
    scales = unified_df.index.get_level_values('scale').unique()
    # Get the column names for the traditional measures.
    trad_measure_cols = traditional_measures.columns.tolist()
    # Get the column names for the P(F, x*) measures.
    p_measure_cols = p_results_wide.columns.tolist()

    # Perform the correlation analysis separately for each scale.
    for scale in scales:
        # Select the data for the current scale.
        scale_df = unified_df.xs(scale, level='scale')

        # Compute the full Pearson correlation matrix for all measures over time.
        corr_matrix = scale_df.corr(method='pearson')

        # We are interested in the correlations between traditional and P(F,x*) measures.
        # Select the relevant block of the correlation matrix.
        cross_corr = corr_matrix.loc[trad_measure_cols, p_measure_cols]

        # For each traditional measure, find the P(F,x*) it correlates most/least with.
        for trad_measure in trad_measure_cols:
            # Get the correlation series for the current traditional measure.
            corr_series = cross_corr.loc[trad_measure]

            # Find the x* with the maximum correlation.
            max_corr_col = corr_series.idxmax()
            max_corr_val = corr_series.max()

            # Find the x* with the minimum correlation.
            min_corr_col = corr_series.idxmin()
            min_corr_val = corr_series.min()

            # Store the results. We extract the numeric x* value from the column name.
            correlation_analysis_results.append({
                'scale': scale,
                'traditional_measure': trad_measure,
                'most_correlated_x_star': int(max_corr_col.split('_')[-1]),
                'correlation_max': max_corr_val,
                'least_correlated_x_star': int(min_corr_col.split('_')[-1]),
                'correlation_min': min_corr_val
            })

    # Convert the list of results into a structured DataFrame.
    correlation_summary_df = pd.DataFrame(correlation_analysis_results)
    # Set a clean index for the summary table.
    correlation_summary_df.set_index(['scale', 'traditional_measure'], inplace=True)

    # --- Step 4: Synthesize and Return the Framework ---
    # The final output is a dictionary containing both the comprehensive
    # unified dataset and the high-level correlation summary.
    return {
        'unified_data': unified_df,
        'correlation_summary': correlation_summary_df
    }


In [None]:
# Task 11: Theoretical Extensions

def calculate_affective_polarization(
    positions: np.ndarray,
    weights: np.ndarray,
    x_star: float,
    g_func: Callable[[np.ndarray], np.ndarray]
) -> float:
    """
    Calculates the overall affective polarization of an electorate.

    This function implements the model from Proposition 3, where affective
    polarization is the average animosity felt by members of one group towards
    the opposing group. Animosity is an increasing function `g` of the
    ideological distance to the mean of the other group.

    The formula for the total weighted animosity is:
    A(F) = Σ_{i: x_i < x*} w_i*g(|x_i-m_R|) + Σ_{i: x_i >= x*} w_i*g(|x_i-m_L|)

    Args:
        positions (np.ndarray): Array of ideological positions for all voters.
        weights (np.ndarray): Array of corresponding normalized survey weights.
        x_star (float): The ideological point that divides the electorate into
                        a 'left' group (positions < x_star) and a 'right' group.
        g_func (Callable[[np.ndarray], np.ndarray]):
            A vectorized function representing animosity, g(distance). It should
            take an array of distances and return an array of animosities.

    Returns:
        float: The aggregate affective polarization score for the electorate.
    """
    # --- Step 1: Partition Electorate into Left and Right Groups ---
    # Create boolean masks to identify members of the left and right groups.
    left_mask = positions < x_star
    right_mask = positions >= x_star

    # If either group is empty, affective polarization is undefined.
    if not np.any(left_mask) or not np.any(right_mask):
        return np.nan

    # --- Step 2: Calculate Group Mean Positions ---
    # Calculate the weighted mean position of the left group (m_L).
    m_L = np.average(positions[left_mask], weights=weights[left_mask])
    # Calculate the weighted mean position of the right group (m_R).
    m_R = np.average(positions[right_mask], weights=weights[right_mask])

    # --- Step 3: Calculate Animosity for Each Group ---
    # Animosity of the left group towards the right group.
    # This is a weighted sum of the animosity of each left-group member.
    # Distance for each left-group member is |x_i - m_R|.
    distances_L_to_R = np.abs(positions[left_mask] - m_R)
    animosity_L = np.sum(weights[left_mask] * g_func(distances_L_to_R))

    # Animosity of the right group towards the left group.
    # Distance for each right-group member is |x_i - m_L|.
    distances_R_to_L = np.abs(positions[right_mask] - m_L)
    animosity_R = np.sum(weights[right_mask] * g_func(distances_R_to_L))

    # The total affective polarization is the sum of the animosities from both sides.
    total_affective_polarization = animosity_L + animosity_R

    return total_affective_polarization


def simulate_issue_salience_effect(
    salience_alphas: List[float],
    common_value_dist: Dict[str, Any],
    divisive_issue_dist: Dict[str, Any],
    polarization_params: Dict[str, Any],
    n_voters: int = 10000,
    random_seed: Optional[int] = None
) -> pd.DataFrame:
    """
    Simulates the effect of issue salience on ideological polarization.

    This function implements the model from Proposition 4, where a voter's
    ideology `x` is a weighted average of a common-value position `c` and a
    divisive position `d`: `x = (1-α)c + αd`. The parameter `α` represents
    the salience of the divisive issue.

    The simulation demonstrates that as `α` increases, the overall ideological
    polarization P(F, x*) of the electorate also increases.

    Args:
        salience_alphas (List[float]): A list of salience parameters (α) to test.
        common_value_dist (Dict[str, Any]): Parameters for the common-value
            distribution. E.g., {'low': 4.8, 'high': 5.2} for Uniform.
        divisive_issue_dist (Dict[str, Any]): Parameters for the divisive
            issue distribution. E.g., {'low': 0, 'high': 10} for Uniform.
        polarization_params (Dict[str, Any]): Parameters needed for the
            P(F, x*) calculation, including 'x_star' and 'boundaries'.
        n_voters (int, optional): The number of voters in the simulation.
        random_seed (Optional[int], optional): Seed for the random number
            generator to ensure reproducibility.

    Returns:
        pd.DataFrame: A DataFrame indexed by 'alpha' showing the resulting
                      'polarization_value' for each salience level.
    """
    # --- Step 1: Setup Simulation Framework ---
    # Set the random seed for reproducibility.
    rng = np.random.default_rng(random_seed)

    # Generate the underlying positions for all voters once.
    # c ~ Uniform(low, high)
    c_positions = rng.uniform(
        low=common_value_dist['low'],
        high=common_value_dist['high'],
        size=n_voters
    )
    # d ~ Uniform(low, high)
    d_positions = rng.uniform(
        low=divisive_issue_dist['low'],
        high=divisive_issue_dist['high'],
        size=n_voters
    )

    results = []
    # Iterate through each specified salience level alpha.
    for alpha in salience_alphas:
        # --- Step 2: Generate Ideological Positions ---
        # Calculate the final ideological positions based on the model.
        # x = (1-α)c + αd
        x_positions = (1 - alpha) * c_positions + alpha * d_positions

        # --- Step 3: Calculate Polarization of the Simulated Electorate ---
        # To calculate P(F, x*), we first need the empirical CDF of x.
        # For a simulation, all weights are equal (1/n_voters).
        # First, create the PMF by getting unique values and their counts.
        unique_pos, counts = np.unique(x_positions, return_counts=True)
        pmf_df = pd.DataFrame({'position': unique_pos, 'pmf_value': counts / n_voters})

        # Then, create the CDF from the PMF.
        pmf_df['cdf_value'] = pmf_df['pmf_value'].cumsum()
        cdf_df = pmf_df[['position', 'cdf_value']].set_index('position')

        # Now, calculate the polarization index using the function from Task 3.
        polarization_value = calculate_polarization_index(
            cdf_df=cdf_df,
            x_star=polarization_params['x_star'],
            boundaries=polarization_params['boundaries']
        )

        results.append({'alpha': alpha, 'polarization_value': polarization_value})

    # --- Step 4: Return Structured Results ---
    # Convert the list of results into a DataFrame, indexed by alpha.
    results_df = pd.DataFrame(results).set_index('alpha')
    return results_df


In [None]:
# Task 12: Robustness Checks

def run_polarization_pipeline(
    anes_df: pd.DataFrame,
    params: Dict[str, Any]
) -> Dict[str, Any]:
    """
    Executes the complete end-to-end polarization research pipeline.

    This orchestrator function serves as the main entry point for the entire
    analysis. It takes raw ANES data and a comprehensive parameter dictionary,
    and executes the full sequence of validated, rigorous analytical steps:
    1.  Parameter Validation
    2.  Data Cleansing
    3.  Data Preprocessing (CDF Generation)
    4.  Polarization Index Computation
    5.  Temporal, Election, and Cleavage Point Analysis
    6.  Traditional Measure Calculation and Comparative Analysis
    7.  Theoretical Model Extensions

    Args:
        anes_df (pd.DataFrame): The raw ANES survey data.
        params (Dict[str, Any]): A dictionary containing all necessary parameters
            for the analysis, including 'central_points_params',
            'boundaries_params', 'cleavage_finder_params', etc.

    Returns:
        Dict[str, Any]: A comprehensive dictionary containing the key output
                        DataFrames from each major stage of the analysis.
    """
    # --- Task 0: Parameter Validation ---
    # First, validate all inputs to ensure the pipeline can run successfully.
    print("Step 0: Validating parameters...")
    validate_parameters(
        anes_df=anes_df,
        central_points_params=params['central_points_params'],
        boundaries_params=params['boundaries_params'],
        integration_params=params['integration_params'],
        cleavage_finder_params=params['cleavage_finder_params']
    )
    print("...Parameters validated successfully.")

    # --- Task 1: Data Cleansing ---
    print("\nStep 1: Cleansing raw ANES data...")
    cleaned_df = clean_anes_data(
        anes_df=anes_df,
        missing_value_map=params.get('missing_value_map') # Use default if not provided
    )
    print("...Data cleansing complete.")

    # --- Task 2: Data Preprocessing ---
    print("\nStep 2: Preprocessing data and generating CDFs...")
    preprocessed_data = preprocess_for_polarization(cleaned_df=cleaned_df)
    print("...Preprocessing complete.")

    # --- Task 4: Compute Polarization Measures ---
    # Task 3 is the core calculation function, which is called by Task 4.
    print("\nStep 4: Computing flexible polarization measures P(F, x*)...")
    polarization_results = compute_all_polarization_measures(
        preprocessed_data=preprocessed_data,
        central_points_params=params['central_points_params'],
        boundaries_params=params['boundaries_params']
    )
    print("...P(F, x*) computation complete.")

    # --- Task 5: Temporal Analysis ---
    print("\nStep 5: Preparing data for temporal analysis...")
    temporal_analysis_data = prepare_temporal_analysis(
        polarization_results=polarization_results,
        midpoints=params['midpoints']
    )
    print("...Temporal analysis data prepared.")

    # --- Task 6: Election Year Analysis ---
    print("\nStep 6: Analyzing election year effects...")
    election_year_analysis = analyze_election_year_effects(
        polarization_results=polarization_results,
        election_years=params['election_years_for_analysis']
    )
    print("...Election year analysis complete.")

    # --- Task 8: Identify Cleavage Points ---
    # Task 7 is the core algorithm, which is called by Task 8.
    print("\nStep 8: Identifying and analyzing cleavage points...")
    cleavage_analysis = identify_and_analyze_cleavage_points(
        polarization_results=polarization_results,
        cleavage_finder_params=params['cleavage_finder_params'],
        midpoints=params['midpoints']
    )
    print("...Cleavage point analysis complete.")

    # --- Task 9: Compute Traditional Polarization Measures ---
    print("\nStep 9: Computing traditional polarization measures...")
    traditional_measures = compute_traditional_measures(
        cleaned_df=cleaned_df,
        centrist_definitions=params['centrist_definitions']
    )
    print("...Traditional measures computed.")

    # --- Task 10: Comparative Analysis ---
    print("\nStep 10: Creating comparative analysis framework...")
    comparative_framework = create_comparative_analysis_framework(
        polarization_results=polarization_results,
        traditional_measures=traditional_measures
    )
    print("...Comparative framework created.")

    # --- Task 11: Theoretical Extensions ---
    # Note: These are illustrative and not dependent on the main pipeline's flow.
    # They are included here to demonstrate their use.
    print("\nStep 11: Running theoretical extension models...")
    salience_simulation_results = simulate_issue_salience_effect(
        **params['salience_simulation_params']
    )
    print("...Theoretical extensions complete.")

    # --- Final Output Assembly ---
    # Compile all key results into a single output dictionary.
    final_results = {
        "cleaned_data": cleaned_df,
        "polarization_results": polarization_results,
        "temporal_analysis_data": temporal_analysis_data,
        "election_year_analysis": election_year_analysis,
        "cleavage_analysis": cleavage_analysis,
        "traditional_measures": traditional_measures,
        "comparative_framework": comparative_framework,
        "salience_simulation_results": salience_simulation_results
    }
    print("\n--- Polarization Pipeline Finished ---")
    return final_results

def run_robustness_analysis(
    anes_df: pd.DataFrame,
    params: Dict[str, Any]
) -> Dict[str, Dict[str, Any]]:
    """
    Performs a robustness analysis by running the full pipeline under
    different methodological assumptions.

    A critical step in any rigorous quantitative analysis is to assess the
    sensitivity of the findings to key methodological choices. This function
    facilitates this by running the entire `run_polarization_pipeline` under
    a set of predefined scenarios.

    The primary check implemented here is the sensitivity to survey weights:
    1.  **'Weighted' Scenario:** The baseline analysis using the provided ANES
        survey weights.
    2.  **'Unweighted' Scenario:** The analysis is re-run assuming every
        respondent has an equal weight of 1.0.

    By comparing the outputs of these scenarios (e.g., if the identified
    cleavage points remain the same), one can assess the robustness of the
    conclusions.

    Args:
        anes_df (pd.DataFrame): The raw ANES survey data.
        params (Dict[str, Any]): The comprehensive parameter dictionary for the
                                 analysis pipeline.

    Returns:
        Dict[str, Dict[str, Any]]:
            A dictionary where keys are the scenario names ('weighted',
            'unweighted') and values are the complete results dictionaries
            produced by the pipeline for that scenario.
    """
    # --- Define Robustness Scenarios ---
    # Each scenario is a dictionary defining its name and a function to modify
    # the input DataFrame according to the scenario's assumption.
    scenarios = [
        {
            "name": "weighted",
            "modifier": lambda df: df.copy() # Baseline: use the data as is.
        },
        {
            "name": "unweighted",
            "modifier": lambda df: df.assign(weight=1.0) # Set all weights to 1.0.
        }
    ]

    # Initialize a dictionary to store the results from all scenarios.
    robustness_results = {}

    # --- Iterate Through Scenarios and Run Pipeline ---
    print("--- Starting Robustness Analysis ---")
    for scenario in scenarios:
        scenario_name = scenario['name']
        modifier_func = scenario['modifier']

        print(f"\n--- Running Scenario: {scenario_name.upper()} ---")

        # Apply the scenario's modification to the input DataFrame.
        scenario_df = modifier_func(anes_df)

        # Run the entire end-to-end pipeline with the modified data.
        # A try-except block makes the overall analysis resilient to a failure
        # in a single scenario.
        try:
            scenario_result = run_polarization_pipeline(
                anes_df=scenario_df,
                params=params
            )
            # Store the complete results dictionary under the scenario's name.
            robustness_results[scenario_name] = scenario_result
        except Exception as e:
            # If a pipeline run fails, report the error and continue.
            print(f"!!! SCENARIO '{scenario_name}' FAILED: {e} !!!")
            robustness_results[scenario_name] = {"error": str(e)}

    print("\n--- Robustness Analysis Finished ---")
    # Return the dictionary containing the results from all scenarios.
    return robustness_results


In [None]:
# Task 13: Visualization and Reporting

def _plot_temporal_trends(
    left_panel_df: pd.DataFrame,
    right_panel_df: pd.DataFrame,
    scale_name: str
) -> plt.Figure:
    """Helper function to generate a two-panel temporal trend plot for one scale."""
    # Set a professional plot style.
    sns.set_theme(style="whitegrid")
    # Create a figure with two subplots, side-by-side.
    fig, axes = plt.subplots(1, 2, figsize=(16, 6), sharey=True)
    fig.suptitle(f"Temporal Evolution of Polarization: {scale_name.replace('_', ' ').title()}", fontsize=16)

    # --- Left Panel ---
    # Plot the time series for x* values at or below the midpoint.
    sns.lineplot(
        data=left_panel_df,
        x='year_numeric',
        y='polarization_value',
        hue='x_star',
        style='x_star',
        markers=True,
        palette='viridis',
        ax=axes[0]
    )
    axes[0].set_title('Left-of-Center / Centrist Positions')
    axes[0].set_xlabel('Year')
    axes[0].set_ylabel(r'Polarization Index $P(F, x^*)$')
    axes[0].legend(title=r'$x^*$')

    # --- Right Panel ---
    # Plot the time series for x* values above the midpoint.
    sns.lineplot(
        data=right_panel_df,
        x='year_numeric',
        y='polarization_value',
        hue='x_star',
        style='x_star',
        markers=True,
        palette='plasma',
        ax=axes[1]
    )
    axes[1].set_title('Right-of-Center Positions')
    axes[1].set_xlabel('Year')
    axes[1].set_ylabel('') # Shared Y-axis, so no label needed.
    axes[1].legend(title=r'$x^*$')

    # Improve layout and return the figure object.
    plt.tight_layout(rect=[0, 0, 1, 0.96])
    return fig

def _plot_election_effects(
    election_df: pd.DataFrame,
    election_year: int
) -> plt.Figure:
    """Helper function to generate a pre- vs. post-election plot."""
    sns.set_theme(style="whitegrid")
    # Get the unique scales present in this election year's data.
    scales = election_df.index.get_level_values('scale').unique()
    # Create a figure with one subplot for each scale.
    fig, axes = plt.subplots(1, len(scales), figsize=(8 * len(scales), 6), sharey=True)
    if len(scales) == 1: axes = [axes] # Ensure axes is always iterable
    fig.suptitle(f"Pre- vs. Post-Election Polarization: {election_year}", fontsize=16)

    for i, scale in enumerate(scales):
        # Select data for the current scale.
        scale_data = election_df.xs(scale, level='scale')
        # Plot pre-election values.
        axes[i].plot(
            scale_data.index, scale_data['polarization_pre'],
            marker='o', linestyle='--', label='Pre-Election', alpha=0.8
        )
        # Plot post-election values.
        axes[i].plot(
            scale_data.index, scale_data['polarization_post'],
            marker='x', linestyle='-', label='Post-Election', alpha=0.8
        )
        axes[i].set_title(f"Scale: {scale.replace('_', ' ').title()}")
        axes[i].set_xlabel(r'Central Point $x^*$')
        axes[i].legend()

    # Set the Y-label only for the first subplot.
    axes[0].set_ylabel(r'Polarization Index $P(F, x^*)$')
    plt.tight_layout(rect=[0, 0, 1, 0.96])
    return fig

def generate_report_visuals(
    pipeline_results: Dict[str, Any]
) -> Dict[str, Any]:
    """
    Generates a report of key visuals and tables from the pipeline results.

    This function takes the comprehensive results from the pipeline orchestrator
    and produces a set of publication-quality outputs, including:
    1.  Temporal Trend Plots: Multi-panel plots showing the evolution of
        P(F, x*) over time, replicating the style of Figures 1 & 2.
    2.  Election Year Plots: Before-and-after plots showing the change in
        polarization around key elections, replicating the style of Figure 3.
    3.  Summary Tables: Formatted versions of the key analytical DataFrames
        (cleavage points, comparative analysis) suitable for inclusion in a
        report (e.g., in LaTeX or HTML format).

    Args:
        pipeline_results (Dict[str, Any]):
            The dictionary of results produced by `run_polarization_pipeline`.

    Returns:
        Dict[str, Any]:
            A dictionary containing the generated matplotlib Figure objects and
            formatted table strings, keyed by descriptive names.
    """
    # Initialize the dictionary to hold all report assets.
    report = {"plots": {}, "tables": {}}

    # --- Step 1: Create Publication-Quality Temporal Trend Plots ---
    print("Generating temporal trend plots...")
    temporal_data = pipeline_results['temporal_analysis_data']
    for scale, panel_data in temporal_data.items():
        fig = _plot_temporal_trends(
            left_panel_df=panel_data['left_panel']['long'],
            right_panel_df=panel_data['right_panel']['long'],
            scale_name=scale
        )
        report['plots'][f'temporal_trends_{scale}'] = fig
        plt.close(fig) # Close plot to prevent it from displaying prematurely.

    # --- Step 2: Generate Election Year Comparison Visualizations ---
    print("Generating election year effect plots...")
    election_data = pipeline_results['election_year_analysis']
    if not election_data.empty:
        for year in election_data.index.get_level_values('election_year').unique():
            year_df = election_data.xs(year, level='election_year')
            fig = _plot_election_effects(year_df, year)
            report['plots'][f'election_effects_{year}'] = fig
            plt.close(fig)

    # --- Step 3: Compile Comprehensive Results Tables ---
    print("Generating summary tables...")
    # Format the cleavage analysis summary table as a LaTeX string.
    cleavage_df = pipeline_results['cleavage_analysis']
    if not cleavage_df.empty:
        report['tables']['cleavage_summary_latex'] = cleavage_df.to_latex(
            caption="Summary of Identified Cleavage Points",
            label="tab:cleavage",
            float_format="%.2f"
        )

    # Format the correlation summary table as a LaTeX string.
    corr_summary_df = pipeline_results['comparative_framework']['correlation_summary']
    if not corr_summary_df.empty:
        report['tables']['correlation_summary_latex'] = corr_summary_df.to_latex(
            caption="Correlation Between Traditional and Flexible Polarization Measures",
            label="tab:correlation",
            float_format="%.3f"
        )

    print("\n--- Report Generation Finished ---")
    return report


In [None]:
# Master Orchestrator

def execute_full_research_project(
    anes_df: pd.DataFrame,
    params: Dict[str, Any]
) -> Dict[str, Any]:
    """
    Executes the complete, end-to-end polarization research project.

    This master orchestrator function serves as the single entry point to run
    the entire analysis suite, from raw data to final report assets. It encapsulates
    the full research workflow, including the baseline analysis, robustness checks,
    and the generation of all tables and visualizations.

    The workflow is as follows:
    1.  **Main Pipeline Run:** Executes the `run_polarization_pipeline` function
        to generate the primary set of findings based on the default parameters
        (typically using survey weights).
    2.  **Report Generation:** Uses the results from the main pipeline run to
        call `generate_report_visuals`, creating all necessary plots and tables
        for the primary report.
    3.  **Robustness Analysis:** Executes the `run_robustness_analysis` function,
        which re-runs the entire pipeline under different methodological
        scenarios (e.g., weighted vs. unweighted) to test the sensitivity
        of the findings.
    4.  **Result Consolidation:** Assembles all outputs into a final, comprehensive,
        and hierarchically structured dictionary, which represents the complete
        output of the research project.

    Args:
        anes_df (pd.DataFrame): The raw ANES survey data.
        params (Dict[str, Any]): A comprehensive dictionary containing all
            parameters required for every stage of the analysis.

    Returns:
        Dict[str, Any]:
            A master dictionary containing the complete project results,
            structured as follows:
            {
                'main_analysis': {
                    'results': Dict[str, pd.DataFrame], # Output from pipeline
                    'report_assets': Dict[str, Any]     # Output from visuals
                },
                'robustness_analysis': Dict[str, Dict] # Output from robustness
            }
    """
    print("=============================================")
    print("=== EXECUTING FULL POLARIZATION PROJECT ===")
    print("=============================================")

    # --- Step 1: Execute the Main Analysis Pipeline ---
    # This run constitutes the primary findings of the research.
    print("\n>>> STAGE 1: RUNNING MAIN ANALYSIS PIPELINE...")
    main_pipeline_results = run_polarization_pipeline(
        anes_df=anes_df,
        params=params
    )
    print(">>> STAGE 1 COMPLETE.")

    # --- Step 2: Generate Report Visuals for the Main Analysis ---
    print("\n>>> STAGE 2: GENERATING REPORT ASSETS FOR MAIN ANALYSIS...")
    report_assets = generate_report_visuals(
        pipeline_results=main_pipeline_results
    )
    print(">>> STAGE 2 COMPLETE.")

    # --- Step 3: Execute the Robustness Analysis ---
    # This stage re-runs the pipeline under different scenarios to test sensitivity.
    print("\n>>> STAGE 3: RUNNING ROBUSTNESS ANALYSIS...")
    robustness_analysis_results = run_robustness_analysis(
        anes_df=anes_df,
        params=params
    )
    print(">>> STAGE 3 COMPLETE.")

    # --- Step 4: Assemble the Final Structured Output Dictionary ---
    # This is the final deliverable, conforming to the specification of Task 14.
    # It contains all analytical products in a clean, hierarchical structure.
    master_results_dictionary = {
        'main_analysis': {
            'results': main_pipeline_results,
            'report_assets': report_assets
        },
        'robustness_analysis': robustness_analysis_results
    }

    print("\n=============================================")
    print("=== PROJECT EXECUTION SUCCESSFULLY COMPLETED ===")
    print("=============================================")

    return master_results_dictionary
