# `README.md`

# A General Decomposability Toolkit for Auditing the True Human Impact of Localized Economic Shocks

<!-- PROJECT SHIELDS -->
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
[![Python Version](https://img.shields.io/badge/python-3.9%2B-blue.svg)](https://www.python.org/)
[![arXiv](https://img.shields.io/badge/arXiv-2510.24225v1-b31b1b.svg)](https://arxiv.org/abs/2510.24225)
[![Journal](https://img.shields.io/badge/Journal-Journal%20of%20Labor%20Economics-003366)](https://www.journals.uchicago.edu/loi/jole)
[![Year](https://img.shields.io/badge/Year-2025-purple)](https://github.com/chirindaopensource/effects_of_immigration_on_places_people)
[![Discipline](https://img.shields.io/badge/Discipline-Labor%20Economics-00529B)](https://github.com/chirindaopensource/effects_of_immigration_on_places_people)
[![Data Sources](https://img.shields.io/badge/Data-German_Social_Security_Records_--_IEB-lightgrey)](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/SCIOLX)
[![Core Method](https://img.shields.io/badge/Method-Decomposition%20%7C%20FD--IV%20%7C%202SLS-orange)](https://github.com/chirindaopensource/effects_of_immigration_on_places_people)
[![Analysis](https://img.shields.io/badge/Analysis-Causal%20Inference%20%7C%20Immigration%20Economics-red)](https://github.com/chirindaopensource/effects_of_immigration_on_places_people)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Type Checking: mypy](https://img.shields.io/badge/type%20checking-mypy-blue)](http://mypy-lang.org/)
[![NumPy](https://img.shields.io/badge/numpy-%23013243.svg?style=flat&logo=numpy&logoColor=white)](https://numpy.org/)
[![Pandas](https://img.shields.io/badge/pandas-%23150458.svg?style=flat&logo=pandas&logoColor=white)](https://pandas.pydata.org/)
[![Matplotlib](https://img.shields.io/badge/matplotlib-%2311557c.svg?style=flat&logo=matplotlib&logoColor=white)](https://matplotlib.org/)
[![linearmodels](https://img.shields.io/badge/linearmodels-003F72-blue)](https://bashtage.github.io/linearmodels/)
[![Pydantic](https://img.shields.io/badge/Pydantic-E92063?logo=pydantic&logoColor=white)](https://pydantic-docs.helpmanual.io/)
[![PyYAML](https://img.shields.io/badge/PyYAML-gray?logo=yaml&logoColor=white)](https://pyyaml.org/)
[![Jupyter](https://img.shields.io/badge/Jupyter-%23F37626.svg?style=flat&logo=Jupyter&logoColor=white)](https://jupyter.org/)

**Repository:** `https://github.com/chirindaopensource/effects_of_immigration_on_places_people`

**Owner:** 2025 Craig Chirinda (Open Source Projects)

This repository contains an **independent**, professional-grade Python implementation of the research methodology from the 2025 paper entitled **"The Effects of Immigration on Places and People - Identification and Interpretation"** by:

*   Christian Dustmann
*   Sebastian Otten
*   Uta Schönberg
*   Jan Stuhler

The project provides a complete, end-to-end computational framework for replicating the paper's findings. It delivers a modular, auditable, and extensible pipeline that executes the entire research workflow: from rigorous data validation and preparation to the core econometric decompositions, heterogeneity analyses, structural parameter recovery, and the final generation of all tables and figures.

## Table of Contents

- [Introduction](#introduction)
- [Theoretical Background](#theoretical-background)
- [Features](#features)
- [Methodology Implemented](#methodology-implemented)
- [Core Components (Notebook Structure)](#core-components-notebook-structure)
- [Key Callable: `execute_decomposition_toolkit_pipeline`](#key-callable-execute_decomposition_toolkit_pipeline)
- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Input Data Structure](#input-data-structure)
- [Usage](#usage)
- [Output Structure](#output-structure)
- [Project Structure](#project-structure)
- [Customization](#customization)
- [Contributing](#contributing)
- [Recommended Extensions](#recommended-extensions)
- [License](#license)
- [Citation](#citation)
- [Acknowledgments](#acknowledgments)

## Introduction

This project provides a Python implementation of the analytical framework presented in Dustmann, Otten, Schönberg, and Stuhler (2025). The core of this repository is the iPython Notebook `effects_of_immigration_on_places_people_draft.ipynb`, which contains a comprehensive suite of functions to replicate the paper's findings. The pipeline is designed to be a generalizable toolkit for decomposing the aggregate ("places-based") effects of any localized economic shock into its underlying micro-level ("people-based") components.

The paper's central argument is that standard estimates of immigration's regional effects are composites that mask distinct, policy-relevant mechanisms. This codebase operationalizes the paper's unifying framework, allowing users to:
-   Rigorously validate and manage the entire experimental configuration via a single `config.yaml` file.
-   Process raw longitudinal spell data, applying a sequence of cleansing, imputation, and panel construction steps.
-   Decompose the regional employment effect into **displacement**, **crowding-out**, and **relocation** components.
-   Decompose the regional wage effect into a **pure price effect** and **compositional effects** from selective worker flows.
-   Estimate all models using a robust 2SLS framework with a from-scratch **Wild Cluster Bootstrap** for inference.
-   Run a full suite of heterogeneity analyses (e.g., by age, skill) and robustness checks.
-   Recover underlying **structural economic parameters** (labor demand/supply elasticities) from the estimated coefficients.
-   Automatically generate all key tables and figures from the paper.

## Theoretical Background

The implemented methods are grounded in modern labor econometrics and causal inference, extending the canonical supply-and-demand model of a local labor market.

**1. Regional vs. Pure Effects:**
The framework distinguishes between the regional wage effect (`γ^R`), which is what is typically estimated with repeated cross-sections, and the pure wage effect (`γ^W`), which is the true change in the price of labor. The two are linked by the composition effect:
$$
\frac{d\log w^R}{dI^P} = \frac{d\log w}{dI^P} \times (1 + \tilde{\eta}^E - \tilde{\eta}^P)
$$
where `(1 + ῆ^E - ῆ^P)` is the "selectivity bias" term, driven by the difference between efficiency-weighted (`ῆ^E`) and population-weighted (`ῆ^P`) labor supply elasticities.

**2. Employment Decomposition:**
The change in regional native employment is an accounting identity of three micro-level flows:
$$
\frac{E_{r1} - E_{r0}}{E_{r0}} = -\frac{E_{r,N}}{E_{r0}} (\text{Displacement}) + \frac{E_{\{\tilde{r},N\},r}}{E_{r0}} (\text{Crowding-Out}) - \frac{E_{r,\tilde{r}}}{E_{r0}} (\text{Relocation})
$$
This implementation estimates the causal effect of the immigration shock on each of these components.

**3. Identification Strategy:**
-   **Instrumental Variable (2SLS):** To address the endogeneity of immigrant inflows (`ΔI_r`), the study uses distance to the border and its square as instrumental variables.
-   **First-Difference IV (FD-IV):** To identify the pure wage effect (`γ^W`), the implementation uses an individual-level FD-IV model on the panel of "stayers" (workers who remain in the same region). This differences out time-invariant individual heterogeneity (`θ_i`).
    $$
    \Delta \log w_{ir} = \gamma^W \Delta I_r + \delta' \Delta X_i + \Delta e_{ir}
    $$

## Features

The provided iPython Notebook (`effects_of_immigration_on_places_people_draft.ipynb`) implements the full research pipeline, including:

-   **Modular, Multi-Task Architecture:** The entire pipeline is broken down into 25 distinct, modular tasks, each with its own orchestrator function.
-   **Configuration-Driven Design:** All study parameters are managed in an external `config.yaml` file, validated by a `Pydantic` schema.
-   **Rigorous Data Validation:** A multi-stage validation process checks the schema, logical integrity, and consistency of all input data.
-   **Advanced Data Preparation:** Includes Tobit-style imputation for censored wages and a high-performance, vectorized method for resolving spell data into a worker-year panel.
-   **Robust Econometric Engine:** All estimations are performed using a master 2SLS function with a from-scratch Wild Cluster Bootstrap for reliable inference.
-   **Complete Decomposition Suite:** Implements the full employment and wage decompositions, including the complex routine-task decomposition with occupational switching.
-   **Comprehensive Heterogeneity & Robustness Analysis:** Includes dedicated modules for analyzing subgroups (older workers, non-employed) and for running a full suite of robustness checks (pre-trends, sensitivity, pseudo-panel).
-   **Structural Parameter Recovery:** Concludes by using the estimated reduced-form coefficients to solve for underlying economic parameters.

## Methodology Implemented

The core analytical steps directly implement the methodology from the paper:

1.  **Validation & Cleansing (Tasks 1-3):** Ingests and validates all inputs, normalizes dtypes, and transforms raw spell data into a canonical worker-year panel.
2.  **Variable Construction (Tasks 4-8):** Imputes censored wages, builds the final analysis panel, aggregates to the regional level, and constructs the immigration shock and instrumental variables.
3.  **Main Analysis (Tasks 9-13):** Prepares the final estimation dataset and runs the main regional employment and wage decompositions.
4.  **Robustness & Heterogeneity (Tasks 14-19):** Executes selection bounding, analyzes key subgroups (non-employed, older workers, by task), and performs the detailed routine employment decomposition and apprenticeship analysis.
5.  **Synthesis & Reporting (Tasks 20-25):** Recovers structural parameters, orchestrates the full pipeline, runs final robustness checks, and compiles all results into publication-quality tables and figures.

## Core Components (Notebook Structure)

The `effects_of_immigration_on_places_people_draft.ipynb` notebook is structured as a logical pipeline with modular orchestrator functions for each of the 25 major tasks. All functions are self-contained, fully documented with type hints and docstrings, and designed for professional-grade execution.

## Key Callable: `execute_decomposition_toolkit_pipeline`

The project is designed around a single, top-level user-facing interface function:

-   **`execute_decomposition_toolkit_pipeline`:** This master orchestrator function, located in the final section of the notebook, runs the entire automated research pipeline from end-to-end. A single call to this function reproduces the entire computational portion of the project.

## Prerequisites

-   Python 3.9+
-   Core dependencies: `pandas`, `numpy`, `pyyaml`, `faker`, `statsmodels`, `linearmodels`, `pydantic`, `pyproj`, `matplotlib`.

## Installation

1.  **Clone the repository:**
    ```sh
    git clone https://github.com/chirindaopensource/effects_of_immigration_on_places_people.git
    cd effects_of_immigration_on_places_people
    ```

2.  **Create and activate a virtual environment (recommended):**
    ```sh
    python -m venv venv
    source venv/bin/activate  # On Windows, use `venv\Scripts\activate`
    ```

3.  **Install Python dependencies:**
    ```sh
    pip install pandas numpy pyyaml Faker statsmodels linearmodels pydantic pyproj matplotlib
    ```

## Input Data Structure

The pipeline requires a main dataset and several auxiliary files, all with specific schemas that are rigorously validated. A synthetic data generator is included in the notebook for a self-contained demonstration.
1.  **`consolidated_df_raw`**: The primary spell-level dataset (Parquet format).
2.  **Auxiliary Files**: `border_crossings.csv`, `bibb_iab_task_mapping.csv`, `matched_controls.csv`.

All other parameters are controlled by the `config.yaml` file.

## Usage

The `effects_of_immigration_on_places_people_draft.ipynb` notebook provides a complete, step-by-step guide. The primary workflow is to execute the final cell of the notebook, which demonstrates how to use the top-level `execute_decomposition_toolkit_pipeline` orchestrator:

```python
# Final cell of the notebook

# This block serves as the main entry point for the entire project.
if __name__ == '__main__':
    # 1. Generate a full set of synthetic data files for the demonstration.
    generate_synthetic_data(output_dir="./study_data/")
    
    # 2. Load the master configuration from the YAML file.
    with open('config.yaml', 'r') as f:
        config = yaml.safe_load(f)
    
    # 3. IMPORTANT: Update file paths in the loaded config to point to the new data.
    config['auxiliary_data_artifacts']['BORDER_CROSSING_TABLE']['path_or_handle'] = "./study_data/border_crossings.csv"
    # ... (update other paths as needed) ...
    
    # 4. Define the path to the main raw data file.
    raw_data_file_path = "./study_data/consolidated_df_raw.parquet"
    
    # 5. Execute the entire replication study.
    final_results = execute_decomposition_toolkit_pipeline(
        raw_data_path=raw_data_file_path,
        master_config=config,
        force_rerun_prep=True # Force re-run on first execution
    )
```

## Output Structure

The pipeline returns a comprehensive dictionary containing all analytical artifacts, structured as follows:
-   **`main_tables`**: Contains nested dictionaries for each result year, holding the full estimation outputs for all main decomposition and heterogeneity analyses.
-   **`event_studies`**: Contains the full time-series results for analyses run over all years (e.g., apprenticeship uptake).
-   **`robustness_checks`**: Contains results from all sensitivity and validation analyses.
-   **`structural_parameters`**: Contains the final recovered economic parameters.

## Project Structure

```
effects_of_immigration_on_places_people/
│
├── effects_of_immigration_on_places_people_draft.ipynb  # Main implementation notebook
├── config.yaml                                          # Master configuration file
├── requirements.txt                                     # Python package dependencies
│
├── study_data/                                          # Directory for synthetic/real data
│   ├── consolidated_df_raw.parquet
│   └── ...
│
├── .pipeline_cache/                                     # Directory for cached artifacts
│   ├── task3.pkl
│   └── ...
│
├── LICENSE                                              # MIT Project License File
└── README.md                                            # This file
```

## Customization

The pipeline is highly customizable via the `config.yaml` file. Users can modify all study parameters, including file paths, sample selection criteria, and algorithm settings, without altering the core Python code.

## Contributing

Contributions are welcome. Please fork the repository, create a feature branch, and submit a pull request with a clear description of your changes. Adherence to PEP 8, type hinting, and comprehensive docstrings is required.

## Recommended Extensions

Future extensions could include:
-   **Generalization:** Abstracting the core decomposition logic to create a generic toolkit that can be applied to other local shocks (e.g., trade, automation) with minimal modification.
-   **Alternative Estimation:** Incorporating alternative estimation techniques, such as difference-in-differences with staggered adoption or synthetic control methods.
-   **Dynamic Panel Models:** Extending the analysis to account for dynamic adjustments and feedback effects over time.

## License

This project is licensed under the MIT License. See the `LICENSE` file for details.

## Citation

If you use this code or the methodology in your research, please cite the original paper:

```bibtex
@article{dustmann2025effects,
  title={The Effects of Immigration on Places and People--Identification and Interpretation},
  author={Dustmann, Christian and Otten, Sebastian and Sch{\"o}nberg, Uta and Stuhler, Jan},
  journal={Journal of Labor Economics},
  year={2025},
  note={arXiv:2510.24225}
}
```

For the implementation itself, you may cite this repository:
```
Chirinda, C. (2025). A General Decomposability Toolkit for Auditing the True Human Impact of Localized Economic Shocks: An Implementation of Dustmann et al. (2025).
GitHub repository: https://github.com/chirindaopensource/effects_of_immigration_on_places_people
```

## Acknowledgments

-   Credit to **Christian Dustmann, Sebastian Otten, Uta Schönberg, and Jan Stuhler** for the foundational research that forms the entire basis for this computational replication.
-   This project is built upon the exceptional tools provided by the open-source community. Sincere thanks to the developers of the scientific Python ecosystem, including **Pandas, NumPy, Matplotlib, Statsmodels, linearmodels, and Pydantic**.

--

*This README was generated based on the structure and content of the `effects_of_immigration_on_places_people_draft.ipynb` notebook and follows best practices for research software documentation.*


# Paper

Title: "*The Effects of Immigration on Places and People - Identification and Interpretation*"

Authors: Christian Dustmann, Sebastian Otten, Uta Schönberg, Jan Stuhler

E-Journal Submission Date: 28 October 2025

Journal Reference: Accepted at the Journal of Labor Economics

Paper Link: https://arxiv.org/abs/2510.24225

Replication Dataset Link: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/SCIOLX

Abstract:

Most studies on the labor market effects of immigration use repeated cross-sectional data to estimate the effects of immigration on regions. This paper shows that such regional effects are composites of effects that address fundamental questions in the immigration debate but remain unidentified with repeated cross-sectional data. We provide a unifying empirical framework that decomposes the regional effects of immigration into their underlying components and show how these are identifiable from data that track workers over time. Our empirical application illustrates that such analysis yields a far more informative picture of immigration's effects on wages, employment, and occupational upgrading.


# Summary

### **The Core Research Problem and Methodological Contribution**

The paper's central thesis is that the canonical method for estimating the labor market effects of immigration—the "spatial correlations" approach using repeated cross-sectional data—is fundamentally flawed. This standard approach regresses changes in regional outcomes (like average wages or employment levels in a city) on changes in the local immigrant share.

Dustmann et al. argue that the resulting coefficients are **composite effects**. They are amalgams of several distinct, policy-relevant mechanisms that cannot be separately identified with cross-sectional data. The paper's primary contribution is to develop a unifying framework that:
1.  **Decomposes** these regional "place-based" effects into their underlying "people-based" components.
2.  **Demonstrates** how longitudinal data, which tracks individual workers over time, is essential for identifying these separate components.
3.  **Shows** that the distinction is not merely academic, as the magnitudes and interpretations of "place" versus "people" effects can differ dramatically.

### **The Theoretical Framework and Key Decompositions**

The authors extend the canonical supply-and-demand model of a local labor market. Their key innovation is to formalize how a regional effect is constructed from individual-level responses.

**A. Decomposition of Employment Effects:**

The standard approach estimates the **regional employment effect ($\beta^R$)**, which is the total percentage change in native employment in a region following an immigration shock. The authors show through a simple identity (Equation 3) that this regional effect can be exactly decomposed into three components:

1.  **Displacement Effect:** The effect on incumbent workers—natives employed in the region *before* the shock—who lose their jobs. This directly answers the public-policy question: "Do immigrants take jobs from existing workers?"
2.  **Crowding-Out Effect:** The reduction in inflows of native workers into the region's labor market. This captures natives who would have taken jobs in the region (either from non-employment or other regions) but are now "crowded out" by immigrants.
3.  **Relocation Effect:** The effect on incumbent workers who respond to the shock by moving to jobs in other regions.

Critically, a large negative regional effect ($\beta^R$) could be driven by a large displacement effect (incumbents are fired) or a large crowding-out effect (fewer new workers enter). These two scenarios have vastly different implications for the welfare of native workers.

**B. Decomposition of Wage Effects:**

Similarly, the standard approach estimates the **regional wage effect ($\gamma^R$)**, which is the change in the average wage in a region. The authors demonstrate (Equations 6 and 7) that this effect is a composite of:

1.  **The "Pure" Wage Effect ($\gamma^W$):** The change in the price of labor for a worker of constant productivity. This is the theoretically "pure" effect of a labor supply shift along a fixed labor demand curve. It can be identified by observing the wage growth of natives who remain employed in the same region throughout the period ("stayers").
2.  **Compositional Effects:** Changes in the average wage driven by changes in the *composition* of the workforce. Immigration can alter who is employed. If low-wage natives are more likely to exit employment (or high-wage natives are more likely to enter), the average wage in the region can increase, even if the pure wage effect for every individual worker is negative. This compositional change introduces a "selectivity bias" in traditional estimates.

### **Empirical Strategy and Identification**

To empirically test their framework, the authors leverage a powerful quasi-experimental setting:

*   **The Shock:** A 1991 policy that allowed Czech workers to commute to jobs in the German border region without granting them residence rights. This represents a clean labor supply shock, as the commuters earned money in Germany but spent most of it in the Czech Republic, minimizing local demand effects.
*   **Identification:** The immigrant inflow was geographically concentrated, declining sharply with distance from the border. This allows the authors to use an **instrumental variable (IV)** strategy, instrumenting the actual immigrant inflow in a municipality with its distance to the border (and its square). This addresses the endogeneity concern that immigrants might be drawn to areas with better economic prospects.
*   **Data:** The analysis relies on German Social Security Records (IEB), a high-quality longitudinal administrative dataset covering the universe of workers in the social security system. This data is crucial as it allows them to track individual workers over time and across different employers, regions, and employment statuses.

### **Main Empirical Findings on Employment**

The results provide striking evidence for the paper's core thesis. For the period 1990-1993:

*   **Regional Effect:** A 1 percentage point increase in the Czech immigrant share *decreased* total native employment in a municipality by **0.87%**. This is a large, significant effect, consistent with some of the more pessimistic findings in the literature.
*   **Displacement Effect:** The same shock increased the probability that an *incumbent* native worker lost their job by only **0.14%**. This effect is small and fades to zero after five years.
*   **The Source of the Gap:** The massive difference is almost entirely explained by the **crowding-out effect**. The immigrant inflow reduced the rate of new native hiring by **0.77%**.

**Interpretation:** Immigration did not cause widespread job losses for existing native workers. Instead, it primarily reduced job-finding opportunities for natives who were not already employed in those specific local labor markets.

### **Main Empirical Findings on Wages and Elasticities**

The wage results tell a parallel story. For the period 1990-1993:

*   **Regional Effect:** The estimated effect on regional average wages was close to zero (-0.008) and statistically insignificant. A researcher using cross-sectional data would conclude that immigration had no wage impact.
*   **"Pure" Wage Effect:** By tracking incumbent workers, the authors find a statistically significant negative effect. A 1 percentage point increase in the immigrant share *decreased* the wages of continuously employed natives by **0.19%**.
*   **The Source of the Gap:** The regional wage effect is biased toward zero due to **positive compositional selection**. The workers entering the affected labor markets were, on average, of higher "quality" (in terms of productivity) than the incumbents, while those leaving were also positively selected. This change in workforce composition almost perfectly offset the negative pure wage effect.

**Implication for Economic Parameters:** Using the "pure" wage effect and the decomposed employment effects allows for a credible estimation of the inverse labor demand elasticity ($\phi$), which they find to be -1.95 (implying a demand elasticity of -0.51). This is a standard value in the literature. In contrast, using the regional wage effect would lead to a near-infinite and implausible labor demand elasticity.

### **Heterogeneity and Occupational Upgrading**

The authors extend the analysis to explore heterogeneous effects and other adjustment margins:

*   **Vulnerable Groups:** The negative effects are concentrated on specific groups. **Non-employed natives** and **older workers (50+)** experience significantly larger displacement and wage loss effects than the average incumbent.
*   **Occupational "Upgrading":** The Czech commuters primarily entered routine-task occupations. While this led to a decline in native routine employment, the authors use their longitudinal data to show this was **not** because incumbent natives were "upgrading" to abstract jobs. Instead, the adjustment came from fewer new native workers entering routine jobs. They find evidence for a different kind of upgrading: young natives in affected regions were more likely to enter apprenticeship training rather than low-skilled employment, an investment in human capital.

### **Conclusion and Broader Implications**

The paper's conclusion is a powerful methodological warning. The effects of economic shocks **on places are not the same as the effects on people.** Regional-level estimates can be deeply misleading because they mask crucial dynamics of worker flows and compositional changes.

The framework is highly portable. The same critique applies to other fields that rely on spatial correlation designs to study the impacts of local shocks, such as import competition (trade), automation (robotics), or fiscal policy. Without longitudinal data, it is difficult to know whether a negative regional employment effect reflects the firing of incumbent workers or a reduction in new hires, two phenomena with vastly different welfare implications.

# Import Essential Modules

In [None]:
#!/usr/bin/env python3
# ==============================================================================#
#
#  A General Decomposability Toolkit for Auditing the True Human Impact of
#  Localized Economic Shocks
#
#  This module provides a complete, production-grade implementation of the
#  analytical framework presented in "The Effects of Immigration on Places
#  and People - Identification and Interpretation" by Dustmann, Otten,
#  Schönberg, and Stuhler (2025). It delivers a robust toolkit for decomposing
#  aggregate, "places-based" economic impacts into their underlying, policy-
#  relevant "people-based" components, such as displacement, crowding-out,
#  and selection effects.
#
#  Core Methodological Components:
#  • Causal effect estimation of local shocks via Instrumental Variables (2SLS).
#  • Decomposition of regional employment effects into displacement, crowding-out,
#    and relocation flows using longitudinal worker data.
#  • Decomposition of regional wage effects into a "pure" price effect (purged
#    of selection) and composition effects from worker inflows and outflows.
#  • First-Difference IV (FD-IV) models on individual panel data to identify
#    pure wage effects and test for occupational upgrading.
#  • Tobit-style imputation for right-censored wage data.
#  • Recovery of structural economic parameters (labor demand/supply elasticities)
#    from reduced-form estimates.
#
#  Technical Implementation Features:
#  • A complete, end-to-end pipeline from raw data validation to final output generation.
#  • Robust inference via from-scratch Wild Cluster Bootstrap implementation.
#  • A modular system of orchestrated, self-contained analysis functions.
#  • Geospatial instrument construction (distance to border).
#  • A comprehensive suite of robustness and sensitivity checks, including
#    pre-trend analysis and pseudo-panel estimation.
#
#  Paper Reference:
#  Dustmann, C., Otten, S., Schönberg, U., & Stuhler, J. (2025). The Effects of
#  Immigration on Places and People - Identification and Interpretation.
#  Journal of Labor Economics.
#  arXiv: https://arxiv.org/abs/2510.24225
#  DOI: https://doi.org/10.1086/739196
#
#  Author: CS Chirinda
#  License: MIT
#  Version: 1.0.0
#
# ==============================================================================#

# ==============================================================================
# Consolidated Imports for the Entire Analysis Pipeline
# ==============================================================================

# --- Core Data Handling and Numerical Computation ---
import numpy as np
import pandas as pd

# --- File System, Caching, and Utilities ---
import time
import pickle
import copy
import shutil
import re
from pathlib import Path
from typing import Dict, Any, List, Tuple, Set

# --- Econometrics and Statistical Modeling ---
import statsmodels.api as sm
import statsmodels.formula.api as smf
from linearmodels.iv import IV2SLS
from scipy.stats import norm
from scipy.optimize import minimize
from scipy.spatial.distance import cdist

# --- Configuration Validation ---
from pydantic import BaseModel, Field, validator, ValidationError, conint, conlist, confloat
from typing import Literal

# --- Geospatial Analysis ---
from pyproj import CRS, Transformer

# --- Visualization ---
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator


# Implementation

## Draft 1

### Documentation of All Orchestrator Callables

#### **Task 1: `validate_consolidated_df_raw`**

*   **Inputs:**
    *   `consolidated_df_raw` (pd.DataFrame): The raw, unprocessed spell-level data.
    *   `master_config` (Dict): The main configuration dictionary.
*   **Processes:**
    1.  **Schema Validation:** Verifies that the input DataFrame has the correct MultiIndex (`Worker_ID`, `Start_Date`), that all required columns are present, and that each column has a compatible data type (e.g., integer, float, datetime).
    2.  **Integrity Validation:** Checks for internal consistency. It asserts that the MultiIndex is unique, that `Spell_Sequence_ID` is contiguous and monotonic for each worker, that `Start_Date <= End_Date` for all spells, and that the `Is_TopCoded_Wage` flag is logically consistent with the wage and cap values.
    3.  **Geospatial/Policy Validation:** Confirms that coordinate columns are valid numbers and that geographic identifiers (`Municipality_ID`) are consistently mapped to higher-level geographies (`District_ID`). It also cross-validates the `Is_Border_Region` flag in the data against the district lists in the `master_config`.
*   **Outputs:**
    *   `bool`: Returns `True` if all checks pass. Raises a detailed `ValueError` if any check fails.
*   **Role in Research Pipeline:** This function serves as the primary gateway to the entire pipeline. It is the first line of defense against corrupt or malformed input data, ensuring that all subsequent processing steps can rely on a dataset with a known, valid structure and internal consistency. It enforces the foundational assumptions about the raw data's quality.

#### **Task 2: `validate_artifacts_and_config`**

*   **Inputs:**
    *   `master_config` (Dict): The main configuration dictionary.
    *   `consolidated_df_raw` (pd.DataFrame): The raw data, used for cross-validation.
*   **Processes:**
    1.  **Artifact Validation:** Loads auxiliary data files specified in the config (e.g., `BORDER_CROSSING_TABLE`, `ROUTINE_ABSTRACT_MAPPING`). It validates their file paths, schemas, and content. A critical check ensures that the task mapping provides complete coverage for all `Occupation_Code`s found in the main dataset.
    2.  **Configuration Validation:** Uses a `Pydantic`-based schema to rigorously validate the structure, types, and specific values of critical parameters within the `master_config` dictionary (e.g., `BASE_YEAR`, `BOOTSTRAP_REPLICATIONS`).
*   **Outputs:**
    *   `Dict[str, pd.DataFrame]`: A dictionary containing the loaded and validated auxiliary DataFrames (e.g., `{'border_crossings': df, 'task_mapping': df}`).
*   **Role in Research Pipeline:** This function ensures the reproducibility and correctness of the analysis by validating all external dependencies. It guarantees that the parameters governing the analysis are exactly as specified by the replication instructions and that all necessary side-files (like geographic coordinates or occupational mappings) are present and correct.

#### **Task 3: `cleanse_and_canonicalize_spells`**

*   **Inputs:**
    *   `consolidated_df_raw` (pd.DataFrame): The validated raw spell data.
    *   `master_config` (Dict): The main configuration dictionary.
*   **Processes:**
    1.  **Normalization:** Coerces all columns to their canonical data types (e.g., strings to `datetime`) and standardizes string formats (e.g., zero-padding `Municipality_ID`). This step includes immediate self-validation to fail fast if coercion introduces errors.
    2.  **Spell Resolution:** Transforms the spell-level data into a worker-year panel. It uses a highly efficient vectorized algorithm to identify the single "main job" for each worker at each annual snapshot date (June 30th) by applying a deterministic tie-breaking rule (highest wage, then longest duration, then earliest start date).
    3.  **Flagging and Filtering:** Computes worker `age` and creates a comprehensive set of boolean flags (`is_full_time`, `is_older_worker`, `is_apprentice`) on the full panel. It then creates a second, filtered version of the panel that constitutes the main analysis sample by applying age and employment-type restrictions.
*   **Outputs:**
    *   `Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]`: A tuple containing three key DataFrames: the initial normalized spell data, the full unfiltered worker-year panel with all flags, and the filtered main analysis panel.
*   **Role in Research Pipeline:** This is the core data transformation step. It converts the raw, event-based spell data into the clean, structured panel format that is the foundation for nearly all subsequent analyses.

#### **Task 4: `impute_censored_wages`**

*   **Inputs:**
    *   `worker_year_panel` (pd.DataFrame): The cleansed worker-year panel from Task 3.
    *   `master_config` (Dict): The main configuration dictionary.
*   **Processes:**
    1.  **Preparation:** Subsets the data to full-time workers and creates log-transformed wage and censoring cap columns.
    2.  **Tobit Estimation:** For each cell defined by the interaction of `Gender_Code` and `District_ID`, it fits a censored-normal (Tobit) model via Maximum Likelihood Estimation to estimate the mean (`μ`) and standard deviation (`σ`) of the underlying latent log-wage distribution. It includes a robust fallback mechanism for sparse cells.
    3.  **Imputation:** For each censored wage observation, it replaces the log-cap value with the conditional expectation of the wage given that it is above the cap. This is calculated using the estimated `μ` and `σ` for that observation's cell.
        \[ E[\log w \mid \log w \geq c] = \mu + \sigma \frac{\phi\left(\frac{c - \mu}{\sigma}\right)}{1 - \Phi\left(\frac{c - \mu}{\sigma}\right)} \]
*   **Outputs:**
    *   `pd.DataFrame`: The input `worker_year_panel` with a new column, `log_wage_imputed`, containing the final, corrected log wages.
*   **Role in Research Pipeline:** This function corrects for the measurement error introduced by top-coding in administrative data. It provides an unbiased measure of wages, which is essential for accurately estimating the wage effects of immigration (`γ^R` and `γ^W`).

#### **Task 5: `build_analysis_panel`**

*   **Inputs:**
    *   `analysis_panel_employed` (pd.DataFrame): The panel of employed workers with imputed wages.
    *   `all_spells_cleansed` (pd.DataFrame): The full, normalized spell data.
    *   `validated_artifacts` (Dict): The dictionary of auxiliary data.
    *   `master_config` (Dict): The main configuration dictionary.
*   **Processes:**
    1.  **Grid Construction:** Creates a complete, balanced panel grid of all unique workers across all study years. It merges the employed panel onto this grid to identify non-employed worker-years.
    2.  **Lookback Imputation:** For workers non-employed in 1990, it searches their 1986-1989 spell history to find their last known job characteristics, which are then attached to their 1990 observation.
    3.  **Variable Enrichment:** Adds all remaining analysis variables to the full panel: `age_sq`, task labels (`Routine_or_Abstract_Label`, `Abstract_Intensity`) from the auxiliary mapping, `fte_weight`, nationality flags (`is_native`, `is_czech`), and categorical groups for the pseudo-panel analysis.
*   **Outputs:**
    *   `pd.DataFrame`: The definitive, fully specified `analysis_panel` where each row is a worker-year and all columns needed for any subsequent analysis are present.
*   **Role in Research Pipeline:** This function produces the single, master micro-dataset for the project. All subsequent aggregation and estimation tasks draw from this single source of truth, ensuring consistency.

#### **Task 6: `aggregate_to_regional_panel`**

*   **Inputs:**
    *   `analysis_panel` (pd.DataFrame): The final worker-year panel from Task 5.
    *   `master_config` (Dict): The main configuration dictionary.
*   **Processes:**
    1.  **Employment Aggregation:** Groups the `analysis_panel` by `Municipality_ID` and `snapshot_year` to compute total FTE employment (`Total_rt`) and native FTE employment (`E_rt`).
    2.  **Wage Aggregation:** For the subset of native, full-time workers, it computes the mean of `log_wage_imputed` for each municipality-year (`log_w̄_rt`).
    3.  **National Wage Series:** It also computes the national average of `log_wage_imputed` for each year, which is needed for the non-employed imputation in Task 15.
*   **Outputs:**
    *   `Tuple[pd.DataFrame, pd.Series]`: A tuple containing (1) the `regional_panel` DataFrame with all aggregated municipality-year variables, and (2) the `national_wage_series`.
*   **Role in Research Pipeline:** This function transforms the micro-level data into the macro-level (regional) data needed for the "places-based" analyses.

#### **Task 7: `construct_immigration_shock`**

*   **Inputs:**
    *   `analysis_panel` (pd.DataFrame): The worker-year panel.
    *   `regional_panel` (pd.DataFrame): The municipality-year panel.
    *   `master_config` (Dict): The main configuration dictionary.
*   **Processes:**
    1.  **Numerator Calculation:** Computes the change in the FTE of Czech workers in each municipality between the specified start and end years (1990-1992 for the main shock, 1990-1991 for the 1991 event-year shock).
    2.  **Denominator Calculation:** Extracts the total FTE employment of all workers in each municipality in the baseline year (1990).
    3.  **Shock Calculation:** Divides the numerator by the denominator for each municipality.
        \[ \Delta I_r = \frac{\text{Czech}^{92}_r - \text{Czech}^{90}_r}{\text{Total}^{90}_r} \]
*   **Outputs:**
    *   `pd.DataFrame`: A DataFrame indexed by `Municipality_ID` containing the two shock variables (`shock_main` and `shock_1991`).
*   **Role in Research Pipeline:** This function constructs the key independent variable (the "treatment") for all causal estimations.

#### **Task 8: `construct_instrumental_variables`**

*   **Inputs:**
    *   `analysis_panel` (pd.DataFrame): The worker-year panel (to get municipality coordinates).
    *   `validated_artifacts` (Dict): Contains the validated `border_crossings` DataFrame.
    *   `master_config` (Dict): The main configuration dictionary.
*   **Processes:**
    1.  **Coordinate Harmonization:** Transforms the municipality and border crossing coordinates into a consistent system (e.g., WGS84 for great-circle distance) using `pyproj`.
    2.  **Distance Calculation:** For each municipality, it calculates the minimum distance to any border crossing using either the Haversine formula or Euclidean distance on projected coordinates. This is done efficiently using vectorized `numpy` or `scipy` operations.
    3.  **Instrument Construction:** Creates the final instruments: `distance_to_border` and its square, `distance_to_border_sq`.
*   **Outputs:**
    *   `pd.DataFrame`: A DataFrame indexed by `Municipality_ID` containing the instrumental variables.
*   **Role in Research Pipeline:** This function constructs the instrumental variables used to address the endogeneity of immigrant location choices, which is the cornerstone of the study's causal identification strategy.

#### **Task 9: `prepare_event_study_dataset`**

*   **Inputs:** The `regional_panel`, `shock_df`, `instruments_df`, and `analysis_panel`.
*   **Processes:**
    1.  **Outcome Calculation:** Computes the final outcome variables for the regional analyses by differencing the level variables in the `regional_panel` relative to the 1990 baseline.
    2.  **Data Assembly:** Merges the outcomes, shocks, instruments, baseline weights, and cluster IDs (`District_ID`) into a single, long-format DataFrame indexed by `(Municipality_ID, snapshot_year)`.
    3.  **Sample Filtering:** Applies the final sample filter, keeping only municipalities in the treated and matched control districts.
*   **Outputs:**
    *   `pd.DataFrame`: The final, analysis-ready `event_study_df` for all regional 2SLS estimations.
*   **Role in Research Pipeline:** This is the final data assembly step, creating the master dataset for all "places-based" estimations.

#### **Task 10: `estimate_regional_effect_2sls`**

*   **Inputs:** The `event_study_df`, `master_config`, an `outcome_variable` name, and an `event_year`.
*   **Processes:**
    1.  **First Stage:** Estimates the weighted first-stage regression of the shock on the instruments and calculates the F-statistic for instrument strength.
        \[ \Delta I_r = \pi_0 + \pi_1 \,\text{Distance}_r + \pi_2 \,\text{DistanceSq}_r + u_r \]
    2.  **Second Stage:** Estimates the weighted second-stage regression of the specified outcome on the predicted shock from the first stage.
        \[ Y_r = \alpha + \beta \,\widehat{\Delta I_r} + \epsilon_r \]
    3.  **Inference:** Computes cluster-robust standard errors and runs a full wild cluster bootstrap to generate a robust confidence interval.
*   **Outputs:**
    *   `Dict[str, Any]`: A dictionary containing the point estimate, standard error, p-value, F-statistic, and bootstrap confidence interval.
*   **Role in Research Pipeline:** This is the master estimation engine for all regional ("places-based") analyses in the paper.

#### **Task 11: `decompose_regional_employment_effect`**

*   **Inputs:** The `analysis_panel`, `regional_panel`, `event_study_df`, `master_config`, and `event_year`.
*   **Processes:**
    1.  **Component Construction:** Calls a helper to compute the shares of the three micro-level flows (displacement, inflow/crowding-out, relocation) that constitute the total regional employment change.
    2.  **Iterative Estimation:** Systematically calls `estimate_regional_effect_2sls` for the total effect and for each of the three flow components.
    3.  **Identity Verification:** Checks that the estimated coefficients satisfy the decomposition identity: `β̂^R ≈ (-δ̂_displacement) + δ̂_inflows - δ̂_relocation`.
*   **Outputs:**
    *   `Dict[str, Any]`: A nested dictionary containing the estimation results for the total effect and each component.
*   **Role in Research Pipeline:** This function implements the paper's first main contribution: decomposing the aggregate employment effect to distinguish between job loss for incumbents (displacement) and reduced hiring of new entrants (crowding-out).

#### **Task 12: `estimate_wage_effects`**

*   **Inputs:** The `event_study_df`, `analysis_panel`, `master_config`, and `event_year`.
*   **Processes:**
    1.  **Regional Effect (`γ^R`):** Calls `estimate_regional_effect_2sls` with `wage_outcome` as the dependent variable.
    2.  **Pure Effect (`γ^W`):** Calls a dedicated helper (`_estimate_pure_wage_effect_stayers`) that implements an individual-level First-Difference IV (FD-IV) model for the sub-sample of "stayers." This model controls for time-invariant individual heterogeneity.
        \[ \Delta \log w_{ir} = \gamma^W \,\Delta I_r + \delta' \Delta X_i + \Delta e_{ir} \]
*   **Outputs:**
    *   `Dict[str, Dict]`: A dictionary containing the full results for both the regional and pure wage effect estimations.
*   **Role in Research Pipeline:** This function implements the paper's second main contribution: empirically demonstrating the divergence between the naive regional wage effect and the true "pure" price effect of immigration.

#### **Task 13: `decompose_regional_wage_effect`**

*   **Inputs:** The `analysis_panel`, `event_study_df`, the results from Task 12, `master_config`, and `event_year`.
*   **Processes:**
    1.  **Component Construction:** Calls a vectorized helper to compute the municipality-level components of wage change due to stayers' wage growth, selective outflows, and selective inflows, as per Equation (6).
    2.  **Iterative Estimation:** Calls `estimate_regional_effect_2sls` for each of these components.
    3.  **Reconciliation:** Reconciles the estimated effects on the components with the previously estimated `γ̂^R` and `γ̂^W` to isolate the "age selection" effect and verify the full decomposition identity.
*   **Outputs:**
    *   `Dict[str, Any]`: A nested dictionary containing the full wage decomposition results, mirroring Table 2.
*   **Role in Research Pipeline:** This function provides the full explanation for *why* `γ̂^R` and `γ̂^W` differ, by quantifying the causal impact of immigration on the selective movement of workers.

#### **Task 14: `bound_pure_wage_effect_selection`**

*   **Inputs:** The `analysis_panel`, `event_study_df`, the results from the pure wage effect estimation, `master_config`, and `event_year`.
*   **Processes:**
    1.  **Probit Estimation:** Estimates a probit model of the probability that a 1990 incumbent "stays" as a function of the immigration shock.
    2.  **Component Calculation:** Calculates the necessary components for the bias formula: the standard deviation of the wage growth residuals (`σ̂_Δe`), the derivative of the inverse Mills ratio, and the probit coefficient.
    3.  **Bound Calculation:** Computes the maximum potential bias under worst-case assumptions about the correlation (`ρ = ±1`) between selection and the time-varying wage shock.
        \[ \text{Bias} = \rho \cdot \hat{\sigma}_{\Delta e} \cdot \left(\frac{\partial \lambda(\pi)}{\partial \pi}\right) \cdot \left(\frac{\partial \pi}{\partial \Delta I_r}\right) \]
*   **Outputs:**
    *   `Dict[str, Any]`: A dictionary containing the point estimate `γ̂^W`, the maximum bias, and the final upper and lower bounds.
*   **Role in Research Pipeline:** This is a key robustness check that addresses the final potential threat to the identification of the pure wage effect, showing that any remaining selection on time-varying unobservables is likely to be negligible.

#### **Task 15: `analyze_non_employed_entrants`**

*   **Inputs:** The `analysis_panel`, `regional_panel`, `national_wage_series`, `event_study_df`, `master_config`, and `event_year`.
*   **Processes:**
    1.  **Cohort Identification:** Identifies the cohort of non-employed workers in 1990 with prior work history.
    2.  **Counterfactual Wage Imputation:** Calculates their counterfactual 1990 wage.
    3.  **Employment Effect:** Estimates a 2SLS model of the effect of the shock on this cohort's re-employment probability.
    4.  **Wage Effect:** Estimates an FD-IV model of the pure wage effect for the subset of this cohort that successfully finds a job.
*   **Outputs:**
    *   `Dict[str, Dict]`: A dictionary containing the results for both the employment and wage estimations for this subgroup.
*   **Role in Research Pipeline:** This function provides a direct test of the "crowding-out" mechanism on a clearly defined group of labor market entrants, showing they are more adversely affected than incumbents.

#### **Task 16: `analyze_older_workers`**

*   **Inputs:** The `analysis_panel`, `regional_panel`, `event_study_df`, `master_config`, and `event_year`.
*   **Processes:**
    1.  **Cohort Identification:** Identifies the cohort of workers aged 50+ in 1990.
    2.  **Displacement Effect:** Estimates a 2SLS model of the effect of the shock on the displacement rate for this specific cohort.
    3.  **Wage Effect:** Estimates an FD-IV model of the pure wage effect for the subset of this cohort who are "stayers."
*   **Outputs:**
    *   `Dict[str, Dict]`: A dictionary containing the results for both the displacement and wage estimations for this subgroup.
*   **Role in Research Pipeline:** This is a heterogeneity analysis that tests whether older workers are disproportionately affected by the immigration shock, as is often debated.

#### **Task 17: `analyze_task_heterogeneity`**

*   **Inputs:** The `analysis_panel`, `event_study_df`, `master_config`, and `event_year`.
*   **Processes:**
    1.  **Outcome Construction:** Computes task-specific outcomes (employment and wage changes for Routine vs. Abstract jobs, and the change in the Abstract share).
    2.  **Iterative Estimation:** Systematically re-runs the regional employment (2SLS) and pure wage (FD-IV) estimations for each task group separately, using task-specific weights.
    3.  **Share Estimation:** Estimates the 2SLS effect on the change in the regional share of abstract employment.
*   **Outputs:**
    *   `Dict[str, Any]`: A nested dictionary containing the results for all task-specific estimations.
*   **Role in Research Pipeline:** This heterogeneity analysis tests whether the impact of the (routine-biased) immigration shock was concentrated on the routine-task segment of the labor market.

#### **Task 18: `decompose_routine_employment`**

*   **Inputs:** The `analysis_panel`, `event_study_df`, `master_config`, and `event_year`.
*   **Processes:**
    1.  **Component Construction:** Implements the full, complex decomposition of routine employment change from Equation (8), calculating shares for displacement, crowding-out, relocation, within-region upgrading (R->A), and downgrading (A->R).
    2.  **Iterative Estimation:** Runs a 2SLS estimation for each of the six components.
    3.  **Continuous Upgrading Test:** Runs an FD-IV model on the change in the continuous `Abstract_Intensity` score for regional stayers.
*   **Outputs:**
    *   `Dict[str, Any]`: A nested dictionary containing the results for all seven estimations.
*   **Role in Research Pipeline:** This function provides the definitive "people-based" test of the occupational upgrading hypothesis, showing that the observed regional shift towards abstract tasks is not driven by individual workers switching jobs, but by other flows.

#### **Task 19: `analyze_apprenticeship_uptake`**

*   **Inputs:** The `analysis_panel_full`, `event_study_df`, and `master_config`.
*   **Processes:**
    1.  **Outcome Construction:** Computes the percentage change in native apprenticeship employment for each municipality relative to 1990.
    2.  **Event-Study Estimation:** Calls `estimate_regional_effect_2sls` for every year in the pre- and post-treatment periods.
*   **Outputs:**
    *   `List[Dict]`: A list of result dictionaries, one for each year, suitable for plotting Figure 3.
*   **Role in Research Pipeline:** This function tests an alternative educational upgrading margin, investigating whether young natives respond to increased low-skilled competition by investing more in vocational training.

#### **Task 20: `recover_structural_parameters`**

*   **Inputs:** A dictionary of the key reduced-form coefficients (`β̂^R`, `γ̂^R`, `γ̂^W`), the `analysis_panel`, and `master_config`.
*   **Processes:**
    1.  **Compute `c`:** Calculates the efficiency-to-headcount scaling factor.
    2.  **Recover Elasticities:** Uses the reduced-form coefficients to algebraically solve for the structural parameters `η̄^P`, `η̄^E`, and `φ`.
    3.  **Validation:** Re-calculates `φ` using the naive `γ̂^R` to demonstrate the magnitude of the bias.
*   **Outputs:**
    *   `Dict[str, Any]`: A dictionary containing the recovered structural parameters and the validation results.
*   **Role in Research Pipeline:** This function connects the empirical results back to economic theory, providing estimates of fundamental labor market parameters that are free from the composition bias that plagues much of the literature.

#### **Task 21: `execute_decomposition_toolkit_pipeline`**

*   **Inputs:** `raw_data_path`, `master_config`, and run-time flags.
*   **Processes:** This is the top-level master orchestrator. It executes the entire sequence of tasks from 1 to 20 in the correct order, managing all data dependencies and caching for the data preparation phase. It then proceeds to call the robustness and output compilation orchestrators.
*   **Outputs:**
    *   `Dict[str, Any]`: The final, comprehensive, nested dictionary containing every result generated by the entire project.
*   **Role in Research Pipeline:** It is the main entry point for the entire replication, ensuring perfect reproducibility.

#### **Task 22: `run_robustness_checks`**

*   **Inputs:** The `event_study_df`, `analysis_panel`, `main_estimation_results`, and `master_config`.
*   **Processes:**
    1.  **Pre-Trend Estimation:** Calls the master estimation functions for the pre-treatment years (1987-1989) to test for parallel trends.
    2.  **Plotting:** Combines the pre- and post-treatment results and calls a plotting utility to generate the main event-study figures.
*   **Outputs:** None (generates and displays plots).
*   **Role in Research Pipeline:** This function validates the core identifying assumption of the study and visualizes the main dynamic findings.

#### **Task 23: `run_sensitivity_analyses`**

*   **Inputs:** The `event_study_df`, `master_config`, and `raw_data_path`.
*   **Processes:** Orchestrates a series of robustness checks by systematically altering the main analysis: (1) using an alternative first-stage specification, (2) re-running the entire pipeline with perturbed parameters, and (3) re-running the final estimation on restricted samples.
*   **Outputs:**
    *   `Dict[str, Any]`: A dictionary containing the results from each sensitivity check.
*   **Role in Research Pipeline:** This function probes the stability of the main findings to reasonable changes in methodological choices.

#### **Task 25: `compile_final_outputs`**

*   **Inputs:** The `final_results` dictionary from the main orchestrator and all key data artifacts.
*   **Processes:**
    1.  **Table Generation:** Calls a suite of helper functions to format the numerical results into publication-quality tables that replicate those in the paper.
    2.  **Figure Generation:** Calls a helper to run the necessary event-study estimations and then calls a plotting utility to generate the main figures.
    3.  **Summary Reporting:** Calls a helper to print a formatted summary of the recovered structural parameters.
*   **Outputs:** None (prints tables and summaries to the console, and displays plots).
*   **Role in Research Pipeline:** This is the final presentation layer, translating the raw numerical output of the pipeline into human-readable tables and figures.

<br><br>

## Usage Example

### Pre-Implementation Discussion: End-to-End Example

The goal is to create a self-contained, runnable example that demonstrates the full capability of the developed toolkit. This requires three main steps: (1) creating realistic synthetic data for all required inputs, (2) saving these to disk, and (3) writing a script that loads them and executes the main orchestrator.

*   **Data Structure:** The primary structures are the `consolidated_df_raw` DataFrame and the four auxiliary DataFrames. The synthetic data generation must respect the complex correlations and constraints between the variables (e.g., `Start_Date` must precede `End_Date`; wages in border regions might be systematically different; Czech workers are more likely to be in routine jobs).
*   **Implementation Accuracy:**
    1.  **Synthetic Data Generation:** We will use the `faker` library for generating realistic-looking but anonymized identifiers and names. For the core data structure, we will not use a high-level library like SDV, as it can be difficult to enforce the specific longitudinal and causal structures required here. Instead, we will write a procedural generation script. This script will simulate the study's data generating process:
        *   First, define a universe of municipalities (some border, some control) and workers (some German, some Czech).
        *   Simulate a "treatment" effect where border municipalities have a higher probability of receiving Czech workers after 1990.
        *   Generate spell histories for each worker, ensuring `Spell_Sequence_ID` is correct and dates are logical.
        *   Assign wages, occupations, and other characteristics based on plausible rules (e.g., Czech workers are assigned lower wages and higher probability of routine jobs).
        *   This procedural approach gives maximum control and ensures the synthetic data has the necessary variation for the econometric models to run successfully.
    2.  **File I/O:** All generated DataFrames will be saved to a local directory (e.g., `./study_data/`) using the efficient Parquet format for the main dataset and CSV for the smaller auxiliary files.
    3.  **Configuration Loading:** The `config.yaml` file will be loaded using the `PyYAML` library. The file paths within the YAML will be updated to point to the newly created synthetic data files.
    4.  **Pipeline Execution:** The final script will call `execute_decomposition_toolkit_pipeline`, passing the path to the synthetic raw data and the loaded configuration dictionary.
*   **Anticipated Challenges & Solutions:**
    *   **Challenge:** Creating synthetic data that is realistic enough for the complex econometric models (especially the 2SLS and FD-IV models) to produce meaningful (i.e., non-error) results is non-trivial.
    *   **Solution:** The procedural generation script will be designed to explicitly build in the key relationships the study aims to uncover. For instance, the inflow of Czech workers will be programmatically linked to the `distance_to_border` of the municipality, ensuring the instrument is relevant in the synthetic data.
*   **Python Modules:** `pandas`, `numpy`, `faker` for data generation; `pyyaml` for loading the config.
*   **Completeness and Best Practices:** The example will be a complete, runnable script. It will start by creating its own data dependencies, making it fully self-contained and an excellent demonstration of the entire toolkit.

### Code Implementation: Full End-to-End Example

Here is the complete, professional-grade, and runnable example.

```python
# ==============================================================================
# Full End-to-End Usage Example for the Decomposition Toolkit
# ==============================================================================

import pandas as pd
import numpy as np
import yaml
from faker import Faker
from pathlib import Path
from typing import Dict, Any, List

# Assume all orchestrator functions (e.g., `execute_decomposition_toolkit_pipeline`)
# and their helpers from the entire conversation are defined in the current scope.

# ------------------------------------------------------------------------------
# Step 1: Generate High-Fidelity Synthetic Data
# ------------------------------------------------------------------------------

def generate_synthetic_data(
    output_dir: str = "./study_data/",
    num_workers: int = 5000,
    num_municipalities: int = 100,
    num_border_crossings: int = 10
) -> None:
    """
    Generates and saves a complete set of high-fidelity synthetic data files.
    """
    print("--- Generating synthetic data files... ---")
    
    # Initialize Faker for generating random data.
    fake = Faker('de_DE')
    Faker.seed(0)
    np.random.seed(0)
    
    # Create output directory.
    data_path = Path(output_dir)
    data_path.mkdir(exist_ok=True)

    # --- a. Generate Auxiliary Data ---

    # Border Crossings
    border_crossings = pd.DataFrame({
        'Crossing_ID': [f'BC_{i}' for i in range(num_border_crossings)],
        'Crossing_Name': [f'Crossing {chr(65+i)}' for i in range(num_border_crossings)],
        'Coord_X_UTM': np.random.uniform(450000, 550000, num_border_crossings),
        'Coord_Y_UTM': np.random.uniform(5400000, 5500000, num_border_crossings)
    })
    border_crossings.to_csv(data_path / "border_crossings.csv", index=False)

    # Task Mapping
    occupations = pd.DataFrame({
        'Occupation_Code_3digit': range(100, 400),
        'Routine_or_Abstract_Label': np.random.choice(['Routine', 'Abstract'], 300, p=[0.7, 0.3]),
        'Abstract_Intensity': np.random.rand(300)
    })
    occupations.to_csv(data_path / "bibb_iab_task_mapping.csv", index=False)

    # Matched Controls (simplified)
    all_districts = [f'D_{101+i}' for i in range(20)]
    matched_controls = pd.DataFrame({
        'District_ID': all_districts[13:],
        'Municipality_ID': [f'{90000+i:05d}' for i in range(7)]
    })
    matched_controls.to_csv(data_path / "matched_controls.csv", index=False)

    # --- b. Generate Main Spell Data (consolidated_df_raw) ---
    
    # Define universe of workers and municipalities.
    workers = pd.DataFrame({
        'Worker_ID': range(1, num_workers + 1),
        'Birth_Year': np.random.randint(1930, 1975, num_workers),
        'Gender_Code': np.random.randint(1, 3, num_workers),
        'Education_Level_Code': np.random.randint(1, 5, num_workers),
        'Nationality_Code': np.random.choice([1, 2, 3], num_workers, p=[0.95, 0.02, 0.03]) # 95% German
    })
    
    municipalities = pd.DataFrame({
        'Municipality_ID': [f'{80000+i:05d}' for i in range(num_municipalities)],
        'Municipality_Name': [fake.city() for _ in range(num_municipalities)],
        'Workplace_Coord_X_UTM': np.random.uniform(400000, 600000, num_municipalities),
        'Workplace_Coord_Y_UTM': np.random.uniform(5300000, 5600000, num_municipalities),
        'District_ID': np.random.choice(all_districts, num_municipalities)
    })
    municipalities['Is_Border_Region'] = municipalities['District_ID'].isin(all_districts[:13])
    municipalities['Is_Matched_Control'] = municipalities['District_ID'].isin(all_districts[13:])

    # Generate spells.
    spells = []
    for worker_id, worker in workers.iterrows():
        num_spells = np.random.randint(1, 8)
        current_date = pd.Timestamp(np.random.randint(1986, 1990), np.random.randint(1, 12), 1)
        for i in range(num_spells):
            start_date = current_date + pd.Timedelta(days=np.random.randint(0, 90))
            end_date = start_date + pd.Timedelta(days=np.random.randint(30, 365*2))
            if end_date.year > 1995: break
            
            muni = municipalities.sample(1).iloc[0]
            
            # Simulate Czech worker inflow to border regions post-1990
            if worker['Nationality_Code'] == 2 and start_date.year > 1990 and not muni['Is_Border_Region']:
                if np.random.rand() > 0.1: # 90% chance to be re-assigned to a border region
                    muni = municipalities[municipalities['Is_Border_Region']].sample(1).iloc[0]

            wage = np.random.normal(80, 20) - 10 * (worker['Nationality_Code']==2) # Lower wage for Czech
            cap = 150.0 # Simplified cap
            
            spells.append({
                'Worker_ID': worker['Worker_ID'],
                'Start_Date': start_date,
                'End_Date': end_date,
                'Spell_Sequence_ID': i + 1,
                'Daily_Wage_EUR': min(wage, cap),
                'Is_TopCoded_Wage': wage >= cap,
                'Social_Security_Cap_EUR': cap,
                'Municipality_ID': muni['Municipality_ID'],
                'District_ID': muni['District_ID'],
                'Is_Border_Region': muni['Is_Border_Region'],
                'Is_Matched_Control': muni['Is_Matched_Control'],
                'Occupation_Code': np.random.choice(occupations['Occupation_Code_3digit']),
                'Employment_Type_Code': np.random.choice([1, 5, 6], p=[0.8, 0.1, 0.1]),
                **worker.to_dict() # Add worker demographics
            })
            current_date = end_date
            
    df_raw = pd.DataFrame(spells)
    # Add other columns with default values.
    for col in ['Employer_ID', 'Municipality_Name', 'Workplace_Coord_X_UTM', 'Workplace_Coord_Y_UTM',
                'Reason_for_Termination', 'Establishment_ID', 'Industry_Code', 'Firm_Size_Code', 'State_ID', 'Wage_Cap_Year']:
        if col not in df_raw.columns: df_raw[col] = None
        
    df_raw = df_raw.set_index(['Worker_ID', 'Start_Date'])
    df_raw.to_parquet(data_path / "consolidated_df_raw.parquet")
    
    print(f"--- Synthetic data generation complete. Files saved in '{output_dir}'. ---")

# ------------------------------------------------------------------------------
# Step 2 & 3: Load Configuration and Execute the Pipeline
# ------------------------------------------------------------------------------

if __name__ == '__main__':
    
    # --- Step i) & ii): Create and save synthetic data files ---
    # This function creates all the necessary input files in a local directory.
    generate_synthetic_data(output_dir="./study_data/")

    # --- Step iii): Read the config.yaml file ---
    # First, we create the YAML file string. In a real scenario, this would be a file on disk.
    yaml_content = """
    # ... (paste the full YAML content from the previous response here) ...
    """
    # For this example, we load it directly from the string.
    # In a real script: with open('config.yaml', 'r') as f: config = yaml.safe_load(f)
    config = yaml.safe_load(yaml_content)

    # --- IMPORTANT: Update file paths in the loaded config ---
    # The loaded config has placeholder paths. We must update them to point to
    # our newly generated synthetic data files.
    config['auxiliary_data_artifacts']['BORDER_CROSSING_TABLE']['path_or_handle'] = "./study_data/border_crossings.csv"
    config['auxiliary_data_artifacts']['ROUTINE_ABSTRACT_MAPPING']['path_or_handle'] = "./study_data/bibb_iab_task_mapping.csv"
    config['auxiliary_data_artifacts']['MATCHED_CONTROLS_LIST']['path_or_handle'] = "./study_data/matched_controls.csv"
    # We also need to fill in the placeholder CRS and matched control IDs
    config['geographic_policy_parameters']['CRS_UTM_EPSG'] = "EPSG:32632" # Example: UTM Zone 32N
    config['geographic_policy_parameters']['MATCHED_CONTROL_DISTRICT_IDS'] = [f'D_{101+i}' for i in range(13, 20)]


    # --- Step iv): Execute the end-to-end pipeline function ---
    
    # Define the path to the main raw data file we just created.
    raw_data_file_path = "./study_data/consolidated_df_raw.parquet"
    
    # Call the top-level orchestrator to run the entire analysis.
    # We set `force_rerun_prep=True` for the first run to ensure caches are built.
    final_results_from_pipeline = execute_decomposition_toolkit_pipeline(
        raw_data_path=raw_data_file_path,
        master_config=config,
        main_result_years=[1993], # Run for one year for a quicker example
        cache_dir="./.pipeline_cache/",
        force_rerun_prep=True
    )

    # The `final_results_from_pipeline` dictionary now holds all the estimation
    # outputs, tables, and figure data generated by the entire project.
    print("\n--- PIPELINE EXECUTION COMPLETE ---")
    print("Final results dictionary contains the following top-level keys:")
    print(list(final_results_from_pipeline.keys()))
```

In [None]:
# Task 1: Validate consolidated_df_raw schema, integrity, and dtypes

# ==============================================================================
# Task 1: Validate consolidated_df_raw schema, integrity, and dtypes
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 1, Step 1: Helper for MultiIndex and Column Schema/Dtype Validation
# ------------------------------------------------------------------------------

def _validate_schema_and_dtypes(
    df: pd.DataFrame,
    task_name: str = "Task 1, Step 1"
) -> List[str]:
    """
    Validates the DataFrame's MultiIndex, column presence, and dtypes.

    This function checks for a two-level MultiIndex [Worker_ID, Start_Date]
    with correct dtypes and verifies that all required columns exist with
    compatible dtypes. It collects all validation errors into a list.

    Args:
        df (pd.DataFrame): The raw, consolidated spell-level DataFrame to validate.
        task_name (str): The name of the calling task for error reporting.

    Returns:
        List[str]: A list of error messages. An empty list indicates success.
    """
    # Initialize a list to collect validation errors.
    errors = []

    # Define the canonical schema with expected column names and their dtypes.
    # This serves as the ground truth for the input DataFrame structure.
    expected_schema = {
        # IEB Spell Data Fields
        'Employer_ID': 'object', 'Spell_Sequence_ID': 'int64',
        'End_Date': 'datetime64[ns]', 'Daily_Wage_EUR': 'float64',
        'Occupation_Code': 'int64', 'Nationality_Code': 'int64',
        'Gender_Code': 'int64',
        # Denormalized Wage/Censoring Fields
        'Is_TopCoded_Wage': 'bool', 'Wage_Cap_Year': 'int64',
        'Social_Security_Cap_EUR': 'float64',
        # Denormalized Geospatial & Policy Fields
        'Municipality_ID': 'object', 'Municipality_Name': 'object',
        'Workplace_Coord_X_UTM': 'float64', 'Workplace_Coord_Y_UTM': 'float64',
        'Is_Border_Region': 'bool',
        # Raw Worker Demographics
        'Birth_Year': 'int64', 'Education_Level_Code': 'int64',
        # Raw Employment Characteristics
        'Employment_Type_Code': 'int64', 'Reason_for_Termination': 'int64',
        # Raw Firm/Establishment Characteristics
        'Establishment_ID': 'object', 'Industry_Code': 'int64',
        'Firm_Size_Code': 'int64',
        # Denormalized Geographic Context
        'District_ID': 'object', 'State_ID': 'int64', 'Is_Matched_Control': 'bool'
    }

    # --- MultiIndex Validation ---
    # Verify that the DataFrame's index is a MultiIndex.
    if not isinstance(df.index, pd.MultiIndex):
        errors.append(f"[{task_name}] DataFrame index is not a MultiIndex.")
    else:
        # Verify the number of levels in the MultiIndex.
        if df.index.nlevels != 2:
            errors.append(
                f"[{task_name}] Index must have 2 levels, but found {df.index.nlevels}."
            )
        # Verify the names of the index levels.
        if df.index.names != ['Worker_ID', 'Start_Date']:
            errors.append(
                f"[{task_name}] Index names must be ['Worker_ID', 'Start_Date'], "
                f"but found {df.index.names}."
            )
        # Verify the dtypes of the index levels.
        if not pd.api.types.is_integer_dtype(df.index.get_level_values('Worker_ID')):
            errors.append(
                f"[{task_name}] Index level 'Worker_ID' must be integer dtype."
            )
        if not pd.api.types.is_datetime64_any_dtype(df.index.get_level_values('Start_Date')):
            errors.append(
                f"[{task_name}] Index level 'Start_Date' must be datetime64 dtype."
            )

    # --- Column Presence Validation ---
    # Check for any missing columns from the expected schema.
    missing_cols = set(expected_schema.keys()) - set(df.columns)
    if missing_cols:
        errors.append(
            f"[{task_name}] Missing required columns: {sorted(list(missing_cols))}"
        )

    # --- Column Dtype Validation ---
    # Iterate through the schema to check dtypes of existing columns.
    for col, expected_dtype in expected_schema.items():
        if col in df.columns:
            actual_dtype = df[col].dtype
            # Check for compatibility based on the expected dtype category.
            is_compatible = False
            if expected_dtype == 'int64':
                is_compatible = pd.api.types.is_integer_dtype(actual_dtype)
            elif expected_dtype == 'float64':
                is_compatible = pd.api.types.is_float_dtype(actual_dtype)
            elif expected_dtype == 'datetime64[ns]':
                is_compatible = pd.api.types.is_datetime64_any_dtype(actual_dtype)
            elif expected_dtype == 'bool':
                is_compatible = pd.api.types.is_bool_dtype(actual_dtype)
            elif expected_dtype == 'object':
                is_compatible = pd.api.types.is_object_dtype(actual_dtype)

            if not is_compatible:
                errors.append(
                    f"[{task_name}] Column '{col}' has dtype '{actual_dtype}', "
                    f"but a compatible type for '{expected_dtype}' was expected."
                )

    # Return the list of all found errors.
    return errors

# ------------------------------------------------------------------------------
# Task 1, Step 2: Helper for Referential and Logical Integrity Enforcement
# ------------------------------------------------------------------------------

def _validate_logical_integrity(
    df: pd.DataFrame,
    task_name: str = "Task 1, Step 2"
) -> List[str]:
    """
    Enforces referential and logical integrity constraints on the DataFrame.

    This function performs a series of rigorous checks to ensure the internal
    consistency of the spell-level data. It validates:
    1.  **Index Uniqueness**: The ('Worker_ID', 'Start_Date') MultiIndex is a unique primary key.
    2.  **Spell Sequence Integrity**: For each worker, spells are chronologically
        ordered and the 'Spell_Sequence_ID' increments by exactly 1 with no gaps.
    3.  **Date Logic**: For every spell, 'Start_Date' is not after 'End_Date'.
    4.  **Wage Censoring Logic**: The 'Is_TopCoded_Wage' flag is perfectly
        consistent with the 'Daily_Wage_EUR' and 'Social_Security_Cap_EUR' values,
        using floating-point-robust comparisons.

    Args:
        df (pd.DataFrame): The schema-validated, spell-level DataFrame. It is
                           expected to have a MultiIndex of ('Worker_ID', 'Start_Date').
        task_name (str): The name of the calling task for clear error reporting.

    Returns:
        List[str]: A list of string descriptions of any validation errors found.
                   An empty list signifies that all integrity checks have passed.
    """
    # Initialize a list to collect all validation error messages.
    errors: List[str] = []

    # --- 1. MultiIndex Uniqueness Validation ---
    # The MultiIndex must serve as a unique primary key for the dataset.
    # We check this property using the efficient `.index.is_unique` attribute.
    if not df.index.is_unique:
        # If duplicates exist, count them to provide a more informative error.
        duplicate_count = df.index.duplicated().sum()
        errors.append(
            f"[{task_name}] MultiIndex is not unique. Found {duplicate_count} "
            "duplicate (Worker_ID, Start_Date) pairs."
        )

    # --- 2. Spell Sequence Integrity Validation ---
    # This check ensures that for each worker, the spell history is coherent.
    # The DataFrame must be sorted by the index for the `.diff()` operation to be meaningful.
    if 'Worker_ID' in df.index.names and 'Spell_Sequence_ID' in df.columns:
        # Sort the DataFrame chronologically for each worker.
        df_sorted = df.sort_index()

        # Calculate the difference between consecutive 'Spell_Sequence_ID's within each worker's history.
        sequence_diffs = df_sorted.groupby('Worker_ID')['Spell_Sequence_ID'].diff().dropna()

        # The difference should always be exactly 1 for a valid, contiguous sequence.
        # Any other value indicates a gap, a duplicate, or an incorrect ordering.
        invalid_sequence_mask = (sequence_diffs != 1)
        if invalid_sequence_mask.any():
            # Identify the specific workers with sequence errors for targeted debugging.
            # We get the 'Worker_ID' from the index of the invalid diffs.
            bad_worker_ids = df_sorted.loc[invalid_sequence_mask.index, :].index.get_level_values('Worker_ID').unique().tolist()
            errors.append(
                f"[{task_name}] 'Spell_Sequence_ID' is not contiguous (has gaps or duplicates) "
                f"for {len(bad_worker_ids)} workers. Example failing Worker_IDs: {bad_worker_ids[:5]}."
            )

    # --- 3. Date Logic Validation ---
    # A spell's start date cannot be after its end date. This is a fundamental logical constraint.
    invalid_dates_mask = df['Start_Date'] > df['End_Date']
    if invalid_dates_mask.any():
        # Count the number of violations to report the scale of the problem.
        errors.append(
            f"[{task_name}] Found {invalid_dates_mask.sum()} spells where "
            "Start_Date > End_Date."
        )

    # --- 4. Wage Censoring Field Validation ---
    # The 'Is_TopCoded_Wage' flag must be a perfect logical representation of the wage hitting the cap.
    # We use a floating-point-safe comparison for maximum rigor.
    # The condition is True if the wage is strictly greater than the cap, OR if it is numerically close to the cap.
    expected_censoring = (df['Daily_Wage_EUR'] > df['Social_Security_Cap_EUR']) | \
                         np.isclose(df['Daily_Wage_EUR'], df['Social_Security_Cap_EUR'])

    # Compare the expected flag with the actual flag in the data.
    inconsistent_censoring_mask = (expected_censoring != df['Is_TopCoded_Wage'])
    if inconsistent_censoring_mask.any():
        # Report the number of inconsistencies found.
        errors.append(
            f"[{task_name}] Found {inconsistent_censoring_mask.sum()} spells with "
            "an inconsistent 'Is_TopCoded_Wage' flag relative to the wage and cap values."
        )

    # The social security cap itself must be a positive value for the logic to hold.
    if (df['Social_Security_Cap_EUR'] <= 0).any():
        errors.append(
            f"[{task_name}] Found { (df['Social_Security_Cap_EUR'] <= 0).sum() } spells "
            "with a non-positive 'Social_Security_Cap_EUR'."
        )

    # Return the list of all identified error messages.
    return errors

# ------------------------------------------------------------------------------
# Task 1, Step 3: Helper for Geospatial and Policy Field Consistency
# ------------------------------------------------------------------------------

def _validate_geo_policy_consistency(
    df: pd.DataFrame,
    config: Dict[str, Any],
    task_name: str = "Task 1, Step 3"
) -> List[str]:
    """
    Validates consistency of geospatial and policy-related fields.

    Checks for valid coordinates, consistent geographic hierarchies
    (Municipality -> District -> State), and alignment between policy flags
    in the data and district lists in the master configuration.

    Args:
        df (pd.DataFrame): The integrity-validated DataFrame.
        config (Dict[str, Any]): The master configuration dictionary.
        task_name (str): The name of the calling task for error reporting.

    Returns:
        List[str]: A list of error messages. An empty list indicates success.
    """
    # Initialize a list to collect validation errors.
    errors = []

    # --- Coordinate Validity ---
    # Ensure UTM coordinates are non-null and finite.
    for coord_col in ['Workplace_Coord_X_UTM', 'Workplace_Coord_Y_UTM']:
        if df[coord_col].isnull().any() or not np.isfinite(df[coord_col]).all():
            errors.append(f"[{task_name}] Column '{coord_col}' contains null or non-finite values.")

    # --- Geographic ID Formatting and Hierarchy ---
    # Validate Municipality_ID format (5-digit string).
    municipality_id_regex = re.compile(r'^\d{5}$')
    invalid_municipality_ids = df['Municipality_ID'].dropna().apply(
        lambda x: municipality_id_regex.match(str(x)) is None
    )
    if invalid_municipality_ids.any():
        errors.append(
            f"[{task_name}] Found {invalid_municipality_ids.sum()} 'Municipality_ID' "
            "entries that are not 5-digit strings."
        )

    # Check for consistent mapping from Municipality to District and State.
    geo_hierarchy = df[['Municipality_ID', 'District_ID', 'State_ID']].drop_duplicates()
    if geo_hierarchy['Municipality_ID'].duplicated().any():
        errors.append(
            f"[{task_name}] Inconsistent geographic hierarchy: some municipalities map "
            "to multiple districts or states."
        )

    # --- Policy Flag Consistency ---
    # Extract district sets from the configuration for cross-validation.
    treated_districts = set(config["geographic_policy_parameters"]["TREATED_DISTRICT_IDS"])

    # Check if spells in treated districts have the correct Is_Border_Region flag.
    border_mismatch = df[df['District_ID'].isin(treated_districts) & ~df['Is_Border_Region']]
    if not border_mismatch.empty:
        errors.append(
            f"[{task_name}] Found {len(border_mismatch)} spells in treated districts "
            "(from config) where 'Is_Border_Region' is False."
        )

    # Check if spells flagged as border region are in the configured treated list.
    flag_mismatch = df[df['Is_Border_Region'] & ~df['District_ID'].isin(treated_districts)]
    if not flag_mismatch.empty:
        mismatched_districts = flag_mismatch['District_ID'].unique()
        errors.append(
            f"[{task_name}] Found {len(flag_mismatch)} spells with 'Is_Border_Region'=True "
            f"but their districts are not in the config's treated list. "
            f"Districts: {list(mismatched_districts)}."
        )

    # Return the list of all found errors.
    return errors

# ------------------------------------------------------------------------------
# Task 1, Orchestrator Function
# ------------------------------------------------------------------------------

def validate_consolidated_df_raw(
    consolidated_df_raw: pd.DataFrame,
    master_config: Dict[str, Any]
) -> bool:
    """
    Orchestrates the validation of the raw consolidated spell-level DataFrame.

    This function serves as the main entry point for Task 1. It executes a
    series of validation steps covering the DataFrame's schema, logical
    integrity, and consistency with study parameters. It provides a single,
    comprehensive error report if any validation fails.

    Args:
        consolidated_df_raw (pd.DataFrame): The primary input DataFrame,
            containing spell-level data for all workers. It is expected to
            have a MultiIndex of ('Worker_ID', 'Start_Date').
        master_config (Dict[str, Any]): The master configuration dictionary
            containing all study parameters, file paths, and coding maps.

    Returns:
        bool: True if all validation checks pass successfully.

    Raises:
        TypeError: If input `consolidated_df_raw` is not a pandas DataFrame
                   or `master_config` is not a dictionary.
        ValueError: If any validation check fails. The error message will
                    contain a detailed list of all identified issues.
    """
    # --- Input Type Validation ---
    # Ensure the primary inputs are of the correct type.
    if not isinstance(consolidated_df_raw, pd.DataFrame):
        raise TypeError("`consolidated_df_raw` must be a pandas DataFrame.")
    if not isinstance(master_config, dict):
        raise TypeError("`master_config` must be a dictionary.")

    # --- Execute Validation Steps Sequentially ---
    # Initialize a list to aggregate errors from all validation steps.
    all_errors = []

    # Step 1: Validate schema, column presence, and dtypes.
    all_errors.extend(_validate_schema_and_dtypes(consolidated_df_raw))

    # Step 2: Enforce referential and logical integrity.
    all_errors.extend(_validate_logical_integrity(consolidated_df_raw))

    # Step 3: Validate geospatial and policy field consistency.
    all_errors.extend(_validate_geo_policy_consistency(consolidated_df_raw, master_config))

    # --- Final Verdict ---
    # If the error list is not empty, raise a single comprehensive error.
    if all_errors:
        # Combine all error messages into a single, readable report.
        error_report = "\n- ".join(["Input data validation failed with the following issues:"] + all_errors)
        raise ValueError(error_report)

    # If all checks pass, print a success message and return True.
    print("Task 1: Validation of `consolidated_df_raw` completed successfully. "
          "Schema, integrity, and consistency checks passed.")

    return True


In [None]:
# Task 2: Validate auxiliary data artifacts and master_config

# ==============================================================================
# Task 2: Validate auxiliary data artifacts and master_config
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 2, Step 1: Helper to Load and Validate BORDER_CROSSING_TABLE
# ------------------------------------------------------------------------------

def _validate_border_crossing_table(
    config: Dict[str, Any],
    task_name: str = "Task 2, Step 1"
) -> pd.DataFrame:
    """
    Loads and validates the border crossing auxiliary data artifact.

    This function reads the border crossing data from the path specified in the
    master config, validates its schema and content, and ensures the CRS
    parameter is properly specified for downstream geospatial tasks.

    Args:
        config (Dict[str, Any]): The master configuration dictionary.
        task_name (str): The name of the calling task for error reporting.

    Returns:
        pd.DataFrame: The loaded and validated border crossing DataFrame.

    Raises:
        ValueError: If the file path is invalid, the schema is incorrect,
                    or the data content fails validation checks.
        FileNotFoundError: If the specified file does not exist.
    """
    # Safely access the nested configuration for the border crossing table.
    try:
        artifact_config = config["auxiliary_data_artifacts"]["BORDER_CROSSING_TABLE"]
        file_path_str = artifact_config["path_or_handle"]
        expected_schema = artifact_config["schema"]
    except KeyError as e:
        raise ValueError(f"[{task_name}] Missing required key in master_config: {e}")

    # Resolve the file path and check for existence.
    file_path = Path(file_path_str)
    if not file_path.is_file():
        raise FileNotFoundError(
            f"[{task_name}] Border crossing data file not found at: {file_path}"
        )

    # Load the data, handling potential parsing errors.
    try:
        # Assuming a CSV file for this implementation. This could be extended
        # to handle other formats like Parquet based on file extension.
        df = pd.read_csv(file_path)
    except Exception as e:
        raise ValueError(f"[{task_name}] Failed to load or parse border crossing data from {file_path}. Error: {e}")

    # --- Schema and Content Validation ---
    errors = []
    # Verify column presence.
    missing_cols = set(expected_schema.keys()) - set(df.columns)
    if missing_cols:
        errors.append(f"Missing columns: {sorted(list(missing_cols))}")

    # Verify dtypes and content for existing columns.
    for col, dtype_str in expected_schema.items():
        if col in df.columns:
            try:
                # Coerce to the expected type.
                if 'float' in dtype_str:
                    df[col] = pd.to_numeric(df[col], errors='coerce')
                elif 'int' in dtype_str:
                    df[col] = pd.to_numeric(df[col], errors='coerce').astype('Int64')

                # Check for nulls after coercion, especially for numeric types.
                if df[col].isnull().any():
                     errors.append(f"Column '{col}' contains null or non-coercible values.")

                # For coordinates, ensure they are finite.
                if 'Coord' in col:
                    if not np.isfinite(df[col].dropna()).all():
                        errors.append(f"Column '{col}' contains non-finite values (inf, -inf).")

            except Exception as e:
                errors.append(f"Could not validate or cast column '{col}'. Error: {e}")

    # Validate the Coordinate Reference System (CRS) specification.
    try:
        crs_epsg = config["geographic_policy_parameters"]["CRS_UTM_EPSG"]
        if not crs_epsg or "XXXX" in crs_epsg:
            errors.append("`CRS_UTM_EPSG` in config is not specified or is a placeholder.")
    except KeyError:
        errors.append("Missing `CRS_UTM_EPSG` in `geographic_policy_parameters`.")

    # If any errors were found, raise a comprehensive exception.
    if errors:
        error_report = "\n- ".join([f"[{task_name}] Validation of border crossing table failed:"] + errors)
        raise ValueError(error_report)

    return df

# ------------------------------------------------------------------------------
# Task 2, Step 2: Helper to Validate Task Mapping and Matched Controls
# ------------------------------------------------------------------------------

def _validate_task_mapping_artifact(
    config: Dict[str, Any],
    main_df: pd.DataFrame,
    task_name: str = "Task 2, Step 2 (Task Mapping)"
) -> pd.DataFrame:
    """
    Loads and validates the occupation-to-task mapping artifact.

    Ensures the artifact has the correct schema and, critically, provides
    complete coverage for all occupation codes present in the main dataset.

    Args:
        config (Dict[str, Any]): The master configuration dictionary.
        main_df (pd.DataFrame): The main consolidated DataFrame to check against.
        task_name (str): The name of the calling task for error reporting.

    Returns:
        pd.DataFrame: The loaded and validated task mapping DataFrame.

    Raises:
        ValueError: If validation fails.
    """
    # Load and perform basic schema validation on the task mapping artifact.
    try:
        artifact_config = config["auxiliary_data_artifacts"]["ROUTINE_ABSTRACT_MAPPING"]
        file_path = Path(artifact_config["path_or_handle"])
        df_map = pd.read_csv(file_path) # Assuming CSV
    except Exception as e:
        raise ValueError(f"[{task_name}] Failed to load task mapping artifact. Error: {e}")

    expected_cols = {"Occupation_Code_3digit", "Routine_or_Abstract_Label", "Abstract_Intensity"}
    if not expected_cols.issubset(df_map.columns):
        raise ValueError(f"[{task_name}] Task mapping artifact missing required columns. "
                         f"Expected: {expected_cols}, Found: {set(df_map.columns)}")

    # --- Coverage Validation ---
    # Ensure every occupation code in the main data has a corresponding mapping.
    # This is critical to prevent data loss or errors in downstream analysis.
    occupation_codes_in_data: Set[int] = set(main_df['Occupation_Code'].unique())
    occupation_codes_in_map: Set[int] = set(df_map['Occupation_Code_3digit'].unique())

    unmapped_codes = occupation_codes_in_data - occupation_codes_in_map
    if unmapped_codes:
        raise ValueError(
            f"[{task_name}] The task mapping is incomplete. "
            f"{len(unmapped_codes)} occupation codes found in the main data "
            f"are missing from the mapping file. Examples: {list(unmapped_codes)[:10]}"
        )

    return df_map

def _validate_matched_controls_artifact(
    config: Dict[str, Any],
    main_df: pd.DataFrame,
    task_name: str = "Task 2, Step 2 (Matched Controls)"
) -> pd.DataFrame:
    """
    Loads and validates the matched controls list artifact.

    Args:
        config (Dict[str, Any]): The master configuration dictionary.
        main_df (pd.DataFrame): The main consolidated DataFrame to check against.
        task_name (str): The name of the calling task for error reporting.

    Returns:
        pd.DataFrame: The loaded and validated matched controls DataFrame.
    """
    # Load and perform basic schema validation on the matched controls artifact.
    try:
        artifact_config = config["auxiliary_data_artifacts"]["MATCHED_CONTROLS_LIST"]
        file_path = Path(artifact_config["path_or_handle"])
        df_controls = pd.read_csv(file_path) # Assuming CSV
    except Exception as e:
        raise ValueError(f"[{task_name}] Failed to load matched controls artifact. Error: {e}")

    expected_cols = {"District_ID", "Municipality_ID"}
    if not expected_cols.issubset(df_controls.columns):
        raise ValueError(f"[{task_name}] Matched controls artifact missing required columns. "
                         f"Expected: {expected_cols}, Found: {set(df_controls.columns)}")

    # --- Consistency Validation ---
    # Ensure the list of matched control districts in the config is populated and consistent.
    control_districts_from_config = set(config["geographic_policy_parameters"]["MATCHED_CONTROL_DISTRICT_IDS"])
    control_districts_from_artifact = set(df_controls['District_ID'].astype(str).unique())

    if not control_districts_from_config:
        raise ValueError(f"[{task_name}] `MATCHED_CONTROL_DISTRICT_IDS` in config is empty.")

    if control_districts_from_config != control_districts_from_artifact:
        raise ValueError(
            f"[{task_name}] Mismatch between `MATCHED_CONTROL_DISTRICT_IDS` in config "
            f"and districts found in the matched controls artifact."
        )

    return df_controls

# ------------------------------------------------------------------------------
# Task 2, Step 3: Helper to Validate master_config Parameters
# ------------------------------------------------------------------------------

class EventStudyShockRule(BaseModel):
    """
    Defines the structure for a single event study shock rule, specifying the
    start and end years for calculating the immigration inflow.
    """
    # The start year of the period for shock calculation.
    start_year: int

    # The end year of the period for shock calculation.
    end_year: int

class TemporalParams(BaseModel):
    """
    Validates the 'temporal_parameters' section of the config, ensuring all
    critical dates and periods for the study are correctly specified.
    """
    # The first year of data acquisition, fixed for replication.
    DATA_ACQUISITION_START_YEAR: Literal[1986]

    # The last year of data acquisition, fixed for replication.
    DATA_ACQUISITION_END_YEAR: Literal[1995]

    # The month for the annual employment snapshot.
    ANNUAL_SNAPSHOT_MONTH: Literal[6]

    # The day for the annual employment snapshot.
    ANNUAL_SNAPSHOT_DAY: Literal[30]

    # The pre-treatment baseline year, fixed for replication.
    BASE_YEAR: Literal[1990]

    # The first year the treatment (commuting policy) is in effect.
    TREATMENT_START_YEAR: Literal[1991]

    # A dictionary defining the specific shock calculation windows for the event study.
    EVENT_STUDY_SHOCK_RULES: Dict[Literal["1991", "1992_to_1995"], EventStudyShockRule]

    @validator('EVENT_STUDY_SHOCK_RULES')
    def check_shock_rules(cls, v: Dict) -> Dict:
        """Validator to check the precise content of the event study shock rules."""
        # Define the expected rule for the 1991 event year.
        expected_1991_rule = EventStudyShockRule(start_year=1990, end_year=1991)
        # Define the expected rule for all subsequent event years.
        expected_post_1991_rule = EventStudyShockRule(start_year=1990, end_year=1992)

        # Assert that the configured rules match the expected rules exactly.
        if v.get("1991") != expected_1991_rule:
            raise ValueError("EVENT_STUDY_SHOCK_RULES for '1991' is incorrect.")
        if v.get("1992_to_1995") != expected_post_1991_rule:
            raise ValueError("EVENT_STUDY_SHOCK_RULES for '1992_to_1995' is incorrect.")

        # Return the validated dictionary.
        return v

class AlgorithmConfig(BaseModel):
    """
    Validates the 'algorithm_config_parameters' section of the config,
    ensuring key econometric and computational parameters are correct.
    """
    # The number of bootstrap replications, fixed for replication.
    BOOTSTRAP_REPLICATIONS: Literal[500]

    # The type of distribution for wild bootstrap weights, fixed to Rademacher.
    WILD_BOOTSTRAP_TYPE: Literal["Rademacher"]

    # The geographic level for clustering standard errors, fixed to District_ID.
    CLUSTER_LEVEL: Literal["District_ID"]

    # The mapping of employment type codes to Full-Time Equivalent weights.
    PART_TIME_EQUIVALENCY_WEIGHTS: Dict[int, float]

    @validator('PART_TIME_EQUIVALENCY_WEIGHTS')
    def check_fte_weights(cls, v: Dict) -> Dict:
        """Validator to check the precise content of the FTE weight mapping."""
        # Define the exact dictionary of weights required by the study.
        expected = {1: 1.00, 5: 0.67, 6: 0.50}

        # Assert that the configured weights match the expected weights exactly.
        if v != expected:
            raise ValueError(f"PART_TIME_EQUIVALENCY_WEIGHTS must be exactly {expected}.")

        # Return the validated dictionary.
        return v

class MasterConfigModel(BaseModel):
    """
    The Pydantic model for the entire master_config structure.

    This top-level model composes the nested models and validates the presence
    of all major sections of the configuration dictionary.
    """
    # Validate the nested 'temporal_parameters' dictionary against its specific model.
    temporal_parameters: TemporalParams

    # Validate the nested 'algorithm_config_parameters' dictionary against its specific model.
    algorithm_config_parameters: AlgorithmConfig

    # For other sections, we ensure they exist as dictionaries but do not apply
    # detailed field-level validation in this remediation step for brevity.
    # A full production implementation would define models for these as well.
    study_metadata: Dict
    sample_selection_parameters: Dict
    geographic_policy_parameters: Dict
    variable_coding_maps: Dict
    shock_and_scaling_parameters: Dict
    wage_imputation_parameters: Dict
    selection_bounding_parameters: Dict
    auxiliary_data_artifacts: Dict

def _validate_config_parameters(
    config: Dict[str, Any],
    task_name: str = "Task 2, Step 3"
) -> List[str]:
    """
    Performs rigorous validation of the master_config dictionary using Pydantic.

    This function replaces manual, brittle checking with a formal, schema-based
    validation approach. It defines the expected structure, types, and specific
    values for critical parameters and uses Pydantic to parse and validate the
    input configuration dictionary against this schema. This method is vastly
    more robust, maintainable, and provides superior error reporting.

    Args:
        config (Dict[str, Any]): The master configuration dictionary to be validated.
        task_name (str): The name of the calling task for clear error reporting.

    Returns:
        List[str]: A list of human-readable error messages if validation fails.
                   An empty list signifies that the configuration is valid.
    """
    try:
        # The core of the validation: attempt to parse the raw dictionary
        # into the formal Pydantic model. Pydantic handles all type checks,
        # value constraints (via Literal), and custom validator functions.
        MasterConfigModel.parse_obj(config)

        # If parsing succeeds without raising a ValidationError, the config is valid.
        # Return an empty list to signal success to the orchestrator.
        return []

    except ValidationError as e:
        # If parsing fails, Pydantic's ValidationError contains a detailed,
        # structured list of all issues found throughout the nested structure.

        # We format these structured errors into a simple list of strings
        # as expected by the calling orchestrator function.
        error_messages = []
        for error in e.errors():
            # 'loc' provides the path to the failing key, e.g., ('temporal_parameters', 'BASE_YEAR').
            loc = ".".join(map(str, error['loc']))
            # 'msg' provides a human-readable description of the failure.
            msg = error['msg']
            # Append the formatted error message to our list.
            error_messages.append(f"[{task_name}] Config validation error at '{loc}': {msg}")

        # Return the comprehensive list of all validation failures.
        return error_messages

# ------------------------------------------------------------------------------
# Task 2, Orchestrator Function
# ------------------------------------------------------------------------------

def validate_artifacts_and_config(
    master_config: Dict[str, Any],
    consolidated_df_raw: pd.DataFrame
) -> Dict[str, pd.DataFrame]:
    """
    Orchestrates validation of auxiliary data artifacts and master_config.

    This function is the main entry point for Task 2. It validates all
    external data dependencies and the main configuration dictionary that
    drives the entire analysis pipeline, ensuring reproducibility and correctness.

    Args:
        master_config (Dict[str, Any]): The master configuration dictionary.
        consolidated_df_raw (pd.DataFrame): The main spell-level DataFrame,
            required for cross-validation checks (e.g., occupation code coverage).

    Returns:
        Dict[str, pd.DataFrame]: A dictionary containing the loaded and
            validated auxiliary DataFrames, keyed by their artifact name
            (e.g., "border_crossings", "task_mapping").

    Raises:
        TypeError: If inputs are of the wrong type.
        ValueError: If any validation check fails, with a comprehensive report.
        FileNotFoundError: If an artifact file is not found.
    """
    # --- Input Type Validation ---
    if not isinstance(master_config, dict):
        raise TypeError("`master_config` must be a dictionary.")
    if not isinstance(consolidated_df_raw, pd.DataFrame):
        raise TypeError("`consolidated_df_raw` must be a pandas DataFrame.")

    # --- Execute Validation Steps Sequentially ---
    validated_artifacts = {}
    all_errors = []

    # Step 1: Validate Border Crossing Table.
    try:
        validated_artifacts["border_crossings"] = _validate_border_crossing_table(master_config)
    except (ValueError, FileNotFoundError) as e:
        all_errors.append(str(e))

    # Step 2: Validate Task Mapping and Matched Controls artifacts.
    try:
        validated_artifacts["task_mapping"] = _validate_task_mapping_artifact(master_config, consolidated_df_raw)
    except ValueError as e:
        all_errors.append(str(e))

    try:
        validated_artifacts["matched_controls"] = _validate_matched_controls_artifact(master_config, consolidated_df_raw)
    except ValueError as e:
        all_errors.append(str(e))

    # Step 3: Validate critical parameters within the master_config dictionary.
    config_errors = _validate_config_parameters(master_config)
    if config_errors:
        all_errors.extend(config_errors)

    # --- Final Verdict ---
    if all_errors:
        error_report = "\n- ".join(["Artifact and configuration validation failed:"] + all_errors)
        raise ValueError(error_report)

    print("Task 2: Validation of auxiliary artifacts and master_config completed successfully.")

    return validated_artifacts


In [None]:
# Task 3: Cleanse and canonicalize spell-level data

# ==============================================================================
# Task 3: Cleanse and canonicalize spell-level data
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 3, Step 1: Helper to Normalize Dtypes and Enforce String Formatting
# ------------------------------------------------------------------------------

def _normalize_dtypes_and_formats(
    df: pd.DataFrame,
    task_name: str = "Task 3, Step 1"
) -> pd.DataFrame:
    """
    Normalizes dtypes and standardizes string formats with immediate validation.

    This function creates a cleansed copy of the input DataFrame. It iterates
    through columns, coercing them to their canonical data types (datetime,
    numeric, string). Crucially, after each coercion on a mandatory field, it
    immediately validates that no null values were introduced, ensuring data
    integrity and failing fast on parsing errors. It also standardizes the
    format of key string identifiers (e.g., zero-padding for Municipality_ID).

    Args:
        df (pd.DataFrame): The validated, raw spell-level DataFrame from Task 1.
        task_name (str): The name of the calling task for clear error reporting.

    Returns:
        pd.DataFrame: A new DataFrame with cleansed and validated dtypes and formats.

    Raises:
        ValueError: If a mandatory column cannot be coerced to its target
                    dtype without introducing null values (e.g., 'abc' in an
                    integer column), indicating corrupt source data.
    """
    # --- Input Validation ---
    if not isinstance(df, pd.DataFrame):
        raise TypeError("Input `df` must be a pandas DataFrame.")

    # Work on a copy to prevent side effects on the original DataFrame.
    df_clean = df.copy()

    # --- Schema and Mandatory Fields Definition ---
    # Define the target dtypes for all relevant columns.
    schema = {
        'date': ['End_Date'],
        'int': ['Spell_Sequence_ID', 'Occupation_Code', 'Nationality_Code',
                'Gender_Code', 'Wage_Cap_Year', 'Birth_Year',
                'Education_Level_Code', 'Employment_Type_Code',
                'Reason_for_Termination', 'Industry_Code', 'Firm_Size_Code',
                'State_ID'],
        'float': ['Daily_Wage_EUR', 'Social_Security_Cap_EUR',
                  'Workplace_Coord_X_UTM', 'Workplace_Coord_Y_UTM'],
        'str_pad': {'Municipality_ID': 5}, # Columns to be zero-padded
        'str_strip': ['Employer_ID', 'Establishment_ID', 'District_ID'] # Columns to be stripped
    }
    # Define which columns cannot have nulls after coercion.
    mandatory_fields = set(schema['date'] + schema['int'] + schema['float'])

    # --- 1. Date Type Canonicalization and Validation ---

    # Reset index to handle 'Start_Date' as a column temporarily.
    df_clean = df_clean.reset_index()
    date_cols = schema['date'] + ['Start_Date']

    for col in date_cols:
        # Coerce column to datetime, turning unparseable values into NaT.
        df_clean[col] = pd.to_datetime(df_clean[col], errors='coerce')

        # Immediately validate that no nulls were introduced in this mandatory field.
        if df_clean[col].isnull().any():
            raise ValueError(
                f"[{task_name}] Failed to parse all values in date column '{col}'. "
                "Null values (NaT) were introduced, indicating corrupt data."
            )

    # --- 2. Numeric Type Canonicalization and Validation ---

    # Process integer columns.
    for col in schema['int']:
        # Coerce column to numeric, turning non-numeric values into NaN.
        df_clean[col] = pd.to_numeric(df_clean[col], errors='coerce')

        # Immediately validate that no nulls were introduced.
        if col in mandatory_fields and df_clean[col].isnull().any():
            raise ValueError(
                f"[{task_name}] Failed to parse all values in integer column '{col}'. "
                "Null values were introduced, indicating corrupt data."
            )
        # Cast to a nullable integer type to handle potential NaNs in non-mandatory fields.
        df_clean[col] = df_clean[col].astype('Int64')

    # Process float columns.
    for col in schema['float']:
        # Coerce column to numeric.
        df_clean[col] = pd.to_numeric(df_clean[col], errors='coerce')

        # Immediately validate that no nulls were introduced.
        if col in mandatory_fields and df_clean[col].isnull().any():
            raise ValueError(
                f"[{task_name}] Failed to parse all values in float column '{col}'. "
                "Null values were introduced, indicating corrupt data."
            )

    # --- 3. String Formatting Canonicalization ---

    # Process strings that require zero-padding.
    for col, width in schema['str_pad'].items():
        # Ensure column is string type, strip whitespace, then apply zfill.
        df_clean[col] = df_clean[col].astype(str).str.strip().str.zfill(width)

    # Process strings that require stripping of whitespace.
    for col in schema['str_strip']:
        # Ensure column is string type and strip whitespace.
        df_clean[col] = df_clean[col].astype(str).str.strip()

    # --- Finalization ---

    # Set the canonical MultiIndex after all transformations are complete.
    df_clean = df_clean.set_index(['Worker_ID', 'Start_Date'])

    # Log success message.
    print(f"[{task_name}] Dtypes and string formats normalized and validated successfully.")

    # Return the fully cleansed and validated DataFrame.
    return df_clean

# ------------------------------------------------------------------------------
# Task 3, Step 2: Helper to Resolve Overlapping Spells at Snapshots
# ------------------------------------------------------------------------------

def _resolve_main_job_at_snapshots(
    df_spells: pd.DataFrame,
    config: Dict[str, Any],
    task_name: str = "Task 3, Step 2"
) -> pd.DataFrame:
    """
    Transforms spell-level data into a worker-year panel using a vectorized approach.

    This function efficiently resolves concurrent employment spells to identify a
    single "main job" for each worker at each annual snapshot date (June 30).
    It avoids Python-level loops by expanding each spell into the years it covers,
    filtering for activity on the snapshot date, and then applying a single,
    globally sorted de-duplication based on the study's deterministic
    tie-breaking rule.

    The tie-breaking rule is:
    1. Highest 'Daily_Wage_EUR'.
    2. Longest spell duration.
    3. Earliest 'Start_Date'.

    Args:
        df_spells (pd.DataFrame): The cleansed, spell-level DataFrame from
                                  the previous normalization step.
        config (Dict[str, Any]): The master configuration dictionary.
        task_name (str): The name of the calling task for clear error reporting.

    Returns:
        pd.DataFrame: A worker-year panel DataFrame where each row represents
                      the main job of a worker in a given year. The index is
                      set to ['Worker_ID', 'snapshot_year']. Returns an empty
                      DataFrame if no active spells are found.
    """
    # --- Input Validation ---
    if not isinstance(df_spells, pd.DataFrame):
        raise TypeError("Input `df_spells` must be a pandas DataFrame.")
    if not isinstance(config, dict):
        raise TypeError("Input `config` must be a dictionary.")

    # --- 1. Prepare Data and Generate Snapshot Information ---

    # Extract necessary temporal parameters from the configuration.
    start_year = config["temporal_parameters"]["DATA_ACQUISITION_START_YEAR"]
    end_year = config["temporal_parameters"]["DATA_ACQUISITION_END_YEAR"]
    snap_month = config["temporal_parameters"]["ANNUAL_SNAPSHOT_MONTH"]
    snap_day = config["temporal_parameters"]["ANNUAL_SNAPSHOT_DAY"]

    # Create a DataFrame mapping each year in the study to its snapshot date.
    all_years = range(start_year, end_year + 1)
    snapshots = pd.DataFrame({
        'snapshot_year': all_years,
        'snapshot_date': [pd.Timestamp(year, snap_month, snap_day) for year in all_years]
    })

    # Work with a copy of the spell data, resetting the index to access date columns.
    spells = df_spells.reset_index()

    # --- 2. Vectorized Expansion of Spells to Spell-Year Level ---

    # For each spell, create a list of all years it spans.
    # This is the key step to enable vectorization and avoid loops.
    spells['snapshot_year'] = spells.apply(
        lambda row: list(range(row['Start_Date'].year, row['End_Date'].year + 1)),
        axis=1
    )

    # Use .explode() to transform the DataFrame into a long format, creating one
    # row for each year a spell is potentially active.
    spell_year_panel = spells.explode('snapshot_year')

    # Convert the exploded year to an integer type.
    spell_year_panel['snapshot_year'] = pd.to_numeric(spell_year_panel['snapshot_year'], errors='coerce').astype('Int64')
    spell_year_panel.dropna(subset=['snapshot_year'], inplace=True)

    # --- 3. Filter to Spells Active on the Snapshot Date ---

    # Merge the snapshot date for each year into the panel.
    active_spells = spell_year_panel.merge(snapshots, on='snapshot_year', how='left')

    # Apply the precise filter: a spell is active if the snapshot date is within its start and end dates.
    active_spells = active_spells[
        (active_spells['Start_Date'] <= active_spells['snapshot_date']) &
        (active_spells['End_Date'] >= active_spells['snapshot_date'])
    ].copy()

    # If no spells are active on any snapshot date, return an empty DataFrame.
    if active_spells.empty:
        print(f"[{task_name}] No active spells found on any snapshot date.")
        return pd.DataFrame()

    # --- 4. Apply Deterministic Tie-Breaking Rule ---

    # a. Compute the spell duration, which is the second tie-breaker criterion.
    active_spells['spell_duration'] = (active_spells['End_Date'] - active_spells['Start_Date']).dt.days

    # b. Sort the entire DataFrame according to the tie-breaking hierarchy.
    # This single sort operation is highly efficient.
    sorted_spells = active_spells.sort_values(
        by=[
            'Worker_ID',
            'snapshot_year',
            'Daily_Wage_EUR',      # 1. Highest wage first
            'spell_duration',      # 2. Longest duration first
            'Start_Date'           # 3. Earliest start date first
        ],
        ascending=[
            True,
            True,
            False,                 # Descending for wage
            False,                 # Descending for duration
            True                   # Ascending for start date
        ]
    )

    # c. Select the first row for each worker-year group. This is the main job.
    main_jobs = sorted_spells.drop_duplicates(subset=['Worker_ID', 'snapshot_year'], keep='first')

    # --- 5. Finalize the Panel DataFrame ---

    # Set the canonical index for the worker-year panel.
    worker_year_panel = main_jobs.set_index(['Worker_ID', 'snapshot_year'])

    # Remove temporary columns used for the tie-breaking process.
    worker_year_panel = worker_year_panel.drop(columns=['snapshot_date', 'spell_duration'])

    # Log success message.
    print(f"[{task_name}] Worker-year panel with main jobs created successfully using vectorized approach.")

    # Return the final, clean panel.
    return worker_year_panel

# ------------------------------------------------------------------------------
# Task 3, Step 3: Helper to Apply Filters and Create Sample Indicators
# ------------------------------------------------------------------------------

def _apply_filters_and_create_flags(
    worker_year_panel: pd.DataFrame,
    config: Dict[str, Any],
    task_name: str = "Task 3, Step 3"
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    Applies sample selection filters and creates analysis flags.

    This function takes the raw worker-year panel of main jobs and performs two
    critical operations:
    1.  **Enrichment**: It computes worker age and creates a comprehensive set of
        boolean flags for key subpopulations (e.g., full-time, older workers,
        apprentices). These flags are created on the full, unfiltered dataset
        to preserve information for all auxiliary analyses.
    2.  **Filtering**: It creates a second, filtered DataFrame that represents the
        core analysis sample by applying the study's main inclusion criteria
        (age range and exclusion of irregular/marginal/seasonal employment).

    Args:
        worker_year_panel (pd.DataFrame): The worker-year panel of main jobs,
            output from the spell resolution step.
        config (Dict[str, Any]): The master configuration dictionary.
        task_name (str): The name of the calling task for clear error reporting.

    Returns:
        Tuple[pd.DataFrame, pd.DataFrame]: A tuple containing:
        - panel_with_flags (pd.DataFrame): The complete, unfiltered worker-year
          panel enriched with age and all boolean indicator columns.
        - panel_filtered (pd.DataFrame): A filtered view of the panel representing
          the main analysis sample.
    """
    # --- Input Validation ---
    if not isinstance(worker_year_panel, pd.DataFrame):
        raise TypeError("Input `worker_year_panel` must be a pandas DataFrame.")
    if worker_year_panel.empty:
        print(f"[{task_name}] Input worker-year panel is empty. Returning two empty DataFrames.")
        return pd.DataFrame(), pd.DataFrame()

    # Work on a copy to prevent side effects.
    panel_with_flags = worker_year_panel.copy()

    # --- 1. Compute Age ---
    # Age is defined as the calendar year of the snapshot minus the worker's birth year.
    # This is a fundamental demographic variable for filtering and controls.
    panel_with_flags['age'] = panel_with_flags.index.get_level_values('snapshot_year') - panel_with_flags['Birth_Year']

    # --- 2. Create Sample Indicator Flags (on the full panel) ---
    # It is critical to create flags before filtering, as some analyses (e.g., on
    # apprentices) rely on data that will be excluded from the main sample.

    # Flag for full-time workers, used in all wage analyses.
    panel_with_flags['is_full_time'] = panel_with_flags['Employment_Type_Code'].isin(
        config["sample_selection_parameters"]["WAGE_ANALYSIS_EMPLOYMENT_CODES"]
    )

    # Flag for older workers, used in the heterogeneity analysis (Task 16).
    panel_with_flags['is_older_worker'] = panel_with_flags['age'] >= config["sample_selection_parameters"]["OLDER_WORKER_AGE_THRESHOLD"]

    # Flag for apprentices, used in the training uptake analysis (Task 19).
    # This flag is based on an employment code that is typically excluded from the main sample.
    panel_with_flags['is_apprentice'] = panel_with_flags['Employment_Type_Code'] == config["variable_coding_maps"]["APPRENTICESHIP_CODE"]

    # --- 3. Apply Core Sample Selection Filters ---

    # Record the initial size of the panel for auditing and logging purposes.
    initial_size = len(panel_with_flags)

    # a. Filter by employment type to exclude irregular, marginal, and seasonal work
    # from the main analysis sample, as specified in the paper.
    excluded_emp_types = config["sample_selection_parameters"]["EXCLUDED_EMPLOYMENT_TYPE_CODES"]
    main_sample_mask = ~panel_with_flags['Employment_Type_Code'].isin(excluded_emp_types)

    # b. Filter by the valid age range for the study population.
    min_age = config["sample_selection_parameters"]["WORKER_AGE_MIN"]
    max_age = config["sample_selection_parameters"]["WORKER_AGE_MAX"]
    age_mask = (panel_with_flags['age'] >= min_age) & (panel_with_flags['age'] <= max_age)

    # Combine the masks to define the final filtered sample.
    panel_filtered = panel_with_flags[main_sample_mask & age_mask].copy()

    # Report on the filtering process to provide a clear audit trail.
    final_size = len(panel_filtered)
    print(
        f"[{task_name}] Sample filtering complete. "
        f"Initial size: {initial_size}, Main analysis sample size: {final_size} "
        f"({initial_size - final_size} observations excluded from main sample)."
    )

    # Return both the fully enriched panel and the main filtered panel.
    return panel_with_flags, panel_filtered

# ------------------------------------------------------------------------------
# Task 3, Orchestrator Function
# ------------------------------------------------------------------------------

def cleanse_and_canonicalize_spells(
    consolidated_df_raw: pd.DataFrame,
    master_config: Dict[str, Any]
) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """
    Orchestrates the cleansing and canonicalization of spell-level data.

    This function serves as the master orchestrator for Task 3. It executes a
    three-step pipeline that transforms the raw, spell-level data into the
    foundational datasets for the entire analysis. This revised version returns
    three distinct DataFrames to correctly manage data dependencies for all
    downstream tasks.

    The pipeline is as follows:
    1.  **Normalization**: Standardizes all data types and string formats of the
        raw spell data (`_normalize_dtypes_and_formats`).
    2.  **Spell Resolution**: Transforms the spell data into a worker-year panel
        by identifying a unique "main job" for each worker at each annual
        snapshot (`_resolve_main_job_at_snapshots`).
    3.  **Flagging & Filtering**: Enriches the panel with computed age and all
        necessary boolean flags, then creates a second, filtered version of
        the panel for the main analysis sample (`_apply_filters_and_create_flags`).

    Args:
        consolidated_df_raw (pd.DataFrame): The validated, raw spell-level
            DataFrame from Task 1.
        master_config (Dict[str, Any]): The validated master configuration
            dictionary from Task 2.

    Returns:
        Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]: A tuple containing:
        - **df_normalized**: The full, spell-level DataFrame after dtype and
          format normalization. This is required by Task 5 to know the
          universe of all workers.
        - **panel_full_with_flags**: The complete, unfiltered worker-year panel
          containing all workers with main jobs, enriched with age and all
          boolean indicator columns.
        - **panel_main_analysis**: A filtered view of the panel representing the
          core analysis sample.
    """
    # --- Input Validation ---
    # Ensure the primary inputs are of the correct type before proceeding.
    if not isinstance(consolidated_df_raw, pd.DataFrame):
        raise TypeError("`consolidated_df_raw` must be a pandas DataFrame.")
    if not isinstance(master_config, dict):
        raise TypeError("`master_config` must be a dictionary.")

    # --- Step 1: Normalize dtypes and enforce string formatting ---
    # This step produces the first critical output: the cleansed spell data.
    df_normalized = _normalize_dtypes_and_formats(
        df=consolidated_df_raw,
        task_name="Task 3, Step 1"
    )

    # --- Step 2: Resolve overlapping spells and create the worker-year panel ---
    # This step transforms the cleansed spell data into a raw panel of main jobs.
    worker_year_panel = _resolve_main_job_at_snapshots(
        df_spells=df_normalized,
        config=master_config,
        task_name="Task 3, Step 2"
    )

    # --- Step 3: Apply sample filters and create indicator flags ---
    # This step takes the raw panel and produces the final two panel outputs.
    panel_full_with_flags, panel_main_analysis = _apply_filters_and_create_flags(
        worker_year_panel=worker_year_panel,
        config=master_config,
        task_name="Task 3, Step 3"
    )

    # Log the successful completion of the entire task.
    print("\nTask 3: Cleansing and canonicalization of spell-level data completed successfully.")

    # Return all three essential DataFrames for the downstream pipeline.
    return df_normalized, panel_full_with_flags, panel_main_analysis


In [None]:
# Task 4: Impute censored wages (Tobit-style, for full-time spells)

# ==============================================================================
# Task 4: Impute censored wages (Tobit-style, for full-time spells)
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 4, Step 1: Helper to Prepare Data for Wage Imputation
# ------------------------------------------------------------------------------

def _prepare_wage_imputation_data(
    worker_year_panel: pd.DataFrame,
    task_name: str = "Task 4, Step 1"
) -> pd.DataFrame:
    """
    Subsets to full-time workers and prepares columns for Tobit imputation.

    This function filters the panel to full-time observations, computes raw log
    wages, and creates essential flags and log-transformed cap values needed
    for the censored regression model.

    Args:
        worker_year_panel (pd.DataFrame): The cleansed worker-year panel from Task 3.
        task_name (str): The name of the calling task for error reporting.

    Returns:
        pd.DataFrame: A DataFrame of full-time workers with columns required
                      for imputation ('log_wage_raw', 'is_censored', 'log_wage_cap').
    """
    # Filter the panel to include only full-time workers, as wage analysis is
    # restricted to this group.
    full_time_df = worker_year_panel[worker_year_panel['is_full_time']].copy()

    # Input validation: Ensure daily wages are strictly positive before logging.
    if (full_time_df['Daily_Wage_EUR'] <= 0).any():
        raise ValueError(f"[{task_name}] Found non-positive 'Daily_Wage_EUR' values, "
                         "which cannot be log-transformed.")

    # Create the raw log wage column from the daily wage.
    full_time_df['log_wage_raw'] = np.log(full_time_df['Daily_Wage_EUR'])

    # Create a definitive boolean flag for censored observations.
    # This relies on the 'Is_TopCoded_Wage' flag validated in Task 1.
    full_time_df['is_censored'] = full_time_df['Is_TopCoded_Wage']

    # Create the log of the censoring cap.
    full_time_df['log_wage_cap'] = np.log(full_time_df['Social_Security_Cap_EUR'])

    print(f"[{task_name}] Prepared {len(full_time_df)} full-time observations for wage imputation.")
    return full_time_df

# ------------------------------------------------------------------------------
# Task 4, Step 2: Helper to Fit Censored-Normal (Tobit) Models by Group
# ------------------------------------------------------------------------------

def _estimate_tobit_parameters_for_group(
    group_df: pd.DataFrame
) -> pd.Series:
    """
    Estimates parameters (mu, sigma) for a single group using MLE.

    This function defines and minimizes the negative log-likelihood for a
    right-censored normal distribution.

    Args:
        group_df (pd.DataFrame): A subset of the data for a single imputation group.

    Returns:
        pd.Series: A series containing the estimated 'mu' and 'sigma'.
    """
    # Separate censored and uncensored observations within the group.
    uncensored_wages = group_df.loc[~group_df['is_censored'], 'log_wage_raw']
    censored_caps = group_df.loc[group_df['is_censored'], 'log_wage_cap']

    # Define the negative log-likelihood function for the Tobit model.
    def neg_log_likelihood(params: np.ndarray) -> float:
        mu, log_sigma = params[0], params[1]
        sigma = np.exp(log_sigma) # Ensure sigma is positive via reparameterization.

        # Log-likelihood for uncensored observations.
        ll_uncensored = norm.logpdf(uncensored_wages, loc=mu, scale=sigma).sum()

        # Log-likelihood for censored observations (using log of survival function for stability).
        ll_censored = norm.logsf(censored_caps, loc=mu, scale=sigma).sum()

        # Total negative log-likelihood.
        return -(ll_uncensored + ll_censored)

    # Provide initial guesses for the optimization.
    # Use moments from the uncensored part of the data as a starting point.
    mu_initial = uncensored_wages.mean() if not uncensored_wages.empty else 5.0
    sigma_initial = uncensored_wages.std() if len(uncensored_wages) > 1 else 1.0
    # Handle cases where std is zero or NaN.
    if not np.isfinite(sigma_initial) or sigma_initial <= 0:
        sigma_initial = 1.0

    initial_params = np.array([mu_initial, np.log(sigma_initial)])

    # Perform the optimization to find the MLE parameters.
    result = minimize(
        neg_log_likelihood,
        initial_params,
        method='L-BFGS-B' # A robust quasi-Newton method.
    )

    # If optimization fails, return NaNs to be handled by the fallback mechanism.
    if not result.success:
        return pd.Series({'mu': np.nan, 'sigma': np.nan})

    # Extract and return the estimated parameters.
    mu_hat, log_sigma_hat = result.x
    sigma_hat = np.exp(log_sigma_hat)
    return pd.Series({'mu': mu_hat, 'sigma': sigma_hat})

def _estimate_tobit_parameters_by_group(
    full_time_df: pd.DataFrame,
    config: Dict[str, Any],
    task_name: str = "Task 4, Step 2"
) -> pd.DataFrame:
    """
    Estimates Tobit model parameters for each imputation group.

    Groups data as specified in the config, applies MLE to each group, and
    handles small groups or convergence failures with a fallback mechanism.

    Args:
        full_time_df (pd.DataFrame): The prepared full-time wage data.
        config (Dict[str, Any]): The master configuration dictionary.

    Returns:
        pd.DataFrame: The input DataFrame with 'mu' and 'sigma' columns merged in.
    """
    # Define the primary grouping keys from the configuration.
    grouping_keys = config["wage_imputation_parameters"]["IMPUTATION_GROUPING"]

    # --- Primary Estimation ---
    # Estimate parameters for each group defined by the grouping keys.
    print(f"[{task_name}] Estimating Tobit parameters for {grouping_keys} groups...")
    tobit_params = full_time_df.groupby(grouping_keys).apply(_estimate_tobit_parameters_for_group)

    # Merge the estimated parameters back to the main DataFrame.
    df_with_params = full_time_df.merge(
        tobit_params.rename(columns={'mu': 'mu_est', 'sigma': 'sigma_est'}),
        on=grouping_keys,
        how='left'
    )

    # --- Fallback Mechanism ---
    # Identify observations where estimation failed (e.g., small groups, non-convergence).
    failed_estimation_mask = df_with_params['mu_est'].isnull()
    if failed_estimation_mask.any():
        print(f"[{task_name}] Primary estimation failed for {failed_estimation_mask.sum()} observations. "
              "Applying fallback estimation (pooling by gender).")

        # Fallback 1: Pool by the first grouping key (e.g., 'Gender_Code').
        fallback_grouping = [grouping_keys[0]]
        fallback_params = df_with_params[failed_estimation_mask].groupby(fallback_grouping).apply(_estimate_tobit_parameters_for_group)

        # Merge fallback parameters.
        fallback_merged = df_with_params[failed_estimation_mask].drop(columns=['mu_est', 'sigma_est']).merge(
            fallback_params.rename(columns={'mu': 'mu_fb', 'sigma': 'sigma_fb'}),
            on=fallback_grouping,
            how='left'
        )

        # Update the main DataFrame with the fallback estimates.
        df_with_params.loc[failed_estimation_mask, 'mu_est'] = fallback_merged['mu_fb']
        df_with_params.loc[failed_estimation_mask, 'sigma_est'] = fallback_merged['sigma_fb']

    # Final check for any remaining nulls (e.g., if fallback also failed).
    if df_with_params['mu_est'].isnull().any():
        raise RuntimeError(f"[{task_name}] Wage imputation failed. Could not estimate "
                         "Tobit parameters even after fallback.")

    print(f"[{task_name}] Tobit parameter estimation complete.")
    return df_with_params

# ------------------------------------------------------------------------------
# Task 4, Step 3: Helper to Replace Censored Wages with Conditional Expectation
# ------------------------------------------------------------------------------

def _impute_censored_wages(
    df_with_params: pd.DataFrame,
    task_name: str = "Task 4, Step 3"
) -> pd.DataFrame:
    """
    Imputes censored log wages using the conditional expectation formula.

    Calculates E[log_w | log_w >= cap] for censored observations using the
    estimated Tobit parameters and creates the final imputed wage column.

    Args:
        df_with_params (pd.DataFrame): DataFrame with raw log wages and estimated
                                       Tobit parameters ('mu_est', 'sigma_est').
        task_name (str): The name of the calling task for error reporting.

    Returns:
        pd.DataFrame: The DataFrame with the new 'log_wage_imputed' column.
    """
    # Isolate the data for censored observations.
    censored_df = df_with_params[df_with_params['is_censored']].copy()

    # --- Calculate Conditional Expectation ---
    # E[log_w | log_w >= c] = mu + sigma * (phi(z) / (1 - Phi(z)))
    # where z = (c - mu) / sigma
    mu = censored_df['mu_est']
    sigma = censored_df['sigma_est']
    log_cap = censored_df['log_wage_cap']

    # Standardized value z for the inverse Mills ratio calculation.
    z = (log_cap - mu) / sigma

    # Calculate the inverse Mills ratio (phi(z) / (1 - Phi(z))).
    # Using norm.pdf and norm.sf is numerically stable.
    inverse_mills_ratio = norm.pdf(z) / norm.sf(z)

    # Calculate the imputed value.
    imputed_values = mu + sigma * inverse_mills_ratio

    # --- Create the Final Imputed Wage Column ---
    # Initialize the new column with the raw log wages.
    df_with_params['log_wage_imputed'] = df_with_params['log_wage_raw']

    # Update the values for censored observations with the imputed values.
    df_with_params.loc[df_with_params['is_censored'], 'log_wage_imputed'] = imputed_values

    num_imputed = df_with_params['is_censored'].sum()
    print(f"[{task_name}] Imputed {num_imputed} censored wage observations.")

    return df_with_params

# ------------------------------------------------------------------------------
# Task 4, Orchestrator Function
# ------------------------------------------------------------------------------

def impute_censored_wages(
    worker_year_panel: pd.DataFrame,
    master_config: Dict[str, Any]
) -> pd.DataFrame:
    """
    Orchestrates the Tobit-style imputation of right-censored wages.

    This function executes a three-step pipeline for full-time workers:
    1. Prepares the data by creating log wages and censoring flags.
    2. Estimates parameters (mu, sigma) of a censored normal distribution for
       each group specified in the configuration (e.g., by gender and district).
    3. Replaces censored wage values with the conditional expectation based on
       the estimated model parameters.

    The final imputed log wage is added as a new column 'log_wage_imputed' to
    the input worker-year panel.

    Args:
        worker_year_panel (pd.DataFrame): The cleansed worker-year panel from Task 3.
        master_config (Dict[str, Any]): The validated master configuration dictionary.

    Returns:
        pd.DataFrame: The worker-year panel with the 'log_wage_imputed' column added.
                      Uncensored and non-full-time observations will have this
                      column populated with their original (log) wage or NaN.
    """
    # --- Input Validation ---
    if not isinstance(worker_year_panel, pd.DataFrame):
        raise TypeError("`worker_year_panel` must be a pandas DataFrame.")
    if not isinstance(master_config, dict):
        raise TypeError("`master_config` must be a dictionary.")

    # Step 1: Subset to full-time workers and prepare data.
    full_time_df = _prepare_wage_imputation_data(worker_year_panel)

    # If there are no full-time workers, there's nothing to impute.
    if full_time_df.empty:
        print("Task 4: No full-time workers found. Skipping wage imputation.")
        worker_year_panel['log_wage_imputed'] = np.nan
        return worker_year_panel

    # Step 2: Estimate Tobit parameters for each group with fallbacks.
    df_with_params = _estimate_tobit_parameters_by_group(full_time_df, master_config)

    # Step 3: Calculate conditional expectation and create the imputed column.
    df_imputed = _impute_censored_wages(df_with_params)

    # --- Merge Imputed Wages Back into the Main Panel ---
    # We only need the final imputed wage column.
    final_panel = worker_year_panel.merge(
        df_imputed[['log_wage_imputed']],
        left_index=True,
        right_index=True,
        how='left'
    )

    print("Task 4: Censored wage imputation completed successfully.")
    return final_panel


In [None]:
# Task 5: Build annual worker-year panel at June 30 snapshots

# ==============================================================================
# Task 5: Build annual worker-year panel at June 30 snapshots (Enrichment)
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 5, Step 1: Helper to Construct Full Panel Grid and Handle Non-Employment
# ------------------------------------------------------------------------------

def _construct_full_panel_grid(
    employed_panel: pd.DataFrame,
    all_spells_df: pd.DataFrame,
    config: Dict[str, Any],
    task_name: str = "Task 5, Step 1"
) -> pd.DataFrame:
    """
    Constructs a complete worker-year panel grid and handles non-employment.

    This function expands the panel of employed workers to a full, balanced grid
    covering all unique workers and all study years. This process correctly
    identifies non-employed periods as rows with missing job information.

    Crucially, it also performs a lookback search for workers who were non-employed
    in the baseline year (1990). For this cohort, it finds their last known job
    characteristics from the 1986-1989 period and assigns this information
    directly and efficiently to their 1990 observation in the panel.

    Args:
        employed_panel (pd.DataFrame): The worker-year panel of main jobs,
            enriched with flags (output from Task 3).
        all_spells_df (pd.DataFrame): The fully cleansed spell-level data
            (from Task 3, Step 1), used to identify the universe of all workers
            and for the non-employed lookback.
        config (Dict[str, Any]): The master configuration dictionary.
        task_name (str): The name of the calling task for clear error reporting.

    Returns:
        pd.DataFrame: A comprehensive worker-year panel indexed by
                      ['Worker_ID', 'snapshot_year'], including both employed
                      and non-employed observations, with lookback information
                      attached for the relevant 1990 non-employed cohort.
    """
    # --- Input Validation ---
    if not isinstance(employed_panel, pd.DataFrame):
        raise TypeError("`employed_panel` must be a pandas DataFrame.")
    if not isinstance(all_spells_df, pd.DataFrame):
        raise TypeError("`all_spells_df` must be a pandas DataFrame.")

    # --- 1. Create Full Panel Grid ---

    # Define the universe of all workers (from the original spell data) and all years.
    all_workers = all_spells_df.index.get_level_values('Worker_ID').unique()
    start_year = config["temporal_parameters"]["DATA_ACQUISITION_START_YEAR"]
    end_year = config["temporal_parameters"]["DATA_ACQUISITION_END_YEAR"]
    all_years = range(start_year, end_year + 1)

    # Create a MultiIndex representing every possible worker-year observation.
    full_grid_index = pd.MultiIndex.from_product(
        [all_workers, all_years], names=['Worker_ID', 'snapshot_year']
    )

    # Create the full panel DataFrame from this grid.
    full_panel = pd.DataFrame(index=full_grid_index)

    # --- 2. Merge Employment Data ---

    # Join the employed panel onto the full grid. A left join ensures all
    # worker-year observations are kept. Non-matches will have NaNs in job-
    # related columns, correctly identifying them as non-employed periods.
    full_panel = full_panel.join(employed_panel, how='left')

    # Create an explicit boolean flag for employment status.
    full_panel['is_employed'] = full_panel['Municipality_ID'].notna()

    print(f"[{task_name}] Full worker-year grid created with {len(full_panel)} observations.")

    # --- 3. Handle Non-Employed in Baseline Year (1990) ---

    # Extract parameters for the lookback operation.
    base_year = config["temporal_parameters"]["BASE_YEAR"]
    lookback_start = config["sample_selection_parameters"]["NON_EMPLOYED_LOOKBACK_START_YEAR"]
    lookback_end = config["sample_selection_parameters"]["NON_EMPLOYED_LOOKBACK_END_YEAR"]

    # Identify the Worker_IDs of the target cohort: non-employed in 1990.
    non_employed_1990_workers = full_panel.loc[
        (full_panel.index.get_level_values('snapshot_year') == base_year) &
        (~full_panel['is_employed'])
    ].index.get_level_values('Worker_ID')

    # Proceed only if such workers exist.
    if not non_employed_1990_workers.empty:
        # a. Perform the lookback search on a separate, small DataFrame.
        # Filter all historical spells to the lookback window for this specific cohort.
        lookback_spells = all_spells_df.loc[
            all_spells_df.index.get_level_values('Worker_ID').isin(non_employed_1990_workers)
        ].reset_index()

        # Add a year column for filtering.
        lookback_spells['spell_year'] = lookback_spells['Start_Date'].dt.year

        # Apply the lookback window filter.
        lookback_spells = lookback_spells[
            (lookback_spells['spell_year'] >= lookback_start) &
            (lookback_spells['spell_year'] <= lookback_end)
        ]

        if not lookback_spells.empty:
            # b. Find the last job for each worker in the lookback period.
            # Sort by date to ensure 'last' is the most recent spell.
            lookback_spells = lookback_spells.sort_values(by=['Worker_ID', 'Start_Date'])

            # Use drop_duplicates to efficiently get the last spell per worker.
            last_jobs = lookback_spells.drop_duplicates(subset=['Worker_ID'], keep='last')

            # Prepare the lookback data for assignment.
            last_jobs = last_jobs.set_index('Worker_ID')[[
                'Municipality_ID', 'spell_year', 'Daily_Wage_EUR'
            ]].rename(columns={
                'Municipality_ID': 'last_pre1990_municipality_id',
                'spell_year': 'last_pre1990_spell_year',
                'Daily_Wage_EUR': 'last_pre1990_daily_wage'
            })

            # c. Use direct assignment with .loc for a precise and efficient update.
            # Create the MultiIndex that targets the exact rows to be updated.
            target_idx = pd.MultiIndex.from_product(
                [last_jobs.index, [base_year]], names=['Worker_ID', 'snapshot_year']
            )

            # Align the data to be assigned with the target index order.
            values_to_assign = last_jobs.reindex(target_idx.get_level_values('Worker_ID'))

            # Assign the values from the lookback directly to the target rows.
            for col in values_to_assign.columns:
                full_panel.loc[target_idx, col] = values_to_assign[col].values

            print(f"[{task_name}] Lookback job history attached for {len(last_jobs)} non-employed 1990 workers.")

    # Return the fully constructed and enriched panel.
    return full_panel

# ------------------------------------------------------------------------------
# Task 5, Steps 2 & 3: Helper to Add Econometric Variables and Group Indicators
# ------------------------------------------------------------------------------

def _add_econometric_variables(
    full_panel: pd.DataFrame,
    task_mapping_df: pd.DataFrame,
    config: Dict[str, Any],
    task_name: str = "Task 5, Steps 2 & 3"
) -> pd.DataFrame:
    """
    Adds all remaining analysis variables to the full worker-year panel.

    This includes FTE weights, age controls, task assignments, nationality labels,
    and categorical groups for pseudo-panel analysis.

    Args:
        full_panel (pd.DataFrame): The comprehensive panel from the previous step.
        task_mapping_df (pd.DataFrame): The validated task mapping artifact.
        config (Dict[str, Any]): The master configuration dictionary.
        task_name (str): The name of the calling task for error reporting.

    Returns:
        pd.DataFrame: The fully enriched, analysis-ready panel.
    """
    panel = full_panel.copy()

    # --- Step 2: Construct Age Controls and Task Assignments ---
    # Compute squared age for wage regressions.
    panel['age_sq'] = panel['age'] ** 2

    # Merge task information (Routine/Abstract label and intensity).
    # This will add NaNs for non-employed observations, which is correct.
    panel = panel.merge(
        task_mapping_df,
        left_on='Occupation_Code',
        right_on='Occupation_Code_3digit',
        how='left'
    ).drop(columns=['Occupation_Code_3digit'])

    # --- Step 3: Construct FTE Weights, Nationality, and Group Indicators ---
    # Map employment type to Full-Time Equivalent (FTE) weights.
    fte_map = config["algorithm_config_parameters"]["PART_TIME_EQUIVALENCY_WEIGHTS"]
    panel['fte_weight'] = panel['Employment_Type_Code'].map(fte_map).fillna(0) # Non-employed get 0.

    # Map nationality code to labels and create boolean flags.
    nat_map = config["variable_coding_maps"]["NATIONALITY_MAP"]
    panel['nationality'] = panel['Nationality_Code'].map(nat_map)
    panel['is_native'] = (panel['nationality'] == 'German')
    panel['is_czech'] = (panel['nationality'] == 'Czech')

    # Create categorical groups for pseudo-panel analysis.
    # Age groups.
    age_bins = [-np.inf, 29, 49, np.inf]
    age_labels = ['<30', '30-50', '>50']
    panel['age_group'] = pd.cut(panel['age'], bins=age_bins, labels=age_labels, right=True)

    # Education groups.
    edu_map = {1: "No vocational/apprenticeship", 2: "Vocational/apprenticeship",
               3: "University", 4: "University"} # Grouping 3 and 4
    panel['education_group'] = panel['Education_Level_Code'].map(edu_map)

    # Gender groups.
    gender_map = config["variable_coding_maps"]["GENDER_MAP"]
    panel['gender_group'] = panel['Gender_Code'].map(gender_map)

    print(f"[{task_name}] All econometric variables and group indicators added.")
    return panel

# ------------------------------------------------------------------------------
# Task 5, Orchestrator Function
# ------------------------------------------------------------------------------

def build_analysis_panel(
    analysis_panel_employed: pd.DataFrame,
    all_spells_cleansed: pd.DataFrame,
    validated_artifacts: Dict[str, pd.DataFrame],
    master_config: Dict[str, Any]
) -> pd.DataFrame:
    """
    Orchestrates the construction of the final, fully enriched analysis panel.

    This function takes the panel of employed workers (from Task 3) and expands
    it to include non-employed observations. It then enriches this complete
    panel with all variables required for the study's econometric analyses,
    including FTE weights, age controls, task assignments, nationality flags,
    and pseudo-panel group identifiers.

    Args:
        analysis_panel_employed (pd.DataFrame): The filtered worker-year panel
            of main jobs for employed individuals, output from Task 3.
        all_spells_cleansed (pd.DataFrame): The full, cleansed spell-level data
            from Task 3, Step 1, used for identifying all workers and for the
            non-employed lookback.
        validated_artifacts (Dict[str, pd.DataFrame]): A dictionary containing
            the validated auxiliary data, including the 'task_mapping' DataFrame.
        master_config (Dict[str, Any]): The validated master configuration dictionary.

    Returns:
        pd.DataFrame: The definitive, fully specified worker-year panel dataset,
                      ready for all subsequent aggregation and estimation tasks.
    """
    # --- Input Validation ---
    if not isinstance(analysis_panel_employed, pd.DataFrame):
        raise TypeError("`analysis_panel_employed` must be a pandas DataFrame.")
    if not isinstance(all_spells_cleansed, pd.DataFrame):
        raise TypeError("`all_spells_cleansed` must be a pandas DataFrame.")
    if 'task_mapping' not in validated_artifacts:
        raise KeyError("Validated 'task_mapping' artifact not found in `validated_artifacts`.")

    # Step 1: Construct the full panel grid, including non-employed observations.
    full_panel = _construct_full_panel_grid(
        employed_panel=analysis_panel_employed,
        all_spells_df=all_spells_cleansed,
        config=master_config
    )

    # Step 2 & 3: Add all remaining econometric variables and group indicators.
    final_panel = _add_econometric_variables(
        full_panel=full_panel,
        task_mapping_df=validated_artifacts['task_mapping'],
        config=master_config
    )

    # Merge the imputed log wage from the employed panel.
    # The imputed wage only exists for employed, full-time workers.
    if 'log_wage_imputed' in analysis_panel_employed.columns:
        final_panel = final_panel.merge(
            analysis_panel_employed[['log_wage_imputed']],
            left_index=True,
            right_index=True,
            how='left'
        )

    print("Task 5: Final analysis panel built successfully.")
    return final_panel


In [None]:
# Task 6: Aggregate to municipality-year level for regional analyses

# ==============================================================================
# Task 6: Aggregate to municipality-year level for regional analyses
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 6, Step 1: Helper to Compute Native and Total FTE Employment Stocks
# ------------------------------------------------------------------------------

def _aggregate_employment_stocks(
    analysis_panel: pd.DataFrame,
    config: Dict[str, Any],
    task_name: str = "Task 6, Step 1"
) -> pd.DataFrame:
    """
    Aggregates worker-level data to compute municipality-year employment stocks.

    This function calculates Full-Time Equivalent (FTE) employment for native
    workers and all workers combined for each municipality and year. It also
    extracts the baseline native employment to be used as regression weights.

    Args:
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel from Task 5.
        config (Dict[str, Any]): The master configuration dictionary.
        task_name (str): The name of the calling task for error reporting.

    Returns:
        pd.DataFrame: A municipality-year panel with employment stock columns.
    """
    # Filter to only employed observations for aggregation.
    employed_df = analysis_panel[analysis_panel['is_employed']].copy()

    # Define aggregation functions.
    # E_native_rt: Sum of FTE weights for native workers.
    # Total_rt: Sum of FTE weights for all workers.
    # N_native_fulltime: Count of full-time native workers for wage averaging.
    aggregations = {
        'fte_weight': [
            ('E_native_rt', lambda w: w[employed_df.loc[w.index, 'is_native']].sum()),
            ('Total_rt', 'sum')
        ],
        'is_full_time': [
            ('N_native_fulltime', lambda ft: ft[employed_df.loc[ft.index, 'is_native']].sum())
        ]
    }

    # Perform the aggregation at the municipality-year level.
    regional_panel = employed_df.groupby(['Municipality_ID', 'snapshot_year']).agg(aggregations)

    # Flatten the MultiIndex columns created by .agg().
    regional_panel.columns = ['_'.join(col).strip() for col in regional_panel.columns.values]
    regional_panel = regional_panel.rename(columns={
        'fte_weight_E_native_rt': 'native_fte_employment',
        'fte_weight_Total_rt': 'total_fte_employment',
        'is_full_time_N_native_fulltime': 'native_fulltime_headcount'
    })

    # --- Ensure a balanced panel structure ---
    # Create a full grid of all municipalities and years.
    all_municipalities = analysis_panel['Municipality_ID'].dropna().unique()
    start_year = config["temporal_parameters"]["DATA_ACQUISITION_START_YEAR"]
    end_year = config["temporal_parameters"]["DATA_ACQUISITION_END_YEAR"]
    all_years = range(start_year, end_year + 1)
    full_grid_index = pd.MultiIndex.from_product(
        [all_municipalities, all_years], names=['Municipality_ID', 'snapshot_year']
    )

    # Reindex the panel to the full grid, filling missing employment with 0.
    regional_panel = regional_panel.reindex(full_grid_index, fill_value=0)

    # --- Extract Baseline Weights ---
    # The weight for all municipality-level regressions is the native FTE
    # employment in the baseline year.
    baseline_year = config["temporal_parameters"]["BASELINE_WEIGHT_YEAR"]
    weight_col_name = config["algorithm_config_parameters"]["WEIGHT_COLUMN_MUNICIPALITY"]

    baseline_weights = regional_panel.loc[
        (slice(None), baseline_year), 'native_fte_employment'
    ].rename(weight_col_name).reset_index(level='snapshot_year', drop=True)

    # Merge the baseline weights into the panel so each municipality-year has it.
    regional_panel = regional_panel.merge(
        baseline_weights, on='Municipality_ID', how='left'
    )
    regional_panel[weight_col_name] = regional_panel[weight_col_name].fillna(0)

    print(f"[{task_name}] Aggregated employment stocks for {len(all_municipalities)} municipalities.")
    return regional_panel

# ------------------------------------------------------------------------------
# Task 6, Step 2: Helper to Compute Mean Log Wages for Full-Time Natives
# ------------------------------------------------------------------------------

def _aggregate_regional_wages(
    analysis_panel: pd.DataFrame,
    regional_panel: pd.DataFrame,
    task_name: str = "Task 6, Step 2"
) -> pd.DataFrame:
    """
    Aggregates worker-level wages to compute municipality-year mean log wages.

    Args:
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel.
        regional_panel (pd.DataFrame): The municipality-year panel with employment stocks.
        task_name (str): The name of the calling task for error reporting.

    Returns:
        pd.DataFrame: The regional panel augmented with mean log wage columns.
    """
    # Filter to the sample relevant for wage analysis.
    wage_sample = analysis_panel[
        analysis_panel['is_native'] & analysis_panel['is_full_time']
    ].copy()

    # Calculate the mean log wage for each municipality-year.
    mean_log_wages = wage_sample.groupby(['Municipality_ID', 'snapshot_year'])['log_wage_imputed'].mean()
    mean_log_wages.name = 'mean_log_wage'

    # Merge the mean log wages into the main regional panel.
    # Municipalities with no full-time natives in a year will have NaN, which is correct.
    regional_panel_out = regional_panel.merge(
        mean_log_wages, on=['Municipality_ID', 'snapshot_year'], how='left'
    )

    print(f"[{task_name}] Aggregated mean log wages for full-time natives.")
    return regional_panel_out

# ------------------------------------------------------------------------------
# Task 6, Step 3: Helper to Compute National Mean Log Wages by Year
# ------------------------------------------------------------------------------

def _compute_national_mean_wages(
    analysis_panel: pd.DataFrame,
    task_name: str = "Task 6, Step 3"
) -> pd.Series:
    """
    Computes the national mean log wage for full-time natives for each year.

    This is a required input for the counterfactual wage imputation for
    non-employed workers in Task 15.

    Args:
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel.
        task_name (str): The name of the calling task for error reporting.

    Returns:
        pd.Series: A Series of national mean log wages, indexed by year.
    """
    # Filter to the sample relevant for wage analysis.
    wage_sample = analysis_panel[
        analysis_panel['is_native'] & analysis_panel['is_full_time']
    ].copy()

    # Group by year only and calculate the mean log wage.
    national_wages = wage_sample.groupby('snapshot_year')['log_wage_imputed'].mean()
    national_wages.name = 'national_mean_log_wage'

    print(f"[{task_name}] Computed national mean log wages for years {national_wages.index.min()}-{national_wages.index.max()}.")
    return national_wages

# ------------------------------------------------------------------------------
# Task 6, Orchestrator Function
# ------------------------------------------------------------------------------

def aggregate_to_regional_panel(
    analysis_panel: pd.DataFrame,
    master_config: Dict[str, Any]
) -> Tuple[pd.DataFrame, pd.Series]:
    """
    Orchestrates the aggregation of the worker panel to the regional (municipality-year) level.

    This function performs three key aggregations:
    1. Computes native and total Full-Time Equivalent (FTE) employment stocks.
    2. Computes the mean imputed log wage for full-time native workers.
    3. Computes the national average log wage for each year, needed for later tasks.

    Args:
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel from Task 5.
        master_config (Dict[str, Any]): The validated master configuration dictionary.

    Returns:
        Tuple[pd.DataFrame, pd.Series]:
        - A DataFrame indexed by ('Municipality_ID', 'snapshot_year') containing
          all aggregated regional variables.
        - A Series indexed by 'snapshot_year' containing the national mean log wage.
    """
    # --- Input Validation ---
    if not isinstance(analysis_panel, pd.DataFrame):
        raise TypeError("`analysis_panel` must be a pandas DataFrame.")
    if not isinstance(master_config, dict):
        raise TypeError("`master_config` must be a dictionary.")

    # Step 1: Compute native and total FTE employment stocks and baseline weights.
    regional_panel = _aggregate_employment_stocks(analysis_panel, master_config)

    # Step 2: Compute and merge mean log wages for full-time natives.
    regional_panel_with_wages = _aggregate_regional_wages(analysis_panel, regional_panel)

    # Step 3: Compute the national mean log wage series as a separate artifact.
    national_wage_series = _compute_national_mean_wages(analysis_panel)

    print("Task 6: Aggregation to municipality-year panel completed successfully.")
    return regional_panel_with_wages, national_wage_series


In [None]:
# Task 7: Construct the immigration shock ΔI_r per municipality

# ==============================================================================
# Task 7: Construct the immigration shock ΔI_r per municipality
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 7, Step 1: Helper to Compute Czech FTE Counts by Year
# ------------------------------------------------------------------------------

def _compute_czech_fte_by_year(
    analysis_panel: pd.DataFrame,
    task_name: str = "Task 7, Step 1"
) -> pd.DataFrame:
    """
    Computes the Full-Time Equivalent (FTE) count of Czech workers by municipality and year.

    Args:
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel from Task 5.
        task_name (str): The name of the calling task for error reporting.

    Returns:
        pd.DataFrame: A DataFrame indexed by 'Municipality_ID' with columns
                      for Czech FTE counts in each relevant year (e.g., 'fte_1990').
    """
    # Filter to Czech workers who are employed.
    czech_workers = analysis_panel[analysis_panel['is_czech'] & analysis_panel['is_employed']].copy()

    # If no Czech workers are found, return an empty DataFrame with the expected structure.
    if czech_workers.empty:
        print(f"[{task_name}] No Czech workers found in the panel. Returning empty FTE counts.")
        all_municipalities = analysis_panel['Municipality_ID'].dropna().unique()
        return pd.DataFrame(index=all_municipalities)

    # Aggregate FTE weights by municipality and year.
    czech_fte = czech_workers.groupby(['Municipality_ID', 'snapshot_year'])['fte_weight'].sum()

    # Pivot the data to a wide format: municipalities as rows, years as columns.
    czech_fte_wide = czech_fte.unstack(level='snapshot_year').add_prefix('fte_')

    # Ensure all municipalities from the original panel are present, filling missing with 0.
    all_municipalities = analysis_panel['Municipality_ID'].dropna().unique()
    czech_fte_wide = czech_fte_wide.reindex(all_municipalities, fill_value=0)

    print(f"[{task_name}] Computed Czech FTE counts for {len(czech_fte_wide)} municipalities.")
    return czech_fte_wide

# ------------------------------------------------------------------------------
# Task 7, Steps 2 & 3: Helper to Compute the Immigration Shock Variables
# ------------------------------------------------------------------------------

def _compute_immigration_shocks(
    czech_fte_df: pd.DataFrame,
    regional_panel: pd.DataFrame,
    config: Dict[str, Any],
    task_name: str = "Task 7, Steps 2 & 3"
) -> pd.DataFrame:
    """
    Computes the main and event-year specific immigration shock variables.

    Args:
        czech_fte_df (pd.DataFrame): Wide-format DataFrame of Czech FTEs by year.
        regional_panel (pd.DataFrame): The municipality-year panel from Task 6.
        config (Dict[str, Any]): The master configuration dictionary.
        task_name (str): The name of the calling task for error reporting.

    Returns:
        pd.DataFrame: A DataFrame indexed by 'Municipality_ID' with columns
                      'shock_main' and 'shock_1991'.
    """
    # --- Step 2: Extract Baseline Denominator ---
    # The denominator is the total FTE employment (all nationalities) in the baseline year.
    denom_year = config["shock_and_scaling_parameters"]["SHOCK_DENOMINATOR_YEAR"]

    # Extract the denominator series from the regional panel.
    denominator = regional_panel.loc[
        (slice(None), denom_year), 'total_fte_employment'
    ].droplevel('snapshot_year')

    # Validate the denominator to prevent division by zero.
    if (denominator <= 0).any():
        zero_denom_munis = denominator[denominator <= 0].index.tolist()
        raise ValueError(
            f"[{task_name}] Found {len(zero_denom_munis)} municipalities with zero or "
            f"negative total employment in baseline year {denom_year}. "
            f"Cannot compute shock. Example: {zero_denom_munis[:5]}"
        )

    # --- Step 3: Compute Shock Regressors ---
    # Extract start and end years for shock definitions from config.
    main_shock_years = config["temporal_parameters"]["EVENT_STUDY_SHOCK_RULES"]["1992_to_1995"]
    y_start_main, y_end_main = main_shock_years['start_year'], main_shock_years['end_year']

    event_1991_years = config["temporal_parameters"]["EVENT_STUDY_SHOCK_RULES"]["1991"]
    y_start_1991, y_end_1991 = event_1991_years['start_year'], event_1991_years['end_year']

    # Ensure required year columns exist in the Czech FTE data.
    required_cols = {f'fte_{y}' for y in [y_start_main, y_end_main, y_start_1991, y_end_1991]}
    if not required_cols.issubset(czech_fte_df.columns):
        missing = required_cols - set(czech_fte_df.columns)
        raise KeyError(f"[{task_name}] Missing required Czech FTE columns: {missing}")

    # Compute the numerator for the main shock (1990-1992).
    # ΔCzech_r = Czech^92_r - Czech^90_r
    numerator_main = czech_fte_df[f'fte_{y_end_main}'] - czech_fte_df[f'fte_{y_start_main}']

    # Compute the main shock variable.
    # ΔI_r = (Czech^92_r - Czech^90_r) / Total^90_r
    shock_main = numerator_main / denominator

    # Compute the numerator for the 1991 event-year shock (1990-1991).
    # ΔCzech_r^1991 = Czech^91_r - Czech^90_r
    numerator_1991 = czech_fte_df[f'fte_{y_end_1991}'] - czech_fte_df[f'fte_{y_start_1991}']

    # Compute the 1991 event-year shock variable.
    # ΔI^1991_r = (Czech^91_r - Czech^90_r) / Total^90_r
    shock_1991 = numerator_1991 / denominator

    # Combine into a single DataFrame.
    shock_df = pd.DataFrame({
        'shock_main': shock_main,
        'shock_1991': shock_1991
    })

    # Fill any NaNs that might result from division (though validated against) with 0.
    shock_df = shock_df.fillna(0)

    print(f"[{task_name}] Immigration shock variables computed successfully.")
    return shock_df

# ------------------------------------------------------------------------------
# Task 7, Orchestrator Function
# ------------------------------------------------------------------------------

def construct_immigration_shock(
    analysis_panel: pd.DataFrame,
    regional_panel: pd.DataFrame,
    master_config: Dict[str, Any]
) -> pd.DataFrame:
    """
    Orchestrates the construction of the immigration shock variables.

    This function calculates the change in Czech Full-Time Equivalent (FTE)
    employment as a share of total baseline employment for each municipality.
    It produces two versions of the shock variable as specified in the paper's
    event-study design:
    1. `shock_main`: Based on the 1990-1992 change, used for post-1991 years.
    2. `shock_1991`: Based on the 1990-1991 change, used for the 1991 event year.

    Args:
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel from Task 5.
        regional_panel (pd.DataFrame): The municipality-year panel from Task 6.
        master_config (Dict[str, Any]): The validated master configuration dictionary.

    Returns:
        pd.DataFrame: A DataFrame indexed by 'Municipality_ID' containing the
                      'shock_main' and 'shock_1991' variables.
    """
    # --- Input Validation ---
    if not isinstance(analysis_panel, pd.DataFrame):
        raise TypeError("`analysis_panel` must be a pandas DataFrame.")
    if not isinstance(regional_panel, pd.DataFrame):
        raise TypeError("`regional_panel` must be a pandas DataFrame.")
    if not isinstance(master_config, dict):
        raise TypeError("`master_config` must be a dictionary.")

    # Step 1: Compute Czech FTE counts for all relevant years in a wide format.
    czech_fte_df = _compute_czech_fte_by_year(analysis_panel)

    # Steps 2 & 3: Extract the denominator and compute the final shock variables.
    shock_df = _compute_immigration_shocks(
        czech_fte_df=czech_fte_df,
        regional_panel=regional_panel,
        config=master_config
    )

    print("Task 7: Construction of immigration shock variables completed successfully.")
    return shock_df


In [None]:
# Task 8: Construct instrumental variables (distance to border)

# ==============================================================================
# Task 8: Construct instrumental variables (distance to border)
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 8, Step 1: Helper to Harmonize Coordinate Systems and Units
# ------------------------------------------------------------------------------

def _prepare_coordinates(
    analysis_panel: pd.DataFrame,
    border_crossings_df: pd.DataFrame,
    config: Dict[str, Any],
    task_name: str = "Task 8, Step 1"
) -> Tuple[pd.DataFrame, np.ndarray, np.ndarray, float]:
    """
    Prepares and harmonizes municipality and border crossing coordinates.

    Extracts unique municipality coordinates, validates CRS, and transforms
    coordinates to the required system (e.g., WGS84 for great-circle) based
    on the configuration.

    Args:
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel.
        border_crossings_df (pd.DataFrame): Validated border crossing data.
        config (Dict[str, Any]): The master configuration dictionary.
        task_name (str): The name of the calling task for error reporting.

    Returns:
        Tuple containing:
        - pd.DataFrame: Unique municipalities with their original coordinates.
        - np.ndarray: N x 2 array of harmonized municipality coordinates.
        - np.ndarray: M x 2 array of harmonized border crossing coordinates.
        - float: Scaling factor to convert distance results to kilometers.
    """
    # Extract unique municipality coordinates to avoid redundant calculations.
    muni_coords = analysis_panel[['Municipality_ID', 'Workplace_Coord_X_UTM', 'Workplace_Coord_Y_UTM']]
    muni_coords = muni_coords.drop_duplicates(subset=['Municipality_ID']).set_index('Municipality_ID')

    # Extract coordinate arrays.
    muni_coords_arr = muni_coords[['Workplace_Coord_X_UTM', 'Workplace_Coord_Y_UTM']].to_numpy()
    border_coords_arr = border_crossings_df[['Coord_X_UTM', 'Coord_Y_UTM']].to_numpy()

    # Get geospatial parameters from config.
    calc_method = config["algorithm_config_parameters"]["DISTANCE_CALCULATION_METHOD"]
    units = config["algorithm_config_parameters"]["DISTANCE_UNITS"]
    source_crs_str = config["geographic_policy_parameters"]["CRS_UTM_EPSG"]

    # Initialize scaling factor (default is no scaling).
    scaling_factor = 1.0

    if calc_method == "great_circle":
        # For Haversine distance, coordinates must be in latitude/longitude (WGS84).
        print(f"[{task_name}] Transforming coordinates from {source_crs_str} to WGS84 (EPSG:4326) for great-circle distance.")
        try:
            source_crs = CRS(source_crs_str)
            target_crs = CRS("EPSG:4326") # WGS84
            transformer = Transformer.from_crs(source_crs, target_crs, always_xy=True)

            # Transform coordinates. Note: pyproj expects (x, y) order.
            muni_lon, muni_lat = transformer.transform(muni_coords_arr[:, 0], muni_coords_arr[:, 1])
            border_lon, border_lat = transformer.transform(border_coords_arr[:, 0], border_coords_arr[:, 1])

            # Output arrays should be (lat, lon) for many Haversine implementations, but we will be consistent.
            # Our Haversine implementation will expect (lon, lat) in radians.
            muni_coords_harmonized = np.radians(np.column_stack([muni_lon, muni_lat]))
            border_coords_harmonized = np.radians(np.column_stack([border_lon, border_lat]))

        except Exception as e:
            raise ValueError(f"[{task_name}] Failed to transform coordinates. Error: {e}")

    elif calc_method == "euclidean":
        # For Euclidean distance, assume coordinates are in a projected CRS (e.g., UTM).
        print(f"[{task_name}] Using Euclidean distance on projected coordinates from {source_crs_str}.")
        muni_coords_harmonized = muni_coords_arr
        border_coords_harmonized = border_coords_arr
        # If the source CRS is in meters, we need to scale the result to km.
        if units == "km":
            scaling_factor = 0.001
    else:
        raise ValueError(f"[{task_name}] Invalid DISTANCE_CALCULATION_METHOD: '{calc_method}'")

    return muni_coords.reset_index(), muni_coords_harmonized, border_coords_harmonized, scaling_factor

# ------------------------------------------------------------------------------
# Task 8, Step 2: Helper to Compute Minimum Distance to Border
# ------------------------------------------------------------------------------

def _compute_min_distances(
    muni_coords: np.ndarray,
    border_coords: np.ndarray,
    method: str,
    scaling_factor: float,
    task_name: str = "Task 8, Step 2"
) -> np.ndarray:
    """
    Computes the minimum distance from each municipality to any border crossing.

    Args:
        muni_coords (np.ndarray): Harmonized municipality coordinates.
        border_coords (np.ndarray): Harmonized border crossing coordinates.
        method (str): The calculation method ('euclidean' or 'great_circle').
        scaling_factor (float): Factor to convert distance to desired units (km).
        task_name (str): The name of the calling task for error reporting.

    Returns:
        np.ndarray: A 1D array of minimum distances for each municipality.
    """
    if method == "euclidean":
        # Use scipy's highly optimized cdist for Euclidean distance.
        dist_matrix = cdist(muni_coords, border_coords, metric='euclidean')

    elif method == "great_circle":
        # Implement vectorized Haversine formula for great-circle distance.
        # Coords are expected in radians: (lon, lat).
        lon1, lat1 = muni_coords[:, 0][:, np.newaxis], muni_coords[:, 1][:, np.newaxis]
        lon2, lat2 = border_coords[:, 0], border_coords[:, 1]

        dlon = lon2 - lon1
        dlat = lat2 - lat1

        a = np.sin(dlat / 2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2.0)**2
        c = 2 * np.arcsin(np.sqrt(a))

        # Earth radius in kilometers.
        R = 6371.0
        dist_matrix = R * c

    # Find the minimum distance for each municipality (row-wise minimum).
    min_distances = np.min(dist_matrix, axis=1)

    # Apply the scaling factor (e.g., for meters to km).
    min_distances_scaled = min_distances * scaling_factor

    print(f"[{task_name}] Minimum distances computed for {len(min_distances_scaled)} municipalities.")
    return min_distances_scaled

# ------------------------------------------------------------------------------
# Task 8, Orchestrator Function
# ------------------------------------------------------------------------------

def construct_instrumental_variables(
    analysis_panel: pd.DataFrame,
    validated_artifacts: Dict[str, Any],
    master_config: Dict[str, Any]
) -> pd.DataFrame:
    """
    Orchestrates the construction of the instrumental variables.

    This function calculates the distance from each municipality to the nearest
    border crossing and its square. These variables serve as instruments for
    the immigration shock in the 2SLS regressions.

    Args:
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel from Task 5.
        validated_artifacts (Dict[str, Any]): Dictionary of validated auxiliary data,
                                              including the 'border_crossings' DataFrame.
        master_config (Dict[str, Any]): The validated master configuration dictionary.

    Returns:
        pd.DataFrame: A DataFrame indexed by 'Municipality_ID' containing the
                      instrumental variables 'distance_to_border' and
                      'distance_to_border_sq', and optional interactions.
    """
    # --- Input Validation ---
    if 'border_crossings' not in validated_artifacts:
        raise KeyError("Validated 'border_crossings' artifact not found in `validated_artifacts`.")

    # Step 1: Harmonize coordinate systems.
    unique_munis, muni_coords_h, border_coords_h, scale = _prepare_coordinates(
        analysis_panel=analysis_panel,
        border_crossings_df=validated_artifacts['border_crossings'],
        config=master_config
    )

    # Step 2: Compute the minimum distance for each municipality.
    min_distances = _compute_min_distances(
        muni_coords=muni_coords_h,
        border_coords=border_coords_h,
        method=master_config["algorithm_config_parameters"]["DISTANCE_CALCULATION_METHOD"],
        scaling_factor=scale
    )

    # Step 3: Assemble the final instruments DataFrame.
    instruments_df = pd.DataFrame({
        'Municipality_ID': unique_munis['Municipality_ID'],
        'distance_to_border': min_distances
    }).set_index('Municipality_ID')

    # Construct the squared distance term.
    instruments_df['distance_to_border_sq'] = instruments_df['distance_to_border'] ** 2

    # Construct interaction terms if specified in the config.
    if master_config["algorithm_config_parameters"]["IV_INTERACTION_WITH_BORDER_DUMMY"]:
        # Get the time-invariant 'Is_Border_Region' flag for each municipality.
        border_region_flag = analysis_panel.groupby('Municipality_ID')['Is_Border_Region'].first()
        instruments_df = instruments_df.merge(border_region_flag, on='Municipality_ID', how='left')

        # Create interaction terms.
        instruments_df['dist_x_border'] = instruments_df['distance_to_border'] * instruments_df['Is_Border_Region']
        instruments_df['dist_sq_x_border'] = instruments_df['distance_to_border_sq'] * instruments_df['Is_Border_Region']
        print("Task 8, Step 3: Interaction-term instruments constructed.")

    print("Task 8: Construction of instrumental variables completed successfully.")
    return instruments_df


In [None]:
# Task 9: Prepare event-study datasets (outcomes, shock, instruments, weights, clusters)

# ==============================================================================
# Task 9: Prepare event-study datasets
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 9, Step 1: Helper to Define Outcome Variables Relative to Base Year
# ------------------------------------------------------------------------------

def _compute_event_study_outcomes(
    regional_panel: pd.DataFrame,
    config: Dict[str, Any],
    task_name: str = "Task 9, Step 1"
) -> pd.DataFrame:
    """
    Computes event-study outcome variables relative to the base year (1990).

    This function calculates the percentage change in native employment and the
    absolute change in mean log wages for each municipality-year relative to
    its 1990 baseline value.

    Args:
        regional_panel (pd.DataFrame): The aggregated municipality-year panel from Task 6.
        config (Dict[str, Any]): The master configuration dictionary.
        task_name (str): The name of the calling task for error reporting.

    Returns:
        pd.DataFrame: The regional panel augmented with 'emp_outcome' and 'wage_outcome'.
    """
    base_year = config["temporal_parameters"]["BASE_YEAR"]

    # Create a DataFrame containing only the baseline year values.
    baseline_values = regional_panel.loc[(slice(None), base_year), :].copy()
    baseline_values = baseline_values.reset_index(level='snapshot_year', drop=True)
    baseline_values = baseline_values.rename(columns={
        'native_fte_employment': 'native_fte_employment_base',
        'mean_log_wage': 'mean_log_wage_base'
    })

    # Merge baseline values back onto the full regional panel.
    panel_with_base = regional_panel.merge(
        baseline_values[['native_fte_employment_base', 'mean_log_wage_base']],
        on='Municipality_ID',
        how='left'
    )

    # --- Compute Employment Outcome ---
    # Outcome: (E_rt - E_r,1990) / E_r,1990
    # Use np.divide for safe division, which handles division by zero.
    with np.errstate(divide='ignore', invalid='ignore'):
        panel_with_base['emp_outcome'] = np.divide(
            panel_with_base['native_fte_employment'] - panel_with_base['native_fte_employment_base'],
            panel_with_base['native_fte_employment_base']
        )
    # Replace inf/-inf resulting from 0/0 or x/0 with NaN.
    panel_with_base['emp_outcome'].replace([np.inf, -np.inf], np.nan, inplace=True)

    # --- Compute Wage Outcome ---
    # Outcome: log_w_bar_rt - log_w_bar_r,1990
    panel_with_base['wage_outcome'] = panel_with_base['mean_log_wage'] - panel_with_base['mean_log_wage_base']

    # Set outcomes for the base year to exactly 0.
    panel_with_base.loc[panel_with_base.index.get_level_values('snapshot_year') == base_year, ['emp_outcome', 'wage_outcome']] = 0.0

    print(f"[{task_name}] Event-study outcome variables computed relative to {base_year}.")
    return panel_with_base

# ------------------------------------------------------------------------------
# Task 9, Step 2: Helper to Assign Shock Regressors and Instruments
# ------------------------------------------------------------------------------

def _assign_shocks_and_instruments(
    panel_with_outcomes: pd.DataFrame,
    shock_df: pd.DataFrame,
    instruments_df: pd.DataFrame,
    config: Dict[str, Any],
    task_name: str = "Task 9, Step 2"
) -> pd.DataFrame:
    """
    Merges and assigns the correct shock and instrument variables for each event year.

    Args:
        panel_with_outcomes (pd.DataFrame): The regional panel with outcome variables.
        shock_df (pd.DataFrame): DataFrame with 'shock_main' and 'shock_1991'.
        instruments_df (pd.DataFrame): DataFrame with distance-based instruments.
        config (Dict[str, Any]): The master configuration dictionary.
        task_name (str): The name of the calling task for error reporting.

    Returns:
        pd.DataFrame: The panel augmented with shock and instrument columns.
    """
    # Merge shock and instrument variables onto the panel.
    panel = panel_with_outcomes.merge(shock_df, on='Municipality_ID', how='left')
    panel = panel.merge(instruments_df, on='Municipality_ID', how='left')

    # --- Assign the Correct Event-Year Shock ---
    # The 'shock' column will be the endogenous regressor in all 2SLS models.
    # Its value depends on the year, per the paper's event-study design.
    panel['shock'] = np.where(
        panel.index.get_level_values('snapshot_year') == 1991,
        panel['shock_1991'],  # Use 1990-1991 shock for 1991 event year.
        panel['shock_main']   # Use 1990-1992 shock for all other years (pre and post).
    )

    print(f"[{task_name}] Shock and instrument variables assigned.")
    return panel

# ------------------------------------------------------------------------------
# Task 9, Step 3: Helper to Attach Identifiers and Apply Sample Filters
# ------------------------------------------------------------------------------

def _apply_final_sample_filters(
    full_event_panel: pd.DataFrame,
    analysis_panel: pd.DataFrame,
    config: Dict[str, Any],
    task_name: str = "Task 9, Step 3"
) -> pd.DataFrame:
    """
    Attaches cluster IDs and applies the final sample filter for the analysis.

    Args:
        full_event_panel (pd.DataFrame): The fully assembled event-study panel.
        analysis_panel (pd.DataFrame): The worker-year panel, used to get District_ID.
        config (Dict[str, Any]): The master configuration dictionary.
        task_name (str): The name of the calling task for error reporting.

    Returns:
        pd.DataFrame: The final, analysis-ready event-study DataFrame.
    """
    # --- Attach Cluster Identifier ---
    # District_ID is time-invariant for each municipality.
    district_map = analysis_panel[['Municipality_ID', 'District_ID']].drop_duplicates().set_index('Municipality_ID')
    panel = full_event_panel.merge(district_map, on='Municipality_ID', how='left')

    # --- Apply Sample Filter ---
    # The analysis sample includes only treated border districts and matched controls.
    geo_params = config["geographic_policy_parameters"]
    treated_districts: Set[str] = set(geo_params["TREATED_DISTRICT_IDS"])
    control_districts: Set[str] = set(geo_params["MATCHED_CONTROL_DISTRICT_IDS"])
    analysis_districts: Set[str] = treated_districts.union(control_districts)

    # Excluded districts should not be in the analysis sample.
    excluded_districts: Set[str] = set(geo_params["EXCLUDED_BORDER_DISTRICT_IDS"])
    if not analysis_districts.isdisjoint(excluded_districts):
        raise ValueError(f"[{task_name}] Overlap found between analysis districts and excluded districts.")

    initial_munis = panel['Municipality_ID'].nunique()

    # Filter the panel to the defined analysis sample.
    final_panel = panel[panel['District_ID'].isin(analysis_districts)].copy()

    final_munis = final_panel['Municipality_ID'].nunique()
    print(f"[{task_name}] Final sample filter applied. Kept {final_munis} of {initial_munis} municipalities.")

    return final_panel

# ------------------------------------------------------------------------------
# Task 9, Orchestrator Function
# ------------------------------------------------------------------------------

def prepare_event_study_dataset(
    regional_panel: pd.DataFrame,
    shock_df: pd.DataFrame,
    instruments_df: pd.DataFrame,
    analysis_panel: pd.DataFrame,
    master_config: Dict[str, Any]
) -> pd.DataFrame:
    """
    Orchestrates the assembly of the final event-study dataset.

    This function combines regional outcomes, shock variables, instruments,
    weights, and cluster identifiers into a single, analysis-ready DataFrame
    in a long format (one row per municipality-year).

    Args:
        regional_panel (pd.DataFrame): Aggregated municipality-year panel from Task 6.
        shock_df (pd.DataFrame): Immigration shock variables from Task 7.
        instruments_df (pd.DataFrame): Instrumental variables from Task 8.
        analysis_panel (pd.DataFrame): The main worker-year panel, used for District_ID mapping.
        master_config (Dict[str, Any]): The validated master configuration dictionary.

    Returns:
        pd.DataFrame: The final, analysis-ready dataset for all 2SLS regressions.
                      The DataFrame is indexed by ('Municipality_ID', 'snapshot_year').
    """
    # --- Input Validation ---
    required_inputs = {
        'regional_panel': regional_panel, 'shock_df': shock_df,
        'instruments_df': instruments_df, 'analysis_panel': analysis_panel
    }
    for name, df in required_inputs.items():
        if not isinstance(df, pd.DataFrame):
            raise TypeError(f"`{name}` must be a pandas DataFrame.")
    if not isinstance(master_config, dict):
        raise TypeError("`master_config` must be a dictionary.")

    # Step 1: Compute outcome variables relative to the base year.
    panel_with_outcomes = _compute_event_study_outcomes(regional_panel, master_config)

    # Step 2: Assign the correct shock regressor for each year and merge instruments.
    panel_with_ivs = _assign_shocks_and_instruments(
        panel_with_outcomes, shock_df, instruments_df, master_config
    )

    # Step 3: Attach cluster IDs and apply the final sample filter.
    final_event_study_df = _apply_final_sample_filters(
        panel_with_ivs, analysis_panel, master_config
    )

    # Final check for required columns before returning.
    required_cols = [
        'emp_outcome', 'wage_outcome', 'shock', 'distance_to_border',
        'distance_to_border_sq', master_config["algorithm_config_parameters"]["WEIGHT_COLUMN_MUNICIPALITY"],
        'District_ID'
    ]
    if not all(col in final_event_study_df.columns for col in required_cols):
        raise RuntimeError("Final event study DataFrame is missing required columns.")

    print("Task 9: Preparation of event-study dataset completed successfully.")
    return final_event_study_df


In [None]:
# Task 10: Estimate regional employment effect (2SLS; Equation (2))

# ==============================================================================
# Task 10: Estimate regional employment effect (2SLS; Equation (2))
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 10, Step 1: Helper to Estimate Weighted First Stage
# ------------------------------------------------------------------------------

def _estimate_first_stage(
    data: pd.DataFrame,
    config: Dict[str, Any],
    task_name: str = "Task 10, Step 1"
) -> Tuple[Any, pd.Series]:
    """
    Estimates the weighted first-stage regression of the 2SLS model.

    Regresses the endogenous shock variable on the instrumental variables,
    using baseline native employment as weights. It also performs a weak
    instrument test (F-statistic).

    Args:
        data (pd.DataFrame): The analysis-ready data for a single event year.
        config (Dict[str, Any]): The master configuration dictionary.
        task_name (str): The name of the calling task for error reporting.

    Returns:
        Tuple[Any, pd.Series]:
        - The fitted statsmodels results object for the first stage.
        - A Series of predicted values for the endogenous variable.
    """
    # Define the formula for the first-stage regression.
    # shock ~ 1 + distance_to_border + distance_to_border_sq
    iv_names = list(config["algorithm_config_parameters"]["INSTRUMENT_SPECIFICATION"].keys())
    formula = f"shock ~ 1 + {' + '.join(iv_names)}"

    # Get the column name for weights.
    weight_col = config["algorithm_config_parameters"]["WEIGHT_COLUMN_MUNICIPALITY"]

    # Estimate the model using Weighted Least Squares (WLS).
    first_stage_model = smf.wls(
        formula=formula,
        data=data,
        weights=data[weight_col]
    ).fit()

    # --- Weak Instrument Test ---
    # Perform a joint F-test on the instruments' coefficients.
    f_test_result = first_stage_model.f_test(iv_names)
    f_statistic = f_test_result.fvalue

    # The paper reports F ~ 52. We assert against the standard threshold of 10.
    if f_statistic < 10:
        # This should be a warning, not an error, as it's a diagnostic.
        print(f"[{task_name}] WARNING: Weak instrument detected. "
              f"First-stage F-statistic is {f_statistic:.2f}, which is below 10.")
    else:
        print(f"[{task_name}] First-stage F-statistic: {f_statistic:.2f} (Instruments are strong).")

    # Predict the values of the endogenous variable.
    predicted_shock = first_stage_model.predict(data)
    predicted_shock.name = 'shock_hat'

    return first_stage_model, predicted_shock

# ------------------------------------------------------------------------------
# Task 10, Step 2: Helper to Estimate Weighted Second Stage
# ------------------------------------------------------------------------------

def _estimate_second_stage(
    data: pd.DataFrame,
    outcome_var: str,
    config: Dict[str, Any],
    task_name: str = "Task 10, Step 2"
) -> Any:
    """
    Estimates the weighted second-stage regression with cluster-robust SEs.

    Regresses the outcome variable on the predicted shock from the first stage.

    Args:
        data (pd.DataFrame): The analysis data, including 'shock_hat'.
        outcome_var (str): The name of the dependent variable column.
        config (Dict[str, Any]): The master configuration dictionary.
        task_name (str): The name of the calling task for error reporting.

    Returns:
        Any: The fitted statsmodels results object with cluster-robust errors.
    """
    # Define the formula for the second-stage regression.
    # outcome ~ 1 + shock_hat
    formula = f"{outcome_var} ~ 1 + shock_hat"

    # Get column names for weights and clusters.
    weight_col = config["algorithm_config_parameters"]["WEIGHT_COLUMN_MUNICIPALITY"]
    cluster_col = config["algorithm_config_parameters"]["CLUSTER_LEVEL"]

    # Estimate the model using Weighted Least Squares (WLS).
    second_stage_model = smf.wls(
        formula=formula,
        data=data,
        weights=data[weight_col]
    ).fit()

    # Compute cluster-robust standard errors.
    robust_results = second_stage_model.get_robustcov_results(
        cov_type='cluster',
        groups=data[cluster_col]
    )

    print(f"[{task_name}] Second-stage estimation for outcome '{outcome_var}' complete.")
    return robust_results

# ------------------------------------------------------------------------------
# Task 10, Step 3: Helper to Implement the Wild Cluster Bootstrap
# ------------------------------------------------------------------------------

def _run_wild_cluster_bootstrap(
    data: pd.DataFrame,
    outcome_var: str,
    second_stage_results: Any,
    config: Dict[str, Any],
    task_name: str = "Task 10, Step 3"
) -> Tuple[float, float]:
    """
    Implements the wild cluster bootstrap for robust inference.

    Args:
        data (pd.DataFrame): The analysis data for a single event year.
        outcome_var (str): The name of the dependent variable column.
        second_stage_results (Any): The initial second-stage results object.
        config (Dict[str, Any]): The master configuration dictionary.

    Returns:
        Tuple[float, float]: The lower and upper bounds of the bootstrap confidence interval.
    """
    # Extract necessary parameters and data.
    n_reps = config["algorithm_config_parameters"]["BOOTSTRAP_REPLICATIONS"]
    seed = config["algorithm_config_parameters"]["RANDOM_SEED"]
    cluster_col = config["algorithm_config_parameters"]["CLUSTER_LEVEL"]
    conf_level = config["algorithm_config_parameters"]["BOOTSTRAP_CONF_LEVEL"]

    # Set seed for reproducibility.
    rng = np.random.default_rng(seed)

    # Get unique cluster identifiers.
    clusters = data[cluster_col].unique()

    # Get fitted values and residuals from the original second-stage model.
    fitted_values = second_stage_results.fittedvalues
    residuals = second_stage_results.resid

    # Store bootstrap coefficient estimates.
    bootstrap_coeffs = []

    print(f"[{task_name}] Starting wild cluster bootstrap with {n_reps} replications...")
    for i in range(n_reps):
        # 1. Generate Rademacher weights at the cluster level.
        rademacher_weights = rng.choice([-1, 1], size=len(clusters))
        cluster_weights = pd.DataFrame({
            cluster_col: clusters,
            'v': rademacher_weights
        })

        # 2. Merge weights and create bootstrap residuals.
        # ε_r^(b) = v_g(r) * ε̂_r
        bootstrap_data = data.merge(cluster_weights, on=cluster_col, how='left')
        bootstrap_residuals = bootstrap_data['v'] * residuals

        # 3. Generate bootstrap outcome variable.
        # Y_r^(b) = Ŷ_r + ε̂_r^(b)
        bootstrap_data[f'{outcome_var}_boot'] = fitted_values + bootstrap_residuals

        # 4. Re-estimate the 2SLS model.
        # We must re-run both stages for each bootstrap replication.
        _, pred_shock_boot = _estimate_first_stage(bootstrap_data, config, task_name="Bootstrap First Stage")
        bootstrap_data['shock_hat_boot'] = pred_shock_boot

        # Estimate second stage without robust SEs, as we are building the distribution.
        formula = f"{outcome_var}_boot ~ 1 + shock_hat_boot"
        weight_col = config["algorithm_config_parameters"]["WEIGHT_COLUMN_MUNICIPALITY"]

        boot_model = smf.wls(formula, data=bootstrap_data, weights=bootstrap_data[weight_col]).fit()

        # Store the coefficient of interest.
        bootstrap_coeffs.append(boot_model.params['shock_hat_boot'])

    # 5. Compute the percentile confidence interval.
    alpha = 1 - conf_level
    lower_bound = np.percentile(bootstrap_coeffs, 100 * alpha / 2)
    upper_bound = np.percentile(bootstrap_coeffs, 100 * (1 - alpha / 2))

    print(f"[{task_name}] Bootstrap complete.")
    return lower_bound, upper_bound

# ------------------------------------------------------------------------------
# Task 10, Orchestrator Function
# ------------------------------------------------------------------------------

def estimate_regional_effect_2sls(
    event_study_df: pd.DataFrame,
    master_config: Dict[str, Any],
    outcome_variable: str,
    event_year: int
) -> Dict[str, Any]:
    """
    Orchestrates the 2SLS estimation for a given outcome and event year.

    This function performs the full 2SLS estimation pipeline:
    1. Estimates the weighted first stage and checks instrument strength.
    2. Estimates the weighted second stage with cluster-robust standard errors.
    3. Runs a wild cluster bootstrap to compute a robust confidence interval.

    Args:
        event_study_df (pd.DataFrame): The final, analysis-ready dataset from Task 9.
        master_config (Dict[str, Any]): The master configuration dictionary.
        outcome_variable (str): The name of the outcome column to use (e.g., 'emp_outcome').
        event_year (int): The specific year of the event study to estimate.

    Returns:
        Dict[str, Any]: A dictionary containing the key estimation results:
                        'point_estimate', 'cluster_robust_se', 'p_value',
                        'f_statistic', 'bootstrap_ci', and 'n_obs'.
    """
    # --- Prepare Data for the Specific Event Year ---
    # Filter data to the specified year and drop any rows with missing values
    # in the variables required for the model.
    model_vars = [
        outcome_variable, 'shock', 'shock_hat',
        master_config["algorithm_config_parameters"]["WEIGHT_COLUMN_MUNICIPALITY"],
        master_config["algorithm_config_parameters"]["CLUSTER_LEVEL"]
    ] + list(master_config["algorithm_config_parameters"]["INSTRUMENT_SPECIFICATION"].keys())

    year_data = event_study_df.loc[
        (slice(None), event_year), :
    ].copy().reset_index(level='snapshot_year')

    # Temporarily add 'shock_hat' to ensure dropna works correctly.
    year_data['shock_hat'] = 0
    year_data.dropna(subset=model_vars, inplace=True)

    if year_data.empty:
        print(f"WARNING: No valid observations for year {event_year} and outcome {outcome_variable}. Skipping.")
        return {}

    # Step 1: Estimate the first stage and get predicted shock.
    first_stage_res, predicted_shock = _estimate_first_stage(year_data, master_config)
    year_data['shock_hat'] = predicted_shock

    # Step 2: Estimate the second stage with cluster-robust SEs.
    second_stage_res_robust = _estimate_second_stage(year_data, outcome_variable, master_config)

    # Step 3: Run the wild cluster bootstrap for a robust CI.
    ci_lower, ci_upper = _run_wild_cluster_bootstrap(
        data=year_data,
        outcome_var=outcome_variable,
        second_stage_results=second_stage_res_robust,
        config=master_config
    )

    # --- Assemble and Return Final Results ---
    point_estimate = second_stage_res_robust.params['shock_hat']
    cluster_robust_se = second_stage_res_robust.bse['shock_hat']
    p_value = second_stage_res_robust.pvalues['shock_hat']
    f_statistic = first_stage_res.f_test(list(master_config["algorithm_config_parameters"]["INSTRUMENT_SPECIFICATION"].keys())).fvalue

    results = {
        'event_year': event_year,
        'outcome': outcome_variable,
        'point_estimate': point_estimate,
        'cluster_robust_se': cluster_robust_se,
        'p_value': p_value,
        'f_statistic': f_statistic,
        'bootstrap_ci': (ci_lower, ci_upper),
        'n_obs': int(second_stage_res_robust.nobs)
    }

    print(f"\n--- Results for {event_year} | {outcome_variable} ---")
    print(f"Point Estimate (β^R): {point_estimate:.4f}")
    print(f"Cluster-Robust SE: {cluster_robust_se:.4f}")
    print(f"Bootstrap 95% CI: [{ci_lower:.4f}, {ci_upper:.4f}]")
    print(f"First-Stage F-stat: {f_statistic:.2f}")
    print("----------------------------------------\n")

    return results


In [None]:
# Task 11: Decompose regional employment into flow components (Equation (3))

# ==============================================================================
# Task 11: Decompose regional employment into flow components (Equation (3))
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 11, Step 1: Helper to Compute Employment Flow Shares
# ------------------------------------------------------------------------------

def _compute_employment_flow_shares(
    analysis_panel: pd.DataFrame,
    regional_panel: pd.DataFrame,
    event_year: int,
    config: Dict[str, Any],
    task_name: str = "Task 11, Step 1"
) -> pd.DataFrame:
    """
    Computes displacement, inflow, and relocation shares for a given event year.

    This function tracks native workers between the base year (1990) and the
    specified event year to construct the flow components needed for the
    employment decomposition, normalized by baseline native employment.

    Args:
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel from Task 5.
        regional_panel (pd.DataFrame): The aggregated municipality-year panel from Task 6.
        event_year (int): The post-treatment year to compare against the base year.
        config (Dict[str, Any]): The master configuration dictionary.
        task_name (str): The name of the calling task for error reporting.

    Returns:
        pd.DataFrame: A DataFrame indexed by 'Municipality_ID' with columns for
                      'displacement_share', 'inflow_share', and 'relocation_share'.
    """
    # Extract the base year from the configuration for comparison.
    base_year = config["temporal_parameters"]["BASE_YEAR"]

    # Select only native workers and the two relevant years for flow analysis.
    # This is the foundational dataset for tracking individual transitions.
    native_panel = analysis_panel[
        analysis_panel['is_native'] &
        analysis_panel.index.get_level_values('snapshot_year').isin([base_year, event_year])
    ].copy()

    # Pivot the panel to create a wide-format DataFrame. Each row represents a unique
    # worker, with columns for their status in both the base and event years.
    flow_df = native_panel.reset_index().pivot_table(
        index='Worker_ID',
        columns='snapshot_year',
        values=['is_employed', 'Municipality_ID', 'fte_weight']
    )
    # Flatten the multi-level column index for easier access.
    flow_df.columns = [f'{val}_{year}' for val, year in flow_df.columns]

    # --- Classify Worker Flows ---
    # Define boolean masks for each flow category. This vectorized approach is highly efficient.

    # Condition for being employed in the base year.
    employed_base = flow_df[f'is_employed_{base_year}'].fillna(False).astype(bool)
    # Condition for being employed in the event year.
    employed_event = flow_df[f'is_employed_{event_year}'].fillna(False).astype(bool)

    # Displacement: Employed in base year -> Not employed in event year.
    # Contribution is the FTE weight from the base year.
    flow_df['displacement_fte'] = np.where(
        employed_base & ~employed_event,
        flow_df[f'fte_weight_{base_year}'],
        0
    )

    # Relocation: Employed in base year -> Employed in a DIFFERENT municipality in event year.
    # Contribution is the FTE weight from the base year.
    flow_df['relocation_fte'] = np.where(
        employed_base & employed_event &
        (flow_df[f'Municipality_ID_{base_year}'] != flow_df[f'Municipality_ID_{event_year}']),
        flow_df[f'fte_weight_{base_year}'],
        0
    )

    # Inflow: Employed in event year -> but was NOT employed in that SAME municipality in base year.
    # This includes those who were non-employed or employed elsewhere in the base year.
    # Contribution is the FTE weight from the event year.
    flow_df['inflow_fte'] = np.where(
        employed_event & (
            ~employed_base |
            (flow_df[f'Municipality_ID_{base_year}'] != flow_df[f'Municipality_ID_{event_year}'])
        ),
        flow_df[f'fte_weight_{event_year}'],
        0
    )

    # --- Aggregate Flows by Municipality ---
    # For displacement and relocation, we group by the origin municipality (base year).
    displacement_agg = flow_df.groupby(f'Municipality_ID_{base_year}')['displacement_fte'].sum()
    relocation_agg = flow_df.groupby(f'Municipality_ID_{base_year}')['relocation_fte'].sum()

    # For inflows, we group by the destination municipality (event year).
    inflow_agg = flow_df.groupby(f'Municipality_ID_{event_year}')['inflow_fte'].sum()

    # --- Assemble and Normalize Flow Shares ---
    # Get the baseline native employment for normalization from the regional panel.
    weight_col = config["algorithm_config_parameters"]["WEIGHT_COLUMN_MUNICIPALITY"]
    baseline_employment = regional_panel.groupby('Municipality_ID')[weight_col].first()

    # Create the final DataFrame of flow shares, indexed by Municipality_ID.
    flow_shares = pd.DataFrame(index=baseline_employment.index)
    flow_shares['displacement_share'] = displacement_agg
    flow_shares['relocation_share'] = relocation_agg
    flow_shares['inflow_share'] = inflow_agg

    # Normalize by the baseline employment, handling division by zero.
    with np.errstate(divide='ignore', invalid='ignore'):
        for col in flow_shares.columns:
            flow_shares[col] = np.divide(flow_shares[col], baseline_employment)

    # Replace inf/-inf with NaN and then fill all NaNs with 0.
    flow_shares.replace([np.inf, -np.inf], np.nan, inplace=True)
    flow_shares.fillna(0, inplace=True)

    print(f"[{task_name}] Employment flow shares computed for year {event_year}.")
    return flow_shares

# ------------------------------------------------------------------------------
# Task 11, Orchestrator Function
# ------------------------------------------------------------------------------

def decompose_regional_employment_effect(
    analysis_panel: pd.DataFrame,
    regional_panel: pd.DataFrame,
    event_study_df: pd.DataFrame,
    master_config: Dict[str, Any],
    event_year: int
) -> Dict[str, Any]:
    """
    Orchestrates the decomposition of the regional employment effect.

    This function first computes the micro-level employment flow shares
    (displacement, inflow, relocation) and then estimates the causal effect
    of the immigration shock on each component using 2SLS. It verifies that
    the sum of the component effects reconciles with the total regional effect.

    Args:
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel from Task 5.
        regional_panel (pd.DataFrame): The aggregated municipality-year panel from Task 6.
        event_study_df (pd.DataFrame): The analysis-ready dataset from Task 9.
        master_config (Dict[str, Any]): The master configuration dictionary.
        event_year (int): The specific year of the event study to estimate.

    Returns:
        Dict[str, Any]: A dictionary containing the estimation results for the
                        total effect and each of the three flow components.
    """
    # --- Input Validation ---
    if not isinstance(analysis_panel, pd.DataFrame):
        raise TypeError("`analysis_panel` must be a pandas DataFrame.")
    if not isinstance(regional_panel, pd.DataFrame):
        raise TypeError("`regional_panel` must be a pandas DataFrame.")
    if not isinstance(event_study_df, pd.DataFrame):
        raise TypeError("`event_study_df` must be a pandas DataFrame.")
    if not isinstance(master_config, dict):
        raise TypeError("`master_config` must be a dictionary.")

    # Step 1: Compute the employment flow shares for the given event year.
    flow_shares_df = _compute_employment_flow_shares(
        analysis_panel, regional_panel, event_year, master_config
    )

    # --- Prepare a consistent analysis sample for all regressions ---
    # Merge flow shares into the main event study dataset for the specified year.
    analysis_data_year = event_study_df.loc[(slice(None), event_year), :].copy()
    analysis_data_year = analysis_data_year.merge(
        flow_shares_df, on='Municipality_ID', how='left'
    )

    # Define all outcome variables for this task.
    outcomes_to_estimate = {
        'total_effect': 'emp_outcome',
        'displacement': 'displacement_share',
        'inflow': 'inflow_share',
        'relocation': 'relocation_share'
    }

    # Drop rows with missing values in any of the outcomes to ensure a consistent sample.
    # This is the critical step for ensuring the decomposition identity holds.
    analysis_data_year.dropna(subset=list(outcomes_to_estimate.values()), inplace=True)

    # --- Step 2: Estimate 2SLS for each component ---
    results = {}
    for name, outcome_col in outcomes_to_estimate.items():
        print(f"\n--- Estimating effect on: {name} ({outcome_col}) for year {event_year} ---")
        # Reuse the robust 2SLS estimation function from Task 10.
        results[name] = estimate_regional_effect_2sls(
            event_study_df=analysis_data_year.set_index('Municipality_ID', append=True).swaplevel(0,1),
            master_config=master_config,
            outcome_variable=outcome_col,
            event_year=event_year
        )

    # --- Step 3: Verify the summation identity ---
    # Identity: β^R ≈ (-δ_displacement) + δ_inflow - δ_relocation
    total_beta = results['total_effect']['point_estimate']
    sum_of_components = (
        -results['displacement']['point_estimate'] +
        results['inflow']['point_estimate'] -
        results['relocation']['point_estimate']
    )

    # Use a tight absolute tolerance for the verification.
    if not np.isclose(total_beta, sum_of_components, atol=1e-8):
        raise AssertionError(
            "Decomposition identity check FAILED! "
            f"Total Effect (β^R) = {total_beta:.8f}, but "
            f"Sum of Components = {sum_of_components:.8f}. "
            "This indicates a logical error in flow construction or sample inconsistency."
        )
    else:
        print("\n" + "="*80)
        print("Decomposition Identity Verification PASSED")
        print(f"Total Effect (β^R): {total_beta:.6f}")
        print(f"Sum of Components:  {sum_of_components:.6f}")
        print("="*80)

    print(f"\nTask 11: Decomposition of regional employment effect for year {event_year} completed.")
    return results


In [None]:
# Task 12: Estimate regional and pure wage effects (Equations (5) and (4))

# ==============================================================================
# Task 12: Estimate regional and pure wage effects (Equations (5) and (4))
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 12, Step 2: Helper to Estimate Pure Wage Effect (FD-IV for Stayers)
# ------------------------------------------------------------------------------

def _estimate_pure_wage_effect_stayers(
    analysis_panel: pd.DataFrame,
    event_study_df: pd.DataFrame,
    master_config: Dict[str, Any],
    event_year: int,
    task_name: str = "Task 12, Step 2"
) -> Dict[str, Any]:
    """
    Estimates the pure wage effect (γ^W) for stayers using a FD-IV model.

    This function identifies "stayers" (workers in the same municipality at
    base and event year), constructs a first-differenced dataset of their
    wages and age controls, and estimates the effect of the immigration shock
    using an individual-level 2SLS model with a wild cluster bootstrap.

    Args:
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel from Task 5.
        event_study_df (pd.DataFrame): The analysis-ready dataset from Task 9.
        master_config (Dict[str, Any]): The master configuration dictionary.
        event_year (int): The specific year of the event study to estimate.
        task_name (str): The name of the calling task for error reporting.

    Returns:
        Dict[str, Any]: A dictionary containing the key estimation results.
    """
    base_year = master_config["temporal_parameters"]["BASE_YEAR"]

    # --- 1. Identify Stayers and Prepare Data ---
    # Filter to native, full-time workers in the two relevant years.
    stayers_panel = analysis_panel[
        analysis_panel['is_native'] &
        analysis_panel['is_full_time'] &
        analysis_panel.index.get_level_values('snapshot_year').isin([base_year, event_year])
    ].copy()

    # Pivot to wide format to identify stayers.
    stayers_wide = stayers_panel.reset_index().pivot_table(
        index='Worker_ID',
        columns='snapshot_year',
        values=['Municipality_ID', 'log_wage_imputed', 'age', 'age_sq']
    )
    stayers_wide.columns = [f'{val}_{year}' for val, year in stayers_wide.columns]

    # A stayer is employed in the same municipality in both years.
    stayers_wide = stayers_wide.dropna(
        subset=[f'Municipality_ID_{base_year}', f'Municipality_ID_{event_year}']
    )
    stayers_df = stayers_wide[
        stayers_wide[f'Municipality_ID_{base_year}'] == stayers_wide[f'Municipality_ID_{event_year}']
    ].copy()
    stayers_df.rename(columns={f'Municipality_ID_{base_year}': 'Municipality_ID'}, inplace=True)

    # --- 2. Construct First-Differenced Variables ---
    # Δlog_w_ir = log_w_irt - log_w_ir,1990
    stayers_df['delta_log_wage'] = stayers_df[f'log_wage_imputed_{event_year}'] - stayers_df[f'log_wage_imputed_{base_year}']
    # Δage_i = age_it - age_i,1990
    stayers_df['delta_age'] = stayers_df[f'age_{event_year}'] - stayers_df[f'age_{base_year}']
    # Δage_i^2 = (age_it)^2 - (age_i,1990)^2
    stayers_df['delta_age_sq'] = stayers_df[f'age_sq_{event_year}'] - stayers_df[f'age_sq_{base_year}']

    # --- 3. Merge Shocks, Instruments, and Cluster IDs ---
    # Get the single event-year slice from the main event study df.
    event_data_year = event_study_df.loc[(slice(None), event_year), :].reset_index()

    # Merge the municipality-level variables onto the individual-level stayer data.
    iv_cols = ['shock'] + list(master_config["algorithm_config_parameters"]["INSTRUMENT_SPECIFICATION"].keys()) + ['District_ID']
    stayers_df = stayers_df.merge(
        event_data_year[['Municipality_ID'] + iv_cols],
        on='Municipality_ID',
        how='left'
    )
    stayers_df.dropna(inplace=True) # Drop if any merge failed or data is missing.

    # --- 4. Estimate the FD-IV Model ---
    # Prepare data for linearmodels.
    dependent = stayers_df['delta_log_wage']
    exog = sm.add_constant(stayers_df[['delta_age', 'delta_age_sq']])
    endog = stayers_df[['shock']]
    instruments = stayers_df[list(master_config["algorithm_config_parameters"]["INSTRUMENT_SPECIFICATION"].keys())]
    clusters = stayers_df['District_ID']

    # Fit the 2SLS model with cluster-robust standard errors.
    model = IV2SLS(dependent, exog, endog, instruments).fit(cov_type='clustered', clusters=clusters)

    # --- 5. Implement Wild Cluster Bootstrap ---
    n_reps = master_config["algorithm_config_parameters"]["BOOTSTRAP_REPLICATIONS"]
    seed = master_config["algorithm_config_parameters"]["RANDOM_SEED"]
    rng = np.random.default_rng(seed)

    unique_clusters = clusters.unique()
    fitted_values = model.predict()
    residuals = model.resids

    bootstrap_coeffs = []
    print(f"[{task_name}] Starting wild cluster bootstrap for pure wage effect...")
    for _ in range(n_reps):
        # Generate Rademacher weights at the cluster level.
        rademacher_weights = rng.choice([-1, 1], size=len(unique_clusters))
        cluster_weights = pd.DataFrame({'District_ID': unique_clusters, 'v': rademacher_weights})

        # Create bootstrap residuals.
        temp_df = stayers_df.merge(cluster_weights, on='District_ID', how='left')
        boot_residuals = temp_df['v'].values * residuals.values

        # Create bootstrap outcome.
        boot_dependent = fitted_values.values.flatten() + boot_residuals

        # Re-estimate the model.
        boot_model = IV2SLS(boot_dependent, exog, endog, instruments).fit(cov_type='clustered', clusters=clusters)
        bootstrap_coeffs.append(boot_model.params['shock'])

    # Compute percentile confidence interval.
    alpha = 1 - master_config["algorithm_config_parameters"]["BOOTSTRAP_CONF_LEVEL"]
    ci_lower = np.percentile(bootstrap_coeffs, 100 * alpha / 2)
    ci_upper = np.percentile(bootstrap_coeffs, 100 * (1 - alpha / 2))

    # --- 6. Assemble and Return Results ---
    results = {
        'event_year': event_year,
        'outcome': 'pure_wage_effect (stayers)',
        'point_estimate': model.params['shock'],
        'cluster_robust_se': model.std_errors['shock'],
        'p_value': model.pvalues['shock'],
        'f_statistic': model.first_stage.f_statistic.stat,
        'bootstrap_ci': (ci_lower, ci_upper),
        'n_obs': int(model.nobs)
    }

    print(f"\n--- Results for {event_year} | Pure Wage Effect (γ^W) ---")
    print(f"Point Estimate (γ^W): {results['point_estimate']:.4f}")
    print(f"Cluster-Robust SE: {results['cluster_robust_se']:.4f}")
    print(f"Bootstrap 95% CI: [{ci_lower:.4f}, {ci_upper:.4f}]")
    print(f"First-Stage F-stat: {results['f_statistic']:.2f}")
    print("--------------------------------------------------\n")

    return results

# ------------------------------------------------------------------------------
# Task 12, Orchestrator Function
# ------------------------------------------------------------------------------

def estimate_wage_effects(
    event_study_df: pd.DataFrame,
    analysis_panel: pd.DataFrame,
    master_config: Dict[str, Any],
    event_year: int
) -> Dict[str, Dict[str, Any]]:
    """
    Orchestrates the estimation of both regional and pure wage effects.

    This function serves as the main entry point for Task 12. It estimates:
    1. The regional wage effect (γ^R) using a municipality-level 2SLS model.
    2. The pure wage effect (γ^W) using an individual-level FD-IV model on stayers.

    It returns a dictionary containing the results for both estimations,
    allowing for a direct comparison that reveals the impact of workforce
    composition changes.

    Args:
        event_study_df (pd.DataFrame): The final, analysis-ready dataset from Task 9.
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel from Task 5.
        master_config (Dict[str, Any]): The master configuration dictionary.
        event_year (int): The specific year of the event study to estimate.

    Returns:
        Dict[str, Dict[str, Any]]: A dictionary with two keys, 'regional_wage_effect'
                                   and 'pure_wage_effect', each containing a
                                   detailed dictionary of estimation results.
    """
    # --- Input Validation ---
    if not isinstance(event_study_df, pd.DataFrame):
        raise TypeError("`event_study_df` must be a pandas DataFrame.")
    if not isinstance(analysis_panel, pd.DataFrame):
        raise TypeError("`analysis_panel` must be a pandas DataFrame.")

    # --- Step 1: Estimate Regional Wage Effect (γ^R) ---
    # Reuse the robust 2SLS estimation function from Task 10.
    regional_results = estimate_regional_effect_2sls(
        event_study_df=event_study_df,
        master_config=master_config,
        outcome_variable='wage_outcome',
        event_year=event_year
    )

    # --- Step 2: Estimate Pure Wage Effect (γ^W) ---
    pure_results = _estimate_pure_wage_effect_stayers(
        analysis_panel=analysis_panel,
        event_study_df=event_study_df,
        master_config=master_config,
        event_year=event_year
    )

    # --- Step 3: Reconcile and Assemble Final Output ---
    final_results = {
        'regional_wage_effect': regional_results,
        'pure_wage_effect': pure_results
    }

    # Print a summary comparison.
    gamma_R = regional_results.get('point_estimate', np.nan)
    gamma_W = pure_results.get('point_estimate', np.nan)
    composition_effect = gamma_R - gamma_W

    print("\n" + "="*80)
    print(f"Wage Effect Reconciliation Summary for {event_year}")
    print(f"Regional Wage Effect (γ^R): {gamma_R:.4f}")
    print(f"Pure Wage Effect (γ^W):     {gamma_W:.4f}")
    print(f"Implied Composition Effect: {composition_effect:.4f}")
    print("="*80 + "\n")

    print("Task 12: Estimation of regional and pure wage effects completed successfully.")
    return final_results


In [None]:
# Task 13: Decompose regional wage change into pure and composition effects (Equations (6)–(7))

# ==============================================================================
# Task 13: Decompose regional wage change into pure and composition effects
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 13, Step 1: Helper to Compute Wage Decomposition Components
# ------------------------------------------------------------------------------

def _compute_wage_decomposition_components(
    analysis_panel: pd.DataFrame,
    event_year: int,
    config: Dict[str, Any],
    task_name: str = "Task 13, Step 1"
) -> pd.DataFrame:
    """
    Computes wage decomposition components using a fully vectorized approach.

    This function implements the decomposition of regional wage changes from
    Equation (6) of the paper. It replaces the previous inefficient, iterative
    method with a high-performance, vectorized algorithm using pandas
    `groupby().transform()`. This is the professional standard for this type of
    calculation.

    The decomposition is:
    Δlog(w_r) ≈ (E[w_t|stayer] - E[w_0|stayer])  (Stayers' Growth)
                + (E[w_0|stayer] - E[w_0|outflower]) * Pr(outflow) (Outflow Comp.)
                - (E[w_t|stayer] - E[w_t|inflower]) * Pr(inflow)   (Inflow Comp.)

    Args:
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel from Task 5.
        event_year (int): The post-treatment year to compare against the base year.
        config (Dict[str, Any]): The master configuration dictionary.
        task_name (str): The name of the calling task for clear error reporting.

    Returns:
        pd.DataFrame: A DataFrame indexed by 'Municipality_ID' with columns for
                      each decomposition component ('stayers_wage_growth',
                      'outflow_composition', 'inflow_composition').
    """
    # --- Input Validation ---
    if not isinstance(analysis_panel, pd.DataFrame):
        raise TypeError("`analysis_panel` must be a pandas DataFrame.")

    # --- 1. Prepare Wide-Format Data for Flow Analysis ---
    base_year = config["temporal_parameters"]["BASE_YEAR"]

    # Filter to native, full-time workers in the two relevant years.
    wage_panel = analysis_panel[
        analysis_panel['is_native'] &
        analysis_panel['is_full_time'] &
        analysis_panel.index.get_level_values('snapshot_year').isin([base_year, event_year])
    ].copy()

    # Pivot to a wide format for individual-level flow classification.
    flow_df = wage_panel.reset_index().pivot_table(
        index='Worker_ID',
        columns='snapshot_year',
        values=['Municipality_ID', 'log_wage_imputed']
    )
    flow_df.columns = [f'{val}_{year}' for val, year in flow_df.columns]

    # Define key columns for convenience.
    muni_base_col, muni_event_col = f'Municipality_ID_{base_year}', f'Municipality_ID_{event_year}'
    wage_base_col, wage_event_col = f'log_wage_imputed_{base_year}', f'log_wage_imputed_{event_year}'

    # --- 2. Define Flow Group Masks ---
    is_incumbent = flow_df[muni_base_col].notna()
    is_employed_event = flow_df[muni_event_col].notna()
    is_stayer = is_incumbent & is_employed_event & (flow_df[muni_base_col] == flow_df[muni_event_col])
    is_outflower = is_incumbent & ~is_stayer
    is_inflower = is_employed_event & ~is_stayer

    # --- 3. Broadcast Group Means and Proportions using transform() ---

    # a. Calculate conditional means for stayers and outflowers at origin (base year muni).
    flow_df['mean_wage_stayers_base'] = flow_df.where(is_stayer)[wage_base_col].groupby(flow_df[muni_base_col]).transform('mean')
    flow_df['mean_wage_outflowers_base'] = flow_df.where(is_outflower)[wage_base_col].groupby(flow_df[muni_base_col]).transform('mean')

    # b. Calculate conditional means for stayers and inflowers at destination (event year muni).
    flow_df['mean_wage_stayers_event'] = flow_df.where(is_stayer)[wage_event_col].groupby(flow_df[muni_event_col]).transform('mean')
    flow_df['mean_wage_inflowers_event'] = flow_df.where(is_inflower)[wage_event_col].groupby(flow_df[muni_event_col]).transform('mean')

    # c. Calculate proportions.
    flow_df['n_incumbents'] = is_incumbent.groupby(flow_df[muni_base_col]).transform('sum')
    flow_df['n_outflowers'] = is_outflower.groupby(flow_df[muni_base_col]).transform('sum')
    flow_df['pr_outflow'] = flow_df['n_outflowers'] / flow_df['n_incumbents']

    flow_df['n_employees_event'] = is_employed_event.groupby(flow_df[muni_event_col]).transform('sum')
    flow_df['n_inflowers'] = is_inflower.groupby(flow_df[muni_event_col]).transform('sum')
    flow_df['pr_inflow'] = flow_df['n_inflowers'] / flow_df['n_employees_event']

    # --- 4. Calculate Components at the Individual Level ---

    # Component (i): Stayers' wage growth. This is calculated directly on the stayer subset.
    stayers_df = flow_df[is_stayer].copy()
    stayers_df['stayers_wage_growth'] = stayers_df[wage_event_col] - stayers_df[wage_base_col]

    # Component (ii): Outflow composition term.
    flow_df['outflow_composition'] = (flow_df['mean_wage_stayers_base'] - flow_df['mean_wage_outflowers_base']) * flow_df['pr_outflow']

    # Component (iii): Inflow composition term.
    flow_df['inflow_composition'] = (flow_df['mean_wage_stayers_event'] - flow_df['mean_wage_inflowers_event']) * flow_df['pr_inflow']

    # --- 5. Final Aggregation to Municipality Level ---

    # Aggregate the stayers' growth component.
    stayers_agg = stayers_df.groupby(muni_base_col)[['stayers_wage_growth']].mean()

    # Aggregate the composition components (they are constant within a municipality, so 'first' is efficient).
    outflow_agg = flow_df.groupby(muni_base_col)[['outflow_composition']].first()
    inflow_agg = flow_df.groupby(muni_event_col)[['inflow_composition']].first()

    # Combine all components into a single DataFrame.
    components_df = pd.concat([stayers_agg, outflow_agg, inflow_agg], axis=1)

    # Ensure all municipalities from the analysis are present, filling missing with 0.
    all_munis = analysis_panel['Municipality_ID'].dropna().unique()
    components_df = components_df.reindex(all_munis).fillna(0)

    print(f"[{task_name}] Wage decomposition components computed for year {event_year} using vectorized approach.")
    return components_df

# ------------------------------------------------------------------------------
# Task 13, Orchestrator Function
# ------------------------------------------------------------------------------

def decompose_regional_wage_effect(
    analysis_panel: pd.DataFrame,
    event_study_df: pd.DataFrame,
    wage_effect_results: Dict[str, Dict[str, Any]],
    master_config: Dict[str, Any],
    event_year: int
) -> Dict[str, Any]:
    """
    Orchestrates the decomposition of the regional wage effect.

    This function implements the decomposition from Equations (6) and (7),
    estimating the causal effect of immigration on each component of wage
    change (stayers' growth, outflow selection, inflow selection) and
    reconciling them with the total regional and pure wage effects.

    Args:
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel.
        event_study_df (pd.DataFrame): The main analysis dataset from Task 9.
        wage_effect_results (Dict): The results from Task 12, containing the
                                    estimated γ^R and γ^W.
        master_config (Dict[str, Any]): The master configuration dictionary.
        event_year (int): The specific year of the event study to estimate.

    Returns:
        Dict[str, Any]: A dictionary containing the estimation results for
                        each component of the wage decomposition.
    """
    # Step 1: Compute the wage decomposition components at the municipality level.
    components_df = _compute_wage_decomposition_components(
        analysis_panel, event_year, master_config
    )

    # Prepare a consistent analysis sample.
    analysis_data_year = event_study_df.loc[(slice(None), event_year), :].copy()
    analysis_data_year = analysis_data_year.merge(
        components_df, on='Municipality_ID', how='left'
    )

    outcomes_to_estimate = {
        'stayers_growth': 'stayers_wage_growth',
        'outflow_comp': 'outflow_composition',
        'inflow_comp': 'inflow_composition'
    }
    analysis_data_year.dropna(subset=list(outcomes_to_estimate.values()), inplace=True)

    # Step 2: Estimate 2SLS for each component.
    results = {}
    for name, outcome_col in outcomes_to_estimate.items():
        print(f"\n--- Estimating effect on: {name} ({outcome_col}) for year {event_year} ---")
        results[name] = estimate_regional_effect_2sls(
            event_study_df=analysis_data_year.set_index('Municipality_ID', append=True).swaplevel(0,1),
            master_config=master_config,
            outcome_variable=outcome_col,
            event_year=event_year
        )

    # Step 3: Reconcile and interpret the full decomposition per Equation (7).
    gamma_R = wage_effect_results['regional_wage_effect']['point_estimate']
    gamma_W = wage_effect_results['pure_wage_effect']['point_estimate']

    delta_stayers = results['stayers_growth']['point_estimate']
    delta_outflow = results['outflow_comp']['point_estimate']
    delta_inflow = results['inflow_comp']['point_estimate']

    # The "age selection" effect is the difference between the raw stayers' growth
    # effect and the age-controlled pure wage effect.
    delta_age_selection = delta_stayers - gamma_W

    # The total composition effect is the sum of its parts.
    total_composition_effect = delta_outflow - delta_inflow + delta_age_selection

    # Final identity check: γ^R ≈ γ^W + Total Composition Effect
    reconstructed_gamma_R = gamma_W + total_composition_effect

    if not np.isclose(gamma_R, reconstructed_gamma_R, atol=1e-8):
        raise AssertionError(
            "Wage decomposition identity check FAILED! "
            f"Regional Effect (γ^R) = {gamma_R:.8f}, but "
            f"Reconstructed Effect (γ^W + Composition) = {reconstructed_gamma_R:.8f}."
        )
    else:
        print("\n" + "="*80)
        print("Wage Decomposition Identity Verification PASSED")
        print(f"Regional Effect (γ^R): {gamma_R:.6f}")
        print(f"Reconstructed (γ^W + Comp): {reconstructed_gamma_R:.6f}")
        print("="*80)

    # Assemble final results dictionary in the format of Table 2.
    final_decomposition = {
        'regional_wage_effect': wage_effect_results['regional_wage_effect'],
        'pure_wage_effect': wage_effect_results['pure_wage_effect'],
        'composition_effect_total': total_composition_effect,
        'composition_components': {
            'outflow_selection': results['outflow_comp'],
            'inflow_selection': results['inflow_comp'], # Note: sign is flipped in Table 2
            'age_selection': delta_age_selection
        }
    }

    print(f"\nTask 13: Decomposition of regional wage effect for year {event_year} completed.")
    return final_decomposition


In [None]:
# Task 14: Selection-bounding for the pure wage effect (Card-style bounds)

# ==============================================================================
# Task 14: Selection-bounding for the pure wage effect (Card-style bounds)
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 14, Step 1: Helper to Estimate the Probit Model for Staying
# ------------------------------------------------------------------------------

def _estimate_staying_probit(
    analysis_panel: pd.DataFrame,
    event_study_df: pd.DataFrame,
    event_year: int,
    config: Dict[str, Any],
    task_name: str = "Task 14, Step 1"
) -> Tuple[float, float]:
    """
    Estimates a probit model for the probability of an incumbent staying.

    This function identifies 1990 incumbents and models their probability of
    staying in the same municipality by the event year as a function of the
    local immigration shock.

    Args:
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel.
        event_study_df (pd.DataFrame): The main analysis dataset from Task 9.
        event_year (int): The specific year of the event study.
        config (Dict[str, Any]): The master configuration dictionary.

    Returns:
        Tuple[float, float]: A tuple containing:
                             - b_hat (float): The coefficient on the shock variable (∂π/∂ΔI_r).
                             - pi_hat (float): The average latent index π.
    """
    base_year = config["temporal_parameters"]["BASE_YEAR"]

    # Identify incumbents: native, full-time workers employed in 1990.
    incumbents = analysis_panel[
        (analysis_panel.index.get_level_values('snapshot_year') == base_year) &
        analysis_panel['is_native'] &
        analysis_panel['is_full_time'] &
        analysis_panel['is_employed']
    ].copy()

    # Identify stayers among the 1990 incumbents.
    stayers = analysis_panel[
        (analysis_panel.index.get_level_values('snapshot_year') == event_year) &
        analysis_panel.index.get_level_values('Worker_ID').isin(incumbents.index.get_level_values('Worker_ID'))
    ]

    # Create the binary 'is_stayer' dependent variable.
    # A stayer is an incumbent who is employed in the same municipality in the event year.
    stayer_status = incumbents.reset_index().merge(
        stayers.reset_index()[['Worker_ID', 'Municipality_ID']],
        on='Worker_ID',
        how='left',
        suffixes=('_base', '_event')
    )
    stayer_status['is_stayer'] = (
        stayer_status['Municipality_ID_base'] == stayer_status['Municipality_ID_event']
    ).astype(int)

    # Merge the appropriate shock variable for the event year.
    event_data_year = event_study_df.loc[(slice(None), event_year), ['shock']].reset_index()
    probit_data = stayer_status.merge(
        event_data_year.drop_duplicates(subset=['Municipality_ID']),
        left_on='Municipality_ID_base',
        right_on='Municipality_ID',
        how='left'
    ).dropna(subset=['is_stayer', 'shock'])

    # Estimate the probit model: Pr(is_stayer=1) = Φ(a + b*shock).
    X = sm.add_constant(probit_data['shock'])
    y = probit_data['is_stayer']
    probit_model = sm.Probit(y, X).fit(disp=0) # disp=0 suppresses convergence messages.

    # Extract the key coefficient b_hat = ∂π/∂ΔI_r.
    b_hat = probit_model.params['shock']

    # Calculate the average latent index π_hat from the overall stayer rate.
    # π = Φ⁻¹(η), where η is the share of stayers.
    stayer_rate_eta = probit_data['is_stayer'].mean()
    pi_hat = norm.ppf(stayer_rate_eta)

    print(f"[{task_name}] Probit model for staying estimated. b_hat = {b_hat:.4f}, pi_hat = {pi_hat:.4f}.")
    return b_hat, pi_hat

# ------------------------------------------------------------------------------
# Task 14, Steps 2 & 3: Helper to Compute Bias Components and Final Bounds
# ------------------------------------------------------------------------------

def _compute_selection_bounds(
    pure_wage_results: Dict[str, Any],
    b_hat: float,
    pi_hat: float,
    config: Dict[str, Any],
    task_name: str = "Task 14, Steps 2 & 3"
) -> Tuple[float, float, float]:
    """
    Computes the components of selection bias and the final bounds.

    Args:
        pure_wage_results (Dict): The results dictionary from the pure wage effect estimation.
        b_hat (float): The probit coefficient on the shock variable.
        pi_hat (float): The average latent index from the probit model.
        config (Dict[str, Any]): The master configuration dictionary.

    Returns:
        Tuple[float, float, float]: A tuple containing:
                                    - max_bias (float): The maximum absolute potential bias.
                                    - lower_bound (float): The lower bound on γ^W.
                                    - upper_bound (float): The upper bound on γ^W.
    """
    # --- Step 2: Compute Residual Variance and IMR Derivative ---
    # The FD-IV model object must be passed within the results dictionary.
    if 'model_object' not in pure_wage_results:
        raise KeyError("`model_object` not found in `pure_wage_results`. Cannot access residuals.")

    # σ̂_Δe: Standard deviation of the time-varying wage shocks (residuals).
    residuals = pure_wage_results['model_object'].resids
    sigma_delta_e = np.std(residuals)

    # λ(π): Inverse Mills Ratio at π_hat.
    lambda_pi = norm.pdf(pi_hat) / norm.cdf(pi_hat)

    # ∂λ(π)/∂π: Derivative of the Inverse Mills Ratio at π_hat.
    # Formula: -λ(π) * (λ(π) + π)
    d_lambda_d_pi = -lambda_pi * (lambda_pi + pi_hat)

    print(f"[{task_name}] Bias components: σ_Δe={sigma_delta_e:.4f}, ∂λ/∂π={d_lambda_d_pi:.4f}")

    # --- Step 3: Bound the Selection Bias ---
    # Bias = ρ * σ̂_Δe * (∂λ/∂π) * (∂π/∂ΔI_r)
    # We calculate the maximum absolute bias by setting ρ = ±1.
    max_bias = np.abs(sigma_delta_e * d_lambda_d_pi * b_hat)

    # Extract the point estimate for the pure wage effect.
    gamma_W_hat = pure_wage_results['point_estimate']

    # The bounds are the point estimate ± the maximum bias.
    lower_bound = gamma_W_hat - max_bias
    upper_bound = gamma_W_hat + max_bias

    return max_bias, lower_bound, upper_bound

# ------------------------------------------------------------------------------
# Task 14, Orchestrator Function
# ------------------------------------------------------------------------------

def bound_pure_wage_effect_selection(
    analysis_panel: pd.DataFrame,
    event_study_df: pd.DataFrame,
    pure_wage_results: Dict[str, Any],
    master_config: Dict[str, Any],
    event_year: int
) -> Dict[str, Any]:
    """
    Orchestrates the selection-bounding exercise for the pure wage effect.

    This function implements the Card-style bounding approach to assess the
    potential bias in the pure wage effect estimate (γ^W) arising from
    selection on time-varying unobservables.

    Args:
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel.
        event_study_df (pd.DataFrame): The main analysis dataset from Task 9.
        pure_wage_results (Dict[str, Any]): The detailed results from the pure
            wage effect estimation in Task 12. Must include the fitted model
            object to access residuals.
        master_config (Dict[str, Any]): The master configuration dictionary.
        event_year (int): The specific year of the event study to analyze.

    Returns:
        Dict[str, Any]: A dictionary containing the original point estimate,
                        the maximum potential bias, and the resulting bounds.
    """
    # --- Input Validation ---
    if not isinstance(analysis_panel, pd.DataFrame):
        raise TypeError("`analysis_panel` must be a pandas DataFrame.")
    if 'point_estimate' not in pure_wage_results:
        raise KeyError("`pure_wage_results` dictionary is missing key 'point_estimate'.")

    # Step 1: Estimate the probit model for the staying decision.
    b_hat, pi_hat = _estimate_staying_probit(
        analysis_panel, event_study_df, event_year, master_config
    )

    # Steps 2 & 3: Compute the bias components and the final bounds.
    max_bias, lower_bound, upper_bound = _compute_selection_bounds(
        pure_wage_results, b_hat, pi_hat, master_config
    )

    # --- Assemble and Report Final Results ---
    gamma_W_hat = pure_wage_results['point_estimate']

    results = {
        'event_year': event_year,
        'point_estimate_gamma_W': gamma_W_hat,
        'max_potential_bias': max_bias,
        'lower_bound': lower_bound,
        'upper_bound': upper_bound
    }

    print("\n" + "="*80)
    print(f"Selection Bounding Results for Pure Wage Effect (γ^W) - Year {event_year}")
    print(f"Original Point Estimate: {gamma_W_hat:.4f}")
    print(f"Maximum Potential Bias (Δγ̂): ±{max_bias:.4f}")
    print(f"Resulting 95% Bounds for γ^W: [{lower_bound:.4f}, {upper_bound:.4f}]")
    print("="*80 + "\n")

    print("Task 14: Selection-bounding for the pure wage effect completed successfully.")
    return results


In [None]:
# Task 15: Non-employed entrants analysis (employment entry and pure wages)

# ==============================================================================
# Task 15: Non-employed entrants analysis (employment entry and pure wages)
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 15, Step 1: Helper to Prepare Non-Employed Entrants Data
# ------------------------------------------------------------------------------

def _prepare_non_employed_entrants_data(
    analysis_panel: pd.DataFrame,
    national_wage_series: pd.Series,
    config: Dict[str, Any],
    task_name: str = "Task 15, Step 1"
) -> pd.DataFrame:
    """
    Identifies the 1990 non-employed cohort and imputes their counterfactual baseline wage.

    This function selects native workers who were not employed in 1990 but had a
    prior job between 1986-1989. It then calculates their counterfactual 1990
    log wage based on their last observed wage, adjusted for aggregate wage growth.

    Args:
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel from Task 5.
        national_wage_series (pd.Series): Series of national mean log wages, indexed by year.
        config (Dict[str, Any]): The master configuration dictionary.
        task_name (str): The name of the calling task for error reporting.

    Returns:
        pd.DataFrame: A DataFrame indexed by 'Worker_ID' containing the cohort of
                      non-employed entrants from 1990, with their counterfactual
                      1990 log wage and origin municipality.
    """
    base_year = config["temporal_parameters"]["BASE_YEAR"]

    # Identify the cohort: native workers, non-employed in 1990, with a valid pre-1990 job record.
    # The lookback information (`last_pre1990_*`) was attached in Task 5.
    entrants_cohort_1990 = analysis_panel.loc[
        (slice(None), base_year), :
    ][
        (~analysis_panel.loc[(slice(None), base_year), 'is_employed']) &
        (analysis_panel.loc[(slice(None), base_year), 'is_native']) &
        (analysis_panel.loc[(slice(None), base_year), 'last_pre1990_municipality_id'].notna())
    ].copy()

    if entrants_cohort_1990.empty:
        print(f"[{task_name}] No non-employed entrants with valid lookback history found.")
        return pd.DataFrame()

    # Validate that last observed wages are positive before log transformation.
    if (entrants_cohort_1990['last_pre1990_daily_wage'] <= 0).any():
        raise ValueError(f"[{task_name}] Found non-positive 'last_pre1990_daily_wage' values.")

    # --- Impute Counterfactual 1990 Log Wage ---
    # Equation: log_w_tilde_ir,1990 = log_w_ir,t* + (log_w_bar_1990 - log_w_bar_t*)

    # Get national average log wage for the base year.
    log_w_bar_1990 = national_wage_series.loc[base_year]

    # Map the national average log wage for the year of the last spell (t*).
    entrants_cohort_1990['log_w_bar_t_star'] = entrants_cohort_1990['last_pre1990_spell_year'].map(national_wage_series)

    # Calculate the imputed wage.
    entrants_cohort_1990['log_wage_imputed_1990'] = (
        np.log(entrants_cohort_1990['last_pre1990_daily_wage']) +
        (log_w_bar_1990 - entrants_cohort_1990['log_w_bar_t_star'])
    )

    # Select and rename columns for clarity, setting Worker_ID as the index.
    entrants_df = entrants_cohort_1990.reset_index()[
        ['Worker_ID', 'last_pre1990_municipality_id', 'log_wage_imputed_1990']
    ].rename(columns={'last_pre1990_municipality_id': 'municipality_id_1990'})
    entrants_df = entrants_df.set_index('Worker_ID')

    print(f"[{task_name}] Prepared data for {len(entrants_df)} non-employed entrants from 1990.")
    return entrants_df

# ------------------------------------------------------------------------------
# Task 15, Step 2: Helper to Estimate Effect on Employment Entry
# ------------------------------------------------------------------------------

def _estimate_entrant_employment_effect(
    entrants_df: pd.DataFrame,
    analysis_panel: pd.DataFrame,
    event_study_df: pd.DataFrame,
    event_year: int,
    config: Dict[str, Any],
    task_name: str = "Task 15, Step 2"
) -> Dict[str, Any]:
    """
    Estimates the 2SLS effect of immigration on the employment entry rate.

    This function performs a complete 2SLS estimation to determine the causal
    impact of the local immigration shock on the probability that a member of
    the 1990 non-employed cohort finds employment by a given event year. The
    estimation is conducted at the municipality level, weighted by the number
    of non-employed entrants in each 1990 origin municipality. Inference is
    based on a from-scratch wild cluster bootstrap to ensure robustness.

    Args:
        entrants_df (pd.DataFrame): A DataFrame of the 1990 non-employed cohort,
            indexed by 'Worker_ID', with their 1990 origin municipality.
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel, used
            to determine employment status in the event year.
        event_study_df (pd.DataFrame): The main analysis dataset from Task 9,
            containing shocks, instruments, and cluster IDs.
        event_year (int): The specific year of the event study to analyze.
        config (Dict[str, Any]): The master configuration dictionary.
        task_name (str): The name of the calling task for error reporting.

    Returns:
        Dict[str, Any]: A dictionary containing the key estimation results:
                        'point_estimate', 'cluster_robust_se', 'p_value',
                        'f_statistic', 'bootstrap_ci', and 'n_obs'.
    """
    # --- Input Validation ---
    if not isinstance(entrants_df, pd.DataFrame):
        raise TypeError("`entrants_df` must be a pandas DataFrame.")
    if not isinstance(analysis_panel, pd.DataFrame):
        raise TypeError("`analysis_panel` must be a pandas DataFrame.")
    if not isinstance(event_study_df, pd.DataFrame):
        raise TypeError("`event_study_df` must be a pandas DataFrame.")

    # --- 1. Construct the Outcome Variable: Employment Entry Share ---

    # Determine which entrants are employed in the event year by merging their
    # 1990 cohort data with their status in the event year.
    entrants_status_event_year = entrants_df.merge(
        analysis_panel.loc[(slice(None), event_year), 'is_employed'].reset_index(),
        on='Worker_ID',
        how='left'
    )

    # Assume entrants not found in the event year panel are not employed.
    entrants_status_event_year['is_employed'] = entrants_status_event_year['is_employed'].fillna(False)

    # Aggregate at the 1990 origin municipality level to get the outcome and weight.
    entry_rate_agg = entrants_status_event_year.groupby('municipality_id_1990').agg(
        # The weight for this regression is the number of entrants from each municipality.
        weight_entrants=('Worker_ID', 'count'),
        # The numerator for the share is the count of those who became employed.
        n_reemployed=('is_employed', 'sum')
    )

    # Calculate the outcome variable: Entry_Share_r = #{re-employed} / #{total entrants}.
    entry_rate_agg['entry_share'] = entry_rate_agg['n_reemployed'] / entry_rate_agg['weight_entrants']

    # --- 2. Prepare the Final Analysis DataFrame ---

    # Start with the main event study data for the specified year.
    analysis_data = event_study_df.loc[(slice(None), event_year), :].copy()

    # Merge the newly computed outcome and weight variables.
    analysis_data = analysis_data.merge(
        entry_rate_agg,
        left_on='Municipality_ID',
        right_index=True,
        how='left'
    ).fillna({'entry_share': 0, 'weight_entrants': 0}) # Fill municipalities with no entrants.

    # Filter to municipalities that had non-employed entrants in 1990 to form the analysis sample.
    analysis_data = analysis_data[analysis_data['weight_entrants'] > 0].copy()

    if analysis_data.empty:
        print(f"[{task_name}] No municipalities with non-employed entrants found. Cannot estimate.")
        return {}

    # --- 3. Full 2SLS Estimation with Custom Weights ---

    # Define the list of instrumental variables from the configuration.
    iv_names = list(config["algorithm_config_parameters"]["INSTRUMENT_SPECIFICATION"].keys())

    # Define the R-style formula for the first-stage regression.
    formula_1st = f"shock ~ 1 + {' + '.join(iv_names)}"

    # Estimate the first stage using Weighted Least Squares (WLS).
    first_stage = smf.wls(formula_1st, data=analysis_data, weights=analysis_data['weight_entrants']).fit()

    # Predict the value of the endogenous variable from the first stage.
    analysis_data['shock_hat'] = first_stage.predict(analysis_data)

    # Define the R-style formula for the second-stage regression.
    formula_2nd = "entry_share ~ 1 + shock_hat"

    # Estimate the second stage using WLS with the same custom weights.
    second_stage = smf.wls(formula_2nd, data=analysis_data, weights=analysis_data['weight_entrants']).fit()

    # --- 4. Compute Cluster-Robust Standard Errors ---

    # Get the cluster identifier column from the configuration.
    cluster_col = config["algorithm_config_parameters"]["CLUSTER_LEVEL"]

    # Re-calculate standard errors to be robust to within-cluster correlation.
    robust_results = second_stage.get_robustcov_results(cov_type='cluster', groups=analysis_data[cluster_col])

    # --- 5. Implement Wild Cluster Bootstrap for Inference ---

    # Extract bootstrap parameters from the configuration.
    n_reps = config["algorithm_config_parameters"]["BOOTSTRAP_REPLICATIONS"]
    seed = config["algorithm_config_parameters"]["RANDOM_SEED"]
    conf_level = config["algorithm_config_parameters"]["BOOTSTRAP_CONF_LEVEL"]

    # Initialize a seeded random number generator for reproducibility.
    rng = np.random.default_rng(seed)

    # Get the array of unique cluster identifiers.
    clusters = analysis_data[cluster_col].unique()

    # Store the coefficient estimate from each bootstrap replication.
    bootstrap_coeffs = []

    # Main bootstrap loop.
    for _ in range(n_reps):
        # a. Generate Rademacher weights (-1 or 1) for each cluster.
        rademacher_weights = rng.choice([-1, 1], size=len(clusters))
        cluster_weights = pd.DataFrame({cluster_col: clusters, 'v': rademacher_weights})

        # b. Create bootstrap residuals by merging weights and multiplying.
        boot_data = analysis_data.merge(cluster_weights, on=cluster_col, how='left')
        boot_residuals = boot_data['v'] * second_stage.resid

        # c. Generate the bootstrap dependent variable: Y_boot = Y_hat + residual_boot.
        boot_data['entry_share_boot'] = second_stage.fittedvalues + boot_residuals

        # d. Re-estimate the full 2SLS model on the bootstrapped data.
        # First stage on bootstrapped data.
        fs_boot = smf.wls(formula_1st, data=boot_data, weights=boot_data['weight_entrants']).fit()
        boot_data['shock_hat_boot'] = fs_boot.predict(boot_data)

        # Second stage on bootstrapped data.
        ss_boot = smf.wls("entry_share_boot ~ 1 + shock_hat_boot", data=boot_data, weights=boot_data['weight_entrants']).fit()

        # e. Store the coefficient of interest.
        bootstrap_coeffs.append(ss_boot.params['shock_hat_boot'])

    # --- 6. Compute Percentile Confidence Interval ---

    # Calculate the alpha for the specified confidence level.
    alpha = 1 - conf_level

    # Compute the lower and upper bounds from the bootstrap distribution.
    ci_lower = np.percentile(bootstrap_coeffs, 100 * alpha / 2)
    ci_upper = np.percentile(bootstrap_coeffs, 100 * (1 - alpha / 2))

    # --- 7. Assemble and Return Final Results ---

    # Extract all key statistics from the estimation objects.
    results = {
        'point_estimate': robust_results.params['shock_hat'],
        'cluster_robust_se': robust_results.bse['shock_hat'],
        'p_value': robust_results.pvalues['shock_hat'],
        'f_statistic': first_stage.f_test(iv_names).fvalue,
        'bootstrap_ci': (ci_lower, ci_upper),
        'n_obs': int(robust_results.nobs)
    }

    # Print a summary of the final result.
    print(f"[{task_name}] Estimated employment entry effect: {results['point_estimate']:.4f}")

    # Return the structured dictionary of results.
    return results

# ------------------------------------------------------------------------------
# Task 15, Step 3: Helper to Estimate Pure Wage Effect for Re-Employed Entrants
# ------------------------------------------------------------------------------

def _estimate_entrant_wage_effect(
    entrants_df: pd.DataFrame,
    analysis_panel: pd.DataFrame,
    event_study_df: pd.DataFrame,
    event_year: int,
    config: Dict[str, Any],
    task_name: str = "Task 15, Step 3"
) -> Dict[str, Any]:
    """
    Estimates the pure wage effect for non-employed entrants who find a job.

    This function implements a First-Difference Instrumental Variable (FD-IV)
    model for the cohort of workers who were non-employed in 1990 but became
    re-employed by the specified event year. The dependent variable is the
    change in log wage relative to the imputed counterfactual 1990 wage. The
    model controls for changes in age and instruments the local immigration
    shock with distance to the border. Inference is based on a full wild
    cluster bootstrap procedure.

    Args:
        entrants_df (pd.DataFrame): A DataFrame of the 1990 non-employed cohort,
            indexed by 'Worker_ID', with their imputed 1990 counterfactual wage.
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel, used
            to identify re-employment status and event-year characteristics.
        event_study_df (pd.DataFrame): The main analysis dataset from Task 9,
            containing shocks, instruments, and cluster IDs.
        event_year (int): The specific year of the event study to analyze.
        config (Dict[str, Any]): The master configuration dictionary.
        task_name (str): The name of the calling task for error reporting.

    Returns:
        Dict[str, Any]: A dictionary containing the key estimation results:
                        'point_estimate', 'cluster_robust_se', 'p_value',
                        'f_statistic', 'bootstrap_ci', and 'n_obs'. Returns an
                        empty dictionary if no re-employed entrants are found.
    """
    # --- Input Validation ---
    if not isinstance(entrants_df, pd.DataFrame):
        raise TypeError("`entrants_df` must be a pandas DataFrame.")
    if not isinstance(analysis_panel, pd.DataFrame):
        raise TypeError("`analysis_panel` must be a pandas DataFrame.")
    if not isinstance(event_study_df, pd.DataFrame):
        raise TypeError("`event_study_df` must be a pandas DataFrame.")

    # --- 1. Identify the Analysis Sample: Re-employed Entrants ---

    # Isolate the panel data for the specified event year.
    panel_event_year = analysis_panel.loc[(slice(None), event_year), :].reset_index()

    # Filter to workers who are employed and full-time in the event year.
    reemployed_in_event_year = panel_event_year[
        panel_event_year['is_employed'] & panel_event_year['is_full_time']
    ]

    # Perform an inner merge with the 1990 entrants cohort to get the final sample.
    reemployed_entrants = entrants_df.merge(
        reemployed_in_event_year,
        on='Worker_ID',
        how='inner'
    )

    # If no re-employed entrants are found, exit gracefully.
    if reemployed_entrants.empty:
        print(f"[{task_name}] No re-employed entrants found for wage analysis in year {event_year}.")
        return {}

    # --- 2. Construct First-Differenced (FD) Variables ---

    # Dependent variable: Δlog_w = log_w_event - log_w_imputed_1990
    reemployed_entrants['delta_log_wage'] = reemployed_entrants['log_wage_imputed'] - reemployed_entrants['log_wage_imputed_1990']

    # To get Δage, we need the age in the base year for this cohort.
    base_year = config["temporal_parameters"]["BASE_YEAR"]
    base_year_age_map = analysis_panel.loc[(reemployed_entrants['Worker_ID'], base_year), 'age']
    base_year_age = reemployed_entrants['Worker_ID'].map(base_year_age_map)

    # Control variable 1: Δage = age_event - age_base
    reemployed_entrants['delta_age'] = reemployed_entrants['age'] - base_year_age

    # Control variable 2: Δage² = age_event² - age_base²
    reemployed_entrants['delta_age_sq'] = reemployed_entrants['age']**2 - base_year_age**2

    # --- 3. Merge Municipality-Level IVs and Cluster IDs ---

    # Extract the relevant slice from the event study data.
    event_data_year_slice = event_study_df.loc[(slice(None), event_year), :].reset_index()

    # Define the columns to merge: shock, instruments, and cluster ID.
    iv_cols = ['shock'] + list(config["algorithm_config_parameters"]["INSTRUMENT_SPECIFICATION"].keys()) + [config["algorithm_config_parameters"]["CLUSTER_LEVEL"]]

    # Merge these variables onto the individual-level FD data based on the event-year municipality.
    fd_data = reemployed_entrants.merge(
        event_data_year_slice[['Municipality_ID'] + iv_cols],
        on='Municipality_ID',
        how='left'
    ).dropna() # Drop any individuals with missing data for the model.

    if fd_data.empty:
        print(f"[{task_name}] No valid observations remain after merging IVs. Cannot estimate.")
        return {}

    # --- 4. Estimate the FD-IV Model ---

    # Prepare data arrays for linearmodels.
    dependent = fd_data['delta_log_wage']
    exog = sm.add_constant(fd_data[['delta_age', 'delta_age_sq']])
    endog = fd_data[['shock']]
    instruments = fd_data[list(config["algorithm_config_parameters"]["INSTRUMENT_SPECIFICATION"].keys())]
    clusters = fd_data[config["algorithm_config_parameters"]["CLUSTER_LEVEL"]]

    # Fit the 2SLS model with cluster-robust standard errors.
    model = IV2SLS(dependent, exog, endog, instruments).fit(cov_type='clustered', clusters=clusters)

    # --- 5. Implement Wild Cluster Bootstrap ---

    # Extract bootstrap parameters.
    n_reps = config["algorithm_config_parameters"]["BOOTSTRAP_REPLICATIONS"]
    seed = config["algorithm_config_parameters"]["RANDOM_SEED"]
    conf_level = config["algorithm_config_parameters"]["BOOTSTRAP_CONF_LEVEL"]

    # Initialize seeded random number generator.
    rng = np.random.default_rng(seed)

    # Get unique clusters and model residuals.
    unique_clusters = clusters.unique()
    fitted_values = model.predict()
    residuals = model.resids

    # Store bootstrap coefficients.
    bootstrap_coeffs = []

    # Main bootstrap loop.
    for _ in range(n_reps):
        # a. Generate Rademacher weights for each cluster.
        rademacher_weights = rng.choice([-1, 1], size=len(unique_clusters))
        cluster_weights = pd.DataFrame({config["algorithm_config_parameters"]["CLUSTER_LEVEL"]: unique_clusters, 'v': rademacher_weights})

        # b. Create bootstrap residuals.
        temp_df = fd_data.merge(cluster_weights, on=config["algorithm_config_parameters"]["CLUSTER_LEVEL"], how='left')
        boot_residuals = temp_df['v'].values * residuals.values

        # c. Create bootstrap dependent variable.
        boot_dependent = fitted_values.values.flatten() + boot_residuals

        # d. Re-estimate the IV model (without robust SEs, as we build the distribution).
        boot_model = IV2SLS(boot_dependent, exog, endog, instruments).fit()

        # e. Store the coefficient of interest.
        bootstrap_coeffs.append(boot_model.params['shock'])

    # --- 6. Compute Percentile Confidence Interval ---

    # Calculate alpha.
    alpha = 1 - conf_level

    # Compute bounds from the bootstrap distribution.
    ci_lower = np.percentile(bootstrap_coeffs, 100 * alpha / 2)
    ci_upper = np.percentile(bootstrap_coeffs, 100 * (1 - alpha / 2))

    # --- 7. Assemble and Return Final Results ---

    results = {
        'point_estimate': model.params['shock'],
        'cluster_robust_se': model.std_errors['shock'],
        'p_value': model.pvalues['shock'],
        'f_statistic': model.first_stage.f_statistic.stat,
        'bootstrap_ci': (ci_lower, ci_upper),
        'n_obs': int(model.nobs)
    }

    # Print a summary of the final result.
    print(f"[{task_name}] Estimated pure wage effect for entrants: {results['point_estimate']:.4f}")

    # Return the structured dictionary of results.
    return results

# ------------------------------------------------------------------------------
# Task 15, Orchestrator Function
# ------------------------------------------------------------------------------

def analyze_non_employed_entrants(
    analysis_panel: pd.DataFrame,
    regional_panel: pd.DataFrame,
    national_wage_series: pd.Series,
    event_study_df: pd.DataFrame,
    master_config: Dict[str, Any],
    event_year: int
) -> Dict[str, Dict[str, Any]]:
    """
    Orchestrates the analysis of non-employed entrants from 1990.

    This function estimates the impact of immigration on two key outcomes for
    this group: their probability of re-employment and the pure wage effect
    for those who successfully re-enter the labor market.

    Args:
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel.
        regional_panel (pd.DataFrame): The aggregated municipality-year panel.
        national_wage_series (pd.Series): Series of national mean log wages.
        event_study_df (pd.DataFrame): The main analysis dataset from Task 9.
        master_config (Dict[str, Any]): The master configuration dictionary.
        event_year (int): The specific year of the event study to analyze.

    Returns:
        Dict[str, Dict[str, Any]]: A dictionary containing the results for both
                                   the employment entry and pure wage effect estimations.
    """
    # Step 1: Identify the 1990 non-employed cohort and impute counterfactual wages.
    entrants_df = _prepare_non_employed_entrants_data(
        analysis_panel, national_wage_series, master_config
    )

    if entrants_df.empty:
        print("Task 15: No valid non-employed entrants to analyze. Skipping.")
        return {'employment_entry_effect': {}, 'pure_wage_effect': {}}

    # Step 2: Estimate the effect on the employment entry rate.
    employment_results = _estimate_entrant_employment_effect(
        entrants_df, analysis_panel, event_study_df, event_year, master_config
    )

    # Step 3: Estimate the pure wage effect for re-employed entrants.
    wage_results = _estimate_entrant_wage_effect(
        entrants_df, analysis_panel, event_study_df, event_year, master_config
    )

    final_results = {
        'employment_entry_effect': employment_results,
        'pure_wage_effect': wage_results
    }

    print(f"\nTask 15: Analysis of non-employed entrants for year {event_year} completed.")
    return final_results


In [None]:
# Task 16: Older workers (50+) analysis (displacement and pure wages)

# ==============================================================================
# Task 16: Older workers (50+) analysis (displacement and pure wages)
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 16, Step 1: Helper to Identify the Older Worker Cohort
# ------------------------------------------------------------------------------

def _identify_older_worker_cohort(
    analysis_panel: pd.DataFrame,
    config: Dict[str, Any],
    task_name: str = "Task 16, Step 1"
) -> Set[int]:
    """
    Identifies the cohort of "older workers" based on their age in the base year.

    An older worker is defined as a native worker who was employed in 1990 and
    was aged 50 or over in that year. This function returns the set of their
    unique Worker_IDs for consistent filtering in subsequent analyses.

    Args:
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel from Task 5.
        config (Dict[str, Any]): The master configuration dictionary.
        task_name (str): The name of the calling task for error reporting.

    Returns:
        Set[int]: A set of unique Worker_IDs for the older worker cohort.
    """
    # Extract parameters from the configuration.
    base_year = config["temporal_parameters"]["BASE_YEAR"]
    age_threshold = config["sample_selection_parameters"]["OLDER_WORKER_AGE_THRESHOLD"]

    # Filter the panel to the base year.
    panel_1990 = analysis_panel.loc[(slice(None), base_year), :]

    # Identify the cohort based on the specified conditions.
    older_worker_cohort = panel_1990[
        (panel_1990['is_native']) &
        (panel_1990['is_employed']) &
        (panel_1990['age'] >= age_threshold)
    ]

    # Extract the unique Worker_IDs of this cohort.
    older_worker_ids = set(older_worker_cohort.index.get_level_values('Worker_ID'))

    print(f"[{task_name}] Identified {len(older_worker_ids)} older workers (age >= {age_threshold} in {base_year}).")

    return older_worker_ids

# ------------------------------------------------------------------------------
# Task 16, Step 2: Helper to Estimate Displacement Effect for Older Incumbents
# ------------------------------------------------------------------------------

def _estimate_older_worker_displacement(
    older_worker_ids: Set[int],
    analysis_panel: pd.DataFrame,
    event_study_df: pd.DataFrame,
    event_year: int,
    config: Dict[str, Any],
    task_name: str = "Task 16, Step 2"
) -> Dict[str, Any]:
    """
    Estimates the 2SLS effect on the displacement rate of older incumbents.

    This function calculates the displacement rate for the pre-defined cohort
    of older workers (age 50+ in 1990). The displacement rate for a municipality
    is the share of its 1990 older incumbents who are non-employed by the event
    year. It then estimates the causal effect of the immigration shock on this
    rate using a weighted 2SLS model with a full wild cluster bootstrap.

    Args:
        older_worker_ids (Set[int]): Set of Worker_IDs for the older worker cohort.
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel.
        event_study_df (pd.DataFrame): The main analysis dataset from Task 9.
        event_year (int): The specific year of the event study to analyze.
        config (Dict[str, Any]): The master configuration dictionary.
        task_name (str): The name of the calling task for error reporting.

    Returns:
        Dict[str, Any]: A dictionary of estimation results, including point
                        estimate, standard error, p-value, F-statistic, and
                        bootstrap confidence interval. Returns an empty
                        dictionary if the analysis cannot be run.
    """
    # --- Input Validation ---
    if not isinstance(older_worker_ids, set):
        raise TypeError("`older_worker_ids` must be a set.")
    if not isinstance(analysis_panel, pd.DataFrame):
        raise TypeError("`analysis_panel` must be a pandas DataFrame.")
    if not isinstance(event_study_df, pd.DataFrame):
        raise TypeError("`event_study_df` must be a pandas DataFrame.")

    # --- 1. Construct Outcome Variable: Displacement Share for Older Workers ---

    # Extract the base year for comparison.
    base_year = config["temporal_parameters"]["BASE_YEAR"]

    # Filter the main panel to only the older worker cohort and the relevant years.
    older_worker_panel = analysis_panel[
        analysis_panel.index.get_level_values('Worker_ID').isin(older_worker_ids) &
        analysis_panel.index.get_level_values('snapshot_year').isin([base_year, event_year])
    ]

    # Pivot to a wide format to track individual flows.
    flow_df = older_worker_panel.reset_index().pivot_table(
        index='Worker_ID',
        columns='snapshot_year',
        values=['is_employed', 'Municipality_ID', 'fte_weight']
    )
    flow_df.columns = [f'{val}_{year}' for val, year in flow_df.columns]

    # Identify displacement: employed in base year and not employed in event year.
    flow_df['is_displacement'] = (
        flow_df[f'is_employed_{base_year}'].fillna(False) &
        ~flow_df[f'is_employed_{event_year}'].fillna(False)
    )

    # Assign the FTE weight from the base year to displaced workers.
    flow_df['displacement_fte'] = np.where(flow_df['is_displacement'], flow_df[f'fte_weight_{base_year}'], 0)

    # Aggregate at the 1990 origin municipality level.
    displacement_agg = flow_df.groupby(f'Municipality_ID_{base_year}').agg(
        # The weight is the total number of older incumbents from that municipality.
        weight_older_incumbents=('displacement_fte', 'size'),
        # The numerator is the sum of FTEs of displaced older workers.
        displaced_fte=('displacement_fte', 'sum')
    )

    # Calculate the outcome: Displacement_Share_50+_r = #{displaced FTE} / #{total older incumbents}.
    displacement_agg['displacement_share_50plus'] = displacement_agg['displaced_fte'] / displacement_agg['weight_older_incumbents']

    # --- 2. Prepare Data for 2SLS Estimation ---

    # Merge the outcome and custom weight into the main event study data.
    analysis_data = event_study_df.loc[(slice(None), event_year), :].copy()
    analysis_data = analysis_data.merge(
        displacement_agg, left_on='Municipality_ID', right_index=True, how='left'
    ).fillna(0)

    # Filter to the sample of municipalities that had older incumbents in 1990.
    analysis_data = analysis_data[analysis_data['weight_older_incumbents'] > 0].copy()

    if analysis_data.empty:
        print(f"[{task_name}] No municipalities with older incumbents found. Cannot estimate.")
        return {}

    # --- 3. Full 2SLS + Bootstrap Estimation ---

    # Define instruments and model formulas.
    iv_names = list(config["algorithm_config_parameters"]["INSTRUMENT_SPECIFICATION"].keys())
    formula_1st = f"shock ~ 1 + {' + '.join(iv_names)}"
    formula_2nd = "displacement_share_50plus ~ 1 + shock_hat"

    # Estimate the first stage with custom weights.
    first_stage = smf.wls(formula_1st, data=analysis_data, weights=analysis_data['weight_older_incumbents']).fit()
    analysis_data['shock_hat'] = first_stage.predict(analysis_data)

    # Estimate the second stage with custom weights.
    second_stage = smf.wls(formula_2nd, data=analysis_data, weights=analysis_data['weight_older_incumbents']).fit()

    # Compute cluster-robust standard errors.
    cluster_col = config["algorithm_config_parameters"]["CLUSTER_LEVEL"]
    robust_results = second_stage.get_robustcov_results(cov_type='cluster', groups=analysis_data[cluster_col])

    # --- 4. Wild Cluster Bootstrap ---

    # Extract bootstrap parameters.
    n_reps = config["algorithm_config_parameters"]["BOOTSTRAP_REPLICATIONS"]
    rng = np.random.default_rng(config["algorithm_config_parameters"]["RANDOM_SEED"])
    clusters = analysis_data[cluster_col].unique()

    # Main bootstrap loop.
    bootstrap_coeffs = []
    for _ in range(n_reps):
        # a. Generate Rademacher weights at the cluster level.
        rademacher_weights = rng.choice([-1, 1], size=len(clusters))
        cluster_weights = pd.DataFrame({cluster_col: clusters, 'v': rademacher_weights})

        # b. Create bootstrap residuals and outcome.
        boot_data = analysis_data.merge(cluster_weights, on=cluster_col, how='left')
        boot_residuals = boot_data['v'] * second_stage.resid
        boot_data['outcome_boot'] = second_stage.fittedvalues + boot_residuals

        # c. Re-estimate the full 2SLS model.
        fs_boot = smf.wls(formula_1st, data=boot_data, weights=boot_data['weight_older_incumbents']).fit()
        boot_data['shock_hat_boot'] = fs_boot.predict(boot_data)
        ss_boot = smf.wls("outcome_boot ~ 1 + shock_hat_boot", data=boot_data, weights=boot_data['weight_older_incumbents']).fit()

        # d. Store the coefficient of interest.
        bootstrap_coeffs.append(ss_boot.params['shock_hat_boot'])

    # --- 5. Compute Confidence Interval and Assemble Results ---

    # Compute percentile confidence interval.
    alpha = 1 - config["algorithm_config_parameters"]["BOOTSTRAP_CONF_LEVEL"]
    ci_lower = np.percentile(bootstrap_coeffs, 100 * alpha / 2)
    ci_upper = np.percentile(bootstrap_coeffs, 100 * (1 - alpha / 2))

    # Assemble the final results dictionary.
    results = {
        'point_estimate': robust_results.params['shock_hat'],
        'cluster_robust_se': robust_results.bse['shock_hat'],
        'p_value': robust_results.pvalues['shock_hat'],
        'f_statistic': first_stage.f_test(iv_names).fvalue,
        'bootstrap_ci': (ci_lower, ci_upper),
        'n_obs': int(robust_results.nobs)
    }

    # Print a summary of the final result.
    print(f"[{task_name}] Estimated displacement effect for older workers: {results['point_estimate']:.4f}")

    # Return the structured dictionary of results.
    return results

# ------------------------------------------------------------------------------
# Task 16, Step 3: Helper to Estimate Pure Wage Effect for Older Stayers
# ------------------------------------------------------------------------------

def _estimate_older_stayer_wage_effect(
    older_worker_ids: Set[int],
    analysis_panel: pd.DataFrame,
    event_study_df: pd.DataFrame,
    event_year: int,
    config: Dict[str, Any],
    task_name: str = "Task 16, Step 3"
) -> Dict[str, Any]:
    """
    Estimates the pure wage effect for older stayers using an FD-IV model.

    This function identifies "stayers" within the older worker cohort, constructs
    a first-differenced dataset of their wages and age controls, and estimates
    the effect of the immigration shock using an individual-level 2SLS model
    with a full wild cluster bootstrap for robust inference.

    Args:
        older_worker_ids (Set[int]): Set of Worker_IDs for the older worker cohort.
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel.
        event_study_df (pd.DataFrame): The main analysis dataset from Task 9.
        event_year (int): The specific year of the event study to analyze.
        config (Dict[str, Any]): The master configuration dictionary.
        task_name (str): The name of the calling task for error reporting.

    Returns:
        Dict[str, Any]: A dictionary of estimation results. Returns an empty
                        dictionary if the analysis cannot be run.
    """
    # --- Input Validation ---
    if not isinstance(older_worker_ids, set):
        raise TypeError("`older_worker_ids` must be a set.")
    if not isinstance(analysis_panel, pd.DataFrame):
        raise TypeError("`analysis_panel` must be a pandas DataFrame.")
    if not isinstance(event_study_df, pd.DataFrame):
        raise TypeError("`event_study_df` must be a pandas DataFrame.")

    # --- 1. Identify the Analysis Sample: Older Stayers ---

    # Extract the base year for comparison.
    base_year = config["temporal_parameters"]["BASE_YEAR"]

    # Filter panel to relevant years.
    stayers_panel = analysis_panel[
        analysis_panel.index.get_level_values('snapshot_year').isin([base_year, event_year])
    ]

    # Pivot to identify stayers (workers in the same municipality in both years).
    stayers_wide = stayers_panel.reset_index().pivot_table(
        index='Worker_ID', columns='snapshot_year', values=['Municipality_ID', 'log_wage_imputed', 'age', 'age_sq']
    )
    stayers_wide.columns = [f'{val}_{year}' for val, year in stayers_wide.columns]
    stayers_df = stayers_wide[
        (stayers_wide[f'Municipality_ID_{base_year}'] == stayers_wide[f'Municipality_ID_{event_year}'])
    ].dropna()

    # Filter the stayers to include only those from the older worker cohort.
    older_stayers_df = stayers_df[stayers_df.index.isin(older_worker_ids)].copy()
    older_stayers_df.rename(columns={f'Municipality_ID_{base_year}': 'Municipality_ID'}, inplace=True)

    if older_stayers_df.empty:
        print(f"[{task_name}] No older stayers found for wage analysis.")
        return {}

    # --- 2. Construct First-Differenced (FD) Variables ---

    # Dependent variable: Δlog_w = log_w_event - log_w_base
    older_stayers_df['delta_log_wage'] = older_stayers_df[f'log_wage_imputed_{event_year}'] - older_stayers_df[f'log_wage_imputed_{base_year}']

    # Control variable 1: Δage = age_event - age_base
    older_stayers_df['delta_age'] = older_stayers_df[f'age_{event_year}'] - older_stayers_df[f'age_{base_year}']

    # Control variable 2: Δage² = age_event² - age_base²
    older_stayers_df['delta_age_sq'] = older_stayers_df[f'age_sq_{event_year}'] - older_stayers_df[f'age_sq_{base_year}']

    # --- 3. Merge IVs and Cluster IDs ---

    # Extract municipality-level variables for the event year.
    event_data_year = event_study_df.loc[(slice(None), event_year), :].reset_index()
    iv_cols = ['shock'] + list(config["algorithm_config_parameters"]["INSTRUMENT_SPECIFICATION"].keys()) + [config["algorithm_config_parameters"]["CLUSTER_LEVEL"]]

    # Merge onto the individual-level FD data.
    fd_data = older_stayers_df.merge(
        event_data_year[['Municipality_ID'] + iv_cols], on='Municipality_ID', how='left'
    ).dropna()

    if fd_data.empty:
        print(f"[{task_name}] No valid observations remain after merging IVs. Cannot estimate.")
        return {}

    # --- 4. Estimate FD-IV Model and Bootstrap ---

    # Prepare data arrays for linearmodels.
    dependent = fd_data['delta_log_wage']
    exog = sm.add_constant(fd_data[['delta_age', 'delta_age_sq']])
    endog = fd_data[['shock']]
    instruments = fd_data[list(config["algorithm_config_parameters"]["INSTRUMENT_SPECIFICATION"].keys())]
    clusters = fd_data[config["algorithm_config_parameters"]["CLUSTER_LEVEL"]]

    # Fit the 2SLS model with cluster-robust standard errors.
    model = IV2SLS(dependent, exog, endog, instruments).fit(cov_type='clustered', clusters=clusters)

    # Bootstrap procedure (identical logic to previous tasks).
    n_reps = config["algorithm_config_parameters"]["BOOTSTRAP_REPLICATIONS"]
    rng = np.random.default_rng(config["algorithm_config_parameters"]["RANDOM_SEED"])
    unique_clusters = clusters.unique()
    bootstrap_coeffs = []
    for _ in range(n_reps):
        rademacher_weights = rng.choice([-1, 1], size=len(unique_clusters))
        cluster_weights = pd.DataFrame({config["algorithm_config_parameters"]["CLUSTER_LEVEL"]: unique_clusters, 'v': rademacher_weights})
        temp_df = fd_data.merge(cluster_weights, on=config["algorithm_config_parameters"]["CLUSTER_LEVEL"], how='left')
        boot_residuals = temp_df['v'].values * model.resids.values
        boot_dependent = model.predict().values.flatten() + boot_residuals
        boot_model = IV2SLS(boot_dependent, exog, endog, instruments).fit()
        bootstrap_coeffs.append(boot_model.params['shock'])

    # --- 5. Compute Confidence Interval and Assemble Results ---

    alpha = 1 - config["algorithm_config_parameters"]["BOOTSTRAP_CONF_LEVEL"]
    ci_lower = np.percentile(bootstrap_coeffs, 100 * alpha / 2)
    ci_upper = np.percentile(bootstrap_coeffs, 100 * (1 - alpha / 2))

    # Assemble the final results dictionary.
    results = {
        'point_estimate': model.params['shock'],
        'cluster_robust_se': model.std_errors['shock'],
        'p_value': model.pvalues['shock'],
        'f_statistic': model.first_stage.f_statistic.stat,
        'bootstrap_ci': (ci_lower, ci_upper),
        'n_obs': int(model.nobs)
    }

    # Print a summary of the final result.
    print(f"[{task_name}] Estimated pure wage effect for older stayers: {results['point_estimate']:.4f}")

    # Return the structured dictionary of results.
    return results

# ------------------------------------------------------------------------------
# Task 16, Orchestrator Function
# ------------------------------------------------------------------------------

def analyze_older_workers(
    analysis_panel: pd.DataFrame,
    regional_panel: pd.DataFrame,
    event_study_df: pd.DataFrame,
    master_config: Dict[str, Any],
    event_year: int
) -> Dict[str, Dict[str, Any]]:
    """
    Orchestrates the heterogeneity analysis for older workers (age 50+).

    This function estimates the impact of immigration on two key outcomes for
    this group: their displacement rate and the pure wage effect for those
    who remain employed (stayers).

    Args:
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel.
        regional_panel (pd.DataFrame): The aggregated municipality-year panel.
        event_study_df (pd.DataFrame): The main analysis dataset from Task 9.
        master_config (Dict[str, Any]): The master configuration dictionary.
        event_year (int): The specific year of the event study to analyze.

    Returns:
        Dict[str, Dict[str, Any]]: A dictionary containing the results for both
                                   the displacement and pure wage effect estimations.
    """
    # Step 1: Identify the cohort of older workers based on their status in 1990.
    older_worker_ids = _identify_older_worker_cohort(analysis_panel, master_config)

    if not older_worker_ids:
        print("Task 16: No older worker cohort found. Skipping analysis.")
        return {'displacement_effect': {}, 'pure_wage_effect': {}}

    # Step 2: Estimate the displacement effect for this cohort.
    displacement_results = _estimate_older_worker_displacement(
        older_worker_ids, analysis_panel, event_study_df, event_year, master_config
    )

    # Step 3: Estimate the pure wage effect for the stayers within this cohort.
    wage_results = _estimate_older_stayer_wage_effect(
        older_worker_ids, analysis_panel, event_study_df, event_year, master_config
    )

    final_results = {
        'displacement_effect': displacement_results,
        'pure_wage_effect': wage_results
    }

    print(f"\nTask 16: Analysis of older workers for year {event_year} completed.")
    return final_results


In [None]:
# Task 17: Routine vs. abstract occupational analyses (employment and wages by task group)

# ==============================================================================
# Task 17: Routine vs. abstract occupational analyses
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 17, Step 1: Helper to Compute Task-Specific Outcomes
# ------------------------------------------------------------------------------

def _compute_task_specific_outcomes(
    analysis_panel: pd.DataFrame,
    config: Dict[str, Any],
    task_name: str = "Task 17, Step 1"
) -> pd.DataFrame:
    """
    Aggregates data to compute task-specific outcomes at the municipality-year level.

    This function calculates employment levels and mean wages separately for
    'Routine' and 'Abstract' occupations, computes their change relative to the
    base year, and also calculates the change in the share of abstract employment.

    Args:
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel.
        config (Dict[str, Any]): The master configuration dictionary.

    Returns:
        pd.DataFrame: A DataFrame indexed by ('Municipality_ID', 'snapshot_year')
                      containing all task-specific outcome variables.
    """
    # Extract the base year for creating change-from-baseline outcomes.
    base_year = config["temporal_parameters"]["BASE_YEAR"]

    # Filter to native, employed workers who have a valid task classification.
    task_panel = analysis_panel[
        analysis_panel['is_native'] &
        analysis_panel['is_employed'] &
        analysis_panel['Routine_or_Abstract_Label'].notna()
    ].copy()

    # --- 1. Aggregate Employment and Wages by Task Group ---
    # Define the aggregation operations to be performed on each group.
    agg_funcs = {
        'fte_weight': 'sum',
        # For mean wage, we only consider full-time workers within each group.
        'log_wage_imputed': lambda x: x[task_panel.loc[x.index, 'is_full_time']].mean()
    }

    # Group by municipality, year, and task label, then apply the aggregations.
    task_agg = task_panel.groupby(
        ['Municipality_ID', 'snapshot_year', 'Routine_or_Abstract_Label']
    ).agg(agg_funcs).rename(columns={'log_wage_imputed': 'mean_log_wage'})

    # --- 2. Reshape Data to Wide Format ---
    # Unstack the task label level to create separate columns for 'Routine' and 'Abstract'.
    task_regional = task_agg.unstack(level='Routine_or_Abstract_Label')

    # Flatten the multi-level column index for easier access.
    task_regional.columns = [f'{val}_{task}' for val, task in task_regional.columns]

    # Ensure a balanced panel by reindexing and filling missing values with 0.
    full_idx = pd.MultiIndex.from_product(
        [analysis_panel['Municipality_ID'].dropna().unique(), analysis_panel.index.get_level_values('snapshot_year').unique()],
        names=['Municipality_ID', 'snapshot_year']
    )
    task_regional = task_regional.reindex(full_idx).fillna(0)

    # --- 3. Compute Outcomes Relative to Base Year ---

    # Extract the baseline (1990) values for each task-specific measure.
    baseline_values = task_regional.loc[(slice(None), base_year), :].copy()
    baseline_values = baseline_values.reset_index(level='snapshot_year', drop=True).add_suffix('_base')

    # Merge the baseline values back onto the panel.
    outcomes_df = task_regional.merge(baseline_values, on='Municipality_ID', how='left').fillna(0)

    # Calculate the change-from-baseline outcomes for each task group.
    with np.errstate(divide='ignore', invalid='ignore'):
        for task in ['Routine', 'Abstract']:
            # Employment Outcome: (E_t - E_0) / E_0
            outcomes_df[f'emp_outcome_{task}'] = np.divide(
                outcomes_df[f'fte_weight_{task}'] - outcomes_df[f'fte_weight_{task}_base'],
                outcomes_df[f'fte_weight_{task}_base']
            )
            # Wage Outcome: log(w_t) - log(w_0)
            outcomes_df[f'wage_outcome_{task}'] = outcomes_df[f'mean_log_wage_{task}'] - outcomes_df[f'mean_log_wage_{task}_base']

    # --- 4. Compute Abstract Share Outcome ---

    # Calculate the share of abstract employment in total task-based employment for each year.
    outcomes_df['abstract_share'] = np.divide(
        outcomes_df['fte_weight_Abstract'],
        outcomes_df['fte_weight_Abstract'] + outcomes_df['fte_weight_Routine']
    )
    # Calculate the baseline abstract share.
    outcomes_df['abstract_share_base'] = np.divide(
        outcomes_df['fte_weight_Abstract_base'],
        outcomes_df['fte_weight_Abstract_base'] + outcomes_df['fte_weight_Routine_base']
    )
    # The outcome is the change in this share from the baseline.
    outcomes_df['abstract_share_outcome'] = outcomes_df['abstract_share'] - outcomes_df['abstract_share_base']

    # Clean up any inf/-inf values and set base year outcomes to exactly 0.
    outcomes_df.replace([np.inf, -np.inf], np.nan, inplace=True)
    outcomes_df.loc[outcomes_df.index.get_level_values('snapshot_year') == base_year,
                    [c for c in outcomes_df.columns if 'outcome' in c]] = 0.0

    print(f"[{task_name}] Task-specific outcomes computed successfully.")
    return outcomes_df

# ------------------------------------------------------------------------------
# Task 17, Steps 2 & 3: Orchestrator for Task Heterogeneity Analysis
# ------------------------------------------------------------------------------

def analyze_task_heterogeneity(
    analysis_panel: pd.DataFrame,
    event_study_df: pd.DataFrame,
    master_config: Dict[str, Any],
    event_year: int
) -> Dict[str, Any]:
    """
    Orchestrates the heterogeneity analysis by occupation task type.

    This function estimates the causal impact of immigration separately for
    Routine and Abstract occupations on both regional employment and pure wages.
    It also estimates the effect on the regional share of abstract employment.

    Args:
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel.
        event_study_df (pd.DataFrame): The main analysis dataset from Task 9.
        master_config (Dict[str, Any]): The master configuration dictionary.
        event_year (int): The specific year of the event study to analyze.

    Returns:
        Dict[str, Any]: A nested dictionary containing the estimation results
                        for each task-specific analysis.
    """
    # --- Input Validation ---
    if not isinstance(analysis_panel, pd.DataFrame):
        raise TypeError("`analysis_panel` must be a pandas DataFrame.")
    if not isinstance(event_study_df, pd.DataFrame):
        raise TypeError("`event_study_df` must be a pandas DataFrame.")

    # --- Step 1: Compute Task-Specific Outcomes ---
    task_outcomes_df = _compute_task_specific_outcomes(analysis_panel, master_config)

    # Prepare a consistent analysis dataset for the event year by merging outcomes.
    analysis_data_year = event_study_df.loc[(slice(None), event_year), :].copy()
    analysis_data_year = analysis_data_year.merge(
        task_outcomes_df, on=['Municipality_ID', 'snapshot_year'], how='left'
    )

    # Initialize the results dictionary.
    results: Dict[str, Any] = {'Routine': {}, 'Abstract': {}}

    # --- Step 2: Estimate Task-Specific Regional Employment and Pure Wage Effects ---
    for task in ['Routine', 'Abstract']:

        # --- Regional Employment Effect ---
        print(f"\n--- Estimating Regional Employment Effect for '{task}' tasks ---")

        # Define the specific outcome and weight columns for this regression.
        outcome_col = f'emp_outcome_{task}'
        weight_col = f'fte_weight_{task}_base'

        # Prepare the data sample for this specific estimation.
        temp_df = analysis_data_year.dropna(subset=[outcome_col, weight_col])
        temp_df = temp_df[temp_df[weight_col] > 0].copy()

        # Create a temporary config to pass the custom weight column name.
        temp_config = master_config.copy()
        temp_config['algorithm_config_parameters'] = master_config['algorithm_config_parameters'].copy()
        temp_config['algorithm_config_parameters']['WEIGHT_COLUMN_MUNICIPALITY'] = weight_col

        # Call the master 2SLS estimator.
        results[task]['regional_employment'] = estimate_regional_effect_2sls(
            event_study_df=temp_df.set_index('Municipality_ID', append=True).swaplevel(0,1),
            master_config=temp_config,
            outcome_variable=outcome_col,
            event_year=event_year
        )

        # --- Pure Wage Effect ---
        print(f"\n--- Estimating Pure Wage Effect for '{task}' tasks ---")

        # Define the population: workers who were in the specified task group in the base year.
        base_year = master_config["temporal_parameters"]["BASE_YEAR"]
        task_cohort_ids = set(
            analysis_panel[
                (analysis_panel.index.get_level_values('snapshot_year') == base_year) &
                (analysis_panel['Routine_or_Abstract_Label'] == task)
            ].index.get_level_values('Worker_ID')
        )

        # Call the FD-IV estimator for older workers, which is generic enough to be reused here.
        # We pass the IDs of the relevant cohort to filter the stayer sample.
        results[task]['pure_wage'] = _estimate_older_stayer_wage_effect(
            older_worker_ids=task_cohort_ids, # Reusing the argument name, but it's just a set of IDs
            analysis_panel=analysis_panel,
            event_study_df=event_study_df,
            event_year=event_year,
            config=master_config,
            task_name=f"Task 17, Pure Wage ({task})"
        )

    # --- Step 3: Estimate Effect on Abstract Share ---
    print("\n--- Estimating Effect on Abstract Employment Share ---")

    # Define the outcome and use the main analysis weight.
    outcome_col = 'abstract_share_outcome'
    weight_col = master_config["algorithm_config_parameters"]["WEIGHT_COLUMN_MUNICIPALITY"]

    # Prepare the data sample.
    temp_df = analysis_data_year.dropna(subset=[outcome_col, weight_col])
    temp_df = temp_df[temp_df[weight_col] > 0].copy()

    # Call the master 2SLS estimator with the original config.
    results['abstract_share'] = estimate_regional_effect_2sls(
        event_study_df=temp_df.set_index('Municipality_ID', append=True).swaplevel(0,1),
        master_config=master_config,
        outcome_variable=outcome_col,
        event_year=event_year
    )

    print(f"\nTask 17: Analysis of task heterogeneity for year {event_year} completed.")
    return results


In [None]:
# Task 18: Routine employment decomposition with within-region occupational switches (Equation (8))

# ==============================================================================
# Task 18: Routine employment decomposition with occupational switches
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 18, Step 1: Helper to Compute Routine Employment Decomposition Components
# ------------------------------------------------------------------------------

def _compute_routine_employment_decomposition(
    analysis_panel: pd.DataFrame,
    event_year: int,
    config: Dict[str, Any],
    task_name: str = "Task 18, Step 1"
) -> pd.DataFrame:
    """
    Computes the full decomposition of routine employment change per Equation (8).

    This function tracks workers' transitions in both location and task type
    between the base year and the event year to construct all flow components,
    including within-region upgrading (Routine -> Abstract) and downgrading
    (Abstract -> Routine).

    Args:
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel.
        event_year (int): The post-treatment year to compare against the base year.
        config (Dict[str, Any]): The master configuration dictionary.

    Returns:
        pd.DataFrame: A DataFrame indexed by 'Municipality_ID' with columns for
                      each decomposition component share and the baseline routine
                      employment count used for weighting.
    """
    # Extract the base year from the configuration.
    base_year = config["temporal_parameters"]["BASE_YEAR"]

    # --- 1. Prepare Wide-Format Data for Flow Analysis ---

    # Filter to native workers and the two relevant years.
    native_panel = analysis_panel[
        analysis_panel['is_native'] &
        analysis_panel.index.get_level_values('snapshot_year').isin([base_year, event_year])
    ].copy()

    # Pivot to a wide format for individual-level flow classification.
    flow_df = native_panel.reset_index().pivot_table(
        index='Worker_ID',
        columns='snapshot_year',
        values=['is_employed', 'Municipality_ID', 'fte_weight', 'Routine_or_Abstract_Label']
    )
    # Flatten the multi-level column index.
    flow_df.columns = [f'{val}_{year}' for val, year in flow_df.columns]

    # For clarity and robustness, explicitly fill NaNs in status columns.
    for year in [base_year, event_year]:
        flow_df[f'is_employed_{year}'].fillna(False, inplace=True)
        flow_df[f'Routine_or_Abstract_Label_{year}'].fillna('NonEmployed', inplace=True)

    # --- 2. Classify Flows and Assign FTE Contributions ---

    # Define convenient column name variables.
    muni_base, muni_event = f'Municipality_ID_{base_year}', f'Municipality_ID_{event_year}'
    task_base, task_event = f'Routine_or_Abstract_Label_{base_year}', f'Routine_or_Abstract_Label_{event_year}'
    fte_base, fte_event = f'fte_weight_{base_year}', f'fte_weight_{event_year}'

    # a. Displacement (from Routine): R_r,0 -> N_1
    # A worker is displaced from a routine job if they were in a routine job in 1990 and non-employed in the event year.
    # The contribution is their 1990 FTE weight.
    flow_df['displacement_R_fte'] = np.where(
        (flow_df[task_base] == 'Routine') & (flow_df[task_event] == 'NonEmployed'),
        flow_df[fte_base], 0
    )
    # b. Relocation (from Routine): R_r,0 -> (R/A)_r',1
    # A worker relocates from a routine job if they were in a routine job in 1990 and employed in a different municipality in the event year.
    flow_df['relocation_R_fte'] = np.where(
        (flow_df[task_base] == 'Routine') & (flow_df[task_event] != 'NonEmployed') & (flow_df[muni_base] != flow_df[muni_event]),
        flow_df[fte_base], 0
    )
    # c. Upgrading: R_r,0 -> A_r,1
    # A worker upgrades if they were in a routine job in 1990 and an abstract job in the event year, within the same municipality.
    flow_df['upgrade_R_to_A_fte'] = np.where(
        (flow_df[task_base] == 'Routine') & (flow_df[task_event] == 'Abstract') & (flow_df[muni_base] == flow_df[muni_event]),
        flow_df[fte_base], 0
    )
    # d. Downgrading: A_r,0 -> R_r,1
    # A worker downgrades if they were in an abstract job in 1990 and a routine job in the event year, within the same municipality.
    flow_df['downgrade_A_to_R_fte'] = np.where(
        (flow_df[task_base] == 'Abstract') & (flow_df[task_event] == 'Routine') & (flow_df[muni_base] == flow_df[muni_event]),
        flow_df[fte_event], 0
    )
    # e. Inflows to Routine: (N/A/R_r')_0 -> R_r,1
    # A worker is an inflower to a routine job if they are in a routine job in the event year but were not in a routine job in that same municipality in 1990.
    flow_df['inflow_to_R_fte'] = np.where(
        (flow_df[task_event] == 'Routine') &
        ((flow_df[task_base] != 'Routine') | (flow_df[muni_base] != flow_df[muni_event])),
        flow_df[fte_event], 0
    )

    # --- 3. Aggregate Flows and Normalize ---

    # The denominator for all shares is the baseline (1990) native routine employment in each municipality.
    base_routine_employment = analysis_panel[
        (analysis_panel.index.get_level_values('snapshot_year') == base_year) &
        (analysis_panel['Routine_or_Abstract_Label'] == 'Routine') &
        (analysis_panel['is_native'])
    ].groupby('Municipality_ID')['fte_weight'].sum()
    base_routine_employment.name = 'routine_fte_employment_base'

    # Aggregate the FTE contributions for each flow type by the relevant municipality.
    displacement_agg = flow_df.groupby(muni_base)['displacement_R_fte'].sum()
    relocation_agg = flow_df.groupby(muni_base)['relocation_R_fte'].sum()
    upgrade_agg = flow_df.groupby(muni_base)['upgrade_R_to_A_fte'].sum()
    inflow_agg = flow_df.groupby(muni_event)['inflow_to_R_fte'].sum()
    downgrade_agg = flow_df.groupby(muni_base)['downgrade_A_to_R_fte'].sum()

    # The "crowding-out" effect is defined as the total inflow to routine jobs minus the within-municipality downgrades.
    crowding_out_agg = inflow_agg - downgrade_agg

    # Assemble all aggregated components into a single DataFrame.
    components_df = pd.DataFrame({
        'displacement_share': displacement_agg,
        'relocation_share': relocation_agg,
        'upgrade_share': upgrade_agg,
        'crowding_out_share': crowding_out_agg,
        'downgrade_share': downgrade_agg
    })

    # Normalize all flow shares by the baseline routine employment.
    components_df = components_df.divide(base_routine_employment, axis=0)

    # Add the baseline routine employment itself, which will be used as the weight in regressions.
    components_df['routine_fte_employment_base'] = base_routine_employment

    # --- 4. Internal Consistency Check ---

    # Calculate the total change in routine employment directly from the panel.
    total_routine_employment = analysis_panel[
        (analysis_panel['Routine_or_Abstract_Label'] == 'Routine') & (analysis_panel['is_native'])
    ].groupby(['Municipality_ID', 'snapshot_year'])['fte_weight'].sum().unstack().fillna(0)

    # Calculate the percentage change relative to the baseline.
    components_df['total_routine_emp_change'] = (
        (total_routine_employment[event_year] - total_routine_employment[base_year])
    ).divide(base_routine_employment)

    # Reconstruct the total change from the components based on the identity in Equation (8).
    components_df['identity_check'] = (
        -components_df['displacement_share'] +
        components_df['crowding_out_share'] -
        components_df['relocation_share'] -
        components_df['upgrade_share'] +
        components_df['downgrade_share']
    )

    # Assert that the identity holds within a small tolerance for floating-point errors.
    if not np.allclose(
        components_df['total_routine_emp_change'].fillna(0),
        components_df['identity_check'].fillna(0),
        atol=1e-8,
        equal_nan=True
    ):
        raise AssertionError(f"[{task_name}] Internal decomposition identity check failed.")

    print(f"[{task_name}] Routine employment decomposition components computed and verified.")

    # Return the final components, dropping the temporary check column and filling NaNs.
    return components_df.drop(columns=['identity_check']).fillna(0)


# --------------------------------------------------------------------------------
# Task 18, Helper for estimating change in abstract intensity for regional stayers
# --------------------------------------------------------------------------------

def _estimate_regional_stayer_intensity_effect(
    analysis_panel: pd.DataFrame,
    event_study_df: pd.DataFrame,
    event_year: int,
    config: Dict[str, Any],
    task_name: str = "Task 18, Step 3"
) -> Dict[str, Any]:
    """
    Estimates the effect on abstract intensity for regional stayers (FD-IV).

    This function provides a continuous measure of occupational upgrading. It
    focuses on "regional stayers" (workers employed in the same municipality
    in the base and event years, regardless of occupation) and estimates the
    causal effect of the immigration shock on the change in their occupation's
    abstract intensity score.

    The estimation uses a First-Difference Instrumental Variable (FD-IV) model,
    instrumenting the shock with distance variables. Inference is made robust
    through a full wild cluster bootstrap procedure.

    Args:
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel.
        event_study_df (pd.DataFrame): The main analysis dataset from Task 9.
        event_year (int): The specific year of the event study to analyze.
        config (Dict[str, Any]): The master configuration dictionary.
        task_name (str): The name of the calling task for clear error reporting.

    Returns:
        Dict[str, Any]: A dictionary of estimation results, including point
                        estimate, standard error, p-value, F-statistic, and
                        bootstrap confidence interval. Returns an empty
                        dictionary if the analysis cannot be run.
    """
    # --- Input Validation ---
    if not isinstance(analysis_panel, pd.DataFrame):
        raise TypeError("`analysis_panel` must be a pandas DataFrame.")
    if not isinstance(event_study_df, pd.DataFrame):
        raise TypeError("`event_study_df` must be a pandas DataFrame.")

    # --- 1. Identify Regional Stayers and Prepare Data ---

    # Extract the base year for comparison.
    base_year = config["temporal_parameters"]["BASE_YEAR"]

    # Filter panel to native workers in the two relevant years.
    stayers_panel = analysis_panel[
        analysis_panel['is_native'] &
        analysis_panel.index.get_level_values('snapshot_year').isin([base_year, event_year])
    ]

    # Pivot to wide format to identify regional stayers.
    stayers_wide = stayers_panel.reset_index().pivot_table(
        index='Worker_ID',
        columns='snapshot_year',
        values=['Municipality_ID', 'Abstract_Intensity']
    )
    stayers_wide.columns = [f'{val}_{year}' for val, year in stayers_wide.columns]

    # A regional stayer was employed in the same municipality in both years.
    # We also require non-missing intensity scores in both periods.
    stayers_df = stayers_wide[
        (stayers_wide[f'Municipality_ID_{base_year}'] == stayers_wide[f'Municipality_ID_{event_year}'])
    ].dropna(subset=[
        f'Municipality_ID_{base_year}',
        f'Abstract_Intensity_{base_year}',
        f'Abstract_Intensity_{event_year}'
    ])

    # Rename the municipality column for merging.
    stayers_df.rename(columns={f'Municipality_ID_{base_year}': 'Municipality_ID'}, inplace=True)

    if stayers_df.empty:
        print(f"[{task_name}] No regional stayers with valid intensity scores found.")
        return {}

    # --- 2. Construct First-Differenced (FD) Outcome Variable ---

    # The dependent variable is the change in the abstract intensity score.
    # ΔAbstract_Intensity = Abstract_Intensity_event - Abstract_Intensity_base
    stayers_df['delta_abstract_intensity'] = stayers_df[f'Abstract_Intensity_{event_year}'] - stayers_df[f'Abstract_Intensity_{base_year}']

    # --- 3. Merge Municipality-Level IVs and Cluster IDs ---

    # Extract the relevant slice from the event study data.
    event_data_year = event_study_df.loc[(slice(None), event_year), :].reset_index()

    # Define the columns to merge: shock, instruments, and cluster ID.
    cluster_col = config["algorithm_config_parameters"]["CLUSTER_LEVEL"]
    iv_names = list(config["algorithm_config_parameters"]["INSTRUMENT_SPECIFICATION"].keys())
    cols_to_merge = ['Municipality_ID', 'shock'] + iv_names + [cluster_col]

    # Merge these variables onto the individual-level FD data.
    fd_data = stayers_df.merge(
        event_data_year[cols_to_merge], on='Municipality_ID', how='left'
    ).dropna()

    if fd_data.empty:
        print(f"[{task_name}] No valid observations remain after merging IVs. Cannot estimate.")
        return {}

    # --- 4. Estimate the FD-IV Model ---

    # Prepare data arrays for linearmodels. This model only has an intercept as exog.
    dependent = fd_data['delta_abstract_intensity']
    exog = sm.add_constant(fd_data[['Municipality_ID']]).drop(columns=['Municipality_ID']) # Creates a constant
    endog = fd_data[['shock']]
    instruments = fd_data[iv_names]
    clusters = fd_data[cluster_col]

    # Fit the 2SLS model with cluster-robust standard errors.
    model = IV2SLS(dependent, exog, endog, instruments).fit(cov_type='clustered', clusters=clusters)

    # --- 5. Implement Wild Cluster Bootstrap ---

    # Extract bootstrap parameters.
    n_reps = config["algorithm_config_parameters"]["BOOTSTRAP_REPLICATIONS"]
    rng = np.random.default_rng(config["algorithm_config_parameters"]["RANDOM_SEED"])

    # Get unique clusters and model residuals.
    unique_clusters = clusters.unique()
    fitted_values = model.predict()
    residuals = model.resids

    # Store bootstrap coefficients.
    bootstrap_coeffs = []

    # Main bootstrap loop.
    for _ in range(n_reps):
        # a. Generate Rademacher weights for each cluster.
        rademacher_weights = rng.choice([-1, 1], size=len(unique_clusters))
        cluster_weights = pd.DataFrame({cluster_col: unique_clusters, 'v': rademacher_weights})

        # b. Create bootstrap residuals and outcome.
        temp_df = fd_data.merge(cluster_weights, on=cluster_col, how='left')
        boot_residuals = temp_df['v'].values * residuals.values
        boot_dependent = fitted_values.values.flatten() + boot_residuals

        # c. Re-estimate the IV model.
        boot_model = IV2SLS(boot_dependent, exog, endog, instruments).fit()

        # d. Store the coefficient of interest.
        bootstrap_coeffs.append(boot_model.params['shock'])

    # --- 6. Compute Confidence Interval and Assemble Results ---

    alpha = 1 - config["algorithm_config_parameters"]["BOOTSTRAP_CONF_LEVEL"]
    ci_lower = np.percentile(bootstrap_coeffs, 100 * alpha / 2)
    ci_upper = np.percentile(bootstrap_coeffs, 100 * (1 - alpha / 2))

    # Assemble the final results dictionary.
    results = {
        'point_estimate': model.params['shock'],
        'cluster_robust_se': model.std_errors['shock'],
        'p_value': model.pvalues['shock'],
        'f_statistic': model.first_stage.f_statistic.stat,
        'bootstrap_ci': (ci_lower, ci_upper),
        'n_obs': int(model.nobs)
    }

    # Print a summary of the final result.
    print(f"[{task_name}] Estimated abstract intensity effect for stayers: {results['point_estimate']:.4f}")

    # Return the structured dictionary of results.
    return results


# ------------------------------------------------------------------------------
# Task 18, Orchestrator Function
# ------------------------------------------------------------------------------

def decompose_routine_employment(
    analysis_panel: pd.DataFrame,
    event_study_df: pd.DataFrame,
    master_config: Dict[str, Any],
    event_year: int
) -> Dict[str, Any]:
    """
    Orchestrates the full decomposition of the routine employment effect.

    This function provides a comprehensive analysis of how regional routine
    employment adjusts to an immigration shock, based on the decomposition in
    Equation (8) of the paper. It performs three main operations:

    1.  **Component Construction**: It first calls a helper to compute the shares
        of all micro-level flows that constitute the change in routine employment.
        This includes displacement, crowding-out, relocation, and, critically,
        within-region occupational upgrading (Routine -> Abstract) and
        downgrading (Abstract -> Routine).

    2.  **Causal Estimation (Discrete Flows)**: It systematically estimates the
        causal effect of the immigration shock on each of these discrete flow
        components using a robust 2SLS procedure with wild cluster bootstrapping.
        These estimations are weighted by the baseline routine employment in each
        municipality.

    3.  **Causal Estimation (Continuous Upgrading)**: As a complementary test, it
        estimates the effect of the shock on the change in the continuous
        'Abstract_Intensity' measure for regional stayers, using an individual-level
        FD-IV model.

    Finally, it performs a rigorous identity check to ensure that the sum of the
    estimated effects on the components (with appropriate signs) equals the
    estimated total effect on routine employment.

    Args:
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel from Task 5.
        event_study_df (pd.DataFrame): The main analysis dataset from Task 9.
        master_config (Dict[str, Any]): The master configuration dictionary.
        event_year (int): The specific year of the event study to analyze.

    Returns:
        Dict[str, Any]: A nested dictionary containing the detailed estimation
                        results for the total routine effect, each of the five
                        decomposition components, and the abstract intensity change.

    Raises:
        TypeError: If input arguments are not of the expected type.
        AssertionError: If the final decomposition identity check on the
                        estimated coefficients fails, indicating a logical or
                        sample inconsistency in the analysis.
    """
    # --- Input Validation ---
    if not isinstance(analysis_panel, pd.DataFrame):
        raise TypeError("`analysis_panel` must be a pandas DataFrame.")
    if not isinstance(event_study_df, pd.DataFrame):
        raise TypeError("`event_study_df` must be a pandas DataFrame.")
    if not isinstance(master_config, dict):
        raise TypeError("`master_config` must be a dictionary.")

    # --- Step 1: Compute all flow components based on Equation (8) ---
    # This helper function performs the complex data wrangling to create the outcome variables.
    components_df = _compute_routine_employment_decomposition(analysis_panel, event_year, master_config)

    # --- Step 2: Prepare a consistent analysis dataset for the event year ---
    # Merge the newly computed component shares into the main event study data.
    analysis_data_year = event_study_df.loc[(slice(None), event_year), :].copy()
    analysis_data_year = analysis_data_year.merge(
        components_df, on='Municipality_ID', how='left'
    ).fillna(0)

    # --- Step 3: Estimate 2SLS for each discrete flow component ---

    # Initialize a dictionary to store the results of each estimation.
    results: Dict[str, Any] = {}

    # Define the mapping from the result key to the outcome column name in the DataFrame.
    outcomes_to_estimate = {
        'total_routine_effect': 'total_routine_emp_change',
        'displacement': 'displacement_share',
        'crowding_out': 'crowding_out_share',
        'relocation': 'relocation_share',
        'upgrading': 'upgrade_share',
        'downgrading': 'downgrade_share'
    }

    # The weight for all these regressions is the baseline ROUTINE employment.
    weight_col = 'routine_fte_employment_base'

    # Create a temporary copy of the config to modify the weight column for the estimator.
    # This is a clean way to parameterize the reusable estimation function.
    temp_config = master_config.copy()
    temp_config['algorithm_config_parameters'] = master_config['algorithm_config_parameters'].copy()
    temp_config['algorithm_config_parameters']['WEIGHT_COLUMN_MUNICIPALITY'] = weight_col

    # Loop through each outcome and run the full 2SLS estimation.
    for name, outcome_col in outcomes_to_estimate.items():
        # Define the analysis sample for this specific regression.
        # We only use municipalities that had routine workers at baseline.
        temp_df = analysis_data_year[analysis_data_year[weight_col] > 0].copy()

        # Call the master 2SLS estimation function from Task 10.
        results[name] = estimate_regional_effect_2sls(
            event_study_df=temp_df.set_index('Municipality_ID', append=True).swaplevel(0,1),
            master_config=temp_config,
            outcome_variable=outcome_col,
            event_year=event_year
        )

    # --- Step 4: Estimate change in abstract intensity for regional stayers ---

    # This provides a continuous measure of upgrading.
    results['abstract_intensity'] = _estimate_regional_stayer_intensity_effect(
        analysis_panel=analysis_panel,
        event_study_df=event_study_df,
        event_year=event_year,
        config=master_config
    )

    # --- Step 5: Final Identity Check on Estimated Coefficients ---
    # This is a critical validation step to ensure the entire decomposition is consistent.
    # Equation (8): ΔE^R ≈ -Disp + Crowd-Out - Reloc - Upgrade + Downgrade

    # Extract the point estimate for the total effect on routine employment.
    total_effect = results['total_routine_effect']['point_estimate']

    # Reconstruct the total effect from the sum of the component effects.
    sum_of_components = (
        -results['displacement']['point_estimate'] +
        results['crowding_out']['point_estimate'] -
        results['relocation']['point_estimate'] -
        results['upgrading']['point_estimate'] +
        results['downgrading']['point_estimate']
    )

    # Assert that the directly estimated total effect is numerically close to the reconstructed sum.
    if not np.isclose(total_effect, sum_of_components, atol=1e-6):
        # If this fails, it indicates a serious logical error in the component construction or sample definition.
        raise AssertionError(
            "Routine decomposition coefficient identity FAILED! "
            f"Directly Estimated Total Effect = {total_effect:.6f}, but "
            f"Sum of Component Effects = {sum_of_components:.6f}."
        )
    else:
        # Log the successful validation.
        print("\nRoutine decomposition coefficient identity PASSED.")

    # Return the comprehensive dictionary of results.
    return results


In [None]:
# Task 19: Apprenticeship uptake analysis (Figure 3)

# ==============================================================================
# Task 19: Apprenticeship uptake analysis (Figure 3)
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 19, Steps 1 & 2: Helper to Compute Apprenticeship Outcomes
# ------------------------------------------------------------------------------

def _compute_apprenticeship_outcomes(
    analysis_panel_full: pd.DataFrame,
    config: Dict[str, Any],
    task_name: str = "Task 19, Steps 1 & 2"
) -> pd.DataFrame:
    """
    Computes the change in native apprenticeship employment relative to baseline.

    This function first aggregates the FTE count of native apprentices for each
    municipality-year from the full, unfiltered worker panel. It then calculates
    the percentage change of this count for each year relative to the 1990 baseline.

    Args:
        analysis_panel_full (pd.DataFrame): The complete, unfiltered worker-year
            panel from Task 3, which includes apprentice flags.
        config (Dict[str, Any]): The master configuration dictionary.
        task_name (str): The name of the calling task for clear error reporting.

    Returns:
        pd.DataFrame: A DataFrame indexed by ('Municipality_ID', 'snapshot_year')
                      containing the 'apprenticeship_outcome' variable.
    """
    # --- 1. Compute Municipality-Year Apprentice Counts ---

    # Filter the full panel to native apprentices. The 'is_apprentice' flag
    # was created on the unfiltered panel in Task 3, which is critical here.
    apprentice_panel = analysis_panel_full[
        analysis_panel_full['is_native'] & analysis_panel_full['is_apprentice']
    ].copy()

    # Group by municipality and year, and sum the FTE weights.
    apprentice_counts = apprentice_panel.groupby(
        ['Municipality_ID', 'snapshot_year']
    )['fte_weight'].sum().rename('apprenticeship_fte')

    # --- 2. Compute Percent Changes Relative to Base Year ---

    # Create a balanced panel of apprentice counts by reindexing.
    full_idx = pd.MultiIndex.from_product(
        [analysis_panel_full['Municipality_ID'].dropna().unique(), analysis_panel_full.index.get_level_values('snapshot_year').unique()],
        names=['Municipality_ID', 'snapshot_year']
    )
    apprentice_counts = apprentice_counts.reindex(full_idx, fill_value=0)

    # Extract the baseline (1990) apprentice counts.
    base_year = config["temporal_parameters"]["BASE_YEAR"]
    baseline_counts = apprentice_counts.loc[(slice(None), base_year)].rename('apprenticeship_fte_base')
    baseline_counts = baseline_counts.reset_index(level='snapshot_year', drop=True)

    # Merge baseline counts back onto the full series.
    outcomes_df = apprentice_counts.to_frame().merge(
        baseline_counts, on='Municipality_ID', how='left'
    ).fillna(0)

    # Calculate the outcome: (Apprenticeships_rt - Apprenticeships_r,1990) / Apprenticeships_r,1990
    with np.errstate(divide='ignore', invalid='ignore'):
        outcomes_df['apprenticeship_outcome'] = np.divide(
            outcomes_df['apprenticeship_fte'] - outcomes_df['apprenticeship_fte_base'],
            outcomes_df['apprenticeship_fte_base']
        )

    # Clean up inf/-inf values and set the base year outcome to exactly 0.
    outcomes_df.replace([np.inf, -np.inf], np.nan, inplace=True)
    outcomes_df.loc[(slice(None), base_year), 'apprenticeship_outcome'] = 0.0

    print(f"[{task_name}] Apprenticeship outcomes computed successfully.")

    return outcomes_df[['apprenticeship_outcome']]

# ------------------------------------------------------------------------------
# Task 19, Orchestrator Function
# ------------------------------------------------------------------------------

def analyze_apprenticeship_uptake(
    analysis_panel_full: pd.DataFrame,
    event_study_df: pd.DataFrame,
    master_config: Dict[str, Any]
) -> List[Dict[str, Any]]:
    """
    Orchestrates the apprenticeship uptake analysis for all event-study years.

    This function estimates the causal effect of the immigration shock on the
    change in native apprenticeship employment for each pre- and post-treatment
    year. The results from this function can be used directly to plot Figure 3
    from the paper.

    Args:
        analysis_panel_full (pd.DataFrame): The complete, unfiltered worker-year
            panel from Task 3, which is necessary to correctly identify apprentices.
        event_study_df (pd.DataFrame): The main analysis dataset from Task 9.
        master_config (Dict[str, Any]): The master configuration dictionary.

    Returns:
        List[Dict[str, Any]]: A list of result dictionaries, with each dictionary
                              containing the full 2SLS estimation output for one
                              event year.
    """
    # --- Input Validation ---
    if not isinstance(analysis_panel_full, pd.DataFrame):
        raise TypeError("`analysis_panel_full` must be a pandas DataFrame.")
    if not isinstance(event_study_df, pd.DataFrame):
        raise TypeError("`event_study_df` must be a pandas DataFrame.")

    # --- Step 1 & 2: Compute the apprenticeship outcome variable ---
    apprenticeship_outcomes = _compute_apprenticeship_outcomes(analysis_panel_full, master_config)

    # --- Step 3: Prepare data and run 2SLS for each event year ---

    # Merge the new outcome into the main event study dataset.
    analysis_data = event_study_df.merge(
        apprenticeship_outcomes, on=['Municipality_ID', 'snapshot_year'], how='left'
    )

    # Define the full range of years for the event-study plot.
    event_years = master_config["temporal_parameters"]["PRE_TREATMENT_YEARS_FOR_TESTING"] + \
                  master_config["temporal_parameters"]["POST_TREATMENT_YEARS"]

    # Initialize a list to store the results for each year.
    all_results = []

    # Loop through each event year and run the estimation.
    for year in sorted(event_years):
        print(f"\n--- Estimating Apprenticeship Effect for Year: {year} ---")

        # Call the master 2SLS estimation function from Task 10.
        # The standard weight (total native baseline employment) is used.
        year_results = estimate_regional_effect_2sls(
            event_study_df=analysis_data,
            master_config=master_config,
            outcome_variable='apprenticeship_outcome',
            event_year=year
        )

        # Store the results for this year.
        if year_results:
            all_results.append(year_results)

    print(f"\nTask 19: Analysis of apprenticeship uptake completed for all event years.")

    return all_results


In [None]:
# Task 20: Structural parameter recovery (η̄^P, η̄^E, φ, c; Equations (1a)–(1c))

# ==============================================================================
# Task 20: Structural parameter recovery
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 20, Step 1: Helper to Compute Efficiency-to-Headcount Scaling 'c'
# ------------------------------------------------------------------------------

def _compute_efficiency_scaling_c(
    analysis_panel: pd.DataFrame,
    config: Dict[str, Any],
    task_name: str = "Task 20, Step 1"
) -> float:
    """
    Computes the efficiency-to-headcount scaling parameter 'c'.

    'c' is the ratio of the immigration shock measured in efficiency units (wage
    bill share) to the shock measured in headcounts (FTE share).

    Args:
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel.
        config (Dict[str, Any]): The master configuration dictionary.

    Returns:
        float: The estimated value of the scaling parameter 'c'.
    """
    # Define the start and end years for the main shock calculation.
    shock_years = config["temporal_parameters"]["EVENT_STUDY_SHOCK_RULES"]["1992_to_1995"]
    start_year, end_year = shock_years['start_year'], shock_years['end_year']

    # Create a 'wage_bill' proxy for each employed worker.
    # Assumption: fte_weight is a valid proxy for the share of the year worked.
    panel = analysis_panel[analysis_panel['is_employed']].copy()
    panel['wage_bill'] = panel['Daily_Wage_EUR'] * panel['fte_weight']

    # --- Numerator: Efficiency Shock (based on wage bill) ---
    wage_bill_czech_start = panel.loc[panel.index.get_level_values('snapshot_year') == start_year, 'wage_bill'][panel['is_czech']].sum()
    wage_bill_czech_end = panel.loc[panel.index.get_level_values('snapshot_year') == end_year, 'wage_bill'][panel['is_czech']].sum()
    wage_bill_total_base = panel.loc[panel.index.get_level_values('snapshot_year') == start_year, 'wage_bill'].sum()

    efficiency_shock = (wage_bill_czech_end - wage_bill_czech_start) / wage_bill_total_base

    # --- Denominator: Headcount Shock (based on FTE) ---
    fte_czech_start = panel.loc[panel.index.get_level_values('snapshot_year') == start_year, 'fte_weight'][panel['is_czech']].sum()
    fte_czech_end = panel.loc[panel.index.get_level_values('snapshot_year') == end_year, 'fte_weight'][panel['is_czech']].sum()
    fte_total_base = panel.loc[panel.index.get_level_values('snapshot_year') == start_year, 'fte_weight'].sum()

    headcount_shock = (fte_czech_end - fte_czech_start) / fte_total_base

    # Validate the denominator to prevent division by zero.
    if np.isclose(headcount_shock, 0):
        raise ValueError(f"[{task_name}] Headcount shock is zero, cannot compute 'c'.")

    # c = (Efficiency Shock) / (Headcount Shock)
    c_val = efficiency_shock / headcount_shock

    print(f"[{task_name}] Efficiency scaling parameter 'c' computed: {c_val:.4f}")
    return c_val

# ------------------------------------------------------------------------------
# Task 20, Step 2: Helper to Recover Labor Supply Elasticities
# ------------------------------------------------------------------------------

def _recover_supply_elasticities(
    beta_R: float,
    gamma_R: float,
    gamma_W: float,
    task_name: str = "Task 20, Step 2"
) -> Tuple[float, float]:
    """
    Recovers the population- and efficiency-weighted labor supply elasticities.

    Args:
        beta_R (float): The estimated regional employment effect (β^R).
        gamma_R (float): The estimated regional wage effect (γ^R).
        gamma_W (float): The estimated pure wage effect (γ^W).

    Returns:
        Tuple[float, float]: A tuple containing (eta_P, eta_E).
    """
    # Ensure the pure wage effect is not zero to avoid division errors.
    if np.isclose(gamma_W, 0):
        raise ValueError("Pure wage effect (gamma_W) is zero, cannot recover elasticities.")

    # Recover population-weighted elasticity (η̄^P) from the ratio of employment to pure wage effects.
    # Formula: η̄^P = β^R / γ^W
    eta_P = beta_R / gamma_W

    # Recover efficiency-weighted elasticity (η̄^E) by rearranging Eq. (1c).
    # Formula: η̄^E = (γ^R / γ^W) - 1 + η̄^P
    eta_E = (gamma_R / gamma_W) - 1 + eta_P

    print(f"[{task_name}] Recovered supply elasticities: η̄^P = {eta_P:.4f}, η̄^E = {eta_E:.4f}")
    return eta_P, eta_E

# ------------------------------------------------------------------------------
# Task 20, Step 3: Helper to Recover Inverse Demand Elasticity
# ------------------------------------------------------------------------------

def _recover_demand_elasticity(
    gamma_W: float,
    beta_R: float,
    eta_P: float,
    eta_E: float,
    c_val: float,
    task_name: str = "Task 20, Step 3"
) -> float:
    """
    Recovers the inverse labor demand elasticity (φ).

    Args:
        gamma_W (float): The estimated pure wage effect (γ^W).
        beta_R (float): The estimated regional employment effect (β^R).
        eta_P (float): The recovered population-weighted supply elasticity.
        eta_E (float): The recovered efficiency-weighted supply elasticity.
        c_val (float): The efficiency-to-headcount scaling parameter.

    Returns:
        float: The estimated inverse labor demand elasticity (φ).
    """
    # The denominator of the formula for φ.
    denominator = c_val + (eta_E / eta_P) * beta_R

    if np.isclose(denominator, 0):
        raise ValueError("Denominator for φ calculation is zero, cannot recover demand elasticity.")

    # Recover the inverse demand elasticity (φ).
    # Formula: φ = γ^W / (c + (η̄^E / η̄^P) * β^R)
    phi = gamma_W / denominator

    print(f"[{task_name}] Recovered inverse demand elasticity φ = {phi:.4f}")
    return phi

# ------------------------------------------------------------------------------
# Task 20, Orchestrator Function
# ------------------------------------------------------------------------------

def recover_structural_parameters(
    reduced_form_estimates: Dict[str, float],
    analysis_panel: pd.DataFrame,
    master_config: Dict[str, Any]
) -> Dict[str, Any]:
    """
    Orchestrates the recovery of structural labor market parameters.

    This function takes the key reduced-form coefficients from the 2SLS
    estimations and uses the paper's theoretical model (Equations 1a-1c)
    to solve for the underlying structural parameters: the labor supply
    elasticities (η̄^P, η̄^E) and the inverse labor demand elasticity (φ).

    It also performs a validation exercise to show that using the biased
    regional wage effect (γ^R) would lead to implausible structural estimates.

    Args:
        reduced_form_estimates (Dict[str, float]): A dictionary with the point
            estimates for 'beta_R', 'gamma_R', and 'gamma_W'.
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel.
        master_config (Dict[str, Any]): The master configuration dictionary.

    Returns:
        Dict[str, Any]: A dictionary containing all recovered structural
                        parameters and the results of the validation exercise.
    """
    # --- Input Validation ---
    required_keys = {'beta_R', 'gamma_R', 'gamma_W'}
    if not required_keys.issubset(reduced_form_estimates.keys()):
        raise KeyError(f"Input `reduced_form_estimates` is missing required keys: {required_keys - set(reduced_form_estimates.keys())}")

    # Extract reduced-form coefficients.
    beta_R = reduced_form_estimates['beta_R']
    gamma_R = reduced_form_estimates['gamma_R']
    gamma_W = reduced_form_estimates['gamma_W']

    # Step 1: Compute the efficiency-to-headcount scaling parameter 'c'.
    c_val = _compute_efficiency_scaling_c(analysis_panel, master_config)

    # Step 2: Recover the population- and efficiency-weighted labor supply elasticities.
    eta_P, eta_E = _recover_supply_elasticities(beta_R, gamma_R, gamma_W)

    # Step 3: Recover the inverse labor demand elasticity.
    phi = _recover_demand_elasticity(gamma_W, beta_R, eta_P, eta_E, c_val)
    demand_elasticity = 1 / phi if not np.isclose(phi, 0) else np.inf

    # --- Validation / Sanity Check ---
    # Re-calculate parameters under the naive assumption that γ^R = γ^W.
    # This implies η̄^E = η̄^P, as the composition effect is assumed to be zero.
    print("\n--- Performing validation using naive (biased) regional wage effect ---")
    eta_P_naive = beta_R / gamma_R if not np.isclose(gamma_R, 0) else np.inf
    eta_E_naive = eta_P_naive # By assumption
    phi_naive = _recover_demand_elasticity(gamma_R, beta_R, eta_P_naive, eta_E_naive, c_val, task_name="Validation")
    demand_elasticity_naive = 1 / phi_naive if not np.isclose(phi_naive, 0) else np.inf

    # --- Assemble Final Results ---
    results = {
        'structural_parameters': {
            'c_scaling_factor': c_val,
            'eta_P_population_supply_elasticity': eta_P,
            'eta_E_efficiency_supply_elasticity': eta_E,
            'phi_inverse_demand_elasticity': phi,
            'demand_elasticity': demand_elasticity
        },
        'validation_with_regional_effect': {
            'eta_P_naive': eta_P_naive,
            'phi_naive': phi_naive,
            'demand_elasticity_naive': demand_elasticity_naive
        }
    }

    print("\n" + "="*80)
    print("Structural Parameter Recovery Summary")
    print(f"  Inverse Demand Elasticity (φ): {phi:.4f}")
    print(f"  Implied Demand Elasticity (1/φ): {demand_elasticity:.4f}")
    print("-" * 40)
    print(f"  Naive (Biased) Inverse Demand Elasticity (φ_naive): {phi_naive:.4f}")
    print(f"  Naive Implied Demand Elasticity: {demand_elasticity_naive:.4f}")
    print("="*80)

    print("\nTask 20: Structural parameter recovery completed successfully.")
    return results


In [None]:
# Task 21: Build orchestrator function for end-to-end pipeline execution

# ==============================================================================
# Task 21: Build orchestrator function for end-to-end pipeline execution
# ==============================================================================

def run_full_analysis_pipeline(
    raw_data_path: str,
    master_config: Dict[str, Any],
    main_result_years: List[int] = [1993, 1995],
    cache_dir: str = "./.cache/",
    force_rerun_prep: bool = False
) -> Dict[str, Any]:
    """
    Orchestrates the entire end-to-end research pipeline with caching.

    This master function executes the full sequence of tasks required to replicate
    the paper's findings, from raw data validation to the final estimation of
    structural parameters. It manages the flow of data artifacts between tasks
    and implements a robust caching mechanism to avoid re-computing expensive
    data preparation steps.

    Args:
        raw_data_path (str): The file path to the raw, consolidated spell-level
                             DataFrame (e.g., a Parquet file).
        master_config (Dict[str, Any]): The master configuration dictionary that
                                        governs the entire analysis.
        main_result_years (List[int]): A list of the primary event years for which
                                       to generate the main decomposition tables.
        cache_dir (str): A directory path for storing and retrieving intermediate
                         data artifacts to speed up subsequent runs.
        force_rerun_prep (bool): If True, all data preparation steps (Tasks 3-9)
                                 will be re-computed, ignoring any existing cache.

    Returns:
        Dict[str, Any]: A nested dictionary containing the final results from all
                        estimation tasks, organized by analysis type and year.
    """
    # --- 1. Setup: Caching, Logging, and Initial Data Loading ---

    # Create the cache directory if it doesn't exist. This is where intermediate
    # data preparation artifacts will be stored.
    cache_path = Path(cache_dir)
    cache_path.mkdir(exist_ok=True)

    # Start a timer for the entire pipeline and log the start message.
    start_time = time.time()
    print("="*80 + "\nSTARTING END-TO-END ANALYSIS PIPELINE\n" + "="*80)

    # Define a simple helper function for logging task completion and duration.
    def log_task_completion(task_name: str, task_start_time: float):
        duration = time.time() - task_start_time
        print(f"--- Task '{task_name}' completed in {duration:.2f} seconds. ---\n")

    # Load the raw data from the specified path. Assuming Parquet for efficiency.
    print(f"Loading raw data from: {raw_data_path}...")
    consolidated_df_raw = pd.read_parquet(raw_data_path)

    # --- 2. Validation Phase (Always runs, no caching) ---

    # Start the timer for the validation phase.
    task_start = time.time()
    # Task 1: Validate the raw data schema, dtypes, and logical integrity.
    validate_consolidated_df_raw(consolidated_df_raw, master_config)
    log_task_completion("1: Validate Raw Data", task_start)

    task_start = time.time()
    # Task 2: Validate the master config and load/validate all auxiliary data artifacts.
    validated_artifacts = validate_artifacts_and_config(master_config, consolidated_df_raw)
    log_task_completion("2: Validate Config & Artifacts", task_start)

    # --- 3. Data Preparation Phase (with Caching) ---

    # This dictionary will hold all major data artifacts as they are created.
    artifacts = validated_artifacts
    # This flag controls the cache waterfall; if one step is re-run, all subsequent steps must also be re-run.
    is_cache_valid = not force_rerun_prep

    # Define the full data preparation pipeline as a list of tasks.
    # Each tuple contains: (name, function, input_map, output_keys, cache_filename)
    prep_pipeline = [
        ("3: Cleanse & Canonicalize", cleanse_and_canonicalize_spells,
         {'consolidated_df_raw': consolidated_df_raw},
         ['df_normalized', 'panel_full_with_flags', 'panel_main_analysis'], 'task3.pkl'),
        ("4: Impute Wages", impute_censored_wages,
         {'worker_year_panel': 'panel_main_analysis'},
         'panel_main_with_wages', 'task4.pkl'),
        ("5: Build Analysis Panel", build_analysis_panel,
         {'analysis_panel_employed': 'panel_main_with_wages', 'all_spells_cleansed': 'df_normalized'},
         'analysis_panel', 'task5.pkl'),
        ("6: Aggregate Regional Panel", aggregate_to_regional_panel,
         {'analysis_panel': 'analysis_panel'},
         ['regional_panel', 'national_wage_series'], 'task6.pkl'),
        ("7: Construct Shock", construct_immigration_shock,
         {'analysis_panel': 'analysis_panel', 'regional_panel': 'regional_panel'},
         'shock_df', 'task7.pkl'),
        ("8: Construct Instruments", construct_instrumental_variables,
         {'analysis_panel': 'analysis_panel'},
         'instruments_df', 'task8.pkl'),
        ("9: Prepare Event Study DF", prepare_event_study_dataset,
         {'regional_panel': 'regional_panel', 'shock_df': 'shock_df',
          'instruments_df': 'instruments_df', 'analysis_panel': 'analysis_panel'},
         'event_study_df', 'task9.pkl')
    ]

    # Execute the preparation pipeline.
    for name, func, kwargs_map, out_keys, cache_file in prep_pipeline:
        task_start = time.time()
        cache_filepath = cache_path / cache_file

        # Check if a valid cached result exists.
        if is_cache_valid and cache_filepath.exists():
            print(f"Loading cached result for Task '{name}'...")
            with open(cache_filepath, 'rb') as f: result = pickle.load(f)
        else:
            print(f"Running Task '{name}'...")
            # If we run any task, all subsequent tasks must also be run.
            is_cache_valid = False
            # Resolve the function's arguments from the artifacts dictionary.
            kwargs = {k: artifacts.get(v, v) if isinstance(v, str) else v for k, v in kwargs_map.items()}
            kwargs['master_config'] = master_config
            # Execute the task function.
            result = func(**kwargs)
            # Save the result to the cache.
            with open(cache_filepath, 'wb') as f: pickle.dump(result, f)

        # Unpack the results into the artifacts dictionary for downstream use.
        if isinstance(out_keys, list):
            for i, key in enumerate(out_keys): artifacts[key] = result[i]
        else:
            artifacts[out_keys] = result
        log_task_completion(name, task_start)

    # --- 4. Estimation Phase (No Caching) ---

    # Initialize the final results dictionary.
    final_results: Dict[str, Any] = {'main_tables': {}, 'event_studies': {}}

    # Run main decomposition and heterogeneity analyses for the specified years.
    for year in main_result_years:
        year_results: Dict[str, Any] = {}
        print(f"\n{'='*30} RUNNING ESTIMATIONS FOR YEAR: {year} {'='*30}\n")

        # Task 10 (as part of Task 11) & 11: Decompose regional employment.
        year_results['employment_decomposition'] = decompose_regional_employment_effect(
            artifacts['analysis_panel'], artifacts['regional_panel'], artifacts['event_study_df'], master_config, year
        )
        # Task 12 & 13: Estimate and decompose wage effects.
        wage_effects = estimate_wage_effects(
            artifacts['event_study_df'], artifacts['analysis_panel'], master_config, year
        )
        year_results['wage_decomposition'] = decompose_regional_wage_effect(
            artifacts['analysis_panel'], artifacts['event_study_df'], wage_effects, master_config, year
        )
        # Task 14: Run selection bounding on the pure wage effect.
        year_results['selection_bounds'] = bound_pure_wage_effect_selection(
            artifacts['analysis_panel'], artifacts['event_study_df'], wage_effects['pure_wage_effect'], master_config, year
        )
        # Task 15: Analyze non-employed entrants.
        year_results['non_employed_entrants'] = analyze_non_employed_entrants(
            artifacts['analysis_panel'], artifacts['regional_panel'], artifacts['national_wage_series'], artifacts['event_study_df'], master_config, year
        )
        # Task 16: Analyze older workers.
        year_results['older_workers'] = analyze_older_workers(
            artifacts['analysis_panel'], artifacts['regional_panel'], artifacts['event_study_df'], master_config, year
        )
        # Task 17 & 18: Analyze task heterogeneity and decompose routine employment.
        year_results['task_heterogeneity'] = analyze_task_heterogeneity(
            artifacts['analysis_panel'], artifacts['event_study_df'], master_config, year
        )
        year_results['routine_decomposition'] = decompose_routine_employment(
            artifacts['analysis_panel'], artifacts['event_study_df'], master_config, year
        )
        # Store all results for this year.
        final_results['main_tables'][year] = year_results

    # Run event study analyses that loop over all years.
    final_results['event_studies']['apprenticeship_uptake'] = analyze_apprenticeship_uptake(
        artifacts['panel_full_with_flags'], artifacts['event_study_df'], master_config
    )

    # Run final structural parameter recovery using results from the first main year.
    main_year = main_result_years[0]
    final_results['structural_parameters'] = recover_structural_parameters(
        reduced_form_estimates={
            'beta_R': final_results['main_tables'][main_year]['employment_decomposition']['total_effect']['point_estimate'],
            'gamma_R': final_results['main_tables'][main_year]['wage_decomposition']['regional_wage_effect']['point_estimate'],
            'gamma_W': final_results['main_tables'][main_year]['wage_decomposition']['pure_wage_effect']['point_estimate']
        },
        analysis_panel=artifacts['analysis_panel'],
        master_config=master_config
    )

    # --- 5. Finalization ---

    # Calculate and log the total runtime of the pipeline.
    total_duration = time.time() - start_time
    print("="*80 + f"\nENTIRE PIPELINE COMPLETED in {total_duration / 60:.2f} minutes.\n" + "="*80)

    # Return the comprehensive dictionary of all results.
    return final_results


In [None]:
# Task 22: Robustness: Event-study pre-trends and placebo tests

# ==============================================================================
# Task 22: Robustness: Event-study pre-trends and placebo tests
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 22, Step 1: Helper to Estimate Pre-Period Coefficients
# ------------------------------------------------------------------------------

def _estimate_pre_trends(
    event_study_df: pd.DataFrame,
    analysis_panel: pd.DataFrame,
    master_config: Dict[str, Any],
    outcomes_to_test: Dict[str, str],
    task_name: str = "Task 22, Step 1"
) -> pd.DataFrame:
    """
    Estimates placebo effects for key outcomes in the pre-treatment period.

    This function iterates through the pre-treatment years and runs the
    appropriate 2SLS estimation for each specified outcome to test the
    parallel trends assumption.

    Args:
        event_study_df (pd.DataFrame): The main analysis dataset from Task 9.
        analysis_panel (pd.DataFrame): The full worker-year panel for individual-level models.
        master_config (Dict[str, Any]): The master configuration dictionary.
        outcomes_to_test (Dict[str, str]): A dictionary mapping outcome names
            (e.g., 'emp_outcome') to the estimation type ('regional' or 'pure_wage').
        task_name (str): The name of the calling task for clear error reporting.

    Returns:
        pd.DataFrame: A DataFrame containing the full estimation results for
                      each outcome in each pre-treatment year.
    """
    # Extract the list of pre-treatment years to test.
    pre_treatment_years = master_config["temporal_parameters"]["PRE_TREATMENT_YEARS_FOR_TESTING"]

    # Store results from each estimation.
    all_results = []

    # Loop through each outcome and each pre-treatment year.
    for outcome_var, model_type in outcomes_to_test.items():
        for year in pre_treatment_years:
            print(f"\n--- Estimating Pre-Trend for '{outcome_var}' ({model_type}) in Year: {year} ---")

            if model_type == 'regional':
                # Call the master regional 2SLS estimation function.
                year_results = estimate_regional_effect_2sls(
                    event_study_df=event_study_df,
                    master_config=master_config,
                    outcome_variable=outcome_var,
                    event_year=year
                )
            elif model_type == 'pure_wage':
                # Call the master individual-level FD-IV estimation function.
                year_results = _estimate_pure_wage_effect_stayers(
                    analysis_panel=analysis_panel,
                    event_study_df=event_study_df,
                    master_config=master_config,
                    event_year=year
                )
            else:
                raise ValueError(f"Unknown model_type '{model_type}' for pre-trend estimation.")

            # Store the results if the estimation was successful.
            if year_results:
                all_results.append(year_results)

    print(f"[{task_name}] Pre-trend estimation completed.")

    # Convert the list of result dictionaries into a DataFrame.
    return pd.DataFrame(all_results)

# ------------------------------------------------------------------------------
# Task 22, Step 2: Helper to Construct Dynamic Event-Study Plots
# ------------------------------------------------------------------------------

def _plot_event_study(
    results_df: pd.DataFrame,
    title: str,
    y_label: str,
    base_year: int = 1990
) -> None:
    """
    Generates a publication-quality event-study plot from estimation results.

    This function takes a tidy DataFrame of regression results (containing point
    estimates and confidence intervals for multiple years) and creates a
    standard event-study plot. It visualizes the dynamic treatment effects over
    time, clearly demarcating the pre- and post-treatment periods.

    Args:
        results_df (pd.DataFrame): A DataFrame where each row contains the
            estimation results for a specific 'event_year'. Must contain the
            columns 'event_year', 'point_estimate', and 'bootstrap_ci'.
        title (str): The main title for the plot.
        y_label (str): The label for the y-axis, describing the coefficient's
                       interpretation.
        base_year (int): The base year of the study (the year before treatment
                         starts), used to draw a demarcation line.

    Raises:
        ValueError: If `results_df` is missing required columns.
    """
    # --- Input Validation ---
    # Check if the input DataFrame is valid and contains the necessary columns.
    if not isinstance(results_df, pd.DataFrame) or results_df.empty:
        # If data is missing, print a warning and exit gracefully.
        print(f"Warning: No data provided for plotting '{title}'. Skipping plot.")
        return

    # Define the set of columns required for plotting.
    required_cols = {'event_year', 'point_estimate', 'bootstrap_ci'}
    # Assert that all required columns are present.
    if not required_cols.issubset(results_df.columns):
        raise ValueError(f"Input results_df is missing required columns. Expected: {required_cols}")

    # --- Data Preparation ---

    # Sort the data by year to ensure lines are plotted chronologically.
    plot_data = results_df.sort_values('event_year').copy()

    # Unpack the confidence interval tuple into separate 'ci_lower' and 'ci_upper' columns.
    plot_data[['ci_lower', 'ci_upper']] = pd.DataFrame(plot_data['bootstrap_ci'].tolist(), index=plot_data.index)

    # --- Plotting ---

    # Set a professional plot style.
    plt.style.use('seaborn-v0_8-whitegrid')

    # Create the figure and axes objects for plotting.
    fig, ax = plt.subplots(figsize=(12, 7))

    # Plot the point estimates for each year as a line with markers.
    ax.plot(plot_data['event_year'], plot_data['point_estimate'], marker='s', linestyle='-', color='black', label='Point Estimate')

    # Plot the 95% confidence interval as a shaded region for visual clarity.
    ax.fill_between(plot_data['event_year'], plot_data['ci_lower'], plot_data['ci_upper'], color='blue', alpha=0.15, label='95% Bootstrap CI')

    # Plot the bounds of the confidence interval as dashed lines.
    ax.plot(plot_data['event_year'], plot_data['ci_lower'], linestyle='--', color='blue', linewidth=0.8)
    ax.plot(plot_data['event_year'], plot_data['ci_upper'], linestyle='--', color='blue', linewidth=0.8)

    # Add a horizontal line at y=0, which serves as the null hypothesis reference.
    ax.axhline(0, color='black', linestyle='-', linewidth=0.8)

    # Add a vertical line to clearly separate the pre- and post-treatment periods.
    ax.axvline(base_year + 0.5, color='red', linestyle='--', linewidth=1.2, label=f'Treatment Start (Post-{base_year})')

    # --- Formatting and Finalization ---

    # Set labels and title with appropriate font sizes.
    ax.set_xlabel("Year", fontsize=12)
    ax.set_ylabel(y_label, fontsize=12)
    ax.set_title(title, fontsize=14, weight='bold')

    # Add a legend to identify the plot elements.
    ax.legend(fontsize=10)

    # Customize the grid for a cleaner look.
    ax.grid(True, which='both', linestyle='--', linewidth=0.5)

    # Customize tick label sizes.
    ax.tick_params(axis='both', which='major', labelsize=10)

    # Ensure the x-axis ticks are integers representing years.
    ax.xaxis.set_major_locator(MaxNLocator(integer=True))

    # Adjust layout to prevent labels from overlapping.
    plt.tight_layout()

    # Display the final plot.
    plt.show()


# ------------------------------------------------------------------------------
# Task 22, Orchestrator Function
# ------------------------------------------------------------------------------

def run_robustness_checks(
    event_study_df: pd.DataFrame,
    analysis_panel: pd.DataFrame,
    main_estimation_results: Dict[str, Any],
    master_config: Dict[str, Any]
) -> None:
    """
    Orchestrates the execution of key robustness checks for the main results.

    This function performs two main actions to validate the identification strategy:
    1.  **Pre-Trend Estimation**: It estimates "placebo" effects for the main
        outcomes in the pre-treatment period (1987-1989). The identifying
        assumption of parallel trends requires these coefficients to be
        statistically indistinguishable from zero.
    2.  **Event-Study Visualization**: It combines the pre-trend results with the
        post-treatment results and generates dynamic event-study plots (like
        Figure 1 in the paper) to visualize the full time path of the estimated
        effects.

    Args:
        event_study_df (pd.DataFrame): The main analysis dataset from Task 9.
        analysis_panel (pd.DataFrame): The full worker-year panel, required for
            any individual-level pre-trend estimations.
        main_estimation_results (Dict[str, Any]): The results from the main
            estimation tasks (e.g., from the Task 21 orchestrator), which contain
            the post-treatment coefficients.
        master_config (Dict[str, Any]): The master configuration dictionary.
    """
    # --- Input Validation ---
    if not isinstance(event_study_df, pd.DataFrame):
        raise TypeError("`event_study_df` must be a pandas DataFrame.")
    if not isinstance(main_estimation_results, dict):
        raise TypeError("`main_estimation_results` must be a dictionary.")

    # --- Step 1: Estimate Pre-Period Coefficients ---

    # Define the main outcomes for which to test for pre-trends.
    outcomes_to_test = {
        'emp_outcome': 'regional',
        'wage_outcome': 'regional',
    }

    # Call the helper to run the 2SLS regressions for each pre-treatment year.
    pre_trend_results_df = _estimate_pre_trends(
        event_study_df, analysis_panel, master_config, outcomes_to_test
    )

    # --- Step 2: Construct Dynamic Event-Study Plots ---

    # Extract the corresponding post-treatment results from the main results dictionary.
    post_trend_results = []
    for year, year_data in main_estimation_results.get('main_tables', {}).items():
        # Extract the total regional employment effect.
        if 'employment_decomposition' in year_data:
            res = year_data['employment_decomposition']['total_effect'].copy()
            res['outcome'] = 'emp_outcome'  # Add an outcome identifier for plotting.
            post_trend_results.append(res)
        # Extract the regional wage effect.
        if 'wage_decomposition' in year_data:
            res = year_data['wage_decomposition']['regional_wage_effect'].copy()
            res['outcome'] = 'wage_outcome' # Add an outcome identifier for plotting.
            post_trend_results.append(res)

    # Combine the pre-trend and post-trend results into a single DataFrame.
    full_event_study_results = pd.concat(
        [pre_trend_results_df, pd.DataFrame(post_trend_results)],
        ignore_index=True
    )

    # Generate the plot for the regional employment effect (replicating Figure 1A logic).
    _plot_event_study(
        results_df=full_event_study_results[full_event_study_results['outcome'] == 'emp_outcome'],
        title='Event Study: Impact of Immigration on Regional Native Employment',
        y_label='Coefficient (Effect on % Change in Native Employment)',
        base_year=master_config["temporal_parameters"]["BASE_YEAR"]
    )

    # Generate the plot for the regional wage effect (replicating Figure 1B logic).
    _plot_event_study(
        results_df=full_event_study_results[full_event_study_results['outcome'] == 'wage_outcome'],
        title='Event Study: Impact of Immigration on Regional Native Wages',
        y_label='Coefficient (Effect on Change in Mean Log Wages)',
        base_year=master_config["temporal_parameters"]["BASE_YEAR"]
    )

    # Log the successful completion of the task.
    print("\nTask 22: Robustness checks completed successfully.")


In [None]:
# Task 23: Robustness: Alternative specifications and sensitivity

# ==============================================================================
# Task 23: Robustness: Alternative specifications and sensitivity
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 23, Step 1: Helper for First-Stage Interaction Test
# ------------------------------------------------------------------------------

def test_first_stage_interactions(
    event_study_df: pd.DataFrame,
    master_config: Dict[str, Any],
    event_year: int,
    task_name: str = "Task 23, Step 1"
) -> Dict[str, Any]:
    """
    Tests an alternative first-stage specification with interactions.

    This function re-estimates the main regional employment effect using an
    alternative first stage where the instruments are interacted with a
    border region indicator. This is a robustness check on the instrument spec.

    Args:
        event_study_df (pd.DataFrame): The main analysis dataset from Task 9.
        master_config (Dict[str, Any]): The master configuration dictionary.
        event_year (int): The specific year to run the test for.

    Returns:
        Dict[str, Any]: The full estimation results from the 2SLS model with
                        the interactive first stage.
    """
    print(f"\n--- {task_name}: Running First-Stage Interaction Test for {event_year} ---")

    # Create a copy of the data to modify.
    data = event_study_df.copy()

    # Create the 'Is_Border_Region' flag.
    treated_districts = set(master_config["geographic_policy_parameters"]["TREATED_DISTRICT_IDS"])
    data['Is_Border_Region'] = data['District_ID'].isin(treated_districts).astype(int)

    # Create the interaction term instruments.
    data['dist_x_border'] = data['distance_to_border'] * data['Is_Border_Region']
    data['dist_sq_x_border'] = data['distance_to_border_sq'] * data['Is_Border_Region']

    # Create a temporary config with the modified instrument specification.
    temp_config = copy.deepcopy(master_config)
    temp_config['algorithm_config_parameters']['INSTRUMENT_SPECIFICATION'] = {
        'dist_x_border': 1,
        'dist_sq_x_border': 2
    }

    # Re-run the main 2SLS estimation with the new instrument set.
    results = estimate_regional_effect_2sls(
        event_study_df=data,
        master_config=temp_config,
        outcome_variable='emp_outcome',
        event_year=event_year
    )

    return results

# ------------------------------------------------------------------------------
# Task 23, Step 2: Helper for Parameter Perturbation Sensitivity
# ------------------------------------------------------------------------------

def test_parameter_perturbation(
    perturbation_scenario: Dict[str, Any],
    raw_data_path: str,
    master_config: Dict[str, Any],
    event_year: int,
    base_cache_dir: str,
    task_name: str = "Task 23, Step 2"
) -> Dict[str, Any]:
    """
    Re-runs the entire pipeline with a perturbed configuration to test sensitivity.

    This function provides a framework for testing the robustness of the main
    results to changes in key methodological parameters (e.g., FTE weights).
    It works by creating a modified copy of the master configuration, clearing
    any old cache for this specific scenario, and then executing the entire
    end-to-end analysis pipeline from scratch. This is computationally intensive
    but provides the most rigorous test of parameter sensitivity.

    Args:
        perturbation_scenario (Dict[str, Any]): A dictionary defining the test case.
            It should have one key (the scenario name) and a value which is another
            dictionary mapping dot-separated config paths to their new values.
            Example: {'fte_low': {'algorithm_config_parameters.PART_TIME_EQUIVALENCY_WEIGHTS': new_weights}}
        raw_data_path (str): The file path to the raw data, needed to re-run the pipeline.
        master_config (Dict[str, Any]): The baseline master configuration dictionary.
        event_year (int): The specific event year to extract the final result for.
        base_cache_dir (str): The root directory for caching, from which a unique
                              subdirectory for this scenario will be created.
        task_name (str): The name of the calling task for clear error reporting.

    Returns:
        Dict[str, Any]: The key estimation result (e.g., the regional employment
                        effect dictionary) from the pipeline run with the
                        perturbed configuration.
    """
    # --- 1. Prepare Perturbed Configuration ---

    # Extract the scenario name and the dictionary of parameter changes.
    scenario_name = list(perturbation_scenario.keys())[0]
    param_changes = list(perturbation_scenario.values())[0]
    print(f"\n--- {task_name}: Testing Sensitivity to Parameter Perturbation: {scenario_name} ---")

    # Create a deep copy of the master config to avoid modifying the original.
    perturbed_config = copy.deepcopy(master_config)

    # Apply the specified changes to the copied config.
    # This loop navigates the nested dictionary using a dot-separated path.
    for path, value in param_changes.items():
        keys = path.split('.')
        d = perturbed_config
        for key in keys[:-1]:
            d = d[key]
        d[keys[-1]] = value

    # --- 2. Set Up and Run the Pipeline ---

    # Define a unique cache directory for this sensitivity run to avoid conflicts.
    scenario_cache_dir = Path(base_cache_dir) / f"sensitivity_{scenario_name}"

    # Clear any old cache for this specific scenario to ensure a clean run.
    if scenario_cache_dir.exists():
        shutil.rmtree(scenario_cache_dir)

    # Re-run the entire end-to-end analysis pipeline with the modified config.
    # We must force rerun of the data preparation to ensure the parameter change propagates.
    full_results = run_full_analysis_pipeline(
        raw_data_path=raw_data_path,
        master_config=perturbed_config,
        main_result_years=[event_year],
        cache_dir=str(scenario_cache_dir),
        force_rerun_prep=True
    )

    # --- 3. Extract and Return Key Result ---

    # Extract the key result of interest (the total regional employment effect)
    # from the comprehensive results dictionary returned by the pipeline.
    key_result = full_results['main_tables'][event_year]['employment_decomposition']['total_effect']

    # Return the result for this sensitivity scenario.
    return key_result

# ------------------------------------------------------------------------------
# Task 23, Step 3: Helper for Sample Restriction Sensitivity
# ------------------------------------------------------------------------------

def test_sample_restriction_sensitivity(
    event_study_df: pd.DataFrame,
    master_config: Dict[str, Any],
    event_year: int,
    restriction: Dict[str, Any],
    task_name: str = "Task 23, Step 3"
) -> Dict[str, Any]:
    """
    Re-runs the main estimation on a programmatically restricted sample.

    This function tests the sensitivity of the main findings to the sample
    definition by applying additional filters to the final analysis dataset
    before re-running the 2SLS estimation.

    Args:
        event_study_df (pd.DataFrame): The main, fully prepared analysis dataset.
        master_config (Dict[str, Any]): The master configuration dictionary.
        event_year (int): The specific event year to run the estimation for.
        restriction (Dict[str, Any]): A dictionary defining the restrictions to apply.
            Supported keys: 'distance_band' (e.g., [10, 60]) or
            'min_muni_size' (e.g., 100).
        task_name (str): The name of the calling task for clear error reporting.

    Returns:
        Dict[str, Any]: The estimation results dictionary for the restricted sample.
    """
    # --- Input Validation ---
    if not isinstance(event_study_df, pd.DataFrame):
        raise TypeError("`event_study_df` must be a pandas DataFrame.")

    print(f"\n--- {task_name}: Testing Sensitivity to Sample Restriction: {restriction} ---")

    # --- 1. Apply Sample Restrictions ---

    # Work on a copy of the DataFrame to avoid modifying the original.
    restricted_df = event_study_df.copy()

    # Apply a distance band filter if specified.
    if 'distance_band' in restriction:
        min_dist, max_dist = restriction['distance_band']
        restricted_df = restricted_df[
            (restricted_df['distance_to_border'] >= min_dist) &
            (restricted_df['distance_to_border'] <= max_dist)
        ]

    # Apply a minimum municipality size filter if specified.
    if 'min_muni_size' in restriction:
        weight_col = master_config["algorithm_config_parameters"]["WEIGHT_COLUMN_MUNICIPALITY"]
        restricted_df = restricted_df[restricted_df[weight_col] >= restriction['min_muni_size']]

    # --- 2. Re-run Estimation ---

    # Re-run the main regional employment effect estimation on the newly restricted data.
    results = estimate_regional_effect_2sls(
        event_study_df=restricted_df,
        master_config=master_config,
        outcome_variable='emp_outcome',
        event_year=event_year
    )

    # Return the results from this sensitivity run.
    return results

# ------------------------------------------------------------------------------
# Task 23, Orchestrator Function
# ------------------------------------------------------------------------------

def run_sensitivity_analyses(
    raw_data_path: str,
    event_study_df: pd.DataFrame,
    master_config: Dict[str, Any],
    event_year: int = 1993
) -> Dict[str, Any]:
    """
    Orchestrates the execution of alternative specification and sensitivity checks.

    This function runs a series of robustness checks to test the stability of
    the main findings. It includes:
    1.  An alternative first-stage specification with instrument interactions.
    2.  Sensitivity to parameter choices (e.g., FTE weights) by re-running the
        entire pipeline with a modified configuration.
    3.  Sensitivity to sample restrictions based on distance and municipality size
        by re-running the final estimation on a filtered dataset.

    Args:
        raw_data_path (str): Path to the raw data, needed for parameter perturbation tests.
        event_study_df (pd.DataFrame): The main analysis dataset from Task 9.
        master_config (Dict[str, Any]): The master configuration dictionary.
        event_year (int): The primary event year for which to run the checks.

    Returns:
        Dict[str, Any]: A nested dictionary containing the results from all
                        sensitivity analyses performed.
    """
    # --- Input Validation ---
    if not isinstance(event_study_df, pd.DataFrame):
        raise TypeError("`event_study_df` must be a pandas DataFrame.")

    # Initialize the main results dictionary.
    sensitivity_results: Dict[str, Any] = {}

    # --- Step 1: First-Stage Interaction Test ---
    # This tests if the instrument's power is robust to a more flexible spec.
    sensitivity_results['first_stage_interaction'] = test_first_stage_interactions(
        event_study_df, master_config, event_year
    )

    # --- Step 2: Parameter Perturbation ---
    # This is computationally expensive as it re-runs the entire pipeline for each scenario.
    param_scenarios = [
        {'fte_weights_low': {'algorithm_config_parameters.PART_TIME_EQUIVALENCY_WEIGHTS': {1: 1.0, 5: 0.60, 6: 0.40}}},
        {'fte_weights_high': {'algorithm_config_parameters.PART_TIME_EQUIVALENCY_WEIGHTS': {1: 1.0, 5: 0.75, 6: 0.60}}}
    ]
    param_results = {}
    for scenario in param_scenarios:
        scenario_name = list(scenario.keys())[0]
        param_results[scenario_name] = test_parameter_perturbation(
            scenario, raw_data_path, master_config, event_year, master_config.get('cache_dir', './.cache/')
        )
    sensitivity_results['parameter_perturbations'] = param_results

    # --- Step 3: Sample Restriction Sensitivity ---
    # This tests if the results are driven by municipalities that are very close/far or very small.
    sample_restriction_scenarios = {
        'distance_band_10_60km': {'distance_band': [10, 60]},
        'min_size_100FTE': {'min_muni_size': 100}
    }

    restriction_results = {}
    for name, restriction in sample_restriction_scenarios.items():
        restriction_results[name] = test_sample_restriction_sensitivity(
            event_study_df, master_config, event_year, restriction
        )
    sensitivity_results['sample_restrictions'] = restriction_results

    # Log the successful completion of the task.
    print("\nTask 23: Sensitivity analyses completed successfully.")

    # Return the comprehensive dictionary of all sensitivity results.
    return sensitivity_results


In [None]:
# Task 24: Robustness: Pseudo-panel cross-check (Equations (C5.1)–(C5.2))

# ==============================================================================
# Task 24: Robustness: Pseudo-panel cross-check
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 24, Step 1: Helper to Construct the Pseudo-Panel
# ------------------------------------------------------------------------------

def _construct_pseudo_panel(
    analysis_panel: pd.DataFrame,
    config: Dict[str, Any],
    task_name: str = "Task 24, Step 1"
) -> pd.DataFrame:
    """
    Constructs a pseudo-panel by aggregating individual data into cells.

    Cells are defined by the interaction of municipality, year, and observable
    worker characteristics (education, age group, gender). The function
    calculates the mean log wage and cell size for each cell.

    Args:
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel.
        config (Dict[str, Any]): The master configuration dictionary.

    Returns:
        pd.DataFrame: A DataFrame where each row represents a pseudo-panel cell,
                      containing the mean log wage and cell size.
    """
    # Define the demographic groups for cell construction.
    group_cols = ['education_group', 'age_group', 'gender_group']

    # Filter to the relevant sample: native, full-time, employed workers.
    pseudo_panel_sample = analysis_panel[
        analysis_panel['is_native'] &
        analysis_panel['is_full_time'] &
        analysis_panel['is_employed']
    ].copy()

    # Group by cell identifiers and aggregate.
    cell_df = pseudo_panel_sample.groupby(
        ['Municipality_ID', 'snapshot_year'] + group_cols
    ).agg(
        mean_log_wage=('log_wage_imputed', 'mean'),
        cell_size=('Worker_ID', 'count')
    )

    # --- Filter out small cells based on baseline year size ---
    base_year = config["temporal_parameters"]["BASE_YEAR"]
    min_cell_size = config["algorithm_config_parameters"]["MIN_PSEUDO_PANEL_CELL_SIZE"]

    # Get the cell sizes for the baseline year.
    baseline_cell_sizes = cell_df.loc[
        (slice(None), base_year, slice(None), slice(None), slice(None)), 'cell_size'
    ].rename('baseline_cell_size')

    # Merge baseline sizes back to the main cell DataFrame.
    cell_df = cell_df.reset_index().merge(
        baseline_cell_sizes.reset_index(),
        on=['Municipality_ID'] + group_cols,
        how='left'
    )

    # Filter to keep only cells that meet the minimum size requirement in the baseline year.
    initial_cells = len(cell_df)
    cell_df_filtered = cell_df[cell_df['baseline_cell_size'] >= min_cell_size].copy()
    final_cells = len(cell_df_filtered)

    print(f"[{task_name}] Constructed pseudo-panel. Kept {final_cells} of {initial_cells} "
          f"cell-year observations after applying min size filter of {min_cell_size}.")

    return cell_df_filtered.set_index(['Municipality_ID', 'snapshot_year'] + group_cols)

# ------------------------------------------------------------------------------
# Task 24, Orchestrator Function
# ------------------------------------------------------------------------------

def run_pseudo_panel_check(
    analysis_panel: pd.DataFrame,
    event_study_df: pd.DataFrame,
    master_config: Dict[str, Any],
    event_year: int
) -> Dict[str, Any]:
    """
    Orchestrates the pseudo-panel robustness check as per Appendix C.5.

    This function estimates the wage effect of immigration using a pseudo-panel
    of demographic cells, which mimics an analysis using repeated cross-sectional
    data where only observable characteristics can be controlled for. The result
    (γ^PP) is then compared to the regional (γ^R) and pure (γ^W) wage effects.

    Args:
        analysis_panel (pd.DataFrame): The fully enriched worker-year panel.
        event_study_df (pd.DataFrame): The main analysis dataset from Task 9.
        master_config (Dict[str, Any]): The master configuration dictionary.
        event_year (int): The specific year of the event study to analyze.

    Returns:
        Dict[str, Any]: The full 2SLS estimation results for the pseudo-panel model.
    """
    # --- Input Validation ---
    if not isinstance(analysis_panel, pd.DataFrame):
        raise TypeError("`analysis_panel` must be a pandas DataFrame.")

    # --- Step 1: Construct the Pseudo-Panel ---
    pseudo_panel_df = _construct_pseudo_panel(analysis_panel, master_config)

    # --- Step 2: First-Difference and Prepare for Estimation ---
    base_year = master_config["temporal_parameters"]["BASE_YEAR"]

    # Extract data for the base and event years.
    pseudo_panel_base = pseudo_panel_df.loc[(slice(None), base_year), :]
    pseudo_panel_event = pseudo_panel_df.loc[(slice(None), event_year), :]

    # Align the two time periods for differencing.
    merged_panel = pseudo_panel_base.merge(
        pseudo_panel_event,
        on=['Municipality_ID'] + master_config["algorithm_config_parameters"]["PSEUDO_PANEL_GROUPS"].keys(),
        suffixes=('_base', '_event'),
        how='inner' # Keep only cells that exist in both years.
    )

    # Outcome: Δlog_w̄_krt = log_w̄_krt - log_w̄_kr,1990
    merged_panel['pseudo_panel_outcome'] = merged_panel['mean_log_wage_event'] - merged_panel['mean_log_wage_base']

    # The weight is the baseline cell size.
    weight_col = 'baseline_cell_size_base'
    merged_panel.rename(columns={'baseline_cell_size_base': weight_col}, inplace=True)

    # Merge municipality-level shocks, instruments, and cluster IDs.
    # The unit of observation is now the cell (Municipality_ID + group).
    event_data_year = event_study_df.loc[(slice(None), event_year), :].reset_index()

    analysis_data = merged_panel.reset_index().merge(
        event_data_year, on='Municipality_ID', how='left'
    ).dropna()

    # --- Step 3: Estimate 2SLS at the Cell Level ---

    # Create a temporary config to pass the custom weight column name.
    temp_config = copy.deepcopy(master_config)
    temp_config['algorithm_config_parameters']['WEIGHT_COLUMN_MUNICIPALITY'] = weight_col

    # Call the master 2SLS estimator. The data must be temporarily reshaped
    # to match the expected input format (MultiIndex with year).
    analysis_data['snapshot_year'] = event_year
    analysis_data = analysis_data.set_index(['Municipality_ID', 'snapshot_year'])

    print(f"\n--- Estimating Pseudo-Panel Model (γ^PP) for Year: {event_year} ---")
    results = estimate_regional_effect_2sls(
        event_study_df=analysis_data,
        master_config=temp_config,
        outcome_variable='pseudo_panel_outcome',
        event_year=event_year
    )

    print("\nTask 24: Pseudo-panel cross-check completed successfully.")
    return results


In [None]:
# Task 25: Compile final outputs, tables, and figures

# ==============================================================================
# Task 25: Compile final outputs, tables, and figures
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 25, Step 1: Helpers to Format and Create Publication Tables
# ------------------------------------------------------------------------------

def _format_result(
    result: Dict[str, Any],
    flip_sign: bool = False
) -> Tuple[str, str]:
    """
    Formats a single estimation result into a (point_estimate, std_error) string tuple.

    This utility takes a dictionary of estimation results, formats the point
    estimate and standard error to three decimal places, and adds significance
    stars. Stars (***, **, *) are determined by conventional p-value thresholds
    if the bootstrap confidence interval, the most robust measure of uncertainty,
    does not contain zero.

    Args:
        result (Dict[str, Any]): A dictionary containing at least 'point_estimate',
                                 'cluster_robust_se', 'bootstrap_ci', and 'p_value'.
        flip_sign (bool): If True, the sign of the point estimate is flipped
                          before formatting. This is required for components like
                          displacement that enter the decomposition identity with
                          a negative sign.

    Returns:
        Tuple[str, str]: A tuple containing two strings:
                         - The formatted point estimate with significance stars.
                         - The formatted standard error enclosed in parentheses.
    """
    # --- Input Validation ---
    # Handle cases where the result dictionary is empty or incomplete, returning placeholders.
    if not result or 'point_estimate' not in result or 'cluster_robust_se' not in result:
        return "-", "(-)"

    # --- Data Extraction ---

    # Extract the point estimate and optionally flip its sign for decomposition tables.
    pe = result['point_estimate'] * (-1 if flip_sign else 1)

    # Extract the cluster-robust standard error.
    se = result['cluster_robust_se']

    # Safely extract the bootstrap confidence interval.
    ci = result.get('bootstrap_ci')

    # --- Significance Star Determination ---

    # Initialize the significance stars string.
    stars = ""

    # Determine significance only if a valid bootstrap confidence interval is provided.
    if ci and len(ci) == 2:
        # The effect is statistically significant if the confidence interval does not include zero.
        # This is a more robust check than relying solely on the p-value.
        if (ci[0] > 0 and ci[1] > 0) or (ci[0] < 0 and ci[1] < 0):
            # Use conventional p-value thresholds for the star notation (*** for p<0.01, etc.).
            p_val = result.get('p_value', 1.0)
            if p_val < 0.01: stars = "***"
            elif p_val < 0.05: stars = "**"
            elif p_val < 0.10: stars = "*"

    # --- Formatting and Return ---

    # Return the formatted point estimate and standard error as a tuple of strings.
    return f"{pe:.3f}{stars}", f"({se:.3f})"

def create_table_1(
    results: Dict[str, Any],
    year: int
) -> pd.DataFrame:
    """
    Creates a pandas DataFrame that replicates Table 1 (Employment Decomposition).

    This function navigates the nested results dictionary to find the employment
    decomposition estimates for a specific year. It then uses the `_format_result`
    utility to format each component and assembles them into a publication-quality
    DataFrame that mirrors the structure of the paper's Table 1.

    Args:
        results (Dict[str, Any]): The comprehensive results dictionary from the
                                  main analysis pipeline.
        year (int): The specific event year for which to generate the table.

    Returns:
        pd.DataFrame: A formatted DataFrame ready for display or export.

    Raises:
        KeyError: If the required 'employment_decomposition' results are not
                  found for the specified year in the input dictionary.
    """
    # --- Data Extraction ---

    # Navigate to the relevant section of the results dictionary.
    # A KeyError will be raised if the path is invalid, providing a clear error.
    try:
        data = results['main_tables'][year]['employment_decomposition']
    except KeyError:
        raise KeyError(f"Could not find 'employment_decomposition' results for year {year}.")

    # --- Table Construction ---

    # Construct the table data by formatting each component.
    # Note the sign flips for displacement and relocation, as per the paper's
    # decomposition identity: ΔE ≈ -Disp + CrowdOut - Reloc.
    table_data = {
        "Regional Effect": _format_result(data.get('total_effect')),
        "Displacement (-)": _format_result(data.get('displacement'), flip_sign=True),
        "Crowding-Out (+)": _format_result(data.get('inflow')),
        "Relocation (-)": _format_result(data.get('relocation'), flip_sign=True),
    }

    # Create the pandas DataFrame from the formatted data.
    df = pd.DataFrame(table_data, index=['Coefficient', 'Std. Error'])

    # Set a descriptive name for the columns index.
    df.columns.name = f"Table 1 (Replica): Employment Decomposition ({year} vs 1990)"

    # Return the final formatted table.
    return df

def create_table_2(
    results: Dict[str, Any],
    year: int
) -> pd.DataFrame:
    """
    Creates a pandas DataFrame that replicates Table 2 (Wage Decomposition).

    This function navigates the nested results dictionary to find the wage
    decomposition estimates for a specific year. It formats each component
    (regional effect, pure effect, and composition terms) and assembles them
    into a publication-quality DataFrame mirroring the paper's Table 2.

    Args:
        results (Dict[str, Any]): The comprehensive results dictionary from the
                                  main analysis pipeline.
        year (int): The specific event year for which to generate the table.

    Returns:
        pd.DataFrame: A formatted DataFrame ready for display or export.

    Raises:
        KeyError: If the required 'wage_decomposition' results are not
                  found for the specified year in the input dictionary.
    """
    # --- Data Extraction ---

    # Navigate to the relevant section of the results dictionary.
    try:
        data = results['main_tables'][year]['wage_decomposition']
        comp_components = data.get('composition_components', {})
    except KeyError:
        raise KeyError(f"Could not find 'wage_decomposition' results for year {year}.")

    # --- Table Construction ---

    # Construct the table data by formatting each component.
    # Note the sign flip for the inflow selection term as per the paper's presentation.
    table_data = {
        "Regional Wage Effect (γ^R)": _format_result(data.get('regional_wage_effect')),
        "Pure Wage Effect (γ^W)": _format_result(data.get('pure_wage_effect')),
        "Compositional Effect": (f"{data.get('composition_effect_total', 0.0):.3f}", ""),
        "  - Outflow Selection": _format_result(comp_components.get('outflow_selection')),
        "  - Inflow Selection": _format_result(comp_components.get('inflow_selection'), flip_sign=True),
        "  - Age Selection": (f"{comp_components.get('age_selection', 0.0):.3f}", ""),
    }

    # Create the pandas DataFrame from the formatted data.
    df = pd.DataFrame(table_data, index=['Coefficient', 'Std. Error'])

    # Set a descriptive name for the columns index.
    df.columns.name = f"Table 2 (Replica): Wage Decomposition ({year} vs 1990)"

    # Return the final formatted table.
    return df

# ------------------------------------------------------------------------------
# Task 25, Step 2: Helpers to Generate and Plot Event Studies
# ------------------------------------------------------------------------------

def _run_full_event_study(
    outcomes_to_run: Dict[str, str],
    event_study_df: pd.DataFrame,
    analysis_panel: pd.DataFrame,
    master_config: Dict[str, Any]
) -> pd.DataFrame:
    """
    Runs estimations for multiple outcomes over all event study years.

    This is a computationally intensive helper function that generates the full
    time-series of coefficients needed for creating event-study plots. It iterates
    through a given set of outcomes and all pre- and post-treatment years,
    calling the appropriate master estimation function for each combination.

    Args:
        outcomes_to_run (Dict[str, str]): A dictionary mapping the name of an
            outcome column in `event_study_df` (e.g., 'emp_outcome') to the
            required model type ('regional' for municipality-level 2SLS or
            'pure_wage' for individual-level FD-IV).
        event_study_df (pd.DataFrame): The main analysis dataset from Task 9.
        analysis_panel (pd.DataFrame): The full worker-year panel, required for
            individual-level models like the pure wage effect.
        master_config (Dict[str, Any]): The master configuration dictionary.

    Returns:
        pd.DataFrame: A tidy DataFrame where each row contains the complete
                      estimation results for a single outcome in a single year.
                      Includes an 'outcome_name' column for easy filtering.

    Raises:
        ValueError: If an unknown model_type is provided in `outcomes_to_run`.
    """
    # --- Input Validation ---
    if not isinstance(outcomes_to_run, dict):
        raise TypeError("`outcomes_to_run` must be a dictionary.")

    # --- 1. Define the Time Period ---

    # Define the full range of years for the event-study plot from the config.
    event_years = master_config["temporal_parameters"]["PRE_TREATMENT_YEARS_FOR_TESTING"] + \
                  master_config["temporal_parameters"]["POST_TREATMENT_YEARS"]

    # --- 2. Iterative Estimation ---

    # Initialize a list to store the dictionary of results from each estimation run.
    all_results: List[Dict[str, Any]] = []

    # Loop through each outcome and each year specified for the analysis.
    for outcome, model_type in outcomes_to_run.items():
        for year in sorted(event_years):
            # Select the appropriate estimation function based on the required model type.
            if model_type == 'regional':
                # For regional outcomes, call the municipality-level 2SLS estimator.
                res = estimate_regional_effect_2sls(event_study_df, master_config, outcome, year)
            elif model_type == 'pure_wage':
                # For the pure wage effect, call the individual-level FD-IV estimator.
                res = _estimate_pure_wage_effect_stayers(analysis_panel, event_study_df, master_config, year)
            else:
                # Raise an error for unsupported model types.
                raise ValueError(f"Unknown model_type '{model_type}' for pre-trend estimation.")

            # If the estimation was successful (returned a non-empty result),
            # add an identifier for the outcome and append it to the results list.
            if res:
                res['outcome_name'] = outcome
                all_results.append(res)

    # --- 3. Finalization ---

    # Convert the list of result dictionaries into a single, tidy DataFrame.
    return pd.DataFrame(all_results)

def _plot_multi_series_event_study(
    results_map: Dict[str, pd.DataFrame],
    title: str,
    y_label: str,
    base_year: int = 1990
) -> None:
    """
    Generates a publication-quality event-study plot with multiple series.

    This function takes a dictionary mapping series names to their corresponding
    results DataFrames and plots them on the same axes for comparison. It is
    designed to replicate the style of figures like Figure 1B in the paper.

    Args:
        results_map (Dict[str, pd.DataFrame]): A dictionary where keys are the
            legend labels for each series (e.g., 'Regional Wage Effect') and
            values are the DataFrames containing the estimation results for that
            series over time.
        title (str): The main title for the plot.
        y_label (str): The label for the y-axis.
        base_year (int): The base year, used to draw the treatment start line.

    Raises:
        ValueError: If a DataFrame in `results_map` is missing required columns.
    """
    # --- Plotting Setup ---

    # Set a professional plot style.
    plt.style.use('seaborn-v0_8-whitegrid')

    # Create the figure and axes objects.
    fig, ax = plt.subplots(figsize=(12, 7))

    # Define color and marker cycles for plotting multiple series.
    colors = ['black', 'blue', 'green', 'purple']
    markers = ['s', 'o', '^', 'D']

    # --- Loop and Plot Each Series ---

    # Iterate through the provided dictionary of result series.
    for i, (series_name, df) in enumerate(results_map.items()):
        # Skip if the DataFrame for a series is empty.
        if df.empty:
            print(f"Warning: No data for series '{series_name}'. Skipping.")
            continue

        # Validate that the DataFrame has the required columns.
        required_cols = {'event_year', 'point_estimate', 'bootstrap_ci'}
        if not required_cols.issubset(df.columns):
            raise ValueError(f"DataFrame for series '{series_name}' is missing required columns.")

        # Prepare data for plotting.
        df = df.sort_values('event_year').copy()
        df[['ci_lower', 'ci_upper']] = pd.DataFrame(df['bootstrap_ci'].tolist(), index=df.index)

        # Plot the point estimates for the current series.
        ax.plot(df['event_year'], df['point_estimate'], marker=markers[i % len(markers)],
                linestyle='-', color=colors[i % len(colors)], label=series_name)

        # Plot the 95% confidence interval as a shaded region.
        ax.fill_between(df['event_year'], df['ci_lower'], df['ci_upper'], color=colors[i % len(colors)], alpha=0.1)

    # --- Formatting and Finalization ---

    # Add a horizontal reference line at y=0 (null effect).
    ax.axhline(0, color='black', linestyle='-', linewidth=0.8)

    # Add a vertical reference line to demarcate the start of the treatment.
    ax.axvline(base_year + 0.5, color='red', linestyle='--', linewidth=1.2, label=f'Treatment Start')

    # Set labels and title with appropriate font sizes.
    ax.set_xlabel("Year", fontsize=12)
    ax.set_ylabel(y_label, fontsize=12)
    ax.set_title(title, fontsize=14, weight='bold')

    # Add a legend.
    ax.legend(fontsize=10)

    # Ensure the x-axis ticks are integers representing years.
    ax.xaxis.set_major_locator(MaxNLocator(integer=True))

    # Adjust layout and display the plot.
    plt.tight_layout()
    plt.show()

# ------------------------------------------------------------------------------
# Task 25, Step 3: Helper to Document Structural Parameters
# ------------------------------------------------------------------------------

def _summarize_structural_parameters(
    results: Dict[str, Any]
) -> None:
    """
    Prints a formatted summary of the recovered structural parameters from Task 20.

    This function extracts the final structural estimates from the results
    dictionary and presents them in a clear, human-readable format. It also
    includes the crucial validation check that contrasts the main results with
    the biased estimates that would be obtained using the naive regional wage effect.

    Args:
        results (Dict[str, Any]): The comprehensive results dictionary from the
                                  main analysis pipeline.
    """
    # --- Data Extraction ---

    # Safely navigate to the structural parameters section of the results.
    params = results.get('structural_parameters')

    # If the parameters are not found, print a message and exit.
    if not params:
        print("Structural parameters not found in results dictionary.")
        return

    # --- Print Formatted Summary ---

    print("\n" + "="*80 + "\nSummary of Recovered Structural Parameters (Task 20)\n" + "="*80)
    print("Main Results (using Pure Wage Effect γ^W):")
    print(f"  Efficiency Scaling Factor (c):                                {params['structural_parameters']['c_scaling_factor']:.3f}")
    print(f"  Population-Weighted Supply Elasticity (η̄^P):                  {params['structural_parameters']['eta_P_population_supply_elasticity']:.3f}")
    print(f"  Efficiency-Weighted Supply Elasticity (η̄^E):                  {params['structural_parameters']['eta_E_efficiency_supply_elasticity']:.3f}")
    print(f"  Inverse Labor Demand Elasticity (φ):                          {params['structural_parameters']['phi_inverse_demand_elasticity']:.3f}")
    print(f"  IMPLIED LABOR DEMAND ELASTICITY (1/φ):                        {params['structural_parameters']['demand_elasticity']:.3f}")
    print("-" * 80)
    print("Validation Check (using naive Regional Wage Effect γ^R):")
    print(f"  Naive Inverse Demand Elasticity (φ_naive):                    {params['validation_with_regional_effect']['phi_naive']:.3f}")
    print(f"  NAIVE IMPLIED DEMAND ELASTICITY:                              {params['validation_with_regional_effect']['demand_elasticity_naive']:.3f}")
    print("="*80)

# ------------------------------------------------------------------------------
# Task 25, Step 3 Helper Functions
# ------------------------------------------------------------------------------

def _generate_all_figure_data(
    event_study_df: pd.DataFrame,
    analysis_panel: pd.DataFrame,
    regional_panel: pd.DataFrame,
    master_config: Dict[str, Any]
) -> pd.DataFrame:
    """
    Generates the full time-series of estimation results needed for all event-study figures.

    This is a computationally intensive helper that runs the key estimations for
    every pre- and post-treatment year to produce a tidy DataFrame of results
    suitable for plotting.

    Args:
        event_study_df (pd.DataFrame): The main analysis dataset from Task 9.
        analysis_panel (pd.DataFrame): The full worker-year panel.
        regional_panel (pd.DataFrame): The aggregated municipality-year panel.
        master_config (Dict[str, Any]): The master configuration dictionary.

    Returns:
        pd.DataFrame: A tidy DataFrame with all estimation results across all years.
    """
    # Define the full range of years for the event-study plot.
    event_years = master_config["temporal_parameters"]["PRE_TREATMENT_YEARS_FOR_TESTING"] + \
                  master_config["temporal_parameters"]["POST_TREATMENT_YEARS"]

    all_results = []

    # Loop through each year to generate the full time series of coefficients.
    for year in sorted(event_years):
        print(f"\n--- Generating Figure Data for Year: {year} ---")

        # --- Regional Employment Effect ---
        res_emp = estimate_regional_effect_2sls(event_study_df, master_config, 'emp_outcome', year)
        if res_emp:
            res_emp['outcome_name'] = 'Regional Employment'
            all_results.append(res_emp)

        # --- Regional Wage Effect ---
        res_wage = estimate_regional_effect_2sls(event_study_df, master_config, 'wage_outcome', year)
        if res_wage:
            res_wage['outcome_name'] = 'Regional Wage'
            all_results.append(res_wage)

        # --- Pure Wage Effect ---
        res_pure_wage = _estimate_pure_wage_effect_stayers(analysis_panel, event_study_df, master_config, year)
        if res_pure_wage:
            res_pure_wage['outcome_name'] = 'Pure Wage'
            all_results.append(res_pure_wage)

        # --- Displacement Effect ---
        # This requires re-computing the displacement share for each year.
        disp_df = _compute_employment_flow_shares(analysis_panel, regional_panel, year, master_config)
        temp_event_df = event_study_df.merge(disp_df[['displacement_share']], on='Municipality_ID', how='left')
        res_disp = estimate_regional_effect_2sls(temp_event_df, master_config, 'displacement_share', year)
        if res_disp:
            res_disp['outcome_name'] = 'Displacement'
            all_results.append(res_disp)

    return pd.DataFrame(all_results)

def _generate_all_figure_data(
    event_study_df: pd.DataFrame,
    analysis_panel: pd.DataFrame,
    regional_panel: pd.DataFrame,
    master_config: Dict[str, Any]
) -> pd.DataFrame:
    """
    Generates the full time-series of estimation results needed for all event-study figures.

    This is a computationally intensive helper that runs the key estimations for
    every pre- and post-treatment year to produce a tidy DataFrame of results
    suitable for plotting.

    Args:
        event_study_df (pd.DataFrame): The main analysis dataset from Task 9.
        analysis_panel (pd.DataFrame): The full worker-year panel.
        regional_panel (pd.DataFrame): The aggregated municipality-year panel.
        master_config (Dict[str, Any]): The master configuration dictionary.

    Returns:
        pd.DataFrame: A tidy DataFrame with all estimation results across all years.
    """
    # Define the full range of years for the event-study plot.
    event_years = master_config["temporal_parameters"]["PRE_TREATMENT_YEARS_FOR_TESTING"] + \
                  master_config["temporal_parameters"]["POST_TREATMENT_YEARS"]

    all_results = []

    # Loop through each year to generate the full time series of coefficients.
    for year in sorted(event_years):
        print(f"\n--- Generating Figure Data for Year: {year} ---")

        # --- Regional Employment Effect ---
        res_emp = estimate_regional_effect_2sls(event_study_df, master_config, 'emp_outcome', year)
        if res_emp:
            res_emp['outcome_name'] = 'Regional Employment'
            all_results.append(res_emp)

        # --- Regional Wage Effect ---
        res_wage = estimate_regional_effect_2sls(event_study_df, master_config, 'wage_outcome', year)
        if res_wage:
            res_wage['outcome_name'] = 'Regional Wage'
            all_results.append(res_wage)

        # --- Pure Wage Effect ---
        res_pure_wage = _estimate_pure_wage_effect_stayers(analysis_panel, event_study_df, master_config, year)
        if res_pure_wage:
            res_pure_wage['outcome_name'] = 'Pure Wage'
            all_results.append(res_pure_wage)

        # --- Displacement Effect ---
        # This requires re-computing the displacement share for each year.
        disp_df = _compute_employment_flow_shares(analysis_panel, regional_panel, year, master_config)
        temp_event_df = event_study_df.merge(disp_df[['displacement_share']], on='Municipality_ID', how='left')
        res_disp = estimate_regional_effect_2sls(temp_event_df, master_config, 'displacement_share', year)
        if res_disp:
            res_disp['outcome_name'] = 'Displacement'
            all_results.append(res_disp)

    return pd.DataFrame(all_results)

# ------------------------------------------------------------------------------
# Task 25, Orchestrator Function
# ------------------------------------------------------------------------------

def compile_final_outputs(
    final_results: Dict[str, Any],
    event_study_df: pd.DataFrame,
    analysis_panel: pd.DataFrame,
    regional_panel: pd.DataFrame,
    master_config: Dict[str, Any]
) -> None:
    """
    Orchestrates the generation of all final tables, figures, and summaries.

    This function serves as the final reporting layer of the entire analysis
    pipeline. It takes the comprehensive results dictionary from the main
    orchestrator and uses a series of specialized helper functions to generate
    and display publication-quality outputs that replicate the key exhibits
    from the paper.

    Args:
        final_results (Dict[str, Any]): The nested dictionary of results from
            the `run_full_analysis_pipeline` function.
        event_study_df (pd.DataFrame): The main analysis dataset, needed to
            generate full event-study series for plotting.
        analysis_panel (pd.DataFrame): The full worker-year panel, needed for
            individual-level estimations for plotting.
        regional_panel (pd.DataFrame): The aggregated municipality-year panel.
        master_config (Dict[str, Any]): The master configuration dictionary.
    """
    # --- Step 1: Reproduce Main Tables ---
    print("\n" + "="*80 + "\nGENERATING PUBLICATION TABLES\n" + "="*80)
    for year in final_results.get('main_tables', {}).keys():
        print(create_table_1(final_results, year).to_string())
        print("\n" + "-"*80 + "\n")
        print(create_table_2(final_results, year).to_string())
        # ... calls to create other tables (e.g., Table 3, 4) would follow here ...

    # --- Step 2: Reproduce Main Figures ---
    print("\n" + "="*80 + "\nGENERATING EVENT-STUDY FIGURES\n" + "="*80)

    # a. Generate all time-series data required for the figures in one go.
    figure_data_df = _generate_all_figure_data(
        event_study_df, analysis_panel, regional_panel, master_config
    )

    # b. Plot Figure 1A (Regional Employment vs. Displacement).
    # The paper plots -displacement, so we flip the sign of the point estimate and CIs.
    disp_plot_data = figure_data_df[figure_data_df['outcome_name'] == 'Displacement'].copy()
    disp_plot_data['point_estimate'] *= -1
    disp_plot_data['bootstrap_ci'] = disp_plot_data['bootstrap_ci'].apply(lambda x: (-x[1], -x[0]))

    _plot_multi_series_event_study(
        results_map={
            'Regional Employment Effect': figure_data_df[figure_data_df['outcome_name'] == 'Regional Employment'],
            'Displacement Effect (-)': disp_plot_data
        },
        title='Figure 1A (Replica): Impact on Regional Employment and Displacement',
        y_label='Coefficient'
    )

    # c. Plot Figure 1B (Regional Wage vs. Pure Wage).
    _plot_multi_series_event_study(
        results_map={
            'Regional Wage Effect': figure_data_df[figure_data_df['outcome_name'] == 'Regional Wage'],
            'Pure Wage Effect': figure_data_df[figure_data_df['outcome_name'] == 'Pure Wage']
        },
        title='Figure 1B (Replica): Regional vs. Pure Wage Effects',
        y_label='Coefficient'
    )

    # d. Plot Figure 3 (Apprenticeship Uptake).
    fig3_results_df = pd.DataFrame(final_results.get('event_studies', {}).get('apprenticeship_uptake', []))
    _plot_multi_series_event_study(
        results_map={'Apprenticeship Uptake': fig3_results_df},
        title='Figure 3 (Replica): Impact on Native Apprenticeships',
        y_label='Coefficient'
    )

    # --- Step 3: Document Structural Parameters ---
    _summarize_structural_parameters(final_results)

    print("\nTask 25: Final output compilation completed successfully.")


In [None]:
# Top-Level Orchestrator

# ==============================================================================
# Final Task: Top-Level Orchestrator for Full Project Execution
# ==============================================================================

# ------------------------------------------------------------------------------
# Top-Level Orchestrator, helper function for data preparation
# ------------------------------------------------------------------------------

def _run_and_cache_prep_pipeline(
    initial_artifacts: Dict[str, Any],
    master_config: Dict[str, Any],
    cache_path: Path,
    force_rerun_prep: bool
) -> Dict[str, Any]:
    """
    Executes the full data preparation pipeline (Tasks 3-9) with robust caching.

    This function manages the sequential execution of all data preparation tasks,
    from cleansing raw data to creating the final event-study DataFrame. It
    implements a waterfall caching logic: if a step is re-run, all subsequent
    steps are also re-run to ensure data consistency.

    Args:
        initial_artifacts (Dict[str, Any]): Dictionary containing the initial
            validated artifacts from Task 1 & 2 (e.g., raw data, auxiliary tables).
        master_config (Dict[str, Any]): The master configuration dictionary.
        cache_path (Path): The path to the cache directory.
        force_rerun_prep (bool): If True, ignores all caches and re-runs all steps.

    Returns:
        Dict[str, Any]: The dictionary of artifacts, now populated with all
                        DataFrames from the preparation pipeline.
    """
    # This dictionary will hold all major data artifacts.
    artifacts = initial_artifacts.copy()

    # This flag controls the cache waterfall. If it becomes False, all subsequent steps must re-run.
    is_cache_valid = not force_rerun_prep

    # Define the full data preparation pipeline as a list of tasks.
    # Each tuple: (name, function, input_map, output_keys, cache_filename)
    prep_pipeline = [
        ("3: Cleanse & Canonicalize", cleanse_and_canonicalize_spells,
         {'consolidated_df_raw': artifacts['consolidated_df_raw']},
         ['df_normalized', 'panel_full_with_flags', 'panel_main_analysis'], 'task3.pkl'),
        ("4: Impute Wages", impute_censored_wages,
         {'worker_year_panel': 'panel_main_analysis'},
         'panel_main_with_wages', 'task4.pkl'),
        ("5: Build Analysis Panel", build_analysis_panel,
         {'analysis_panel_employed': 'panel_main_with_wages', 'all_spells_cleansed': 'df_normalized', 'validated_artifacts': artifacts},
         'analysis_panel', 'task5.pkl'),
        ("6: Aggregate Regional Panel", aggregate_to_regional_panel,
         {'analysis_panel': 'analysis_panel'},
         ['regional_panel', 'national_wage_series'], 'task6.pkl'),
        ("7: Construct Shock", construct_immigration_shock,
         {'analysis_panel': 'analysis_panel', 'regional_panel': 'regional_panel'},
         'shock_df', 'task7.pkl'),
        ("8: Construct Instruments", construct_instrumental_variables,
         {'analysis_panel': 'analysis_panel', 'validated_artifacts': artifacts},
         'instruments_df', 'task8.pkl'),
        ("9: Prepare Event Study DF", prepare_event_study_dataset,
         {'regional_panel': 'regional_panel', 'shock_df': 'shock_df',
          'instruments_df': 'instruments_df', 'analysis_panel': 'analysis_panel'},
         'event_study_df', 'task9.pkl')
    ]

    # Execute the preparation pipeline.
    for name, func, kwargs_map, out_keys, cache_file in prep_pipeline:
        task_start = time.time()
        cache_filepath = cache_path / cache_file

        if is_cache_valid and cache_filepath.exists():
            print(f"Loading cached result for Task '{name}'...")
            with open(cache_filepath, 'rb') as f: result = pickle.load(f)
        else:
            print(f"Running Task '{name}'...")
            is_cache_valid = False # Invalidate cache for all subsequent steps
            kwargs = {k: artifacts[v] for k, v in kwargs_map.items()}
            kwargs['master_config'] = master_config
            result = func(**kwargs)
            with open(cache_filepath, 'wb') as f: pickle.dump(result, f)

        if isinstance(out_keys, list):
            for i, key in enumerate(out_keys): artifacts[key] = result[i]
        else:
            artifacts[out_keys] = result

        duration = time.time() - task_start
        print(f"--- Task '{name}' completed in {duration:.2f} seconds. ---\n")

    return artifacts

# ------------------------------------------------------------------------------
# Top-Level Orchestrator: Main Function
# ------------------------------------------------------------------------------

def execute_decomposition_toolkit_pipeline(
    raw_data_path: str,
    master_config: Dict[str, Any],
    main_result_years: List[int] = [1993, 1995],
    cache_dir: str = "./.cache/",
    force_rerun_prep: bool = False
) -> Dict[str, Any]:
    """
    Executes the complete, end-to-end analysis pipeline for the study.

    This top-level orchestrator serves as the main entry point for the entire
    replication project. It manages the full workflow, from initial data
    validation to the final generation of tables and figures. The pipeline is
    divided into four distinct phases:

    1.  **Validation**: All input data, auxiliary files, and the master
        configuration are rigorously validated before any processing begins.
    2.  **Data Preparation**: A sequence of tasks (3-9) that transform the raw
        spell data into the analysis-ready artifacts (`analysis_panel`,
        `event_study_df`, etc.). This phase is cached to accelerate re-runs.
    3.  **Main Analysis**: The core estimation tasks (10-20) are run to produce
        the main findings for the specified result years.
    4.  **Robustness & Output**: A comprehensive suite of robustness checks
        (Tasks 22-24) is executed, and the final results are compiled into
        publication-quality tables and figures (Task 25).

    Args:
        raw_data_path (str): The file path to the raw, consolidated spell-level
                             DataFrame (e.g., a Parquet file).
        master_config (Dict[str, Any]): The master configuration dictionary that
                                        governs the entire analysis.
        main_result_years (List[int]): A list of the primary event years for which
                                       to generate the main decomposition tables.
        cache_dir (str): A directory path for storing and retrieving intermediate
                         data artifacts to speed up subsequent runs.
        force_rerun_prep (bool): If True, all data preparation steps will be
                                 re-computed, ignoring any existing cache.

    Returns:
        Dict[str, Any]: A comprehensive, nested dictionary containing all generated
                        artifacts and estimation results from the entire pipeline.

    Raises:
        Exception: Propagates any exception that occurs during the pipeline
                   execution after logging a failure message.
    """
    # --- Phase 0: Setup ---

    # Record the start time to measure total execution duration.
    start_time = time.time()

    # Print a banner to indicate the start of the pipeline.
    print("="*80 + "\nSTARTING TOP-LEVEL ORCHESTRATOR\n" + "="*80)

    # This dictionary will hold all major data artifacts as they are created.
    artifacts: Dict[str, Any] = {}

    # This dictionary will hold all final estimation results.
    all_results: Dict[str, Any] = {'main_tables': {}, 'event_studies': {}, 'robustness_checks': {}}

    try:
        # --- Phase 1: Validation ---

        # Start timer for this phase.
        phase_start = time.time()
        print("--- Phase 1: Validating Inputs ---")

        # Load the raw data from the specified path.
        consolidated_df_raw = pd.read_parquet(raw_data_path)
        artifacts['consolidated_df_raw'] = consolidated_df_raw

        # Run Task 1: Validate the raw data schema and integrity.
        validate_consolidated_df_raw(consolidated_df_raw, master_config)

        # Run Task 2: Validate the config and load/validate all auxiliary data artifacts.
        artifacts.update(validate_artifacts_and_config(master_config, consolidated_df_raw))

        # Log completion of the phase.
        print(f"\n>>> PHASE 'Validation' completed in {time.time() - phase_start:.2f} seconds. <<<\n")

        # --- Phase 2: Data Preparation ---

        # Start timer for this phase.
        phase_start = time.time()
        print("--- Phase 2: Preparing Data Artifacts (with Caching) ---")

        # Execute the full data preparation pipeline (Tasks 3-9) with caching.
        # This populates the `artifacts` dictionary with all necessary DataFrames.
        artifacts.update(_run_and_cache_prep_pipeline(
            artifacts, master_config, Path(cache_dir), force_rerun_prep
        ))

        # Log completion of the phase.
        print(f"\n>>> PHASE 'Data Preparation' completed in {time.time() - phase_start:.2f} seconds. <<<\n")

        # --- Phase 3: Main Analysis ---

        # Start timer for this phase.
        phase_start = time.time()
        print("--- Phase 3: Running Main Analysis ---")

        # Loop through the specified years to generate the main table results.
        for year in main_result_years:
            # Initialize a dictionary to hold results for the current year.
            year_results: Dict[str, Any] = {}
            print(f"\n{'='*30} RUNNING ESTIMATIONS FOR YEAR: {year} {'='*30}\n")

            # Task 12: Estimate regional and pure wage effects.
            wage_effects = estimate_wage_effects(
                artifacts['event_study_df'], artifacts['analysis_panel'], master_config, year
            )
            year_results['wage_effects'] = wage_effects

            # Task 11: Decompose regional employment effect.
            year_results['employment_decomposition'] = decompose_regional_employment_effect(
                artifacts['analysis_panel'], artifacts['regional_panel'], artifacts['event_study_df'], master_config, year
            )
            # Task 13: Decompose regional wage effect (depends on Task 12 results).
            year_results['wage_decomposition'] = decompose_regional_wage_effect(
                artifacts['analysis_panel'], artifacts['event_study_df'], wage_effects, master_config, year
            )
            # Task 14: Run selection bounding on the pure wage effect.
            year_results['selection_bounds'] = bound_pure_wage_effect_selection(
                artifacts['analysis_panel'], artifacts['event_study_df'], wage_effects['pure_wage_effect'], master_config, year
            )
            # Task 15: Analyze non-employed entrants.
            year_results['non_employed_entrants'] = analyze_non_employed_entrants(
                artifacts['analysis_panel'], artifacts['regional_panel'], artifacts['national_wage_series'], artifacts['event_study_df'], master_config, year
            )
            # Task 16: Analyze older workers.
            year_results['older_workers'] = analyze_older_workers(
                artifacts['analysis_panel'], artifacts['regional_panel'], artifacts['event_study_df'], master_config, year
            )
            # Task 17: Analyze task heterogeneity.
            year_results['task_heterogeneity'] = analyze_task_heterogeneity(
                artifacts['analysis_panel'], artifacts['event_study_df'], master_config, year
            )
            # Task 18: Decompose routine employment.
            year_results['routine_decomposition'] = decompose_routine_employment(
                artifacts['analysis_panel'], artifacts['event_study_df'], master_config, year
            )
            # Store all results for this year.
            all_results['main_tables'][year] = year_results

        # Task 19: Run event study for apprenticeship uptake (runs over all years).
        all_results['event_studies']['apprenticeship_uptake'] = analyze_apprenticeship_uptake(
            artifacts['panel_full_with_flags'], artifacts['event_study_df'], master_config
        )

        # Task 20: Recover structural parameters using results from the first main year.
        main_year = main_result_years[0]
        all_results['structural_parameters'] = recover_structural_parameters(
            reduced_form_estimates={
                'beta_R': all_results['main_tables'][main_year]['employment_decomposition']['total_effect']['point_estimate'],
                'gamma_R': all_results['main_tables'][main_year]['wage_effects']['regional_wage_effect']['point_estimate'],
                'gamma_W': all_results['main_tables'][main_year]['wage_effects']['pure_wage_effect']['point_estimate']
            },
            analysis_panel=artifacts['analysis_panel'],
            master_config=master_config
        )

        # Log completion of the phase.
        print(f"\n>>> PHASE 'Main Analysis' completed in {time.time() - phase_start:.2f} seconds. <<<\n")

        # --- Phase 4: Robustness & Output ---

        # Start timer for this phase.
        phase_start = time.time()
        print("--- Phase 4: Running Robustness Checks and Compiling Outputs ---")

        # Task 22: Run pre-trend analysis and generate event-study plots.
        run_robustness_checks(
            artifacts['event_study_df'], artifacts['analysis_panel'], all_results, master_config
        )
        # Task 23: Run sensitivity analyses.
        all_results['robustness_checks']['sensitivity'] = run_sensitivity_analyses(
            raw_data_path, artifacts['event_study_df'], master_config, main_result_years[0]
        )
        # Task 24: Run pseudo-panel cross-check.
        all_results['robustness_checks']['pseudo_panel'] = run_pseudo_panel_check(
            artifacts['analysis_panel'], artifacts['event_study_df'], all_results['main_tables'][main_year]['wage_effects'], master_config, main_result_years[0]
        )
        # Task 25: Compile all results into final tables and figures.
        compile_final_outputs(
            all_results, artifacts['event_study_df'], artifacts['analysis_panel'], artifacts['regional_panel'], master_config
        )

        # Log completion of the phase.
        print(f"\n>>> PHASE 'Robustness & Output' completed in {time.time() - phase_start:.2f} seconds. <<<\n")

    except Exception as e:
        # Catch any exception that occurs during the pipeline, log it, and re-raise.
        print(f"\n\nPIPELINE FAILED WITH ERROR: {e}")
        raise

    finally:
        # --- Finalization ---
        # This block runs whether the pipeline succeeded or failed.
        total_duration = time.time() - start_time
        print("="*80 + f"\nTOP-LEVEL ORCHESTRATOR COMPLETED in {total_duration / 60:.2f} minutes.\n" + "="*80)

    # Return the comprehensive dictionary of all generated results.
    return all_results
