# README.md

# Brexit-Related Uncertainty Index (BRUI)

<!-- PROJECT SHIELDS -->
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python Version](https://img.shields.io/badge/python-3.9%2B-blue.svg)](https://www.python.org/downloads/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
[![Type Checking: mypy](https://img.shields.io/badge/type_checking-mypy-blue)](http://mypy-lang.org/)
[![Pandas](https://img.shields.io/badge/pandas-%23150458.svg?style=flat&logo=pandas&logoColor=white)](https://pandas.pydata.org/)
[![NumPy](https://img.shields.io/badge/numpy-%23013243.svg?style=flat&logo=numpy&logoColor=white)](https://numpy.org/)
[![SciPy](https://img.shields.io/badge/SciPy-%230C55A5.svg?style=flat&logo=scipy&logoColor=white)](https://scipy.org/)
[![Statsmodels](https://img.shields.io/badge/Statsmodels-150458.svg?style=flat&logo=python&logoColor=white)](https://www.statsmodels.org/stable/index.html)
[![Matplotlib](https://img.shields.io/badge/Matplotlib-%23ffffff.svg?style=flat&logo=Matplotlib&logoColor=black)](https://matplotlib.org/)
[![NLTK](https://img.shields.io/badge/NLTK-3776AB.svg?style=flat&logo=python&logoColor=white)](https://www.nltk.org/)
[![spaCy](https://img.shields.io/badge/spaCy-09A3D5?style=flat&logo=spacy&logoColor=white)](https://spacy.io/)
[![Jupyter](https://img.shields.io/badge/Jupyter-%23F37626.svg?style=flat&logo=Jupyter&logoColor=white)](https://jupyter.org/)
[![arXiv](https://img.shields.io/badge/arXiv-2507.02439-b31b1b.svg)](https://arxiv.org/abs/2507.02439)
[![DOI](https://img.shields.io/badge/DOI-10.48550/arXiv.2507.02439-blue)](https://doi.org/10.48550/arXiv.2507.02439)
[![Research](https://img.shields.io/badge/Research-Quantitative%20Finance-green)](https://github.com/chirindaopensource/brexit_uncertainty_index_covid_uncertainty_index)
[![Discipline](https://img.shields.io/badge/Discipline-Econometrics-blue)](https://github.com/chirindaopensource/brexit_uncertainty_index_covid_uncertainty_index)
[![Methodology](https://img.shields.io/badge/Methodology-NLP%20%26%20VAR-orange)](https://github.com/chirindaopensource/brexit_uncertainty_index_covid_uncertainty_index)
[![Text Processing](https://img.shields.io/badge/Text-Processing-blue)](https://spacy.io/)
[![Time Series](https://img.shields.io/badge/Time%20Series-Analysis-red)](https://www.statsmodels.org/stable/index.html)
[![Data Source](https://img.shields.io/badge/Data%20Source-EIU%20Reports-lightgrey)](https://www.eiu.com/)
[![Year](https://img.shields.io/badge/Year-2025-purple)](https://github.com/chirindaopensource/brexit_uncertainty_index_covid_uncertainty_index)

**Repository:** https://github.com/chirindaopensource/brexit_uncertainty_index_covid_uncertainty_index

**Owner:** 2025 Craig Chirinda (Open Source Projects)

This repository contains an **independent** implementation of the research methodology from the 2025 paper entitled **"Introducing a New Brexit-Related Uncertainty Index: Its Evolution and Economic Consequences"** by:

*   Ismet Gocer
*   Julia Darby
*   Serdar Ongan

The project provides a robust, end-to-end Python pipeline for constructing a high-frequency, quantifiable measure of geopolitical risk, specifically focusing on "Brexit uncertainty." It transforms this abstract concept into a tangible, decision-useful metric and further analyzes its macroeconomic impacts using Vector Autoregression (VAR) models.

## Table of Contents

- [Introduction](#introduction)
- [Theoretical Background](#theoretical-background)
- [Features](#features)
- [Methodology Implemented](#methodology-implemented)
- [Core Components (Notebook Structure)](#core-components-notebook-structure)
- [Key Callable: run_brexit_uncertainty_analysis](#key-callable-run_brexit_uncertainty_analysis)
- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Input Data Structure](#input-data-structure)
- [Usage](#usage)
- [Output Structure](#output-structure)
- [Project Structure](#project-structure)
- [Customization](#customization)
- [Contributing](#contributing)
- [License](#license)
- [Citation](#citation)
- [Acknowledgments](#acknowledgments)

## Introduction

This project provides a Python implementation of the methodologies presented in the 2025 paper "Introducing a New Brexit-Related Uncertainty Index: Its Evolution and Economic Consequences." The core of this repository is the iPython Notebook `brexit_related_uncertainty_index_draft.ipynb`, which contains a comprehensive suite of functions to construct the Brexit-Related Uncertainty Index (BRUI) and its complementary COVID-19 Related Uncertainty Index (CRUI), and to analyze their economic consequences.

Measuring geopolitical risk is a critical challenge in modern finance and economics. Abstract concepts like "uncertainty" must be transformed into quantifiable metrics to be useful for risk pricing, capital allocation, and policy formulation. This framework provides a rigorous, data-driven approach to this problem.

This codebase enables researchers, policymakers, and financial analysts to:
-   Construct a high-frequency, text-based measure of Brexit uncertainty from raw documents.
-   Methodologically disentangle Brexit-related uncertainty from concurrent shocks like the COVID-19 pandemic.
-   Analyze the dynamic impact of uncertainty shocks on key macroeconomic variables.
-   Replicate and extend the findings of the original research paper.

## Theoretical Background

The implemented methods are grounded in a combination of advanced Natural Language Processing (NLP) and standard time-series econometrics:

**Context-Aware Uncertainty Attribution:** The core of the index construction is a novel algorithm that moves beyond simple keyword counting. It requires the co-occurrence of an "uncertainty" term and a "Brexit" term within a small, defined text window (10 words on either side). This ensures that the measured uncertainty is contextually relevant.

**Proportional Allocation for Disentanglement:** In periods where both Brexit and COVID-19 are discussed together, the algorithm does not discard the data. Instead, it uses a proportional allocation mechanism based on the relative frequency of "pure" Brexit and "pure" COVID-19 uncertainty mentions in the same document to disentangle the two effects.

**Vector Autoregression (VAR) Modeling:** To assess the economic impact of the newly constructed index, the pipeline employs a standard VAR model. This multivariate time-series model captures the dynamic interdependencies between the BRUI and key macroeconomic variables (GDP, CPI, trade, etc.).

**Cholesky Decomposition for Identification:** To identify the causal impact of an uncertainty shock, a Cholesky decomposition is applied to the VAR model's residuals. By ordering the BRUI first in the system, the model operates under the standard economic assumption that uncertainty shocks are contemporaneously exogenous—they affect the economy within the same month, but are not themselves affected by the economy within that same month.

## Features

The provided iPython Notebook (`brexit_related_uncertainty_index_draft.ipynb`) implements the full research pipeline, including:

-   **Parameter Validation:** Rigorous checks for all input data and configurations to ensure methodological compliance.
-   **Data Cleansing:** Robust handling of missing and non-finite values, and precise temporal filtering.
-   **Advanced NLP Pipeline:** Text normalization, tokenization, context-aware stopword removal, and n-gram analysis.
-   **LLM-Powered Entity Recognition:** Use of SpaCy's `en_core_web_lg` to prepare for entity-based analysis.
-   **Context-Aware Attribution Algorithm:** The core algorithm for identifying and classifying uncertainty mentions.
-   **Index Construction:** Proportional allocation, standardization, and max-normalization to create the final BRUI and CRUI.
-   **Econometric Data Preparation:** Systematic stationarity testing (ADF) and data transformations (log, differencing).
-   **VAR Modeling:** Automated optimal lag selection, model estimation, and diagnostic testing.
-   **Post-Estimation Analysis:** Calculation of Impulse Response Functions (IRFs), Forecast Error Variance Decompositions (FEVDs), and bootstrapped confidence intervals.
-   **Publication-Quality Visualization:** A suite of functions to generate the key figures from the paper.

## Methodology Implemented

The core analytical steps directly implement the methodology from the paper:

1.  **Text Corpus Processing (Steps 1-8):** The pipeline ingests monthly EIU reports, cleans the text, and processes it using a standard NLP workflow (normalization, tokenization, stopword removal).
2.  **Context-Aware Attribution (Step 9):** For each "uncertainty" keyword, a 21-word context window is analyzed. The presence of Brexit and/or COVID-19 keywords within this window determines the classification of the uncertainty mention.
3.  **Proportional Allocation (Step 10):** For "mixed" contexts containing both Brexit and COVID-19 keywords, the uncertainty count is allocated proportionally to the BRUI and CRUI based on the relative prevalence of "pure" mentions in the same document.
4.  **Index Finalization (Step 11):** The raw uncertainty counts are standardized by the total word count of the report and then normalized so that the maximum value of the index over the entire sample period is 100.
5.  **Econometric Analysis:** The final BRUI is integrated into a VAR model with key UK macroeconomic variables. The model is used to generate IRFs that trace the economic impact of a one-standard-deviation shock to the BRUI.

## Core Components (Notebook Structure)

The `brexit_related_uncertainty_index_draft.ipynb` notebook is structured as a logical pipeline with modular functions for each task:

-   **Task 0: `validate_parameters`**: The initial quality gate for all inputs.
-   **Task 1: `cleanse_data`**: Handles data quality and temporal scoping.
-   **Task 2: `prepare_lexicons`**: Optimizes keyword lists for high-performance matching.
-   **Task 3: `preprocess_text_corpus`**: The foundational NLP pipeline.
-   **Task 4: `load_spacy_model_for_ner`, `apply_ner_to_corpus`**: LLM-based entity extraction.
-   **Task 5: `attribute_uncertainty_in_corpus`**: The core uncertainty attribution algorithm.
-   **Task 6 & 7: `calculate_brui`, `calculate_crui`**: Final index construction.
-   **Task 8: `prepare_data_for_var`**: Prepares data for econometric modeling.
-   **Task 9: `estimate_var_model`**: Estimates and identifies the VAR model.
-   **Task 10: `run_post_estimation_analysis`**: Computes IRFs, FEVDs, and confidence intervals.
-   **Task 11: `plot_...` functions**: The visualization suite.
-   **Main Orchestrator: `run_brexit_uncertainty_analysis`**: Executes the entire pipeline.

## Key Callable: run_brexit_uncertainty_analysis

The central function in this project is `run_brexit_uncertainty_analysis`. It orchestrates the entire analytical workflow from raw data to final results.

```python
def run_brexit_uncertainty_analysis(
    df_input: pd.DataFrame,
    uncertainty_lexicon: List[str],
    brexit_lexicon: List[str],
    covid_lexicon: List[str],
    index_construction_config: Dict[str, Any],
    econometric_analysis_config: Dict[str, Any],
    brexit_events_for_plotting: Dict[str, str],
    comparison_indices_df: pd.DataFrame = None
) -> Dict[str, Any]:
    """
    Executes the end-to-end research pipeline for the Brexit Uncertainty Index.

    This orchestrator function serves as the master controller for the entire
    analysis. It sequentially executes all tasks from parameter validation to
    final visualization, ensuring a rigorous, reproducible, and auditable
    workflow. Each step's outputs are programmatically passed to the next, and
    all significant results, data, and logs are compiled into a comprehensive
    final dictionary.

    Args:
        df_input (pd.DataFrame): The raw input DataFrame containing monthly
            macroeconomic data and the EIU text corpus.
        uncertainty_lexicon (List[str]): The raw list of uncertainty keywords.
        brexit_lexicon (List[str]): The raw list of Brexit-related keywords.
        covid_lexicon (List[str]): The raw list of COVID-19 related keywords.
        index_construction_config (Dict[str, Any]): Configuration for the
            text-based index construction.
        econometric_analysis_config (Dict[str, Any]): Configuration for the
            econometric VAR analysis.
        brexit_events_for_plotting (Dict[str, str]): A dictionary of key Brexit
            events for annotating the final BRUI time-series plot.
        comparison_indices_df (pd.DataFrame, optional): A DataFrame containing
            alternative indices (e.g., BRUI_B, BRUI_C) for validation plotting.
            Must have a DatetimeIndex. Defaults to None.

    Returns:
        Dict[str, Any]: A comprehensive dictionary containing all results.
    """
    # ... (implementation)
```

## Prerequisites

-   Python 3.9+
-   Core dependencies: `pandas`, `numpy`, `statsmodels`, `matplotlib`, `scipy`, `nltk`, `spacy`.
-   NLTK data packages: `punkt`, `stopwords`.
-   SpaCy model: `en_core_web_lg`.

## Installation

1.  **Clone the repository:**
    ```sh
    git clone https://github.com/chirindaopensource/brexit_uncertainty_index_covid_uncertainty_index.git
    cd brexit_uncertainty_index_covid_uncertainty_index
    ```

2.  **Install Python dependencies:**
    ```sh
    pip install pandas numpy statsmodels matplotlib scipy nltk spacy
    ```

3.  **Download required NLP data:**
    ```sh
    python -m nltk.downloader punkt stopwords
    python -m spacy download en_core_web_lg
    ```

## Input Data Structure

The primary input is a `pandas.DataFrame` with the following structure:
-   **Index:** A `DatetimeIndex` with monthly frequency ('MS'), covering the full sample period (e.g., '2012-05-01' to '2025-01-01').
-   **Columns:**
    -   `BRUI`: A placeholder column (e.g., filled with zeros).
    -   `GDP`, `CPI`, `PPI`, `X`, `M`, `GBP_EUR`, `GBP_USD`, `EMP`, `UEMP`: Numeric columns containing the macroeconomic data.
    -   `EIU`: An object/string column containing the full text of the monthly EIU report.

See the usage example for a template of how to construct this DataFrame.

## Usage

1.  **Prepare Inputs:** Construct the input DataFrame, lexicons, and configuration dictionaries as shown in the detailed usage example within the `brexit_related_uncertainty_index_draft.ipynb` notebook.
2.  **Open and Run Notebook:** Open the notebook in a Jupyter environment.
3.  **Execute All Cells:** Run all cells in the notebook to define the functions and prepare the example data.
4.  **Invoke the Orchestrator:** The final cells of the notebook demonstrate how to call the main `run_brexit_uncertainty_analysis` function with all the prepared inputs.
5.  **Analyze Outputs:** The returned dictionary will contain all results, logs, and figures generated by the pipeline.

## Output Structure

The `run_brexit_uncertainty_analysis` function returns a single, comprehensive dictionary with the following top-level keys:

-   `audit_logs`: Contains detailed logs from the data cleansing, VAR data preparation, and VAR estimation steps.
-   `final_data`: Contains key data artifacts, including the `prepared_lexicons` object, the final DataFrame with the `BRUI`, `CRUI`, and all intermediate calculation columns, and the stationary dataset used for the VAR.
-   `fitted_model`: Contains the `statsmodels.VARResults` object, which is the complete fitted VAR model.
-   `analysis_results`: Contains the results of the post-estimation analysis, including the IRF point estimates, confidence intervals, and FEVD summary tables.
-   `visualizations`: Contains the `matplotlib.figure.Figure` objects for the generated plots.

## Project Structure

```
brexit_uncertainty_index_covid_uncertainty_index/
│
├── brexit_related_uncertainty_index_draft.ipynb  # Main implementation notebook
├── LICENSE                                       # MIT license file
└── README.md                                     # This documentation file
```

## Customization

The pipeline is highly customizable via the `index_construction_config` and `econometric_analysis_config` dictionaries. Users can easily modify:
-   **Lexicons:** Add or remove keywords from the input lists.
-   **Context Window Size:** Change `context_window_size` in the configuration.
-   **VAR Model:** Add or remove variables, change the lag selection criterion, or modify the Cholesky ordering in the `econometric_analysis_config`.
-   **Post-Estimation:** Adjust the `horizon` for IRFs/FEVDs or the `confidence_interval_level`.

## Contributing

Contributions are welcome. Please fork the repository, create a feature branch, and submit a pull request with a clear description of your changes. Adherence to PEP 8, type hinting, and comprehensive docstrings is required.

## License

This project is licensed under the MIT License. See the `LICENSE` file for details.

## Citation

If you use this code or the methodology in your research, please cite the original paper:

```bibtex
@article{gocer2025introducing,
  title={Introducing a New Brexit-Related Uncertainty Index: Its Evolution and Economic Consequences},
  author={Gocer, Ismet and Darby, Julia and Ongan, Serdar},
  journal={arXiv preprint arXiv:2507.02439},
  year={2025}
}
```

For the implementation itself, you may cite this repository:
```
Chirinda, C. (2025). A Python Implementation of the Brexit-Related Uncertainty Index (BRUI).
GitHub repository: https://github.com/chirindaopensource/brexit_uncertainty_index_covid_uncertainty_index
```

## Acknowledgments

-   Credit to Ismet Gocer, Julia Darby, and Serdar Ongan for their novel methodology in constructing a context-aware uncertainty index.
-   Thanks to the developers of the `statsmodels`, `pandas`, `spacy`, and `nltk` libraries, which are the foundational pillars of this analytical pipeline.

--

*This README was generated based on the structure and content of `brexit_related_uncertainty_index_draft.ipynb` and follows best practices for research software documentation.*


# Paper

Title: "Introducing a New Brexit-Related Uncertainty Index: Its Evolution and Economic Consequences"

Link: https://arxiv.org/abs/2507.02439

Authors: Notare: Ismet Gocer, Julia Darby, Serdar Ongan

Submission Date: 3 Jul 2025

Abstract:

Important game-changer economic events and transformations cause uncertainties that may affect investment decisions, capital flows, international trade, and macroeconomic variables. One such major transformation is Brexit, which refers to the multiyear process through which the UK withdrew from the EU. This study develops and uses a new Brexit-Related Uncertainty Index (BRUI). In creating this index, we apply Text Mining, Context Window, Natural Language Processing (NLP), and Large Language Models (LLMs) from Deep Learning techniques to analyse the monthly country reports of the Economist Intelligence Unit from May 2012 to January 2025. Additionally, we employ a standard vector autoregression (VAR) analysis to examine the model-implied responses of various macroeconomic variables to BRUI shocks. While developing the BRUI, we also create a complementary COVID-19 Related Uncertainty Index (CRUI) to distinguish the uncertainties stemming from these distinct events. Empirical findings and comparisons of BRUI with other earlier-developed uncertainty indexes demonstrate the robustness of the new index. This new index can assist British policymakers in measuring and understanding the impacts of Brexit-related uncertainties, enabling more effective policy formulation.

# Summary

#### **Step 1: The Core Problem and the Paper's Stated Contribution**

The fundamental problem the authors address is the difficulty in quantifying the *specific* economic uncertainty generated by the multi-year Brexit process. Previous attempts, they argue, have several limitations:
*   **Static Measures:** Using a simple dummy variable for "post-referendum" is too crude; it doesn't capture the evolving nature and intensity of uncertainty.
*   **Conflated Shocks:** Existing uncertainty indices (like the general Economic Policy Uncertainty index) struggle to disentangle Brexit-related anxiety from other major shocks, most notably the COVID-19 pandemic.
*   **Limited Scope:** Some specialized Brexit indices are based on surveys (which can be subjective), end too early (e.g., 2016), or haven't been updated.

The authors' primary contribution is the creation of a new, dynamic **Brexit-Related Uncertainty Index (BRUI)**. Its key purported advantages are:
1.  **Methodological Sophistication:** It uses modern Natural Language Processing (NLP) and a "Context Window" approach to ensure uncertainty keywords are directly related to Brexit.
2.  **Disentanglement:** It explicitly creates a parallel **COVID-19 Related Uncertainty Index (CRUI)** to methodologically separate and control for pandemic-induced uncertainty.
3.  **Temporal Coverage:** The index runs from May 2012 (when the term "Brexit" first appeared) to a projected January 2025, covering the entire lifecycle from conception through implementation and its aftermath.
4.  **Data Source:** It relies on the Economist Intelligence Unit (EIU) country reports, which are more standardized and analytical than newspaper articles, potentially reducing sensationalism and editorial bias.

#### **Step 2: The Methodological Pipeline – From Text to Index**

This is the computational core of the paper. The process for creating the BRUI is a multi-stage data processing and analysis pipeline.

*   **a. Data Ingestion and Preparation:**
    *   The raw data is a corpus of monthly EIU reports for the UK.
    *   Using Python libraries (like `PyMuPDF`), they extract the raw text from PDFs and convert it to lowercase for consistency.
    *   Standard NLP pre-processing is applied using the Natural Language Toolkit (`NLTK`):
        *   **Tokenization:** Breaking text into individual words (tokens).
        *   **Stopword Removal:** Eliminating common, non-informative words ('the', 'a', 'is').

*   **b. The "Context Window" Algorithm (The Key Innovation):**
    This is the cleverest part of their methodology. Instead of just counting keyword frequencies in a document, they enforce a *contextual link*.
    1.  They define three distinct keyword lexicons (word lists), as shown in their **Table 1**:
        *   **Uncertainty Terms (U):** `uncertainty`, `volatile`, `risk`, `instability`, etc.
        *   **Brexit-Related Terms (BRK):** `brexit`, `article 50`, `customs union`, `leave the EU`, etc.
        *   **COVID-19-Related Terms (CRK):** `covid`, `pandemic`, `lockdown`, `vaccine`, etc.
    2.  The algorithm iterates through the text. Whenever it finds an **Uncertainty term (U)**, it opens a "Context Window" of 10 words before and 10 words after it.
    3.  It then applies a classification logic within this window:
        *   If a **BRK** is found in the window but no **CRK** is present, the uncertainty is counted as **purely Brexit-related**.
        *   If both a **BRK** and a **CRK** are found, the uncertainty is classified as **mixed**. The authors then allocate this count proportionally based on the overall monthly frequency of pure Brexit vs. pure COVID uncertainty words. This is a pragmatic solution to the attribution problem.

*   **c. Index Construction and Normalization:**
    1.  For each month, they calculate the **Total Brexit-Related Uncertainty Keyword Number (TBRUKN)**.
    2.  To account for varying report lengths, they perform **standardization**:
        `BRUI_raw = (TBRUKN for that month) / (Total number of words in that month's report)`
    3.  Finally, they perform **normalization**: The entire time series of `BRUI_raw` is scaled so that the peak value is 100. This makes the index easy to interpret and compare over time.

#### **Step 3: Empirical Findings – What the Index Reveals**

Having constructed the index, the authors use it in two ways: descriptive analysis and econometric modeling.

*   **a. The Evolution of Brexit Uncertainty (Figure 2):**
    The time-series plot of the BRUI is the paper's central visual. It clearly shows three phases:
    1.  **Pre-Brexit Period (2012-2016):** Low-level, rising uncertainty as the referendum is promised and approaches.
    2.  **Transition Period (2016-2020):** Extreme volatility and the highest peaks in uncertainty, corresponding to the referendum result, failed withdrawal agreements, the "Irish backstop" crisis, and general political chaos.
    3.  **Post-Brexit Period (2020-onward):** A lower, but still elevated and persistent, level of uncertainty related to the implementation of new trade rules, the Northern Ireland Protocol, and ongoing economic friction.

*   **b. Robustness Checks (Figures 3, 4, 5):**
    To validate their index, they correlate it with the other existing Brexit indices.
    *   **High Correlation (0.82 and 0.75)** with the Bloom et al. (BRUI_B) and Chung et al. (BRUI_C) indices suggests their BRUI is capturing a similar underlying phenomenon, which lends it credibility.
    *   **Low Correlation (0.35)** with the Baker et al. (BRUI_Baker) index is expected and logical, as that index ends in 2016, capturing only the initial, low-level phase of uncertainty.

*   **c. Econometric Impact Analysis (The VAR Model):**
    This is where they connect the index to the real economy. They use a standard **Vector Autoregression (VAR)** model, which examines the dynamic interrelationships between a set of variables.
    *   **Impulse-Response Functions (Figure 6):** This analysis answers the question: "What happens to the economy when there is a sudden, unexpected shock (a spike) in Brexit uncertainty?" The results are consistent with economic theory: a positive shock to BRUI leads to a statistically significant *decline* in GDP, exports, imports, and employment, and a *depreciation* of the British Pound against the Euro.
    *   **Forecast-Error Variance Decomposition (Table 3):** This quantifies the importance of BRUI shocks. It shows that a meaningful percentage of the forecast errors in key variables like GDP (3.4%), GBP/USD (4.41%), and Imports (2.49%) are explained by fluctuations in the BRUI. This demonstrates that Brexit uncertainty is not just noise but a tangible driver of macroeconomic outcomes.

#### **Step 4: Conclusion and Critical Assessment**

The paper concludes that the BRUI is a robust and valuable tool. It confirms that Brexit was not a one-off event but a long-term source of structural uncertainty with persistent negative effects on the UK economy.

**My professorial critique:**
*   **Strengths:** The methodology for disentangling COVID-19 and Brexit uncertainty is the paper's strongest feature. The use of EIU reports is a sound choice for data. The VAR analysis provides a solid econometric foundation for their claims.
*   **Limitations (as acknowledged by the authors):** The index is, by definition, sensitive to the initial choice of keywords. While their list is comprehensive, it's not exhaustive. The reliance on a single data source (EIU) could introduce a specific institutional viewpoint, though it aids consistency.
*   **Avenues for Future Research:** The proportional allocation for "mixed" uncertainty is a reasonable heuristic, but more advanced machine learning models (e.g., topic modeling with attribution) could refine this. Applying this methodology to create sector-specific indices (e.g., for finance, manufacturing) would be a valuable extension.

In summary, this is a strong piece of applied econometric and computational work. It provides a tangible, data-driven tool that improves upon existing measures and offers clear, actionable insights into the economic consequences of a major geopolitical event. It's an excellent example of how modern data science techniques can be leveraged to answer pressing economic questions.

# Import Essential Modules

In [None]:
"""
Professional Python imports following PEP-8 standards for Brexit uncertainty analysis.

This module provides all necessary imports for implementing the Brexit-Related
Uncertainty Index (BRUI) methodology, including text processing, statistical
analysis, and econometric modeling capabilities.

Standard: PEP-8 (https://pep8.org/)
Author: CS Chirinda
Date: 2025-07-06
"""

# =============================================================================
# STANDARD LIBRARY IMPORTS
# =============================================================================
import re
import unicodedata
from typing import Any, Dict, List, Set, Tuple, TypedDict

# =============================================================================
# THIRD-PARTY LIBRARY IMPORTS
# =============================================================================
# Data manipulation and numerical computing
import numpy as np
import pandas as pd

# Statistical analysis
from scipy.stats import pearsonr

# Natural language processing
import nltk
import spacy
from spacy.language import Language

# Econometric and time series analysis
from statsmodels.tsa.api import VAR
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.vector_ar.var_model import VARResults

# Visualization
import matplotlib.pyplot as plt
import matplotlib.dates as mdates


# Implementation

## Draft 1

### Main Constituents of the Draft

### **Exegesis of the Research Pipeline Components**

#### **1. `validate_parameters` (and its helpers)**
*   **Inputs:** Raw user-provided data and configurations: `df_input` (DataFrame), `uncertainty_lexicon` (list), `brexit_lexicon` (list), `covid_lexicon` (list), `index_construction_config` (dict), `econometric_analysis_config` (dict).
*   **Process:** This function and its helpers (`_validate_dataframe_structure`, `_validate_lexicons`, etc.) perform a comprehensive, non-destructive validation of all inputs. It traverses the data structures, comparing their properties (e.g., shape, size, data types, values) against a predefined schema derived directly from the research methodology. It aggregates any and all deviations into a list of errors.
*   **Output:** `None`. Its successful execution without raising a `ValueError` is a binary signal that all inputs are compliant.
*   **Data Transformation:** No data is transformed. This is a pure validation and assertion step.
*   **Role in Research Pipeline:** This callable serves as the **Methodological Compliance Gateway**. It ensures that the entire subsequent analysis is built upon a foundation that strictly adheres to the paper's specified sample periods, data formats, lexicons, and algorithmic parameters. It directly enforces the constraints mentioned throughout the methodology, such as the sample period ("May 2012 and January 2025"), the keyword lists from **Table 1**, and the algorithmic parameters.

#### **2. `cleanse_data`**
*   **Inputs:** A validated `pd.DataFrame` and the `index_construction_config` dictionary.
*   **Process:** This function executes a three-part data integrity workflow.
    1.  **Cleansing:** It scans all non-text columns for a predefined set of string-based null representations (e.g., 'N/A', 'NULL') and replaces them with a standard `np.nan`. It also scans all columns for non-finite numeric values (`np.inf`, `-np.inf`) and replaces them with `np.nan`.
    2.  **Filtering:** It filters the DataFrame rows to conform to the precise temporal window specified in the configuration ('2012-05' to '2025-01').
    3.  **Verification:** It counts the number of non-null documents in the 'EIU' text column and verifies this count against the expected number from the configuration, raising an error if coverage is insufficient.
*   **Output:** A tuple containing: (1) a cleansed, temporally-scoped `pd.DataFrame`, and (2) a detailed audit log (`dict`) of all actions performed.
*   **Data Transformation:** The input DataFrame is transformed by (a) value replacement (e.g., 'N/A' -> `np.nan`) and (b) row filtering (slicing by date).
*   **Role in Research Pipeline:** This callable implements the **Data Integrity and Scoping** phase. It directly enforces the sample period from **Step 1** of the methodology and prepares the dataset for analysis by standardizing data quality, a crucial but often unstated prerequisite in academic papers.

#### **3. `prepare_lexicons`**
*   **Inputs:** Three raw `list`s of strings: `uncertainty_lexicon`, `brexit_lexicon`, `covid_lexicon`.
*   **Process:** The function transforms the flat lists of keywords into a highly optimized, structured dictionary. For each lexicon, it:
    1.  Normalizes all keywords (lowercase, strip whitespace).
    2.  Separates them into unigrams (single words) stored in a `set` for O(1) lookup, and n-grams (multi-word phrases).
    3.  Organizes the n-grams into a lookup dictionary keyed by their first token. The phrases under each key are sorted by length in descending order to facilitate greedy matching.
    4.  It also codifies the "Scottish referendum" disambiguation rule from **Table 1, Note 2** into a machine-readable format.
*   **Output:** A `PreparedLexicons` dictionary containing the optimized data structures for all three lexicons and the disambiguation rule.
*   **Data Transformation:** The input `list`s are transformed into a complex, nested dictionary containing `set`s and other dictionaries, structured for maximum computational performance in the subsequent text analysis.
*   **Role in Research Pipeline:** This callable implements the **Lexicon Optimization and Rule Codification**. It is a direct implementation of the n-gram matching requirement ("'n-grams' used to analyse word sequences, examining bi-grams... and three-grams...") and the disambiguation rule from **Table 1, Note 2**.

#### **4. `preprocess_text_corpus`**
*   **Inputs:** The cleansed `pd.DataFrame` from `cleanse_data` and the `PreparedLexicons` object from `prepare_lexicons`.
*   **Process:** This function applies a sequential NLP pipeline to the raw text in the 'EIU' column of the DataFrame.
    1.  **Normalization:** Converts all text to lowercase and applies Unicode normalization.
    2.  **Tokenization:** Splits the normalized text into a list of tokens using `nltk.word_tokenize`.
    3.  **Stopword Removal:** Removes common function words, while intelligently preserving any words that appear in the project's keyword lexicons.
    4.  **N-gram Generation:** Generates lists of bigrams and trigrams from the cleaned tokens.
*   **Output:** The input DataFrame augmented with new columns for each stage of processing: `EIU_lowercase`, `EIU_tokens`, `EIU_cleaned_tokens`, `bigrams`, `trigrams`.
*   **Data Transformation:** A string column is transformed into a series of new columns containing progressively processed lists of strings and tuples.
*   **Role in Research Pipeline:** This callable implements the **Foundational NLP Pipeline**, directly executing **Steps 3, 4, 5, and 6** of the published methodology.

#### **5. `load_spacy_model_for_ner` & `apply_ner_to_corpus`**
*   **Inputs:** The preprocessed DataFrame and the model name string (`'en_core_web_lg'`).
*   **Process:** `load_spacy_model_for_ner` loads the specified SpaCy model and optimizes it by disabling unused components. `apply_ner_to_corpus` then takes this model and efficiently applies it to the entire text corpus using batch processing (`nlp.pipe`). It extracts the text, label, and character offsets for each named entity found in each document.
*   **Output:** The DataFrame augmented with a new `ner_entities` column, where each entry is a list of dictionaries representing the entities found in that document.
*   **Data Transformation:** A string column is processed to generate a new column containing structured entity data (lists of dictionaries).
*   **Role in Research Pipeline:** These callables implement the **LLM-based Feature Enhancement**, as described in **Step 8** of the methodology: "This study employs SpaCy's 'en\_core\_web\_lg' module of Large Language Models (LLMs) to elucidate the contextual occurrence of 'uncertainty words (U)'... SpaCy... uses Named Entity Recognition (NER) to identify entities...". While the paper is vague on the exact use, these functions prepare the necessary data for a concrete enhancement strategy.

#### **6. `attribute_uncertainty_in_corpus`**
*   **Inputs:** The DataFrame from the previous step, the `PreparedLexicons` object, and the `index_construction_config`.
*   **Process:** This is the algorithmic core. It iterates through each document's cleaned tokens. Upon finding an uncertainty keyword, it constructs a context window around it. It then searches this window for Brexit and COVID-19 keywords (using the optimized lexicon structures). Based on the findings, it classifies the uncertainty instance as "Pure Brexit," "Pure COVID," or "Mixed" and increments the corresponding counter for that document.
*   **Output:** The DataFrame augmented with three new integer columns: `pure_brexit_count`, `pure_covid_count`, and `mixed_count`.
*   **Data Transformation:** A column of token lists is transformed into three new columns of numerical counts.
*   **Role in Research Pipeline:** This callable implements the **Context-Aware Uncertainty Attribution Algorithm**. It is the direct execution of the logic described in **Step 9** and embodies the following equations:
    *   Context Window Construction:
        $ \text{CW} = \{x_{-10}, \dots, x_{-1}, U, x_{+1}, \dots, x_{+10}\} $
    *   Conditional Counting Logic (And the symmetric logic for CRUKN and the case for mixed counts).

#### **7. `calculate_brui` & `calculate_crui`**
*   **Inputs:** The DataFrame from the attribution step and the `index_construction_config`.
*   **Process:** These functions perform the final index calculations.
    1.  **Aggregation:** They calculate the total effective uncertainty count for each topic (TBRUKN and TCRUKN) by applying proportional allocation to the `mixed_count`. For BRUI, the formula is `TBRUKN_t = pure_brexit_count_t + (mixed_count_t * brexit_weight_t)`.
    2.  **Standardization:** They divide the total count by the document's total word count to get a raw density score.
    3.  **Normalization:** They scale the raw density score so that the maximum value of the entire series is 100.
*   **Output:** The DataFrame augmented with the final `BRUI` and `CRUI` columns, as well as their intermediate calculation columns (`TBRUKN`, `BRUI_raw`, etc.) for auditability.
*   **Data Transformation:** Columns of integer counts are transformed into final, normalized floating-point index values.
*   **Role in Research Pipeline:** These callables implement the **Index Finalization**. They directly execute the logic from **Step 10** and **Step 11**, specifically the standardization formula: $ \text{BRUI}_t = \frac{\text{TBRUKN}_t}{(\text{Total Number of Words Per Report})_t} $
    ...followed by the max-to-100 normalization.

#### **8. `prepare_data_for_var`**
*   **Inputs:** The DataFrame containing the final indices and macroeconomic data, and the `econometric_analysis_config`.
*   **Process:** This function prepares the data for econometric modeling. It selects the 10 variables for the VAR, interpolates any missing data, applies log transformations to specified variables, and then applies first-differencing to all variables to induce stationarity. It performs and logs ADF tests both before and after transformation to verify the process.
*   **Output:** A tuple containing: (1) the final, stationary `pd.DataFrame` ready for estimation, and (2) a detailed audit log of all tests and transformations.
*   **Data Transformation:** The input time series are transformed via logarithmic and differencing operations (`ln(x_t)` and `x_t - x_{t-1}`).
*   **Role in Research Pipeline:** This callable implements the **Econometric Data Preparation**. This is a standard but critical step in time-series econometrics, ensuring the data meets the assumptions of the VAR model. It addresses the need for stationarity, a core concept in the field.

#### **9. `estimate_var_model`**
*   **Inputs:** The stationary DataFrame and the `econometric_analysis_config`.
*   **Process:** This function estimates the VAR model. It first selects the optimal lag length `p` using information criteria (AIC, BIC, HQIC). It then fits the VAR(p) model to the data, which must be ordered according to the Cholesky specification. Finally, it performs and logs critical diagnostic tests (for stability, serial correlation, and normality) and computes the Cholesky decomposition of the residual covariance matrix.
*   **Output:** A tuple containing: (1) the fitted `statsmodels.VARResults` object, and (2) a detailed log of the estimation process.
*   **Data Transformation:** The DataFrame of time-series data is transformed into a fitted statistical model object.
*   **Role in Research Pipeline:** This callable implements the **VAR Model Estimation and Identification (Task 9)**. It directly addresses the need for model specification (lag selection) and estimation. Most importantly, it implements the **Cholesky decomposition** identification strategy, which is central to the paper's causal analysis. The ordering of variables, with BRUI first, imposes the economic assumption that uncertainty shocks are contemporaneously exogenous.

#### **10. `run_post_estimation_analysis`**
*   **Inputs:** The fitted `VARResults` object, the estimation log (containing the Cholesky matrix), and the `econometric_analysis_config`.
*   **Process:** This function uses the fitted model to derive economic insights. It computes the Impulse Response Functions (IRFs), the Forecast Error Variance Decompositions (FEVDs), and the percentile bootstrapped confidence intervals for the IRFs.
*   **Output:** A dictionary containing the structured results of the IRF, FEVD, and confidence interval calculations.
*   **Data Transformation:** The estimated model parameters are transformed into time-path responses (IRFs), variance proportions (FEVDs), and statistical bounds.
*   **Role in Research Pipeline:** This callable implements the **Post-Estimation Analysis**. It generates the core analytical outputs of the econometric model, as described in the "Empirical findings" section of the paper, which are used to understand the economic consequences of a Brexit uncertainty shock.

#### **11. The Visualization Suite (`plot_...` functions)**
*   **Inputs:** Various data components from the results dictionary (e.g., the `BRUI` series, the IRF results).
*   **Process:** Each function takes specific data and uses the `matplotlib` library to generate a publication-quality plot, styled and annotated according to the figures in the research paper (e.g., Figure 2, Figure 6).
*   **Output:** A `matplotlib.figure.Figure` object for each plot.
*   **Data Transformation:** Numerical and time-series data are transformed into visual representations (lines, shaded regions, text annotations).
*   **Role in Research Pipeline:** These callables implement the **Scientific Visualization**. They are responsible for creating the final, interpretable outputs that communicate the paper's key findings, such as the evolution of the BRUI over time and the dynamic response of the economy to its shocks.

#### **12. `run_brexit_uncertainty_analysis`**
*   **Inputs:** All raw data and configuration files.
*   **Process:** This is the master orchestrator. It calls every other callable in the correct sequence, managing the flow of data from one step to the next. It initializes a master results dictionary and progressively populates it with the outputs and audit logs from each task.
*   **Output:** A single, comprehensive dictionary containing all artifacts of the entire research pipeline, from intermediate data to final figures.
*   **Data Transformation:** This function orchestrates the entire chain of data transformations.
*   **Role in Research Pipeline:** This callable represents the **End-to-End Research Pipeline** itself. It encapsulates the entire methodology of the paper into a single, executable, and reproducible function.



## Usage Example


### **Implementation Example: Executing the End-to-End Brexit Uncertainty Pipeline**

This guide demonstrates the practical application of the `run_brexit_uncertainty_analysis` function;
i.e. the function which executes the end-to-end research pipeline. A successful execution requires the meticulous preparation of all input parameters. We will construct each parameter as specified in the original project prompt.

#### **Step 1: Assembling the Input Data (`df_input`)**

The primary input is a `pandas.DataFrame` with a `DatetimeIndex` and precisely defined columns. For this example, we will construct a synthetic DataFrame that mimics the required structure. In a real-world scenario, this DataFrame would be the result of a comprehensive data ingestion process, sourcing macroeconomic data from providers like the ONS and the EIU text reports from their respective archives.

```python
# Import necessary libraries for data creation.
import pandas as pd
import numpy as np

# Define the precise date range for the analysis as per the methodology.
# This covers the period from May 2012 to January 2025 (153 months).
date_range = pd.date_range(start="2012-05-01", end="2025-01-01", freq='MS')

# Create a dictionary to hold the synthetic data.
# The keys must match the required column names exactly.
data = {
    # The 'BRUI' column is a placeholder; it will be recalculated by the pipeline.
    "BRUI": np.zeros(len(date_range)),
    # Macroeconomic variables are populated with random data for this example.
    "GDP": np.random.uniform(95, 105, size=len(date_range)),
    "CPI": np.random.uniform(100, 120, size=len(date_range)),
    "PPI": np.random.uniform(100, 115, size=len(date_range)),
    "X": np.random.uniform(25, 35, size=len(date_range)),
    "M": np.random.uniform(30, 40, size=len(date_range)),
    "GBP_EUR": np.random.uniform(1.1, 1.3, size=len(date_range)),
    "GBP_USD": np.random.uniform(1.2, 1.5, size=len(date_range)),
    "EMP": np.random.uniform(30, 33, size=len(date_range)),
    "UEMP": np.random.uniform(3.5, 5.5, size=len(date_range)),
    # The 'EIU' column holds the text corpus. Here, we use sample sentences.
    # In a real application, this would contain the full text of 153 monthly reports.
    "EIU": [
        "The economic outlook is filled with uncertainty due to the upcoming referendum.",
        "Discussions surrounding the customs union and single market access are creating tension.",
        "There is no clarity on the future trade arrangement with the EU.",
        "The withdrawal agreement faces significant political hurdles, leading to volatility.",
        "Post-Brexit trade negotiations are precarious. Separately, the pandemic and covid-19 lockdown add to the instability."
    ] * 30 + ["Sample text."] * 3 # Ensure the list has the correct length.
}

# Construct the final input DataFrame with the DatetimeIndex.
df_input = pd.DataFrame(data, index=date_range)

# Display the head and info of the created DataFrame to verify its structure.
print("--- Input DataFrame Head ---")
print(df_input.head())
print("\n--- Input DataFrame Info ---")
df_input.info()
```

#### **Step 2: Defining the Keyword Lexicons**

The next set of inputs are the three keyword lexicons, defined as Python lists of strings. These must match the specifications from Table 1 of the research paper.

```python
# Define the Uncertainty Lexicon as per the methodology.
uncertainty_lexicon = [
    "fear", "indecision", "instability", "jittery", "nervousness",
    "precarious", "tense", "tension", "uncertain", "uncertainly",
    "uncertainty", "unclear", "unknown", "unpredictable", "unsettled",
    "unstable", "volatile", "volatility", "worry"
]

# Define the Brexit Lexicon as per the methodology.
brexit_lexicon = [
    "article 50", "brexit", "brexit-related", "customs union", "eu exit",
    "eu membership", "eu withdrawal", "exit deal", "exit from the eu",
    "exit the eu", "exit time", "exiting", "exiting the eu",
    "exiting the european union", "free movement", "internal market bill",
    "leave the eu", "northern ireland protocol", "post-brexit", "pre-brexit",
    "referendum", "regulatory alignment", "regulatory framework", "single market",
    "trade arrangement", "trade negotiations", "transition period", "uk exits",
    "uk-eu relations", "uk-eu trade deal", "uk's withdrawal",
    "withdrawal agreement", "withdrawal from the eu"
]

# Define the COVID-19 Lexicon as per the methodology.
covid_lexicon = [
    "coronavirus", "covid", "covid-19", "lockdown", "outbreak",
    "pandemic", "quarantine", "vaccination", "vaccine"
]
```

#### **Step 3: Defining the Configuration Dictionaries**

Two detailed configuration dictionaries are required to control the behavior of the index construction and econometric analysis phases. These parameters are critical for ensuring reproducibility.

```python
# Define the configuration for the text-based index construction.
# This dictionary governs every step from preprocessing to normalization.
index_construction_config = {
    "corpus_parameters": {
        "date_range_start": "2012-05",
        "date_range_end": "2025-01",
        "expected_document_count": 153
    },
    "preprocessing_parameters": {
        "text_normalization": "lowercase",
        "tokenizer": "nltk.word_tokenize",
        "stopword_language": "english",
        "stopword_source": "nltk.corpus.stopwords"
    },
    "algorithm_parameters": {
        "name": "Context-Aware Uncertainty Attribution",
        "context_window_size": 10,
        "mixed_count_allocation": {
            "method": "proportional",
            "fallback": 0.5
        }
    },
    "llm_parameters": {
        "library": "spacy",
        "model_identifier": "en_core_web_lg",
        "purpose": "Named Entity Recognition (NER) to enhance keyword identification"
    },
    "finalization_parameters": {
        "standardization_method": "division_by_total_words",
        "normalization_method": "scale_to_max",
        "normalization_target_max": 100
    }
}

# Define the configuration for the econometric VAR analysis.
# This dictionary controls model specification, data transformation, and post-estimation.
econometric_analysis_config = {
    "model_specification": {
        "model_type": "Vector Autoregression (VAR)",
        "variables": [
            "BRUI", "GDP", "CPI", "PPI", "X", "M",
            "GBP_EUR", "GBP_USD", "EMP", "UEMP"
        ],
        "lag_length_selection_criterion": ["AIC", "BIC", "HQIC"],
        "include_constant": True
    },
    "data_transformation_parameters": {
        "log_transform_variables": ["GDP", "CPI", "PPI", "X", "M", "EMP"],
        "differencing_transform_variables": "all",
        "differencing_order": 1
    },
    "impulse_response_parameters": {
        "identification_strategy": "Cholesky Decomposition",
        "cholesky_variable_order": [
            "BRUI", "GDP", "CPI", "PPI", "X", "M",
            "GBP_EUR", "GBP_USD", "EMP", "UEMP"
        ],
        "shock_size": "one_standard_deviation",
        "horizon": 10,
        "confidence_interval_level": 0.90,
        "confidence_interval_method": "standard_percentile_bootstrap",
        "bootstrap_repetitions": 999
    },
    "fevd_parameters": {
        "horizon": 10
    }
}
```

#### **Step 4: Defining Ancillary Data for Visualization**

The pipeline requires a dictionary of key events for plotting and can optionally accept a DataFrame of other indices for comparative validation.

```python
# Define the dictionary of key Brexit events for annotating the BRUI timeline plot.
# The keys are date strings, and the values are the descriptions.
brexit_events_for_plotting = {
    '2016-06-23': 'EU Referendum',
    '2017-03-29': 'Article 50 Triggered',
    '2019-01-15': "May's Deal Defeated",
    '2020-01-31': 'UK Leaves EU',
    '2020-12-24': 'UK-EU Trade Deal Agreed'
}

# Define an optional DataFrame for comparative validation plots.
# In a real scenario, this data would be loaded from external sources.
# For this example, we create synthetic data for BRUI_B and BRUI_C.
comparison_data = {
    "BRUI_B": pd.Series(np.random.uniform(0, 1, size=len(date_range)) * 80, index=date_range),
    "BRUI_C": pd.Series(np.random.uniform(0, 1, size=len(date_range)) * 90, index=date_range)
}
# Introduce some NaNs to simulate different temporal coverages.
comparison_data["BRUI_B"].iloc[:30] = np.nan
comparison_data["BRUI_C"].iloc[-20:] = np.nan
comparison_indices_df = pd.DataFrame(comparison_data)
```

#### **Step 5: Executing the Pipeline**

With all inputs meticulously prepared, the final step is to call the master orchestrator function. The function will execute the entire research pipeline and return a comprehensive dictionary containing all results.

```python
# Before running, ensure all required functions from the previous steps are defined
# or imported in the current execution environment. This includes:
# validate_parameters, cleanse_data, prepare_lexicons, preprocess_text_corpus,
# load_spacy_model_for_ner, apply_ner_to_corpus, attribute_uncertainty_in_corpus,
# calculate_brui, calculate_crui, prepare_data_for_var, estimate_var_model,
# run_post_estimation_analysis, plot_brui_with_events, plot_comparative_validation,
# and plot_impulse_response_functions.

# Also ensure NLTK and SpaCy data packages are downloaded:
# nltk.download('punkt')
# nltk.download('stopwords')
# !python -m spacy download en_core_web_lg

# Execute the end-to-end analysis pipeline.
# The 'try-except' block is a robust way to catch any validation or runtime errors.
try:
    # Call the main orchestrator function with all prepared inputs.
    pipeline_results = run_brexit_uncertainty_analysis(
        df_input=df_input,
        uncertainty_lexicon=uncertainty_lexicon,
        brexit_lexicon=brexit_lexicon,
        covid_lexicon=covid_lexicon,
        index_construction_config=index_construction_config,
        econometric_analysis_config=econometric_analysis_config,
        brexit_events_for_plotting=brexit_events_for_plotting,
        comparison_indices_df=comparison_indices_df
    )

    # --- Post-Execution: Accessing Results ---
    # The 'pipeline_results' dictionary now contains all artifacts.
    
    # Example: Access the final DataFrame with the calculated indices.
    final_df = pipeline_results['final_data']['indices_and_components']
    print("\n--- Final DataFrame with BRUI and CRUI ---")
    print(final_df[['BRUI', 'CRUI']].head())

    # Example: Access the VAR estimation log.
    var_estimation_log = pipeline_results['audit_logs']['var_estimation']
    print(f"\n--- VAR Model Optimal Lag (BIC) ---")
    print(var_estimation_log['lag_selection']['optimal_lag_p'])

    # Example: Display one of the generated figures.
    print("\n--- Displaying Generated BRUI Timeline Plot ---")
    brui_timeline_figure = pipeline_results['visualizations']['brui_timeline']
    # In an interactive environment (like Jupyter), this will display the plot.
    # To save it, you would use: brui_timeline_figure.savefig('brui_timeline.png')
    plt.show()

except (ValueError, KeyError, OSError, LookupError) as e:
    # Catch potential errors and print them in a structured way.
    print(f"\n--- AN ERROR OCCURRED ---")
    print(f"Error Type: {type(e).__name__}")
    print(f"Error Message: {e}")

```
This example provides a complete and reproducible template for using the end-to-end pipeline. It demonstrates how to structure each required input and how to invoke the main function, thereby executing a complex, multi-stage research project with a single, clean function call.

In [None]:
# Task 0: Parameter Validation

def _validate_dataframe_structure(
    df: pd.DataFrame,
    config: Dict[str, Any]
) -> List[str]:
    """
    Validates the structure, temporal coverage, and dtypes of the input DataFrame.

    This function performs a series of checks to ensure the input DataFrame
    conforms to the strict requirements of the research methodology. It verifies
    the index type, date range, row count, column names, and data types for
    each column.

    Args:
        df (pd.DataFrame): The input DataFrame containing macroeconomic and EIU data.
        config (Dict[str, Any]): The index construction configuration dictionary.

    Returns:
        List[str]: A list of error messages. An empty list indicates success.
    """
    # Initialize a list to aggregate validation error messages.
    errors: List[str] = []

    # --- Step 1.1: Validate DataFrame Index ---
    # The index must be a pandas DatetimeIndex for time-series operations.
    if not pd.api.types.is_datetime64_any_dtype(df.index):
        errors.append("DataFrame index is not a DatetimeIndex.")
        # If the index is not a datetime object, further date checks are invalid.
        return errors

    # --- Step 1.2: Validate Temporal Coverage ---
    try:
        # Extract expected start and end dates from the configuration.
        expected_start_str = config['corpus_parameters']['date_range_start']
        expected_end_str = config['corpus_parameters']['date_range_end']

        # Convert configuration strings to pandas Timestamp objects for comparison.
        expected_start = pd.to_datetime(expected_start_str)
        expected_end = pd.to_datetime(expected_end_str)

        # Compare the DataFrame's actual date range with the expected range.
        if df.index.min() != expected_start:
            errors.append(
                f"DataFrame start date {df.index.min().strftime('%Y-%m')} "
                f"does not match expected start date {expected_start_str}."
            )
        if df.index.max() != expected_end:
            errors.append(
                f"DataFrame end date {df.index.max().strftime('%Y-%m')} "
                f"does not match expected end date {expected_end_str}."
            )
    except KeyError as e:
        errors.append(f"Missing required date key in config: {e}")
    except Exception as e:
        errors.append(f"Error processing date range from config: {e}")

    # --- Step 1.3: Validate Document/Row Count ---
    try:
        # The number of rows must match the expected number of monthly reports.
        expected_count = config['corpus_parameters']['expected_document_count']
        if len(df) != expected_count:
            errors.append(
                f"DataFrame has {len(df)} rows, but expected document "
                f"count is {expected_count}."
            )
    except KeyError as e:
        errors.append(f"Missing 'expected_document_count' in config: {e}")

    # --- Step 1.4: Validate Column Structure ---
    # Define the exact set of required columns as per the methodology.
    expected_columns: Set[str] = {
        "BRUI", "GDP", "CPI", "PPI", "X", "M", "GBP_EUR",
        "GBP_USD", "EMP", "UEMP", "EIU"
    }
    # Compare the set of actual columns to the expected set.
    if set(df.columns) != expected_columns:
        errors.append(
            f"DataFrame columns mismatch. Expected: {sorted(list(expected_columns))}, "
            f"Got: {sorted(list(df.columns))}."
        )

    # --- Step 1.5: Validate Column Data Types ---
    # Define the expected data type for each column.
    expected_dtypes: Dict[str, str] = {
        "BRUI": "numeric", "GDP": "numeric", "CPI": "numeric", "PPI": "numeric",
        "X": "numeric", "M": "numeric", "GBP_EUR": "numeric", "GBP_USD": "numeric",
        "EMP": "numeric", "UEMP": "numeric", "EIU": "object"
    }
    # Iterate through expected columns to check their types.
    for col, expected_type in expected_dtypes.items():
        # Check if the column exists before trying to access its dtype.
        if col in df.columns:
            # Check for numeric types.
            if expected_type == "numeric":
                if not pd.api.types.is_numeric_dtype(df[col]):
                    errors.append(
                        f"Column '{col}' is not numeric. "
                        f"Found dtype: {df[col].dtype}."
                    )
            # Check for object type (for strings).
            elif expected_type == "object":
                if not pd.api.types.is_object_dtype(df[col]):
                    errors.append(
                        f"Column '{col}' is not of object type (for strings). "
                        f"Found dtype: {df[col].dtype}."
                    )

    # Return the list of all found errors.
    return errors

def _validate_lexicons(
    uncertainty_lexicon: List[str],
    brexit_lexicon: List[str],
    covid_lexicon: List[str],
    expected_counts: Dict[str, int] = {'Uncertainty': 19,
                                       'Brexit': 33,
                                       'COVID-19': 9}) -> List[str]:
    """
    Validates the keyword lexicons against specified counts and structural rules.

    This function performs a series of rigorous checks to ensure the keyword
    lexicons are correctly structured. It validates:
    1.  The size of each lexicon against explicitly provided expected counts.
    2.  The data type of each element, ensuring all are strings.
    3.  The internal uniqueness of each lexicon (i.e., no duplicate keywords).
    4.  The external uniqueness between lexicons (i.e., no overlapping keywords).
    This function is designed to be a pure validator, decoupled from hard-coded
    project data.

    Args:
        uncertainty_lexicon (List[str]): The list of uncertainty-related keywords.
        brexit_lexicon (List[str]): The list of Brexit-related keywords.
        covid_lexicon (List[str]): The list of COVID-19 related keywords.
        expected_counts (Dict[str, int]): A dictionary specifying the exact
            expected number of keywords for each lexicon (e.g.,
            {'Uncertainty': 19, 'Brexit': 33, 'COVID-19': 9}).

    Returns:
        List[str]: A list of string error messages. An empty list indicates
            that all validation checks passed successfully.
    """
    # Initialize a list to aggregate all validation error messages found.
    errors: List[str] = []

    # Create a dictionary to map lexicon names to their list objects for iteration.
    lexicons = {
        "Uncertainty": uncertainty_lexicon,
        "Brexit": brexit_lexicon,
        "COVID-19": covid_lexicon
    }

    # Initialize a dictionary to hold the processed sets of keywords for overlap checks.
    processed_sets: Dict[str, Set[str]] = {}

    # --- Iterate through each lexicon to perform validation checks ---
    for name, lexicon in lexicons.items():
        # Verify that the input object is a list.
        if not isinstance(lexicon, list):
            # If not a list, append a fatal error for this lexicon and skip further checks.
            errors.append(f"Lexicon '{name}' is not a list, but type {type(lexicon).__name__}.")
            # Continue to the next lexicon in the loop.
            continue

        # Check if the size of the lexicon matches the expected count passed as a parameter.
        if len(lexicon) != expected_counts.get(name, -1):
            # Append an error message if the counts do not match.
            errors.append(
                f"Lexicon '{name}' has {len(lexicon)} items, but "
                f"expected {expected_counts.get(name)}."
            )

        # --- Process and validate each keyword within the lexicon ---
        # Use a temporary list to store processed keywords for this lexicon.
        processed_keywords: List[str] = []
        # Use enumerate to track the index for precise error reporting.
        for i, kw in enumerate(lexicon):
            # Remediation: Implement robust type checking for each element.
            # Verify that the keyword is a string before attempting string operations.
            if not isinstance(kw, str):
                # If not a string, log a specific error with the index and value.
                errors.append(
                    f"Lexicon '{name}' contains a non-string element at index {i}: '{kw}' "
                    f"(type: {type(kw).__name__})."
                )
                # Skip processing for this invalid element.
                continue

            # Normalize the keyword by converting to lowercase and stripping whitespace.
            processed_kw = kw.lower().strip()
            # Add the cleaned keyword to our list for this lexicon.
            processed_keywords.append(processed_kw)

        # Check for internal duplicates by comparing the length of the list of
        # processed keywords to the length of a set made from the same list.
        if len(set(processed_keywords)) != len(processed_keywords):
            # Append an error if duplicates are found within the lexicon.
            errors.append(f"Lexicon '{name}' contains duplicate entries after normalization.")

        # Store the final, clean set of keywords for inter-lexicon overlap checks.
        processed_sets[name] = set(processed_keywords)

    # --- Perform inter-lexicon overlap checks if all lexicons were valid lists ---
    # This check is only meaningful if all three lexicons were successfully processed into sets.
    if len(processed_sets) == 3:
        # Check for any common keywords between the Uncertainty and Brexit sets.
        if not processed_sets["Uncertainty"].isdisjoint(processed_sets["Brexit"]):
            # If intersection is not empty, log an error.
            overlap = processed_sets["Uncertainty"] & processed_sets["Brexit"]
            errors.append(f"Overlap found between Uncertainty and Brexit lexicons: {overlap}")

        # Check for any common keywords between the Uncertainty and COVID-19 sets.
        if not processed_sets["Uncertainty"].isdisjoint(processed_sets["COVID-19"]):
            # If intersection is not empty, log an error.
            overlap = processed_sets["Uncertainty"] & processed_sets["COVID-19"]
            errors.append(f"Overlap found between Uncertainty and COVID-19 lexicons: {overlap}")

        # Check for any common keywords between the Brexit and COVID-19 sets.
        if not processed_sets["Brexit"].isdisjoint(processed_sets["COVID-19"]):
            # If intersection is not empty, log an error.
            overlap = processed_sets["Brexit"] & processed_sets["COVID-19"]
            errors.append(f"Overlap found between Brexit and COVID-19 lexicons: {overlap}")

    # Return the final, aggregated list of all validation errors.
    return errors

def _validate_index_construction_config(
    config: Dict[str, Any]
) -> List[str]:
    """
    Validates the index construction config against methodological requirements.

    This function performs a deep check of the nested configuration dictionary
    to ensure all parameters for text processing, algorithm choice, and
    index finalization align with the research paper's steps.

    Args:
        config (Dict[str, Any]): The index construction configuration dictionary.

    Returns:
        List[str]: A list of error messages. An empty list indicates success.
    """
    # Initialize a list to aggregate validation error messages.
    errors: List[str] = []

    # Define a schema of expected values for validation.
    expected_schema = {
        "corpus_parameters": {
            "date_range_start": "2012-05",
            "date_range_end": "2025-01",
            "expected_document_count": 153
        },
        "preprocessing_parameters": {
            "text_normalization": "lowercase",
            "tokenizer": "nltk.word_tokenize",
            "stopword_language": "english",
            "stopword_source": "nltk.corpus.stopwords"
        },
        "algorithm_parameters": {
            "name": "Context-Aware Uncertainty Attribution",
            "context_window_size": 10,
            "mixed_count_allocation": {
                "method": "proportional",
                "fallback": 0.5
            }
        },
        "llm_parameters": {
            "library": "spacy",
            "model_identifier": "en_core_web_lg",
            "purpose": "Named Entity Recognition (NER) to enhance keyword identification"
        },
        "finalization_parameters": {
            "standardization_method": "division_by_total_words",
            "normalization_method": "scale_to_max",
            "normalization_target_max": 100
        }
    }

    # --- Step 3.1: Validate the configuration structure and values ---
    # Iterate through the expected schema to check each parameter.
    for key, sub_schema in expected_schema.items():
        # Check if the top-level key exists in the provided config.
        if key not in config:
            errors.append(f"Missing top-level key '{key}' in index_construction_config.")
            continue

        # Get the sub-dictionary from the user's config.
        sub_config = config[key]
        # Iterate through the nested schema.
        for sub_key, expected_value in sub_schema.items():
            # Check for nested dictionaries.
            if isinstance(expected_value, dict):
                if sub_key not in sub_config:
                    errors.append(f"Missing nested key '{sub_key}' in '{key}'.")
                    continue
                # Iterate through the second level of nesting.
                for nested_key, nested_expected_value in expected_value.items():
                    if nested_key not in sub_config[sub_key]:
                        errors.append(f"Missing key '{nested_key}' in '{key}.{sub_key}'.")
                        continue
                    # Compare the actual value with the expected value.
                    actual_value = sub_config[sub_key][nested_key]
                    if actual_value != nested_expected_value:
                        errors.append(
                            f"Config mismatch at '{key}.{sub_key}.{nested_key}'. "
                            f"Expected: {nested_expected_value}, Got: {actual_value}."
                        )
            else:
                # Check for keys in the first level of nesting.
                if sub_key not in sub_config:
                    errors.append(f"Missing key '{sub_key}' in '{key}'.")
                    continue
                # Compare the actual value with the expected value.
                actual_value = sub_config[sub_key]
                if actual_value != expected_value:
                    errors.append(
                        f"Config mismatch at '{key}.{sub_key}'. "
                        f"Expected: {expected_value}, Got: {actual_value}."
                    )

    # Return the list of all found errors.
    return errors

def _validate_econometric_analysis_config(
    config: Dict[str, Any]
) -> List[str]:
    """
    Validates the econometric analysis config against VAR methodology.

    This function ensures the VAR model specification, data transformations,
    and identification strategy (especially the Cholesky ordering) are
    consistent with the research paper's econometric design.

    Args:
        config (Dict[str, Any]): The econometric analysis configuration dictionary.

    Returns:
        List[str]: A list of error messages. An empty list indicates success.
    """
    # Initialize a list to aggregate validation error messages.
    errors: List[str] = []

    # --- Step 4.1: Validate Model Specification ---
    try:
        spec = config['model_specification']
        # Check model type.
        if spec['model_type'] != "Vector Autoregression (VAR)":
            errors.append("model_type must be 'Vector Autoregression (VAR)'.")
        # Check variable list.
        expected_vars = {
            "BRUI", "GDP", "CPI", "PPI", "X", "M",
            "GBP_EUR", "GBP_USD", "EMP", "UEMP"
        }
        if set(spec['variables']) != expected_vars:
            errors.append("Variable list in model_specification is incorrect.")
        # Check lag selection criteria.
        if set(spec['lag_length_selection_criterion']) != {"AIC", "BIC", "HQIC"}:
            errors.append("lag_length_selection_criterion is incorrect.")
        # Check for constant term.
        if spec['include_constant'] is not True:
            errors.append("include_constant must be True.")
    except KeyError as e:
        errors.append(f"Missing key in model_specification: {e}")

    # --- Step 4.2: Validate Data Transformation Parameters ---
    try:
        trans = config['data_transformation_parameters']
        # Check log transform variables.
        expected_log_vars = {"GDP", "CPI", "PPI", "X", "M", "EMP"}
        if set(trans['log_transform_variables']) != expected_log_vars:
            errors.append("log_transform_variables list is incorrect.")
        # Check differencing specification.
        if trans['differencing_transform_variables'] != "all":
            errors.append("differencing_transform_variables must be 'all'.")
        # Check differencing order.
        if trans['differencing_order'] != 1:
            errors.append("differencing_order must be 1.")
    except KeyError as e:
        errors.append(f"Missing key in data_transformation_parameters: {e}")

    # --- Step 4.3: Validate Impulse Response Parameters ---
    try:
        ir_params = config['impulse_response_parameters']
        # Check identification strategy.
        if ir_params['identification_strategy'] != "Cholesky Decomposition":
            errors.append("identification_strategy must be 'Cholesky Decomposition'.")

        # CRITICAL: Validate Cholesky variable ordering.
        cholesky_order = ir_params['cholesky_variable_order']
        # Check that BRUI is the first variable. This is a core assumption.
        if not cholesky_order or cholesky_order[0] != "BRUI":
            errors.append("Cholesky order is invalid: 'BRUI' must be the first variable.")
        # Check that the set of variables in the order matches the model spec.
        if set(cholesky_order) != set(config['model_specification']['variables']):
            errors.append("Cholesky order variables do not match model_specification variables.")

        # Check other IRF parameters.
        if ir_params['shock_size'] != "one_standard_deviation":
            errors.append("shock_size must be 'one_standard_deviation'.")
        if ir_params['horizon'] != 10:
            errors.append("horizon must be 10.")
        if ir_params['confidence_interval_level'] != 0.90:
            errors.append("confidence_interval_level must be 0.90.")
        if ir_params['confidence_interval_method'] != "standard_percentile_bootstrap":
            errors.append("confidence_interval_method must be 'standard_percentile_bootstrap'.")
        if ir_params['bootstrap_repetitions'] != 999:
            errors.append("bootstrap_repetitions must be 999.")
    except KeyError as e:
        errors.append(f"Missing key in impulse_response_parameters: {e}")

    # --- Step 4.4: Validate FEVD Parameters ---
    try:
        fevd_params = config['fevd_parameters']
        # Check FEVD horizon.
        if fevd_params['horizon'] != 10:
            errors.append("FEVD horizon must be 10.")
    except KeyError as e:
        errors.append(f"Missing key in fevd_parameters: {e}")

    # Return the list of all found errors.
    return errors

def validate_parameters(
    df: pd.DataFrame,
    uncertainty_lexicon: List[str],
    brexit_lexicon: List[str],
    covid_lexicon: List[str],
    index_construction_config: Dict[str, Any],
    econometric_analysis_config: Dict[str, Any]
) -> None:
    """
    Orchestrates the validation of all input parameters for the BRUI project.

    This function serves as the main entry point for parameter validation. It
    calls specialized helper functions to validate each component: the input
    DataFrame, the keyword lexicons, the index construction configuration, and
    the econometric analysis configuration. If any validation check fails, it
    aggregates all error messages and raises a single, comprehensive
    ValueError, allowing the user to correct all issues at once.

    Args:
        df (pd.DataFrame): The primary DataFrame with monthly data.
        uncertainty_lexicon (List[str]): List of uncertainty-related keywords.
        brexit_lexicon (List[str]): List of Brexit-related keywords.
        covid_lexicon (List[str]): List of COVID-19 related keywords.
        index_construction_config (Dict[str, Any]): Configuration for BRUI calculation.
        econometric_analysis_config (Dict[str, Any]): Configuration for VAR analysis.

    Raises:
        TypeError: If any input parameter is of an incorrect type.
        ValueError: If any validation check fails, containing a detailed
                    list of all identified issues.

    Returns:
        None: The function returns None if all validations pass.
    """
    # --- Type Checking for main inputs ---
    # Ensure all inputs are of the expected high-level type.
    if not isinstance(df, pd.DataFrame):
        raise TypeError("Input 'df' must be a pandas DataFrame.")
    if not isinstance(uncertainty_lexicon, list):
        raise TypeError("Input 'uncertainty_lexicon' must be a list.")
    if not isinstance(brexit_lexicon, list):
        raise TypeError("Input 'brexit_lexicon' must be a list.")
    if not isinstance(covid_lexicon, list):
        raise TypeError("Input 'covid_lexicon' must be a list.")
    if not isinstance(index_construction_config, dict):
        raise TypeError("Input 'index_construction_config' must be a dictionary.")
    if not isinstance(econometric_analysis_config, dict):
        raise TypeError("Input 'econometric_analysis_config' must be a dictionary.")

    # --- Aggregate errors from all validation steps ---
    # Initialize a list to hold all error messages from all checks.
    all_errors: List[str] = []

    # Execute DataFrame validation.
    all_errors.extend(_validate_dataframe_structure(df, index_construction_config))

    # Execute lexicon validation.
    all_errors.extend(_validate_lexicons(
        uncertainty_lexicon, brexit_lexicon, covid_lexicon
    ))

    # Execute index construction config validation.
    all_errors.extend(_validate_index_construction_config(index_construction_config))

    # Execute econometric analysis config validation.
    all_errors.extend(_validate_econometric_analysis_config(econometric_analysis_config))

    # --- Final Error Reporting ---
    # If the list of errors is not empty, raise a single, comprehensive ValueError.
    if all_errors:
        # Format the error messages for clear presentation.
        error_header = "Parameter validation failed with the following errors:"
        formatted_errors = "\n".join([f"  - {error}" for error in all_errors])
        # Raise the exception.
        raise ValueError(f"{error_header}\n{formatted_errors}")

    # If no errors were found, the function completes successfully.
    # A print statement can be used for explicit confirmation in an interactive session.
    # print("All parameters successfully validated against the research methodology.")


In [None]:
# Task 1: Data Cleansing

def cleanse_data(
    df: pd.DataFrame,
    index_construction_config: Dict[str, Any],
    missing_doc_threshold: float = 0.95,
    text_column: str = 'EIU'
) -> Tuple[pd.DataFrame, Dict[str, Any]]:
    """
    Performs systematic data cleansing, temporal filtering, and corpus validation.

    This function executes a rigorous, multi-step data cleansing pipeline. It
    is designed to be non-destructive to the original text corpus while
    thoroughly cleaning all other columns. The pipeline includes:
    1.  A targeted replacement strategy to standardize missing and non-finite
        values, applying string-based null replacements only to non-text columns.
    2.  Precise temporal filtering to align the DataFrame with the research
        methodology's sample period (May 2012 - Jan 2025).
    3.  Strict verification of the EIU text corpus completeness, raising an
        error if coverage falls below a critical threshold.

    Args:
        df (pd.DataFrame): The input DataFrame, assumed to have passed the
            validations in Task 0.
        index_construction_config (Dict[str, Any]): The configuration dictionary
            containing corpus parameters like date ranges and expected counts.
        missing_doc_threshold (float): The minimum required proportion of
            documents. If coverage falls below this, a ValueError is raised.
            Defaults to 0.95.
        text_column (str): The name of the column containing the raw text corpus,
            which will be excluded from string-based null replacements.
            Defaults to 'EIU'.

    Returns:
        Tuple[pd.DataFrame, Dict[str, Any]]: A tuple containing:
            - The cleansed, temporally filtered pandas DataFrame.
            - A detailed audit log dictionary documenting all actions taken.

    Raises:
        ValueError: If the EIU document coverage is below the specified
            `missing_doc_threshold`.
        KeyError: If required keys are missing from the configuration dictionary.
    """
    # Create a deep copy to avoid modifying the original DataFrame.
    df_cleaned = df.copy()

    # Initialize a dictionary to serve as a detailed audit log for all operations.
    audit_log: Dict[str, Any] = {
        "data_quality_assessment": {},
        "temporal_filtering": {},
        "document_count_verification": {}
    }

    # --- Step 1: Comprehensive Data Quality Assessment and Targeted Replacement ---
    # Remediation: Implement a column-targeted replacement strategy.

    # Define lists of values to replace, separated by type.
    numeric_nulls = [np.inf, -np.inf]
    string_nulls = ['', 'N/A', 'NULL', 'null', 'None']

    # Identify the columns to which string replacements will be applied.
    # This excludes the specified text_column to preserve its integrity.
    non_text_columns = [col for col in df_cleaned.columns if col != text_column]

    # --- Audit and Replace ---
    # Initialize a dictionary to store the detailed audit of replacements.
    replacement_audit = {}

    # Stage 1: Audit and replace numeric nulls across the entire DataFrame.
    for value in numeric_nulls:
        # Find where the value occurs.
        locations = df_cleaned == value
        # Count occurrences per column.
        counts_per_column = locations.sum()
        # Filter to include only columns with one or more occurrences.
        affected_columns = counts_per_column[counts_per_column > 0].to_dict()
        # If any columns are affected, log the details.
        if affected_columns:
            replacement_audit[str(value)] = {
                "total_replacements": int(locations.sum().sum()),
                "columns_affected": affected_columns
            }
    # Perform the replacement for numeric nulls.
    df_cleaned.replace(numeric_nulls, np.nan, inplace=True)

    # Stage 2: Audit and replace string-based nulls ONLY in non-text columns.
    for value in string_nulls:
        # Find where the value occurs, but only in the target columns.
        locations = df_cleaned[non_text_columns] == value
        # Count occurrences per column.
        counts_per_column = locations.sum()
        # Filter to include only columns with one or more occurrences.
        affected_columns = counts_per_column[counts_per_column > 0].to_dict()
        # If any columns are affected, log the details.
        if affected_columns:
            replacement_audit[str(value)] = {
                "total_replacements": int(locations.sum().sum()),
                "columns_affected": affected_columns
            }
    # Perform the targeted replacement for string nulls.
    df_cleaned[non_text_columns] = df_cleaned[non_text_columns].replace(string_nulls, np.nan)

    # Store the complete replacement audit in the main audit log.
    audit_log["data_quality_assessment"]["replacements_made"] = replacement_audit

    # Log the total number of NaN values per column after all cleansing operations.
    audit_log["data_quality_assessment"]["nan_counts_post_cleansing"] = \
        df_cleaned.isna().sum().to_dict()

    # --- Step 2: Temporal Filtering Implementation (Unchanged) ---
    # Log the original shape and date range of the DataFrame before filtering.
    audit_log["temporal_filtering"]["original_row_count"] = len(df_cleaned)
    audit_log["temporal_filtering"]["original_date_range"] = {
        "start": df_cleaned.index.min().strftime('%Y-%m-%d'),
        "end": df_cleaned.index.max().strftime('%Y-%m-%d')
    }

    # Extract the required date range from the configuration dictionary.
    start_date_str = index_construction_config['corpus_parameters']['date_range_start']
    end_date_str = index_construction_config['corpus_parameters']['date_range_end']

    # Convert date strings to pandas Timestamp objects for robust filtering.
    start_date = pd.to_datetime(start_date_str)
    end_date = pd.to_datetime(end_date_str)

    # Ensure the DataFrame index is sorted for efficient and correct slicing.
    df_cleaned.sort_index(inplace=True)

    # Apply the temporal filter using .loc for an inclusive date range slice.
    df_filtered = df_cleaned.loc[start_date:end_date].copy()

    # Log the new shape and date range after the filtering operation.
    audit_log["temporal_filtering"]["filtered_row_count"] = len(df_filtered)
    audit_log["temporal_filtering"]["filtered_date_range"] = {
        "start": df_filtered.index.min().strftime('%Y-%m-%d'),
        "end": df_filtered.index.max().strftime('%Y-%m-%d')
    }

    # --- Step 3: Document Count Verification (Unchanged) ---
    # Extract the expected document count from the configuration.
    expected_count = index_construction_config['corpus_parameters']['expected_document_count']

    # Count the actual number of non-null EIU reports in the filtered data.
    actual_count = df_filtered[text_column].count()

    # Calculate the completion percentage of the text corpus.
    completion_pct = actual_count / expected_count if expected_count > 0 else 0

    # Generate the full expected monthly date range for the sample period.
    expected_index = pd.date_range(
        start=start_date, end=end_date, freq='MS'
    )

    # Identify the index of documents that are actually present and not null.
    actual_index = df_filtered[df_filtered[text_column].notna()].index

    # Find the difference between the expected and actual indices to get missing months.
    missing_months = expected_index.difference(actual_index).strftime('%Y-%m').tolist()

    # Log the detailed verification results.
    verification_results = {
        "expected_document_count": expected_count,
        "actual_document_count": int(actual_count),
        "completion_percentage": round(completion_pct, 4),
        "missing_document_months": missing_months,
        "is_corpus_complete": actual_count == expected_count
    }
    audit_log["document_count_verification"] = verification_results

    # Raise a critical error if the document coverage is below the specified threshold.
    if completion_pct < missing_doc_threshold:
        # The error message is specific and actionable for the user.
        raise ValueError(
            f"EIU document coverage is {completion_pct:.2%}, which is below the "
            f"required threshold of {missing_doc_threshold:.2%}. "
            f"Missing {len(missing_months)} documents for months: {missing_months}"
        )

    # Return the fully cleansed and filtered DataFrame and the detailed audit log.
    return df_filtered, audit_log


In [None]:
# Task 2: Keyword List Preparation
# Define a TypedDict for the structure of a single processed lexicon.
# This enhances readability and allows for static type checking.
class ProcessedLexicon(TypedDict):
    unigrams: Set[str]
    ngrams_by_first_token: Dict[str, List[Tuple[str, ...]]]
    max_ngram_len: int

# Define a TypedDict for the overall output structure.
class PreparedLexicons(TypedDict):
    uncertainty: ProcessedLexicon
    brexit: ProcessedLexicon
    covid: ProcessedLexicon
    disambiguation_rules: List[Dict[str, Any]]

def prepare_lexicons(
    uncertainty_lexicon: List[str],
    brexit_lexicon: List[str],
    covid_lexicon: List[str]
) -> PreparedLexicons:
    """
    Processes and optimizes keyword lexicons for high-performance text matching.

    This function transforms raw lists of keywords into a structured and highly
    optimized format suitable for the Context-Aware Uncertainty Attribution
    Algorithm. It performs three key steps:
    1.  Normalizes all keywords (lowercase, strip whitespace) and separates them
        into unigrams (single words) and n-grams (multi-word phrases).
    2.  Implements the specific disambiguation rule from the research paper to
        handle the "Scottish referendum" case by creating a machine-readable rule.
    3.  Builds an optimized lookup structure for n-grams, indexed by their first
        token and sorted by length (longest first), to enable efficient and
        greedy matching within context windows.

    Args:
        uncertainty_lexicon (List[str]): The raw list of uncertainty keywords.
        brexit_lexicon (List[str]): The raw list of Brexit-related keywords.
        covid_lexicon (List[str]): The raw list of COVID-19 related keywords.

    Returns:
        PreparedLexicons: A TypedDict containing the processed and optimized
            lexicons and any disambiguation rules. The structure is designed
            for direct use in subsequent NLP tasks.

    Raises:
        ValueError: If any of the input lexicons are empty, as this would
                    indicate a critical configuration error.
    """
    # --- Input Validation ---
    # Ensure that lexicons are not empty, which would be a critical error.
    if not all([uncertainty_lexicon, brexit_lexicon, covid_lexicon]):
        raise ValueError("One or more input lexicons are empty.")

    # --- Helper Function for processing a single lexicon ---
    def _process_single_lexicon(lexicon: List[str]) -> ProcessedLexicon:
        """Processes a single raw lexicon list into an optimized structure."""
        # Initialize containers for unigrams and n-grams.
        unigrams: Set[str] = set()
        # This dictionary will hold n-grams, keyed by their first word.
        ngrams_by_first_token: Dict[str, List[Tuple[str, ...]]] = {}
        # Track the maximum n-gram length for setting search bounds later.
        max_ngram_len = 1

        # --- Step 1: Multi-word Phrase Processing ---
        # Iterate through each raw keyword in the provided list.
        for raw_keyword in lexicon:
            # Normalize the keyword: lowercase and strip leading/trailing whitespace.
            # This ensures consistency with the text processing pipeline.
            keyword = raw_keyword.lower().strip()

            # Skip if the keyword becomes empty after stripping.
            if not keyword:
                continue

            # Tokenize the keyword by splitting on one or more whitespace characters.
            # This correctly handles phrases with multiple spaces.
            tokens = tuple(re.split(r'\s+', keyword))

            # Classify as a unigram or n-gram based on the number of tokens.
            if len(tokens) == 1:
                # Add the single token to the set of unigrams for O(1) lookup.
                unigrams.add(tokens[0])
            else:
                # This is a multi-word phrase (n-gram).
                # Get the first token to use as a key in our lookup dictionary.
                first_token = tokens[0]

                # If this is the first time we see this starting token, initialize a list.
                if first_token not in ngrams_by_first_token:
                    ngrams_by_first_token[first_token] = []

                # Append the tokenized phrase to the list for this starting token.
                ngrams_by_first_token[first_token].append(tokens)

                # Update the maximum n-gram length found in this lexicon.
                if len(tokens) > max_ngram_len:
                    max_ngram_len = len(tokens)

        # --- Step 3: Lexicon Optimization for Context Window Processing ---
        # For each starting token, sort its associated n-grams by length in
        # descending order. This is crucial for implementing a "greedy" matching
        # strategy that finds the longest possible phrase first.
        # E.g., it will match "exiting the european union" before "exiting the eu".
        for first_token in ngrams_by_first_token:
            ngrams_by_first_token[first_token].sort(key=len, reverse=True)

        # Return the fully processed and optimized lexicon structure.
        return {
            "unigrams": unigrams,
            "ngrams_by_first_token": ngrams_by_first_token,
            "max_ngram_len": max_ngram_len
        }

    # --- Main Function Logic ---
    # Process each of the three main lexicons using the helper function.
    processed_uncertainty = _process_single_lexicon(uncertainty_lexicon)
    processed_brexit = _process_single_lexicon(brexit_lexicon)
    processed_covid = _process_single_lexicon(covid_lexicon)

    # --- Step 2: Disambiguation Rule Implementation ---
    # As per Table 1, Note 2: "any context containing 'referendum' alongside
    # 'Scotland' or 'Scottish' was excluded from the analysis".
    # This is encoded as a machine-readable rule for the downstream processor.
    disambiguation_rules = [
        {
            "target_keyword": "referendum",
            "exclusion_keywords": {"scotland", "scottish"},
            "applies_to_lexicon": "brexit"
        }
    ]

    # Assemble the final, comprehensive `PreparedLexicons` object.
    # This structure is self-contained and ready for the text analysis pipeline.
    prepared_lexicons: PreparedLexicons = {
        "uncertainty": processed_uncertainty,
        "brexit": processed_brexit,
        "covid": processed_covid,
        "disambiguation_rules": disambiguation_rules
    }

    return prepared_lexicons


In [None]:
# Task 3: Text Preprocessing

class ProcessedLexicon(TypedDict):
    unigrams: Set[str]
    ngrams_by_first_token: Dict[str, List[Tuple[str, ...]]]
    max_ngram_len: int

class PreparedLexicons(TypedDict):
    uncertainty: ProcessedLexicon
    brexit: ProcessedLexicon
    covid: ProcessedLexicon
    disambiguation_rules: List[Dict[str, Any]]

def preprocess_text_corpus(
    df: pd.DataFrame,
    prepared_lexicons: PreparedLexicons,
    language: str = 'english'
) -> pd.DataFrame:
    """
    Applies a sequential, configurable NLP preprocessing pipeline to a text corpus.

    This function executes the text processing pipeline as defined in Steps 3-6
    of the research methodology. It takes a cleansed DataFrame and prepared
    lexicons, and returns a DataFrame augmented with new columns representing
    each stage of preprocessing. This ensures a fully traceable and auditable
    data transformation process.

    The pipeline includes:
    1.  Unicode and Lowercase Normalization: Standardizes text for matching.
    2.  NLTK Tokenization: Splits text into words and punctuation.
    3.  Stopword Removal: Removes common function words for a specified language,
        while preserving any keywords from the lexicons.
    4.  N-gram Generation: Creates bigram and trigram sequences.

    Args:
        df (pd.DataFrame): The cleansed DataFrame from Task 1. Must contain an
            'EIU' column with the raw text.
        prepared_lexicons (PreparedLexicons): The optimized lexicon structure
            from Task 2, used to prevent keyword removal.
        language (str): The language for stopword removal, corresponding to
            NLTK's available corpus languages. Defaults to 'english'.

    Returns:
        pd.DataFrame: The input DataFrame augmented with the following new
            columns: 'EIU_lowercase', 'EIU_tokens', 'EIU_cleaned_tokens',
            'bigrams', and 'trigrams'.

    Raises:
        LookupError: If required NLTK data packages (e.g., 'punkt', 'stopwords')
            for the specified language are not found on the system.
        KeyError: If the input DataFrame is missing the required 'EIU' column.
    """
    # --- Input Validation ---
    # Verify that the essential 'EIU' column exists in the DataFrame.
    if 'EIU' not in df.columns:
        # Raise an error if the required column is not found.
        raise KeyError("Input DataFrame must contain an 'EIU' column.")

    # Create a deep copy of the DataFrame to ensure the original is not modified.
    df_processed = df.copy()

    # --- Step 1: Lowercase Normalization (Methodology Step 3) ---
    # Define a helper function for robust text normalization.
    def normalize_text(text: Any) -> Any:
        """Applies Unicode normalization and lowercasing, handling NaNs."""
        # Check if the input is a valid string; otherwise, return it as is (e.g., NaN).
        if not isinstance(text, str):
            return text
        # Apply NFKC Unicode normalization to handle compatibility characters (e.g., ligatures).
        normalized_text = unicodedata.normalize('NFKC', text)
        # Convert the entire normalized string to lowercase for case-insensitive matching.
        return normalized_text.lower()

    # Apply the normalization function to the 'EIU' column using pandas .apply method.
    df_processed['EIU_lowercase'] = df_processed['EIU'].apply(normalize_text)

    # --- Step 2: NLTK Tokenization (Methodology Step 4) ---
    # Define a helper function for tokenization with error handling.
    def tokenize_text(text: Any) -> List[str]:
        """Tokenizes text using NLTK, handling NaNs and missing data errors."""
        # Return an empty list for non-string inputs (e.g., NaN), representing zero tokens.
        if not isinstance(text, str):
            return []
        # Use a try-except block to handle cases where NLTK data is not downloaded.
        try:
            # Use the NLTK word tokenizer as specified in the methodology.
            return nltk.word_tokenize(text)
        # Catch the specific error for a missing NLTK data package.
        except LookupError:
            # Raise a new, more informative error with an actionable instruction.
            raise LookupError(
                "NLTK 'punkt' tokenizer data not found. Please run: "
                "import nltk; nltk.download('punkt')"
            )

    # Apply the tokenization function to the newly created lowercase column.
    df_processed['EIU_tokens'] = df_processed['EIU_lowercase'].apply(tokenize_text)

    # --- Step 3: Stopword Removal (Methodology Step 5) ---
    # Use a try-except block for robustly fetching the stopwords list.
    try:
        # Remediation: Fetch the stopword list for the specified language parameter.
        stopwords_base = set(nltk.corpus.stopwords.words(language))
    # Catch the specific error for a missing NLTK data package.
    except LookupError:
        # Remediation: Raise a dynamic error message that includes the requested language.
        raise LookupError(
            f"NLTK 'stopwords' data for language '{language}' not found. "
            "Please run: import nltk; nltk.download('stopwords')"
        )

    # --- Advanced Stopword Handling: Preserve Lexicon Keywords ---
    # Aggregate all unigram keywords from all prepared lexicons into a single set.
    all_lexicon_unigrams = (
        prepared_lexicons['uncertainty']['unigrams'] |
        prepared_lexicons['brexit']['unigrams'] |
        prepared_lexicons['covid']['unigrams']
    )
    # Create the final stopword set by removing any words that also appear in our lexicons.
    final_stopwords = stopwords_base - all_lexicon_unigrams

    # Define a helper function for the stopword removal process.
    def remove_stopwords(tokens: List[str]) -> List[str]:
        """Removes stopwords from a list of tokens using the final stopword set."""
        # Use a list comprehension for an efficient filter operation.
        return [token for token in tokens if token not in final_stopwords]

    # Apply the stopword removal function to the tokenized column.
    df_processed['EIU_cleaned_tokens'] = df_processed['EIU_tokens'].apply(remove_stopwords)

    # --- Step 4: N-gram Analysis (Methodology Step 6) ---
    # Define a helper function to generate n-grams robustly.
    def generate_ngrams(tokens: List[str], n: int) -> List[Tuple[str, ...]]:
        """Generates n-grams from a list of tokens, handling lists shorter than n."""
        # N-grams can only be generated if the number of tokens is at least n.
        if len(tokens) < n:
            # Return an empty list if not enough tokens are available.
            return []
        # Use the NLTK bigrams function if n is 2.
        if n == 2:
            return list(nltk.bigrams(tokens))
        # Use the NLTK trigrams function if n is 3.
        elif n == 3:
            return list(nltk.trigrams(tokens))
        # Return an empty list for other values of n.
        return []

    # Apply the n-gram generation function to create a 'bigrams' column.
    df_processed['bigrams'] = df_processed['EIU_cleaned_tokens'].apply(
        lambda tokens: generate_ngrams(tokens, 2)
    )
    # Apply the n-gram generation function to create a 'trigrams' column.
    df_processed['trigrams'] = df_processed['EIU_cleaned_tokens'].apply(
        lambda tokens: generate_ngrams(tokens, 3)
    )

    # Return the fully augmented DataFrame with all preprocessing columns.
    return df_processed


In [None]:
# Task 4: Named Entity Recognition
# Define a TypedDict for the structure of a single extracted entity.
# This provides clarity and enables static analysis of the output.
class NerEntity(TypedDict):
    text: str
    label: str
    start_char: int
    end_char: int

def load_spacy_model_for_ner(model_name: str = "en_core_web_lg") -> Language:
    """
    Loads the specified SpaCy model, optimized for Named Entity Recognition.

    This function loads a SpaCy language model and disables all pipeline
    components except for the Named Entity Recognizer ('ner'). This is a
    critical performance optimization that focuses computational resources on
    the required task. It includes robust error handling for cases where the
    model is not installed.

    Args:
        model_name (str): The name of the SpaCy model to load. Defaults to
            "en_core_web_lg" as specified in the research methodology.

    Returns:
        Language: The loaded and optimized SpaCy Language object.

    Raises:
        OSError: If the specified SpaCy model is not found on the system,
            providing a clear installation instruction.
    """
    try:
        # Load the specified SpaCy model, disabling all components not needed for NER.
        # This significantly speeds up processing and reduces memory usage.
        nlp = spacy.load(
            model_name,
            disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"]
        )
        return nlp
    except OSError:
        # Catch the specific error for a missing model and raise a more
        # informative exception with actionable advice for the user.
        error_message = (
            f"SpaCy model '{model_name}' not found. Please install it by "
            f"running the following command in your terminal:\n"
            f"python -m spacy download {model_name}"
        )
        raise OSError(error_message)

def apply_ner_to_corpus(
    df: pd.DataFrame,
    nlp: Language,
    text_column: str = "EIU_lowercase",
    batch_size: int = 50
) -> pd.DataFrame:
    """
    Applies Named Entity Recognition to a corpus of texts using batch processing.

    This function uses a pre-loaded, optimized SpaCy model to perform NER on
    an entire column of text from a DataFrame. It leverages SpaCy's `nlp.pipe`
    for efficient, memory-safe batch processing. It also dynamically adjusts
    the model's `max_length` to handle potentially very long documents without
    error.

    Args:
        df (pd.DataFrame): The DataFrame containing the text data. It is
            assumed to have passed through the preprocessing steps of Task 3.
        nlp (Language): The pre-loaded and optimized SpaCy Language object from
            `load_spacy_model_for_ner`.
        text_column (str): The name of the column containing the normalized
            text to be processed. Defaults to "EIU_lowercase".
        batch_size (int): The number of documents to process in each batch.
            Tuning this can affect the trade-off between speed and memory usage.
            Defaults to 50.

    Returns:
        pd.DataFrame: The input DataFrame augmented with a new 'ner_entities'
            column. Each entry in this column is a list of dictionaries, where
            each dictionary represents a single named entity.

    Raises:
        KeyError: If the specified `text_column` does not exist in the DataFrame.
    """
    # --- Input Validation ---
    # Verify that the specified text column exists.
    if text_column not in df.columns:
        raise KeyError(f"The specified text column '{text_column}' was not found in the DataFrame.")

    # Create a deep copy to avoid side effects on the original DataFrame.
    df_ner = df.copy()

    # --- Step 2: Batch Processing Implementation ---
    # Convert the text column to a list, replacing any NaN values with empty
    # strings to ensure nlp.pipe receives a clean list of strings.
    texts = df_ner[text_column].fillna('').tolist()

    # --- Robustness: Handle potentially long documents ---
    # Calculate the maximum document length in the corpus.
    max_len = max(len(text) for text in texts) if texts else 0
    # If the longest document exceeds the model's default max_length, increase it.
    # This prevents a ValueError for long texts. Add a small buffer.
    if max_len > nlp.max_length:
        nlp.max_length = max_len + 100

    # Initialize a list to store the extracted entities for each document.
    all_entities: List[List[NerEntity]] = []

    # Process the texts using nlp.pipe for efficient batching.
    # This is significantly faster and more memory-efficient than a simple loop.
    # The `as_tuples=True` argument can further optimize if only text and doc are needed,
    # but here we process the full doc object to access .ents.
    for doc in nlp.pipe(texts, batch_size=batch_size):
        # For each processed document, extract the entities.
        # Create a list of structured dictionaries for the current document's entities.
        doc_entities: List[NerEntity] = [
            {
                "text": ent.text,
                "label": ent.label_,
                "start_char": ent.start_char,
                "end_char": ent.end_char
            }
            for ent in doc.ents
        ]
        # Append the list of entities for this document to the main list.
        all_entities.append(doc_entities)

    # Add the list of extracted entities as a new column to the DataFrame.
    # The list `all_entities` has the same order and length as the input `texts`
    # list, ensuring perfect alignment with the DataFrame's index.
    df_ner['ner_entities'] = all_entities

    # --- Step 3: Entity Integration Strategy ---
    # The integration of these entities is a strategic choice for the next task.
    # This function's role is to provide the structured data. The downstream
    # function (Task 5) will use this 'ner_entities' column to dynamically
    # augment its keyword search within context windows, checking for the text
    # of entities with relevant labels (e.g., ORG, GPE, EVENT).

    return df_ner


In [None]:
# Task 5: Context-Aware Uncertainty Attribution

class ProcessedLexicon(TypedDict):
    unigrams: Set[str]
    ngrams_by_first_token: Dict[str, List[Tuple[str, ...]]]
    max_ngram_len: int

class PreparedLexicons(TypedDict):
    uncertainty: ProcessedLexicon
    brexit: ProcessedLexicon
    covid: ProcessedLexicon
    disambiguation_rules: List[Dict[str, Any]]

def _find_keywords_in_window(
    window: List[str],
    lexicon: ProcessedLexicon
) -> Set[str]:
    """
    Finds all matching unigram and n-gram keywords from a lexicon in a token window.
    This is a high-performance helper using pre-computed lexicon structures.
    """
    found_keywords: Set[str] = set()
    i = 0
    while i < len(window):
        token = window[i]
        matched = False
        # Check for n-gram matches first (greedy, longest-first approach).
        if token in lexicon['ngrams_by_first_token']:
            for ngram in lexicon['ngrams_by_first_token'][token]:
                # Check if the n-gram fits within the rest of the window.
                if i + len(ngram) <= len(window):
                    # Compare the slice of the window with the n-gram tuple.
                    if tuple(window[i : i + len(ngram)]) == ngram:
                        # Add the matched phrase (joined back to a string).
                        found_keywords.add(" ".join(ngram))
                        # Advance the index by the length of the matched n-gram.
                        i += len(ngram)
                        matched = True
                        # Break after the first (longest) match for this position.
                        break
        # If no n-gram was matched at this position, check for a unigram.
        if not matched:
            if token in lexicon['unigrams']:
                found_keywords.add(token)
            # Advance the index by one.
            i += 1
    return found_keywords

def attribute_uncertainty_in_corpus(
    df: pd.DataFrame,
    prepared_lexicons: PreparedLexicons,
    config: Dict[str, Any]
) -> pd.DataFrame:
    """
    Executes the Context-Aware Uncertainty Attribution algorithm on the corpus.

    This function is the core of the BRUI calculation. It iterates through each
    document's cleaned tokens, identifies every instance of an uncertainty
    keyword, and analyzes its surrounding context window (as per Equation 1).
    For each window, it detects the presence of Brexit and COVID-19 keywords
    and classifies the uncertainty accordingly. The final output is a DataFrame
    augmented with the raw counts needed for the index construction in Task 6.

    The process for each document is:
    1.  Iterate through tokens to find uncertainty keywords.
    2.  For each uncertainty keyword, construct the 10-word context window.
    3.  Search the window for Brexit and COVID-19 keywords using an optimized
        matching algorithm that handles n-grams and disambiguation rules.
    4.  Classify the uncertainty instance based on the findings (Pure Brexit,
        Pure COVID, or Mixed) and increment the appropriate counter.
    5.  Store the final counts for the document.

    Args:
        df (pd.DataFrame): The preprocessed DataFrame from Task 3, containing
            the 'EIU_cleaned_tokens' column.
        prepared_lexicons (PreparedLexicons): The optimized lexicon structure
            from Task 2.
        config (Dict[str, Any]): The main `index_construction_config` dictionary,
            used to get the context window size.

    Returns:
        pd.DataFrame: The input DataFrame augmented with three new columns:
            'pure_brexit_count', 'pure_covid_count', and 'mixed_count',
            containing the raw attribution counts for each document.

    Raises:
        KeyError: If required columns or configuration keys are missing.
    """
    # --- Input Validation and Setup ---
    required_column = 'EIU_cleaned_tokens'
    if required_column not in df.columns:
        raise KeyError(f"Input DataFrame must contain the '{required_column}' column.")

    # Create a deep copy to avoid modifying the original DataFrame.
    df_attributed = df.copy()

    # Extract algorithm parameters from the configuration.
    try:
        window_size = config['algorithm_parameters']['context_window_size']
    except KeyError as e:
        raise KeyError(f"Missing required configuration key: {e}")

    # Extract the specific lexicons for easier access.
    uncertainty_lex = prepared_lexicons['uncertainty']
    brexit_lex = prepared_lexicons['brexit']
    covid_lex = prepared_lexicons['covid']

    # Extract disambiguation rule details.
    # This makes the code cleaner and assumes one rule for now as per the paper.
    disambiguation_rule = prepared_lexicons['disambiguation_rules'][0]
    dis_target = disambiguation_rule['target_keyword']
    dis_exclusions = disambiguation_rule['exclusion_keywords']

    # Initialize lists to store the results for each document.
    pure_brexit_counts: List[int] = []
    pure_covid_counts: List[int] = []
    mixed_counts: List[int] = []

    # --- Main Loop: Iterate through each document (row) in the DataFrame ---
    for tokens in df_attributed[required_column]:
        # Initialize counters for the current document.
        doc_pure_brexit = 0
        doc_pure_covid = 0
        doc_mixed = 0

        # If the document has no tokens, append 0 counts and continue.
        if not tokens:
            pure_brexit_counts.append(0)
            pure_covid_counts.append(0)
            mixed_counts.append(0)
            continue

        # --- Step 1: Context Window Construction (based on finding U) ---
        # Iterate through each token's index to check for uncertainty keywords.
        for i, token in enumerate(tokens):
            # Check if the current token is an uncertainty unigram or starts an n-gram.
            is_uncertainty_keyword = (
                token in uncertainty_lex['unigrams'] or
                token in uncertainty_lex['ngrams_by_first_token']
            )

            if is_uncertainty_keyword:
                # An uncertainty keyword 'U' is found. Construct the context window.
                # Equation 1: CW = {x_{-10}, ..., U, ..., x_{+10}}
                start = max(0, i - window_size)
                end = i + window_size + 1 # Slice end is exclusive.
                context_window = tokens[start:end]

                # --- Step 2: Keyword Detection within the Window ---
                # Find all Brexit and COVID keywords present in the window.
                brexit_words_found = _find_keywords_in_window(context_window, brexit_lex)
                covid_words_found = _find_keywords_in_window(context_window, covid_lex)

                # Apply the "Scottish referendum" disambiguation rule.
                # If 'referendum' was found, check if any exclusion words were also found.
                if dis_target in brexit_words_found:
                    # Check for intersection between found words and exclusion words.
                    if not brexit_words_found.isdisjoint(dis_exclusions):
                        # If there's an overlap, remove 'referendum' from the set
                        # of found words, effectively ignoring it for this window.
                        brexit_words_found.remove(dis_target)

                # Determine the final boolean flags after disambiguation.
                brexit_found = bool(brexit_words_found)
                covid_found = bool(covid_words_found)

                # --- Step 3: Conditional Logic Implementation (Equation 2) ---
                # Classify the uncertainty instance and increment the correct counter.
                if brexit_found and not covid_found:
                    # Case 1: Pure Brexit-related uncertainty.
                    doc_pure_brexit += 1
                elif covid_found and not brexit_found:
                    # Case 2: Pure COVID-related uncertainty.
                    doc_pure_covid += 1
                elif brexit_found and covid_found:
                    # Case 3: Mixed uncertainty, to be allocated proportionally later.
                    doc_mixed += 1

        # After processing all tokens in the document, append the final counts.
        pure_brexit_counts.append(doc_pure_brexit)
        pure_covid_counts.append(doc_pure_covid)
        mixed_counts.append(doc_mixed)

    # Add the new count columns to the DataFrame.
    df_attributed['pure_brexit_count'] = pure_brexit_counts
    df_attributed['pure_covid_count'] = pure_covid_counts
    df_attributed['mixed_count'] = mixed_counts

    return df_attributed



In [None]:
# Task 6: Brexit-Related Uncertainty Index (BRUI) Calculation

def calculate_brui(
    df: pd.DataFrame,
    config: Dict[str, Any]
) -> pd.DataFrame:
    """
    Calculates the final Brexit-Related Uncertainty Index (BRUI).

    This function executes the final four steps of the index construction
    methodology, transforming the raw attribution counts into the standardized
    and normalized BRUI. It operates with high numerical precision and directly
    implements the specified equations and logic from the research paper.

    The calculation pipeline is as follows:
    1.  TBRUKN Aggregation: Calculates the Total Brexit-Related Uncertainty
        Keyword Number by applying proportional allocation to mixed-count contexts.
    2.  Total Word Count: Computes the total number of cleaned tokens for each
        document to be used as a normalization factor.
    3.  Standardization (Equation 3): Calculates the raw uncertainty density
        (BRUI_raw) by dividing the TBRUKN by the total word count.
    4.  Max-Normalization (Step 11): Scales the raw index so that its maximum
        value over the entire period is 100.

    Args:
        df (pd.DataFrame): The DataFrame from Task 5, containing the columns
            'pure_brexit_count', 'pure_covid_count', 'mixed_count', and
            'EIU_cleaned_tokens'.
        config (Dict[str, Any]): The main `index_construction_config` dictionary,
            used to get the fallback weight for proportional allocation.

    Returns:
        pd.DataFrame: The input DataFrame augmented with the final 'BRUI' column
            as well as intermediate calculation columns ('TBRUKN',
            'total_word_count', 'BRUI_raw') for full auditability.

    Raises:
        KeyError: If required columns or configuration keys are missing.
    """
    # --- Input Validation and Setup ---
    required_cols = [
        'pure_brexit_count', 'pure_covid_count', 'mixed_count', 'EIU_cleaned_tokens'
    ]
    if not all(col in df.columns for col in required_cols):
        raise KeyError(f"Input DataFrame is missing one or more required columns: {required_cols}")

    # Create a deep copy to avoid modifying the original DataFrame.
    df_final = df.copy()

    # Extract the fallback weight from the configuration for proportional allocation.
    try:
        fallback_weight = config['algorithm_parameters']['mixed_count_allocation']['fallback']
    except KeyError as e:
        raise KeyError(f"Missing required configuration key for fallback weight: {e}")

    # --- Step 1: Total Brexit Related Uncertainty Keyword Number (TBRUKN) Aggregation ---
    # This step implements the proportional allocation for mixed-uncertainty contexts.

    # Calculate the denominator for the weight calculation: pure_brexit + pure_covid.
    pure_sum = df_final['pure_brexit_count'] + df_final['pure_covid_count']

    # Calculate the Brexit weight. Use the fallback where the sum of pure counts is zero.
    # brexit_weight = pure_brexit_count / (pure_brexit_count + pure_covid_count)
    brexit_weight = np.divide(
        df_final['pure_brexit_count'],
        pure_sum,
        out=np.full_like(pure_sum, fill_value=fallback_weight, dtype=float),
        where=(pure_sum != 0)
    )

    # Calculate the final TBRUKN for each month (t).
    # TBRUKN_t = pure_brexit_count_t + (mixed_count_t * brexit_proportion_weight_t)
    df_final['TBRUKN'] = df_final['pure_brexit_count'] + (df_final['mixed_count'] * brexit_weight)

    # --- Step 2: Total Word Count Calculation for Standardization ---
    # The word count must be based on the cleaned tokens to match the analysis basis.
    df_final['total_word_count'] = df_final['EIU_cleaned_tokens'].apply(len)

    # --- Step 3: BRUI Standardization Implementation (Equation 3) ---
    # This step calculates the raw uncertainty density for each document.
    # BRUI_t = TBRUKN_t / (Total Number of Words Per Report)_t

    # Perform safe division, returning 0.0 where the word count is 0.
    df_final['BRUI_raw'] = np.divide(
        df_final['TBRUKN'],
        df_final['total_word_count'],
        out=np.zeros_like(df_final['TBRUKN'], dtype=float),
        where=(df_final['total_word_count'] != 0)
    )

    # --- Step 4: Max-Normalization Implementation (Methodology Step 11) ---
    # This step scales the entire series so that the peak uncertainty month is 100.

    # Find the maximum value of the raw, standardized BRUI series.
    max_raw_brui = df_final['BRUI_raw'].max()

    # Get the normalization target from the configuration.
    norm_target = config['finalization_parameters']['normalization_target_max']

    # Normalize the series. If the max value is 0, the result is an all-zero series.
    # normalized_BRUI = (BRUI_raw / max(BRUI_raw)) * 100
    if max_raw_brui > 0:
        df_final['BRUI'] = (df_final['BRUI_raw'] / max_raw_brui) * norm_target
    else:
        # If no uncertainty was ever detected, the index is zero everywhere.
        df_final['BRUI'] = 0.0

    return df_final


In [None]:
# Task 7: COVID-19 Related Uncertainty Index (CRUI) Calculation

def calculate_crui(
    df: pd.DataFrame,
    config: Dict[str, Any]
) -> pd.DataFrame:
    """
    Calculates the final COVID-19 Related Uncertainty Index (CRUI).

    This function constructs the CRUI by applying the identical, symmetric
    methodology used for the BRUI. It leverages the same raw attribution counts
    ('pure_brexit_count', 'pure_covid_count', 'mixed_count') to ensure perfect
    consistency in the disentanglement of the two uncertainty sources.

    The calculation pipeline mirrors the BRUI construction:
    1.  TCRUKN Aggregation: Calculates the Total COVID-19 Related Uncertainty
        Keyword Number using a proportional allocation symmetric to the BRUI's.
    2.  Standardization: Calculates the raw CRUI density by dividing the TCRUKN
        by the same total word count used for the BRUI.
    3.  Max-Normalization: Scales the raw CRUI series independently so that its
        own maximum value over the entire period is 100.

    Args:
        df (pd.DataFrame): The DataFrame from Task 6, which must contain the
            raw attribution counts and the 'total_word_count' column.
        config (Dict[str, Any]): The main `index_construction_config` dictionary,
            used to get the fallback weight for proportional allocation.

    Returns:
        pd.DataFrame: The input DataFrame augmented with the final 'CRUI' column
            and its intermediate calculation columns ('TCRUKN', 'CRUI_raw')
            for full auditability.

    Raises:
        KeyError: If required columns from previous tasks or configuration
            keys are missing.
    """
    # --- Input Validation and Setup ---
    # Verify that all required columns from previous tasks are present.
    required_cols = [
        'pure_brexit_count', 'pure_covid_count', 'mixed_count', 'total_word_count'
    ]
    if not all(col in df.columns for col in required_cols):
        raise KeyError(f"Input DataFrame is missing one or more required columns: {required_cols}")

    # Create a deep copy to avoid modifying the original DataFrame.
    df_final = df.copy()

    # Extract the fallback weight from the configuration. This must be the same
    # as used for the BRUI to ensure consistency.
    try:
        fallback_weight = config['algorithm_parameters']['mixed_count_allocation']['fallback']
    except KeyError as e:
        raise KeyError(f"Missing required configuration key for fallback weight: {e}")

    # --- Step 1 & 2: TCRUKN Calculation with Proportional Allocation Consistency ---
    # This step calculates the Total COVID-19 Related Uncertainty Keyword Number.

    # The denominator is the same as for the BRUI calculation.
    pure_sum = df_final['pure_brexit_count'] + df_final['pure_covid_count']

    # Calculate the COVID weight, which is the symmetric counterpart to the Brexit weight.
    # covid_weight = pure_covid_count / (pure_brexit_count + pure_covid_count)
    covid_weight = np.divide(
        df_final['pure_covid_count'],
        pure_sum,
        out=np.full_like(pure_sum, fill_value=fallback_weight, dtype=float),
        where=(pure_sum != 0)
    )

    # Calculate the final TCRUKN for each month (t).
    # TCRUKN_t = pure_covid_count_t + (mixed_count_t * covid_proportion_weight_t)
    df_final['TCRUKN'] = df_final['pure_covid_count'] + (df_final['mixed_count'] * covid_weight)

    # --- Step 3: CRUI Standardization and Normalization ---
    # The process is identical to the BRUI but uses the TCRUKN as the numerator.

    # Standardization: Calculate the raw CRUI density.
    # CRUI_raw_t = TCRUKN_t / (Total Number of Words Per Report)_t
    # We reuse the 'total_word_count' column calculated for the BRUI to ensure
    # the normalization base is identical, which is critical for comparing densities.
    df_final['CRUI_raw'] = np.divide(
        df_final['TCRUKN'],
        df_final['total_word_count'],
        out=np.zeros_like(df_final['TCRUKN'], dtype=float),
        where=(df_final['total_word_count'] != 0)
    )

    # Normalization: Scale the CRUI series to its own maximum value of 100.
    # This is done independently of the BRUI's normalization.

    # Find the maximum value of the raw, standardized CRUI series.
    max_raw_crui = df_final['CRUI_raw'].max()

    # Get the normalization target from the configuration.
    norm_target = config['finalization_parameters']['normalization_target_max']

    # Normalize the series. If the max value is 0, the result is an all-zero series.
    # normalized_CRUI = (CRUI_raw / max(CRUI_raw)) * 100
    if max_raw_crui > 0:
        df_final['CRUI'] = (df_final['CRUI_raw'] / max_raw_crui) * norm_target
    else:
        # If no COVID-related uncertainty was ever detected, the index is zero everywhere.
        df_final['CRUI'] = 0.0

    return df_final


In [None]:
# Task 8: Data Preparation for VAR Analysis

def prepare_data_for_var(
    df: pd.DataFrame,
    config: Dict[str, Any]
) -> Tuple[pd.DataFrame, Dict[str, Any]]:
    """
    Prepares the dataset for Vector Autoregression (VAR) analysis.

    This function executes a rigorous, multi-step pipeline to transform the
    raw data into a stationary format suitable for VAR modeling, strictly
    adhering to the specified econometric methodology.

    The pipeline includes:
    1.  Integration: Selects the 10 specified VAR variables and handles any
        missing macroeconomic data via linear interpolation.
    2.  Stationarity Testing (Pre-Transform): Conducts and logs ADF tests on
        the initial data to establish a baseline.
    3.  Logarithmic Transformation: Applies natural logs to specified variables
        to stabilize variance, with robust checks for non-positive values.
    4.  First-Differencing: Applies first-order differencing to all variables
        to induce stationarity.
    5.  Stationarity Testing (Post-Transform): Conducts and logs a final round
        of ADF tests to confirm that all variables are stationary before estimation.

    Args:
        df (pd.DataFrame): The fully constructed DataFrame containing the BRUI
            and all macroeconomic variables.
        config (Dict[str, Any]): The `econometric_analysis_config` dictionary
            that specifies all parameters for the VAR analysis.

    Returns:
        Tuple[pd.DataFrame, Dict[str, Any]]: A tuple containing:
            - The final, stationary DataFrame ready for VAR estimation.
            - A detailed audit log documenting every test and transformation.

    Raises:
        KeyError: If required variables or configuration keys are missing.
        ValueError: If a variable intended for log transformation contains
            non-positive values.
    """
    # --- Initialize Audit Log ---
    audit_log: Dict[str, Any] = {
        "integration": {},
        "stationarity_pre_transform": {},
        "transformations": {},
        "stationarity_post_transform": {}
    }

    # --- Step 1: BRUI Integration with Macroeconomic Variables ---
    try:
        # Select the exact list of variables required for the VAR model.
        var_list = config['model_specification']['variables']
        df_var = df[var_list].copy()
    except KeyError as e:
        raise KeyError(f"A required variable for the VAR model is missing from the DataFrame or config: {e}")

    # Handle missing values in the macroeconomic data via interpolation.
    # This is a standard approach for filling small gaps in monthly series.
    initial_nan_counts = df_var.isna().sum()
    audit_log['integration']['initial_nan_counts'] = initial_nan_counts[initial_nan_counts > 0].to_dict()

    # Apply linear interpolation, then back-fill and forward-fill for edge cases.
    df_var.interpolate(method='linear', limit_direction='both', inplace=True)

    final_nan_counts = df_var.isna().sum().sum()
    if final_nan_counts > 0:
        # This should not happen with the above method but is a critical check.
        raise ValueError("Data interpolation failed. NaN values still exist in the VAR dataset.")

    audit_log['integration']['final_row_count'] = len(df_var)

    # --- Step 2: Comprehensive Stationarity Testing (Pre-Transformation) ---
    def _run_adf_test(data: pd.DataFrame) -> Dict[str, Any]:
        """Helper to run and format ADF tests on all columns of a DataFrame."""
        results = {}
        for name, series in data.items():
            # Perform the ADF test with automatic lag selection based on AIC.
            adf_result = adfuller(series.dropna(), autolag='AIC')
            # Format the results into a clean dictionary.
            results[name] = {
                'test_statistic': adf_result[0],
                'p_value': adf_result[1],
                'lags_used': adf_result[2],
                'critical_values': adf_result[4],
                # A series is stationary if the p-value is below the 0.05 threshold.
                'is_stationary_at_5%': adf_result[1] < 0.05
            }
        return results

    audit_log['stationarity_pre_transform'] = _run_adf_test(df_var)

    # --- Step 3: Logarithmic Transformation ---
    try:
        log_vars = config['data_transformation_parameters']['log_transform_variables']
    except KeyError as e:
        raise KeyError(f"Missing 'log_transform_variables' key in config: {e}")

    # Pre-flight check: Ensure all values in columns to be logged are positive.
    for col in log_vars:
        if (df_var[col] <= 0).any():
            raise ValueError(
                f"Column '{col}' contains non-positive values and cannot be "
                "log-transformed. Please clean the data."
            )

    # Apply the natural logarithm transformation.
    df_var[log_vars] = np.log(df_var[log_vars])
    audit_log['transformations']['log_transformed_variables'] = log_vars

    # --- Step 4: First-Differencing ---
    try:
        diff_order = config['data_transformation_parameters']['differencing_order']
    except KeyError as e:
        raise KeyError(f"Missing 'differencing_order' key in config: {e}")

    # Apply first-order differencing to all columns to induce stationarity.
    df_stationary = df_var.diff(periods=diff_order)

    # Differencing creates NaNs in the first row(s); these must be removed.
    df_stationary.dropna(inplace=True)

    audit_log['transformations']['differencing_order'] = diff_order
    audit_log['transformations']['final_sample_size'] = len(df_stationary)
    audit_log['transformations']['final_date_range'] = {
        "start": df_stationary.index.min().strftime('%Y-%m-%d'),
        "end": df_stationary.index.max().strftime('%Y-%m-%d')
    }

    # --- Step 5: Stationarity Testing (Post-Transformation) ---
    # This final check confirms that the data is ready for VAR estimation.
    audit_log['stationarity_post_transform'] = _run_adf_test(df_stationary)

    # Final validation: ensure all series are now stationary.
    for var, result in audit_log['stationarity_post_transform'].items():
        if not result['is_stationary_at_5%']:
            # This is a critical failure of the preparation process.
            raise ValueError(
                f"Stationarity not achieved for variable '{var}' after "
                f"transformation (p-value: {result['p_value']:.4f}). "
                "Review data or consider alternative transformations."
            )

    return df_stationary, audit_log


In [None]:
# Task 9: Vector Autoregression (VAR) Analysis

def estimate_var_model(
    df_stationary: pd.DataFrame,
    config: Dict[str, Any],
    max_lags: int = 12,
    lag_selection_criterion: str = 'bic'
) -> Tuple[VARResults, Dict[str, Any]]:
    """
    Selects optimal lag, estimates the VAR model, and prepares for identification.

    This function executes the core econometric modeling pipeline (Task 9) by:
    1.  Determining the optimal lag order 'p' for the VAR model using the
        specified information criteria (AIC, BIC, HQIC).
    2.  Estimating the VAR(p) model using the correctly ordered stationary data,
        ensuring the 'BRUI' variable is first as required for Cholesky identification.
    3.  Performing and logging critical post-estimation diagnostic checks for
        residual autocorrelation, normality, and model stability.
    4.  Computing the Cholesky decomposition of the residual covariance matrix,
        which provides the identification matrix for structural analysis.

    Args:
        df_stationary (pd.DataFrame): The stationary DataFrame from Task 8,
            ready for VAR estimation.
        config (Dict[str, Any]): The `econometric_analysis_config` dictionary.
        max_lags (int): The maximum number of lags to test for lag selection.
            Defaults to 12, appropriate for monthly data.
        lag_selection_criterion (str): The primary criterion to use for selecting
            the optimal lag ('aic', 'bic', or 'hqic'). Defaults to 'bic' for
            its tendency towards parsimony.

    Returns:
        Tuple[VARResults, Dict[str, Any]]: A tuple containing:
            - The fitted `statsmodels.tsa.vector_ar.var_model.VARResults` object.
            - A detailed log dictionary containing lag selection results,
              diagnostic test outcomes, and the Cholesky identification matrix.

    Raises:
        ValueError: If the VAR model is found to be unstable or if the Cholesky
            decomposition fails due to a non-positive definite covariance matrix.
        KeyError: If required configuration keys are missing.
    """
    # --- Initialize Audit Log ---
    estimation_log: Dict[str, Any] = {
        "lag_selection": {},
        "estimation_summary": {},
        "diagnostics": {},
        "identification": {}
    }

    # --- Step 0: Ensure Correct Variable Ordering ---
    # The Cholesky decomposition's validity depends on the variable order
    # during estimation. We must enforce the order specified in the config.
    try:
        cholesky_order = config['impulse_response_parameters']['cholesky_variable_order']
        df_ordered = df_stationary[cholesky_order].copy()
    except KeyError as e:
        raise KeyError(f"A required variable for Cholesky ordering is missing or config is invalid: {e}")

    # --- Step 1: Optimal Lag Length Selection ---
    # Instantiate a VAR model on the ordered data to select the lag order.
    model_for_lag_selection = VAR(df_ordered)

    # Use the select_order method to test lags up to max_lags.
    lag_selection_results = model_for_lag_selection.select_order(maxlags=max_lags)

    # Store the full results table in the log for auditability.
    estimation_log['lag_selection']['full_results'] = lag_selection_results.summary().as_html()

    # Select the optimal lag based on the chosen criterion.
    optimal_lag = lag_selection_results.selected_lags[lag_selection_criterion]
    estimation_log['lag_selection']['selected_criterion'] = lag_selection_criterion
    estimation_log['lag_selection']['optimal_lag_p'] = optimal_lag

    # --- Step 2: VAR Model Estimation ---
    # Instantiate the final VAR model with the correctly ordered data.
    model = VAR(df_ordered)

    # Fit the model using the optimal lag order determined in the previous step.
    # Y_t = c + A_1*Y_{t-1} + ... + A_p*Y_{t-p} + u_t
    try:
        results = model.fit(optimal_lag)
    except np.linalg.LinAlgError as e:
        raise RuntimeError(f"VAR model estimation failed due to a linear algebra error: {e}")

    # Log the summary of the estimated model.
    estimation_log['estimation_summary']['model_summary'] = results.summary().as_html()

    # --- Post-Estimation Diagnostics ---
    # a) Test for residual autocorrelation (Ljung-Box test).
    # H0: No serial correlation. We want high p-values.
    corr_test = results.test_serial_correlation(lags=optimal_lag + 5)
    estimation_log['diagnostics']['serial_correlation_p_value'] = corr_test.pvalue

    # b) Test for residual normality (Jarque-Bera test).
    # H0: Residuals are normally distributed.
    norm_test = results.test_normality()
    estimation_log['diagnostics']['normality_p_value'] = norm_test.pvalue

    # c) Test for model stability.
    # The model is stable if all roots of the companion matrix have modulus < 1.
    is_stable = results.is_stable()
    estimation_log['diagnostics']['is_stable'] = is_stable
    if not is_stable:
        raise ValueError("Estimated VAR model is unstable. Cannot proceed with analysis.")

    # --- Step 3: Cholesky Decomposition for Structural Identification ---
    # This is the key to identifying structural shocks from reduced-form residuals.
    # u_t = P * epsilon_t, where P is the lower-triangular Cholesky factor.

    # Get the residual covariance matrix (Sigma_u) from the results.
    sigma_u = results.sigma_u

    try:
        # Compute the lower-triangular Cholesky factor P.
        identification_matrix_p = np.linalg.cholesky(sigma_u)
    except np.linalg.LinAlgError:
        raise ValueError(
            "Cholesky decomposition failed. The residual covariance matrix "
            "is not positive definite, indicating a severe model issue."
        )

    # Store the identification matrix in the log, with clear labels.
    estimation_log['identification']['cholesky_matrix_P'] = pd.DataFrame(
        identification_matrix_p,
        index=df_ordered.columns,
        columns=df_ordered.columns
    )
    estimation_log['identification']['strategy'] = "Cholesky Decomposition"
    estimation_log['identification']['ordering_assumption'] = cholesky_order

    return results, estimation_log


In [None]:
# Task 10: Post-Estimation Analysis

def run_post_estimation_analysis(
    var_results: VARResults,
    estimation_log: Dict[str, Any],
    config: Dict[str, Any],
    random_seed: int = 42
) -> Dict[str, Any]:
    """
    Conducts post-estimation analysis on a fitted VAR model.

    This function performs the three key post-estimation analyses required by
    the research methodology:
    1.  Impulse Response Functions (IRFs): Calculates the dynamic response of
        each variable to an identified one-standard-deviation structural shock
        in every other variable, using the pre-computed Cholesky identification.
    2.  Forecast Error Variance Decompositions (FEVDs): Quantifies the
        proportion of movement in each variable that is attributable to shocks
        from the other variables over a given horizon.
    3.  Bootstrapped Confidence Intervals: Generates 90% confidence intervals
        for the IRFs using the specified percentile bootstrap method with 999
        repetitions to provide a measure of statistical uncertainty.

    Args:
        var_results (VARResults): The fitted `statsmodels` VAR results object
            from Task 9.
        estimation_log (Dict[str, Any]): The log from Task 9, containing the
            Cholesky identification matrix.
        config (Dict[str, Any]): The `econometric_analysis_config` dictionary.
        random_seed (int): A seed for the random number generator to ensure
            reproducibility of the bootstrap. Defaults to 42.

    Returns:
        Dict[str, Any]: A dictionary containing the structured results for
            'irf', 'fevd', and 'irf_confidence_intervals'.

    Raises:
        KeyError: If required configuration keys or the Cholesky matrix
            are missing.
    """
    # --- Setup and Parameter Extraction ---
    try:
        # Extract parameters from the configuration dictionary.
        horizon = config['impulse_response_parameters']['horizon']
        ci_level = config['impulse_response_parameters']['confidence_interval_level']
        bootstrap_reps = config['impulse_response_parameters']['bootstrap_repetitions']

        # Extract the pre-computed Cholesky identification matrix.
        cholesky_matrix = estimation_log['identification']['cholesky_matrix_P'].values
    except KeyError as e:
        raise KeyError(f"A required configuration or log key is missing: {e}")

    # The variable names in the correct, estimated order.
    var_names = var_results.names

    # --- Step 1: Impulse Response Function (IRF) Calculation ---
    # Instantiate the IRF analysis object from the VAR results.
    # The formula IRF(h) = Phi_h * P is computed internally.
    irf_analysis = var_results.irf(periods=horizon, var_decomp=cholesky_matrix)

    # Extract the point estimates of the orthogonalized IRFs.
    # Shape: (horizon+1, k, k) -> response of col to impulse in row.
    irf_point_estimates = irf_analysis.orth_irfs

    # --- Step 2: Forecast Error Variance Decomposition (FEVD) ---
    # Compute the FEVDs using the fitted model.
    fevd_results = var_results.fevd(periods=horizon + 1, var_decomp=cholesky_matrix)

    # The result is a VARResults object; we extract the summary DataFrame.
    # The FEVD shows the percentage of the forecast error variance of the variable
    # in the column that is explained by innovations to the variable in the row.
    fevd_summary = fevd_results.summary()

    # --- Step 3: Percentile Bootstrap Confidence Interval Estimation ---
    # This is a computationally intensive step.
    # Set the random seed for perfect reproducibility of the bootstrap.
    np.random.seed(random_seed)

    # Calculate the confidence intervals using the built-in bootstrap method.
    # We specify the number of repetitions and the desired significance level.
    # The significance level alpha is 1 - confidence_level.
    alpha = 1 - ci_level
    lower_quantile = alpha / 2
    upper_quantile = 1 - (alpha / 2)

    # The `err_band_kws` argument allows us to specify bootstrap parameters.
    # This computes the lower and upper bands for the confidence interval.
    irf_lower_bounds, irf_upper_bounds = irf_analysis.err_band(
        orth=True,
        repl=bootstrap_reps,
        signif=alpha,
        seed=random_seed,
        burn=0, # No burn-in needed for this type of bootstrap
        component=None # For all components
    )

    # --- Structure and Return Results ---
    # Organize the results into a clean, well-structured dictionary.
    post_estimation_results = {
        'irf': {
            'point_estimates': irf_point_estimates,
            'description': "3D array (horizon, responding_var, impulse_var) of IRF point estimates."
        },
        'irf_confidence_intervals': {
            'lower_bound': irf_lower_bounds,
            'upper_bound': irf_upper_bounds,
            'confidence_level': ci_level,
            'bootstrap_repetitions': bootstrap_reps,
            'description': "3D arrays for lower and upper CI bounds, matching the IRF point estimates."
        },
        'fevd': {
            'summary_tables': fevd_summary,
            'description': "List of DataFrames, one for each variable's FEVD."
        },
        'metadata': {
            'horizon': horizon,
            'variable_names': var_names,
            'random_seed_used': random_seed
        }
    }

    return post_estimation_results


In [None]:
# Task 11: Visualization

def plot_brui_with_events(
    brui_series: pd.Series,
    events: Dict[str, str]
) -> plt.Figure:
    """
    Generates a publication-quality time series plot of the BRUI with annotations.

    This function visualizes the final, normalized BRUI over its entire sample
    period. It is enhanced with a collision-aware algorithm to intelligently
    place annotations for key historical events, ensuring readability even when
    events are chronologically close. This replicates the style and content of
    Figure 2 from the research paper with superior robustness.

    Args:
        brui_series (pd.Series): A pandas Series containing the final BRUI values,
            with a DatetimeIndex.
        events (Dict[str, str]): A dictionary where keys are date strings
            (e.g., 'YYYY-MM-DD') and values are the event descriptions.

    Returns:
        plt.Figure: The matplotlib Figure object containing the plot.

    Raises:
        ValueError: If the input Series does not have a DatetimeIndex.
    """
    # --- Input Validation ---
    # Check if the index of the input Series is a DatetimeIndex.
    if not isinstance(brui_series.index, pd.DatetimeIndex):
        # Raise an error if the index is not of the correct type.
        raise ValueError("Input 'brui_series' must have a DatetimeIndex.")

    # --- Plot Styling Setup ---
    # Set a professional and clean plot style.
    plt.style.use('seaborn-v0_8-whitegrid')
    # Create a Figure and a single Axes object with a specified size and resolution.
    fig, ax = plt.subplots(figsize=(16, 9), dpi=200)

    # --- Plotting the Main Time Series ---
    # Plot the BRUI data with a professional color and line width.
    ax.plot(brui_series.index, brui_series, color='#003366', lw=2.5, label='BRUI')

    # --- Remediation: Collision-Aware Annotation Placement ---
    # Define a list of preferred vertical offsets for annotations (in data coordinates).
    y_offsets = [brui_series.max() * offset for offset in [0.1, 0.25, 0.4, 0.18, 0.33]]
    # Initialize a list to store the bounding boxes of placed annotations.
    placed_annotations_bboxes = []

    # Sort events by date to process them chronologically.
    sorted_events = sorted(events.items(), key=lambda item: pd.to_datetime(item[0]))

    # Iterate through each event to be annotated on the plot.
    for date_str, description in sorted_events:
        # Convert the event's date string to a pandas Timestamp object.
        event_date = pd.to_datetime(date_str)

        # Proceed only if the event date falls within the plotted data's range.
        if brui_series.index.min() <= event_date <= brui_series.index.max():
            # Draw a vertical line on the plot to mark the event's date.
            ax.axvline(x=event_date, color='red', linestyle='--', linewidth=1.0, alpha=0.75)

            # Find the y-position on the BRUI line corresponding to the event date.
            y_pos = brui_series.asof(event_date)

            # Initialize the best position for the annotation text.
            best_y = y_pos + y_offsets[0]

            # Create a dummy annotation to find the best non-colliding position.
            # This annotation is initially invisible.
            annotation = ax.annotate(
                description,
                xy=(event_date, y_pos),
                xytext=(event_date, best_y), # Start with the first preferred offset.
                arrowprops=dict(facecolor='black', shrink=0.05, width=1, headwidth=4),
                ha='center',
                fontsize=9,
                bbox=dict(boxstyle="round,pad=0.3", fc="white", ec="gray", lw=0.5, alpha=0.95),
                visible=False # Keep it invisible until the best position is found.
            )

            # Force matplotlib to draw the canvas to calculate object dimensions.
            fig.canvas.draw()

            # --- Iterative Placement Logic ---
            # Assume the initial position is not ideal until proven otherwise.
            is_placed = False
            # Iterate through the preferred vertical offsets.
            for offset in y_offsets:
                # Set the hypothetical y-position for the annotation text.
                annotation.set_position((event_date, y_pos + offset))
                # Get the bounding box of the annotation in display coordinates.
                hypothetical_bbox = annotation.get_bbox_patch().get_window_extent(fig.canvas.renderer)

                # Check if this hypothetical box collides with any already placed boxes.
                is_colliding = any(
                    hypothetical_bbox.overlaps(placed_bbox) for placed_bbox in placed_annotations_bboxes
                )

                # If there is no collision, this is a good position.
                if not is_colliding:
                    # Make the annotation visible at this position.
                    annotation.set_visible(True)
                    # Add its bounding box to the list of placed annotations.
                    placed_annotations_bboxes.append(hypothetical_bbox)
                    # Mark as placed and break the inner loop.
                    is_placed = True
                    break

            # If all preferred positions resulted in a collision, place it at the default.
            if not is_placed:
                # Set the position to the first (closest) offset as a fallback.
                annotation.set_position((event_date, y_pos + y_offsets[0]))
                # Make the annotation visible.
                annotation.set_visible(True)

    # --- Final Plot Formatting ---
    # Set the main title of the plot with appropriate styling.
    ax.set_title('Brexit-Related Uncertainty Index (BRUI) and Major Events', fontsize=18, weight='bold', pad=20)
    # Set the label for the y-axis.
    ax.set_ylabel('BRUI (Normalized, Max=100)', fontsize=14)
    # Set the label for the x-axis.
    ax.set_xlabel('Year', fontsize=14)
    # Set the y-axis limits to provide padding for annotations.
    ax.set_ylim(0, brui_series.max() * 1.5)
    # Display the plot legend.
    ax.legend(loc='upper left', fontsize=12)

    # Configure the x-axis to show years clearly.
    ax.xaxis.set_major_locator(mdates.YearLocator(base=2))
    # Set the format of the year display on the x-axis.
    ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
    # Rotate the x-axis tick labels for better readability.
    plt.xticks(rotation=45, ha='right')
    # Adjust the plot layout to prevent labels from being cut off.
    fig.tight_layout()

    # Return the final Figure object.
    return fig

def plot_comparative_validation(
    df_indices: pd.DataFrame,
    brui_col: str,
    comparison_cols: List[str]
) -> plt.Figure:
    """
    Generates a multi-panel plot comparing the BRUI to other indices.

    This function replicates Figures 3, 4, and 5 from the paper, creating
    subplots to visually compare the calculated BRUI against other established
    Brexit uncertainty indices. It calculates and displays the Pearson
    correlation coefficient for the overlapping period in each subplot.

    Args:
        df_indices (pd.DataFrame): A DataFrame containing the BRUI and all
            comparison indices, with a DatetimeIndex.
        brui_col (str): The name of the column containing the calculated BRUI.
        comparison_cols (List[str]): A list of column names for the indices
            to be compared against the BRUI.

    Returns:
        plt.Figure: The matplotlib Figure object containing the grid of plots.
    """
    # --- Plot Styling and Layout Setup ---
    # Set a professional and clean plot style.
    plt.style.use('seaborn-v0_8-whitegrid')
    # Get the number of comparisons to determine the number of subplots.
    n_comparisons = len(comparison_cols)
    # Create a figure and a grid of subplots.
    fig, axes = plt.subplots(
        nrows=n_comparisons, ncols=1, figsize=(12, 6 * n_comparisons),
        sharex=True, dpi=150
    )
    # Ensure 'axes' is always a list-like object for consistent indexing, even with one subplot.
    if n_comparisons == 1:
        axes = [axes]

    # --- Loop Through Each Comparison ---
    # Iterate through the axes and the names of the comparison columns simultaneously.
    for ax, comp_col in zip(axes, comparison_cols):
        # Create a temporary DataFrame with only the two series to be compared.
        df_comp = df_indices[[brui_col, comp_col]].copy()

        # Drop rows with any NaN values to find the overlapping time period.
        df_overlap = df_comp.dropna()

        # Check if there is enough overlapping data to calculate a correlation.
        if len(df_overlap) > 1:
            # Calculate the Pearson correlation coefficient and the p-value.
            corr, p_value = pearsonr(df_overlap[brui_col], df_overlap[comp_col])
            # Format the correlation text to be displayed on the plot.
            corr_text = f'Correlation: {corr:.2f} (p-value: {p_value:.3f})'
        else:
            # Set a default text if there is no overlapping data.
            corr_text = 'Correlation: N/A (No Overlap)'

        # Plot the BRUI series on the current subplot.
        ax.plot(df_comp.index, df_comp[brui_col], label=brui_col, color='#003366', lw=2)
        # Plot the comparison index series on the same subplot with a different style.
        ax.plot(df_comp.index, df_comp[comp_col], label=comp_col, color='#D55E00', lw=2, linestyle='--')

        # --- Formatting for Each Subplot ---
        # Set the title for the subplot, including the calculated correlation.
        ax.set_title(f'Comparison of {brui_col} with {comp_col}\n({corr_text})', fontsize=14, weight='bold')
        # Set the y-axis label for the subplot.
        ax.set_ylabel('Index Value', fontsize=10)
        # Display the legend for the subplot.
        ax.legend(loc='upper left')
        # Add a grid for better readability.
        ax.grid(True, which='both', linestyle='--', linewidth=0.5)

    # --- Final Global Formatting ---
    # Set the x-axis label only on the bottom-most subplot.
    axes[-1].set_xlabel('Year', fontsize=12)
    # Set a main title for the entire figure.
    fig.suptitle('BRUI Validation Against Alternative Indices', fontsize=18, weight='bold', y=1.02)
    # Adjust the plot layout to prevent titles/labels from overlapping.
    fig.tight_layout(rect=[0, 0, 1, 1])

    # Return the final Figure object.
    return fig

def plot_impulse_response_functions(
    irf_results: Dict[str, Any],
    shock_variable: str = 'BRUI'
) -> plt.Figure:
    """
    Generates a grid plot of Impulse Response Functions with confidence bands.

    This function visualizes the dynamic responses of all macroeconomic variables
    to a one-standard-deviation shock in the specified shock variable (typically
    BRUI), replicating the style of Figure 6 from the paper.

    Args:
        irf_results (Dict[str, Any]): The post-estimation results dictionary
            from Task 10, containing IRF point estimates and confidence intervals.
        shock_variable (str): The name of the variable whose shock is being analyzed.

    Returns:
        plt.Figure: The matplotlib Figure object containing the grid of IRF plots.
    """
    # --- Extract Data from Results Dictionary ---
    # Get the 3D array of IRF point estimates.
    point_estimates = irf_results['irf']['point_estimates']
    # Get the 3D array for the lower bound of the confidence interval.
    lower_bounds = irf_results['irf_confidence_intervals']['lower_bound']
    # Get the 3D array for the upper bound of the confidence interval.
    upper_bounds = irf_results['irf_confidence_intervals']['upper_bound']
    # Get the list of variable names in the correct order.
    var_names = irf_results['metadata']['variable_names']
    # Get the number of periods for the IRF horizon.
    horizon = irf_results['metadata']['horizon']

    # Find the integer index of the variable that is originating the shock.
    try:
        shock_idx = var_names.index(shock_variable)
    except ValueError:
        # Raise an error if the specified shock variable is not in the model.
        raise ValueError(f"Shock variable '{shock_variable}' not found in the model variables.")

    # --- Plot Styling and Layout Setup ---
    # Set a professional and clean plot style.
    plt.style.use('seaborn-v0_8-whitegrid')
    # Identify the variables that will be responding to the shock.
    response_vars = [v for v in var_names] # Plot all responses, including to itself.
    # Get the total number of response variables.
    n_responses = len(response_vars)
    # Define the grid layout for the subplots (e.g., 4x3 for 10 variables).
    n_cols = 3
    # Calculate the required number of rows for the grid.
    n_rows = int(np.ceil(n_responses / n_cols))

    # Create a figure and a grid of subplots.
    fig, axes = plt.subplots(
        nrows=n_rows, ncols=n_cols, figsize=(15, 4 * n_rows),
        sharex=True, dpi=200
    )
    # Flatten the 2D array of axes into a 1D array for easy iteration.
    axes = axes.flatten()

    # --- Loop Through Each Response Variable ---
    # Iterate through the flattened axes and the list of response variables.
    for i, response_var in enumerate(response_vars):
        # Select the current axis object for plotting.
        ax = axes[i]
        # Get the integer index of the current response variable.
        response_idx = var_names.index(response_var)

        # Extract the specific 1D IRF series for this shock-response pair.
        irf_series = point_estimates[:, response_idx, shock_idx]
        # Extract the corresponding lower confidence bound series.
        lower_series = lower_bounds[:, response_idx, shock_idx]
        # Extract the corresponding upper confidence bound series.
        upper_series = upper_bounds[:, response_idx, shock_idx]

        # Define the x-axis as the horizon in months (from 0 to H).
        x_axis = range(horizon + 1)

        # Plot the point estimate of the impulse response.
        ax.plot(x_axis, irf_series, color='#003366', lw=2.5, label='Point Estimate')

        # Shade the area between the lower and upper bounds to show the confidence interval.
        ax.fill_between(x_axis, lower_series, upper_series, color='#0072B2', alpha=0.2, label='90% CI')

        # Plot a horizontal line at zero, which is the critical reference for significance.
        ax.axhline(0, color='black', linestyle='--', linewidth=1)

        # --- Formatting for Each Subplot ---
        # Set the title for the subplot, indicating the response variable.
        ax.set_title(f'Response of {response_var}', fontsize=12, weight='bold')
        # Set the x-axis limits to the specified horizon.
        ax.set_xlim(0, horizon)
        # Add a grid for better readability.
        ax.grid(True, which='both', linestyle='--', linewidth=0.5)
        # Add a y-axis label only to the plots in the first column.
        if i % n_cols == 0:
            ax.set_ylabel('Response Magnitude', fontsize=10)
        # Add an x-axis label only to the plots in the bottom row.
        if i >= (n_rows - 1) * n_cols:
            ax.set_xlabel('Months After Shock', fontsize=10)

    # --- Final Global Formatting ---
    # Hide any unused subplots if the number of responses is not a multiple of n_cols.
    for i in range(n_responses, len(axes)):
        # Make the unused axis invisible.
        axes[i].set_visible(False)

    # Set a main title for the entire figure.
    fig.suptitle(f'Impulse Responses to a One S.D. Shock in {shock_variable}', fontsize=18, weight='bold')
    # Adjust the plot layout to make space for the main title.
    fig.tight_layout(rect=[0, 0.03, 1, 0.95])

    # Return the final Figure object.
    return fig



In [None]:
# Pipeline

def run_brexit_uncertainty_analysis(
    df_input: pd.DataFrame,
    uncertainty_lexicon: List[str],
    brexit_lexicon: List[str],
    covid_lexicon: List[str],
    index_construction_config: Dict[str, Any],
    econometric_analysis_config: Dict[str, Any],
    brexit_events_for_plotting: Dict[str, str],
    comparison_indices_df: pd.DataFrame = None
) -> Dict[str, Any]:
    """
    Executes the end-to-end research pipeline for the Brexit Uncertainty Index.

    This orchestrator function serves as the master controller for the entire
    analysis. It sequentially executes all tasks from parameter validation to
    final visualization, ensuring a rigorous, reproducible, and auditable
    workflow. Each step's outputs are programmatically passed to the next, and
    all significant results, data, and logs are compiled into a comprehensive
    final dictionary. This version is updated to align with the fully remediated
    and professional-grade sub-components.

    Args:
        df_input (pd.DataFrame): The raw input DataFrame containing monthly
            macroeconomic data and the EIU text corpus.
        uncertainty_lexicon (List[str]): The raw list of uncertainty keywords.
        brexit_lexicon (List[str]): The raw list of Brexit-related keywords.
        covid_lexicon (List[str]): The raw list of COVID-19 related keywords.
        index_construction_config (Dict[str, Any]): Configuration for the
            text-based index construction.
        econometric_analysis_config (Dict[str, Any]): Configuration for the
            econometric VAR analysis.
        brexit_events_for_plotting (Dict[str, str]): A dictionary of key Brexit
            events for annotating the final BRUI time-series plot.
        comparison_indices_df (pd.DataFrame, optional): A DataFrame containing
            alternative indices (e.g., BRUI_B, BRUI_C) for validation plotting.
            Must have a DatetimeIndex. Defaults to None.

    Returns:
        Dict[str, Any]: A comprehensive dictionary containing all results,
            including final data, audit logs, fitted models, post-estimation
            analysis, and generated figures.
    """
    # Initialize the master results dictionary to store all pipeline artifacts.
    results: Dict[str, Any] = {
        "audit_logs": {},
        "final_data": {},
        "fitted_model": {},
        "analysis_results": {},
        "visualizations": {}
    }

    # --- Task 0: Parameter Validation ---
    # Provide user feedback that the process is starting.
    print("Executing Task 0: Parameter Validation...")
    # Call the validation function as a critical quality gate before any computation.
    validate_parameters(
        df=df_input,
        uncertainty_lexicon=uncertainty_lexicon,
        brexit_lexicon=brexit_lexicon,
        covid_lexicon=covid_lexicon,
        index_construction_config=index_construction_config,
        econometric_analysis_config=econometric_analysis_config
    )
    # Confirm successful validation to the user.
    print("...Validation successful.")

    # --- Task 1: Data Cleansing ---
    # Announce the start of the data cleansing task.
    print("Executing Task 1: Data Cleansing...")
    # Call the remediated cleanse_data function, which performs targeted cleaning.
    df_clean, audit_log_1 = cleanse_data(df_input, index_construction_config)
    # Store the detailed audit log from the cleansing process.
    results["audit_logs"]["data_cleansing"] = audit_log_1
    # Confirm task completion.
    print("...Data cleansing complete.")

    # --- Task 2: Keyword List Preparation ---
    # Announce the start of the lexicon preparation task.
    print("Executing Task 2: Keyword List Preparation...")
    # Call the function to process raw keyword lists into an optimized structure.
    prepared_lexicons = prepare_lexicons(uncertainty_lexicon, brexit_lexicon, covid_lexicon)
    # Store the highly optimized lexicon object for potential review.
    results["final_data"]["prepared_lexicons"] = prepared_lexicons
    # Confirm task completion.
    print("...Lexicon preparation complete.")

    # --- Task 3: Text Preprocessing ---
    # Announce the start of the NLP preprocessing pipeline.
    print("Executing Task 3: Text Preprocessing...")
    # Amendment: Extract the language from the config to pass to the remediated function.
    language = index_construction_config['preprocessing_parameters']['stopword_language']
    # Call the remediated preprocessing function with the explicit language parameter.
    df_preprocessed = preprocess_text_corpus(df_clean, prepared_lexicons, language=language)
    # Confirm task completion.
    print("...Text preprocessing complete.")

    # --- Task 4: Named Entity Recognition ---
    # Announce the start of the NER task.
    print("Executing Task 4: Named Entity Recognition...")
    # Load the specified SpaCy model once to avoid redundant loading.
    nlp_model = load_spacy_model_for_ner(
        model_name=index_construction_config['llm_parameters']['model_identifier']
    )
    # Apply the loaded model to the text corpus using efficient batch processing.
    df_ner = apply_ner_to_corpus(df_preprocessed, nlp_model)
    # Confirm task completion.
    print("...NER complete.")

    # --- Task 5: Context-Aware Uncertainty Attribution ---
    # Announce the start of the core attribution algorithm.
    print("Executing Task 5: Context-Aware Uncertainty Attribution...")
    # Call the function to apply the context window algorithm and classify uncertainty.
    df_attributed = attribute_uncertainty_in_corpus(df_ner, prepared_lexicons, index_construction_config)
    # Confirm task completion.
    print("...Uncertainty attribution complete.")

    # --- Task 6 & 7: Index Calculation (BRUI & CRUI) ---
    # Announce the start of the final index calculation phase.
    print("Executing Tasks 6 & 7: BRUI and CRUI Calculation...")
    # Calculate the BRUI based on the attributed counts.
    df_with_brui = calculate_brui(df_attributed, index_construction_config)
    # Calculate the CRUI using the same data for methodological consistency.
    df_indices = calculate_crui(df_with_brui, index_construction_config)
    # Store the final DataFrame containing both indices and all intermediate components.
    results["final_data"]["indices_and_components"] = df_indices
    # Confirm task completion.
    print("...Index calculation complete.")

    # --- Task 8: Data Preparation for VAR Analysis ---
    # Announce the start of the econometric data preparation.
    print("Executing Task 8: Data Preparation for VAR Analysis...")
    # Call the function to transform the data into a stationary format for the VAR.
    df_stationary, audit_log_8 = prepare_data_for_var(df_indices, econometric_analysis_config)
    # Store the detailed audit log from the transformation process.
    results["audit_logs"]["var_data_preparation"] = audit_log_8
    # Store the final, stationary dataset used for estimation.
    results["final_data"]["stationary_var_dataset"] = df_stationary
    # Confirm task completion.
    print("...VAR data preparation complete.")

    # --- Task 9: Vector Autoregression (VAR) Analysis ---
    # Announce the start of the VAR model estimation.
    print("Executing Task 9: VAR Model Estimation...")
    # Call the function to select lag order, estimate the model, and run diagnostics.
    var_results, estimation_log = estimate_var_model(df_stationary, econometric_analysis_config)
    # Store the rich statsmodels results object.
    results["fitted_model"]["var_results_object"] = var_results
    # Store the detailed log of the estimation and identification process.
    results["audit_logs"]["var_estimation"] = estimation_log
    # Confirm task completion.
    print("...VAR estimation complete.")

    # --- Task 10: Post-Estimation Analysis ---
    # Announce the start of the post-estimation analysis.
    print("Executing Task 10: Post-Estimation Analysis (IRF, FEVD, Bootstrap)...")
    # Call the function to compute IRFs, FEVDs, and bootstrapped confidence intervals.
    post_estimation_results = run_post_estimation_analysis(var_results, estimation_log, econometric_analysis_config)
    # Store the complete set of analytical results.
    results["analysis_results"] = post_estimation_results
    # Confirm task completion.
    print("...Post-estimation analysis complete.")

    # --- Task 11: Visualization ---
    # Announce the start of the final visualization phase.
    print("Executing Task 11: Generating Visualizations...")
    # Generate Figure 1: The BRUI time series annotated with key events.
    fig1 = plot_brui_with_events(df_indices['BRUI'], brexit_events_for_plotting)

    # Initialize the variable for the second figure to None.
    fig2 = None
    # Check if the optional DataFrame for comparison indices was provided.
    if comparison_indices_df is not None:
        # If so, merge the calculated BRUI with the external comparison indices.
        df_for_comparison = df_indices[['BRUI']].join(comparison_indices_df)
        # Generate Figure 2: The comparative validation plots.
        fig2 = plot_comparative_validation(
            df_for_comparison,
            brui_col='BRUI',
            comparison_cols=list(comparison_indices_df.columns)
        )

    # Generate Figure 3: The grid of Impulse Response Functions.
    fig3 = plot_impulse_response_functions(
        post_estimation_results,
        shock_variable='BRUI'
    )

    # Store the generated matplotlib Figure objects in the results dictionary.
    results["visualizations"] = {
        "brui_timeline": fig1,
        "comparative_validation": fig2,
        "impulse_response_functions": fig3
    }
    # Confirm task completion.
    print("...Visualization generation complete.")

    # Announce the successful completion of the entire pipeline.
    print("\nEnd-to-end analysis pipeline executed successfully.")
    # Return the master dictionary containing all artifacts of the analysis.
    return results

