# **`README.md`**

## Quantifying Semantic Shift in Financial NLP: A Robust Evaluation Framework

<!-- PROJECT SHIELDS -->
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
[![Python Version](https://img.shields.io/badge/python-3.9%2B-blue.svg)](https://www.python.org/)
[![arXiv](https://img.shields.io/badge/arXiv-2510.0205-b31b1b.svg)](https://arxiv.org/abs/2510.00205v1)
[![Conference](https://img.shields.io/badge/Conference-ICAIF%20'25-9cf)](https://icaif.acm.org/2025/)
[![Year](https://img.shields.io/badge/Year-2025-purple)](https://github.com/chirindaopensource/quantifying_semantic_shift_financial_nlp)
[![Discipline](https://img.shields.io/badge/Discipline-Financial%20NLP-00529B)](https://github.com/chirindaopensource/quantifying_semantic_shift_financial_nlp)
[![Primary Data](https://img.shields.io/badge/Data-Financial%20News%20%7C%20Stock%20Returns-lightgrey)](https://github.com/chirindaopensource/quantifying_semantic_shift_financial_nlp)
[![Core Method](https://img.shields.io/badge/Method-Regime--Based%20Robustness%20Testing-orange)](https://github.com/chirindaopensource/quantifying_semantic_shift_financial_nlp)
[![Key Metrics](https://img.shields.io/badge/Metrics-FCAS%20%7C%20PCS%20%7C%20TSV%20%7C%20NLICS-red)](https://github.com/chirindaopensource/quantifying_semantic_shift_financial_nlp)
[![Models](https://img.shields.io/badge/Models-LSTM%20%7C%20Transformers-blueviolet)](https://github.com/chirindaopensource/quantifying_semantic_shift_financial_nlp)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Type Checking: mypy](https://img.shields.io/badge/type%20checking-mypy-blue)](http://mypy-lang.org/)
[![Pandas](https://img.shields.io/badge/pandas-%23150458.svg?style=flat&logo=pandas&logoColor=white)](https://pandas.pydata.org/)
[![PyTorch](https://img.shields.io/badge/PyTorch-%23EE4C2C.svg?style=flat&logo=PyTorch&logoColor=white)](https://pytorch.org/)
[![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-blue)](https://huggingface.co/)
[![Scikit-learn](https://img.shields.io/badge/scikit--learn-%23F7931E.svg?style=flat&logo=scikit-learn&logoColor=white)](https://scikit-learn.org/)
[![OpenAI](https://img.shields.io/badge/OpenAI-412991.svg?style=flat&logo=OpenAI&logoColor=white)](https://openai.com/)
[![Jupyter](https://img.shields.io/badge/Jupyter-%23F37626.svg?style=flat&logo=Jupyter&logoColor=white)](https://jupyter.org/)
--

**Repository:** `https://github.com/chirindaopensource/quantifying_semantic_shift_financial_nlp`

**Owner:** 2025 Craig Chirinda (Open Source Projects)

This repository contains an **independent**, professional-grade Python implementation of the research methodology from the 2025 paper entitled **"Quantifying Semantic Shift in Financial NLP: Robust Metrics for Market Prediction Stability"** by:

*   Zhongtian Sun
*   Chenghao Xiao
*   Anoushka Harit
*   Jongmin Yu

The project provides a complete, end-to-end computational framework for replicating the paper's novel evaluation suite for financial NLP models. It delivers a modular, auditable, and extensible pipeline that executes the entire research workflow: from rigorous data validation and regime-based partitioning, through multi-architecture model training and feature engineering, to the computation of four novel diagnostic metrics and a comprehensive suite of analytical studies.

## Table of Contents

- [Introduction](#introduction)
- [Theoretical Background](#theoretical-background)
- [Features](#features)
- [Methodology Implemented](#methodology-implemented)
- [Core Components (Notebook Structure)](#core-components-notebook-structure)
- [Key Callables](#key-callables)
- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Input Data Structure](#input-data-structure)
- [Usage](#usage)
- [Output Structure](#output-structure)
- [Project Structure](#project-structure)
- [Customization](#customization)
- [Contributing](#contributing)
- [Recommended Extensions](#recommended-extensions)
- [License](#license)
- [Citation](#citation)
- [Acknowledgments](#acknowledgments)

## Introduction

This project provides a Python implementation of the methodologies presented in the 2025 paper "Quantifying Semantic Shift in Financial NLP: Robust Metrics for Market Prediction Stability." The core of this repository is the iPython Notebook `quantifying_semantic_shift_financial_nlp_draft.ipynb`, which contains a comprehensive suite of functions to replicate the paper's findings, from initial data validation to the final generation of all analytical tables and figures.

The paper introduces a structured evaluation framework to quantify the robustness of financial NLP models under the stress of macroeconomic regime shifts. It argues that standard metrics like MSE are insufficient and proposes four complementary diagnostic metrics to provide a multi-faceted view of model stability. This codebase operationalizes this advanced evaluation suite, allowing users to:
-   Rigorously validate and cleanse time-series financial news and market data.
-   Systematically partition data into distinct macroeconomic regimes (e.g., Pre-COVID, COVID).
-   Perform chronological train-validation-test splits to prevent lookahead bias.
-   Train multiple model architectures (LSTM, Text Transformer, Feature-Enhanced MLP) on a per-regime basis.
-   Compute the four novel diagnostic metrics: **FCAS**, **PCS**, **TSV**, and **NLICS**.
-   Quantify semantic drift between regimes using **Jensen-Shannon Divergence**.
-   Conduct a full suite of analyses, including case studies, ablation studies, and cross-sector generalization tests.

## Theoretical Background

The implemented methods are grounded in time-series econometrics, natural language processing, and deep learning.

**1. Regime-Based Evaluation:**
The framework's foundation is the acknowledgment that financial markets are non-stationary. The data-generating process changes over time, particularly during major economic events. The methodology explicitly partitions the data into distinct macroeconomic regimes, $R = \{r_1, ..., r_K\}$, and evaluates models within each regime $r_k$. This allows for a precise measurement of performance degradation under structural breaks.

**2. The Four Diagnostic Metrics:**
The paper introduces four metrics to create a "Robustness Profile" beyond simple prediction error:
-   **Financial Causal Attribution Score (FCAS):** Measures if a model's prediction direction aligns with simple causal keywords in the source text.
    $$
    \text{FCAS} = \mathbb{E}[\mathbb{I}(\text{sign}(\text{prediction}) = \text{sign}(\text{causal\_cue}))]
    $$
-   **Patent Cliff Sensitivity (PCS):** Measures the magnitude of change in a model's prediction when the input text is subjected to a controlled semantic perturbation (e.g., "growth" -> "decline").
    $$
    \text{PCS} = \mathbb{E}[|f_\theta(\mathbf{x}) - f_\theta(\tilde{\mathbf{x}})|]
    $$
-   **Temporal Semantic Volatility (TSV):** Measures the drift in the underlying meaning of the text corpus over time, calculated as the average Euclidean distance between embeddings of consecutive news articles.
    $$
    \text{TSV} = \frac{1}{N-1} \sum_{i=1}^{N-1} \|\phi(\mathbf{x}_{i+1}) - \phi(\mathbf{x}_i)\|_2
    $$
-   **NLI-based Logical Consistency Score (NLICS):** Uses a large language model (LLM) to perform Natural Language Inference, assessing whether the model's prediction is a logical entailment of the source news text.
    $$
    \text{NLICS} = \mathbb{E}[\text{EntailmentScore}(\text{text}, \text{Hypothesis}(\text{prediction}))]
    $$

**3. Semantic Drift Quantification:**
The linguistic shift between any two regimes is quantified using the **Jensen-Shannon (J-S) Divergence** between their respective vocabulary probability distributions. This provides a formal measure of how much the language used in financial news has changed.
$$
D_{JS}(P, Q) = \frac{1}{2}D_{KL}(P || M) + \frac{1}{2}D_{KL}(Q || M), \quad M = \frac{1}{2}(P+Q)
$$

## Features

The provided iPython Notebook (`quantifying_semantic_shift_financial_nlp_draft.ipynb`) implements the full research pipeline, including:

-   **Modular, Multi-Phase Architecture:** The entire pipeline is broken down into 35 distinct, modular tasks, each with its own orchestrator function, covering validation, partitioning, feature engineering, training, inference, and a full suite of analyses.
-   **Configuration-Driven Design:** All experimental parameters are managed in an external `config.yaml` file, allowing for easy customization and replication without code changes.
-   **Multi-Architecture Support:** Complete training and evaluation pipelines for three distinct model types: a baseline LSTM, a fine-tuned Text Transformer (DistilBERT), and a hybrid Feature-Enhanced MLP.
-   **Idempotent and Resumable Pipelines:** Computationally expensive steps, such as model training and LLM-based evaluations, are designed to be idempotent (resumable), saving checkpoints and caching results to prevent loss of progress and redundant computation.
-   **Production-Grade Metric Implementation:** Includes a highly performant, asynchronous, and cached implementation for the NLICS metric and a full-pipeline replication for the computationally intensive PCS metric.
-   **Comprehensive Analysis Suite:** Implements all analyses from the paper, including J-S divergence, t-SNE visualization, stock-specific case studies, control experiments, and a full N x N cross-sector generalization matrix.
-   **Automated Reporting:** Programmatic generation of all key tables and figures from the paper, as well as a final, synthesized analytical report.

## Methodology Implemented

The core analytical steps directly implement the methodology from the paper:

1.  **Validation & Cleansing (Tasks 1-3):** Ingests and rigorously validates the raw data and `config.yaml`, performs a deep data quality audit, and standardizes all data.
2.  **Data Partitioning (Tasks 4-6):** Partitions the data by macroeconomic regime and performs chronological train/val/test splits.
3.  **Feature Engineering (Tasks 7-9):** Generates TF-IDF, sentence embedding, and combined feature sets.
4.  **Model Training (Tasks 10-15):** Orchestrates the training of all 12 model-regime pairs with early stopping and checkpointing.
5.  **Inference & Evaluation (Tasks 16-24):** Generates predictions on all test sets and computes the full suite of five performance and diagnostic metrics.
6.  **Analysis & Ablation (Tasks 25-35):** Executes all higher-level analyses, including semantic drift calculation, visualizations, case studies, and ablation studies.

## Core Components (Notebook Structure)

The `quantifying_semantic_shift_financial_nlp_draft.ipynb` notebook is structured as a logical pipeline with modular orchestrator functions for each of the major tasks. All functions are self-contained, fully documented with type hints and docstrings, and designed for professional-grade execution.

## Key Callables

The project is designed around a single, top-level user-facing interface function:

-   **`execute_quantifying_semantic_shift_study`:** This master orchestrator function runs the entire automated research pipeline from end-to-end. It handles all data processing, model training, and analysis. It also generates the necessary files for the optional, human-in-the-loop entailment model comparison. A single call to this function reproduces the entire computational portion of the project.

## Prerequisites

-   Python 3.9+
-   An OpenAI API key set as an environment variable (`OPENAI_API_KEY`) for the NLICS metric.
-   Core dependencies: `pandas`, `numpy`, `scipy`, `scikit-learn`, `pyyaml`, `torch`, `transformers`, `sentence-transformers`, `openai`, `matplotlib`, `seaborn`, `tqdm`, `ipython`.

## Installation

1.  **Clone the repository:**
    ```sh
    git clone https://github.com/chirindaopensource/quantifying_semantic_shift_financial_nlp.git
    cd quantifying_semantic_shift_financial_nlp
    ```

2.  **Create and activate a virtual environment (recommended):**
    ```sh
    python -m venv venv
    source venv/bin/activate  # On Windows, use `venv\Scripts\activate`
    ```

3.  **Install Python dependencies:**
    ```sh
    pip install -r requirements.txt
    ```

4.  **Set Environment Variable:**
    ```sh
    export OPENAI_API_KEY="your_api_key_here"
    ```

## Input Data Structure

The pipeline requires a single `pandas.DataFrame` and a `config.yaml` file. The script includes a helper function to generate a synthetic, structurally correct DataFrame for testing purposes. The required schema is:
-   **Index:** A `pandas.MultiIndex` with three levels:
    -   `date` (`DatetimeIndex`): The trading date.
    -   `ticker` (`object`): The stock ticker.
    -   `sector` (`object`): The GICS sector.
-   **Columns:**
    -   `Open`, `High`, `Low`, `Close`, `Adj Close` (`float64`): Standard market data.
    -   `Volume` (`int64`): Daily trading volume.
    -   `aggregated_text` (`object`/`str`): Concatenated daily news text. An empty string is a valid value.
    -   `target_return` (`float64`): The forward-looking, next-day adjusted close return.

## Usage

The `quantifying_semantic_shift_financial_nlp_draft.ipynb` notebook provides a complete, step-by-step guide. The primary workflow is to call the top-level orchestrator from a `main.py` script or the final cell of the notebook:

```python
# main.py
from pathlib import Path
import pandas as pd
import yaml

# Assuming all pipeline functions are in `pipeline.py`
from pipeline import execute_quantifying_semantic_shift_study

# Load configuration
with open("config.yaml", 'r') as f:
    study_config = yaml.safe_load(f)

# Load data (or generate synthetic data)
raw_df = pd.read_pickle("data/financial_data.pkl")

# Run the entire study
final_artifacts = execute_quantifying_semantic_shift_study(
    raw_df=raw_df,
    study_config=study_config
)
```

## Output Structure

The `execute_quantifying_semantic_shift_study` function creates a `results/` directory and returns a dictionary of artifact paths:

```
{
    'data_splits': Path('results/data_splits.pkl'),
    'training_results': Path('results/training_results.pkl'),
    'enriched_predictions': Path('results/enriched_predictions.pkl'),
    'robustness_profile': Path('results/robustness_profile.csv'),
    'js_divergence_matrix': Path('results/js_divergence_matrix.csv'),
    'nli_benchmark_for_annotation': Path('results/nli_benchmark_for_annotation.csv'),
    ...
}
```

## Project Structure

```
quantifying_semantic_shift_financial_nlp/
│
├── quantifying_semantic_shift_financial_nlp_draft.ipynb # Main implementation notebook
├── config.yaml                                          # Master configuration file
├── requirements.txt                                     # Python package dependencies
├── LICENSE                                              # MIT license file
└── README.md                                            # This documentation file
```

## Customization

The pipeline is highly customizable via the `config.yaml` file. Users can easily modify all experimental parameters, including regime dates, model architectures, feature engineering settings, and LLM prompts, without altering the core Python code.

## Contributing

Contributions are welcome. Please fork the repository, create a feature branch, and submit a pull request with a clear description of your changes. Adherence to PEP 8, type hinting, and comprehensive docstrings is required.

## Recommended Extensions

Future extensions could include:
-   **Additional Model Architectures:** Integrating other models like FinBERT or more advanced transformer architectures.
-   **Alternative Diagnostic Metrics:** Implementing other measures of model robustness, such as influence functions or prediction confidence calibration.
-   **Automated Retraining Triggers:** Building a system that uses the computed drift metrics (like TSV or J-S Divergence) to automatically trigger model retraining when a significant regime shift is detected.
-   **Dynamic Feature Selection:** Exploring methods for dynamically adjusting feature importance based on the detected market regime.

## License

This project is licensed under the MIT License. See the `LICENSE` file for details.

## Citation

If you use this code or the methodology in your research, please cite the original paper:

```bibtex
@inproceedings{sun2025quantifying,
  author    = {Sun, Zhongtian and Xiao, Chenghao and Harit, Anoushka and Yu, Jongmin},
  title     = {Quantifying Semantic Shift in Financial NLP: Robust Metrics for Market Prediction Stability},
  booktitle = {Proceedings of the 6th ACM International Conference on AI in Finance},
  series    = {ICAIF '25},
  year      = {2025},
  publisher = {ACM}
}
```

For the implementation itself, you may cite this repository:
```
Chirinda, C. (2025). A Professional-Grade Implementation of the "Quantifying Semantic Shift" Framework.
GitHub repository: https://github.com/chirindaopensource/quantifying_semantic_shift_financial_nlp
```

## Acknowledgments

-   Credit to **Zhongtian Sun, Chenghao Xiao, Anoushka Harit, and Jongmin Yu** for the foundational research that forms the entire basis for this computational replication.
-   This project is built upon the exceptional tools provided by the open-source community. Sincere thanks to the developers of the scientific Python ecosystem, including **Pandas, NumPy, SciPy, Scikit-learn, PyTorch, HuggingFace, and Jupyter**, whose work makes complex computational analysis accessible and robust.

--

*This README was generated based on the structure and content of `quantifying_semantic_shift_financial_nlp_draft.ipynb` and follows best practices for research software documentation.*

# Paper

Title: "*Quantifying Semantic Shift in Financial NLP: Robust Metrics for Market Prediction Stability*"

Authors: Zhongtian Sun, Chenghao Xiao, Anoushka Harit, Jongmin Yu

E-Journal Submission Date: 30 September 2025

Link: https://arxiv.org/abs/2510.00205v1

Paper's Conference Affiliation: The 6th ACM International Conference on Al in Finance

Abstract:

Financial news is essential for accurate market prediction, but evolving narratives across macroeconomic regimes introduce semantic and causal drift that weaken model reliability. We present an evaluation framework to quantify robustness in financial NLP under regime shifts. The framework defines four metrics: (1) Financial Causal Attribution Score (FCAS) for alignment with causal cues, (2) Patent Cliff Sensitivity (PCS) for sensitivity to semantic perturbations, (3) Temporal Semantic Volatility (TSV) for drift in latent text representations, and (4) NLI-based Logical Consistency Score (NLICS) for entailment coherence. Applied to LSTM and Transformer models across four economic periods (pre-COVID, COVID, post-COVID, and rate hike), the metrics reveal performance degradation during crises. Semantic volatility and Jensen-Shannon divergence correlate with prediction error. Transformers are more affected by drift, while feature-enhanced variants improve generalisation. A GPT-4 case study confirms that alignment-aware models better preserve causal and logical consistency. The framework supports auditability, stress testing, and adaptive retraining in financial AI systems.

# Summary

### **The Core Problem and Motivation**

The authors begin by correctly identifying the Achilles' heel of many financial prediction models: **regime shifts**. In econometrics, we call this structural breaks or non-stationarity. Financial markets are not a stationary process; their underlying dynamics change due to macroeconomic events, policy shifts, and crises.

The novel contribution here is framing this problem through the lens of **Natural Language Processing (NLP)**. The paper posits that these macroeconomic regime shifts induce a corresponding **"semantic and causal drift"** in the language used in financial news.

*   **Semantic Drift:** The meaning and connotation of words change. For example, the word "unprecedented" had a different weight and context pre-COVID than it did during the pandemic's peak.
*   **Causal Drift:** The relationship between a described event and its market impact changes. A central bank announcing "liquidity injections" might be interpreted as a strong bullish signal in a stable market, but as a desperate, bearish signal during a financial panic.

Existing models, often optimized for predictive accuracy (e.g., minimizing Mean Squared Error), are blind to this drift. They learn statistical correlations from a specific regime and fail, sometimes catastrophically, when that regime ends. The paper's motivation is to move beyond simple accuracy metrics and create a diagnostic framework to measure model *robustness* in the face of these shifts.

### **The Proposed Methodological Framework**

This is the heart of the paper. Instead of proposing a new state-of-the-art predictive model, the authors propose a framework for *evaluating* existing models. They introduce a suite of four complementary metrics, designed to form a "Robustness Profile."

Let's break down each metric:

1.  **Financial Causal Attribution Score (FCAS):** This metric attempts to measure if the model's prediction aligns with explicit causal statements in the source text. For a news article stating "strong earnings cause stock to rise," FCAS checks if the model's prediction is positive. This is a crucial step towards interpretability, moving from pure correlation to a semblance of causal alignment. From a computer science perspective, it's a form of model audit against ground-truth reasoning.

2.  **Patent Cliff Sensitivity (PCS):** This is a classic perturbation analysis, akin to adversarial testing. The authors perturb the input text with small, semantically meaningful changes (e.g., "growth" to "decline") and measure the change in the model's output. A robust model should be sensitive to meaningful changes but not overly volatile. This tests the local stability of the function the model has learned.

3.  **Temporal Semantic Volatility (TSV):** This metric operates in the model's latent space. It measures the geometric distance (L2 norm) between the embeddings of texts from consecutive time periods. High TSV indicates that the model's internal representation of the world is changing rapidly, signaling instability and potential semantic drift. It's a direct probe into the model's representational geometry over time.

4.  **NLI-based Logical Consistency Score (NLICS):** This is perhaps the most modern of the metrics. It uses a powerful Large Language Model (LLM), like GPT-4, as an external "umpire." The process is:
    *   Take the input news article (the *premise*).
    *   Take the model's prediction and convert it to a natural language statement, e.g., "The stock price will increase" (the *hypothesis*).
    *   Ask the LLM umpire if the premise entails the hypothesis.
    This metric assesses the high-level logical and semantic coherence between the input data and the model's output.

Together, these four metrics provide a multi-faceted view of a model's behavior that goes far beyond a single loss value.

### **Experimental Design**

The authors' empirical setup is sound and well-designed to test their framework.

*   **Models:** They wisely choose three representative architectures:
    *   **LSTM:** A classic recurrent neural network, good at capturing sequences but less context-aware than modern architectures.
    *   **Transformer (DistilBERT):** The modern standard for NLP, with a powerful attention mechanism to capture complex relationships in text.
    *   **Feature-based Transformer:** A hybrid model that combines dense semantic embeddings with sparse, traditional features (TF-IDF). This allows them to test the value of feature engineering.
*   **Data & Task:** They use a standard financial news dataset (FNSPID) to predict next-day stock returns for 110 S&P 500 companies. This is a canonical and notoriously difficult task in quantitative finance.
*   **Regimes:** Crucially, they partition their data into four distinct, economically meaningful periods: Pre-COVID, COVID, Post-COVID, and the recent Rate-Hike cycle. This partitioning is what allows them to explicitly measure performance degradation across structural breaks.

### **Key Empirical Findings**

The results validate both the severity of the problem and the utility of their proposed metrics.

1.  **Performance Confirms the Problem:** As expected, model performance (measured by MSE) degrades significantly during the volatile COVID and Rate-Hike periods. The standard Transformer is particularly brittle, showing a massive spike in error during the COVID crisis.

2.  **LSTM vs. Transformer Dynamics:** The LSTM, while perhaps less powerful, proves to be more stable across regimes. The Transformer is more expressive and performs better in stable periods but is highly sensitive to drift. This highlights a classic bias-variance tradeoff: the Transformer's high capacity allows it to overfit to the narrative of a specific regime.

3.  **Semantic Drift Correlates with Error:** The authors show that the Jensen-Shannon (JS) divergence—a measure of how different the vocabulary distributions are between regimes—peaks between the most dissimilar periods (e.g., COVID vs. Rate-Hike). This peak in linguistic drift coincides with the peak in model error, providing strong evidence for their core hypothesis.

4.  **Value of Feature Augmentation:** The hybrid Transformer model, which uses both modern embeddings and traditional TF-IDF features, demonstrates improved generalization and lower semantic volatility. This suggests that grounding powerful semantic models with more stable, sparse features can enhance robustness.

### **Critical Assessment and Concluding Remarks**

As a professor, I would give this work high marks for its methodological contribution and thoughtful experimental design. It moves the field in the right direction—away from a myopic focus on accuracy and towards a more holistic understanding of model reliability.

**Strengths:**

*   **Problem Formulation:** Excellent framing of a classic econometrics problem (structural breaks) within a modern NLP context.
*   **Methodological Rigor:** The four-metric framework is comprehensive, well-motivated, and probes different, complementary facets of model failure.
*   **Practical Relevance:** This framework is directly applicable for financial institutions looking to stress-test and audit their AI/ML systems before deployment. It provides the tools for responsible AI development in a high-stakes domain.

**Areas for Future Inquiry**

*   **Actionability:** The framework is diagnostic. The next logical step is to make it prescriptive. How can these metrics be integrated into the training loop? Could we, for instance, use TSV as a regularization term to encourage more stable representations?
*   **The Oracle Problem:** The NLICS metric relies on an external LLM (GPT-4). This introduces a dependency on a costly, opaque, and potentially biased "oracle." How do we ensure the umpire itself is robust and unbiased across these same regimes?
*   **Endogeneity of Regimes:** The regime boundaries were defined ex-post. A more advanced system might employ online changepoint detection algorithms to identify regime shifts automatically, triggering model recalibration or the activation of a more robust, crisis-specific model.

**Final Verdict:**

This paper is a significant contribution to the *metrology* of financial AI—the science of measurement. It provides a much-needed toolkit for quantifying the risks associated with semantic and causal drift. By developing robust diagnostics, the authors pave the way for building more adaptive and trustworthy AI systems for finance, which is an absolute necessity for moving these technologies from the research lab to live, mission-critical deployment. An excellent piece of work.

# Import Essential Modules

In [None]:
#!/usr/bin/env python3
# ==============================================================================
#
#  Quantifying Semantic Shift in Financial NLP: A Robust Evaluation Framework
#
#  This module provides a complete, production-grade implementation of the
#  analytical framework presented in "Quantifying Semantic Shift in Financial
#  NLP: Robust Metrics for Market Prediction Stability" by Sun et al. (2025).
#  It delivers a comprehensive suite of tools for evaluating the robustness of
#  financial NLP models under macroeconomic regime shifts, enabling rigorous
#  auditing, stress testing, and governance of AI systems in non-stationary
#  market environments.
#
#  Core Methodological Components:
#  • Regime-based data partitioning and chronological cross-validation.
#  • Feature engineering pipeline for sparse (TF-IDF) and dense (Transformer) text representations.
#  • Training and inference pipelines for LSTM, Text Transformer, and Feature-Enhanced models.
#  • Implementation of four novel diagnostic metrics for model robustness:
#    1. Financial Causal Attribution Score (FCAS): Alignment with textual causal cues.
#    2. Patent Cliff Sensitivity (PCS): Stability under semantic perturbation.
#    3. Temporal Semantic Volatility (TSV): Drift in latent text representations.
#    4. NLI-based Logical Consistency Score (NLICS): Coherence via language models.
#  • Quantitative analysis of semantic drift using Jensen-Shannon Divergence.
#  • Comprehensive ablation and cross-domain generalization studies.
#
#  Technical Implementation Features:
#  • Modular, reusable, and idempotent orchestrators for managing complex experiments.
#  • Robust data validation, cleaning, and artifact management with checksums.
#  • High-performance deep learning pipeline with GPU acceleration and mixed precision.
#  • Asynchronous, cached, and rate-limited API interaction for LLM-based metrics.
#  • Professional-grade visualization and reporting of all analytical results.
#
#  Paper Reference:
#  Sun, Z., Xiao, C., Harit, A., & Yu, J. (2025). Quantifying Semantic Shift
#  in Financial NLP: Robust Metrics for Market Prediction Stability.
#  In Proceedings of the 6th ACM International Conference on AI in Finance (ICAIF '25).
#  arXiv preprint arXiv:2510.00205. https://arxiv.org/abs/2510.00205v1
#
#  Author: CS Chirinda
#  License: MIT
#  Version: 1.0.0
#
# ==============================================================================

# --- Standard Library Imports ---
import asyncio
import hashlib
import json
import logging
import math
import os
import pickle
import random
import re
import time
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
from datetime import datetime
from itertools import combinations, permutations
from pathlib import Path
from typing import (Any, Callable, Dict, List, Optional, Set, Tuple, Union)

# --- Third-Party Library Imports ---

# Core Scientific Computing
import numpy as np
import pandas as pd
from scipy import stats
from scipy.sparse import csr_matrix
from scipy.stats import entropy, spearmanr

# Machine Learning (Scikit-learn)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.manifold import TSNE
from sklearn.model_selection import train_test_split

# Deep Learning (PyTorch and associated)
import torch
import torch.nn as nn
import torch.optim as optim
from torch.cuda.amp import GradScaler, autocast
from torch.utils.data import DataLoader, Dataset

# Natural Language Processing (HuggingFace and SentenceTransformers)
from sentence_transformers import SentenceTransformer
from transformers import (AutoTokenizer, DistilBertModel, PreTrainedModel,
                          PreTrainedTokenizerBase, pipeline, Pipeline)

# LLM API Interaction
from openai import AsyncOpenAI, RateLimitError

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML

# Progress Bars
from tqdm import tqdm
from tqdm.auto import tqdm as auto_tqdm
from tqdm.asyncio import tqdm as async_tqdm

# Define type aliases for clarity and maintainability.
DataSplits = Dict[str, Dict[str, pd.DataFrame]]
TfidfFeatures = Dict[str, Dict[str, csr_matrix]]
EmbeddingFeatures = Dict[str, Dict[str, np.ndarray]]
CombinedFeatures = Dict[str, Dict[str, np.ndarray]]

# Configure a logger for clear, standardized output.
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)


# Implementation

## Draft 1

### **Functional Specification Document**

#### **Phase 1: Configuration Validation and Data Quality Assurance**

*   **Callable:** `run_config_validation_suite`
    *   **Inputs:** The main `study_config` dictionary.
    *   **Processes:**
        1.  Calls `validate_regime_definitions` to parse date strings, ensure they are in chronological order, and verify that the time intervals `[start_date, end_date]` are non-overlapping.
        2.  Calls `validate_data_split_ratios` to confirm that the `training`, `validation`, and `testing` ratios are valid probabilities that sum to 1.0, using a numerically stable comparison.
        3.  Calls `validate_feature_engineering_config` to check that `max_features` is a positive integer, that the specified HuggingFace models are accessible, and that the configured dimensions match the models' actual output dimensions.
    *   **Outputs:** None. Raises a `ValueError` with a consolidated report if any validation fails.
    *   **Transformation:** This function validates the configuration but does not transform data.
    *   **Research Role:** Implements the preliminary sanity checks on all experimental parameters defined in **Section 4.2, 5.1, and Table 1**. It ensures the entire study is based on a valid and coherent set of hyperparameters and design choices before any computation begins.

*   **Callable:** `run_dataframe_validation_suite`
    *   **Inputs:** The raw `pd.DataFrame`.
    *   **Processes:**
        1.  Calls `validate_multi_index_integrity` to verify the DataFrame has a unique, sorted, 3-level MultiIndex of `(date, ticker, sector)` with the correct data types.
        2.  Calls `validate_column_schema` to ensure the presence and correct `dtype` of all required columns and checks for non-positive values in price/volume columns.
        3.  Calls `validate_target_return_construction` to rigorously verify that the `target_return` column was correctly computed as a forward-looking return, preventing lookahead bias. It mathematically confirms the relationship `target_return[t] = (Adj_Close[t+1] - Adj_Close[t]) / Adj_Close[t]` on a per-ticker basis.
    *   **Outputs:** None. Raises a `ValueError` if validation fails.
    *   **Transformation:** This function validates the input data structure but does not transform it.
    *   **Research Role:** Implements the critical data integrity checks required for any quantitative financial study. It ensures the raw data conforms to the expected panel data structure and that the target variable is constructed without methodological flaws, as described in **Section 5.1 ("News-Price Alignment")**.

*   **Callable:** `run_data_quality_and_cleansing_suite`
    *   **Inputs:** The validated raw `pd.DataFrame`.
    *   **Processes:**
        1.  Calls `assess_missing_data` to generate a report distinguishing between `NaN` values and meaningful empty strings (`''`).
        2.  Calls `validate_financial_data_consistency` to check for valid OHLC relationships and flag unrealistic daily returns.
        3.  Calls `clean_and_standardize_text_data` to perform cleansing operations: it replaces `NaN`s in the text column with `''`, standardizes encoding to UTF-8, and removes null bytes.
    *   **Outputs:** A cleaned `pd.DataFrame`.
    *   **Transformation:** Transforms the input DataFrame by cleaning and standardizing its `aggregated_text` column.
    *   **Research Role:** Implements the data pre-processing steps described in **Section 5.1 ("Text Processing")**. It prepares the raw data for the subsequent feature engineering and modeling phases.

#### **Phase 2: Data Partitioning**

*   **Callable:** `run_regime_assignment_suite`
    *   **Inputs:** The cleaned `pd.DataFrame`, `study_config`.
    *   **Processes:**
        1.  Calls `assign_regime_labels` which maps each row's date to a macroeconomic regime (`Pre-COVID`, `COVID`, etc.) based on the intervals in `study_config`.
        2.  Filters out any data points that do not fall within a defined regime.
        3.  Calls `validate_regime_assignment` to confirm that the resulting DataFrame contains data for all expected regimes and that the sample counts are adequate.
    *   **Outputs:** A `pd.DataFrame` with an added `regime` column, containing only data from the specified periods.
    *   **Transformation:** Transforms the time-series DataFrame by annotating each row with a regime label and filtering it temporally.
    *   **Research Role:** Implements the "Regime Assignment" step described in **Section 5.1** and **Table 1**, which is the core of the paper's experimental design for studying model performance under structural breaks.

*   **Callable:** `run_chronological_splitting_suite`
    *   **Inputs:** The regime-assigned `pd.DataFrame`, `study_config`.
    *   **Processes:**
        1.  Calls `create_regime_data_subsets`, which iterates through each unique regime.
        2.  For each regime, it calls `perform_chronological_split`, which sorts the regime's data by date and partitions it into non-overlapping `training` (first 60%), `validation` (next 20%), and `testing` (final 20%) sets.
        3.  Calls `validate_split_quality` to perform a final, critical check ensuring there is no temporal overlap between the splits, guaranteeing the absence of lookahead bias.
    *   **Outputs:** A nested dictionary `DataSplits` containing the 12 DataFrame subsets.
    *   **Transformation:** Transforms the regime-annotated DataFrame into a structured collection of 12 distinct train/validation/test sets.
    *   **Research Role:** Implements the "Chronological Splits" methodology described in **Section 5.1**. This is the fundamental procedure for correctly evaluating time-series models.

*   **Callable:** `run_cross_regime_consistency_suite`
    *   **Inputs:** The regime-assigned `pd.DataFrame`.
    *   **Processes:**
        1.  Calls `analyze_ticker_consistency` to check for changes in the asset universe across regimes.
        2.  Calls `analyze_return_distribution_stability` to compute the first four statistical moments of returns in each regime and performs pairwise Kolmogorov-Smirnov tests to formally check if the return distributions are statistically different.
        3.  Calls `analyze_news_coverage_consistency` to quantify the availability and volume of text data in each regime.
    *   **Outputs:** None. Prints a detailed diagnostic report to the console.
    *   **Transformation:** This is a pure analysis function; it does not transform data.
    *   **Research Role:** This function provides the quantitative justification for the regime-based approach. By demonstrating that the statistical properties of both the target variable (`target_return`) and the input features (`aggregated_text`) are significantly different across regimes, it validates the paper's core premise that the data generating process is non-stationary.

#### **Phase 3: Feature Engineering**

*   **Callable:** `run_tfidf_vectorization_suite`
    *   **Inputs:** `DataSplits` dictionary, `study_config`.
    *   **Processes:**
        1.  Calls `create_and_fit_tfidf_vectorizer`, which first assembles a global corpus from all *training* splits across all regimes.
        2.  It fits a single `TfidfVectorizer` on this global corpus to create a unified vocabulary.
        3.  It then uses this single fitted vectorizer to `.transform()` the text from all 12 data subsets.
        4.  Calls `validate_tfidf_features` to check the shape, sparsity, and L2 normalization of the resulting sparse matrices.
    *   **Outputs:** The fitted `TfidfVectorizer` object and a nested dictionary `TfidfFeatures` containing the 12 sparse feature matrices.
    *   **Transformation:** Transforms the raw text in the `DataSplits` into sparse numerical feature matrices.
    *   **Research Role:** Implements the TF-IDF feature engineering described in **Section 4.2** and **5.1**. The use of a single global vocabulary is a critical, rigorous step to ensure feature comparability across all regimes.

*   **Callable:** `run_embedding_extraction_suite`
    *   **Inputs:** `DataSplits` dictionary, `study_config`.
    *   **Processes:**
        1.  Calls `initialize_sentence_transformer` to load the pre-trained `all-MiniLM-L6-v2` model and move it to the GPU.
        2.  Calls `extract_sentence_embeddings`, which iterates through all 12 data subsets and uses the model's `.encode()` method in batches to convert the text into dense embedding vectors.
        3.  Calls `validate_embedding_features` to check the shape, numerical integrity, and L2 normalization of the resulting NumPy arrays.
    *   **Outputs:** A nested dictionary `EmbeddingFeatures` containing the 12 dense feature matrices.
    *   **Transformation:** Transforms the raw text in the `DataSplits` into dense numerical feature matrices.
    *   **Research Role:** Implements the sentence embedding feature engineering described in **Section 4.2** and **5.1**.

*   **Callable:** `run_feature_concatenation_suite`
    *   **Inputs:** `TfidfFeatures` dictionary, `EmbeddingFeatures` dictionary.
    *   **Processes:**
        1.  Calls `concatenate_features`, which iterates through all 12 splits. For each split, it converts the sparse TF-IDF matrix to a dense array and horizontally stacks it with the corresponding dense embedding array.
        2.  The resulting matrix implements the formulation $\mathbf{x}_{\text{combined}} = [\mathbf{v}_{\text{tfidf}} \mid\mid \mathbf{e}_{\text{sentence}}]$.
        3.  Calls `validate_combined_features` to check the shape and numerical integrity of the final concatenated matrices.
    *   **Outputs:** A nested dictionary `CombinedFeatures` containing the 12 combined feature matrices.
    *   **Transformation:** Transforms two sets of feature matrices into a single, unified set of feature matrices for the hybrid model.
    *   **Research Role:** Implements the feature set for the "Feature-based Transformer" model described in **Section 4.2**.

#### **Phase 4 & 5: Model Training**

*   **Callable:** `run_regime_specific_training_pipeline`
    *   **Inputs:** All data splits and feature set dictionaries, `study_config`.
    *   **Processes:** This is a master orchestrator that iterates through all 12 regime-model pairs. For each pair, it:
        1.  Calls `_get_model_and_features_for_run` to instantiate the correct model class (`LSTMRegressionModel`, `TextTransformerRegressionModel`, or `FeatureEnhancedMLP`) and select the corresponding feature set.
        2.  Calls `create_dataloaders` to prepare the data for batch processing.
        3.  Calls `setup_optimization_components` to create the `MSELoss` criterion, `Adam` optimizer, and `ReduceLROnPlateau` scheduler.
        4.  Calls `run_training_orchestrator`, which executes the main training loop, incorporating `train_epoch` and `validate_epoch` helpers, early stopping, and checkpointing of the best model.
        5.  It is idempotent, checking for existing checkpoints to allow for resumption.
    *   **Outputs:** A dictionary `training_results` containing the file paths to the 12 best-performing model checkpoints and their training histories.
    *   **Transformation:** This is the core computational step of the project, transforming the feature sets and untrained model architectures into a set of 12 trained, specialized predictive models.
    *   **Research Role:** Implements the entire model training workflow described across **Section 4.2, 5.3, and 5.5**.

#### **Phase 6: Prediction and Evaluation**

*   **Callable:** `run_inference_pipeline`
    *   **Inputs:** `training_results` dictionary, all data splits and feature sets, `study_config`.
    *   **Processes:** Iterates through the 12 trained model-regime pairs. For each pair, it:
        1.  Loads the appropriate model architecture and its saved checkpoint weights.
        2.  Prepares the `test` set `DataLoader` for that pair.
        3.  Calls `generate_predictions_for_split`, which runs the model in inference mode (`model.eval()`, `torch.no_grad()`) to generate predictions on the unseen test data.
        4.  Collates all predictions into a single, tidy `pd.DataFrame`.
    *   **Outputs:** A single `pd.DataFrame` containing predictions, ground truth, and metadata for all 12 test sets.
    *   **Transformation:** Transforms the test set features into model predictions using the trained models.
    *   **Research Role:** Implements the model evaluation step, generating the raw predictions that are the basis for all subsequent analysis.

*   **Callable:** `enrich_and_store_predictions`
    *   **Inputs:** The tidy predictions `pd.DataFrame`.
    *   **Processes:** Adds per-sample diagnostic columns (`squared_error`, `absolute_error`, `directional_accuracy`), embeds metadata, and saves the resulting enriched DataFrame to a file with a corresponding SHA256 checksum file for integrity validation.
    *   **Outputs:** The enriched `pd.DataFrame`.
    *   **Transformation:** Augments the predictions DataFrame with additional analytical columns.
    *   **Research Role:** Creates a robust, auditable, and comprehensive artifact of the model's predictive output, which is a best practice for reproducible research.

*   **Callable:** `run_mse_evaluation_suite`
    *   **Inputs:** The enriched predictions `pd.DataFrame`.
    *   **Processes:**
        1.  Calls `compute_and_validate_mse`, which calculates the Mean Squared Error for each of the 12 experimental cells by implementing the equation $\text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i)^2$ on a per-group basis. It also computes bootstrap confidence intervals for these estimates.
        2.  Calls `generate_mse_summary_table` to format the results into a publication-quality table.
    *   **Outputs:** The raw MSE `pd.DataFrame`. Displays a styled table.
    *   **Transformation:** Transforms the per-sample predictions into an aggregate performance summary.
    *   **Research Role:** Implements the baseline performance evaluation and generates the results for **Table 3**.

#### **Phase 7-11: Full Analysis Suite**

*   **Callable:** `run_diagnostic_metrics_orchestrator`
    *   **Inputs:** All major data artifacts.
    *   **Processes:** This is a master orchestrator for the four diagnostic metrics. It calls:
        1.  `compute_fcas`: Implements $\text{FCAS} = \mathbb{E}[\mathbb{I}(\text{sign}(\hat{y}) = \text{sign}(c))]$ by counting keyword matches to determine the sign of the causal cue `c`.
        2.  `compute_pcs`: Implements $\text{PCS} = \mathbb{E}[|f_\theta(\mathbf{x}) - f_\theta(\tilde{\mathbf{x}})|]$ by creating perturbed texts `\tilde{\mathbf{x}}`, re-running the entire feature and inference pipeline to get `f_\theta(\tilde{\mathbf{x}})`, and averaging the absolute differences.
        3.  `compute_tsv`: Implements $\text{TSV} = \frac{1}{N-1} \sum \|\phi(\mathbf{x}_{i+1}) - \phi(\mathbf{x}_i)\|_2$ by calculating the mean Euclidean distance between consecutive chronologically sorted sentence embeddings.
        4.  `compute_nlics`: Implements $\text{NLICS} = \mathbb{E}[\text{EntailmentScore}(\mathbf{x}, H(\hat{y}))]$ by using the OpenAI API to evaluate the logical entailment between the news text `\mathbf{x}` and a hypothesis `H` generated from the prediction `\hat{y}`.
    *   **Outputs:** A single `pd.DataFrame`, the `robustness_profile`, containing all five metrics for all 12 experimental cells.
    *   **Transformation:** Transforms the predictions and source data into the final, multi-faceted performance profile.
    *   **Research Role:** Implements the core contribution of the paper: the calculation of the four novel diagnostic metrics as defined in **Section 3, Equations (2-5)**.

*   **Callable:** `run_diagnostic_validation_suite`
    *   **Inputs:** The `robustness_profile` DataFrame.
    *   **Processes:** Performs sanity checks on the final metrics, ensuring their values fall within their theoretical ranges (e.g., FCAS in `[0, 1]`). It also computes a diagnostic correlation matrix and generates a styled, publication-quality summary table.
    *   **Outputs:** The validated `robustness_profile` and a styled table object.
    *   **Transformation:** This is a validation and reporting function.
    *   **Research Role:** Serves as the final quality assurance step on the computed results.

*   **Callable:** `compute_js_divergence_matrix`
    *   **Inputs:** `DataSplits` dictionary, `study_config`.
    *   **Processes:** Implements the Jensen-Shannon Divergence calculation to quantify semantic shift. It creates a unified vocabulary, builds a probability distribution of term usage for each regime, and then calculates the pairwise J-S divergence using the formula $D_{JS}(P, Q) = 0.5 \times D_{KL}(P || M) + 0.5 \times D_{KL}(Q || M)$, where $M=0.5(P+Q)$.
    *   **Outputs:** A symmetric `pd.DataFrame` containing the pairwise J-S divergence values.
    *   **Transformation:** Transforms the raw text corpora of the regimes into a matrix of statistical distances.
    *   **Research Role:** Implements the "Vocabulary Shift" analysis described in **Section 6.2** and generates the data for **Figure 2**.

*   **Callable:** `run_tsne_visualization_suite`
    *   **Inputs:** `DataSplits` and `EmbeddingFeatures` dictionaries.
    *   **Processes:** Aggregates all test set sentence embeddings, performs stratified subsampling for efficiency, and applies the t-SNE algorithm to project the 384-dimensional vectors into a 2D space. It then generates scatter plots colored by regime and sector.
    *   **Outputs:** A `pd.DataFrame` of the 2D coordinates. Displays the plots.
    *   **Transformation:** Transforms high-dimensional semantic vectors into a low-dimensional visualization.
    *   **Research Role:** Implements the visualization of the "Regime-Aware Representations" described in **Section 6.4** and reproduces **Figures 3 and 4**.

*   **Callable:** `run_cross_regime_analysis_suite`
    *   **Inputs:** `js_divergence_matrix`, `robustness_profile`.
    *   **Processes:** Aligns the J-S divergence values with the absolute difference in MSE between corresponding regime pairs. It then computes the Spearman rank correlation and p-value to test the association between semantic drift and model performance degradation.
    *   **Outputs:** A `pd.DataFrame` summarizing the correlation results. Displays a plot.
    *   **Transformation:** Synthesizes two different analytical results into a single statistical test and visualization.
    *   **Research Role:** Implements the core hypothesis test of the paper, as discussed in **Section 6.2**, linking semantic drift to prediction error.

*   **Callable:** `run_stock_specific_case_study`
    *   **Inputs:** All major data artifacts.
    *   **Processes:** Filters all data artifacts for a specific list of tickers (e.g., 'JPM', 'AAPL') and then re-runs the entire diagnostic metric computation pipeline (`compute_fcas`, `compute_pcs`, etc.) on these small, single-stock subsets.
    *   **Outputs:** A `pd.DataFrame` containing the diagnostic metrics for the specified stocks across all regimes.
    *   **Transformation:** Filters the entire dataset and re-applies the full analysis to produce a granular, stock-level view.
    *   **Research Role:** Implements the case study analysis presented in **Section 6.6** and **Table 6**.

*   **Callable:** `perform_metric_ablation_analysis`, `perform_feature_augmentation_ablation`, `perform_entailment_model_ablation`
    *   **Inputs:** Various result artifacts.
    *   **Processes:** These functions implement the ablation studies described in **Section 7** and presented in **Tables 7, 8, and 9**. They involve re-running parts of the pipeline on different data slices (e.g., cross-sector) or with different components (e.g., BART-NLI) and presenting the comparative results.
    *   **Outputs:** Formatted `pd.DataFrame` summary tables.
    *   **Transformation:** These are high-level analytical functions that orchestrate new, smaller experiments to answer specific questions about the components of the main framework.
    *   **Research Role:** Implements the full suite of ablation studies from **Section 7**.

*   **Callable:** `run_control_experiment`
    *   **Inputs:** All major data artifacts.
    *   **Processes:** Identifies a specific event type (earnings reports) across all regimes. It then re-runs the PCS and TSV metric calculations on these homogenous subsets, both within and across regimes, to disentangle situational from linguistic drift.
    *   **Outputs:** A `pd.DataFrame` summary table.
    *   **Transformation:** Performs highly specific data filtering and re-applies metric computations.
    *   **Research Role:** Implements the sophisticated control experiment described in **Section 6.5** and **Table 5**.

*   **Callable:** `run_multi_sector_robustness_validation`
    *   **Inputs:** All major data artifacts.
    *   **Processes:** Orchestrates a large-scale experiment by iterating through all possible pairs of GICS sectors. For each pair, it trains a model on the source sector and evaluates it on the target sector, reusing the core training and inference functions. It calculates a "transferability ratio" to normalize the results.
    *   **Outputs:** A dictionary of `pd.DataFrame` transferability matrices.
    *   **Transformation:** Orchestrates a massive set of training and evaluation runs to produce a comprehensive summary of model generalization.
    *   **Research Role:** Implements the "Sector Transferability" analysis described in **Section 6.3** and **Table 4**, but extends it to a full, comprehensive matrix.

#### **Top-Level Orchestrators**

*   **Callable:** `run_full_research_pipeline`
    *   **Inputs:** Raw `pd.DataFrame`, `study_config`.
    *   **Processes:** Serves as the master orchestrator for all *automated* computational tasks. It calls the orchestrator for each phase in the correct sequence, from data validation to the final analyses, managing the flow of artifacts between them. It is designed to be idempotent and resumable.
    *   **Outputs:** A dictionary of file paths to all major generated artifacts.
    *   **Transformation:** Orchestrates the entire end-to-end transformation of raw data into a complete set of research results.
    *   **Research Role:** Represents the full, reproducible script to run the entire computational portion of the paper.

*   **Callable:** `execute_quantifying_semantic_shift_study`
    *   **Inputs:** Raw `pd.DataFrame`, `study_config`, and user flags.
    *   **Processes:** Acts as the ultimate user-facing entry point. It calls `run_full_research_pipeline` to perform the main computation. It then manages the optional, human-dependent `run_entailment_ablation_analysis` step by verifying the existence of the required annotated file. Finally, it calls the master reporting and synthesis functions to generate the final outputs.
    *   **Outputs:** A dictionary containing all key final artifacts and reports.
    *   **Transformation:** Orchestrates the entire study lifecycle, including the interface with the manual annotation step.
    *   **Research Role:** Represents the complete, user-runnable script that reproduces the entire study, including its human-in-the-loop component.


<br><br>

### **Usage Example: Final `main.py`**

Below is a Python script which shows how to use the orchestrator callables:

```python
#!/usr/bin/env python3
# ==============================================================================
#
#  Main Execution Script for "Quantifying Semantic Shift in Financial NLP"
#
#  This script serves as the main entry point to run the entire research
#  pipeline. It includes a synthetic data generator to ensure the script is
#  fully executable out-of-the-box for demonstration and validation purposes.
#
#  Workflow:
#  1. Generates a synthetic, structurally correct illustrative DataFrame.
#  2. Loads the main study configuration from `config.yaml`.
#  3. Resolves secrets (like API keys) from environment variables.
#  4. Calls the main pipeline orchestrator `execute_quantifying_semantic_shift_study`.
#  5. Provides a commented-out example for running the separate human-in-the-loop
#     analysis after manual annotation is complete.
#
# ==============================================================================

import logging
from pathlib import Path
from typing import Dict, Any
import os

import numpy as np
import pandas as pd
import yaml

# Assume all previously defined orchestrator functions are in a module
# called `pipeline` for clean importing.
# from pipeline import (
#     execute_quantifying_semantic_shift_study,
#     run_entailment_ablation_analysis
# )

# For this self-contained example, we assume the functions are in the same file.


def _create_synthetic_input_data(
    output_path: Path,
    num_tickers: int = 10,
    days: int = 500
) -> pd.DataFrame:
    """
    Generates a realistic, synthetic DataFrame matching the required input schema.

    This function creates a mock dataset for demonstrating and testing the full
    pipeline. The data is structurally and methodologically sound, but the
    values are synthetic and not suitable for actual model training.

    Args:
        output_path: The path to save the generated DataFrame.
        num_tickers: The number of synthetic tickers to create.
        days: The number of trading days to generate.

    Returns:
        The generated pandas DataFrame.
    """
    logging.info("Generating synthetic input data...")
    
    # --- 1. Define Universe ---
    tickers = [f"TICK{i}" for i in range(num_tickers)]
    sectors = ['Technology', 'Financials', 'Health Care']
    ticker_to_sector = {t: sectors[i % len(sectors)] for i, t in enumerate(tickers)}
    
    # --- 2. Create Multi-Index ---
    dates = pd.to_datetime(pd.date_range(end='2023-01-01', periods=days, freq='B'))
    index = pd.MultiIndex.from_product([dates, tickers], names=['date', 'ticker'])
    
    # --- 3. Generate Synthetic Data ---
    df = pd.DataFrame(index=index)
    df['sector'] = df.index.get_level_values('ticker').map(ticker_to_sector)
    df = df.set_index('sector', append=True) # Add sector to the index

    # Generate a random walk for adjusted close prices for each ticker
    np.random.seed(42)
    # Use cumsum on random returns to simulate a price series
    returns = np.random.randn(len(dates), num_tickers) * 0.02
    price_series = 100 * np.exp(np.cumsum(returns, axis=0))
    df['Adj Close'] = price_series.flatten(order='F')

    # Create consistent OHLC data
    noise = np.random.rand(len(df))
    df['Open'] = df['Adj Close'] * (1 - 0.01 * noise)
    df['Close'] = df['Adj Close'] * (1 + 0.01 * noise)
    df['High'] = pd.concat([df['Open'], df['Close'], df['Adj Close'] * (1 + 0.02 * noise)], axis=1).max(axis=1)
    df['Low'] = pd.concat([df['Open'], df['Close'], df['Adj Close'] * (1 - 0.02 * noise)], axis=1).min(axis=1)
    df['Volume'] = np.random.randint(1_000_000, 10_000_000, size=len(df))

    # --- 4. Generate Synthetic Text Data ---
    sentences = [
        "The company reported strong growth and record earnings this quarter.",
        "Concerns about a decline in revenue led to a stock price fall.",
        "Quarterly results show a significant profit increase, beating all forecasts.",
        "The firm announced a major expansion, signaling positive future guidance.",
        "A weak economic outlook may cause a contraction in the next fiscal period.",
        "The stock will likely miss its earnings target.",
        "" # CRITICAL: Include empty strings to simulate no-news days
    ]
    df['aggregated_text'] = np.random.choice(sentences, size=len(df), p=[0.15, 0.15, 0.1, 0.1, 0.1, 0.1, 0.3])

    # --- 5. Generate Target Variable with Correct Methodology ---
    # This is the crucial step: create the forward-looking return.
    df['target_return'] = df.groupby(level='ticker')['Adj Close'].pct_change(1).shift(-1) * 100
    
    # Sort the index for pipeline compatibility.
    df.sort_index(inplace=True)
    
    # Save the synthetic data to the specified path.
    df.to_pickle(output_path)
    logging.info(f"Synthetic data with shape {df.shape} saved to '{output_path}'.")
    
    return df


def load_configuration(config_path: Path) -> Dict[str, Any]:
    """
    Loads the YAML configuration file and resolves environment variables.
    """
    # Open and read the YAML configuration file.
    with open(config_path, 'r') as f:
        config = yaml.safe_load(f)
    
    # --- Rigorous Secret Management ---
    api_key_placeholder = config['diagnostics']['nlics_metric']['api_key']
    if api_key_placeholder == "ENV_OPENAI_API_KEY":
        api_key = os.environ.get("OPENAI_API_KEY")
        if not api_key:
            raise ValueError(
                "The 'OPENAI_API_KEY' environment variable is not set. "
                "Please set it to run the NLICS metric computation."
            )
        config['diagnostics']['nlics_metric']['api_key'] = api_key
        
    return config


def main():
    """
    Main function to execute the entire research study.
    """
    # --- 1. Configure Logging ---
    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(levelname)s - %(message)s',
        datefmt='%Y-%m-%d %H:%M:%S'
    )
    
    # --- 2. Define File Paths ---
    config_path = Path("config.yaml")
    data_dir = Path("data")
    data_dir.mkdir(exist_ok=True)
    raw_data_path = data_dir / "synthetic_financial_data.pkl"
    results_dir = Path("results")

    # --- 3. Load Inputs: Configuration and Data ---
    logging.info(f"Loading configuration from '{config_path}'...")
    if not config_path.exists():
        raise FileNotFoundError(f"Configuration file not found at '{config_path}'. Please create it.")
    study_config = load_configuration(config_path)

    # Generate synthetic data if it doesn't exist.
    if not raw_data_path.exists():
        _create_synthetic_input_data(raw_data_path)
    
    logging.info(f"Loading data from '{raw_data_path}'...")
    raw_df = pd.read_pickle(raw_data_path)

    # --- 4. Execute the Main Automated Pipeline ---
    try:
        # This function runs all computational tasks (Tasks 1-33), generating
        # all models, predictions, and analyses. It is idempotent and will skip
        # steps if artifacts already exist.
        artifact_paths = execute_quantifying_semantic_shift_study(
            raw_df=raw_df,
            study_config=study_config,
            results_dir=results_dir,
            # NOTE: The main pipeline now handles the human-in-the-loop step
            # by generating the required file and providing instructions.
            # The `run_entailment_ablation` flag is part of the top-level orchestrator.
            run_entailment_ablation=False,
            force_rerun_main_training=False,
            force_rerun_cross_sector=False
        )
        logging.info("\nMain research pipeline completed successfully.")
        logging.info("Generated artifacts are described by the following paths:")
        for name, path in artifact_paths.items():
            logging.info(f"  - {name}: {path}")

    except Exception as e:
        logging.error("The main research pipeline failed.", exc_info=True)
        return

    # --- 5. (Optional) Execute the Human-in-the-Loop Analysis ---
    # This section demonstrates how a user would run the final piece of the
    # analysis AFTER they have manually annotated the benchmark file.
    
    # To run this, a user would:
    # 1. Find the file 'results/nli_benchmark_for_annotation.csv'.
    # 2. Open it and replace all "ANNOTATE_HERE" with 1.0, 0.5, or 0.0.
    # 3. Save it as 'results/nli_benchmark_annotated.csv'.
    # 4. Uncomment and run the code block below.
    
    # --------------------------------------------------------------------------
    # UNCOMMENT THE BLOCK BELOW TO RUN THE ENTAILMENT ABLATION ANALYSIS
    # --------------------------------------------------------------------------
    # try:
    #     logging.info("\n" + "="*80 + "\nATTEMPTING TO RUN ENTAILMENT ABLATION ANALYSIS\n" + "="*80)
    #     
    #     # Define the path to the now-annotated file.
    #     annotated_file = results_dir / "nli_benchmark_annotated.csv"
    #     
    #     # Run the dedicated analysis function.
    #     entailment_results = run_entailment_ablation_analysis(
    #         enriched_predictions_df_path=artifact_paths['enriched_predictions'],
    #         annotated_benchmark_path=annotated_file,
    #         study_config=study_config
    #     )
    #     
    #     logging.info("Entailment ablation analysis completed successfully.")
    #     logging.info("Results:\n" + entailment_results.to_string())
    #     
    # except (FileNotFoundError, ValueError) as e:
    #     logging.error(f"Could not run entailment ablation analysis: {e}")
    # --------------------------------------------------------------------------


if __name__ == "__main__":
    # This block makes the script executable from the command line.
    main()
```

In [None]:
# Task 1: Configuration Parameter Validation

def validate_regime_definitions(
    regime_definitions: List[Dict[str, str]]
) -> Tuple[bool, List[str]]:
    """
    Validates the temporal consistency of regime definitions.

    This function performs a series of rigorous checks on the provided regime
    definitions to ensure they are temporally coherent and correctly specified.
    It verifies:
    1.  Each regime dictionary contains the required 'start_date' and 'end_date' keys.
    2.  All date strings are in the valid ISO 8601 format ('YYYY-MM-DD').
    3.  Regimes are chronologically ordered and do not overlap. The end date of
        one regime must be strictly before the start date of the next.

    Args:
        regime_definitions: A list of dictionaries, where each dictionary
                            defines a regime with 'regime_name', 'start_date',
                            and 'end_date'.

    Returns:
        A tuple containing:
        - A boolean indicating if all validation checks passed (True) or not (False).
        - A list of strings describing any validation errors that were found.
    """
    # Initialize a list to collect detailed error messages.
    errors = []

    # --- Input Structure and Type Validation ---
    # Ensure the input is a non-empty list.
    if not isinstance(regime_definitions, list) or not regime_definitions:
        # If the structure is invalid, return immediately.
        return False, ["'regime_definitions' must be a non-empty list."]

    # --- DataFrame Transformation for Vectorized Validation ---
    try:
        # Convert the list of dictionaries into a pandas DataFrame for efficient processing.
        regimes_df = pd.DataFrame(regime_definitions)

        # Check for the presence of required columns.
        required_cols = {'regime_name', 'start_date', 'end_date'}
        if not required_cols.issubset(regimes_df.columns):
            # Identify and report missing columns.
            missing = required_cols - set(regimes_df.columns)
            errors.append(f"Missing required keys in regime definitions: {missing}")
            # Return immediately as further checks are not possible.
            return False, errors

        # --- Step 1: Validate Date Format and Convert to Timestamp ---
        # Attempt to convert date strings to timezone-aware (UTC) timestamps.
        # Using a strict format and raising errors ensures format compliance.
        regimes_df['start_date'] = pd.to_datetime(
            regimes_df['start_date'], format='%Y-%m-%d', errors='raise', utc=True
        )
        regimes_df['end_date'] = pd.to_datetime(
            regimes_df['end_date'], format='%Y-%m-%d', errors='raise', utc=True
        )

    # Catch errors from DataFrame creation or date conversion.
    except (ValueError, TypeError, KeyError) as e:
        # Report specific parsing or structural errors.
        errors.append(f"Invalid date format or structure in regime definitions: {e}")
        return False, errors

    # --- Step 2: Validate Chronological Order within each Regime ---
    # Ensure that for every regime, the start_date is before the end_date.
    invalid_intervals = regimes_df[regimes_df['start_date'] >= regimes_df['end_date']]
    # Check if any such invalid intervals exist.
    if not invalid_intervals.empty:
        # Report each regime where the start date is not before the end date.
        for _, row in invalid_intervals.iterrows():
            errors.append(
                f"In regime '{row['regime_name']}', start_date "
                f"({row['start_date'].date()}) must be before end_date "
                f"({row['end_date'].date()})."
            )

    # --- Step 3: Validate Chronological Order and Non-Overlap between Regimes ---
    # Sort the DataFrame by start_date to ensure regimes are in chronological order.
    regimes_df = regimes_df.sort_values(by='start_date').reset_index(drop=True)

    # Check for temporal overlaps between consecutive regimes.
    # The end date of regime `i` must be strictly less than the start date of regime `i+1`.
    # We use `shift(-1)` to get the start date of the next regime for comparison.
    next_start_dates = regimes_df['start_date'].shift(-1)

    # Identify overlaps where the current end date is on or after the next start date.
    # The last regime will have a NaT for `next_start_dates`, so we exclude it.
    overlaps = regimes_df[regimes_df['end_date'] >= next_start_dates]

    # Check if any overlaps were found.
    if not overlaps.empty:
        # Report each detected overlap for clear debugging.
        for index, row in overlaps.iterrows():
            # Get the name of the next regime that is causing the overlap.
            next_regime_name = regimes_df.loc[index + 1, 'regime_name']
            errors.append(
                f"Overlap detected: Regime '{row['regime_name']}' "
                f"(ends {row['end_date'].date()}) overlaps with or touches "
                f"regime '{next_regime_name}' "
                f"(starts {next_start_dates.loc[index].date()})."
            )

    # --- Final Validation Result ---
    # The validation is successful if and only if the errors list is empty.
    is_valid = not errors
    # Return the overall validity status and the list of detailed errors.
    return is_valid, errors


def validate_data_split_ratios(
    data_split_ratios: Dict[str, float]
) -> Tuple[bool, List[str]]:
    """
    Validates the mathematical consistency of data split ratios.

    This function ensures that the data split ratios for training, validation,
    and testing are mathematically sound. It verifies:
    1.  The dictionary contains exactly 'training', 'validation', and 'testing' keys.
    2.  All ratio values are numeric (float or int).
    3.  Each ratio is strictly greater than 0 and less than 1.
    4.  The sum of the ratios is equal to 1.0, within a small tolerance
        to account for floating-point arithmetic inaccuracies.

    Args:
        data_split_ratios: A dictionary containing the split ratios, e.g.,
                           {'training': 0.6, 'validation': 0.2, 'testing': 0.2}.

    Returns:
        A tuple containing:
        - A boolean indicating if all validation checks passed (True) or not (False).
        - A list of strings describing any validation errors that were found.
    """
    # Initialize a list to collect detailed error messages.
    errors = []

    # --- Input Structure and Type Validation ---
    # Define the expected keys for the dictionary.
    expected_keys = {'training', 'validation', 'testing'}
    # Check if the provided keys match the expected keys.
    if set(data_split_ratios.keys()) != expected_keys:
        # Report any discrepancy in keys.
        errors.append(f"data_split_ratios must contain exactly these keys: {expected_keys}")
        # Return immediately as further checks are invalid.
        return False, errors

    # --- Step 1: Validate Individual Ratio Properties ---
    # Iterate through each split type and its corresponding ratio.
    for split_name, ratio in data_split_ratios.items():
        # Check if the ratio is a valid number (int or float).
        if not isinstance(ratio, (int, float)):
            # Report a type error if the ratio is not numeric.
            errors.append(f"Ratio for '{split_name}' must be a number, but got {type(ratio)}.")
            # Continue to the next item to check all ratios.
            continue

        # Check if the ratio is within the valid range (0, 1).
        if not (0 < ratio < 1):
            # Report an error if the ratio is outside the valid bounds.
            errors.append(
                f"Ratio for '{split_name}' must be between 0 and 1 (exclusive), but got {ratio}."
            )

    # If there were errors with individual ratios, return now.
    if errors:
        return False, errors

    # --- Step 2: Validate the Sum of Ratios ---
    # Calculate the sum of all ratio values.
    total_ratio = sum(data_split_ratios.values())

    # Use numpy.isclose for robust floating-point comparison.
    # This checks if `total_ratio` is approximately equal to 1.0.
    # Equation: |total_ratio - 1.0| <= atol + rtol * |1.0|
    if not np.isclose(total_ratio, 1.0, atol=1e-9):
        # Report an error if the sum is not 1.0.
        errors.append(
            f"The sum of data split ratios must be 1.0, but it is {total_ratio}."
        )

    # --- Final Validation Result ---
    # The validation is successful if and only if the errors list is empty.
    is_valid = not errors
    # Return the overall validity status and the list of detailed errors.
    return is_valid, errors


def validate_feature_engineering_config(
    feature_config: Dict[str, Any],
    model_arch_config: Dict[str, Any]
) -> Tuple[bool, List[str]]:
    """
    Validates the feature engineering and related model parameters.

    This function checks the integrity of the feature engineering configuration.
    It verifies:
    1.  TF-IDF `max_features` is a positive integer.
    2.  The sentence-transformer model identifier is valid and the model can be
        loaded from HuggingFace Hub.
    3.  The configured `embedding_dimension` matches the actual output dimension
        of the loaded sentence-transformer model.
    4.  The `input_size` for the `feature_transformer` architecture correctly
        matches the sum of TF-IDF features and embedding dimensions.

    Args:
        feature_config: The 'feature_engineering' section of the study config.
        model_arch_config: The 'architectures' section of the model training config.

    Returns:
        A tuple containing:
        - A boolean indicating if all validation checks passed (True) or not (False).
        - A list of strings describing any validation errors that were found.
    """
    # Initialize a list to collect detailed error messages.
    errors = []

    # --- Step 1: Validate TF-IDF Configuration ---
    try:
        # Retrieve the max_features parameter for TF-IDF.
        max_features = feature_config['tfidf']['max_features']
        # Check if it's an integer and is positive.
        if not isinstance(max_features, int) or max_features <= 0:
            # Report an error if the validation fails.
            errors.append(f"'tfidf.max_features' must be a positive integer, but got {max_features}.")
    # Catch errors if the keys are missing in the config dictionary.
    except KeyError:
        errors.append("Missing 'tfidf' or 'max_features' configuration.")
    except TypeError:
        errors.append("'tfidf' configuration is not correctly structured.")

    # --- Step 2: Validate Sentence Embedding Configuration ---
    try:
        # Retrieve sentence embedding parameters.
        model_id = feature_config['sentence_embeddings']['model_identifier']
        config_dim = feature_config['sentence_embeddings']['embedding_dimension']

        # --- Sub-step 2a: Validate Model Accessibility ---
        try:
            # Attempt to load the sentence transformer model. This is the definitive
            # test of whether the model identifier is valid and accessible.
            logging.info(f"Attempting to load sentence transformer model: '{model_id}'...")
            model = SentenceTransformer(model_id)
            logging.info("Model loaded successfully.")

            # --- Sub-step 2b: Validate Embedding Dimension ---
            # Get the actual embedding dimension from the loaded model.
            actual_dim = model.get_sentence_embedding_dimension()
            # Check if the configured dimension matches the actual dimension.
            if actual_dim != config_dim:
                errors.append(
                    f"Configured 'embedding_dimension' ({config_dim}) does not match "
                    f"the actual dimension of model '{model_id}' ({actual_dim})."
                )
        # Catch exceptions related to model loading (e.g., not found, network error).
        except Exception as e:
            errors.append(
                f"Failed to load sentence transformer model '{model_id}'. "
                f"Check model identifier and network connection. Error: {e}"
            )
            # Set actual_dim to None as it couldn't be determined.
            actual_dim = None

    # Catch errors if the keys are missing in the config dictionary.
    except KeyError:
        errors.append("Missing 'sentence_embeddings' configuration.")
    except TypeError:
        errors.append("'sentence_embeddings' configuration is not correctly structured.")

    # --- Step 3: Cross-Validate Feature Transformer Input Size ---
    try:
        # Retrieve the configured input size for the feature-enhanced transformer.
        feature_transformer_input_size = model_arch_config['feature_transformer']['input_size']

        # This check can only be performed if previous steps were successful.
        if 'max_features' in locals() and 'actual_dim' in locals() and actual_dim is not None:
            # Calculate the expected input size from the sum of feature dimensions.
            expected_input_size = max_features + actual_dim
            # Compare the expected size with the configured size.
            if feature_transformer_input_size != expected_input_size:
                errors.append(
                    f"'feature_transformer.input_size' is configured as "
                    f"{feature_transformer_input_size}, but the expected sum of "
                    f"TF-IDF features ({max_features}) and embedding dimension "
                    f"({actual_dim}) is {expected_input_size}."
                )
    # Catch errors if the keys are missing in the config dictionary.
    except KeyError:
        errors.append("Missing 'feature_transformer' or 'input_size' configuration.")
    except TypeError:
        errors.append("'feature_transformer' configuration is not correctly structured.")

    # --- Final Validation Result ---
    # The validation is successful if and only if the errors list is empty.
    is_valid = not errors
    # Return the overall validity status and the list of detailed errors.
    return is_valid, errors


def run_config_validation_suite(
    study_config: Dict[str, Any]
) -> None:
    """
    Executes the full suite of configuration validation checks.

    This orchestrator function runs all validation steps for Task 1 and
    provides a consolidated report of the outcome. It is the main entry
    point for validating the entire study configuration dictionary.

    Args:
        study_config: The complete study configuration dictionary.

    Raises:
        ValueError: If any of the validation checks fail, this exception is
                    raised with a detailed report of all errors.
    """
    # Initialize a list to aggregate all errors from the validation suite.
    all_errors = []

    # --- Execute Task 1, Step 1 ---
    logging.info("--- Running Task 1, Step 1: Regime Definition Validation ---")
    # Validate the temporal consistency of regime definitions.
    regime_valid, regime_errors = validate_regime_definitions(
        study_config['experimental_design']['regime_definitions']
    )
    # Check if the validation passed.
    if not regime_valid:
        # If not, add the errors to the master list.
        all_errors.extend(regime_errors)
        logging.error("Regime definition validation FAILED.")
    else:
        # Log success if validation passed.
        logging.info("Regime definition validation PASSED.")

    # --- Execute Task 1, Step 2 ---
    logging.info("\n--- Running Task 1, Step 2: Data Split Ratio Validation ---")
    # Validate the mathematical consistency of data split ratios.
    split_valid, split_errors = validate_data_split_ratios(
        study_config['experimental_design']['data_split_ratios']
    )
    # Check if the validation passed.
    if not split_valid:
        # If not, add the errors to the master list.
        all_errors.extend(split_errors)
        logging.error("Data split ratio validation FAILED.")
    else:
        # Log success if validation passed.
        logging.info("Data split ratio validation PASSED.")

    # --- Execute Task 1, Step 3 ---
    logging.info("\n--- Running Task 1, Step 3: Feature Engineering Config Validation ---")
    # Validate the feature engineering parameters and their cross-dependencies.
    feature_valid, feature_errors = validate_feature_engineering_config(
        study_config['feature_engineering'],
        study_config['model_training']['architectures']
    )
    # Check if the validation passed.
    if not feature_valid:
        # If not, add the errors to the master list.
        all_errors.extend(feature_errors)
        logging.error("Feature engineering config validation FAILED.")
    else:
        # Log success if validation passed.
        logging.info("Feature engineering config validation PASSED.")

    # --- Final Report ---
    # Check if any errors were collected during the entire suite.
    if all_errors:
        # If there are errors, format them into a single report.
        error_report = "\n".join([f"- {error}" for error in all_errors])
        # Raise a ValueError with the consolidated report.
        raise ValueError(f"Configuration validation failed with the following errors:\n{error_report}")
    else:
        # If no errors were found, log the overall success.
        logging.info("\n>>> All configuration parameters validated successfully. <<<")


In [None]:
# Task 2: DataFrame Structure and Index Validation

def validate_multi_index_integrity(
    df: pd.DataFrame
) -> Tuple[bool, List[str]]:
    """
    Validates the structural integrity of the DataFrame's MultiIndex.

    This function performs a rigorous, production-grade validation of the
    DataFrame's index to ensure it conforms to the expected panel data
    structure: a 3-level MultiIndex of ('date', 'ticker', 'sector').
    It verifies:
    1.  The index is a MultiIndex with exactly three levels.
    2.  The names of the levels are exactly ['date', 'ticker', 'sector'].
    3.  The data types of the levels are correct: DatetimeIndex for 'date',
        and string/object for 'ticker' and 'sector'.
    4.  The index is unique, ensuring no duplicate (date, ticker, sector) entries.
    5.  The index is monotonically increasing, which is critical for correct
        time-series operations and preventing lookahead bias.

    Args:
        df: The pandas DataFrame to be validated.

    Returns:
        A tuple containing:
        - A boolean, True if the index is valid, False otherwise.
        - A list of strings, containing detailed descriptions of any validation failures.
    """
    # Initialize a list to collect detailed error messages.
    errors = []

    # --- Step 1: Validate MultiIndex Presence and Level Count ---
    # Check if the index is an instance of pandas MultiIndex.
    if not isinstance(df.index, pd.MultiIndex):
        # If not, this is a fundamental structural failure.
        errors.append("DataFrame index is not a MultiIndex.")
        # Return immediately as subsequent checks are invalid.
        return False, errors

    # Check if the number of levels in the MultiIndex is exactly 3.
    if df.index.nlevels != 3:
        # Report the actual number of levels found.
        errors.append(
            f"Expected 3 index levels, but found {df.index.nlevels}."
        )
        # Return immediately.
        return False, errors

    # --- Step 2: Validate Index Level Names ---
    # Define the exact, ordered list of expected index level names.
    expected_names = ['date', 'ticker', 'sector']
    # Compare the actual names with the expected names.
    if list(df.index.names) != expected_names:
        # Report the discrepancy.
        errors.append(
            f"Expected index names {expected_names}, but found {list(df.index.names)}."
        )

    # --- Step 3: Validate Index Level Data Types ---
    # Check if the first level ('date') is a DatetimeIndex.
    if not isinstance(df.index.levels[0], pd.DatetimeIndex):
        errors.append(
            f"Expected index level 0 ('date') to be a DatetimeIndex, "
            f"but found {type(df.index.levels[0])}."
        )

    # Check if the second level ('ticker') has a string/object dtype.
    if not pd.api.types.is_string_dtype(df.index.levels[1]):
        errors.append(
            f"Expected index level 1 ('ticker') to have a string dtype, "
            f"but found {df.index.levels[1].dtype}."
        )

    # Check if the third level ('sector') has a string/object dtype.
    if not pd.api.types.is_string_dtype(df.index.levels[2]):
        errors.append(
            f"Expected index level 2 ('sector') to have a string dtype, "
            f"but found {df.index.levels[2].dtype}."
        )

    # --- Step 4: Validate Index Uniqueness ---
    # The `is_unique` property provides a highly optimized check for duplicates.
    if not df.index.is_unique:
        # Report that duplicate index entries exist.
        errors.append("Duplicate entries found in the MultiIndex. Index must be unique.")

    # --- Step 5: Validate Index Sorting ---
    # The `is_monotonic_increasing` property checks if the index is sorted.
    # This is crucial for time-series integrity.
    if not df.index.is_monotonic_increasing:
        # Report the sorting failure and suggest a remedy.
        errors.append(
            "Index is not monotonically increasing. Please sort the DataFrame "
            "using `df.sort_index(inplace=True)` before proceeding."
        )

    # --- Final Validation Result ---
    # The validation is successful if and only if the errors list is empty.
    is_valid = not errors
    # Return the overall validity status and the list of detailed errors.
    return is_valid, errors


def validate_column_schema(
    df: pd.DataFrame
) -> Tuple[bool, List[str]]:
    """
    Validates the DataFrame's column schema, data types, and basic data integrity.

    This function ensures the DataFrame has the exact required columns and that
    each column conforms to its specified data type. It also performs a crucial
    integrity check for financial data: ensuring all price and volume figures
    are positive, as negative values are nonsensical and indicate data corruption.

    It verifies:
    1.  The set of column names exactly matches the expected schema.
    2.  Each column has the correct, precise data type (e.g., float64, int64).
    3.  All values in numeric price and volume columns are strictly positive.

    Args:
        df: The pandas DataFrame to be validated.

    Returns:
        A tuple containing:
        - A boolean, True if the schema is valid, False otherwise.
        - A list of strings, containing detailed descriptions of any validation failures.
    """
    # Initialize a list to collect detailed error messages.
    errors = []

    # --- Step 1: Validate Column Presence and Names ---
    # Define the exact set of expected column names. Using a set is robust to order.
    expected_columns: Set[str] = {
        'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume',
        'aggregated_text', 'target_return'
    }
    # Compare the set of actual columns to the expected set.
    if set(df.columns) != expected_columns:
        # Report missing and/or unexpected columns for precise debugging.
        missing = expected_columns - set(df.columns)
        extra = set(df.columns) - expected_columns
        if missing:
            errors.append(f"Missing required columns: {sorted(list(missing))}")
        if extra:
            errors.append(f"Found unexpected columns: {sorted(list(extra))}")
        # Return immediately if the column set is wrong.
        return False, errors

    # --- Step 2: Validate Column Data Types ---
    # Define the precise expected data type for each column.
    expected_dtypes: Dict[str, np.dtype] = {
        'Open': np.dtype('float64'),
        'High': np.dtype('float64'),
        'Low': np.dtype('float64'),
        'Close': np.dtype('float64'),
        'Adj Close': np.dtype('float64'),
        'Volume': np.dtype('int64'),
        'aggregated_text': np.dtype('object'),
        'target_return': np.dtype('float64')
    }
    # Iterate through the expected dtypes and compare with the actual dtypes.
    for col, expected_dtype in expected_dtypes.items():
        # Check for exact dtype match.
        if df[col].dtype != expected_dtype:
            errors.append(
                f"Column '{col}' has incorrect dtype. Expected {expected_dtype}, "
                f"but found {df[col].dtype}."
            )

    # --- Step 3: Validate Data Integrity (Positivity of Numeric Columns) ---
    # Define the columns that must contain only positive values.
    positive_cols = ['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']
    # Perform a highly efficient, vectorized check for non-positive values.
    # This check ignores NaNs, which are handled separately.
    non_positive_mask = (df[positive_cols] <= 0)
    # Check if any non-positive values were found in any of these columns.
    if non_positive_mask.any().any():
        # Identify which columns contain non-positive values.
        cols_with_issues = non_positive_mask.any()[non_positive_mask.any()].index.tolist()
        errors.append(
            f"Non-positive values found in the following columns which should "
            f"be strictly positive: {cols_with_issues}."
        )

    # --- Final Validation Result ---
    # The validation is successful if and only if the errors list is empty.
    is_valid = not errors
    # Return the overall validity status and the list of detailed errors.
    return is_valid, errors


def validate_target_return_construction(
    df: pd.DataFrame
) -> Tuple[bool, List[str]]:
    """
    Validates the temporal alignment and construction of the 'target_return'.

    This function provides a critical check against lookahead bias by verifying
    that the `target_return` for day `t` is correctly calculated using the
    `Adj Close` prices of day `t` and day `t+1`. It performs this check on a
    per-ticker basis to respect the time-series nature of the data.

    It verifies:
    1.  The `target_return` value at each row mathematically corresponds to the
        percentage change in `Adj Close` from the current day to the next.
    2.  The last observation for each ticker has a `NaN` `target_return`, as
        its future return cannot be known.

    Args:
        df: The pandas DataFrame to be validated. Must have a sorted MultiIndex.

    Returns:
        A tuple containing:
        - A boolean, True if the target construction is valid, False otherwise.
        - A list of strings, containing tickers for which validation failed.
    """
    # Initialize a list to collect tickers with validation failures.
    failed_tickers = []

    # --- Input Validation ---
    # Ensure the DataFrame is not empty and has the required columns.
    if df.empty or 'Adj Close' not in df.columns or 'target_return' not in df.columns:
        return False, ["Input DataFrame is empty or missing required columns."]

    # --- Step 1: Re-calculate the Target Return in a Vectorized, Grouped Manner ---
    # Group by the 'ticker' level of the index to perform calculations per security.
    # This is essential to prevent data leakage across securities.
    # Equation: target_return[t] = (Adj_Close[t+1] - Adj_Close[t]) / Adj_Close[t] * 100
    # `pct_change(1)` calculates return from t-1 to t.
    # `shift(-1)` moves this value to row t-1, aligning it as the return from t to t+1.
    recalculated_return = df.groupby(level='ticker')['Adj Close'] \
                            .pct_change(1).shift(-1) * 100

    # --- Step 2: Compare Recalculated Return with Existing Target ---
    # Use `numpy.isclose` for a robust floating-point comparison.
    # `equal_nan=True` is critical, as we expect NaNs to match at the end of each series.
    is_correct = np.isclose(
        df['target_return'],
        recalculated_return,
        atol=1e-9,
        equal_nan=True
    )

    # --- Step 3: Identify Tickers with Mismatches ---
    # If not all values are correct, identify the specific tickers that failed.
    if not is_correct.all():
        # Find the rows where the comparison failed.
        mismatch_df = df[~is_correct]
        # Get the unique tickers from these mismatched rows.
        failed_tickers = mismatch_df.index.get_level_values('ticker').unique().tolist()

    # --- Final Validation Result ---
    # The validation is successful if and only if no tickers failed.
    is_valid = not failed_tickers
    # Return the overall validity status and the list of failed tickers.
    return is_valid, failed_tickers


def run_dataframe_validation_suite(
    df: pd.DataFrame
) -> None:
    """
    Executes the full suite of DataFrame validation checks for Task 2.

    This orchestrator function runs all validation steps for the input DataFrame
    and provides a consolidated report. It is the main entry point for
    validating the data structure before any processing.

    Args:
        df: The complete input pandas DataFrame.

    Raises:
        ValueError: If any of the validation checks fail, this exception is
                    raised with a detailed report of all errors.
    """
    # Initialize a list to aggregate all errors from the validation suite.
    all_errors = []

    # --- Execute Task 2, Step 1 ---
    logging.info("--- Running Task 2, Step 1: Multi-Index Integrity Validation ---")
    index_valid, index_errors = validate_multi_index_integrity(df)
    if not index_valid:
        all_errors.extend(index_errors)
        logging.error("Multi-Index integrity validation FAILED.")
    else:
        logging.info("Multi-Index integrity validation PASSED.")

    # --- Execute Task 2, Step 2 ---
    logging.info("\n--- Running Task 2, Step 2: Column Schema Validation ---")
    schema_valid, schema_errors = validate_column_schema(df)
    if not schema_valid:
        all_errors.extend(schema_errors)
        logging.error("Column schema validation FAILED.")
    else:
        logging.info("Column schema validation PASSED.")

    # --- Execute Task 2, Step 3 ---
    # This check is only meaningful if the index is sorted correctly.
    if index_valid and 'Index is not monotonically increasing' not in str(index_errors):
        logging.info("\n--- Running Task 2, Step 3: Target Return Construction Validation ---")
        target_valid, target_errors = validate_target_return_construction(df)
        if not target_valid:
            # Format the error message for failed tickers.
            error_msg = f"Target return construction is incorrect for tickers: {target_errors}"
            all_errors.append(error_msg)
            logging.error("Target return construction validation FAILED.")
        else:
            logging.info("Target return construction validation PASSED.")
    else:
        # Skip this check if the index is not sorted, as it would be invalid.
        logging.warning(
            "\n--- SKIPPING Task 2, Step 3: Target Return Construction Validation --- "
            "Reason: Index is not valid or not sorted."
        )

    # --- Final Report ---
    # Check if any errors were collected during the entire suite.
    if all_errors:
        # Format them into a single, readable report.
        error_report = "\n".join([f"- {error}" for error in all_errors])
        # Raise a ValueError with the consolidated report.
        raise ValueError(f"DataFrame validation failed with the following errors:\n{error_report}")
    else:
        # If no errors were found, log the overall success.
        logging.info("\n>>> All DataFrame structure and integrity checks passed successfully. <<<")


In [None]:
# Task 3: Data Quality Assessment and Cleansing

def assess_missing_data(
    df: pd.DataFrame
) -> pd.DataFrame:
    """
    Generates a detailed report on missing and empty data patterns.

    This function provides a diagnostic overview of data completeness. Crucially,
    it distinguishes between truly missing values (NaN) and intentionally empty
    strings (''), which is vital for the 'aggregated_text' column where an
    empty string signifies a day with no news, a meaningful data point.

    Args:
        df: The input pandas DataFrame to assess.

    Returns:
        A pandas DataFrame summarizing the missing data statistics for each column,
        including counts and percentages of NaN values and empty strings.
    """
    # --- Input Validation ---
    if not isinstance(df, pd.DataFrame):
        raise TypeError("Input must be a pandas DataFrame.")

    # --- Step 1: Calculate NaN Statistics ---
    # Calculate the number of NaN values in each column.
    nan_counts = df.isnull().sum()
    # Calculate the percentage of NaN values.
    nan_percentages = (nan_counts / len(df)) * 100

    # --- Step 2: Calculate Empty String Statistics ---
    # Initialize a series for empty string counts with zeros.
    empty_string_counts = pd.Series(0, index=df.columns, dtype=int)
    # Iterate through columns with object dtype to check for empty strings.
    for col in df.select_dtypes(include=['object']).columns:
        # This vectorized operation is highly efficient.
        empty_string_counts[col] = (df[col] == '').sum()
    # Calculate the percentage of empty strings.
    empty_string_percentages = (empty_string_counts / len(df)) * 100

    # --- Step 3: Assemble the Report ---
    # Create a DataFrame from the calculated statistics.
    report = pd.DataFrame({
        'nan_count': nan_counts,
        'nan_percentage': nan_percentages,
        'empty_string_count': empty_string_counts,
        'empty_string_percentage': empty_string_percentages
    })
    # Return the comprehensive report.
    return report


def validate_financial_data_consistency(
    df: pd.DataFrame,
    return_threshold: float = 50.0
) -> Dict[str, pd.Series]:
    """
    Performs consistency checks specific to financial market data.

    This function validates the integrity of financial data by checking for
    logical impossibilities, such as incorrect price relationships or extreme,
    likely erroneous, daily returns.

    It validates:
    1.  OHLC consistency: Low <= {Open, Close} <= High for each row.
    2.  Unrealistic returns: Flags any daily adjusted close returns exceeding a
        specified percentage threshold (e.g., +/- 50%).

    Args:
        df: The input DataFrame with a valid MultiIndex.
        return_threshold: The percentage threshold to define an unrealistic
                          daily return. Defaults to 50.0.

    Returns:
        A dictionary where keys are validation check names and values are
        boolean pandas Series. A `True` value in a series indicates the
        corresponding row passed that specific check.
    """
    # --- Input Validation ---
    if not isinstance(df, pd.DataFrame):
        raise TypeError("Input must be a pandas DataFrame.")
    required_cols = {'Open', 'High', 'Low', 'Close', 'Adj Close'}
    if not required_cols.issubset(df.columns):
        raise ValueError(f"DataFrame is missing required financial columns: {required_cols - set(df.columns)}")

    # --- Step 1: Validate OHLC (Open, High, Low, Close) Relationships ---
    # This vectorized boolean mask efficiently checks all rows at once.
    # A row is valid if all conditions are met.
    ohlc_valid_mask = (df['Low'] <= df['Open']) & \
                      (df['Low'] <= df['Close']) & \
                      (df['High'] >= df['Open']) & \
                      (df['High'] >= df['Close'])

    # --- Step 2: Validate for Unrealistic Returns ---
    # Calculate daily returns per ticker to avoid data leakage.
    # This is the only methodologically sound way to compute returns in panel data.
    daily_returns = df.groupby(level='ticker')['Adj Close'].pct_change() * 100
    # Create a mask where `True` indicates a realistic return.
    # The check `daily_returns.abs() <= return_threshold` identifies valid returns.
    # We must fill NaNs (first day of each ticker) with True as they are not "unrealistic".
    realistic_return_mask = (daily_returns.abs() <= return_threshold).fillna(True)

    # --- Step 3: Assemble and Return Results ---
    # Return a dictionary of boolean masks for detailed, row-level reporting.
    return {
        'ohlc_valid': ohlc_valid_mask,
        'realistic_return': realistic_return_mask
    }


def clean_and_standardize_text_data(
    df: pd.DataFrame
) -> pd.DataFrame:
    """
    Cleans and standardizes the 'aggregated_text' column.

    This function performs essential text cleansing operations to prepare the
    data for NLP processing. It operates on a copy of the DataFrame to ensure
    the original data remains untouched (functional purity).

    The cleansing pipeline is:
    1.  Replace any `NaN` values in 'aggregated_text' with empty strings ('').
    2.  Ensure consistent UTF-8 encoding to prevent downstream errors.
    3.  Remove any null bytes ('\\x00'), which can corrupt text processing pipelines.

    Args:
        df: The input pandas DataFrame.

    Returns:
        A new pandas DataFrame with the 'aggregated_text' column cleaned.
    """
    # --- Input Validation ---
    if not isinstance(df, pd.DataFrame):
        raise TypeError("Input must be a pandas DataFrame.")
    if 'aggregated_text' not in df.columns:
        raise ValueError("DataFrame is missing the 'aggregated_text' column.")

    # --- Operate on a copy to avoid side effects ---
    df_clean = df.copy()

    # --- Step 1: Handle NaN values ---
    # Replace any NaN values specifically in this column with an empty string.
    # This standardizes the representation of "no text available".
    df_clean['aggregated_text'] = df_clean['aggregated_text'].fillna('')

    # --- Step 2: Ensure Consistent UTF-8 Encoding ---
    # This two-step process robustly handles potential encoding issues.
    # It encodes to bytes (ignoring errors) and decodes back to a clean UTF-8 string.
    df_clean['aggregated_text'] = df_clean['aggregated_text'].str.encode(
        'utf-8', 'ignore'
    ).str.decode('utf-8')

    # --- Step 3: Remove Null Bytes ---
    # Null bytes (\x00) are problematic for many tools. This removes them.
    # `regex=False` ensures this is a fast, literal string replacement.
    df_clean['aggregated_text'] = df_clean['aggregated_text'].str.replace(
        '\x00', '', regex=False
    )

    # Return the cleaned DataFrame.
    return df_clean


def run_data_quality_and_cleansing_suite(
    df: pd.DataFrame,
    return_threshold: float = 50.0
) -> pd.DataFrame:
    """
    Orchestrates the full data quality assessment and cleansing pipeline.

    This function executes the complete suite of checks and cleansing operations
    from Task 3. It first generates diagnostic reports and then applies the
    necessary cleaning steps, returning a new, validated, and cleaned DataFrame.

    Args:
        df: The raw input pandas DataFrame.
        return_threshold: The percentage threshold for flagging unrealistic returns.

    Returns:
        A new, cleaned, and validated pandas DataFrame ready for the next phase.

    Raises:
        ValueError: If critical data consistency checks fail.
    """
    # --- Input Validation ---
    if not isinstance(df, pd.DataFrame):
        raise TypeError("Input must be a pandas DataFrame.")

    logging.info("--- Running Task 3: Data Quality Assessment and Cleansing Suite ---")

    # --- Step 1: Assess Missing Data ---
    logging.info("\nStep 1: Assessing missing data patterns...")
    # Generate the missing data report.
    missing_data_report = assess_missing_data(df)
    logging.info("Missing Data Report:\n" + missing_data_report.to_string())
    # The key takeaway is understanding the data; cleansing happens in Step 3.

    # --- Step 2: Validate Financial Data Consistency ---
    logging.info("\nStep 2: Validating financial data consistency...")
    # Get the boolean masks for financial data validation.
    validation_results = validate_financial_data_consistency(df, return_threshold)

    # Aggregate any failures from the consistency checks.
    all_errors = []
    # Check for OHLC violations.
    if not validation_results['ohlc_valid'].all():
        # Count the number of rows with invalid OHLC data.
        num_invalid_ohlc = (~validation_results['ohlc_valid']).sum()
        all_errors.append(f"{num_invalid_ohlc} rows failed OHLC consistency check.")

    # Check for unrealistic returns.
    if not validation_results['realistic_return'].all():
        # Count the number of rows with unrealistic returns.
        num_unrealistic_returns = (~validation_results['realistic_return']).sum()
        # This is treated as a warning, not a fatal error, but is reported.
        logging.warning(
            f"{num_unrealistic_returns} rows have unrealistic daily returns "
            f"(> +/-{return_threshold}%)."
        )

    # If there are critical errors, raise an exception.
    if all_errors:
        error_report = "\n".join([f"- {error}" for error in all_errors])
        raise ValueError(f"Critical data consistency validation failed:\n{error_report}")
    else:
        logging.info("Financial data consistency validation PASSED.")

    # --- Step 3: Clean and Standardize Text Data ---
    logging.info("\nStep 3: Cleaning and standardizing text data...")
    # Apply the text cleaning pipeline. This returns a new, cleaned DataFrame.
    df_cleaned = clean_and_standardize_text_data(df)
    logging.info("Text data has been standardized (NaNs -> '', UTF-8, null bytes removed).")

    # --- Final Output ---
    logging.info("\n>>> Data quality assessment and cleansing suite completed successfully. <<<")
    # Return the final, cleaned DataFrame.
    return df_cleaned


In [None]:
# Task 4: Temporal Regime Assignment

def _create_regime_classifier(
    regime_definitions: List[Dict[str, str]]
) -> callable:
    """
    Internal helper to create a fast, memoized regime classification function.

    This helper pre-processes the regime definitions from the config into a
    more efficient structure for lookup and returns a closure that can be
    applied to a DatetimeIndex.

    Args:
        regime_definitions: The list of regime definition dictionaries from the
                            study configuration.

    Returns:
        A callable function that takes a pandas Timestamp and returns the
        corresponding regime name as a string, or None if it's outside all regimes.
    """
    # --- Pre-processing for Efficiency ---
    # Convert the list of dicts to a DataFrame and parse dates to timezone-aware UTC.
    # This is done once, making the returned classifier function highly efficient.
    processed_regimes = pd.DataFrame(regime_definitions)
    processed_regimes['start_date'] = pd.to_datetime(
        processed_regimes['start_date'], utc=True
    )
    processed_regimes['end_date'] = pd.to_datetime(
        processed_regimes['end_date'], utc=True
    )

    # Convert to a list of tuples for fast iteration.
    regime_tuples = list(processed_regimes.itertuples(index=False, name=None))

    def classify_date(ts: pd.Timestamp) -> Optional[str]:
        """Classifies a single timestamp into a regime."""
        # --- Timezone Canonicalization ---
        # Ensure the input timestamp is timezone-aware (UTC) for correct comparison.
        if ts.tzinfo is None:
            # If timezone-naive, localize to UTC.
            ts_utc = ts.tz_localize('utc')
        else:
            # If already timezone-aware, convert to UTC.
            ts_utc = ts.tz_convert('utc')

        # --- Linear Scan for Classification ---
        # For a small number of regimes (like 4), a linear scan is optimal.
        # The boundary condition is inclusive: start_date <= ts <= end_date.
        for name, start, end in regime_tuples:
            if start <= ts_utc <= end:
                # Return the name of the first matching regime.
                return name

        # If no regime matches, return None.
        return None

    # Return the configured classification function.
    return classify_date


def assign_regime_labels(
    df: pd.DataFrame,
    regime_definitions: List[Dict[str, str]]
) -> pd.DataFrame:
    """
    Assigns a 'regime' label to each row and filters out non-regime data.

    This function applies a classification function to the DataFrame's date
    index to label each row with its corresponding macroeconomic regime. It then
    removes any rows that do not fall within one of the defined regimes.

    Args:
        df: The input DataFrame with a DatetimeIndex at level 0.
        regime_definitions: The list of regime definition dictionaries.

    Returns:
        A new DataFrame with an added 'regime' column, containing only the
        data within the specified regimes. The 'regime' column is set to a
        memory-efficient categorical dtype.
    """
    # --- Input Validation ---
    if not isinstance(df, pd.DataFrame):
        raise TypeError("Input must be a pandas DataFrame.")
    if not isinstance(df.index, pd.MultiIndex) or not isinstance(df.index.levels[0], pd.DatetimeIndex):
        raise ValueError("DataFrame must have a MultiIndex with a DatetimeIndex at level 0.")

    # --- Operate on a copy to avoid side effects ---
    df_processed = df.copy()

    # --- Step 1: Create the Classification Function ---
    # This helper pre-processes the definitions for efficient application.
    regime_classifier = _create_regime_classifier(regime_definitions)

    # --- Step 2: Apply Regime Labels via Optimized Mapping ---
    # Use the highly optimized `.map()` method on the date index level.
    # This is significantly faster than row-wise `.apply()`.
    date_index = df_processed.index.get_level_values('date')
    regime_labels = date_index.map(regime_classifier)

    # Assign the generated labels to a new 'regime' column.
    df_processed['regime'] = regime_labels

    # --- Step 3: Filter Out Data Outside Regimes ---
    # Keep only the rows where a regime label was successfully assigned.
    rows_before = len(df_processed)
    df_processed = df_processed[df_processed['regime'].notna()]
    rows_after = len(df_processed)
    logging.info(
        f"Filtered out {rows_before - rows_after} rows that are outside "
        "any defined regime."
    )

    # --- Step 4: Optimize Memory Usage ---
    # Convert the 'regime' column to a categorical dtype.
    # This is highly memory-efficient and can speed up subsequent group-by operations.
    df_processed['regime'] = df_processed['regime'].astype('category')

    # Return the processed DataFrame.
    return df_processed


def validate_regime_assignment(
    df_regimes: pd.DataFrame,
    regime_definitions: List[Dict[str, str]]
) -> Tuple[bool, List[str]]:
    """
    Validates the outcome of the regime assignment process.

    This function performs a post-assignment audit to ensure that the data
    has been correctly partitioned. It verifies:
    1.  All defined regimes are present in the resulting DataFrame.
    2.  The observed date ranges for each regime in the data are consistent
        with the configuration.
    3.  Each regime contains a sufficient number of samples for analysis.

    Args:
        df_regimes: The DataFrame after regime labels have been assigned.
        regime_definitions: The original list of regime definition dictionaries.

    Returns:
        A tuple containing:
        - A boolean, True if the assignment is valid, False otherwise.
        - A list of strings, containing detailed descriptions of any validation failures.
    """
    # Initialize a list to collect detailed error messages.
    errors = []

    # --- Pre-process config for easy lookup ---
    config_df = pd.DataFrame(regime_definitions)
    config_df['start_date'] = pd.to_datetime(config_df['start_date'], utc=True)
    config_df['end_date'] = pd.to_datetime(config_df['end_date'], utc=True)
    config_df = config_df.set_index('regime_name')

    # --- Step 1: Validate Presence of All Regimes ---
    # Get the set of regime names from the config and the data.
    expected_regimes = set(config_df.index)
    found_regimes = set(df_regimes['regime'].unique())

    # Check if any expected regimes are missing from the data.
    if expected_regimes != found_regimes:
        missing = expected_regimes - found_regimes
        errors.append(f"Regimes defined in config but not found in data: {missing}")
        # Return early if regimes are missing, as further checks are invalid.
        return False, errors

    # --- Step 2: Validate Temporal Boundaries and Sample Counts per Regime ---
    # Group by the newly assigned 'regime' column.
    # Use `.agg()` for an efficient, single-pass calculation of all required stats.
    regime_stats = df_regimes.groupby('regime').agg(
        observed_start=('date', 'min'),
        observed_end=('date', 'max'),
        sample_count=('date', 'size')
    )

    # Iterate through the calculated stats for each regime.
    for regime_name, stats in regime_stats.iterrows():
        # Retrieve the configured boundaries for the current regime.
        config_start = config_df.loc[regime_name, 'start_date']
        config_end = config_df.loc[regime_name, 'end_date']

        # Check if the observed date range is within the configured boundaries.
        if not (stats['observed_start'] >= config_start and stats['observed_end'] <= config_end):
            errors.append(
                f"Regime '{regime_name}' has data outside its defined boundaries. "
                f"Expected: [{config_start.date()}-{config_end.date()}], "
                f"Found: [{stats['observed_start'].date()}-{stats['observed_end'].date()}]."
            )

        # Check for sufficient sample size (e.g., minimum of 100 observations).
        min_samples = 100
        if stats['sample_count'] < min_samples:
            # This is a warning rather than a fatal error, but important to note.
            logging.warning(
                f"Regime '{regime_name}' has a low sample count "
                f"({stats['sample_count']}), which may affect model robustness."
            )

    # --- Final Validation Result ---
    is_valid = not errors
    return is_valid, errors


def run_regime_assignment_suite(
    df: pd.DataFrame,
    study_config: Dict[str, Any]
) -> pd.DataFrame:
    """
    Orchestrates the full temporal regime assignment and validation pipeline.

    This function executes the complete workflow for Task 4. It takes the raw
    DataFrame and the study configuration, assigns regime labels, filters the
    data, validates the result, and returns the final, regime-partitioned DataFrame.

    Args:
        df: The raw input pandas DataFrame.
        study_config: The complete study configuration dictionary.

    Returns:
        A new DataFrame containing only data from the defined regimes, with a
        validated 'regime' column.

    Raises:
        ValueError: If the post-assignment validation fails.
    """
    logging.info("--- Running Task 4: Temporal Regime Assignment Suite ---")

    # Retrieve regime definitions from the configuration.
    regime_definitions = study_config['experimental_design']['regime_definitions']

    # --- Step 1 & 2: Assign Regime Labels and Filter Data ---
    logging.info("\nStep 1 & 2: Assigning regime labels and filtering data...")
    df_with_regimes = assign_regime_labels(df, regime_definitions)
    logging.info("Regime assignment and filtering complete.")
    logging.info(f"DataFrame shape after regime assignment: {df_with_regimes.shape}")
    logging.info("Sample counts per regime:\n" + str(df_with_regimes['regime'].value_counts()))

    # --- Step 3: Validate the Assignment ---
    logging.info("\nStep 3: Validating regime assignment...")
    is_valid, errors = validate_regime_assignment(df_with_regimes, regime_definitions)

    # Check the validation outcome.
    if not is_valid:
        # If validation fails, construct a detailed error report and raise an exception.
        error_report = "\n".join([f"- {error}" for error in errors])
        raise ValueError(f"Post-assignment validation failed:\n{error_report}")
    else:
        logging.info("Regime assignment validation PASSED.")

    # --- Final Output ---
    logging.info("\n>>> Temporal regime assignment suite completed successfully. <<<")
    # Return the final, validated, and regime-partitioned DataFrame.
    return df_with_regimes


In [None]:
# Task 5: Chronological Data Splitting Within Regimes

# Define a type alias for clarity and maintainability.
DataSplits = Dict[str, Dict[str, pd.DataFrame]]


def perform_chronological_split(
    regime_df: pd.DataFrame,
    split_ratios: Dict[str, float]
) -> Dict[str, pd.DataFrame]:
    """
    Splits a single regime's DataFrame into chronological train/val/test sets.

    This function is the core of the temporal splitting logic. It ensures that
    the data is partitioned without lookahead bias by first sorting the data
    by date and then slicing it based on the provided ratios.

    Args:
        regime_df: A DataFrame containing data for a single, isolated regime.
        split_ratios: A dictionary specifying the ratios for 'training',
                      'validation', and 'testing'.

    Returns:
        A dictionary containing the 'training', 'validation', and 'testing'
        DataFrames.

    Raises:
        ValueError: If the input DataFrame is too small to be split meaningfully.
    """
    # --- Input Validation ---
    min_samples_for_split = 10
    if len(regime_df) < min_samples_for_split:
        raise ValueError(
            f"Cannot split regime data with fewer than {min_samples_for_split} "
            f"samples. Found: {len(regime_df)}."
        )

    # --- Step 1: Ensure Chronological Order ---
    # Sort the DataFrame by the 'date' level of the index. This is non-negotiable
    # for preventing lookahead bias. We operate on a sorted view.
    sorted_df = regime_df.sort_index(level='date')
    n_samples = len(sorted_df)

    # --- Step 2: Calculate Split Indices ---
    # Use math.floor to get integer indices, ensuring clean boundaries.
    train_ratio = split_ratios['training']
    val_ratio = split_ratios['validation']

    # Calculate the index where the training set ends.
    train_end_idx = math.floor(train_ratio * n_samples)
    # Calculate the index where the validation set ends.
    val_end_idx = train_end_idx + math.floor(val_ratio * n_samples)

    # --- Step 3: Perform Positional Slicing ---
    # Use .iloc for positional slicing on the sorted DataFrame.
    # Create deep copies to ensure the resulting splits are independent objects.
    train_df = sorted_df.iloc[:train_end_idx].copy(deep=True)
    val_df = sorted_df.iloc[train_end_idx:val_end_idx].copy(deep=True)
    test_df = sorted_df.iloc[val_end_idx:].copy(deep=True)

    # Log the size of each created split for auditability.
    logging.info(
        f"Split {n_samples} samples -> "
        f"Train: {len(train_df)}, Val: {len(val_df)}, Test: {len(test_df)}"
    )

    # Return the dictionary of data splits.
    return {'training': train_df, 'validation': val_df, 'testing': test_df}


def create_regime_data_subsets(
    df_regimes: pd.DataFrame,
    split_ratios: Dict[str, float]
) -> DataSplits:
    """
    Orchestrates the splitting of the main DataFrame into 12 subsets.

    This function iterates through each unique regime found in the data,
    isolates the data for that regime, and then applies the chronological
    splitting logic to generate train, validation, and test sets.

    Args:
        df_regimes: The DataFrame after regime labels have been assigned.
        split_ratios: A dictionary specifying the split ratios.

    Returns:
        A nested dictionary of the structure:
        {regime_name: {'training': df, 'validation': df, 'testing': df}}
    """
    # --- Input Validation ---
    if 'regime' not in df_regimes.columns:
        raise ValueError("Input DataFrame must contain a 'regime' column.")

    # Get the list of unique regimes to process.
    regimes = df_regimes['regime'].unique()

    # Initialize the nested dictionary to hold all 12 data subsets.
    all_splits: DataSplits = {}

    # --- Iterate and Split Each Regime ---
    for regime_name in regimes:
        logging.info(f"--- Processing and splitting regime: {regime_name} ---")
        # Isolate the DataFrame for the current regime.
        regime_specific_df = df_regimes[df_regimes['regime'] == regime_name]

        # Delegate the actual splitting to the specialized function.
        regime_splits = perform_chronological_split(regime_specific_df, split_ratios)

        # Store the resulting splits in the main dictionary.
        all_splits[regime_name] = regime_splits

    return all_splits


def validate_split_quality(
    data_splits: DataSplits
) -> Tuple[bool, List[str]]:
    """
    Performs a comprehensive quality audit on the generated data splits.

    This is a critical validation step to ensure the integrity of the
    experimental setup. It verifies:
    1.  Absolute temporal separation between train/val/test sets within each regime.
    2.  No new tickers appear in validation or test sets that were not in the
        training set of the same regime.
    3.  Each split is non-empty.

    Args:
        data_splits: The nested dictionary of data splits.

    Returns:
        A tuple containing:
        - A boolean, True if all quality checks pass, False otherwise.
        - A list of strings, containing detailed descriptions of any validation failures.
    """
    # Initialize a list to collect detailed error messages.
    errors = []

    # Iterate through each regime and its corresponding splits.
    for regime_name, splits in data_splits.items():
        train_df = splits['training']
        val_df = splits['validation']
        test_df = splits['testing']

        # --- Step 1: Check for Empty Splits ---
        if train_df.empty:
            errors.append(f"Regime '{regime_name}': Training set is empty.")
        if val_df.empty:
            # A warning is more appropriate for val/test as they can be small.
            logging.warning(f"Regime '{regime_name}': Validation set is empty.")
        if test_df.empty:
            logging.warning(f"Regime '{regime_name}': Test set is empty.")
        # If training set is empty, further checks are meaningless.
        if train_df.empty: continue

        # --- Step 2: Validate Temporal Separation (No Lookahead Bias) ---
        # Get the last date in training and first date in validation.
        last_train_date = train_df.index.get_level_values('date').max()

        if not val_df.empty:
            first_val_date = val_df.index.get_level_values('date').min()
            # The last training day must be strictly before the first validation day.
            if not last_train_date < first_val_date:
                errors.append(
                    f"Regime '{regime_name}': Temporal overlap detected between "
                    f"training (ends {last_train_date.date()}) and validation "
                    f"(starts {first_val_date.date()})."
                )

        if not val_df.empty and not test_df.empty:
            last_val_date = val_df.index.get_level_values('date').max()
            first_test_date = test_df.index.get_level_values('date').min()
            # The last validation day must be strictly before the first test day.
            if not last_val_date < first_test_date:
                errors.append(
                    f"Regime '{regime_name}': Temporal overlap detected between "
                    f"validation (ends {last_val_date.date()}) and test "
                    f"(starts {first_test_date.date()})."
                )

        # --- Step 3: Validate Ticker Representation ---
        # A model should not be evaluated on tickers it has never seen.
        train_tickers = set(train_df.index.get_level_values('ticker').unique())

        if not val_df.empty:
            val_tickers = set(val_df.index.get_level_values('ticker').unique())
            # Check for tickers in validation that are not in training.
            unseen_in_val = val_tickers - train_tickers
            if unseen_in_val:
                errors.append(
                    f"Regime '{regime_name}': Unseen tickers found in validation set: "
                    f"{unseen_in_val}."
                )

        if not test_df.empty:
            test_tickers = set(test_df.index.get_level_values('ticker').unique())
            # Check for tickers in test that are not in training.
            unseen_in_test = test_tickers - train_tickers
            if unseen_in_test:
                errors.append(
                    f"Regime '{regime_name}': Unseen tickers found in test set: "
                    f"{unseen_in_test}."
                )

    # --- Final Validation Result ---
    is_valid = not errors
    return is_valid, errors


def run_chronological_splitting_suite(
    df_regimes: pd.DataFrame,
    study_config: Dict[str, Any]
) -> DataSplits:
    """
    Orchestrates the full chronological data splitting and validation pipeline.

    This function executes the complete workflow for Task 5. It takes the
    regime-partitioned DataFrame, splits it into 12 train/val/test subsets,
    performs a rigorous quality audit on the splits, and returns the final
    structured data.

    Args:
        df_regimes: The DataFrame after regime labels have been assigned.
        study_config: The complete study configuration dictionary.

    Returns:
        A nested dictionary containing the 12 validated data subsets.

    Raises:
        ValueError: If the post-splitting validation fails.
    """
    logging.info("--- Running Task 5: Chronological Data Splitting Suite ---")

    # Retrieve split ratios from the configuration.
    split_ratios = study_config['experimental_design']['data_split_ratios']

    # --- Step 1 & 2: Create Regime-Specific Data Subsets ---
    logging.info("\nStep 1 & 2: Creating 12 chronological data subsets (4 regimes x 3 splits)...")
    data_splits = create_regime_data_subsets(df_regimes, split_ratios)
    logging.info("Data splitting complete.")

    # --- Step 3: Validate Split Quality ---
    logging.info("\nStep 3: Validating split quality (temporal separation, ticker consistency)...")
    is_valid, errors = validate_split_quality(data_splits)

    # Check the validation outcome.
    if not is_valid:
        # If validation fails, construct a detailed error report and raise an exception.
        error_report = "\n".join([f"- {error}" for error in errors])
        raise ValueError(f"Data split quality validation failed:\n{error_report}")
    else:
        logging.info("Data split quality validation PASSED.")

    # --- Final Output ---
    logging.info("\n>>> Chronological data splitting suite completed successfully. <<<")
    # Return the final, validated data splits.
    return data_splits


In [None]:
# Task 6: Cross-Regime Data Consistency Validation

def analyze_ticker_consistency(
    df_regimes: pd.DataFrame
) -> pd.DataFrame:
    """
    Analyzes the consistency of the ticker universe across different regimes.

    This function assesses the stability of the asset pool over time by
    reporting how many unique tickers are present in each regime and what
    percentage of the total historical universe this represents.

    Args:
        df_regimes: The DataFrame after regime labels have been assigned.

    Returns:
        A pandas DataFrame indexed by regime name, detailing the ticker count,
        universe coverage, and a list of tickers absent from that regime.
    """
    # --- Input Validation ---
    if 'regime' not in df_regimes.columns:
        raise ValueError("Input DataFrame must contain a 'regime' column.")
    if not isinstance(df_regimes.index, pd.MultiIndex) or 'ticker' not in df_regimes.index.names:
        raise ValueError("DataFrame must have a MultiIndex with a 'ticker' level.")

    # --- Step 1: Define the Master Universe of Tickers ---
    # This is the set of all unique tickers ever observed in the dataset.
    master_universe = set(df_regimes.index.get_level_values('ticker').unique())
    n_universe = len(master_universe)

    # --- Step 2: Calculate Per-Regime Ticker Inventories ---
    # Group by regime and find the unique tickers within each group.
    regime_tickers = df_regimes.groupby('regime').apply(
        lambda df: set(df.index.get_level_values('ticker').unique())
    )

    # --- Step 3: Generate the Consistency Report ---
    # Create a report by iterating through the regime inventories.
    report_data = []
    for regime_name, tickers in regime_tickers.items():
        # Calculate the number of tickers present in the regime.
        ticker_count = len(tickers)
        # Calculate the percentage of the master universe covered.
        coverage_pct = (ticker_count / n_universe) * 100 if n_universe > 0 else 0
        # Identify which tickers from the master universe are missing.
        missing_tickers = sorted(list(master_universe - tickers))

        # Append the structured data for this regime to our report list.
        report_data.append({
            'regime_name': regime_name,
            'ticker_count': ticker_count,
            'universe_coverage_pct': coverage_pct,
            'missing_tickers': missing_tickers
        })

    # Convert the list of dictionaries to a DataFrame and set the index.
    report_df = pd.DataFrame(report_data).set_index('regime_name')
    return report_df


def analyze_return_distribution_stability(
    df_regimes: pd.DataFrame
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    Analyzes the stability of the 'target_return' distribution across regimes.

    This function provides a quantitative justification for regime-based
    modeling by showing how the statistical properties of stock returns change
    across different macroeconomic periods. It calculates:
    1.  The first four statistical moments (mean, std, skew, kurtosis) for each regime.
    2.  A matrix of p-values from pairwise Kolmogorov-Smirnov tests to formally
        assess if the return distributions are statistically different.

    Args:
        df_regimes: The DataFrame after regime labels have been assigned.

    Returns:
        A tuple containing:
        - A DataFrame of statistical moments for each regime.
        - A DataFrame (p-value matrix) from pairwise KS tests between regimes.
    """
    # --- Input Validation ---
    if 'regime' not in df_regimes.columns or 'target_return' not in df_regimes.columns:
        raise ValueError("Input DataFrame must contain 'regime' and 'target_return' columns.")

    # --- Step 1: Calculate Statistical Moments per Regime ---
    # Use .agg() for a single, efficient pass to compute all required statistics.
    # We explicitly use scipy's functions for skew and kurtosis for consistency.
    moment_stats = df_regimes.groupby('regime')['target_return'].agg(
        mean='mean',
        std='std',
        skew=lambda x: stats.skew(x.dropna()),
        kurtosis=lambda x: stats.kurtosis(x.dropna()) # Fisher's definition (normal=0)
    )

    # --- Step 2: Perform Pairwise Kolmogorov-Smirnov Tests ---
    # Get the unique regime names for pairwise comparisons.
    regimes = df_regimes['regime'].unique().tolist()
    # Initialize a DataFrame to store the p-values of the KS tests.
    ks_matrix = pd.DataFrame(np.nan, index=regimes, columns=regimes)

    # Iterate through all unique pairs of regimes.
    for regime1, regime2 in combinations(regimes, 2):
        # Extract the non-NaN return series for each regime in the pair.
        returns1 = df_regimes[df_regimes['regime'] == regime1]['target_return'].dropna()
        returns2 = df_regimes[df_regimes['regime'] == regime2]['target_return'].dropna()

        # Perform the two-sample KS test.
        ks_statistic, p_value = stats.ks_2samp(returns1, returns2)

        # Store the p-value in the matrix (it's symmetric).
        ks_matrix.loc[regime1, regime2] = p_value
        ks_matrix.loc[regime2, regime1] = p_value

    # Set the diagonal to 1.0, as a distribution is identical to itself.
    np.fill_diagonal(ks_matrix.values, 1.0)

    return moment_stats, ks_matrix


def analyze_news_coverage_consistency(
    df_regimes: pd.DataFrame
) -> pd.DataFrame:
    """
    Analyzes the consistency of news coverage across different regimes.

    This function quantifies the volume and availability of the text data,
    which is crucial for understanding the reliability of NLP models.

    It calculates for each regime:
    1.  News Coverage %: The percentage of observations that have non-empty text.
    2.  Avg. Text Length: The average character length of the news articles,
        calculated only on non-empty entries to avoid skew.

    Args:
        df_regimes: The DataFrame after regime labels have been assigned.

    Returns:
        A pandas DataFrame indexed by regime name, detailing the news coverage metrics.
    """
    # --- Input Validation ---
    if 'regime' not in df_regimes.columns or 'aggregated_text' not in df_regimes.columns:
        raise ValueError("Input DataFrame must contain 'regime' and 'aggregated_text' columns.")

    # --- Define Custom Aggregation Functions ---
    # This function calculates the percentage of non-empty strings in a Series.
    def coverage_percentage(series: pd.Series) -> float:
        if len(series) == 0:
            return 0.0
        non_empty_count = (series != '').sum()
        return (non_empty_count / len(series)) * 100

    # This function calculates the mean length of only the non-empty strings.
    def average_text_length(series: pd.Series) -> float:
        non_empty_series = series[series != '']
        if len(non_empty_series) == 0:
            return 0.0
        return non_empty_series.str.len().mean()

    # --- Step 1: Calculate News Metrics per Regime ---
    # Use .agg() with the custom functions for a clean and efficient calculation.
    news_stats = df_regimes.groupby('regime')['aggregated_text'].agg(
        news_coverage_pct=coverage_percentage,
        avg_text_length_chars=average_text_length
    )

    return news_stats


def run_cross_regime_consistency_suite(
    df_regimes: pd.DataFrame
) -> None:
    """
    Orchestrates the full suite of cross-regime data consistency checks.

    This function executes all validation steps from Task 6 and prints a
    comprehensive diagnostic report to the console, justifying the need for
    a regime-based modeling approach.

    Args:
        df_regimes: The DataFrame after regime labels have been assigned.
    """
    logging.info("--- Running Task 6: Cross-Regime Data Consistency Validation Suite ---")

    # --- Step 1: Analyze Ticker Consistency ---
    logging.info("\nStep 1: Analyzing Ticker Universe Consistency...")
    ticker_report = analyze_ticker_consistency(df_regimes)
    logging.info("Ticker Consistency Report:\n" + ticker_report.to_string(
        formatters={'universe_coverage_pct': '{:.2f}%'.format}
    ))

    # --- Step 2: Analyze Target Return Distribution Stability ---
    logging.info("\nStep 2: Analyzing Target Return Distribution Stability...")
    moment_stats, ks_matrix = analyze_return_distribution_stability(df_regimes)
    logging.info("Statistical Moments of Target Return per Regime:\n" + moment_stats.to_string(float_format='{:.4f}'.format))
    logging.info("\nPairwise KS-Test p-values (H0: Distributions are identical):\n" + ks_matrix.to_string(float_format='{:.4f}'.format))
    # Interpretation note for the user.
    logging.info("(Note: Low p-values, e.g., < 0.05, suggest distributions are significantly different)")

    # --- Step 3: Analyze News Coverage Consistency ---
    logging.info("\nStep 3: Analyzing News Coverage Consistency...")
    news_report = analyze_news_coverage_consistency(df_regimes)
    logging.info("News Coverage Report:\n" + news_report.to_string(
        formatters={
            'news_coverage_pct': '{:.2f}%'.format,
            'avg_text_length_chars': '{:.1f}'.format
        }
    ))

    logging.info("\n>>> Cross-regime consistency validation suite completed successfully. <<<")


In [None]:
# Task 7: TF-IDF Vectorization Implementation

def create_and_fit_tfidf_vectorizer(
    data_splits: DataSplits,
    tfidf_params: Dict[str, Any]
) -> Tuple[TfidfVectorizer, TfidfFeatures]:
    """
    Initializes, fits, and transforms text data using a TF-IDF vectorizer.

    This function implements the methodologically critical "fit on train,
    transform all" paradigm. It creates a single, unified vocabulary by
    fitting the vectorizer exclusively on the combined training data from all
    regimes. This ensures a consistent feature space across the entire
    experiment, which is paramount for valid cross-regime model comparison.

    Args:
        data_splits: The nested dictionary of data splits from Task 5.
        tfidf_params: A dictionary of parameters for the TfidfVectorizer,
                      e.g., {'max_features': 2000}.

    Returns:
        A tuple containing:
        - The single, fitted TfidfVectorizer object.
        - A nested dictionary containing the transformed sparse matrices for
          all 12 data subsets, mirroring the input structure.
    """
    # --- Input Validation ---
    if not data_splits:
        raise ValueError("data_splits dictionary cannot be empty.")
    if 'max_features' not in tfidf_params:
        raise ValueError("tfidf_params must contain 'max_features'.")

    # --- Step 1: Configure TF-IDF Vectorizer ---
    # Instantiate the vectorizer with the exact parameters from the config.
    # These parameters are derived from the LaTeX context (Section 5.1 and 7).
    vectorizer = TfidfVectorizer(
        max_features=tfidf_params.get('max_features', 2000),
        ngram_range=tfidf_params.get('ngram_range', (1, 2)),
        min_df=tfidf_params.get('min_df', 2),
        max_df=tfidf_params.get('max_df', 0.95),
        stop_words='english',
        lowercase=True,
        norm='l2',
        use_idf=True,
        smooth_idf=True,
        sublinear_tf=False
    )

    # --- Step 2: Assemble the Global Training Corpus ---
    # Concatenate the 'aggregated_text' from all training splits across all regimes.
    # This is the ONLY data the vectorizer's vocabulary will be based on.
    logging.info("Assembling global training corpus for TF-IDF fitting...")
    training_corpora = [
        splits['training']['aggregated_text']
        for regime, splits in data_splits.items()
        if not splits['training'].empty
    ]
    global_training_corpus = pd.concat(training_corpora, ignore_index=True)
    logging.info(f"Fitting TF-IDF on a corpus of {len(global_training_corpus)} training documents.")

    # --- Step 3: Fit the Vectorizer ---
    # Fit the vectorizer exclusively on the assembled training data.
    vectorizer.fit(global_training_corpus)

    # --- Step 4: Transform All 12 Data Subsets ---
    # Initialize the nested dictionary to store the output sparse matrices.
    tfidf_features: TfidfFeatures = {regime: {} for regime in data_splits}

    # Iterate through every regime and every split.
    for regime_name, splits in data_splits.items():
        for split_name, df in splits.items():
            logging.info(f"Transforming text for {regime_name} - {split_name} split...")
            # Use the already-fitted vectorizer to transform the data.
            # This ensures the same feature mapping is used everywhere.
            if not df.empty:
                # Transform the text and store the resulting sparse matrix.
                transformed_matrix = vectorizer.transform(df['aggregated_text'])
                tfidf_features[regime_name][split_name] = transformed_matrix
            else:
                # Handle empty splits by creating an empty sparse matrix with the correct shape.
                tfidf_features[regime_name][split_name] = csr_matrix(
                    (0, tfidf_params['max_features'])
                )

    return vectorizer, tfidf_features


def validate_tfidf_features(
    tfidf_features: TfidfFeatures,
    data_splits: DataSplits,
    expected_features: int
) -> Tuple[bool, List[str]]:
    """
    Performs a quality and integrity audit on the generated TF-IDF features.

    This function validates the output of the vectorization process to ensure
    that the resulting feature matrices are correctly shaped, normalized, and
    consistent with the input data.

    Args:
        tfidf_features: The nested dictionary of transformed sparse matrices.
        data_splits: The original nested dictionary of data splits.
        expected_features: The expected number of features (e.g., 2000).

    Returns:
        A tuple containing:
        - A boolean, True if all checks pass, False otherwise.
        - A list of strings detailing any validation failures.
    """
    # Initialize a list to collect detailed error messages.
    errors = []

    # Iterate through every generated feature matrix.
    for regime_name, splits in tfidf_features.items():
        for split_name, matrix in splits.items():
            # Get the corresponding original DataFrame for row count comparison.
            original_df = data_splits[regime_name][split_name]

            # --- Step 1: Validate Matrix Shape ---
            # Check number of rows.
            if matrix.shape[0] != len(original_df):
                errors.append(
                    f"[{regime_name}][{split_name}]: Row count mismatch. "
                    f"Expected {len(original_df)}, found {matrix.shape[0]}."
                )
            # Check number of columns (features).
            if matrix.shape[1] != expected_features:
                errors.append(
                    f"[{regime_name}][{split_name}]: Feature count mismatch. "
                    f"Expected {expected_features}, found {matrix.shape[1]}."
                )

            # Skip further checks for empty matrices.
            if matrix.shape[0] == 0:
                continue

            # --- Step 2: Validate L2 Normalization ---
            # Calculate the L2 norm for each row directly on the sparse matrix.
            # This is memory-efficient as it avoids creating a dense matrix.
            row_norms = np.sqrt(matrix.power(2).sum(axis=1))
            # Check if all norms are close to 1.0 (or 0.0 for empty documents).
            # A document with no terms from the vocabulary will have a zero vector.
            if not np.all(np.isclose(row_norms, 1.0) | np.isclose(row_norms, 0.0)):
                errors.append(
                    f"[{regime_name}][{split_name}]: L2 normalization validation failed. "
                    "Not all rows are unit vectors."
                )

            # --- Step 3: Report on Sparsity ---
            # Sparsity = 1 - (non-zero elements / total elements)
            sparsity = 1.0 - (matrix.nnz / (matrix.shape[0] * matrix.shape[1]))
            logging.info(
                f"[{regime_name}][{split_name}]: Matrix shape: {matrix.shape}, "
                f"Sparsity: {sparsity:.2%}"
            )

    is_valid = not errors
    return is_valid, errors


def run_tfidf_vectorization_suite(
    data_splits: DataSplits,
    study_config: Dict[str, Any]
) -> Tuple[TfidfVectorizer, TfidfFeatures]:
    """
    Orchestrates the full TF-IDF feature engineering and validation pipeline.

    Args:
        data_splits: The nested dictionary of data splits from Task 5.
        study_config: The complete study configuration dictionary.

    Returns:
        A tuple containing the fitted vectorizer and the dictionary of features.

    Raises:
        ValueError: If the post-vectorization validation fails.
    """
    logging.info("--- Running Task 7: TF-IDF Vectorization Suite ---")

    # Retrieve TF-IDF parameters from the configuration.
    tfidf_params = study_config['feature_engineering']['tfidf']

    # --- Step 1 & 2: Configure, Fit, and Transform ---
    logging.info("\nStep 1 & 2: Fitting a single global vectorizer and transforming all splits...")
    vectorizer, tfidf_features = create_and_fit_tfidf_vectorizer(data_splits, tfidf_params)
    logging.info(f"TF-IDF fitting and transformation complete. Vocabulary size: {len(vectorizer.vocabulary_)}")

    # --- Step 3: Validate Feature Quality ---
    logging.info("\nStep 3: Validating TF-IDF feature quality...")
    is_valid, errors = validate_tfidf_features(
        tfidf_features,
        data_splits,
        tfidf_params['max_features']
    )

    if not is_valid:
        error_report = "\n".join([f"- {error}" for error in errors])
        raise ValueError(f"TF-IDF feature validation failed:\n{error_report}")
    else:
        logging.info("TF-IDF feature validation PASSED.")

    logging.info("\n>>> TF-IDF vectorization suite completed successfully. <<<")
    return vectorizer, tfidf_features


In [None]:
# Task 8: Sentence Embedding Extraction

def initialize_sentence_transformer(
    model_identifier: str
) -> SentenceTransformer:
    """
    Initializes a SentenceTransformer model and moves it to the optimal device.

    This function handles the loading of a pre-trained model from the
    HuggingFace Hub, detects GPU availability, and sets the model to
    evaluation mode for deterministic inference.

    Args:
        model_identifier: The HuggingFace identifier for the model,
                          e.g., 'sentence-transformers/all-MiniLM-L6-v2'.

    Returns:
        The initialized SentenceTransformer model object.

    Raises:
        RuntimeError: If the model cannot be loaded due to network issues or
                      an invalid identifier.
    """
    # --- Step 1: Determine Optimal Computing Device ---
    # Check for CUDA-enabled GPU availability for accelerated processing.
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    logging.info(f"Initializing sentence transformer on device: '{device}'")

    # --- Step 2: Load the Pre-trained Model ---
    try:
        # Instantiate the model. This will download it from the Hub if not cached.
        model = SentenceTransformer(model_identifier, device=device)

        # --- Step 3: Set Model to Evaluation Mode ---
        # This is a critical step to ensure deterministic outputs by disabling
        # layers like dropout that behave differently during training.
        model.eval()

        logging.info(f"Successfully loaded model '{model_identifier}'.")
        logging.info(f"Embedding dimension: {model.get_sentence_embedding_dimension()}")

        return model

    except Exception as e:
        # Catch potential errors (e.g., network, invalid name) and raise a specific exception.
        raise RuntimeError(
            f"Failed to initialize SentenceTransformer model '{model_identifier}'. "
            f"Please check the model name and your network connection. Original error: {e}"
        )


def extract_sentence_embeddings(
    data_splits: DataSplits,
    model: SentenceTransformer,
    batch_size: int = 64
) -> EmbeddingFeatures:
    """
    Generates sentence embeddings for all text data in the data splits.

    This function uses a pre-initialized SentenceTransformer model to encode
    the 'aggregated_text' from all 12 data subsets into dense numerical vectors.
    It leverages efficient batch processing for high throughput.

    Args:
        data_splits: The nested dictionary of data splits from Task 5.
        model: The initialized SentenceTransformer model.
        batch_size: The number of sentences to process in a single batch.

    Returns:
        A nested dictionary containing the NumPy arrays of embeddings,
        mirroring the input structure.
    """
    # Initialize the nested dictionary to store the output embedding arrays.
    embedding_features: EmbeddingFeatures = {regime: {} for regime in data_splits}

    # Iterate through every regime and every split.
    for regime_name, splits in data_splits.items():
        for split_name, df in splits.items():
            logging.info(f"Generating embeddings for {regime_name} - {split_name} split...")

            # Extract the text corpus for the current split.
            corpus = df['aggregated_text'].tolist()

            if not corpus:
                # Handle empty splits by creating an empty array with the correct shape.
                embedding_dim = model.get_sentence_embedding_dimension()
                embeddings = np.empty((0, embedding_dim), dtype=np.float32)
            else:
                # --- Use model.encode for efficient, batched inference ---
                embeddings = model.encode(
                    corpus,
                    batch_size=batch_size,
                    show_progress_bar=True,  # Provides valuable user feedback.
                    convert_to_numpy=True    # Directly output as a NumPy array.
                )

            # Store the resulting embedding matrix.
            embedding_features[regime_name][split_name] = embeddings

    return embedding_features


def validate_embedding_features(
    embedding_features: EmbeddingFeatures,
    data_splits: DataSplits,
    expected_dimension: int
) -> Tuple[bool, List[str]]:
    """
    Performs a quality and integrity audit on the generated sentence embeddings.

    This function validates the output of the embedding process to ensure
    that the resulting matrices are correctly shaped, numerically stable, and
    properly normalized.

    Args:
        embedding_features: The nested dictionary of embedding matrices.
        data_splits: The original nested dictionary of data splits.
        expected_dimension: The expected embedding dimension of the model.

    Returns:
        A tuple containing:
        - A boolean, True if all checks pass, False otherwise.
        - A list of strings detailing any validation failures.
    """
    # Initialize a list to collect detailed error messages.
    errors = []

    # Iterate through every generated embedding matrix.
    for regime_name, splits in embedding_features.items():
        for split_name, embeddings in splits.items():
            original_df = data_splits[regime_name][split_name]

            # --- Step 1: Validate Matrix Shape ---
            if embeddings.shape[0] != len(original_df):
                errors.append(
                    f"[{regime_name}][{split_name}]: Row count mismatch. "
                    f"Expected {len(original_df)}, found {embeddings.shape[0]}."
                )
            if embeddings.shape[1] != expected_dimension and embeddings.shape[0] > 0:
                errors.append(
                    f"[{regime_name}][{split_name}]: Dimension mismatch. "
                    f"Expected {expected_dimension}, found {embeddings.shape[1]}."
                )

            # Skip further checks for empty matrices.
            if embeddings.shape[0] == 0:
                continue

            # --- Step 2: Validate Numerical Integrity ---
            if np.isnan(embeddings).any():
                errors.append(f"[{regime_name}][{split_name}]: Found NaN values in embeddings.")
            if np.isinf(embeddings).any():
                errors.append(f"[{regime_name}][{split_name}]: Found infinite values in embeddings.")

            # --- Step 3: Validate L2 Normalization ---
            # Calculate the L2 norm for each embedding vector.
            norms = np.linalg.norm(embeddings, axis=1)
            # Check if all norms are approximately equal to 1.0.
            if not np.all(np.isclose(norms, 1.0, atol=1e-6)):
                errors.append(
                    f"[{regime_name}][{split_name}]: L2 normalization validation failed. "
                    "Not all embedding vectors are unit vectors."
                )

    is_valid = not errors
    return is_valid, errors


def run_embedding_extraction_suite(
    data_splits: DataSplits,
    study_config: Dict[str, Any]
) -> EmbeddingFeatures:
    """
    Orchestrates the full sentence embedding extraction and validation pipeline.

    Args:
        data_splits: The nested dictionary of data splits from Task 5.
        study_config: The complete study configuration dictionary.

    Returns:
        A nested dictionary containing the 12 validated embedding matrices.

    Raises:
        ValueError: If the post-extraction validation fails.
    """
    logging.info("--- Running Task 8: Sentence Embedding Extraction Suite ---")

    # Retrieve embedding parameters from the configuration.
    embedding_params = study_config['feature_engineering']['sentence_embeddings']
    model_id = embedding_params['model_identifier']
    expected_dim = embedding_params['embedding_dimension']

    # --- Step 1: Initialize the Model ---
    logging.info("\nStep 1: Initializing Sentence Transformer model...")
    model = initialize_sentence_transformer(model_id)

    # --- Step 2: Extract Embeddings ---
    logging.info("\nStep 2: Extracting embeddings for all 12 data splits...")
    # A batch size of 64 is a reasonable default for models of this size.
    embedding_features = extract_sentence_embeddings(data_splits, model, batch_size=64)
    logging.info("Embedding extraction complete.")

    # --- Step 3: Validate Feature Quality ---
    logging.info("\nStep 3: Validating embedding feature quality...")
    is_valid, errors = validate_embedding_features(
        embedding_features,
        data_splits,
        expected_dim
    )

    if not is_valid:
        error_report = "\n".join([f"- {error}" for error in errors])
        raise ValueError(f"Embedding feature validation failed:\n{error_report}")
    else:
        logging.info("Embedding feature validation PASSED.")

    logging.info("\n>>> Sentence embedding extraction suite completed successfully. <<<")
    return embedding_features


In [None]:
# Task 9: Feature Concatenation and Validation

def concatenate_features(
    tfidf_features: TfidfFeatures,
    embedding_features: EmbeddingFeatures
) -> CombinedFeatures:
    """
    Concatenates sparse TF-IDF features and dense sentence embeddings.

    This function synthesizes the two feature types into a single, unified
    feature matrix for each of the 12 data subsets. It ensures mathematical
    correctness and memory efficiency.

    The concatenation follows the precise formulation:
    x_combined = [v_tfidf || e_sentence]

    Args:
        tfidf_features: Nested dictionary of TF-IDF sparse matrices.
        embedding_features: Nested dictionary of sentence embedding NumPy arrays.

    Returns:
        A new nested dictionary containing the concatenated dense feature
        matrices as NumPy arrays of dtype float32.

    Raises:
        ValueError: If there is a row count mismatch between corresponding
                    TF-IDF and embedding matrices.
        MemoryError: If converting a sparse matrix to dense exceeds available memory.
    """
    # Initialize the nested dictionary to store the output combined matrices.
    combined_features: CombinedFeatures = {regime: {} for regime in tfidf_features}

    # Iterate through every regime and split.
    for regime_name, splits in tfidf_features.items():
        for split_name, tfidf_matrix in splits.items():
            # Retrieve the corresponding embedding matrix.
            embedding_matrix = embedding_features[regime_name][split_name]

            # --- Critical Safety Check: Row Alignment ---
            # Assert that the number of samples is identical before concatenation.
            if tfidf_matrix.shape[0] != embedding_matrix.shape[0]:
                raise ValueError(
                    f"[{regime_name}][{split_name}]: Row count mismatch. "
                    f"TF-IDF has {tfidf_matrix.shape[0]} rows, but embeddings "
                    f"have {embedding_matrix.shape[0]} rows."
                )

            # Handle empty splits gracefully.
            if tfidf_matrix.shape[0] == 0:
                # If empty, create an empty combined matrix with the correct number of columns.
                combined_dim = tfidf_matrix.shape[1] + embedding_matrix.shape[1]
                combined_features[regime_name][split_name] = np.empty(
                    (0, combined_dim), dtype=np.float32
                )
                continue

            # --- Step 1: Convert Sparse to Dense and Concatenate ---
            try:
                # Convert the sparse TF-IDF matrix to a dense NumPy array.
                # This is a potential memory bottleneck for extremely large datasets.
                tfidf_dense = tfidf_matrix.toarray()

                # Concatenate the two dense matrices horizontally.
                # np.hstack is a clear and efficient choice for this.
                concatenated_matrix = np.hstack([tfidf_dense, embedding_matrix])

                # --- Step 2: Standardize Data Type ---
                # Cast the final matrix to float32 for memory efficiency and GPU compatibility.
                combined_features[regime_name][split_name] = concatenated_matrix.astype(np.float32)

            except MemoryError:
                # Provide a specific, actionable error message if densification fails.
                raise MemoryError(
                    f"[{regime_name}][{split_name}]: Failed to convert sparse TF-IDF "
                    f"matrix of shape {tfidf_matrix.shape} to dense. "
                    "The dataset may be too large for available RAM."
                )

    return combined_features


def validate_combined_features(
    combined_features: CombinedFeatures,
    data_splits: Dict[str, Dict[str, pd.DataFrame]],
    expected_dimension: int
) -> Tuple[bool, List[str]]:
    """
    Performs a quality and integrity audit on the combined feature matrices.

    This function validates the output of the concatenation process to ensure
    the final matrices are correctly shaped and numerically stable.

    Args:
        combined_features: The nested dictionary of combined feature matrices.
        data_splits: The original nested dictionary of data splits.
        expected_dimension: The expected total number of features after
                            concatenation (e.g., 2384).

    Returns:
        A tuple containing:
        - A boolean, True if all checks pass, False otherwise.
        - A list of strings detailing any validation failures.
    """
    # Initialize a list to collect detailed error messages.
    errors = []

    # Iterate through every generated combined feature matrix.
    for regime_name, splits in combined_features.items():
        for split_name, matrix in splits.items():
            original_df = data_splits[regime_name][split_name]

            # --- Step 1: Validate Matrix Shape ---
            if matrix.shape[0] != len(original_df):
                errors.append(
                    f"[{regime_name}][{split_name}]: Row count mismatch. "
                    f"Expected {len(original_df)}, found {matrix.shape[0]}."
                )
            if matrix.shape[1] != expected_dimension and matrix.shape[0] > 0:
                errors.append(
                    f"[{regime_name}][{split_name}]: Feature dimension mismatch. "
                    f"Expected {expected_dimension}, found {matrix.shape[1]}."
                )

            # Skip further checks for empty matrices.
            if matrix.shape[0] == 0:
                continue

            # --- Step 2: Validate Numerical Integrity ---
            if np.isnan(matrix).any():
                errors.append(f"[{regime_name}][{split_name}]: Found NaN values in combined features.")
            if np.isinf(matrix).any():
                errors.append(f"[{regime_name}][{split_name}]: Found infinite values in combined features.")

    is_valid = not errors
    return is_valid, errors


def run_feature_concatenation_suite(
    tfidf_features: TfidfFeatures,
    embedding_features: EmbeddingFeatures,
    data_splits: DataSplits,
    study_config: Dict[str, Any]
) -> CombinedFeatures:
    """
    Orchestrates the full feature concatenation and validation pipeline.

    Args:
        tfidf_features: Nested dictionary of TF-IDF sparse matrices.
        embedding_features: Nested dictionary of sentence embedding NumPy arrays.
        data_splits: The original nested dictionary of data splits.
        study_config: The complete study configuration dictionary.

    Returns:
        A nested dictionary containing the 12 validated combined feature matrices.

    Raises:
        ValueError: If the post-concatenation validation fails.
    """
    logging.info("--- Running Task 9: Feature Concatenation and Validation Suite ---")

    # --- Step 1: Concatenate Features ---
    logging.info("\nStep 1: Concatenating TF-IDF and embedding features for all 12 splits...")
    combined_features = concatenate_features(tfidf_features, embedding_features)
    logging.info("Feature concatenation complete.")

    # --- Step 2: Validate Combined Features ---
    logging.info("\nStep 2: Validating combined feature matrices...")
    # Retrieve expected dimension from the configuration for validation.
    expected_dim = study_config['model_training']['architectures']['feature_transformer']['input_size']
    is_valid, errors = validate_combined_features(
        combined_features,
        data_splits,
        expected_dim
    )

    if not is_valid:
        error_report = "\n".join([f"- {error}" for error in errors])
        raise ValueError(f"Combined feature validation failed:\n{error_report}")
    else:
        logging.info("Combined feature validation PASSED.")

    logging.info("\n>>> Feature concatenation suite completed successfully. <<<")
    return combined_features


In [None]:
# Task 10: LSTM Architecture Specification

class LSTMRegressionModel(nn.Module):
    """
    An LSTM-based regression model for predicting stock returns from TF-IDF features.

    This model architecture is designed as specified in the research paper. It
    treats the static TF-IDF vector for a given day as a sequence of length 1.
    The architecture consists of a single LSTM layer followed by a dropout layer
    and a final linear layer for regression output.

    Attributes:
        input_size (int): The dimensionality of the input features (e.g., 2000 for TF-IDF).
        hidden_size (int): The number of features in the LSTM hidden state.
        dropout_rate (float): The dropout probability.
        lstm (nn.LSTM): The core LSTM layer.
        dropout (nn.Dropout): The dropout layer for regularization.
        regressor (nn.Linear): The final linear layer for outputting a single regression value.
    """
    def __init__(self, model_config: Dict[str, Any]):
        """
        Initializes the LSTMRegressionModel with parameters from a configuration dictionary.

        Args:
            model_config: A dictionary containing the model's hyperparameters:
                          'input_size', 'hidden_size', and 'dropout'.

        Raises:
            KeyError: If a required key is missing from the model_config.
        """
        # Call the constructor of the parent class (nn.Module).
        super().__init__()

        # --- Parameter Extraction and Validation ---
        try:
            # The number of features in the input TF-IDF vector.
            self.input_size: int = model_config['input_size']
            # The dimensionality of the LSTM's hidden state.
            self.hidden_size: int = model_config['hidden_size']
            # The probability for the dropout layer.
            self.dropout_rate: float = model_config['dropout']
        except KeyError as e:
            raise KeyError(f"Missing required key in model_config: {e}")

        # --- Step 1: Define LSTM Network Components ---
        # The core LSTM layer.
        # `input_size`: The number of expected features in the input x.
        # `hidden_size`: The number of features in the hidden state h.
        # `num_layers`: Number of recurrent layers.
        # `batch_first=True`: This crucial argument means the input and output
        # tensors are provided as (batch, seq, feature), which is the standard
        # for PyTorch data loaders.
        self.lstm = nn.LSTM(
            input_size=self.input_size,
            hidden_size=self.hidden_size,
            num_layers=1,
            batch_first=True
        )

        # A dropout layer for regularization, applied to the LSTM's output.
        self.dropout = nn.Dropout(p=self.dropout_rate)

        # The final linear layer (regression head).
        # It maps the LSTM's hidden state to a single output value (the predicted return).
        self.regressor = nn.Linear(
            in_features=self.hidden_size,
            out_features=1
        )

        # --- Step 3: Configure Weight Initialization ---
        # Apply a custom weight initialization scheme for better training stability.
        self._init_weights()

        logging.info("LSTMRegressionModel initialized successfully.")
        logging.info(f"  - Input Size: {self.input_size}")
        logging.info(f"  - Hidden Size: {self.hidden_size}")
        logging.info(f"  - Dropout Rate: {self.dropout_rate}")

    def _init_weights(self) -> None:
        """
        Applies a custom weight initialization scheme to the model's layers.
        """
        # Iterate through all modules (layers) in the network.
        for module in self.modules():
            # Check if the module is a Linear layer.
            if isinstance(module, nn.Linear):
                # Apply Xavier uniform initialization to the weights. This is a
                # standard practice for layers with linear or no activation.
                nn.init.xavier_uniform_(module.weight)
                # Initialize the bias to zero.
                if module.bias is not None:
                    nn.init.constant_(module.bias, 0)
            # Check if the module is an LSTM layer.
            elif isinstance(module, nn.LSTM):
                # Iterate through all named parameters of the LSTM.
                for name, param in module.named_parameters():
                    # Initialize weight matrices (e.g., 'weight_ih_l0').
                    if 'weight' in name:
                        nn.init.xavier_uniform_(param)
                    # Initialize bias vectors (e.g., 'bias_hh_l0').
                    elif 'bias' in name:
                        nn.init.constant_(param, 0)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Defines the forward pass of the model.

        Args:
            x: The input tensor of TF-IDF features.
               Shape: (batch_size, input_size), e.g., (64, 2000).

        Returns:
            The output tensor of predicted returns.
               Shape: (batch_size, 1), e.g., (64, 1).
        """
        # --- Input Shape Validation ---
        # A robust check to ensure the input data has the correct dimensions.
        if x.dim() != 2 or x.shape[1] != self.input_size:
            raise ValueError(
                f"Expected input of shape (batch_size, {self.input_size}), "
                f"but got {x.shape}."
            )

        # --- Step 2: Implement LSTM Forward Pass Logic ---
        # 1. Reshape the input for the LSTM layer.
        # The LSTM expects a 3D tensor: (batch, seq_len, features).
        # Since our TF-IDF vector is a static feature set for one time step,
        # we treat it as a sequence of length 1.
        # Shape change: (batch_size, 2000) -> (batch_size, 1, 2000).
        x_reshaped = x.unsqueeze(1)

        # 2. Pass the reshaped tensor through the LSTM.
        # We don't need the final hidden/cell states `(h_n, c_n)` for this architecture.
        # `lstm_out` will contain the hidden state for each element in the sequence.
        # Shape of lstm_out: (batch_size, 1, hidden_size), e.g., (64, 1, 256).
        lstm_out, _ = self.lstm(x_reshaped)

        # 3. Extract the final hidden state.
        # Since our sequence length is 1, we take the output from the first (and only) time step.
        # The slice `[:, -1, :]` is a robust way to get the last time step's output.
        # Shape change: (batch_size, 1, 256) -> (batch_size, 256).
        last_hidden_state = lstm_out[:, -1, :]

        # 4. Apply dropout for regularization.
        # Shape remains: (batch_size, 256).
        regularized_hidden_state = self.dropout(last_hidden_state)

        # 5. Pass the result through the final linear regressor.
        # Shape change: (batch_size, 256) -> (batch_size, 1).
        prediction = self.regressor(regularized_hidden_state)

        return prediction


In [None]:
# Task 11: Text Transformer Architecture Specification

class TextTransformerRegressionModel(nn.Module):
    """
    A Transformer-based regression model for predicting stock returns from raw text.

    This model leverages a pre-trained DistilBERT model as its core feature
    extractor. It fine-tunes the entire transformer and adds a custom two-layer
    MLP regression head on top of the [CLS] token representation to predict a
    single continuous value.

    Attributes:
        base_model_identifier (str): The HuggingFace identifier for the base model.
        dropout_rate (float): The dropout probability for the regression head.
        distilbert (PreTrainedModel): The pre-trained DistilBERT base model.
        pre_classifier (nn.Linear): The first linear layer of the regression head.
        relu (nn.ReLU): The ReLU activation function.
        dropout (nn.Dropout): The dropout layer for regularization.
        regressor (nn.Linear): The final output linear layer.
    """
    def __init__(self, model_config: Dict[str, Any]):
        """
        Initializes the TextTransformerRegressionModel.

        Args:
            model_config: A dictionary containing the model's hyperparameters:
                          'base_model_identifier', 'hidden_size', and 'dropout'.

        Raises:
            KeyError: If a required key is missing from the model_config.
            RuntimeError: If the pre-trained model cannot be loaded.
        """
        # Call the constructor of the parent class (nn.Module).
        super().__init__()

        # --- Parameter Extraction and Validation ---
        try:
            # The HuggingFace identifier for the pre-trained model.
            self.base_model_identifier: str = model_config['base_model_identifier']
            # The dropout probability for the regression head.
            self.dropout_rate: float = model_config['dropout']
            # The hidden size of the transformer's output.
            self.transformer_hidden_size: int = model_config['hidden_size']
        except KeyError as e:
            raise KeyError(f"Missing required key in model_config: {e}")

        # --- Step 1: Define Transformer Network Components ---
        try:
            # Load the pre-trained DistilBERT model from HuggingFace.
            # This downloads and caches the model weights and configuration.
            self.distilbert: PreTrainedModel = DistilBertModel.from_pretrained(
                self.base_model_identifier
            )
        except Exception as e:
            # Provide a clear error if the model fails to load.
            raise RuntimeError(
                f"Failed to load pre-trained model '{self.base_model_identifier}'. "
                f"Check model name and network connection. Original error: {e}"
            )

        # --- Define the custom regression head ---
        # A two-layer MLP is a robust choice for mapping the complex text
        # representation to a single regression output.

        # First linear layer: maps DistilBERT output to an intermediate dimension.
        self.pre_classifier = nn.Linear(self.transformer_hidden_size, 256)

        # ReLU activation function.
        self.relu = nn.ReLU()

        # Dropout layer for regularization, applied to the [CLS] token representation.
        self.dropout = nn.Dropout(p=self.dropout_rate)

        # Final linear layer: maps the intermediate dimension to the single output value.
        self.regressor = nn.Linear(256, 1)

        # Note: For fine-tuning, we do not apply a custom weight initialization
        # to the pre-trained distilbert part. We only initialize the new layers
        # (the regression head).
        self._init_head_weights()

        logging.info("TextTransformerRegressionModel initialized successfully.")
        logging.info(f"  - Base Model: {self.base_model_identifier}")
        logging.info(f"  - Transformer Hidden Size: {self.transformer_hidden_size}")
        logging.info(f"  - Dropout Rate: {self.dropout_rate}")

    def _init_head_weights(self) -> None:
        """
        Initializes the weights of the custom regression head layers.
        """
        # Apply Xavier uniform initialization to the first linear layer.
        nn.init.xavier_uniform_(self.pre_classifier.weight)
        # Initialize its bias to zero.
        nn.init.constant_(self.pre_classifier.bias, 0)

        # Apply Xavier uniform initialization to the final regressor layer.
        nn.init.xavier_uniform_(self.regressor.weight)
        # Initialize its bias to zero.
        nn.init.constant_(self.regressor.bias, 0)

    def forward(
        self,
        input_ids: torch.Tensor,
        attention_mask: torch.Tensor
    ) -> torch.Tensor:
        """
        Defines the forward pass of the model.

        Args:
            input_ids: A tensor of token IDs.
                       Shape: (batch_size, sequence_length).
            attention_mask: A tensor of attention masks (1 for real tokens,
                            0 for padding).
                            Shape: (batch_size, sequence_length).

        Returns:
            The output tensor of predicted returns.
               Shape: (batch_size, 1).
        """
        # --- Step 3: Implement Transformer Forward Pass ---
        # 1. Pass inputs through the base DistilBERT model.
        # The attention_mask ensures that the model does not perform attention
        # on padded tokens, which is critical for correctness.
        distilbert_output = self.distilbert(
            input_ids=input_ids,
            attention_mask=attention_mask
        )

        # 2. Extract the [CLS] token representation.
        # The `last_hidden_state` contains the output embeddings for all tokens
        # in the sequence. For classification/regression, we use the state of
        # the first token, the [CLS] token.
        # Shape of last_hidden_state: (batch_size, sequence_length, hidden_size).
        # We slice `[:, 0, :]` to get the [CLS] token for each item in the batch.
        # Shape of cls_token_state: (batch_size, hidden_size), e.g., (64, 768).
        cls_token_state = distilbert_output.last_hidden_state[:, 0, :]

        # 3. Apply dropout to the [CLS] token representation for regularization.
        # Shape remains: (batch_size, 768).
        regularized_cls_token = self.dropout(cls_token_state)

        # 4. Pass the representation through the custom regression head.
        # Shape change: (batch_size, 768) -> (batch_size, 256).
        intermediate_output = self.pre_classifier(regularized_cls_token)
        # Apply ReLU activation.
        intermediate_output = self.relu(intermediate_output)

        # Shape change: (batch_size, 256) -> (batch_size, 1).
        prediction = self.regressor(intermediate_output)

        return prediction


In [None]:
# Task 12: Feature-Enhanced Transformer Architecture Specification

class FeatureEnhancedMLP(nn.Module):
    """
    An MLP model for predicting returns from combined TF-IDF and embedding features.

    This model, referred to as the "Feature-based Transformer" in the paper,
    is implemented as a Multi-Layer Perceptron (MLP). It processes the static,
    concatenated vector of sparse (TF-IDF) and dense (sentence embedding)
    features through a simple but effective neural network architecture.

    The architecture consists of:
    1. A hidden layer mapping the combined features to a smaller dimension.
    2. A ReLU activation function.
    3. A dropout layer for regularization.
    4. A final linear layer for regression output.

    Attributes:
        input_size (int): The dimensionality of the concatenated input features (e.g., 2384).
        hidden_size (int): The number of neurons in the hidden layer.
        dropout_rate (float): The dropout probability.
        hidden_layer (nn.Linear): The first linear layer mapping input to hidden size.
        activation (nn.ReLU): The non-linear activation function.
        dropout (nn.Dropout): The dropout layer for regularization.
        regressor (nn.Linear): The final output linear layer.
    """
    def __init__(self, model_config: Dict[str, Any]):
        """
        Initializes the FeatureEnhancedMLP with parameters from a configuration dictionary.

        Args:
            model_config: A dictionary containing the model's hyperparameters:
                          'input_size', 'hidden_size', and 'dropout'.

        Raises:
            KeyError: If a required key is missing from the model_config.
        """
        # Call the constructor of the parent class (nn.Module).
        super().__init__()

        # --- Parameter Extraction and Validation ---
        try:
            # The number of features in the concatenated input vector.
            self.input_size: int = model_config['input_size']
            # The number of neurons in the hidden layer.
            self.hidden_size: int = model_config['hidden_size']
            # The probability for the dropout layer.
            self.dropout_rate: float = model_config['dropout']
        except KeyError as e:
            raise KeyError(f"Missing required key in model_config: {e}")

        # --- Step 1: Define MLP Architecture Components ---
        # A sequential container for a clean, readable architecture definition.
        self.network = nn.Sequential(
            # First linear layer: maps the input vector to the hidden dimension.
            nn.Linear(self.input_size, self.hidden_size),
            # ReLU activation function to introduce non-linearity.
            nn.ReLU(),
            # Dropout layer for regularization to prevent overfitting.
            nn.Dropout(self.dropout_rate),
            # Final linear layer: maps the hidden representation to a single output value.
            nn.Linear(self.hidden_size, 1)
        )

        # --- Step 3: Configure Weight Initialization ---
        # Apply a custom weight initialization scheme.
        self._init_weights()

        logging.info("FeatureEnhancedMLP initialized successfully.")
        logging.info(f"  - Input Size: {self.input_size}")
        logging.info(f"  - Hidden Size: {self.hidden_size}")
        logging.info(f"  - Dropout Rate: {self.dropout_rate}")

    def _init_weights(self) -> None:
        """
        Applies a custom weight initialization scheme to the model's layers.
        """
        # Iterate through all modules (layers) in the network.
        for module in self.network.modules():
            # Check if the module is a Linear layer.
            if isinstance(module, nn.Linear):
                # Use He (Kaiming) initialization for the weights of layers
                # that are followed by a ReLU activation. This is theoretically
                # sound as it preserves variance through the activation.
                if module.out_features == self.hidden_size:
                    nn.init.kaiming_uniform_(module.weight, nonlinearity='relu')
                # Use Xavier (Glorot) initialization for the final output layer,
                # which has a linear activation.
                else:
                    nn.init.xavier_uniform_(module.weight)

                # Initialize all biases to zero.
                if module.bias is not None:
                    nn.init.constant_(module.bias, 0)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Defines the forward pass of the model.

        Args:
            x: The input tensor of concatenated features.
               Shape: (batch_size, input_size), e.g., (64, 2384).

        Returns:
            The output tensor of predicted returns.
               Shape: (batch_size, 1), e.g., (64, 1).
        """
        # --- Input Shape Validation ---
        # A robust check to ensure the input data has the correct dimensions.
        if x.dim() != 2 or x.shape[1] != self.input_size:
            raise ValueError(
                f"Expected input of shape (batch_size, {self.input_size}), "
                f"but got {x.shape}."
            )

        # --- Step 2: Implement MLP Forward Pass ---
        # The forward pass is a simple, sequential execution of the defined network.
        # Shape flow: (batch, 2384) -> (batch, 256) -> (batch, 1).
        prediction = self.network(x)

        return prediction


In [None]:
# Task 13: Training Infrastructure Setup

def setup_training_environment(seed: int = 42) -> torch.device:
    """
    Configures the environment for reproducible and optimized PyTorch training.

    This function performs two critical setup tasks:
    1.  Sets random seeds for all relevant libraries (torch, numpy, random) to
        ensure run-to-run reproducibility.
    2.  Configures PyTorch to use deterministic algorithms and detects the optimal
        compute device (GPU if available, otherwise CPU).

    Args:
        seed: The integer seed for all random number generators.

    Returns:
        The configured torch.device object ('cuda' or 'cpu').
    """
    # --- Step 1: Set Seeds for Reproducibility ---
    # Set seed for the `random` module.
    random.seed(seed)
    # Set seed for NumPy.
    np.random.seed(seed)
    # Set seed for PyTorch on both CPU and GPU.
    torch.manual_seed(seed)
    # Ensure reproducibility for CUDA operations.
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

    # --- Step 2: Configure for Deterministic Operations ---
    # Using deterministic algorithms can have a performance impact but is crucial
    # for reproducibility. `benchmark=False` prevents cuDNN from choosing
    # non-deterministic algorithms.
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

    # --- Step 3: Select Compute Device ---
    # Select GPU if available, otherwise fall back to CPU.
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    logging.info(f"Training environment configured. Seed: {seed}. Device: {device}.")
    logging.info(f"cuDNN deterministic: {torch.backends.cudnn.deterministic}, benchmark: {torch.backends.cudnn.benchmark}")

    return device


class FinancialDataset(Dataset):
    """
    A PyTorch Dataset for standard numerical feature matrices.

    This class serves as a robust interface between NumPy-based feature arrays
    (e.g., TF-IDF, combined features) and the PyTorch DataLoader. It is designed
    for models that expect a single feature tensor per sample, such as the LSTM
    and MLP architectures in this study. The class handles the conversion of
    NumPy arrays to PyTorch tensors with the appropriate data types.

    Attributes:
        features (torch.Tensor): A tensor holding all feature data.
        targets (torch.Tensor): A tensor holding all target data.
    """

    def __init__(self, features: np.ndarray, targets: np.ndarray) -> None:
        """
        Initializes the FinancialDataset.

        Args:
            features (np.ndarray): A 2D NumPy array of input features, where
                                   each row is a sample.
                                   Shape: (num_samples, num_features).
            targets (np.ndarray): A 1D NumPy array of target values.
                                  Shape: (num_samples,).

        Raises:
            ValueError: If the number of samples in features and targets
                        do not match.
            TypeError: If inputs are not NumPy arrays.
        """
        # --- Input Type and Integrity Validation ---
        # Ensure features are a NumPy array.
        if not isinstance(features, np.ndarray):
            raise TypeError(f"Features must be a NumPy array, but got {type(features)}.")

        # Ensure targets are a NumPy array.
        if not isinstance(targets, np.ndarray):
            raise TypeError(f"Targets must be a NumPy array, but got {type(targets)}.")

        # Ensure the number of samples is consistent between features and targets.
        if features.shape[0] != targets.shape[0]:
            raise ValueError(
                f"Inconsistent number of samples. Features have {features.shape[0]} "
                f"samples, but targets have {targets.shape[0]} samples."
            )

        # --- Data Conversion and Storage ---
        # Convert the NumPy feature array to a PyTorch tensor of type float32.
        # float32 is the standard precision for features in most deep learning tasks.
        self.features: torch.Tensor = torch.tensor(features, dtype=torch.float32)

        # Convert the NumPy target array to a PyTorch tensor of type float32.
        self.targets: torch.Tensor = torch.tensor(targets, dtype=torch.float32)

    def __len__(self) -> int:
        """
        Returns the total number of samples in the dataset.

        This method is required by the PyTorch Dataset API and is used by the
        DataLoader to determine the size of the dataset.

        Returns:
            int: The total number of samples.
        """
        # Return the number of rows in the features tensor.
        return self.features.shape[0]

    def __getitem__(self, idx: int) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Retrieves a single sample (features and target) from the dataset.

        This method is required by the PyTorch Dataset API. It is called by the
        DataLoader to fetch a single data point at the specified index.

        Args:
            idx (int): The index of the sample to retrieve.

        Returns:
            Tuple[torch.Tensor, torch.Tensor]: A tuple containing:
                - The feature tensor for the sample. Shape: (num_features,).
                - The target tensor for the sample. Shape: (1,). The target is
                  unsqueezed to ensure a consistent shape for loss calculation.
        """
        # Retrieve the feature tensor at the specified index.
        feature_sample = self.features[idx]

        # Retrieve the target tensor at the specified index.
        target_sample = self.targets[idx]

        # Unsqueeze the target tensor to add a dimension, changing its shape
        # from a scalar-like tensor to a 1D tensor of shape (1,).
        # This is best practice for regression tasks to align with model outputs
        # that typically have shape (batch_size, 1).
        return feature_sample, target_sample.unsqueeze(0)


class TextTransformerDataset(Dataset):
    """
    A PyTorch Dataset for tokenized text data for Transformer models.

    This class is specifically designed to work with the output of a HuggingFace
    tokenizer. It handles dictionary-based inputs (e.g., 'input_ids',
    'attention_mask') required by Transformer models like DistilBERT.

    Attributes:
        encodings (Dict[str, torch.Tensor]): A dictionary of tensors, where keys
            are tokenizer-generated names (e.g., 'input_ids') and values are
            the corresponding tensors for the entire dataset.
        targets (torch.Tensor): A tensor holding all target data.
    """

    def __init__(self, encodings: Dict[str, np.ndarray], targets: np.ndarray) -> None:
        """
        Initializes the TextTransformerDataset.

        Args:
            encodings (Dict[str, np.ndarray]): A dictionary where keys are strings
                (e.g., 'input_ids', 'attention_mask') and values are NumPy arrays.
                All arrays must have the same first dimension (num_samples).
            targets (np.ndarray): A 1D NumPy array of target values.
                                  Shape: (num_samples,).

        Raises:
            ValueError: If the number of samples in encodings and targets
                        do not match, or if 'input_ids' is missing.
            TypeError: If inputs are not of the expected types.
        """
        # --- Input Type and Integrity Validation ---
        # Ensure encodings is a dictionary.
        if not isinstance(encodings, dict):
            raise TypeError(f"Encodings must be a dictionary, but got {type(encodings)}.")

        # Ensure 'input_ids' is present, as it's the minimum requirement.
        if 'input_ids' not in encodings:
            raise ValueError("Encodings dictionary must contain the key 'input_ids'.")

        # Ensure targets is a NumPy array.
        if not isinstance(targets, np.ndarray):
            raise TypeError(f"Targets must be a NumPy array, but got {type(targets)}.")

        # Get the number of samples from the 'input_ids' array.
        num_samples = encodings['input_ids'].shape[0]

        # Ensure the number of samples is consistent across all encoding arrays.
        for key, value in encodings.items():
            if not isinstance(value, np.ndarray):
                raise TypeError(f"Value for key '{key}' in encodings must be a NumPy array.")
            if value.shape[0] != num_samples:
                raise ValueError(
                    f"Inconsistent number of samples in encodings. 'input_ids' has "
                    f"{num_samples} samples, but '{key}' has {value.shape[0]}."
                )

        # Ensure the number of samples in targets matches.
        if targets.shape[0] != num_samples:
            raise ValueError(
                f"Inconsistent number of samples. Encodings have {num_samples} "
                f"samples, but targets have {targets.shape[0]} samples."
            )

        # --- Data Conversion and Storage ---
        # Convert each NumPy array in the encodings dictionary to a PyTorch tensor.
        # The tensor type is inferred from the NumPy array (usually int64 for IDs).
        self.encodings: Dict[str, torch.Tensor] = {
            key: torch.tensor(val) for key, val in encodings.items()
        }

        # Convert the NumPy target array to a PyTorch tensor of type float32.
        self.targets: torch.Tensor = torch.tensor(targets, dtype=torch.float32)

    def __len__(self) -> int:
        """
        Returns the total number of samples in the dataset.

        Returns:
            int: The total number of samples.
        """
        # Return the number of rows in the targets tensor.
        return self.targets.shape[0]

    def __getitem__(self, idx: int) -> Tuple[Dict[str, torch.Tensor], torch.Tensor]:
        """
        Retrieves a single sample (encoding dictionary and target) from the dataset.

        Args:
            idx (int): The index of the sample to retrieve.

        Returns:
            Tuple[Dict[str, torch.Tensor], torch.Tensor]: A tuple containing:
                - A dictionary where keys are 'input_ids', 'attention_mask', etc.,
                  and values are the corresponding tensors for the single sample.
                - The target tensor for the sample. Shape: (1,).
        """
        # Create a dictionary for the single sample by slicing each tensor
        # in the main encodings dictionary at the specified index.
        item = {key: val[idx] for key, val in self.encodings.items()}

        # Retrieve the target tensor at the specified index.
        target = self.targets[idx]

        # Unsqueeze the target tensor to shape (1,) for consistency.
        return item, target.unsqueeze(0)


def create_dataloaders(
    features: Union[np.ndarray, Dict[str, np.ndarray]],
    targets: np.ndarray,
    model_type: str,
    batch_size: int,
    shuffle: bool
) -> DataLoader:
    """
    Factory function to create a PyTorch DataLoader for a given dataset.

    This function selects the appropriate Dataset class based on the model type
    and configures a DataLoader with specified parameters.

    Args:
        features: The input features (either a NumPy array or a dictionary of tokenized data).
        targets: The corresponding target values.
        model_type: The type of model ('lstm', 'feature_transformer', or 'text_transformer').
        batch_size: The size of each batch.
        shuffle: Whether to shuffle the data (True for training, False for val/test).

    Returns:
        A configured PyTorch DataLoader instance.
    """
    # --- Select the appropriate Dataset class based on model type ---
    if model_type in ['lstm', 'feature_transformer']:
        # For LSTM and MLP, use the standard numerical dataset.
        dataset = FinancialDataset(features, targets)
    elif model_type == 'text_transformer':
        # For the text transformer, use the specialized dictionary-based dataset.
        dataset = TextTransformerDataset(features, targets)
    else:
        raise ValueError(f"Unknown model_type: {model_type}")

    # --- Configure and return the DataLoader ---
    # `num_workers` > 0 enables multi-process data loading.
    # `pin_memory=True` speeds up CPU-to-GPU data transfer.
    return DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        num_workers=4,
        pin_memory=True
    )


def setup_optimization_components(
    model: nn.Module,
    train_config: Dict[str, Any]
) -> Dict[str, Any]:
    """
    Configures the loss function, optimizer, and learning rate scheduler.

    Args:
        model: The PyTorch model whose parameters will be optimized.
        train_config: The 'global_params' section of the study configuration.

    Returns:
        A dictionary containing the instantiated 'criterion' (loss function),
        'optimizer', and 'scheduler'.
    """
    # --- Step 1: Configure Loss Function ---
    # The paper specifies Mean Squared Error for this regression task.
    # Equation: L_MSE = 1/N * sum((y_pred - y_true)^2)
    if train_config['loss_function'] == 'MeanSquaredError':
        criterion = nn.MSELoss()
    else:
        raise ValueError(f"Unsupported loss function: {train_config['loss_function']}")

    # --- Step 2: Configure Optimizer ---
    # The paper specifies the Adam optimizer.
    if train_config['optimizer'] == 'Adam':
        optimizer = optim.Adam(
            model.parameters(),
            lr=train_config['learning_rate']
        )
    else:
        raise ValueError(f"Unsupported optimizer: {train_config['optimizer']}")

    # --- Step 3: Configure Learning Rate Scheduler ---
    # ReduceLROnPlateau adaptively reduces the learning rate when the validation
    # loss stops improving, which is a robust strategy.
    scheduler = optim.lr_scheduler.ReduceLROnPlateau(
        optimizer,
        mode='min',      # Reduce LR when the monitored quantity has stopped decreasing.
        factor=0.5,      # Factor by which the learning rate will be reduced.
        patience=3,      # Number of epochs with no improvement after which LR is reduced.
        verbose=True     # Print a message when the learning rate is updated.
    )

    logging.info("Optimization components configured (MSE Loss, Adam Optimizer, ReduceLROnPlateau Scheduler).")

    return {'criterion': criterion, 'optimizer': optimizer, 'scheduler': scheduler}


In [None]:
# Task 14: Model Training Loop Implementation

def _handle_batch_input(
    batch: Tuple[Union[torch.Tensor, Dict[str, torch.Tensor]], torch.Tensor],
    device: torch.device
) -> Tuple[Union[torch.Tensor, Dict[str, torch.Tensor]], torch.Tensor]:
    """
    Internal helper to move a batch of data to the correct device.
    Handles both standard tensor and dictionary-based (for transformers) inputs.
    """
    # Unpack the batch into features and targets.
    features, targets = batch

    # Move the targets tensor to the specified device.
    targets = targets.to(device, non_blocking=True)

    # Check if the features are a dictionary (for transformer models).
    if isinstance(features, dict):
        # If so, move each tensor within the dictionary to the device.
        features = {key: val.to(device, non_blocking=True) for key, val in features.items()}
    else:
        # Otherwise, move the single feature tensor to the device.
        features = features.to(device, non_blocking=True)

    return features, targets


def train_epoch(
    model: nn.Module,
    dataloader: DataLoader,
    optimizer: optim.Optimizer,
    criterion: nn.Module,
    device: torch.device,
    scaler: GradScaler,
    max_grad_norm: float = 1.0
) -> float:
    """
    Executes a single training epoch for the given model.

    This function iterates over the training dataloader, performs the forward
    and backward passes, and updates the model weights. It incorporates best
    practices such as gradient clipping and automatic mixed precision (AMP) for
    stable and efficient training.

    Args:
        model: The PyTorch model to be trained.
        dataloader: The DataLoader for the training set.
        optimizer: The optimizer for updating model weights.
        criterion: The loss function.
        device: The device to perform computations on ('cuda' or 'cpu').
        scaler: The GradScaler for mixed-precision training.
        max_grad_norm: The maximum norm for gradient clipping.

    Returns:
        The average training loss for the epoch.
    """
    # --- Set the model to training mode ---
    # This enables layers like Dropout and BatchNorm to function correctly.
    model.train()

    # Initialize running loss for the epoch.
    running_loss = 0.0

    # --- Iterate over the training data ---
    # Use tqdm for a descriptive progress bar.
    for batch in tqdm(dataloader, desc="Training"):
        # Move the batch of data to the configured device.
        features, targets = _handle_batch_input(batch, device)

        # --- Core Training Steps ---
        # 1. Zero the gradients from the previous iteration.
        optimizer.zero_grad()

        # 2. Forward pass with Automatic Mixed Precision (AMP).
        # `autocast` automatically casts operations to lower-precision dtypes (like float16)
        # to speed up computation and reduce memory usage on compatible GPUs.
        with autocast(enabled=(device.type == 'cuda')):
            # Get model predictions.
            if isinstance(features, dict):
                predictions = model(**features)
            else:
                predictions = model(features)
            # Calculate the loss.
            loss = criterion(predictions, targets)

        # 3. Backward pass with the GradScaler.
        # `scaler.scale(loss)` multiplies the loss by a scaling factor to prevent
        # gradients from underflowing (becoming zero) in mixed precision.
        scaler.scale(loss).backward()

        # 4. Gradient Clipping.
        # Unscales the gradients before clipping to view their true values.
        scaler.unscale_(optimizer)
        # Clips the norm of the gradients to prevent them from exploding, which
        # can destabilize training.
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)

        # 5. Optimizer step.
        # `scaler.step` unscales the gradients and calls `optimizer.step()`.
        # It skips the step if the gradients contain NaNs or Infs.
        scaler.step(optimizer)

        # 6. Update the GradScaler for the next iteration.
        scaler.update()

        # Accumulate the loss for the epoch.
        running_loss += loss.item()

    # Calculate the average loss for the epoch.
    avg_epoch_loss = running_loss / len(dataloader)
    return avg_epoch_loss


def validate_epoch(
    model: nn.Module,
    dataloader: DataLoader,
    criterion: nn.Module,
    device: torch.device
) -> float:
    """
    Executes a single validation epoch for the given model.

    This function iterates over the validation dataloader to compute the
    model's loss on unseen data. It runs in inference mode, disabling gradients
    for efficiency and correctness.

    Args:
        model: The PyTorch model to be validated.
        dataloader: The DataLoader for the validation set.
        criterion: The loss function.
        device: The device to perform computations on ('cuda' or 'cpu').

    Returns:
        The average validation loss for the epoch.
    """
    # --- Set the model to evaluation mode ---
    # This is critical: it disables Dropout and sets other layers to inference mode.
    model.eval()

    # Initialize running loss for the epoch.
    running_loss = 0.0

    # --- Disable gradient computation ---
    # This context manager reduces memory usage and speeds up computations, as
    # no gradients need to be stored for the backward pass.
    with torch.no_grad():
        # --- Iterate over the validation data ---
        for batch in tqdm(dataloader, desc="Validating"):
            # Move the batch of data to the configured device.
            features, targets = _handle_batch_input(batch, device)

            # Perform the forward pass.
            if isinstance(features, dict):
                predictions = model(**features)
            else:
                predictions = model(features)

            # Calculate the loss.
            loss = criterion(predictions, targets)

            # Accumulate the loss.
            running_loss += loss.item()

    # Calculate the average loss for the epoch.
    avg_epoch_loss = running_loss / len(dataloader)
    return avg_epoch_loss


def run_training_orchestrator(
    model: nn.Module,
    optimization_components: Dict[str, Any],
    train_loader: DataLoader,
    val_loader: DataLoader,
    device: torch.device,
    num_epochs: int,
    patience: int,
    checkpoint_path: Union[str, Path]
) -> Dict[str, Any]:
    """
    Orchestrates the complete model training loop.

    This function manages the entire training process over multiple epochs,
    integrating training, validation, learning rate scheduling, early stopping,
    and model checkpointing.

    Args:
        model: The PyTorch model to train.
        optimization_components: A dictionary containing the optimizer, criterion, and scheduler.
        train_loader: The DataLoader for the training set.
        val_loader: The DataLoader for the validation set.
        device: The compute device.
        num_epochs: The maximum number of epochs to train for.
        patience: The number of epochs to wait for validation loss improvement
                  before stopping early.
        checkpoint_path: The file path to save the best model weights.

    Returns:
        A dictionary containing the training history and the path to the best model.
    """
    # --- Unpack optimization components ---
    optimizer = optimization_components['optimizer']
    criterion = optimization_components['criterion']
    scheduler = optimization_components['scheduler']

    # --- Initialize training state variables ---
    best_val_loss = float('inf')
    epochs_no_improve = 0
    history = {'train_loss': [], 'val_loss': []}

    # Initialize the GradScaler for mixed-precision training.
    scaler = GradScaler(enabled=(device.type == 'cuda'))

    # Ensure the directory for the checkpoint exists.
    Path(checkpoint_path).parent.mkdir(parents=True, exist_ok=True)

    # --- Main Training Loop ---
    for epoch in range(1, num_epochs + 1):
        start_time = time.time()

        # --- Run Training and Validation for one epoch ---
        train_loss = train_epoch(model, train_loader, optimizer, criterion, device, scaler)
        val_loss = validate_epoch(model, val_loader, criterion, device)

        end_time = time.time()
        epoch_duration = end_time - start_time

        logging.info(
            f"Epoch {epoch}/{num_epochs} | "
            f"Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f} | "
            f"Duration: {epoch_duration:.2f}s"
        )

        # --- Record history ---
        history['train_loss'].append(train_loss)
        history['val_loss'].append(val_loss)

        # --- Learning Rate Scheduling ---
        # The scheduler adjusts the LR based on the validation loss.
        scheduler.step(val_loss)

        # --- Checkpointing and Early Stopping ---
        if val_loss < best_val_loss:
            # If validation loss has improved, save the model and reset counter.
            logging.info(f"Validation loss improved ({best_val_loss:.4f} -> {val_loss:.4f}). Saving model...")
            best_val_loss = val_loss
            torch.save(model.state_dict(), checkpoint_path)
            epochs_no_improve = 0
        else:
            # If no improvement, increment the patience counter.
            epochs_no_improve += 1
            logging.info(f"No improvement in validation loss for {epochs_no_improve} epoch(s).")

        # Check for early stopping.
        if epochs_no_improve >= patience:
            logging.info(f"Early stopping triggered after {patience} epochs with no improvement.")
            break

    logging.info(f"Training finished. Best validation loss: {best_val_loss:.4f}")

    return {
        'history': history,
        'best_model_path': checkpoint_path
    }


In [None]:
# Task 15: Regime-Specific Training Orchestration

def _get_model_and_features_for_run(
    model_type: str,
    regime_name: str,
    split_name: str,
    study_config: Dict[str, Any],
    data_splits: DataSplits,
    tfidf_features: TfidfFeatures,
    embedding_features: EmbeddingFeatures,
    combined_features: CombinedFeatures,
    tokenizer: PreTrainedTokenizerBase
) -> Tuple[nn.Module, Union[np.ndarray, Dict[str, np.ndarray]], np.ndarray]:
    """
    Internal helper to select the correct model class and feature set for a training run.
    This acts as a factory for models and their corresponding data.
    """
    # --- Select Model Class ---
    model_classes = {
        'lstm': LSTMRegressionModel,
        'text_transformer': TextTransformerRegressionModel,
        'feature_transformer': FeatureEnhancedMLP
    }
    if model_type not in model_classes:
        raise ValueError(f"Unknown model_type: {model_type}")

    # Instantiate the correct model with its specific configuration.
    model_config = study_config['model_training']['architectures'][model_type]
    model = model_classes[model_type](model_config)

    # --- Select Corresponding Features and Targets ---
    # Get the target values for the current split.
    targets = data_splits[regime_name][split_name]['target_return'].values

    # Select the feature set based on the model type.
    if model_type == 'lstm':
        # LSTM model uses TF-IDF features.
        features = tfidf_features[regime_name][split_name].toarray()
    elif model_type == 'feature_transformer':
        # Feature-enhanced model uses the combined features.
        features = combined_features[regime_name][split_name]
    elif model_type == 'text_transformer':
        # Text transformer requires tokenized text.
        text_corpus = data_splits[regime_name][split_name]['aggregated_text'].tolist()
        features = tokenizer(
            text_corpus,
            max_length=512,
            padding='max_length',
            truncation=True,
            return_tensors='np' # Return NumPy arrays for the Dataset class
        )
    else:
        # This case is already handled, but included for logical completeness.
        raise ValueError(f"Feature selection logic not implemented for model_type: {model_type}")

    return model, features, targets


def run_regime_specific_training_pipeline(
    data_splits: DataSplits,
    tfidf_features: TfidfFeatures,
    embedding_features: EmbeddingFeatures,
    combined_features: CombinedFeatures,
    study_config: Dict[str, Any],
    base_checkpoint_dir: str = "checkpoints",
    force_retrain: bool = False
) -> Dict[str, Any]:
    """
    Orchestrates the entire regime-specific training pipeline for all models.

    This master function iterates through each regime and each model architecture
    defined in the study configuration. For each of the 12 combinations, it:
    1. Sets up a reproducible training environment.
    2. Selects the correct model, features, and data.
    3. Configures DataLoaders, optimizer, and loss function.
    4. Executes the full training loop with validation, checkpointing, and early stopping.
    5. Stores the results (path to best model and training history).

    Args:
        data_splits: The nested dictionary of data splits.
        tfidf_features: The nested dictionary of TF-IDF features.
        embedding_features: The nested dictionary of sentence embedding features.
        combined_features: The nested dictionary of combined features.
        study_config: The complete study configuration dictionary.
        base_checkpoint_dir: The root directory to save model checkpoints.
        force_retrain: If True, retrains models even if a checkpoint exists.

    Returns:
        A nested dictionary containing the results of all training runs.
        Structure: {regime: {model_type: {'best_model_path': ..., 'history': ...}}}
    """
    logging.info("====== Starting Full Regime-Specific Training Pipeline ======")

    # --- Initialize environment and results dictionary ---
    # Setup a reproducible environment. The same seed is used for all runs for consistency.
    device = setup_training_environment(seed=42)
    # This dictionary will store the final results of all 12 training runs.
    all_training_results = {regime: {} for regime in data_splits.keys()}

    # --- Pre-load the tokenizer for the text transformer model ---
    # This is done once to avoid repeated loading inside the loop.
    tokenizer_id = study_config['model_training']['architectures']['text_transformer']['base_model_identifier']
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)

    # --- Main Experimental Loop: Iterate through Regimes and Models ---
    for regime_name in data_splits.keys():
        for model_type in study_config['model_training']['architectures'].keys():

            logging.info(f"\n--- Starting Training Run: [Regime: {regime_name}] - [Model: {model_type}] ---")

            # --- Step 1: Setup Checkpoint Path and Check for Existing Model ---
            # Define a systematic path for saving the model checkpoint.
            checkpoint_path = Path(base_checkpoint_dir) / regime_name / f"{model_type}_best.pth"

            # Idempotency: Skip training if the model already exists and not forcing retrain.
            if checkpoint_path.exists() and not force_retrain:
                logging.info(f"Checkpoint found at '{checkpoint_path}'. Skipping training.")
                # Store the existing path and a placeholder for history.
                all_training_results[regime_name][model_type] = {
                    'best_model_path': str(checkpoint_path),
                    'history': 'Skipped; loaded from checkpoint.'
                }
                continue

            # --- Step 2: Instantiate Model and Prepare Data ---
            # Use the factory helper to get the correct model and features.
            model, train_features, train_targets = _get_model_and_features_for_run(
                model_type, regime_name, 'training', study_config, data_splits,
                tfidf_features, embedding_features, combined_features, tokenizer
            )
            _, val_features, val_targets = _get_model_and_features_for_run(
                model_type, regime_name, 'validation', study_config, data_splits,
                tfidf_features, embedding_features, combined_features, tokenizer
            )

            # Move the model to the configured compute device.
            model.to(device)

            # --- Step 3: Create DataLoaders ---
            # Configure DataLoaders for training and validation sets.
            train_loader = create_dataloaders(
                train_features, train_targets, model_type,
                batch_size=study_config['model_training']['global_params']['batch_size'],
                shuffle=True
            )
            val_loader = create_dataloaders(
                val_features, val_targets, model_type,
                batch_size=study_config['model_training']['global_params']['batch_size'],
                shuffle=False
            )

            # --- Step 4: Setup Optimization Components ---
            # Configure the loss function, optimizer, and scheduler for this model.
            optimization_components = setup_optimization_components(
                model, study_config['model_training']['global_params']
            )

            # --- Step 5: Execute the Training Orchestrator ---
            # Delegate the actual training process to the function from Task 14.
            training_result = run_training_orchestrator(
                model=model,
                optimization_components=optimization_components,
                train_loader=train_loader,
                val_loader=val_loader,
                device=device,
                num_epochs=50,  # A reasonable maximum number of epochs.
                patience=5,     # Standard patience for early stopping.
                checkpoint_path=checkpoint_path
            )

            # --- Step 6: Store Results ---
            # Save the training history and the path to the best model.
            all_training_results[regime_name][model_type] = training_result

            logging.info(f"--- Finished Training Run: [Regime: {regime_name}] - [Model: {model_type}] ---")

    logging.info("\n====== Full Regime-Specific Training Pipeline Finished. ======")
    return all_training_results


In [None]:
# Task 16: Model Loading and Inference Setup

def generate_predictions_for_split(
    model: nn.Module,
    dataloader: DataLoader,
    device: torch.device
) -> np.ndarray:
    """
    Generates predictions for a given dataset using a trained model.

    This function runs a model in inference mode over a dataloader, collects
    all batch predictions, and returns them as a single NumPy array.

    Args:
        model: The trained PyTorch model (already in eval mode).
        dataloader: The DataLoader for the dataset to be predicted.
        device: The compute device ('cuda' or 'cpu').

    Returns:
        A 1D NumPy array containing the model's predictions.
    """
    # --- Set model to evaluation mode and disable gradients ---
    # This is a redundant safety check; the orchestrator should already do this.
    model.eval()

    # List to store predictions from each batch.
    all_predictions = []

    # The `torch.no_grad()` context manager is critical for inference.
    with torch.no_grad():
        # Iterate over the data with a progress bar.
        for batch in tqdm(dataloader, desc="Generating Predictions"):
            # Move the batch of data to the configured device.
            features, _ = _handle_batch_input(batch, device)

            # --- Forward Pass ---
            # Get model predictions for the batch.
            if isinstance(features, dict):
                predictions = model(**features)
            else:
                predictions = model(features)

            # --- Collect Predictions ---
            # Move predictions to the CPU and convert to a NumPy array before appending.
            # This frees up GPU memory.
            all_predictions.append(predictions.cpu().numpy())

    # Concatenate the list of batch predictions into a single NumPy array.
    predictions_array = np.vstack(all_predictions)

    # Flatten the array from (num_samples, 1) to (num_samples,) for easier use.
    return predictions_array.flatten()


def run_inference_pipeline(
    training_results: Dict[str, Any],
    data_splits: DataSplits,
    tfidf_features: TfidfFeatures,
    embedding_features: EmbeddingFeatures,
    combined_features: CombinedFeatures,
    study_config: Dict[str, Any]
) -> pd.DataFrame:
    """
    Orchestrates the entire inference pipeline for all trained models.

    This master function iterates through each trained model from the training
    results, loads its checkpoint, generates predictions on its corresponding
    test set, and collates all predictions into a single, comprehensive DataFrame.

    Args:
        training_results: The results dictionary from the training pipeline.
        data_splits: The nested dictionary of data splits.
        tfidf_features: The nested dictionary of TF-IDF features.
        embedding_features: The nested dictionary of sentence embedding features.
        combined_features: The nested dictionary of combined features.
        study_config: The complete study configuration dictionary.

    Returns:
        A single, tidy pandas DataFrame containing predictions and ground truth
        for all model-regime test sets.
    """
    logging.info("====== Starting Full Inference Pipeline ======")

    # --- Setup environment ---
    # Device selection should be consistent with the training environment.
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Pre-load the tokenizer once.
    tokenizer_id = study_config['model_training']['architectures']['text_transformer']['base_model_identifier']
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)

    # List to store DataFrame chunks for each test set.
    all_results_dfs = []

    # --- Main Inference Loop: Iterate through Regimes and Models ---
    for regime_name, model_runs in training_results.items():
        for model_type, run_result in model_runs.items():

            logging.info(f"\n--- Generating Predictions for: [Regime: {regime_name}] - [Model: {model_type}] ---")

            # --- Step 1: Load the Best Model from Checkpoint ---
            checkpoint_path = run_result['best_model_path']
            if not Path(checkpoint_path).exists():
                logging.warning(f"Checkpoint not found for {regime_name}-{model_type} at {checkpoint_path}. Skipping.")
                continue

            # Instantiate a new model of the correct architecture.
            model_config = study_config['model_training']['architectures'][model_type]
            model_classes = {
                'lstm': LSTMRegressionModel,
                'text_transformer': TextTransformerRegressionModel,
                'feature_transformer': FeatureEnhancedMLP
            }
            model = model_classes[model_type](model_config)

            # Load the saved weights from the checkpoint file.
            model.load_state_dict(torch.load(checkpoint_path, map_location=device))

            # Move the model to the compute device and set to evaluation mode.
            model.to(device)
            model.eval()

            # --- Step 2: Prepare the Test Data ---
            # Get the specific test DataFrame for this run.
            test_df = data_splits[regime_name]['test']
            if test_df.empty:
                logging.warning(f"Test set for {regime_name}-{model_type} is empty. Skipping.")
                continue

            # Use the factory helper to get the correct features and targets for the test set.
            _, test_features, test_targets = _get_model_and_features_for_run(
                model_type, regime_name, 'test', study_config, data_splits,
                tfidf_features, embedding_features, combined_features, tokenizer
            )

            # Create the DataLoader for the test set. `shuffle=False` is critical.
            test_loader = create_dataloaders(
                test_features, test_targets, model_type,
                batch_size=study_config['model_training']['global_params']['batch_size'] * 2, # Use larger batch for inference
                shuffle=False
            )

            # --- Step 3: Generate and Store Predictions ---
            # Generate predictions for the entire test set.
            predictions = generate_predictions_for_split(model, test_loader, device)

            # Create a temporary DataFrame from the test set to store results.
            result_df = test_df.copy()
            result_df['model_type'] = model_type
            result_df['prediction'] = predictions
            # The ground truth is already in the 'target_return' column.
            result_df = result_df.rename(columns={'target_return': 'ground_truth'})

            # Append this chunk to the master list.
            all_results_dfs.append(result_df)

            logging.info(f"Generated {len(predictions)} predictions.")

    # --- Final Assembly ---
    # Concatenate all the individual result DataFrames into one large DataFrame.
    if not all_results_dfs:
        logging.warning("No predictions were generated. Returning an empty DataFrame.")
        return pd.DataFrame()

    final_predictions_df = pd.concat(all_results_dfs)

    logging.info("\n====== Full Inference Pipeline Finished. ======")
    logging.info(f"Total predictions generated: {len(final_predictions_df)}")

    return final_predictions_df


In [None]:
# Task 17: Mean Squared Error (MSE) Computation

def _bootstrap_mse_ci(
    group: pd.DataFrame,
    n_bootstraps: int = 1000,
    alpha: float = 0.05
) -> Tuple[float, float]:
    """
    Internal helper to calculate bootstrap confidence intervals for MSE.
    """
    # Initialize an array to store the MSE from each bootstrap sample.
    bootstrap_mses = np.zeros(n_bootstraps)
    # Get the number of samples in the group.
    n_samples = len(group)

    # Perform the bootstrapping.
    for i in range(n_bootstraps):
        # Create a bootstrap sample by sampling with replacement.
        resample = group.sample(n=n_samples, replace=True)
        # Calculate the squared error for the resampled data.
        squared_errors = (resample['prediction'] - resample['ground_truth']) ** 2
        # Calculate and store the MSE for this bootstrap sample.
        bootstrap_mses[i] = squared_errors.mean()

    # Calculate the lower and upper bounds of the confidence interval.
    lower_bound = np.percentile(bootstrap_mses, (alpha / 2) * 100)
    upper_bound = np.percentile(bootstrap_mses, (1 - alpha / 2) * 100)

    return lower_bound, upper_bound


def compute_and_validate_mse(
    predictions_df: pd.DataFrame,
    n_bootstraps: int = 1000
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    Computes Mean Squared Error (MSE) and its confidence intervals.

    This function calculates the MSE for each model-regime pair from the
    provided predictions DataFrame. It also performs bootstrap resampling to
    estimate a 95% confidence interval for each MSE value, providing a measure
    of statistical uncertainty.

    Args:
        predictions_df: A tidy DataFrame containing predictions, ground truth,
                        regime, and model_type.
        n_bootstraps: The number of bootstrap samples to generate for
                      confidence intervals.

    Returns:
        A tuple containing:
        - A DataFrame of MSE values, pivoted for presentation.
        - A DataFrame of confidence intervals (lower and upper bounds).
    """
    # --- Input Validation ---
    required_cols = {'prediction', 'ground_truth', 'regime', 'model_type'}
    if not required_cols.issubset(predictions_df.columns):
        raise ValueError(f"Input DataFrame is missing required columns: {required_cols - set(predictions_df.columns)}")

    # --- Step 1: Calculate Squared Error ---
    # This is the core calculation, performed once in a vectorized manner.
    # Equation: SE_i = (prediction_i - ground_truth_i)^2
    df = predictions_df.copy()
    df['squared_error'] = (df['prediction'] - df['ground_truth']) ** 2

    # --- Step 2: Calculate Mean Squared Error per Group ---
    # Group by regime and model type and calculate the mean of the squared errors.
    # Equation: MSE = 1/N * sum(SE_i) for each group.
    mse_series = df.groupby(['regime', 'model_type'])['squared_error'].mean()

    # Pivot the resulting series into the desired wide-format DataFrame.
    mse_table = mse_series.unstack(level='model_type')

    # --- Step 3: Calculate Bootstrap Confidence Intervals ---
    logging.info(f"Calculating bootstrap CIs with {n_bootstraps} samples...")
    # Group by regime and model type and apply the bootstrap helper function.
    ci_series = df.groupby(['regime', 'model_type']).apply(
        _bootstrap_mse_ci, n_bootstraps=n_bootstraps
    )

    # Format the confidence intervals into a readable DataFrame.
    ci_df = ci_series.unstack(level='model_type')
    for col in ci_df.columns:
        ci_df[col] = ci_df[col].apply(lambda x: f"[{x[0]:.2f}, {x[1]:.2f}]")

    return mse_table, ci_df


def generate_mse_summary_table(
    mse_table: pd.DataFrame
) -> pd.io.formats.style.Styler:
    """
    Formats the MSE table for publication-quality presentation.

    This function takes the raw MSE table and applies styling to match the
    format of Table 3 in the paper, including number formatting and
    highlighting the best-performing model in each regime.

    Args:
        mse_table: The pivoted DataFrame of MSE values.

    Returns:
        A pandas Styler object, which can be rendered in environments like
        Jupyter notebooks or exported to HTML/LaTeX.
    """
    # --- Apply Styling using the pandas Style API ---
    styled_table = mse_table.style \
        .set_caption("Table 3 (Reproduced): Mean Squared Error (MSE) of models across economic regimes") \
        .format("{:.2f}") \
        .highlight_min(axis=1, props='font-weight: bold;')

    return styled_table


def run_mse_evaluation_suite(
    predictions_df: pd.DataFrame,
    study_config: Dict[str, Any]
) -> pd.DataFrame:
    """
    Orchestrates the full MSE evaluation and reporting pipeline.

    Args:
        predictions_df: The tidy DataFrame of all model predictions.
        study_config: The complete study configuration dictionary.

    Returns:
        The raw, unstyled DataFrame of MSE values for programmatic use.
    """
    logging.info("--- Running Task 17: Mean Squared Error (MSE) Evaluation Suite ---")

    # --- Step 1 & 2: Compute MSE and Confidence Intervals ---
    logging.info("\nStep 1 & 2: Computing MSE and validating with bootstrap confidence intervals...")
    mse_table, ci_table = compute_and_validate_mse(predictions_df)

    logging.info("\nMean Squared Error (MSE) Table (raw values):\n" + mse_table.to_string(float_format='{:.4f}'.format))
    logging.info("\n95% Confidence Intervals for MSE:\n" + ci_table.to_string())

    # --- Step 3: Generate Formatted Summary Table ---
    logging.info("\nStep 3: Generating publication-quality summary table...")
    styled_mse_table = generate_mse_summary_table(mse_table)

    # In a script, you might save this to HTML. In a notebook, `display` is used.
    # styled_mse_table.to_html("mse_summary_table.html")
    logging.info("Displaying formatted MSE summary table (equivalent to Table 3):")
    display(styled_mse_table)

    logging.info("\n>>> MSE evaluation suite completed successfully. <<<")

    # Return the raw data table for further programmatic use.
    return mse_table


In [None]:
# Task 18: Prediction Storage and Metadata Management

def enrich_and_store_predictions(
    predictions_df: pd.DataFrame,
    export_path: Union[str, Path],
    metadata: Dict[str, Any]
) -> pd.DataFrame:
    """
    Enriches a predictions DataFrame with diagnostic metrics and metadata,
    and provides a robust method for storage and validation.

    This function serves as the final step in the prediction generation pipeline.
    It takes the raw predictions and transforms them into a comprehensive,
    auditable artifact by:
    1.  Calculating per-sample error metrics (Squared Error, Absolute Error).
    2.  Calculating directional accuracy.
    3.  Embedding user-provided metadata (e.g., experiment ID, timestamp)
        into the DataFrame.
    4.  Saving the enriched DataFrame to a file using a secure protocol.
    5.  Creating a checksum file to validate the integrity of the stored artifact.

    Args:
        predictions_df: The tidy DataFrame of all model predictions from the
                        inference pipeline.
        export_path: The file path (e.g., 'results/predictions.pkl') where the
                     enriched DataFrame will be saved.
        metadata: A dictionary containing metadata to be added to the DataFrame.

    Returns:
        The enriched pandas DataFrame with added diagnostic and metadata columns.

    Raises:
        ValueError: If the input DataFrame is missing required columns.
    """
    logging.info("--- Running Task 18: Prediction Storage and Metadata Management ---")

    # --- Input Validation ---
    required_cols = {'prediction', 'ground_truth'}
    if not required_cols.issubset(predictions_df.columns):
        raise ValueError(f"Input DataFrame is missing required columns: {required_cols - set(predictions_df.columns)}")

    # --- Operate on a copy to ensure purity ---
    enriched_df = predictions_df.copy()

    # --- Step 1: Generate Prediction Quality Diagnostics ---
    logging.info("Step 1: Generating per-sample prediction quality diagnostics...")

    # Calculate Squared Error (SE), the contribution of each sample to MSE.
    # Equation: SE_i = (prediction_i - ground_truth_i)^2
    enriched_df['squared_error'] = (enriched_df['prediction'] - enriched_df['ground_truth']) ** 2

    # Calculate Absolute Error (AE), the contribution to MAE.
    # Equation: AE_i = |prediction_i - ground_truth_i|
    enriched_df['absolute_error'] = (enriched_df['prediction'] - enriched_df['ground_truth']).abs()

    # Calculate Directional Accuracy.
    # This is True if the signs of the prediction and ground truth match.
    # np.sign correctly handles positive, negative, and zero values.
    enriched_df['directional_accuracy'] = np.sign(enriched_df['prediction']) == np.sign(enriched_df['ground_truth'])

    logging.info("Added 'squared_error', 'absolute_error', and 'directional_accuracy' columns.")

    # --- Step 2: Embed Metadata ---
    logging.info("Step 2: Embedding metadata into the DataFrame...")
    # Add a timestamp for when the artifact was created.
    enriched_df['artifact_timestamp_utc'] = datetime.utcnow()

    # Add all user-provided metadata as new columns.
    # This makes every prediction row self-describing and fully auditable.
    for key, value in metadata.items():
        enriched_df[key] = value

    logging.info(f"Added metadata columns: {list(metadata.keys()) + ['artifact_timestamp_utc']}")

    # --- Step 3: Create Prediction Export and Validation System ---
    logging.info(f"Step 3: Exporting enriched DataFrame to '{export_path}'...")

    # Ensure the parent directory exists.
    export_path = Path(export_path)
    export_path.parent.mkdir(parents=True, exist_ok=True)

    # Save the DataFrame using pickle with a modern, efficient protocol.
    # Pickle is chosen as it perfectly preserves dtypes and index structure.
    with open(export_path, "wb") as f:
        pickle.dump(enriched_df, f, protocol=pickle.HIGHEST_PROTOCOL)

    # --- Create a checksum for data integrity validation ---
    # Read the file back in binary mode to compute the hash.
    with open(export_path, "rb") as f:
        file_bytes = f.read()
        # Compute the SHA256 hash of the file content.
        checksum = hashlib.sha256(file_bytes).hexdigest()

    # Define the path for the checksum file.
    checksum_path = export_path.with_suffix(export_path.suffix + '.sha256')

    # Write the checksum to its corresponding file.
    with open(checksum_path, "w") as f:
        f.write(checksum)

    logging.info(f"Successfully saved DataFrame. Integrity checksum saved to '{checksum_path}'.")

    logging.info("\n>>> Prediction enrichment and storage suite completed successfully. <<<")

    return enriched_df


def load_and_validate_predictions(
    import_path: Union[str, Path]
) -> pd.DataFrame:
    """
    Loads a prediction artifact and validates its integrity using a checksum.

    Args:
        import_path: The file path of the prediction artifact to load.

    Returns:
        The loaded and validated pandas DataFrame.

    Raises:
        FileNotFoundError: If the artifact or its checksum file is not found.
        ValueError: If the checksum validation fails, indicating data corruption.
    """
    import_path = Path(import_path)
    checksum_path = import_path.with_suffix(import_path.suffix + '.sha256')

    # --- Check for file existence ---
    if not import_path.exists():
        raise FileNotFoundError(f"Prediction artifact not found at '{import_path}'")
    if not checksum_path.exists():
        raise FileNotFoundError(f"Checksum file not found at '{checksum_path}'")

    # --- Validate Integrity ---
    # Read the expected checksum.
    with open(checksum_path, "r") as f:
        expected_checksum = f.read()

    # Read the artifact file and compute its actual checksum.
    with open(import_path, "rb") as f:
        file_bytes = f.read()
        actual_checksum = hashlib.sha256(file_bytes).hexdigest()

    # Compare the checksums.
    if expected_checksum != actual_checksum:
        raise ValueError(
            f"Data integrity check failed for '{import_path}'. "
            "The file may be corrupted."
        )

    logging.info(f"Checksum validation passed for '{import_path}'.")

    # --- Load the DataFrame ---
    # If validation passes, load the object from the pickle file.
    with open(import_path, "rb") as f:
        predictions_df = pickle.load(f)

    logging.info(f"Successfully loaded prediction artifact with shape {predictions_df.shape}.")

    return predictions_df


In [None]:
# Task 19: Diagnostic Metric Orchestrator Function Creation, Task 20: Financial Causal Attribution Score (FCAS) Implementation, Task 21: Patent Cliff Sensitivity (PCS) Implementation,
# Task 22: Temporal Semantic Volatility (TSV) Computation, Task 23: NLI-based Logical Consistency Score (NLICS) Implementation

def _extract_causal_polarity(
    text: str,
    positive_keywords: List[str],
    negative_keywords: List[str]
) -> int:
    """
    Internal helper to extract causal polarity from text via keyword counting.

    This function performs a simple keyword match to determine the overall
    sentiment polarity of a given text. It is case-insensitive and matches
    whole words only.

    Args:
        text: The input string to analyze.
        positive_keywords: A list of keywords indicating positive sentiment.
        negative_keywords: A list of keywords indicating negative sentiment.

    Returns:
        An integer representing the polarity: 1 for positive, -1 for negative,
        and 0 for neutral or ambiguous.
    """
    # Ensure the input is a string.
    if not isinstance(text, str):
        return 0

    # Count occurrences of positive keywords using a case-insensitive, whole-word regex search.
    pos_count = sum(len(re.findall(r'\b' + kw + r'\b', text, re.IGNORECASE)) for kw in positive_keywords)

    # Count occurrences of negative keywords.
    neg_count = sum(len(re.findall(r'\b' + kw + r'\b', text, re.IGNORECASE)) for kw in negative_keywords)

    # Determine the final polarity based on the counts.
    if pos_count > neg_count:
        return 1
    elif neg_count > pos_count:
        return -1
    else:
        return 0


def compute_fcas(predictions_df: pd.DataFrame, **kwargs) -> pd.Series:
    """
    Computes the Financial Causal Attribution Score (FCAS).

    FCAS measures the alignment between a model's prediction direction (positive
    or negative return) and the causal sentiment polarity implied by the input
    text. A higher score indicates that the model's predictions are more
    consistent with the simple causal cues in the news.

    The calculation follows the equation from the paper:
    FCAS = E[I(sign(prediction) == sign(causal_cue))]
    where the expectation is taken over all samples with non-zero causal cues.

    Args:
        predictions_df: The enriched DataFrame containing predictions, ground
                        truth, and the 'aggregated_text' for each sample.
        **kwargs: Catches unused arguments from the orchestrator.

    Returns:
        A pandas Series containing the FCAS for each (regime, model_type) pair,
        indexed by a MultiIndex.
    """
    # Log the start of the computation.
    logging.info("Computing Financial Causal Attribution Score (FCAS)...")

    # Work on a copy to avoid modifying the original DataFrame.
    df = predictions_df.copy()

    # Define comprehensive, financially relevant keywords for sentiment polarity.
    positive_keywords = ["growth", "surge", "increase", "rise", "gain", "profit", "beat", "strong", "positive", "expansion", "boost", "outperform", "record", "upbeat"]
    negative_keywords = ["decline", "fall", "decrease", "drop", "loss", "miss", "weak", "negative", "contraction", "plunge", "crash", "underperform", "risk", "concern"]

    # Apply the helper function to each row to determine the causal polarity of the text.
    df['causal_polarity'] = df['aggregated_text'].apply(
        _extract_causal_polarity, args=(positive_keywords, negative_keywords)
    )

    # Filter out samples where the text was neutral or ambiguous (polarity = 0).
    df_filtered = df[df['causal_polarity'] != 0].copy()

    # Handle the case where no samples have a clear causal direction.
    if df_filtered.empty:
        logging.warning("No samples with clear causal polarity found for FCAS calculation.")
        return pd.Series(name='FCAS', dtype=float)

    # Determine the alignment: True if the sign of the prediction matches the sign of the polarity.
    df_filtered['alignment'] = (np.sign(df_filtered['prediction']) == np.sign(df_filtered['causal_polarity']))

    # Calculate the final FCAS by taking the mean of the boolean 'alignment' series for each group.
    fcas_series = df_filtered.groupby(['regime', 'model_type'])['alignment'].mean()

    # Return the resulting Series.
    return fcas_series


def compute_tsv(embedding_features: Dict[str, Dict[str, np.ndarray]], data_splits: Dict[str, Dict[str, pd.DataFrame]], **kwargs) -> pd.Series:
    """
    Computes the Temporal Semantic Volatility (TSV).

    TSV quantifies the average semantic drift between consecutive news articles
    within the test set of each regime. It is calculated as the mean Euclidean
    distance between the sentence embeddings of chronologically adjacent texts.
    A higher TSV indicates greater narrative change over time.

    The calculation follows the equation from the paper:
    TSV = (1/(N-1)) * sum(||embedding_{i+1} - embedding_i||_2)

    Args:
        embedding_features: A nested dictionary containing the sentence embedding
                            NumPy arrays for all data splits.
        data_splits: The nested dictionary of data splits, used to access the
                     date index for chronological sorting.
        **kwargs: Catches unused arguments from the orchestrator.

    Returns:
        A pandas Series containing the TSV for each regime, indexed by regime name.
    """
    # Log the start of the computation.
    logging.info("Computing Temporal Semantic Volatility (TSV)...")

    # Dictionary to store the final TSV score for each regime.
    tsv_scores = {}

    # Iterate through each regime defined in the data splits.
    for regime_name in data_splits.keys():
        # Get the test set DataFrame for the current regime.
        test_df = data_splits[regime_name]['test']

        # A minimum of two data points are required to compute a difference.
        if len(test_df) < 2:
            tsv_scores[regime_name] = np.nan
            continue

        # Retrieve the corresponding sentence embeddings for the test set.
        test_embeddings = embedding_features[regime_name]['test']

        # Create a temporary DataFrame to align embeddings with their dates.
        temp_df = pd.DataFrame(test_embeddings, index=test_df.index)

        # Sort this DataFrame chronologically by the 'date' level of the index.
        temp_df = temp_df.sort_index(level='date')

        # Extract the sorted embeddings as a NumPy array.
        sorted_embeddings = temp_df.values

        # Calculate the vector difference between each consecutive embedding.
        # `np.diff` computes the difference between adjacent elements.
        differences = np.diff(sorted_embeddings, axis=0)

        # Calculate the Euclidean (L2) norm of each difference vector.
        distances = np.linalg.norm(differences, axis=1)

        # The TSV is the mean of these distances.
        tsv_scores[regime_name] = distances.mean()

    # Convert the dictionary of scores to a pandas Series and return it.
    return pd.Series(tsv_scores, name='TSV')


def _perturb_text(text: str, perturbation_map: Dict[str, str]) -> Union[str, None]:
    """
    Internal helper to apply a single, first-occurrence keyword perturbation.

    This function searches for the first keyword from the perturbation map in
    the text and replaces it with its counterpart.

    Args:
        text: The input string to perturb.
        perturbation_map: A dictionary mapping keywords to their replacements.

    Returns:
        The perturbed string, or None if no keywords were found to replace.
    """
    # Ensure the input is a string.
    if not isinstance(text, str):
        return None

    # Iterate through the keyword-replacement pairs.
    for word, replacement in perturbation_map.items():
        # Use a case-insensitive, whole-word regex pattern.
        pattern = r'\b' + re.escape(word) + r'\b'
        # Check if the keyword exists in the text.
        if re.search(pattern, text, re.IGNORECASE):
            # If found, perform a single replacement and return immediately.
            return re.sub(pattern, replacement, text, count=1, flags=re.IGNORECASE)

    # If the loop completes without finding any keywords, return None.
    return None


def compute_pcs(
    predictions_df: pd.DataFrame,
    training_results: Dict[str, Any],
    data_splits: Dict[str, Dict[str, pd.DataFrame]],
    study_config: Dict[str, Any],
    vectorizer: TfidfVectorizer,
    st_model: SentenceTransformer,
    tokenizer: PreTrainedTokenizerBase,
    **kwargs
) -> pd.Series:
    """
    Computes the Patent Cliff Sensitivity (PCS).

    PCS measures a model's output sensitivity to small, controlled semantic
    changes in the input text. It is calculated as the average absolute
    difference between a model's original prediction and its prediction on a
    semantically perturbed version of the input text.

    The calculation follows the equation from the paper:
    PCS = E[|f_theta(x) - f_theta(x_perturbed)|]

    This is a computationally intensive process as it requires re-running
    feature engineering and inference for many samples.

    Args:
        predictions_df: DataFrame with original predictions.
        training_results: Dictionary with paths to trained model checkpoints.
        data_splits: The original data splits to get the raw text.
        study_config: The main study configuration dictionary.
        vectorizer: The globally fitted TfidfVectorizer.
        st_model: The initialized SentenceTransformer model.
        tokenizer: The initialized HuggingFace tokenizer.
        **kwargs: Catches unused arguments.

    Returns:
        A pandas Series of PCS scores for each (regime, model_type) pair.
    """
    # Log the start of this intensive computation.
    logging.info("Computing Patent Cliff Sensitivity (PCS)...")

    # Determine the compute device.
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Define the bidirectional map for semantic perturbations.
    perturbation_map = {
        "growth": "decline", "increase": "decrease", "rise": "fall", "gain": "loss",
        "profit": "loss", "beat": "miss", "strong": "weak", "positive": "negative",
        "expansion": "contraction", "boost": "drop", "outperform": "underperform",
        "decline": "growth", "decrease": "increase", "fall": "rise",
        "miss": "beat", "weak": "strong", "negative": "positive",
        "contraction": "expansion", "drop": "boost", "underperform": "outperform"
    }

    # List to store the results (absolute differences) from all runs.
    all_diffs = []

    # --- Main Loop: Iterate through each of the 12 experimental cells ---
    for regime_name, model_runs in tqdm(training_results.items(), desc="PCS Regimes"):
        for model_type, run_result in model_runs.items():

            # --- 1. Load the specific model for this cell ---
            checkpoint_path = run_result['best_model_path']
            model_config = study_config['model_training']['architectures'][model_type]
            model_classes = {'lstm': LSTMRegressionModel, 'text_transformer': TextTransformerRegressionModel, 'feature_transformer': FeatureEnhancedMLP}
            model = model_classes[model_type](model_config)
            model.load_state_dict(torch.load(checkpoint_path, map_location=device))
            model.to(device).eval()

            # --- 2. Prepare data for this cell ---
            test_df = data_splits[regime_name]['test']
            if test_df.empty: continue

            # Get the original predictions for this cell's test set.
            original_preds = predictions_df[
                (predictions_df['regime'] == regime_name) & (predictions_df['model_type'] == model_type)
            ]

            # --- 3. Perturb texts and identify valid samples ---
            perturbed_texts = test_df['aggregated_text'].apply(lambda x: _perturb_text(x, perturbation_map))
            valid_indices = perturbed_texts.notna()

            if not valid_indices.any(): continue

            perturbed_texts_list = perturbed_texts[valid_indices].tolist()

            # --- 4. Re-run feature engineering and inference on perturbed data ---
            with torch.no_grad():
                # This block dynamically creates the correct feature set for the perturbed text.
                if model_type == 'text_transformer':
                    encodings = tokenizer(perturbed_texts_list, max_length=512, padding='max_length', truncation=True, return_tensors='pt')
                    # Use a dummy target array for the Dataset.
                    dataloader = DataLoader(TextTransformerDataset(encodings, np.zeros(len(perturbed_texts_list))), batch_size=128)
                else:
                    perturbed_tfidf = vectorizer.transform(perturbed_texts_list).toarray()
                    perturbed_emb = st_model.encode(perturbed_texts_list, batch_size=128, show_progress_bar=False)
                    if model_type == 'lstm':
                        features = perturbed_tfidf
                    else: # feature_transformer
                        features = np.hstack([perturbed_tfidf, perturbed_emb])
                    dataloader = DataLoader(FinancialDataset(features, np.zeros(len(features))), batch_size=128)

                # --- 5. Generate predictions in batches ---
                perturbed_predictions = []
                for batch in dataloader:
                    features_batch, _ = _handle_batch_input(batch, device)
                    preds = model(**features_batch) if isinstance(features_batch, dict) else model(features_batch)
                    perturbed_predictions.append(preds.cpu().numpy())

            perturbed_predictions = np.vstack(perturbed_predictions).flatten()

            # --- 6. Calculate and store absolute differences ---
            diffs_df = original_preds.loc[valid_indices].copy()
            diffs_df['perturbed_prediction'] = perturbed_predictions
            diffs_df['abs_diff'] = (diffs_df['prediction'] - diffs_df['perturbed_prediction']).abs()
            all_diffs.append(diffs_df[['regime', 'model_type', 'abs_diff']])

    # --- 7. Aggregate final PCS scores ---
    if not all_diffs:
        return pd.Series(name='PCS', dtype=float)

    final_diffs_df = pd.concat(all_diffs)
    pcs_series = final_diffs_df.groupby(['regime', 'model_type'])['abs_diff'].mean()
    return pcs_series


async def _get_nlics_score_for_row_cached(
    session: AsyncOpenAI,
    row: pd.Series,
    config: Dict[str, Any],
    cache: Dict[str, float],
    cache_path: Path
) -> float:
    """
    Async helper to get NLICS score for a single row with retry and caching.

    This function generates a unique key for the input, checks a cache,
    and only makes an API call if the result is not already cached. It includes
    robust error handling with exponential backoff for API calls.

    Args:
        session: The active AsyncOpenAI client session.
        row: A row from the predictions DataFrame.
        config: The NLICS configuration dictionary.
        cache: The in-memory cache dictionary.
        cache_path: The path to the persistent cache file.

    Returns:
        The calculated NLICS score (1.0, 0.5, or 0.0), or np.nan on failure.
    """
    # Generate the prediction hypothesis string.
    hypothesis = "The stock price will increase" if row['prediction'] > 0 else "The stock price will decrease"
    # Truncate text to a reasonable length to manage token costs and context windows.
    text_excerpt = row['aggregated_text'][:2000]

    # Create a unique, deterministic hash key for this text-hypothesis pair.
    cache_key = hashlib.sha256((text_excerpt + hypothesis).encode('utf-8')).hexdigest()

    # First, check the in-memory cache for the result.
    if cache_key in cache:
        return cache[cache_key]

    # If not in cache, format the prompt for the API call.
    user_prompt = config['prompt_template']['user'].format(news_excerpt=text_excerpt, prediction_hypothesis=hypothesis)

    # Implement a retry loop with exponential backoff for API call robustness.
    for attempt in range(3):
        try:
            # Make the asynchronous API call.
            response = await session.chat.completions.create(
                model=config['llm_model_identifier'],
                messages=[
                    {"role": "system", "content": config['prompt_template']['system']},
                    {"role": "user", "content": user_prompt}
                ],
                **config['llm_settings']
            )

            # Rigorously parse the response to get the top token and its probability.
            top_logprob = response.choices[0].logprobs.content[0]
            token = top_logprob.token.strip()
            confidence = np.exp(top_logprob.logprob)

            # Apply the precise scoring rule from the paper.
            score = 0.5  # Default to 'Uncertain'
            if token == "Yes" and confidence > config['scoring_rules']['confidence_threshold']:
                score = 1.0
            elif token == "No":
                score = 0.0

            # Update the in-memory cache.
            cache[cache_key] = score
            # Append the new result to the persistent cache file for future runs.
            with open(cache_path, 'a') as f:
                f.write(json.dumps({'key': cache_key, 'score': score}) + '\n')
            return score
        except RateLimitError as e:
            # Handle rate limit errors by waiting for the suggested duration.
            logging.warning(f"Rate limit hit. Waiting for {e.retry_after or 2 ** attempt}s...")
            await asyncio.sleep(e.retry_after or 2 ** attempt)
        except Exception as e:
            # Handle other transient errors.
            logging.warning(f"API call failed on attempt {attempt + 1}: {e}")
            if attempt < 2: await asyncio.sleep(2 ** attempt)

    # If all retries fail, return NaN.
    return np.nan


async def _compute_nlics_async_cached(
    df: pd.DataFrame,
    config: Dict[str, Any],
    cache_path: Path
) -> pd.Series:
    """
    Asynchronous orchestrator for NLICS computation with persistent caching.

    This function manages the high-throughput, asynchronous execution of NLI
    evaluations using an external API (e.g., OpenAI). It is designed for
    efficiency and robustness by:
    1.  Loading a persistent, file-based cache to avoid re-processing and
        re-incurring costs for previously seen text-hypothesis pairs.
    2.  Creating an asynchronous API client.
    3.  Generating a list of concurrent tasks, one for each row in the input
        DataFrame.
    4.  Executing these tasks in parallel using `asyncio.gather`, with a
        progress bar for user feedback.

    Args:
        df (pd.DataFrame): The DataFrame containing the samples to be evaluated.
                           Must include 'aggregated_text' and 'prediction' columns.
        config (Dict[str, Any]): The NLICS configuration dictionary, containing
                                 API keys, model identifiers, and prompt templates.
        cache_path (Path): The file path to the JSONL cache file for reading
                           and writing results.

    Returns:
        pd.Series: A pandas Series containing the calculated NLICS score for each
                   sample, indexed identically to the input DataFrame. Failed
                   evaluations will be represented by `np.nan`.
    """
    # --- 1. Load Existing Cache from Disk ---
    # Initialize an in-memory dictionary to hold the cache for fast lookups.
    cache: Dict[str, float] = {}

    # Check if the cache file already exists.
    if cache_path.exists():
        # Open the cache file for reading.
        with open(cache_path, 'r') as f:
            # Iterate through each line in the JSONL file.
            for line in f:
                try:
                    # Attempt to load the JSON object from the line.
                    entry = json.loads(line)
                    # Populate the in-memory cache with the key and score.
                    cache[entry['key']] = entry['score']
                except (json.JSONDecodeError, KeyError):
                    # If a line is corrupted or malformed, log a warning and skip it.
                    logging.warning(f"Skipping corrupted or malformed line in cache file: {line.strip()}")
                    continue

    # Log the number of entries successfully loaded from the cache.
    logging.info(f"Loaded {len(cache)} entries from NLICS cache at '{cache_path}'.")

    # --- 2. Initialize Asynchronous API Client ---
    # Create an instance of the AsyncOpenAI client using the API key from the config.
    client = AsyncOpenAI(api_key=config['api_key'])

    # --- 3. Create Concurrent Tasks ---
    # Create a list of coroutine tasks. Each task is a call to the helper
    # function that will process one row of the DataFrame.
    tasks = [
        _get_nlics_score_for_row_cached(client, row, config, cache, cache_path)
        for _, row in df.iterrows()
    ]

    # --- 4. Execute Tasks Concurrently ---
    # `async_tqdm.gather` is a wrapper around `asyncio.gather` that provides a
    # real-time progress bar, which is essential for monitoring long-running
    # asynchronous operations. It runs all tasks in the `tasks` list concurrently.
    scores = await async_tqdm.gather(*tasks, desc="Computing NLICS Scores")

    # --- 5. Return Results ---
    # Convert the list of returned scores into a pandas Series.
    # The index of the Series is explicitly set to the index of the input
    # DataFrame to ensure perfect alignment.
    return pd.Series(scores, index=df.index)

def compute_nlics(predictions_df: pd.DataFrame, study_config: Dict[str, Any], **kwargs) -> pd.Series:
    """
    Computes the NLI-based Logical Consistency Score (NLICS).

    NLICS uses a Large Language Model (GPT-4) to assess if a model's prediction
    is logically supported by the input news text. This implementation is highly
    performant due to asynchronous API calls and robust due to a persistent,
    file-based caching mechanism that prevents re-computation and re-incurring
    API costs on subsequent runs.

    Args:
        predictions_df: The enriched DataFrame of predictions.
        study_config: The main study configuration dictionary.
        **kwargs: Catches unused arguments.

    Returns:
        A pandas Series of NLICS scores for each (regime, model_type) pair.
    """
    # Log the start of the computation.
    logging.info("Computing NLICS (with async and caching)...")

    # Get the specific configuration for the NLICS metric.
    nlics_config = study_config['diagnostics']['nlics_metric']

    # Define the path for the persistent cache and ensure its directory exists.
    cache_path = Path("results/nlics_cache.jsonl")
    cache_path.parent.mkdir(parents=True, exist_ok=True)

    # Run the main asynchronous logic.
    scores = asyncio.run(_compute_nlics_async_cached(predictions_df, nlics_config, cache_path))

    # Add the calculated scores to a copy of the DataFrame.
    df = predictions_df.copy()
    df['nlics_score'] = scores

    # Calculate the final NLICS by taking the mean score for each group.
    nlics_series = df.groupby(['regime', 'model_type'])['nlics_score'].mean()
    return nlics_series


def run_diagnostic_metrics_orchestrator(
    predictions_df: pd.DataFrame,
    study_config: Dict[str, Any],
    data_splits: Dict[str, Dict[str, pd.DataFrame]],
    embedding_features: Dict[str, Dict[str, pd.ndarray]],
    training_results: Dict[str, Any],
    vectorizer: TfidfVectorizer,
    st_model: SentenceTransformer,
    tokenizer: PreTrainedTokenizerBase,
    max_workers: int = 4
) -> pd.DataFrame:
    """
    Orchestrates the computation of all diagnostic metrics.

    This master function manages the execution of the entire diagnostic suite,
    including the computationally intensive PCS and the I/O-bound NLICS. It is
    designed for robustness and performance, handling failures in individual
    metric calculations gracefully and structuring the workflow logically.

    The workflow is as follows:
    1.  MSE is calculated directly and synchronously as it is fast.
    2.  Simpler, CPU-bound metrics (FCAS, TSV) are run in parallel using a
        ProcessPoolExecutor.
    3.  The complex, computationally intensive PCS metric is run sequentially
        as it is a major pipeline in itself.
    4.  The I/O-bound, asynchronous NLICS metric is run sequentially to manage
        its async event loop and caching mechanism.
    5.  All results are aggregated into a final, comprehensive "Robustness Profile"
        DataFrame.

    Args:
        predictions_df: The enriched DataFrame containing predictions, ground
                        truth, and all necessary metadata.
        study_config: The complete study configuration dictionary.
        data_splits: The original nested dictionary of data splits.
        embedding_features: The nested dictionary of sentence embedding features.
        training_results: The dictionary containing paths to trained model checkpoints.
        vectorizer: The globally fitted TfidfVectorizer.
        st_model: The initialized SentenceTransformer model.
        tokenizer: The initialized HuggingFace tokenizer.
        max_workers: The maximum number of parallel processes to use for
                     the simpler metric computations.

    Returns:
        A single DataFrame containing the complete Robustness Profile for all
        model-regime pairs, with metrics as columns.
    """
    # Log the start of the master orchestration.
    logging.info("====== Starting Definitive Diagnostic Metrics Orchestrator ======")

    # --- 1. Synchronous MSE Calculation ---
    # MSE is fast to compute directly from the 'squared_error' column.
    logging.info("Computing MSE...")
    # The calculation is a simple groupby-mean operation.
    mse_series = predictions_df.groupby(['regime', 'model_type'])['squared_error'].mean()
    # Name the resulting series for later joining.
    mse_series.name = 'MSE'

    # --- 2. Parallel Computation of Simpler Metrics (FCAS, TSV) ---
    # These metrics are CPU-bound and independent, making them ideal for parallelization.
    simple_metrics_to_run: Dict[str, Callable] = {
        'FCAS': compute_fcas,
        'TSV': compute_tsv
    }
    # This dictionary bundles all necessary artifacts to pass to each function.
    shared_args = {
        'predictions_df': predictions_df,
        'data_splits': data_splits,
        'embedding_features': embedding_features
    }

    # List to hold the resulting pandas Series from the parallel jobs.
    parallel_metric_results = []
    # Use a ProcessPoolExecutor for true parallelism on CPU-bound tasks.
    with ProcessPoolExecutor(max_workers=max_workers) as executor:
        # Submit each function call to the executor pool.
        future_to_metric = {
            executor.submit(func, **shared_args): name
            for name, func in simple_metrics_to_run.items()
        }

        # Process results as they complete, with a progress bar.
        for future in tqdm(as_completed(future_to_metric), total=len(future_to_metric), desc="Computing FCAS & TSV"):
            metric_name = future_to_metric[future]
            try:
                # Retrieve the result from the completed job.
                result_series = future.result()
                # Assign the correct name to the series.
                result_series.name = metric_name
                parallel_metric_results.append(result_series)
                logging.info(f"Successfully computed metric: {metric_name}")
            except Exception as e:
                # Log any exceptions gracefully without crashing the orchestrator.
                logging.error(f"Metric calculation for '{metric_name}' FAILED: {e}", exc_info=True)

    # --- 3. Sequential Computation of Complex Metrics (PCS, NLICS) ---
    # These are run sequentially because they are either extremely complex pipelines
    # (PCS) or manage their own concurrency (NLICS).

    # Compute PCS.
    try:
        pcs_series = compute_pcs(
            predictions_df=predictions_df,
            training_results=training_results,
            data_splits=data_splits,
            study_config=study_config,
            vectorizer=vectorizer,
            st_model=st_model,
            tokenizer=tokenizer
        )
        pcs_series.name = 'PCS'
    except Exception as e:
        logging.error(f"Metric 'PCS' failed entirely: {e}", exc_info=True)
        pcs_series = None

    # Compute NLICS.
    try:
        nlics_series = compute_nlics(
            predictions_df=predictions_df,
            study_config=study_config
        )
        nlics_series.name = 'NLICS'
    except Exception as e:
        logging.error(f"Metric 'NLICS' failed entirely: {e}", exc_info=True)
        nlics_series = None

    # --- 4. Aggregate All Results into a Single DataFrame ---
    logging.info("Aggregating all metric results into a final Robustness Profile table...")

    # Start with the MSE results as the base DataFrame.
    robustness_profile = mse_series.to_frame()

    # Join the results from the parallel computations.
    for result_series in parallel_metric_results:
        # Special handling for TSV, which is indexed only by 'regime'.
        if result_series.name == 'TSV':
            robustness_profile = robustness_profile.join(result_series, on='regime')
        else:
            # Join other metrics on the full (regime, model_type) index.
            robustness_profile = robustness_profile.join(result_series, how='outer')

    # Join the results from the sequential computations if they were successful.
    if pcs_series is not None:
        robustness_profile = robustness_profile.join(pcs_series)
    if nlics_series is not None:
        robustness_profile = robustness_profile.join(nlics_series)

    # Sort the index for a clean, canonical ordering.
    robustness_profile = robustness_profile.sort_index()

    # Log the final, completed table.
    logging.info("\n====== Diagnostic Metrics Orchestration Finished. ======")
    logging.info("Final Robustness Profile:\n" + robustness_profile.to_string(float_format='{:.4f}'.format))

    # Return the final DataFrame.
    return robustness_profile


In [None]:
# Task 24: Diagnostic Metric Validation and Quality Assurance

def run_diagnostic_validation_suite(
    robustness_profile: pd.DataFrame,
    study_config: Dict[str, Any]
) -> Tuple[pd.DataFrame, pd.io.formats.style.Styler]:
    """
    Performs a final quality assurance audit on the computed diagnostic metrics.

    This function takes the aggregated robustness profile, validates its
    completeness and the integrity of its values, computes a diagnostic
    correlation matrix, and generates a publication-quality summary table.

    Args:
        robustness_profile: The DataFrame containing the complete set of
                            computed diagnostic metrics.
        study_config: The complete study configuration dictionary, used to
                      verify all expected regimes and models are present.

    Returns:
        A tuple containing:
        - The validated (and potentially cleaned) robustness profile DataFrame.
        - A pandas Styler object representing the formatted, publication-quality
          summary report.

    Raises:
        ValueError: If a critical validation check (e.g., a metric value
                    being outside its theoretical range) fails.
    """
    logging.info("--- Running Task 24: Diagnostic Metric Validation and Quality Assurance ---")

    # --- Operate on a copy ---
    profile_df = robustness_profile.copy()

    # Initialize a list to collect critical validation errors.
    errors = []

    # --- Step 1: Cross-Metric Consistency and Completeness Validation ---
    logging.info("Step 1: Validating completeness and numerical integrity...")

    # Define expected rows (index) and columns.
    expected_regimes = [r['regime_name'] for r in study_config['experimental_design']['regime_definitions']]
    expected_models = list(study_config['model_training']['architectures'].keys())
    expected_index = pd.MultiIndex.from_product([expected_regimes, expected_models], names=['regime', 'model_type'])
    expected_columns = ['MSE', 'FCAS', 'PCS', 'TSV', 'NLICS']

    # Check for missing rows.
    missing_rows = expected_index.difference(profile_df.index)
    if not missing_rows.empty:
        logging.warning(f"Robustness profile is missing results for the following (regime, model) pairs: {missing_rows.tolist()}")

    # Check for missing columns.
    for col in expected_columns:
        if col not in profile_df.columns:
            errors.append(f"Robustness profile is missing the required metric column: '{col}'")
            # Add a placeholder column with NaNs to prevent downstream errors.
            profile_df[col] = np.nan

    # Check for unexpected NaN values, which indicate a failed computation.
    if profile_df.isnull().values.any():
        nan_report = profile_df.isnull().sum()
        nan_report = nan_report[nan_report > 0]
        logging.warning(f"Found NaN values, indicating failed metric computations:\n{nan_report}")

    # --- Define theoretical ranges for each metric ---
    metric_ranges = {
        'MSE': (0, float('inf')),
        'FCAS': (0, 1),
        'PCS': (0, float('inf')),
        'TSV': (0, 2),  # Max L2 distance between two unit vectors is 2.
        'NLICS': (0, 1),
    }

    # Validate that all computed values fall within their theoretical ranges.
    for metric, (min_val, max_val) in metric_ranges.items():
        if metric in profile_df.columns:
            # Drop NaNs for validation, as they were already reported.
            values = profile_df[metric].dropna()
            if not values.empty:
                if not (values >= min_val).all() or not (values <= max_val).all():
                    errors.append(
                        f"Metric '{metric}' has values outside its theoretical range "
                        f"of [{min_val}, {max_val}]."
                    )

    # If any critical errors were found, raise an exception.
    if errors:
        error_report = "\n".join([f"- {error}" for error in errors])
        raise ValueError(f"Diagnostic metric validation failed:\n{error_report}")

    logging.info("Completeness and numerical integrity validation PASSED.")

    # --- Step 2: Assemble Robustness Profile and Compute Diagnostics ---
    # The profile is already assembled, but we can compute a correlation matrix as a diagnostic.
    logging.info("\nStep 2: Computing diagnostic correlation matrix of metrics...")
    metric_correlations = profile_df.corr()
    logging.info("Metric Correlation Matrix:\n" + metric_correlations.to_string(float_format='{:.3f}'.format))

    # --- Step 3: Generate Diagnostic Metric Summary Report ---
    logging.info("\nStep 3: Generating publication-quality summary report...")

    # Use the pandas Styler API for professional formatting.
    styled_profile = profile_df.style \
        .set_caption("Robustness Profile: Diagnostic Metrics Across Regimes and Models") \
        .format("{:.3f}", na_rep="N/A") \
        .set_table_styles([
            {'selector': 'th', 'props': [('text-align', 'center')]},
            {'selector': 'td', 'props': [('text-align', 'center')]},
        ]) \
        .background_gradient(cmap='viridis', subset=['MSE', 'PCS', 'TSV']) \
        .background_gradient(cmap='viridis_r', subset=['FCAS', 'NLICS'])

    # Display the styled table in a notebook environment.
    display(styled_profile)

    logging.info("\n>>> Diagnostic metric validation and reporting suite completed successfully. <<<")

    # Return the validated DataFrame and the Styler object.
    return profile_df, styled_profile


In [None]:
# Task 25: Jensen-Shannon Divergence Computation

def compute_js_divergence_matrix(
    data_splits: Dict[str, Dict[str, pd.DataFrame]],
    study_config: Dict[str, Any]
) -> pd.DataFrame:
    """
    Computes the pairwise Jensen-Shannon (J-S) Divergence between all regimes.

    This function quantifies the semantic shift between different macroeconomic
    regimes by treating their entire text corpora as documents and calculating
    the J-S Divergence between their vocabulary distributions.

    The process is as follows:
    1.  A unified vocabulary is created from the text of all regimes combined.
    2.  For each regime, its text corpus is transformed into a term frequency
        vector based on the unified vocabulary.
    3.  These vectors are averaged and normalized to create a probability
        distribution for each regime.
    4.  The pairwise J-S Divergence is calculated for all regime pairs.

    Args:
        data_splits: The nested dictionary of data splits, used to access the
                     raw text for all samples in each regime.
        study_config: The complete study configuration dictionary, used to get
                      TF-IDF parameters.

    Returns:
        A pandas DataFrame representing the symmetric J-S Divergence matrix,
        where both the index and columns are the regime names.
    """
    logging.info("--- Running Task 25: Jensen-Shannon Divergence Computation ---")

    # --- Step 1: Extract Corpora and Create Unified Vocabulary ---
    logging.info("Step 1: Creating a unified vocabulary across all regimes...")

    # Get TF-IDF parameters from the config.
    tfidf_params = study_config['feature_engineering']['tfidf']

    # Assemble the full text corpus from all splits and all regimes.
    all_texts = []
    regime_corpora = {}
    regime_names = sorted(data_splits.keys()) # Sort for deterministic order.

    for regime_name in regime_names:
        # Concatenate text from train, val, and test splits for the current regime.
        regime_corpus_series = pd.concat([
            splits_df['aggregated_text']
            for split_name, splits_df in data_splits[regime_name].items()
        ])
        all_texts.append(regime_corpus_series)
        regime_corpora[regime_name] = regime_corpus_series

    global_corpus = pd.concat(all_texts)

    # Create a vectorizer to define the common feature space (vocabulary).
    # We use CountVectorizer as we only need term frequencies for this task.
    vocab_vectorizer = TfidfVectorizer(
        max_features=tfidf_params.get('max_features', 2000),
        ngram_range=tfidf_params.get('ngram_range', (1, 2)),
        min_df=tfidf_params.get('min_df', 2),
        max_df=tfidf_params.get('max_df', 0.95),
        stop_words='english',
        use_idf=False # We only need term frequencies.
    )

    # Fit on the global corpus to establish the unified vocabulary.
    vocab_vectorizer.fit(global_corpus)
    unified_vocabulary = vocab_vectorizer.vocabulary_

    # --- Step 2: Create Probability Distributions for Each Regime ---
    logging.info("Step 2: Generating probability distributions for each regime...")

    # Create a new vectorizer that will use the fixed, unified vocabulary.
    dist_vectorizer = TfidfVectorizer(
        ngram_range=tfidf_params.get('ngram_range', (1, 2)),
        stop_words='english',
        vocabulary=unified_vocabulary,
        use_idf=False
    )

    regime_distributions = {}
    epsilon = 1e-10  # Smoothing constant to avoid log(0)

    for regime_name, corpus in regime_corpora.items():
        # Transform the regime's corpus into term frequency vectors.
        tf_vectors = dist_vectorizer.fit_transform(corpus)

        # Compute the average term frequency vector for the "average document".
        avg_tf_vector = np.array(tf_vectors.mean(axis=0)).flatten()

        # Add smoothing constant.
        avg_tf_vector += epsilon

        # Normalize the vector to create a valid probability distribution.
        distribution = avg_tf_vector / avg_tf_vector.sum()
        regime_distributions[regime_name] = distribution

    # --- Step 3: Calculate Pairwise Jensen-Shannon Divergence ---
    logging.info("Step 3: Calculating the pairwise J-S Divergence matrix...")

    # Initialize a DataFrame to store the results.
    js_matrix = pd.DataFrame(
        np.zeros((len(regime_names), len(regime_names))),
        index=regime_names,
        columns=regime_names
    )

    # Iterate over all unique pairs of regimes.
    for r1, r2 in combinations(regime_names, 2):
        # Get the corresponding probability distributions.
        p = regime_distributions[r1]
        q = regime_distributions[r2]

        # --- J-S Divergence Calculation ---
        # 1. Calculate the mixture distribution M.
        m = 0.5 * (p + q)

        # 2. Calculate the two KL divergences using scipy's stable entropy function.
        # D_KL(P || M) = sum(p_i * log2(p_i / m_i))
        kl_p_m = entropy(pk=p, qk=m, base=2)
        kl_q_m = entropy(pk=q, qk=m, base=2)

        # 3. Calculate the J-S Divergence.
        # D_JS(P, Q) = 0.5 * D_KL(P || M) + 0.5 * D_KL(Q || M)
        js_divergence = 0.5 * kl_p_m + 0.5 * kl_q_m

        # Store the symmetric result in the matrix.
        js_matrix.loc[r1, r2] = js_divergence
        js_matrix.loc[r2, r1] = js_divergence

    # --- Step 4: Validate and Visualize Results ---
    logging.info("\nJensen-Shannon Divergence Matrix (raw values):\n" + js_matrix.to_string(float_format='{:.4f}'.format))

    # Create a heatmap for visualization, as shown in Figure 2 of the paper.
    plt.figure(figsize=(8, 6))
    sns.heatmap(
        js_matrix,
        annot=True,
        fmt=".2f",
        cmap="viridis",
        linewidths=.5
    )
    plt.title("Vocabulary Shift (Jensen-Shannon Divergence) Across Regimes")
    plt.show()

    logging.info("\n>>> Jensen-Shannon Divergence computation completed successfully. <<<")

    return js_matrix


In [None]:
# Task 26: Dimensionality Reduction and Visualization

def _prepare_data_for_tsne(
    data_splits: Dict[str, Dict[str, pd.DataFrame]],
    embedding_features: Dict[str, Dict[str, np.ndarray]],
    max_samples: int = 5000
) -> Tuple[np.ndarray, pd.DataFrame]:
    """
    Internal helper to aggregate test set embeddings and metadata for t-SNE.

    This function extracts all embeddings from the 'test' splits, aligns them
    with their regime and sector labels, and performs stratified subsampling
    if the total number of samples exceeds a given threshold.

    Args:
        data_splits: The nested dictionary of data splits.
        embedding_features: The nested dictionary of sentence embedding features.
        max_samples: The maximum number of samples to use for t-SNE. If the
                     total exceeds this, stratified sampling is performed.

    Returns:
        A tuple containing:
        - A NumPy array of the selected embeddings.
        - A pandas DataFrame containing the corresponding metadata (labels).
    """
    # --- 1. Aggregate all test set embeddings and metadata ---
    all_embeddings = []
    all_metadata = []

    # Iterate through each regime in a sorted order for determinism.
    for regime_name in sorted(data_splits.keys()):
        # Get the test split DataFrame and its corresponding embeddings.
        test_df = data_splits[regime_name]['test']
        test_embeddings = embedding_features[regime_name]['test']

        # Ensure consistency.
        if len(test_df) != test_embeddings.shape[0] or test_df.empty:
            continue

        # Append the embeddings to the master list.
        all_embeddings.append(test_embeddings)

        # Create and append metadata for each sample.
        metadata_df = pd.DataFrame({
            'regime': test_df['regime'],
            'sector': test_df.index.get_level_values('sector')
        })
        all_metadata.append(metadata_df)

    # Concatenate all embeddings and metadata into single structures.
    if not all_embeddings:
        raise ValueError("No test set embeddings found to process for t-SNE.")

    full_embeddings_array = np.vstack(all_embeddings)
    full_metadata_df = pd.concat(all_metadata)

    # --- 2. Perform Stratified Subsampling if Necessary ---
    n_total_samples = full_embeddings_array.shape[0]
    if n_total_samples > max_samples:
        logging.warning(
            f"Total samples ({n_total_samples}) exceed max_samples ({max_samples}). "
            "Performing stratified subsampling for t-SNE efficiency."
        )

        # Use train_test_split as a convenient way to perform stratified sampling.
        # We create a combined stratification key to preserve joint distribution.
        stratify_key = full_metadata_df['regime'].astype(str) + "_" + full_metadata_df['sector'].astype(str)

        # We only need the 'train' part of the split, which will be our subsample.
        sampled_embeddings, _, sampled_metadata, _ = train_test_split(
            full_embeddings_array,
            full_metadata_df,
            train_size=max_samples,
            stratify=stratify_key,
            random_state=42
        )
        return sampled_embeddings, sampled_metadata
    else:
        # If below the threshold, use all data.
        return full_embeddings_array, full_metadata_df


def _create_tsne_plot(
    tsne_df: pd.DataFrame,
    hue_column: str,
    title: str,
    palette: str = "viridis"
) -> None:
    """
    Internal helper to generate and display a single t-SNE scatter plot.
    """
    # Create a new figure for the plot.
    plt.figure(figsize=(12, 10))

    # Generate the scatter plot using seaborn for aesthetics.
    ax = sns.scatterplot(
        data=tsne_df,
        x='tsne_dim_1',
        y='tsne_dim_2',
        hue=hue_column,
        palette=palette,
        alpha=0.7,
        s=50, # Marker size
        legend='full'
    )

    # Set plot titles and labels.
    ax.set_title(title, fontsize=16, fontweight='bold')
    ax.set_xlabel("t-SNE Component 1", fontsize=12)
    ax.set_ylabel("t-SNE Component 2", fontsize=12)

    # Improve legend placement.
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0.)
    plt.tight_layout(rect=[0, 0, 0.85, 1]) # Adjust layout to make space for legend

    # Display the plot.
    plt.show()


def run_tsne_visualization_suite(
    data_splits: Dict[str, Dict[str, pd.DataFrame]],
    embedding_features: Dict[str, Dict[str, np.ndarray]],
    tsne_params: Dict[str, Any] = None
) -> pd.DataFrame:
    """
    Orchestrates the t-SNE dimensionality reduction and visualization pipeline.

    This function reproduces the analysis for Figures 3 and 4 from the paper.
    It aggregates all test set embeddings, performs t-SNE to project them into
    2D space, and generates two scatter plots: one colored by macroeconomic
    regime and another colored by industry sector.

    Args:
        data_splits: The nested dictionary of data splits.
        embedding_features: The nested dictionary of sentence embedding features.
        tsne_params: Optional dictionary of parameters to pass to the
                     sklearn.manifold.TSNE constructor.

    Returns:
        A pandas DataFrame containing the 2D t-SNE coordinates and the
        corresponding metadata for each sample, which can be used for further
        interactive analysis.
    """
    logging.info("--- Running Task 26: Dimensionality Reduction and Visualization Suite ---")

    # --- Step 1: Prepare Embeddings and Metadata for t-SNE ---
    logging.info("Step 1: Aggregating test set embeddings and metadata...")
    try:
        # This helper function handles aggregation and stratified subsampling.
        embeddings_to_process, metadata_df = _prepare_data_for_tsne(
            data_splits, embedding_features
        )
        logging.info(f"Prepared {embeddings_to_process.shape[0]} samples for t-SNE.")
    except ValueError as e:
        logging.error(f"Failed to prepare data for t-SNE: {e}")
        return pd.DataFrame()

    # --- Step 2: Execute t-SNE Dimensionality Reduction ---
    logging.info("Step 2: Executing t-SNE algorithm... (This may take several minutes)")

    # Define default t-SNE parameters, which can be overridden.
    default_params = {
        'n_components': 2,
        'perplexity': 30,
        'learning_rate': 'auto',
        'n_iter': 1000,
        'random_state': 42,
        'n_jobs': -1 # Use all available CPU cores
    }
    if tsne_params:
        default_params.update(tsne_params)

    # Initialize and run the t-SNE algorithm.
    tsne = TSNE(**default_params)
    tsne_results = tsne.fit_transform(embeddings_to_process)

    # Create a DataFrame to hold the results and metadata.
    tsne_df = pd.DataFrame(
        tsne_results,
        columns=['tsne_dim_1', 'tsne_dim_2'],
        index=metadata_df.index
    )
    tsne_df = pd.concat([tsne_df, metadata_df], axis=1)

    logging.info("t-SNE computation complete.")

    # --- Step 3: Generate Visualization Plots ---
    logging.info("Step 3: Generating visualization plots...")

    # Plot 1: Colored by Economic Regime (reproducing Figure 3)
    _create_tsne_plot(
        tsne_df=tsne_df,
        hue_column='regime',
        title='t-SNE Visualization of Financial News Embeddings by Economic Regime',
        palette='viridis'
    )

    # Plot 2: Colored by Industry Sector (reproducing Figure 4)
    _create_tsne_plot(
        tsne_df=tsne_df,
        hue_column='sector',
        title='t-SNE Visualization of Financial News Embeddings by Industry Sector',
        palette='tab20' # A palette suitable for many categories
    )

    logging.info("\n>>> t-SNE visualization suite completed successfully. <<<")

    return tsne_df


In [None]:
# Task 27: Cross-Regime Analysis Integration

def run_cross_regime_analysis_suite(
    js_divergence_matrix: pd.DataFrame,
    robustness_profile: pd.DataFrame
) -> pd.DataFrame:
    """
    Integrates and analyzes the relationship between semantic drift and model performance.

    This function tests the paper's central hypothesis by correlating the
    Jensen-Shannon (J-S) Divergence between regime pairs with the corresponding
    degradation in model performance (measured by the change in MSE).

    The process is as follows:
    1.  Create a tidy DataFrame aligning each unique regime pair with its
        J-S divergence.
    2.  For each model, calculate the absolute difference in MSE between the
        regimes in each pair.
    3.  For each model, compute the Spearman rank correlation between the
        J-S divergences and the MSE differences.
    4.  Generate scatter plots to visualize this relationship for each model.

    Args:
        js_divergence_matrix: The symmetric DataFrame of pairwise J-S divergences.
        robustness_profile: The DataFrame containing the complete set of
                            diagnostic metrics, including MSE.

    Returns:
        A pandas DataFrame summarizing the Spearman correlation coefficient (rho)
        and p-value for each model type, quantifying the strength and
        significance of the relationship between semantic drift and error.
    """
    logging.info("--- Running Task 27: Cross-Regime Analysis Integration Suite ---")

    # --- Step 1: Align Semantic Drift and Performance Degradation Data ---
    logging.info("Step 1: Aligning J-S Divergence with MSE degradation...")

    # Get the names of all regimes.
    regime_names = js_divergence_matrix.index.tolist()
    # Get the names of all model types.
    model_types = robustness_profile.index.get_level_values('model_type').unique().tolist()

    # Create a list of all unique, unordered regime pairs.
    regime_pairs = list(combinations(regime_names, 2))

    # This list will hold the structured data for our analysis.
    analysis_data = []

    # Iterate through each model type to create a separate analysis.
    for model_type in model_types:
        # Filter the robustness profile for the current model.
        model_mse = robustness_profile.xs(model_type, level='model_type')['MSE']

        # Iterate through each unique regime pair.
        for r1, r2 in regime_pairs:
            # Look up the J-S divergence for the pair.
            js_div = js_divergence_matrix.loc[r1, r2]

            # Calculate the absolute difference in MSE for the model between the two regimes.
            mse_degradation = abs(model_mse.loc[r1] - model_mse.loc[r2])

            # Append the aligned data point.
            analysis_data.append({
                'model_type': model_type,
                'regime_pair': f"{r1}-{r2}",
                'js_divergence': js_div,
                'mse_degradation': mse_degradation
            })

    # Create a single, tidy DataFrame from the collected data.
    analysis_df = pd.DataFrame(analysis_data)

    logging.info("Successfully created aligned analysis DataFrame.")

    # --- Step 2: Correlate Semantic Drift with Model Performance ---
    logging.info("\nStep 2: Computing Spearman correlation for each model...")

    correlation_results = []

    # Iterate through each model type to compute its correlation.
    for model_type in model_types:
        # Filter the DataFrame for the current model.
        model_df = analysis_df[analysis_df['model_type'] == model_type]

        # --- Spearman Rank Correlation ---
        # This is a non-parametric test that assesses how well the relationship
        # between two variables can be described using a monotonic function.
        # It is robust to outliers and non-linearities.
        rho, p_value = spearmanr(model_df['js_divergence'], model_df['mse_degradation'])

        # Store the results.
        correlation_results.append({
            'model_type': model_type,
            'spearman_rho': rho,
            'p_value': p_value
        })

    # Create a DataFrame to summarize the correlation results.
    correlation_df = pd.DataFrame(correlation_results).set_index('model_type')

    logging.info("Correlation Analysis Results:\n" + correlation_df.to_string(float_format='{:.4f}'.format))
    logging.warning(
        "Note: p-values should be interpreted with caution due to the small "
        f"sample size (N={len(regime_pairs)} regime pairs)."
    )

    # --- Step 3: Generate Visualization Plots ---
    logging.info("\nStep 3: Generating visualization plots...")

    # Use seaborn's `lmplot` to create a scatter plot with a regression line
    # for each model type, faceted by model.
    g = sns.lmplot(
        data=analysis_df,
        x='js_divergence',
        y='mse_degradation',
        col='model_type',
        hue='model_type',
        height=5,
        aspect=1.2,
        scatter_kws={'s': 100, 'alpha': 0.8},
        ci=95 # Show 95% confidence interval for the regression line.
    )

    # Set a comprehensive title for the entire figure.
    g.fig.suptitle("Relationship Between Semantic Drift (J-S Divergence) and Model Error (MSE Degradation)", y=1.03, fontsize=16, fontweight='bold')

    # Improve axis labels.
    g.set_axis_labels("Jensen-Shannon Divergence", "Absolute Difference in MSE")

    # Display the plot.
    plt.show()

    logging.info("\n>>> Cross-regime analysis suite completed successfully. <<<")

    return correlation_df


In [None]:
# Task 28: Stock-Specific Case Study Implementation

def _filter_artifacts_for_tickers(
    target_tickers: List[str],
    predictions_df: pd.DataFrame,
    data_splits: Dict[str, Dict[str, pd.DataFrame]],
    embedding_features: Dict[str, Dict[str, np.ndarray]]
) -> Tuple[pd.DataFrame, Dict[str, Dict[str, pd.DataFrame]], Dict[str, Dict[str, np.ndarray]]]:
    """
    Internal helper to meticulously filter all data artifacts for specific tickers.

    This function creates a self-consistent, miniature set of all data artifacts
    (predictions, data splits, and embedding features) that pertains only to the
    specified list of target tickers. This is the rigorous foundation for a
    stock-specific case study.

    Args:
        target_tickers: A list of stock tickers to isolate.
        predictions_df: The full DataFrame of predictions.
        data_splits: The full dictionary of data splits.
        embedding_features: The full dictionary of embedding features.

    Returns:
        A tuple containing the filtered versions of:
        - ticker_predictions_df
        - ticker_data_splits
        - ticker_embedding_features
    """
    # --- 1. Filter the main predictions DataFrame ---
    # This is a straightforward boolean indexing operation on the 'ticker' level of the index.
    ticker_mask = predictions_df.index.get_level_values('ticker').isin(target_tickers)
    ticker_predictions_df = predictions_df[ticker_mask].copy()

    # --- 2. Filter the nested dictionaries of DataFrames and NumPy arrays ---
    ticker_data_splits = {}
    ticker_embedding_features = {}

    # Iterate through all 12 splits.
    for regime, splits in data_splits.items():
        ticker_data_splits[regime] = {}
        ticker_embedding_features[regime] = {}
        for split_name, df in splits.items():
            if df.empty:
                # If the original split is empty, the filtered split is also empty.
                ticker_data_splits[regime][split_name] = pd.DataFrame(columns=df.columns, index=df.index)
                # Create an empty embedding array with the correct number of columns.
                emb_dim = embedding_features[regime][split_name].shape[1]
                ticker_embedding_features[regime][split_name] = np.empty((0, emb_dim), dtype=np.float32)
                continue

            # Filter the DataFrame for the current ticker.
            split_ticker_mask = df.index.get_level_values('ticker').isin(target_tickers)
            filtered_df = df[split_ticker_mask].copy()
            ticker_data_splits[regime][split_name] = filtered_df

            # --- 3. Rigorously filter the parallel embedding array ---
            # To filter the NumPy array, we must find the integer positions of the
            # rows we want to keep. `get_indexer_for` is the canonical way to do this.
            positions_to_keep = df.index.get_indexer_for(filtered_df.index)
            # Use these integer positions to slice the embedding array.
            ticker_embedding_features[regime][split_name] = embedding_features[regime][split_name][positions_to_keep]

    return ticker_predictions_df, ticker_data_splits, ticker_embedding_features


def run_stock_specific_case_study(
    target_tickers: List[str],
    predictions_df: pd.DataFrame,
    data_splits: Dict[str, Dict[str, pd.DataFrame]],
    embedding_features: Dict[str, Dict[str, np.ndarray]],
    training_results: Dict[str, Any],
    study_config: Dict[str, Any],
    vectorizer: TfidfVectorizer,
    st_model: SentenceTransformer,
    tokenizer: PreTrainedTokenizerBase
) -> pd.DataFrame:
    """
    Performs a deep-dive case study on a specific list of stocks.

    This function isolates all data related to the target tickers and re-computes
    the full suite of diagnostic metrics (FCAS, PCS, TSV, NLICS) on these
    subsets. This provides a granular view of model performance and robustness
    at the individual company level.

    Args:
        target_tickers: A list of stock tickers for the case study (e.g., ['JPM', 'AAPL']).
        predictions_df: The full DataFrame of all predictions.
        data_splits: The full dictionary of all data splits.
        embedding_features: The full dictionary of all embedding features.
        training_results: The dictionary with paths to trained model checkpoints.
        study_config: The complete study configuration dictionary.
        vectorizer: The globally fitted TfidfVectorizer.
        st_model: The initialized SentenceTransformer model.
        tokenizer: The initialized HuggingFace tokenizer.

    Returns:
        A pandas DataFrame formatted to be comparable to Table 6 in the paper,
        showing the diagnostic metric values for each target stock across all regimes.
    """
    logging.info(f"--- Running Definitive Stock-Specific Case Study for {target_tickers} ---")

    # --- Step 1: Create Filtered, Stock-Specific Data Artifacts ---
    # This helper function robustly filters all complex data structures.
    logging.info(f"Filtering all data artifacts for tickers: {target_tickers}...")
    ticker_predictions_df, ticker_data_splits, ticker_embedding_features = _filter_artifacts_for_tickers(
        target_tickers, predictions_df, data_splits, embedding_features
    )

    if ticker_predictions_df.empty:
        logging.warning(f"No data found for tickers {target_tickers}. Cannot run case study.")
        return pd.DataFrame()

    # --- Step 2: Re-compute All Diagnostic Metrics on the Filtered Data ---
    # Re-running the metrics ensures they are calculated only on the data
    # relevant to the case study stocks.

    logging.info("Re-computing diagnostic metrics on filtered data...")

    # FCAS computation on the filtered prediction set.
    fcas_series = compute_fcas(ticker_predictions_df)
    fcas_series.name = 'FCAS'

    # TSV computation on the filtered embeddings and splits.
    tsv_series = compute_tsv(ticker_embedding_features, ticker_data_splits)
    tsv_series.name = 'TSV'

    # PCS computation on the filtered artifacts.
    pcs_series = compute_pcs(
        ticker_predictions_df, training_results, ticker_data_splits, study_config,
        vectorizer, st_model, tokenizer
    )
    pcs_series.name = 'PCS'

    # NLICS computation on the filtered prediction set (will leverage caching).
    nlics_series = compute_nlics(ticker_predictions_df, study_config)
    nlics_series.name = 'NLICS'

    # --- Step 3: Assemble and Format the Final Case Study Table ---
    logging.info("Assembling final case study table...")

    # Join all the re-computed metric series into a single DataFrame.
    # The index will be (regime, model_type).
    case_study_profile = fcas_series.to_frame()
    case_study_profile = case_study_profile.join(pcs_series)
    case_study_profile = case_study_profile.join(nlics_series)

    # The index of the main profile is (regime, model_type). We need to add the ticker.
    # Run the logic inside a loop for each ticker.
    all_results = []
    for ticker in target_tickers:
        # This re-runs the filtering for each ticker.
        t_preds, t_splits, t_embeds = _filter_artifacts_for_tickers([ticker], predictions_df, data_splits, embedding_features)

        if t_preds.empty: continue

        # Re-compute metrics for this single ticker
        fcas = compute_fcas(t_preds).rename('FCAS')
        tsv = compute_tsv(t_embeds, t_splits).rename('TSV')
        pcs = compute_pcs(t_preds, training_results, t_splits, study_config, vectorizer, st_model, tokenizer).rename('PCS')
        nlics = compute_nlics(t_preds, study_config).rename('NLICS')

        # Join all metrics for this ticker
        ticker_res = pd.concat([fcas, pcs, nlics], axis=1)
        ticker_res = ticker_res.join(tsv, on='regime')
        ticker_res['Stock'] = ticker
        all_results.append(ticker_res.reset_index())

    if not all_results:
        logging.warning("No case study results were generated.")
        return pd.DataFrame()

    # Concatenate results from all tickers into a single tidy DataFrame.
    final_tidy_table = pd.concat(all_results)

    # --- Format the table to resemble Table 6 from the paper ---
    # Table 6 shows metrics for one model at a time. Let's assume we want to see all.
    # The structure is (Stock, Regime) as index and metrics as columns.
    final_table = final_tidy_table.set_index(['Stock', 'regime', 'model_type']).sort_index()

    logging.info("\n--- Case Study Results (Tidy Format) ---")
    # Display the full, tidy table with all models.
    logging.info("\n" + final_table.to_string(float_format='{:.4f}'.format))

    # To exactly reproduce a table like Table 6, you would filter this final table, e.g.:
    # table_6_like = final_table.xs('feature_transformer', level='model_type')

    logging.info("\n>>> Stock-specific case study completed successfully. <<<")

    return final_table


In [None]:
# Task 29: Ablation Study Implementation

def perform_metric_ablation_analysis(
    robustness_profile: pd.DataFrame,
    target_regime: str = "Rate-Hike",
    target_model: str = "text_transformer"
) -> pd.DataFrame:
    """
    Performs and presents a metric ablation analysis based on computed results.

    This function reproduces the analysis shown in Table 7 of the paper. It
    isolates the results for a specific model and regime (e.g., Text Transformer
    during the Rate-Hike period) and presents a table showing how the
    robustness profile appears when each metric is individually excluded.

    Note on Interpretation: In our modular pipeline, metrics are computed
    independently. Therefore, "excluding" a metric does not change the values
    of the other metrics. This analysis serves as a reporting tool to visualize
    the complete profile and compare it to profiles missing one component.

    Args:
        robustness_profile: The complete DataFrame of diagnostic metrics from
                            the evaluation orchestrator.
        target_regime: The specific regime to focus on for the ablation study.
        target_model: The specific model to focus on for the ablation study.

    Returns:
        A pandas DataFrame representing the metric ablation table, which can be
        styled or displayed.

    Raises:
        KeyError: If the specified target_regime or target_model is not found
                  in the robustness_profile index.
    """
    logging.info(f"--- Running Task 29, Step 1: Metric Ablation Analysis ---")
    logging.info(f"Focusing on Regime: '{target_regime}', Model: '{target_model}'")

    # --- Step 1: Isolate the Data for the Target Configuration ---
    try:
        # Use .xs() to select the specific row from the MultiIndex.
        # This returns a pandas Series containing all metric values for the target.
        source_metrics = robustness_profile.xs(
            (target_regime, target_model),
            level=('regime', 'model_type')
        ).iloc[0] # Use iloc[0] in case of duplicate indices, though there shouldn't be.

    except KeyError:
        # Provide a specific error if the requested data slice does not exist.
        raise KeyError(
            f"The combination of regime='{target_regime}' and model_type='{target_model}' "
            "was not found in the robustness_profile DataFrame."
        )

    # --- Step 2: Define Ablation Configurations and Build the Table ---
    # The metrics to be ablated, in the desired order.
    # Note: MSE is not ablated in the paper's table, so we exclude it from this list.
    metrics_to_ablate = ['FCAS', 'PCS', 'TSV', 'NLICS']

    # The full set of columns for the final table.
    table_columns = ['FCAS', 'PCS', 'TSV', 'NLICS']

    # This list will hold the data for each row of the final table.
    ablation_data = []

    # --- Create the "Full Evaluation" row ---
    # This is simply the original set of metrics.
    full_eval_row = source_metrics[table_columns].to_dict()
    full_eval_row['Model Variant'] = 'Full Evaluation'
    ablation_data.append(full_eval_row)

    # --- Create a row for each ablated metric ---
    for metric in metrics_to_ablate:
        # Create a copy of the original metric values.
        ablated_row_data = source_metrics[table_columns].to_dict()
        # Set the value of the ablated metric to NaN.
        ablated_row_data[metric] = np.nan
        # Define the name for this configuration.
        ablated_row_data['Model Variant'] = f'No {metric}'
        # Append the new row to our data list.
        ablation_data.append(ablated_row_data)

    # --- Step 3: Assemble and Format the Final DataFrame ---
    # Create the DataFrame from the list of dictionaries.
    ablation_df = pd.DataFrame(ablation_data)

    # Set 'Model Variant' as the index.
    ablation_df = ablation_df.set_index('Model Variant')

    # Reorder the columns to match the desired output.
    ablation_df = ablation_df[table_columns]

    logging.info("\nMetric Ablation Analysis Table (Reproducing Table 7):\n" + ablation_df.to_string(float_format='{:.2f}'.format, na_rep='N/A'))

    logging.info("\n>>> Metric ablation analysis completed successfully. <<<")

    return ablation_df


def execute_single_training_run(
    model_type: str,
    train_df: pd.DataFrame,
    val_df: pd.DataFrame,
    study_config: Dict[str, Any],
    checkpoint_path: Path,
    vectorizer: TfidfVectorizer,
    st_model: SentenceTransformer,
    tokenizer: PreTrainedTokenizerBase,
    device: torch.device
) -> Dict[str, Any]:
    """
    Executes a single, isolated training run for a given model and dataset.

    This refactored, granular function is the core atomic unit of training. It
    takes a specific training and validation set and handles the entire training
    process for one model, including data prep, optimization setup, and execution.

    Args:
        model_type: The type of model to train ('lstm', 'text_transformer', etc.).
        train_df: The DataFrame for training.
        val_df: The DataFrame for validation.
        study_config: The main study configuration.
        checkpoint_path: The path to save the best model.
        vectorizer: The fitted TfidfVectorizer.
        st_model: The initialized SentenceTransformer.
        tokenizer: The initialized AutoTokenizer.
        device: The torch.device to train on.

    Returns:
        A dictionary containing the training history and the path to the best model.
    """
    # --- 1. Instantiate Model ---
    model_config = study_config['model_training']['architectures'][model_type]
    model_classes = {'lstm': LSTMRegressionModel, 'text_transformer': TextTransformerRegressionModel, 'feature_transformer': FeatureEnhancedMLP}
    model = model_classes[model_type](model_config)
    model.to(device)

    # --- 2. Prepare Features and DataLoaders ---
    # This block dynamically prepares the correct feature format for the given model.
    if model_type == 'text_transformer':
        train_features = tokenizer(train_df['aggregated_text'].tolist(), max_length=512, padding='max_length', truncation=True, return_tensors='np')
        val_features = tokenizer(val_df['aggregated_text'].tolist(), max_length=512, padding='max_length', truncation=True, return_tensors='np')
    else:
        train_tfidf = vectorizer.transform(train_df['aggregated_text']).toarray()
        val_tfidf = vectorizer.transform(val_df['aggregated_text']).toarray()
        if model_type == 'lstm':
            train_features, val_features = train_tfidf, val_tfidf
        else: # feature_transformer
            train_emb = st_model.encode(train_df['aggregated_text'].tolist(), batch_size=64)
            val_emb = st_model.encode(val_df['aggregated_text'].tolist(), batch_size=64)
            train_features = np.hstack([train_tfidf, train_emb])
            val_features = np.hstack([val_tfidf, val_emb])

    train_loader = create_dataloaders(train_features, train_df['target_return'].values, model_type, study_config['model_training']['global_params']['batch_size'], shuffle=True)
    val_loader = create_dataloaders(val_features, val_df['target_return'].values, model_type, study_config['model_training']['global_params']['batch_size'], shuffle=False)

    # --- 3. Setup Optimization and Run Training ---
    optimization_components = setup_optimization_components(model, study_config['model_training']['global_params'])

    training_result = run_training_orchestrator(
        model=model,
        optimization_components=optimization_components,
        train_loader=train_loader,
        val_loader=val_loader,
        device=device,
        num_epochs=50,
        patience=5,
        checkpoint_path=checkpoint_path
    )
    return training_result


def execute_single_inference_run(
    model_type: str,
    checkpoint_path: Path,
    test_df: pd.DataFrame,
    study_config: Dict[str, Any],
    vectorizer: TfidfVectorizer,
    st_model: SentenceTransformer,
    tokenizer: PreTrainedTokenizerBase,
    device: torch.device
) -> float:
    """
    Executes a single, isolated inference run and returns the MSE.

    Args:
        model_type: The type of model being evaluated.
        checkpoint_path: Path to the trained model weights.
        test_df: The DataFrame for testing.
        study_config: The main study configuration.
        ... (feature engineering artifacts) ...
        device: The torch.device to run inference on.

    Returns:
        The calculated Mean Squared Error on the test set.
    """
    # --- 1. Load Model ---
    model_config = study_config['model_training']['architectures'][model_type]
    model_classes = {'lstm': LSTMRegressionModel, 'text_transformer': TextTransformerRegressionModel, 'feature_transformer': FeatureEnhancedMLP}
    model = model_classes[model_type](model_config)
    model.load_state_dict(torch.load(checkpoint_path, map_location=device))
    model.to(device).eval()

    # --- 2. Prepare Test Data ---
    if model_type == 'text_transformer':
        test_features = tokenizer(test_df['aggregated_text'].tolist(), max_length=512, padding='max_length', truncation=True, return_tensors='np')
    else:
        test_tfidf = vectorizer.transform(test_df['aggregated_text']).toarray()
        if model_type == 'lstm':
            test_features = test_tfidf
        else:
            test_emb = st_model.encode(test_df['aggregated_text'].tolist(), batch_size=64)
            test_features = np.hstack([test_tfidf, test_emb])

    test_loader = create_dataloaders(test_features, test_df['target_return'].values, model_type, study_config['model_training']['global_params']['batch_size'] * 2, shuffle=False)

    # --- 3. Generate Predictions and Calculate MSE ---
    predictions = generate_predictions_for_split(model, test_loader, device)
    squared_errors = (predictions - test_df['target_return'].values) ** 2
    mse = np.mean(squared_errors)

    return mse


def perform_feature_augmentation_ablation(
    robustness_profile: pd.DataFrame,
    data_splits: Dict[str, Dict[str, pd.DataFrame]],
    study_config: Dict[str, Any],
    vectorizer: TfidfVectorizer,
    st_model: SentenceTransformer,
    force_retrain_cross_sector: bool = False
) -> pd.DataFrame:
    """
    Performs a definitive feature augmentation ablation study.

    This analysis compares a "Text Only" model with a "Feature Enhanced" model
    across three dimensions: TSV, NLICS, and Cross-Sector MSE. This version
    includes a full, non-placeholder implementation of the cross-sector
    training and evaluation experiment.

    Args:
        robustness_profile: The complete DataFrame of diagnostic metrics.
        data_splits: The nested dictionary of data splits.
        study_config: The main study configuration.
        vectorizer: The globally fitted TfidfVectorizer.
        st_model: The initialized SentenceTransformer model.
        force_retrain_cross_sector: If True, retrains cross-sector models
                                    even if checkpoints exist.

    Returns:
        A pandas DataFrame summarizing the feature augmentation ablation results.
    """
    logging.info("--- Running Definitive Feature Augmentation Ablation Analysis ---")

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    tokenizer = AutoTokenizer.from_pretrained(study_config['model_training']['architectures']['text_transformer']['base_model_identifier'])

    # --- 1. Calculate Average TSV and NLICS from existing results ---
    logging.info("Step 1: Calculating average TSV and NLICS from main results...")
    avg_tsv = robustness_profile['TSV'].mean()
    avg_nlics = robustness_profile.groupby(level='model_type')['NLICS'].mean()
    nlics_text_only = avg_nlics.get('text_transformer', np.nan)
    nlics_feature_enhanced = avg_nlics.get('feature_transformer', np.nan)

    # --- 2. Run Full Cross-Sector MSE Experiment ---
    logging.info("Step 2: Running full Cross-Sector MSE experiment (Financials -> Health Care)...")

    # --- a. Prepare sector-specific data slices ---
    source_sector, target_sector = 'Financials', 'Health Care'

    source_train_dfs = [df[df.index.get_level_values('sector') == source_sector] for r, s in data_splits.items() for sn, df in s.items() if sn == 'training']
    source_val_dfs = [df[df.index.get_level_values('sector') == source_sector] for r, s in data_splits.items() for sn, df in s.items() if sn == 'validation']
    target_test_dfs = [df[df.index.get_level_values('sector') == target_sector] for r, s in data_splits.items() for sn, df in s.items() if sn == 'testing']

    source_train_df = pd.concat(source_train_dfs)
    source_val_df = pd.concat(source_val_dfs)
    target_test_df = pd.concat(target_test_dfs)

    cross_sector_results = {}

    for model_type in ['text_transformer', 'feature_transformer']:
        logging.info(f"  - Running cross-sector experiment for model: {model_type}")

        # --- b. Train model on source sector (if needed) ---
        checkpoint_dir = Path("checkpoints/cross_sector")
        checkpoint_dir.mkdir(parents=True, exist_ok=True)
        checkpoint_path = checkpoint_dir / f"{model_type}_trained_on_{source_sector}.pth"

        if not checkpoint_path.exists() or force_retrain_cross_sector:
            logging.info(f"    - Training model on '{source_sector}' data...")
            execute_single_training_run(
                model_type=model_type,
                train_df=source_train_df,
                val_df=source_val_df,
                study_config=study_config,
                checkpoint_path=checkpoint_path,
                vectorizer=vectorizer,
                st_model=st_model,
                tokenizer=tokenizer,
                device=device
            )
        else:
            logging.info(f"    - Found existing checkpoint: {checkpoint_path}")

        # --- c. Evaluate model on target sector ---
        logging.info(f"    - Evaluating on '{target_sector}' data...")
        mse = execute_single_inference_run(
            model_type=model_type,
            checkpoint_path=checkpoint_path,
            test_df=target_test_df,
            study_config=study_config,
            vectorizer=vectorizer,
            st_model=st_model,
            tokenizer=tokenizer,
            device=device
        )
        cross_sector_results[model_type] = mse
        logging.info(f"    - Cross-Sector MSE: {mse:.4f}")

    # --- 3. Assemble the Final Table ---
    logging.info("Step 3: Assembling final ablation table...")

    ablation_results = pd.DataFrame([
        {
            'Model Type': 'Text Only',
            'TSV': avg_tsv,
            'NLICS': nlics_text_only,
            'Cross-Sector MSE': cross_sector_results.get('text_transformer', np.nan)
        },
        {
            'Model Type': 'Feature Enhanced',
            'TSV': avg_tsv,
            'NLICS': nlics_feature_enhanced,
            'Cross-Sector MSE': cross_sector_results.get('feature_transformer', np.nan)
        }
    ]).set_index('Model Type')

    logging.info("\nDefinitive Feature Augmentation Ablation Analysis:\n" + ablation_results.to_string(float_format='{:.3f}'.format))

    logging.info("\n>>> Feature augmentation ablation analysis completed successfully. <<<")

    return ablation_results


def create_nli_benchmark_file(
    predictions_df: pd.DataFrame,
    output_path: Path,
    n_samples: int = 100,
    random_state: int = 42
) -> None:
    """
    Creates and saves a reproducible benchmark dataset for human annotation.

    This function generates a random, stratified sample of text-hypothesis pairs
    and saves it to a CSV file. This file is intended to be given to a human
    expert for annotation. It includes a 'human_label' column pre-filled with
    placeholders that must be manually replaced.

    Args:
        predictions_df: The full DataFrame of all predictions.
        output_path: The path to save the CSV file for annotation.
        n_samples: The number of samples to include in the benchmark.
        random_state: The random seed for reproducibility.
    """
    logging.info(f"--- Creating NLI Benchmark File for Annotation ---")

    # --- 1. Create a reproducible random sample ---
    # Stratifying by regime ensures the sample is representative.
    benchmark_df = predictions_df.groupby('regime').sample(
        frac=min(1.0, n_samples / len(predictions_df)), # Adjust frac for safety
        random_state=random_state
    ).sample(n=n_samples, random_state=random_state) # Final sample to get exact n_samples

    # --- 2. Generate hypothesis strings ---
    benchmark_df['hypothesis'] = benchmark_df['prediction'].apply(
        lambda p: "The stock price will increase" if p > 0 else "The stock price will decrease"
    )

    # --- 3. Add placeholder for human labels ---
    # The annotator should replace this with 1.0 (Entailment), 0.5 (Neutral), or 0.0 (Contradiction).
    benchmark_df['human_label'] = "ANNOTATE_HERE"

    # --- 4. Select relevant columns and save ---
    # We only need the text, hypothesis, and the column to be annotated.
    # The index is preserved for later merging.
    columns_to_save = ['aggregated_text', 'hypothesis', 'human_label']
    benchmark_df[columns_to_save].to_csv(output_path)

    logging.info(f"Successfully saved benchmark file for annotation to '{output_path}'.")
    logging.warning(
        "ACTION REQUIRED: Please manually edit this CSV file and replace "
        "'ANNOTATE_HERE' with numerical labels (1.0, 0.5, 0.0) before "
        "running the entailment model ablation."
    )


def perform_entailment_model_ablation(
    predictions_df: pd.DataFrame,
    annotated_benchmark_path: Path,
    study_config: Dict[str, Any]
) -> pd.DataFrame:
    """
    Performs a definitive ablation study comparing GPT-4 and BART-NLI.

    This analysis loads a human-annotated benchmark dataset and evaluates both
    GPT-4 and a local BART-NLI model against it. It compares the models on:
    1.  NLICS: The average score produced by the model on the benchmark set.
    2.  Human Agreement %: The percentage of exact matches between the model's
        score and the expert human's score.

    Args:
        predictions_df: The full DataFrame of all predictions (used for context).
        annotated_benchmark_path: The file path to the CSV file that has been
                                  manually annotated by a human expert.
        study_config: The complete study configuration dictionary.

    Returns:
        A pandas DataFrame summarizing the comparison results.

    Raises:
        FileNotFoundError: If the annotated benchmark file does not exist.
        ValueError: If the annotated file is invalid (e.g., missing columns,
                    invalid labels).
    """
    logging.info("--- Running Definitive Entailment Model Comparison Ablation ---")

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # --- 1. Load and Validate the Annotated Benchmark Dataset ---
    logging.info(f"Step 1: Loading and validating annotated benchmark data from '{annotated_benchmark_path}'...")
    if not annotated_benchmark_path.exists():
        raise FileNotFoundError(f"Annotated benchmark file not found. Please create and annotate it first using `create_nli_benchmark_file`.")

    # Load the annotated data.
    benchmark_df = pd.read_csv(annotated_benchmark_path, index_col=[0, 1, 2]) # Assuming original multi-index

    # --- Validation of the annotated file ---
    if 'human_label' not in benchmark_df.columns:
        raise ValueError("Annotated file is missing the 'human_label' column.")
    if benchmark_df['human_label'].isnull().any() or (benchmark_df['human_label'] == "ANNOTATE_HERE").any():
        raise ValueError("Annotated file contains missing or un-annotated 'ANNOTATE_HERE' values.")
    try:
        benchmark_df['human_label'] = pd.to_numeric(benchmark_df['human_label'])
    except ValueError:
        raise ValueError("The 'human_label' column must contain only numeric values (1.0, 0.5, 0.0).")
    valid_labels = {1.0, 0.5, 0.0}
    if not set(benchmark_df['human_label'].unique()).issubset(valid_labels):
        raise ValueError(f"Invalid values found in 'human_label' column. Only {valid_labels} are allowed.")

    # Join with original predictions_df to get all necessary columns like 'prediction'.
    benchmark_df = predictions_df.join(benchmark_df[['human_label', 'hypothesis']], how='inner')
    logging.info(f"Successfully loaded and validated {len(benchmark_df)} annotated samples.")

    # --- 2. Evaluate with BART-NLI ---
    logging.info("\nStep 2: Evaluating benchmark data with BART-NLI...")
    bart_scores = _evaluate_with_bart_nli(benchmark_df, device)
    bart_nlics = bart_scores.mean()
    # Use np.isclose for robust floating-point comparison.
    bart_agreement = np.isclose(bart_scores, benchmark_df['human_label']).mean() * 100

    # --- 3. Evaluate with GPT-4 ---
    logging.info("\nStep 3: Evaluating benchmark data with GPT-4...")
    nlics_config = study_config['diagnostics']['nlics_metric']
    cache_path = Path("results/nlics_cache.jsonl")
    gpt4_scores = asyncio.run(_compute_nlics_async_cached(benchmark_df, nlics_config, cache_path))
    gpt4_nlics = gpt4_scores.mean()
    # Use np.isclose for robust floating-point comparison.
    gpt4_agreement = np.isclose(gpt4_scores, benchmark_df['human_label']).mean() * 100

    # --- 4. Assemble the Final Table ---
    logging.info("\nStep 4: Assembling final comparison table...")

    comparison_results = pd.DataFrame([
        {
            'Entailment Model': 'BART-NLI',
            'NLICS': bart_nlics,
            'Human Agreement (%)': bart_agreement
        },
        {
            'Entailment Model': 'GPT-4',
            'NLICS': gpt4_nlics,
            'Human Agreement (%)': gpt4_agreement
        }
    ]).set_index('Entailment Model')

    logging.info("\nDefinitive Entailment Model Comparison (Reproducing Table 9):\n" + comparison_results.to_string(float_format='{:.2f}'.format))

    logging.info("\n>>> Entailment model comparison ablation completed successfully. <<<")

    return comparison_results


In [None]:
# Task 30: Control Experiment Implementation

def _identify_earnings_events(
    predictions_df: pd.DataFrame,
    keywords: List[str]
) -> pd.DataFrame:
    """
    Filters the predictions DataFrame to identify rows related to earnings events.
    """
    # Create a case-insensitive regex pattern that matches any of the whole-word keywords.
    pattern = r'\b(' + '|'.join(keywords) + r')\b'

    # Apply the filter to the 'aggregated_text' column.
    earnings_mask = predictions_df['aggregated_text'].str.contains(pattern, case=False, na=False, regex=True)

    return predictions_df[earnings_mask]


def run_control_experiment(
    predictions_df: pd.DataFrame,
    data_splits: Dict[str, Dict[str, pd.DataFrame]],
    embedding_features: Dict[str, Dict[str, np.ndarray]],
    training_results: Dict[str, Any],
    study_config: Dict[str, Any],
    vectorizer: TfidfVectorizer,
    st_model: SentenceTransformer,
    tokenizer: PreTrainedTokenizerBase
) -> pd.DataFrame:
    """
    Performs the control experiment to disentangle situational vs. linguistic drift.

    This function reproduces the analysis from Table 5 of the paper. It isolates
    a specific, consistent event type (quarterly earnings reports) and computes
    PCS and TSV metrics both within single regimes and across different regimes
    to measure the impact of narrative framing.

    Args:
        predictions_df: The full DataFrame of all predictions.
        ... (all other data artifacts required for metric computation) ...

    Returns:
        A pandas DataFrame summarizing the control experiment results, formatted
        to match Table 5.
    """
    logging.info("--- Running Task 30: Control Experiment Implementation ---")

    # --- 1. Identify all earnings-related events in the test sets ---
    logging.info("Step 1: Identifying all earnings-related events...")

    # Define a comprehensive list of keywords to identify earnings reports.
    earnings_keywords = ['earnings', 'quarterly', 'q1', 'q2', 'q3', 'q4', 'revenue', 'profit', 'guidance', 'forecast']

    # We need to work with the full test set data.
    test_dfs = [splits['test'] for regime, splits in data_splits.items()]
    full_test_df = pd.concat(test_dfs)

    # Filter the full test set to get only earnings-related rows.
    earnings_df = _identify_earnings_events(full_test_df, earnings_keywords)

    # Filter the main predictions DataFrame to align with these events.
    earnings_predictions_df = predictions_df.loc[earnings_df.index]

    if earnings_predictions_df.empty:
        logging.error("No earnings events found in any test set. Cannot run control experiment.")
        return pd.DataFrame()

    logging.info(f"Identified {len(earnings_predictions_df)} total earnings-related samples in test sets.")

    # --- 2. Define the experimental conditions ---
    # These are the specific comparisons we need to make for Table 5.
    experiments = {
        "Within-Regime (pre-COVID)": ("Pre-COVID", "Pre-COVID"),
        "Within-Regime (COVID)": ("COVID", "COVID"),
        "Cross-Regime (pre-COVID vs. COVID)": ("Pre-COVID", "COVID"),
        "Cross-Regime (post-COVID vs. rate-hike)": ("Post-COVID", "Rate-Hike")
    }

    results = []

    # --- 3. Run analysis for each experimental condition ---
    for name, (regime1, regime2) in experiments.items():
        logging.info(f"\n--- Analyzing Experiment: {name} ---")

        # --- a. Isolate data for the current experiment ---
        regimes_to_include = list(set([regime1, regime2]))

        # Filter the earnings predictions for the relevant regimes.
        exp_preds_df = earnings_predictions_df[earnings_predictions_df['regime'].isin(regimes_to_include)]

        if exp_preds_df.empty:
            logging.warning(f"No earnings data for experiment '{name}'. Skipping.")
            results.append({'Event Type': name, 'PCS': np.nan, 'TSV': np.nan})
            continue

        # Create the filtered artifacts needed for the metric computations.
        _, exp_splits, exp_embeds = _filter_artifacts_for_tickers(
            exp_preds_df.index.get_level_values('ticker').unique().tolist(),
            exp_preds_df, # Pass the already filtered preds
            data_splits,
            embedding_features
        )

        # --- b. Re-compute metrics on the isolated data ---

        # Compute PCS. This will average the PCS across all models in the specified regimes.
        pcs_series = compute_pcs(
            exp_preds_df, training_results, exp_splits, study_config,
            vectorizer, st_model, tokenizer
        )
        # We take the overall mean PCS for this experimental condition.
        pcs_score = pcs_series.mean()

        # Compute TSV.
        if regime1 == regime2: # Within-Regime TSV
            tsv_series = compute_tsv(exp_embeds, exp_splits)
            tsv_score = tsv_series.get(regime1, np.nan)
        else: # Cross-Regime TSV
            # Concatenate the sorted embeddings from both regimes' test sets.
            embeds1 = pd.DataFrame(exp_embeds[regime1]['test'], index=exp_splits[regime1]['test'].index).sort_index(level='date')
            embeds2 = pd.DataFrame(exp_embeds[regime2]['test'], index=exp_splits[regime2]['test'].index).sort_index(level='date')
            combined_embeds = np.vstack([embeds1.values, embeds2.values])
            if len(combined_embeds) < 2:
                tsv_score = np.nan
            else:
                distances = np.linalg.norm(np.diff(combined_embeds, axis=0), axis=1)
                tsv_score = distances.mean()

        logging.info(f"  - Computed PCS: {pcs_score:.4f}")
        logging.info(f"  - Computed TSV: {tsv_score:.4f}")

        results.append({'Event Type': name, 'PCS': pcs_score, 'TSV': tsv_score})

    # --- 4. Assemble and display the final table ---
    final_table = pd.DataFrame(results).set_index('Event Type')

    # Reorder to match the paper's Table 5.
    final_order = [
        "Within-Regime (pre-COVID)",
        "Within-Regime (COVID)",
        "Cross-Regime (pre-COVID vs. COVID)",
        "Cross-Regime (post-COVID vs. rate-hike)"
    ]
    final_table = final_table.reindex(final_order)

    logging.info("\nControl Experiment Results (Reproducing Table 5):\n" + final_table.to_string(float_format='{:.2f}'.format))

    logging.info("\n>>> Control experiment completed successfully. <<<")

    return final_table


In [None]:
# Task 31: Cross-Sector Data Preparation

def prepare_cross_sector_datasets(
    data_splits: Dict[str, Dict[str, pd.DataFrame]],
    source_sector: str,
    target_sector: str
) -> Dict[str, pd.DataFrame]:
    """
    Prepares all necessary data subsets for a cross-sector generalization experiment.

    This function takes the full, regime-based data splits and re-aggregates
    them into sector-specific training, validation, and testing sets. It prepares
    data for two distinct experiments:
    1.  Cross-Sector: Training on the source sector, testing on the target sector.
    2.  In-Sector Baseline: Training on the target sector, testing on the target sector.

    Args:
        data_splits: The nested dictionary of all 12 data splits.
        source_sector: The name of the sector to train on (e.g., 'Financials').
        target_sector: The name of the sector to test on (e.g., 'Health Care').

    Returns:
        A dictionary containing the four key aggregated DataFrames:
        - 'source_train': All training data from the source sector.
        - 'source_val': All validation data from the source sector.
        - 'target_train': All training data from the target sector (for baseline).
        - 'target_val': All validation data from the target sector (for baseline).
        - 'target_test': All testing data from the target sector.

    Raises:
        ValueError: If the source and target sectors are the same, or if
                    insufficient data is found for any key subset.
    """
    logging.info(f"--- Running Task 31: Cross-Sector Data Preparation ---")
    logging.info(f"Source Sector: '{source_sector}', Target Sector: '{target_sector}'")

    # --- Input Validation ---
    if source_sector == target_sector:
        raise ValueError("Source and target sectors must be different for a cross-sector experiment.")

    # --- 1. Isolate and Aggregate Data for Each Required Subset ---

    # This dictionary will hold lists of DataFrame chunks before concatenation.
    subset_chunks: Dict[str, List[pd.DataFrame]] = {
        'source_train': [], 'source_val': [],
        'target_train': [], 'target_val': [], 'target_test': []
    }

    # Iterate through all 12 original data splits.
    for regime, splits in data_splits.items():
        for split_name, df in splits.items():
            if df.empty:
                continue

            # Filter for the source sector.
            source_mask = df.index.get_level_values('sector') == source_sector
            source_df_chunk = df[source_mask]

            # Filter for the target sector.
            target_mask = df.index.get_level_values('sector') == target_sector
            target_df_chunk = df[target_mask]

            # Append the filtered chunks to the correct lists.
            if not source_df_chunk.empty:
                if split_name == 'training':
                    subset_chunks['source_train'].append(source_df_chunk)
                elif split_name == 'validation':
                    subset_chunks['source_val'].append(source_df_chunk)

            if not target_df_chunk.empty:
                if split_name == 'training':
                    subset_chunks['target_train'].append(target_df_chunk)
                elif split_name == 'validation':
                    subset_chunks['target_val'].append(target_df_chunk)
                elif split_name == 'testing':
                    subset_chunks['target_test'].append(target_df_chunk)

    # --- 2. Concatenate Chunks and Validate ---

    # This dictionary will hold the final, aggregated DataFrames.
    sector_datasets: Dict[str, pd.DataFrame] = {}

    for name, chunks in subset_chunks.items():
        if not chunks:
            # Raise an error if a critical subset has no data.
            raise ValueError(f"No data found for '{name}' subset. Cannot proceed with cross-sector analysis.")

        # Concatenate all chunks for the given subset into a single DataFrame.
        agg_df = pd.concat(chunks)

        # Sort by date as a best practice.
        agg_df.sort_index(level='date', inplace=True)

        # Store the final aggregated DataFrame.
        sector_datasets[name] = agg_df

        logging.info(f"Created '{name}' dataset with {len(agg_df)} samples.")

    # --- 3. Final Sanity Check ---
    # Ensure there is no ticker overlap between the source training set and target test set.
    source_tickers = set(sector_datasets['source_train'].index.get_level_values('ticker'))
    target_tickers = set(sector_datasets['target_test'].index.get_level_values('ticker'))

    if not source_tickers.isdisjoint(target_tickers):
        logging.warning(
            f"Ticker overlap found between source train and target test sets: "
            f"{source_tickers.intersection(target_tickers)}. This is unexpected "
            "for distinct sectors but may occur with multi-sector companies."
        )

    logging.info("\n>>> Cross-sector dataset preparation completed successfully. <<<")

    return sector_datasets


In [None]:
# Task 32: Cross-Sector Performance Analysis

def run_cross_sector_performance_analysis(
    sector_datasets: Dict[str, pd.DataFrame],
    study_config: Dict[str, Any],
    vectorizer: TfidfVectorizer,
    st_model: SentenceTransformer,
    force_retrain: bool = False
) -> pd.DataFrame:
    """
    Orchestrates the cross-sector performance analysis (reproducing Table 4).

    This function executes a series of experiments to measure the generalization
    capability of different model architectures when trained on one domain
    (source sector) and evaluated on another (target sector). It compares
    this cross-sector performance to an in-sector baseline.

    Args:
        sector_datasets: A dictionary of DataFrames prepared by
                         `prepare_cross_sector_datasets`.
        study_config: The complete study configuration dictionary.
        vectorizer: The globally fitted TfidfVectorizer.
        st_model: The initialized SentenceTransformer model.
        force_retrain: If True, retrains models even if checkpoints exist.

    Returns:
        A pandas DataFrame summarizing the cross-sector and in-sector MSE results,
        formatted to be comparable to Table 4.
    """
    logging.info("--- Running Task 32: Cross-Sector Performance Analysis ---")

    # --- 1. Setup Environment and Artifacts ---
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    tokenizer = AutoTokenizer.from_pretrained(
        study_config['model_training']['architectures']['text_transformer']['base_model_identifier']
    )

    # Define the model types to be analyzed. The paper's Table 4 is slightly
    # ambiguous. We interpret "Text-Only" as our `text_transformer` and
    # "Feature-Based" as our `feature_transformer`.
    model_types_to_analyze = ['text_transformer', 'feature_transformer']

    # Define sectors
    source_sector = 'Financials'
    target_sector = 'Health Care'

    # This dictionary will store the final MSE results.
    results: Dict[str, Dict[str, float]] = {
        'Cross-Sector MSE': {},
        'In-Sector MSE (Baseline)': {}
    }

    # --- 2. Execute Training and Evaluation Runs ---
    # We need to run four training experiments in total.

    for model_type in model_types_to_analyze:
        logging.info(f"\n--- Processing Model Type: {model_type} ---")

        # --- a. Cross-Sector Experiment (Train on Source, Test on Target) ---
        logging.info(f"  - Running Cross-Sector Experiment (Train: {source_sector}, Test: {target_sector})")

        # Define a unique checkpoint for this experiment.
        checkpoint_dir = Path(f"checkpoints/cross_sector/{source_sector}_trained")
        checkpoint_dir.mkdir(parents=True, exist_ok=True)
        cross_sector_checkpoint_path = checkpoint_dir / f"{model_type}.pth"

        # Train the model on the source sector data if needed.
        if not cross_sector_checkpoint_path.exists() or force_retrain:
            logging.info(f"    - Training model on '{source_sector}' data...")
            execute_single_training_run(
                model_type=model_type,
                train_df=sector_datasets['source_train'],
                val_df=sector_datasets['source_val'],
                study_config=study_config,
                checkpoint_path=cross_sector_checkpoint_path,
                vectorizer=vectorizer, st_model=st_model, tokenizer=tokenizer, device=device
            )
        else:
            logging.info(f"    - Found existing checkpoint: {cross_sector_checkpoint_path}")

        # Evaluate the source-trained model on the target sector test data.
        cross_sector_mse = execute_single_inference_run(
            model_type=model_type,
            checkpoint_path=cross_sector_checkpoint_path,
            test_df=sector_datasets['target_test'],
            study_config=study_config,
            vectorizer=vectorizer, st_model=st_model, tokenizer=tokenizer, device=device
        )
        results['Cross-Sector MSE'][model_type] = cross_sector_mse
        logging.info(f"    - Cross-Sector MSE: {cross_sector_mse:.4f}")

        # --- b. In-Sector Baseline Experiment (Train on Target, Test on Target) ---
        logging.info(f"  - Running In-Sector Baseline (Train: {target_sector}, Test: {target_sector})")

        # Define a unique checkpoint for this experiment.
        checkpoint_dir = Path(f"checkpoints/cross_sector/{target_sector}_trained")
        checkpoint_dir.mkdir(parents=True, exist_ok=True)
        in_sector_checkpoint_path = checkpoint_dir / f"{model_type}.pth"

        # Train the model on the target sector data if needed.
        if not in_sector_checkpoint_path.exists() or force_retrain:
            logging.info(f"    - Training model on '{target_sector}' data...")
            execute_single_training_run(
                model_type=model_type,
                train_df=sector_datasets['target_train'],
                val_df=sector_datasets['target_val'],
                study_config=study_config,
                checkpoint_path=in_sector_checkpoint_path,
                vectorizer=vectorizer, st_model=st_model, tokenizer=tokenizer, device=device
            )
        else:
            logging.info(f"    - Found existing checkpoint: {in_sector_checkpoint_path}")

        # Evaluate the target-trained model on the target sector test data.
        in_sector_mse = execute_single_inference_run(
            model_type=model_type,
            checkpoint_path=in_sector_checkpoint_path,
            test_df=sector_datasets['target_test'],
            study_config=study_config,
            vectorizer=vectorizer, st_model=st_model, tokenizer=tokenizer, device=device
        )
        results['In-Sector MSE (Baseline)'][model_type] = in_sector_mse
        logging.info(f"    - In-Sector MSE: {in_sector_mse:.4f}")

    # --- 3. Assemble and Format the Final Table ---
    logging.info("\nAssembling final cross-sector analysis table...")

    # Create the DataFrame from the results dictionary.
    results_df = pd.DataFrame(results)

    # Rename index for clarity in the final table.
    results_df.index = results_df.index.map({
        'text_transformer': 'Text-Only',
        'feature_transformer': 'Feature-Based'
    })
    results_df.index.name = 'Model'

    # The paper's Table 4 has a slightly different structure. We will present
    # our more direct comparison.

    logging.info("\nCross-Sector Performance Analysis:\n" + results_df.to_string(float_format='{:.4f}'.format))

    # Add a transferability ratio for more insight.
    results_df['Transferability Ratio'] = results_df['Cross-Sector MSE'] / results_df['In-Sector MSE (Baseline)']

    logging.info("\nAnalysis with Transferability Ratio (Lower is Better):\n" + results_df.to_string(float_format='{:.4f}'.format))

    logging.info("\n>>> Cross-sector performance analysis completed successfully. <<<")

    return results_df


In [None]:
# Task 33: Multi-Sector Robustness Validation

def run_multi_sector_robustness_validation(
    data_splits: Dict[str, Dict[str, pd.DataFrame]],
    study_config: Dict[str, Any],
    vectorizer: TfidfVectorizer,
    st_model: SentenceTransformer,
    force_retrain: bool = False
) -> Dict[str, pd.DataFrame]:
    """
    Orchestrates a large-scale, multi-sector robustness validation experiment.

    This function systematically evaluates the generalization capability of models
    across all available GICS sectors. For each model type, it generates a full
    N x N transferability matrix where N is the number of sectors. Each cell (i, j)
    in the matrix represents the performance of a model trained on sector i and
    evaluated on sector j.

    Args:
        data_splits: The nested dictionary of all 12 data splits.
        study_config: The complete study configuration dictionary.
        vectorizer: The globally fitted TfidfVectorizer.
        st_model: The initialized SentenceTransformer model.
        force_retrain: If True, retrains models even if checkpoints exist.

    Returns:
        A dictionary where keys are model types and values are the corresponding
        pandas DataFrame transferability matrices.
    """
    logging.info("--- Running Task 33: Multi-Sector Robustness Validation ---")

    # --- 1. Setup Environment and Artifacts ---
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    tokenizer = AutoTokenizer.from_pretrained(
        study_config['model_training']['architectures']['text_transformer']['base_model_identifier']
    )

    # Identify all unique sectors present in the dataset.
    all_sectors = sorted(list(
        pd.concat([df for r in data_splits.values() for df in r.values()])
        .index.get_level_values('sector').unique()
    ))
    logging.info(f"Found {len(all_sectors)} unique sectors for analysis: {all_sectors}")

    model_types_to_analyze = ['text_transformer', 'feature_transformer']

    # This list will store the long-format results before pivoting.
    all_results_long = []

    # --- 2. Main Experiment Loop: Iterate over all Sector Permutations ---
    # `permutations` gives us all (source, target) pairs, including (A, A).
    for source_sector, target_sector in tqdm(list(permutations(all_sectors, 2)), desc="Sector Pairs"):
        for model_type in model_types_to_analyze:

            logging.info(f"\n--- Processing: [Model: {model_type}] | [Train: {source_sector}] -> [Test: {target_sector}] ---")

            try:
                # --- a. Prepare the specific data slices for this pair ---
                sector_datasets = prepare_cross_sector_datasets(data_splits, source_sector, target_sector)

                # --- b. Train model on source sector (leveraging idempotency) ---
                checkpoint_dir = Path(f"checkpoints/multi_sector/{source_sector}_trained")
                checkpoint_dir.mkdir(parents=True, exist_ok=True)
                checkpoint_path = checkpoint_dir / f"{model_type}.pth"

                if not checkpoint_path.exists() or force_retrain:
                    logging.info(f"    - Training model on '{source_sector}' data...")
                    execute_single_training_run(
                        model_type=model_type,
                        train_df=sector_datasets['source_train'],
                        val_df=sector_datasets['source_val'],
                        study_config=study_config,
                        checkpoint_path=checkpoint_path,
                        vectorizer=vectorizer, st_model=st_model, tokenizer=tokenizer, device=device
                    )
                else:
                    logging.info(f"    - Found existing checkpoint for '{source_sector}'-trained model.")

                # --- c. Evaluate the trained model on the target sector ---
                logging.info(f"    - Evaluating on '{target_sector}' data...")
                mse = execute_single_inference_run(
                    model_type=model_type,
                    checkpoint_path=checkpoint_path,
                    test_df=sector_datasets['target_test'],
                    study_config=study_config,
                    vectorizer=vectorizer, st_model=st_model, tokenizer=tokenizer, device=device
                )

                # Append the result to our long-format list.
                all_results_long.append({
                    'model_type': model_type,
                    'source_sector': source_sector,
                    'target_sector': target_sector,
                    'mse': mse
                })

            except ValueError as e:
                # Gracefully handle cases where a sector has insufficient data.
                logging.warning(f"Skipping pair ({source_sector}, {target_sector}) for model {model_type} due to data issue: {e}")
                all_results_long.append({
                    'model_type': model_type,
                    'source_sector': source_sector,
                    'target_sector': target_sector,
                    'mse': np.nan
                })

    # --- 3. Assemble and Process the Final Transferability Matrices ---
    if not all_results_long:
        logging.error("No cross-sector results were generated.")
        return {}

    results_df = pd.DataFrame(all_results_long)

    final_matrices = {}
    for model_type in model_types_to_analyze:
        logging.info(f"\n--- Assembling Transferability Matrix for: {model_type} ---")

        model_results_df = results_df[results_df['model_type'] == model_type]

        # Pivot the long-format data to create the MSE matrix.
        mse_matrix = model_results_df.pivot_table(
            index='source_sector',
            columns='target_sector',
            values='mse'
        )

        # The diagonal of this matrix is the in-sector baseline MSE.
        baseline_mses = pd.Series(np.diag(mse_matrix), index=mse_matrix.index)

        # Calculate the Transferability Ratio matrix.
        # Ratio = Cross-Sector MSE / In-Sector Baseline MSE
        # We divide each column (target sector) by its corresponding baseline MSE.
        transferability_matrix = mse_matrix.div(baseline_mses, axis=1)

        final_matrices[model_type] = transferability_matrix

        # --- 4. Visualize the Matrix ---
        plt.figure(figsize=(14, 10))
        sns.heatmap(
            transferability_matrix,
            annot=True,
            fmt=".2f",
            cmap="coolwarm",
            linewidths=.5,
            center=1.0 # Center the colormap at 1.0 (perfect transfer)
        )
        plt.title(f"Cross-Sector Transferability Matrix (Ratio) for {model_type}", fontsize=16)
        plt.xlabel("Target Sector (Evaluation)", fontsize=12)
        plt.ylabel("Source Sector (Training)", fontsize=12)
        plt.show()

    logging.info("\n>>> Multi-sector robustness validation completed successfully. <<<")

    return final_matrices


In [None]:
# Orchestrator Function

def run_full_research_pipeline(
    raw_df: pd.DataFrame,
    study_config: Dict[str, Any],
    results_dir: Path = Path("results"),
    force_rerun: bool = False
) -> Dict[str, Path]:
    """
    Executes the complete end-to-end research pipeline from data validation
    to final analysis, reproducing the study "Quantifying Semantic Shift".

    This master orchestrator function integrates all 33 previously defined tasks
    into a single, sequential, and auditable workflow. It is designed to be
    idempotent, leveraging checkpointing in the training-intensive steps to
    allow for resumption without losing progress.

    Args:
        raw_df: The initial, raw pandas DataFrame containing all market and
                news data.
        study_config: The complete study configuration dictionary.
        results_dir: The root directory to save all artifacts.
        force_rerun: If True, ignores existing artifacts and re-runs all steps.

    Returns:
        A dictionary mapping the names of key final artifacts to their file paths.
    """
    # --- Initialize a dictionary to store the final results ---
    final_artifacts = {}

    # ==========================================================================
    # PHASE 1: VALIDATION AND CLEANSING (Tasks 1-3)
    # ==========================================================================
    logging.info("\n" + "="*80 + "\nPHASE 1: CONFIGURATION VALIDATION AND DATA QUALITY ASSURANCE\n" + "="*80)

    # Task 1: Validate the study configuration dictionary.
    run_config_validation_suite(study_config)

    # Task 2: Validate the structure and integrity of the raw DataFrame.
    run_dataframe_validation_suite(raw_df)

    # Task 3: Perform data quality checks and cleanse the data.
    df_cleaned = run_data_quality_and_cleansing_suite(raw_df)

    # ==========================================================================
    # PHASE 2 & 3: PARTITIONING AND FEATURE ENGINEERING (Tasks 4-9)
    # ==========================================================================
    logging.info("\n" + "="*80 + "\nPHASE 2 & 3: DATA PARTITIONING AND FEATURE ENGINEERING\n" + "="*80)

    # Define path for a key intermediate artifact.
    data_splits_path = results_dir / "data_splits.pkl"
    artifact_paths['data_splits'] = data_splits_path

    if not data_splits_path.exists() or force_rerun:
        df_regimes = run_regime_assignment_suite(df_cleaned, study_config)
        data_splits = run_chronological_splitting_suite(df_regimes, study_config)
        pd.to_pickle(data_splits, data_splits_path)
    else:
        logging.info(f"Loading existing data splits from '{data_splits_path}'...")
        data_splits = pd.read_pickle(data_splits_path)

    # Feature engineering steps are fast, so we can re-run them.
    vectorizer, tfidf_features = run_tfidf_vectorization_suite(data_splits, study_config)
    embedding_features = run_embedding_extraction_suite(data_splits, study_config)
    combined_features = run_feature_concatenation_suite(tfidf_features, embedding_features, data_splits, study_config)

    # ==========================================================================
    # PHASE 4 & 5: MODEL TRAINING (Tasks 10-15)
    # ==========================================================================
    logging.info("\n" + "="*80 + "\nPHASE 4 & 5: MODEL ARCHITECTURE AND TRAINING\n" + "="*80)
    # Tasks 10-12 are the class definitions, which are implicitly used here.

    # Task 15: Run the master training orchestrator for all 12 models.
    training_results = run_regime_specific_training_pipeline(
        data_splits=data_splits,
        tfidf_features=tfidf_features,
        embedding_features=embedding_features,
        combined_features=combined_features,
        study_config=study_config,
        force_retrain=force_rerun
    )
    training_results_path = results_dir / "training_results.pkl"
    pd.to_pickle(training_results, training_results_path)
    artifact_paths['training_results'] = training_results_path

    # ==========================================================================
    # PHASE 6: PREDICTION AND BASELINE EVALUATION (Tasks 16-18)
    # ==========================================================================
    logging.info("\n" + "="*80 + "\nPHASE 6: PREDICTION GENERATION AND BASELINE EVALUATION\n" + "="*80)

    enriched_predictions_path = results_dir / "enriched_predictions.pkl"
    artifact_paths['enriched_predictions'] = enriched_predictions_path

    if not enriched_predictions_path.exists() or force_rerun:
        predictions_df = run_inference_pipeline(
            training_results=training_results,
            data_splits=data_splits,
            tfidf_features=tfidf_features,
            embedding_features=embedding_features,
            combined_features=combined_features,
            study_config=study_config
        )
        enriched_predictions_df = enrich_and_store_predictions(
            predictions_df=predictions_df, export_path=enriched_predictions_path,
            metadata={'experiment_id': 'full_run_v1', 'run_timestamp': datetime.utcnow().isoformat()}
        )
    else:
        logging.info(f"Loading existing enriched predictions from '{enriched_predictions_path}'...")
        enriched_predictions_df = load_and_validate_predictions(enriched_predictions_path)

    # Task 17: Compute and display the main MSE results table.
    mse_table = run_mse_evaluation_suite(enriched_predictions_df, study_config)
    final_artifacts['mse_table'] = mse_table

    # ==========================================================================
    # PHASE 7-11: FULL ANALYSIS SUITE (Tasks 19-33)
    # ==========================================================================
    logging.info("\n" + "="*80 + "\nPHASE 7-11: FULL ANALYSIS SUITE\n" + "="*80)

    # --- Load artifacts needed for the full analysis suite ---
    st_model = initialize_sentence_transformer(study_config['feature_engineering']['sentence_embeddings']['model_identifier'])
    tokenizer = AutoTokenizer.from_pretrained(study_config['model_training']['architectures']['text_transformer']['base_model_identifier'])

    # --- Run all analyses ---
    # Each of these functions is an orchestrator for a major analytical task.

    # Task 19-24: Main diagnostic metrics
    robustness_profile = run_diagnostic_metrics_orchestrator(
        predictions_df=enriched_predictions_df, study_config=study_config, data_splits=data_splits,
        embedding_features=embedding_features, training_results=training_results, vectorizer=vectorizer,
        st_model=st_model, tokenizer=tokenizer
    )
    robustness_profile_path = results_dir / "robustness_profile.csv"
    robustness_profile.to_csv(robustness_profile_path)
    artifact_paths['robustness_profile'] = robustness_profile_path

    # Task 25: J-S Divergence
    js_matrix = compute_js_divergence_matrix(data_splits, study_config)
    js_matrix_path = results_dir / "js_divergence_matrix.csv"
    js_matrix.to_csv(js_matrix_path)
    artifact_paths['js_divergence_matrix'] = js_matrix_path

    # Task 26: t-SNE
    run_tsne_visualization_suite(data_splits, embedding_features)

    # Task 27: Correlation Analysis
    correlation_analysis = run_cross_regime_analysis_suite(js_matrix, robustness_profile)
    correlation_path = results_dir / "correlation_analysis.csv"
    correlation_analysis.to_csv(correlation_path)
    artifact_paths['correlation_analysis'] = correlation_path

    # Task 28: Case Study
    case_study = run_stock_specific_case_study(
        target_tickers=['JPM', 'AAPL'], predictions_df=enriched_predictions_df, data_splits=data_splits,
        embedding_features=embedding_features, training_results=training_results, study_config=study_config,
        vectorizer=vectorizer, st_model=st_model, tokenizer=tokenizer
    )
    case_study_path = results_dir / "case_study_results.csv"
    case_study.to_csv(case_study_path)
    artifact_paths['case_study_results'] = case_study_path

    # Task 29: Ablation Studies (computational parts)
    metric_ablation = perform_metric_ablation_analysis(robustness_profile)
    metric_ablation_path = results_dir / "metric_ablation.csv"
    metric_ablation.to_csv(metric_ablation_path)
    artifact_paths['metric_ablation'] = metric_ablation_path

    # Generate the file needed for the manual part of the entailment ablation
    benchmark_file_path = results_dir / "nli_benchmark_for_annotation.csv"
    create_nli_benchmark_file(enriched_predictions_df, benchmark_file_path)
    artifact_paths['nli_benchmark_for_annotation'] = benchmark_file_path
    logging.warning(f"ACTION REQUIRED: The file '{benchmark_file_path}' has been created. It must be manually annotated before running the entailment ablation analysis.")

    # Task 31-33: Cross-Sector Analysis
    transferability_matrices = run_multi_sector_robustness_validation(
        data_splits=data_splits, study_config=study_config, vectorizer=vectorizer,
        st_model=st_model, force_retrain=force_rerun
    )
    transferability_path = results_dir / "transferability_matrices.pkl"
    pd.to_pickle(transferability_matrices, transferability_path)
    artifact_paths['transferability_matrices'] = transferability_path

    logging.info("\n" + "="*80 + "\n====== FULL AUTOMATED RESEARCH PIPELINE COMPLETED SUCCESSFULLY ======\n" + "="*80)

    return artifact_paths


def run_entailment_ablation_analysis(
    enriched_predictions_df_path: Path,
    annotated_benchmark_path: Path,
    study_config: Dict[str, Any]
) -> pd.DataFrame:
    """
    Executes the entailment model comparison using a human-annotated file.

    This function should be run ONLY AFTER the main pipeline is complete and
    the benchmark CSV file it generates has been manually filled out by a
    human expert.

    Args:
        enriched_predictions_df_path: Path to the saved 'enriched_predictions.pkl' artifact.
        annotated_benchmark_path: Path to the CSV file after human annotation.
        study_config: The complete study configuration dictionary.

    Returns:
        A pandas DataFrame summarizing the entailment model comparison.
    """
    logging.info("--- Running Task 29, Step 3: Entailment Model Comparison ---")

    # Load the necessary prediction data.
    predictions_df = load_and_validate_predictions(enriched_predictions_df_path)

    # Run the ablation study using the annotated file.
    entailment_ablation_results = perform_entailment_model_ablation(
        predictions_df=predictions_df,
        annotated_benchmark_path=annotated_benchmark_path,
        study_config=study_config
    )

    return entailment_ablation_results


In [None]:
# Task 34: Comprehensive Results Assembly

def _style_table(
    df: pd.DataFrame,
    caption: str,
    precision: int = 2,
    highlight_min_axis: Optional[int] = None
) -> pd.io.formats.style.Styler:
    """
    Applies professional, publication-quality styling to a pandas DataFrame.

    This generic helper function takes a DataFrame and returns a pandas Styler
    object with a consistent set of presentation-focused formatting rules. It is
    used to generate the final tables for the research report.

    The applied styles include:
    - A descriptive caption for the table.
    - Uniform numerical precision for all cells.
    - A standard representation for missing values ('N/A').
    - Centered text alignment for headers and data cells.
    - Optionally, highlighting the minimum value in each row or column.

    Args:
        df (pd.DataFrame): The input DataFrame to be styled.
        caption (str): The title caption to be displayed above the table.
        precision (int, optional): The number of decimal places to format
                                   numerical values to. Defaults to 2.
        highlight_min_axis (Optional[int], optional): The axis along which to
            highlight the minimum value. `0` for columns, `1` for rows. If
            `None`, no highlighting is applied. Defaults to None.

    Returns:
        pd.io.formats.style.Styler: A pandas Styler object. This object can be
            rendered directly in environments like Jupyter notebooks or can be
            further processed to generate HTML or LaTeX output.
    """
    # --- Input Validation ---
    # Ensure the input is a pandas DataFrame.
    if not isinstance(df, pd.DataFrame):
        raise TypeError(f"Input `df` must be a pandas DataFrame, but got {type(df)}.")

    # --- 1. Initialize the Styler and Set Basic Formatting ---
    # `df.style` creates the Styler object.
    styler = df.style

    # Set the main title for the table.
    styler = styler.set_caption(caption)

    # Apply global formatting to all cells. This sets the number of decimal
    # places for floats and defines how NaN values should be displayed.
    styler = styler.format(precision=precision, na_rep="N/A")

    # --- 2. Apply CSS Styles for Professional Appearance ---
    # `set_table_styles` allows for applying custom CSS to different parts of the table.
    styler = styler.set_table_styles([
        # Style for table headers (<th> tags in HTML).
        {'selector': 'th', 'props': [
            ('text-align', 'center'),
            ('font-weight', 'bold'),
            ('background-color', '#f2f2f2')
        ]},
        # Style for data cells (<td> tags in HTML).
        {'selector': 'td', 'props': [
            ('text-align', 'center'),
            ('padding', '5px')
        ]},
        # Style for the table caption.
        {'selector': 'caption', 'props': [
            ('font-size', '1.2em'),
            ('font-weight', 'bold'),
            ('margin', '10px 0px')
        ]}
    ])

    # --- 3. Apply Conditional Highlighting ---
    # Check if the `highlight_min_axis` argument was provided.
    if highlight_min_axis is not None:
        # `highlight_min` is a built-in Styler method that finds the minimum
        # value along the specified axis (0 for columns, 1 for rows) and
        # applies the given CSS properties to that cell.
        styler = styler.highlight_min(
            axis=highlight_min_axis,
            props='font-weight: bold; color: #0055A4;' # e.g., bold and blue
        )

    # Return the final, configured Styler object.
    return styler

def run_comprehensive_results_assembly(
    artifact_paths: Dict[str, Path],
    entailment_ablation_results: Optional[pd.DataFrame] = None
) -> Dict[str, Any]:
    """
    Assembles, validates, and presents all computed results in publication-quality tables.

    This master reporting function loads all key artifacts generated by the main
    research pipeline, validates their presence, and then generates a series of
    styled tables that reproduce the main findings presented in the paper.

    Args:
        artifact_paths: A dictionary mapping artifact names to their file paths,
                        as returned by `run_full_research_pipeline`.
        entailment_ablation_results: An optional DataFrame containing the results
                                     from the human-in-the-loop entailment model
                                     comparison.

    Returns:
        A "master results database" dictionary containing all the key loaded
        DataFrames for further interactive analysis.

    Raises:
        FileNotFoundError: If a required artifact is not found at its specified path.
    """
    logging.info("--- Running Task 34: Comprehensive Results Assembly ---")

    # --- Step 1: Aggregate All Computed Metrics and Results (by loading) ---
    logging.info("Step 1: Loading all computed result artifacts...")

    master_results_db: Dict[str, Any] = {}

    # Load each required artifact, checking for existence first.
    for name, path in artifact_paths.items():
        if not path.exists():
            raise FileNotFoundError(f"Required artifact '{name}' not found at path: {path}")

        if path.suffix == '.pkl':
            master_results_db[name] = pd.read_pickle(path)
        elif path.suffix == '.csv':
            # Attempt to read with a multi-index if appropriate.
            try:
                df = pd.read_csv(path, index_col=[0, 1])
            except (IndexError, ValueError):
                try:
                    df = pd.read_csv(path, index_col=0)
                except (IndexError, ValueError):
                    df = pd.read_csv(path)
            master_results_db[name] = df

    # Add the optional, externally computed entailment ablation results.
    if entailment_ablation_results is not None:
        master_results_db['entailment_ablation'] = entailment_ablation_results

    logging.info(f"Successfully loaded {len(master_results_db)} artifacts into the master results database.")

    # --- Step 2: Create Master Results Database (already done by loading) ---
    # The `master_results_db` dictionary is our implementation of this concept.
    # We can now save this entire collection as a single artifact for ultimate portability.
    master_db_path = Path(artifact_paths.get('robustness_profile', Path("results/")).parent) / "master_results_database.pkl"
    pd.to_pickle(master_results_db, master_db_path)
    logging.info(f"Step 2: Master results database saved to '{master_db_path}'.")

    # --- Step 3: Generate Publication-Quality Tables and Figures ---
    logging.info("\nStep 3: Generating publication-quality tables...")

    # --- Table 3: MSE Comparison ---
    mse_table = master_results_db['robustness_profile'][['MSE']].reset_index().pivot(
        index='regime', columns='model_type', values='MSE'
    )
    display(_style_table(mse_table, "Table 3 (Reproduced): Mean Squared Error (MSE)", highlight_min_axis=1))

    # --- Full Robustness Profile Table ---
    display(_style_table(master_results_db['robustness_profile'], "Full Robustness Profile", precision=3))

    # --- Table 6: Case Study ---
    case_study_df = master_results_db['case_study_results']
    # To match the paper's format, we can display one table per model.
    for model_type in case_study_df.index.get_level_values('model_type').unique():
        table_6_like = case_study_df.xs(model_type, level='model_type')
        display(_style_table(table_6_like, f"Table 6 (Reproduced): Case Study for Model '{model_type}'", precision=3))

    # --- Table 7: Metric Ablation ---
    metric_ablation_df = master_results_db['metric_ablation']
    display(_style_table(metric_ablation_df, "Table 7 (Reproduced): Metric Ablation Analysis", precision=2, na_rep="N/A"))

    # --- Table 8: Feature Augmentation Ablation ---
    feature_ablation_df = master_results_db.get('feature_ablation')
    if feature_ablation_df is not None:
         display(_style_table(feature_ablation_df, "Table 8 (Reproduced): Feature Augmentation Ablation", precision=3))

    # --- Table 9: Entailment Model Comparison ---
    entailment_ablation_df = master_results_db.get('entailment_ablation')
    if entailment_ablation_df is not None:
        display(_style_table(entailment_ablation_df, "Table 9 (Reproduced): Entailment Model Comparison", precision=2))
    else:
        logging.warning("Entailment ablation results not provided; skipping Table 9.")

    # --- Transferability Matrix Visualization ---
    transfer_matrices = master_results_db.get('transferability_matrices')
    if transfer_matrices is not None:
        for model_type, matrix in transfer_matrices.items():
            plt.figure(figsize=(14, 10))
            sns.heatmap(matrix, annot=True, fmt=".2f", cmap="coolwarm", linewidths=.5, center=1.0)
            plt.title(f"Cross-Sector Transferability Matrix (Ratio) for {model_type}", fontsize=16)
            plt.xlabel("Target Sector (Evaluation)", fontsize=12)
            plt.ylabel("Source Sector (Training)", fontsize=12)
            plt.show()

    logging.info("\n>>> Comprehensive results assembly and reporting completed successfully. <<<")

    return master_results_db


In [None]:
# Task 35: Results Validation and Quality Assurance

def run_results_validation_and_synthesis(
    master_results_db: Dict[str, Any],
    enriched_predictions_df: pd.DataFrame
) -> str:
    """
    Performs a final validation, statistical testing, and synthesis of all results.

    This master analytical function serves as the final step of the research
    pipeline. It takes the database of all computed results and:
    1.  Validates key findings against benchmarks from the source paper.
    2.  Performs statistical significance tests on key comparisons.
    3.  Generates a comprehensive, structured text report summarizing the
        study's findings, limitations, and implications.

    Args:
        master_results_db: The dictionary containing all result DataFrames.
        enriched_predictions_df: The DataFrame with per-sample predictions
                                 and errors, required for t-tests.

    Returns:
        A formatted string containing the full analytical report.
    """
    logging.info("--- Running Task 35: Results Validation and Quality Assurance ---")

    # Initialize a list to build the report string.
    report_parts = ["="*80, "FINAL ANALYTICAL REPORT: Quantifying Semantic Shift in Financial NLP", "="*80 + "\n"]

    # --- Step 1: Validate Results Against Paper Benchmarks ---
    report_parts.append("\n--- 1. Validation Against Paper Benchmarks ---\n")

    # Define benchmark values from the paper's tables.
    benchmarks = {
        'MSE': {
            ('Pre-COVID', 'lstm'): 3.08,
            ('COVID', 'text_transformer'): 40.95,
        },
        'Feature_Ablation_Cross_MSE': {
            'Text Only': 0.501,
            'Feature Enhanced': 0.469
        }
    }
    tolerance = 0.15 # Allow for 15% tolerance due to stochasticity.

    # Validate MSE benchmarks
    mse_table = master_results_db['mse_table']
    for (regime, model), bench_val in benchmarks['MSE'].items():
        try:
            comp_val = mse_table.loc[regime, model]
            is_close = np.isclose(comp_val, bench_val, rtol=tolerance)
            status = "PASSED" if is_close else "FAILED"
            report_parts.append(
                f"  - MSE Benchmark Check for ({regime}, {model}): "
                f"Computed={comp_val:.2f}, Paper={bench_val}, Tolerance={tolerance:.0%}. Status: {status}"
            )
        except KeyError:
            report_parts.append(f"  - MSE Benchmark Check for ({regime}, {model}): FAILED (Computed value not found)")

    # Validate Feature Ablation pattern
    feature_ablation = master_results_db.get('feature_ablation')
    if feature_ablation is not None:
        text_only_mse = feature_ablation.loc['Text Only', 'Cross-Sector MSE']
        feature_enhanced_mse = feature_ablation.loc['Feature Enhanced', 'Cross-Sector MSE']
        pattern_holds = feature_enhanced_mse < text_only_mse
        status = "PASSED" if pattern_holds else "FAILED"
        report_parts.append(
            f"  - Feature Ablation Pattern Check (Feature Enhanced < Text Only): "
            f"{feature_enhanced_mse:.3f} < {text_only_mse:.3f}. Status: {status}"
        )

    # --- Step 2: Implement Statistical Significance Testing ---
    report_parts.append("\n--- 2. Statistical Significance Analysis ---\n")

    # --- Paired t-tests for MSE comparisons within a regime ---
    report_parts.append("  - Paired t-tests on MSE differences (p < 0.05 is significant):")
    regimes = enriched_predictions_df['regime'].unique()
    models = enriched_predictions_df['model_type'].unique()

    for regime in regimes:
        regime_df = enriched_predictions_df[enriched_predictions_df['regime'] == regime]
        if len(models) > 1:
            # Compare the first two models as an example.
            model1, model2 = models[0], models[1]
            errors1 = regime_df[regime_df['model_type'] == model1]['squared_error']
            errors2 = regime_df[regime_df['model_type'] == model2]['squared_error']

            # Ensure samples are aligned for pairing.
            aligned_errors = pd.DataFrame({'m1': errors1, 'm2': errors2}).dropna()
            if len(aligned_errors) > 10:
                t_stat, p_value = stats.ttest_rel(aligned_errors['m1'], aligned_errors['m2'])
                report_parts.append(
                    f"    - In '{regime}', {model1} vs. {model2}: p-value = {p_value:.4f}"
                )

    # --- Correlation Significance ---
    corr_analysis = master_results_db.get('correlation_analysis')
    if corr_analysis is not None:
        report_parts.append("\n  - Significance of Semantic Drift Correlation (Spearman's Rho):")
        for model_type, row in corr_analysis.iterrows():
            report_parts.append(
                f"    - For '{model_type}': rho = {row['spearman_rho']:.3f}, p-value = {row['p_value']:.4f}"
            )

    # --- Step 3: Create Results Interpretation Guidelines & Summary ---
    report_parts.append("\n--- 3. Synthesis and Interpretation ---\n")

    # --- Key Findings ---
    report_parts.append("  - Key Findings:")
    # Programmatically find the most volatile regime and most sensitive model.
    most_volatile_regime = mse_table.unstack().idxmax()[1]
    most_sensitive_model = mse_table.std(axis=0).idxmax()
    most_stable_model = mse_table.std(axis=0).idxmin()

    report_parts.append(f"    1. Model performance degrades significantly during crisis periods, with the '{most_volatile_regime}' regime showing the highest MSE for the '{mse_table.unstack().idxmax()[0]}' model.")
    report_parts.append(f"    2. The '{most_sensitive_model}' model exhibits the highest variance in performance across regimes, making it the most sensitive to semantic drift.")
    report_parts.append(f"    3. The '{most_stable_model}' model shows the most consistent performance, making it the most robust to regime shifts.")
    if corr_analysis is not None and (corr_analysis['p_value'] < 0.1).any():
        report_parts.append("    4. There is a statistically significant positive correlation between semantic drift (J-S Divergence) and model error (MSE degradation), confirming the paper's central hypothesis.")
    if feature_ablation is not None and pattern_holds:
        report_parts.append("    5. Feature enhancement (combining TF-IDF and embeddings) improves cross-sector generalization compared to using text features alone.")

    # --- Limitations ---
    report_parts.append("\n  - Limitations:")
    report_parts.append("    1. The correlation analysis is based on a small sample of regime pairs (N=6), limiting its statistical power. Results should be seen as indicative.")
    report_parts.append("    2. Causal metrics (FCAS, PCS) rely on simplified keyword-based methods, which may not capture the full nuance of financial narratives.")
    report_parts.append("    3. The study is limited to a specific set of models and time periods; findings may not generalize to all market conditions or architectures.")

    # --- Implications ---
    report_parts.append("\n  - Implications for Model Governance:")
    report_parts.append("    1. Static models trained on historical data are unreliable in non-stationary financial markets. Continuous monitoring of performance and data drift is essential.")
    report_parts.append("    2. The diagnostic metric suite (FCAS, PCS, TSV, NLICS) provides a powerful toolkit for model validation, stress testing, and identifying failure modes beyond simple accuracy metrics.")
    report_parts.append("    3. An effective governance system must include triggers for model retraining or recalibration based on quantitative measures of semantic drift (like TSV or J-S Divergence).")

    # --- Final Assembly ---
    final_report = "\n".join(report_parts)
    logging.info("\n" + final_report)

    # Save the report to a file.
    report_path = Path(master_results_db.get('robustness_profile', pd.DataFrame()).index.name if hasattr(master_results_db.get('robustness_profile'), 'index') and master_results_db.get('robustness_profile').index.name is not None else "results") / "final_analytical_report.txt"
    report_path.parent.mkdir(exist_ok=True)
    with open(report_path, "w") as f:
        f.write(final_report)
    logging.info(f"Final analytical report saved to '{report_path}'")

    return final_report


In [None]:
# Top-Level Orchestrator

def execute_quantifying_semantic_shift_study(
    raw_df: pd.DataFrame,
    study_config: Dict[str, Any],
    results_dir: Path = Path("results"),
    run_entailment_ablation: bool = False,
    annotated_benchmark_filename: str = "nli_benchmark_annotated.csv",
    force_rerun_main_training: bool = False,
    force_rerun_cross_sector: bool = False
) -> Dict[str, Any]:
    """
    Executes the complete, end-to-end "Quantifying Semantic Shift" research study.

    This top-level orchestrator serves as the main entry point for the entire
    project. It manages the full workflow from initial data validation to the
    generation of the final analytical report. It is designed to be robust,
    resumable, and methodologically rigorous.

    The workflow is as follows:
    1.  Calls `run_full_research_pipeline` to perform all automated computational
        tasks, from data validation and feature engineering to model training,
        inference, and all subsequent analyses (Tasks 1-33). This step
        leverages extensive caching and idempotency for robustness.
    2.  Optionally, if `run_entailment_ablation` is True, it proceeds with the
        human-in-the-loop analysis. It verifies the existence of a properly
        annotated benchmark file and runs the entailment model comparison.
    3.  Calls `run_comprehensive_results_assembly` to load all generated artifacts
        and produce publication-quality tables and figures.
    4.  Calls `run_results_validation_and_synthesis` to perform final statistical
        tests and generate a comprehensive, text-based analytical summary.

    Args:
        raw_df: The initial, raw pandas DataFrame.
        study_config: The complete study configuration dictionary.
        results_dir: The root directory to save all artifacts.
        run_entailment_ablation: If True, the function will attempt to run the
            final entailment model ablation study, which requires the
            human-annotated benchmark file to be present and valid.
        annotated_benchmark_filename: The filename of the human-annotated
            benchmark CSV file, expected to be inside the `results_dir`.
        force_rerun_main_training: If True, forces retraining of the 12 main models.
        force_rerun_cross_sector: If True, forces retraining of cross-sector models.

    Returns:
        A dictionary containing the key final outputs of the study.
    """
    logging.info("="*100)
    logging.info("STARTING TOP-LEVEL ORCHESTRATOR: QUANTIFYING SEMANTIC SHIFT STUDY")
    logging.info("="*100)

    # This dictionary will hold all final results.
    final_study_outputs: Dict[str, Any] = {}

    # --- Step i: Run the Full Automated Research Pipeline ---
    # This function handles Tasks 1-33 (excluding the manual part of Task 29).
    # It is idempotent and will use cached artifacts where possible.
    artifact_paths = run_full_research_pipeline(
        raw_df=raw_df,
        study_config=study_config,
        results_dir=results_dir,
        force_rerun=(force_rerun_main_training or force_rerun_cross_sector)
    )
    final_study_outputs['artifact_paths'] = artifact_paths

    # --- Step ii: Optionally Run the Entailment Ablation Analysis ---
    entailment_ablation_results = None
    if run_entailment_ablation:
        logging.info("\n--- Attempting to run Entailment Ablation Analysis ---")
        # This is the rigorous, non-interactive check. We verify, we don't ask.
        annotated_benchmark_path = results_dir / annotated_benchmark_filename
        enriched_predictions_path = artifact_paths['enriched_predictions']

        try:
            # This function contains its own validation for the annotated file.
            entailment_ablation_results = run_entailment_ablation_analysis(
                enriched_predictions_df_path=enriched_predictions_path,
                annotated_benchmark_path=annotated_benchmark_path,
                study_config=study_config
            )
            final_study_outputs['entailment_ablation_results'] = entailment_ablation_results
        except (FileNotFoundError, ValueError) as e:
            # If the file is missing or invalid, we log a critical warning and continue.
            # This makes the pipeline robust.
            logging.error(f"Could not run entailment ablation analysis: {e}")
            logging.warning("Please ensure the benchmark file has been correctly annotated and saved.")
    else:
        logging.info("\nSkipping Entailment Ablation Analysis as per configuration.")

    # --- Step iii & iv: Prepare and Run Comprehensive Results Assembly ---
    # This function loads all artifacts and generates all final tables/figures.
    master_results_db = run_comprehensive_results_assembly(
        artifact_paths=artifact_paths,
        entailment_ablation_results=entailment_ablation_results
    )
    final_study_outputs['master_results_database'] = master_results_db

    # --- Step v: Run Final Validation and Synthesis ---
    # This function performs statistical tests and generates the final text report.
    final_analytical_report = run_results_validation_and_synthesis(
        master_results_db=master_results_db,
        enriched_predictions_df=master_results_db['enriched_predictions_df']
    )
    final_study_outputs['final_analytical_report'] = final_analytical_report

    # --- Step vi: Return Final Outputs ---
    logging.info("\n" + "="*100)
    logging.info("TOP-LEVEL ORCHESTRATOR FINISHED SUCCESSFULLY.")
    logging.info("="*100)

    return final_study_outputs

