# `README.md`

# Identifying and Quantifying Financial Bubbles with the Hyped Log-Periodic Power Law Model

<!-- PROJECT SHIELDS -->
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
[![Python Version](https://img.shields.io/badge/python-3.9%2B-blue.svg)](https://www.python.org/)
[![arXiv](https://img.shields.io/badge/arXiv-2510.10878-b31b1b.svg)](https://arxiv.org/abs/2510.10878)
[![Year](https://img.shields.io/badge/Year-2025-purple)](https://github.com/chirindaopensource/identifying_quantifying_financial_bubbles_hyped_log_period_power_law)
[![Discipline](https://img.shields.io/badge/Discipline-Quantitative%20Finance%20%7C%20NLP%20%7C%20Econophysics-00529B)](https://github.com/chirindaopensource/identifying_quantifying_financial_bubbles_hyped_log_period_power_law)
[![Data Sources](https://img.shields.io/badge/Data-CRSP%20%7C%20Compustat%20%7C%20News%20API-lightgrey)](https://github.com/chirindaopensource/identifying_quantifying_financial_bubbles_hyped_log_period_power_law)
[![Core Method](https://img.shields.io/badge/Method-LPPL%20%7C%20Transformer-orange)](https://github.com/chirindaopensource/identifying_quantifying_financial_bubbles_hyped_log_period_power_law)
[![NLP Models](https://img.shields.io/badge/NLP-FinBERT%20%7C%20BERTopic-red)](https://github.com/chirindaopensource/identifying_quantifying_financial_bubbles_hyped_log_period_power_law)
[![Deep Learning](https://img.shields.io/badge/Deep%20Learning-Dual--Stream%20Transformer-blueviolet)](https://github.com/chirindaopensource/identifying_quantifying_financial_bubbles_hyped_log_period_power_law)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Type Checking: mypy](https://img.shields.io/badge/type%20checking-mypy-blue)](http://mypy-lang.org/)
[![NumPy](https://img.shields.io/badge/numpy-%23013243.svg?style=flat&logo=numpy&logoColor=white)](https://numpy.org/)
[![Pandas](https://img.shields.io/badge/pandas-%23150458.svg?style=flat&logo=pandas&logoColor=white)](https://pandas.pydata.org/)
[![PyTorch](https://img.shields.io/badge/PyTorch-%23EE4C2C.svg?style=flat&logo=PyTorch&logoColor=white)](https://pytorch.org/)
[![Scipy](https://img.shields.io/badge/SciPy-%238CAAE6.svg?style=flat&logo=SciPy&logoColor=white)](https://scipy.org/)
[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-blue?style=flat)](https://huggingface.co/)
[![Jupyter](https://img.shields.io/badge/Jupyter-%23F37626.svg?style=flat&logo=Jupyter&logoColor=white)](https://jupyter.org/)
[![YAML](https://img.shields.io/badge/config-YAML-ffdd00.svg)](https://yaml.org/)
---

**Repository:** `https://github.com/chirindaopensource/identifying_quantifying_financial_bubbles_hyped_log_period_power_law`

**Owner:** 2025 Craig Chirinda (Open Source Projects)

This repository contains an **independent**, professional-grade Python implementation of the research methodology from the 2025 paper entitled **"Identifying and Quantifying Financial Bubbles with the Hyped Log-Periodic Power Law Model"** by:

*   Zheng Cao
*   Xingran Shao
*   Yuheng Yan
*   Helyette Geman

The project provides a complete, end-to-end computational framework for replicating the paper's findings. It delivers a modular, auditable, and extensible pipeline that executes the entire research workflow: from rigorous data validation and NLP feature engineering to LPPL model fitting, deep learning, and backtesting.

## Table of Contents

- [Introduction](#introduction)
- [Theoretical Background](#theoretical-background)
- [Features](#features)
- [Methodology Implemented](#methodology-implemented)
- [Core Components (Notebook Structure)](#core-components-notebook-structure)
- [Key Callable: `execute_full_study`](#key-callable-execute_full_study)
- [Workflow Diagram](#workflow-diagram)
- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Input Data Structure](#input-data-structure)
- [Usage](#usage)
- [Output Structure](#output-structure)
- [Project Structure](#project-structure)
- [Customization](#customization)
- [Contributing](#contributing)
- [Recommended Extensions](#recommended-extensions)
- [License](#license)
- [Citation](#citation)
- [Acknowledgments](#acknowledgments)

## Introduction

This project provides a Python implementation of the methodologies presented in the 2025 paper "Identifying and Quantifying Financial Bubbles with the Hyped Log-Periodic Power Law Model." The core of this repository is the iPython Notebook `identifying_quantifying_financial_bubbles_hyped_log_period_power_law_draft.ipynb`, which contains a comprehensive suite of functions to replicate the paper's findings, from initial data validation to the final generation of all analytical tables and figures.

The paper proposes a novel framework (HLPPL) that fuses three distinct domains—econophysics, natural language processing, and deep learning—to create a superior, real-time indicator of financial asset mispricing. This codebase operationalizes that framework, allowing users to:
-   Rigorously validate and manage the entire experimental configuration via a `config.yaml` file.
-   Process raw market data and news text through a multi-stage feature engineering pipeline.
-   Fit the Log-Periodic Power Law (LPPL) model at scale using a robust, multi-start optimization strategy.
-   Construct the novel `BubbleScore` by fusing technical and behavioral signals.
-   Train a state-of-the-art Dual-Stream Transformer model to forecast the `BubbleScore`.
-   Run a complete, event-driven backtest to evaluate the trading performance of the generated signals.
-   Automatically conduct ablation and sensitivity studies to validate the model's robustness.

## Theoretical Background

The implemented methods are grounded in econophysics, behavioral finance, and deep learning.

**1. Log-Periodic Power Law (LPPL) Model:**
Originating from the physics of critical phenomena, the LPPL model describes the super-exponential growth of an asset price leading up to a crash (a critical point). The implementation fits the 7-parameter model defined in Equation (1):
$$
\ln p(t) = A + B(t_c - t)^m + C(t_c - t)^m \cos(\omega \ln(t_c - t) + \phi)
$$
The normalized residual from this fit, $\epsilon_{\text{norm}}(t)$, serves as the primary technical indicator of deviation from the theoretical bubble path.

**2. Behavioral Finance Signals (NLP):**
Two NLP-derived features are constructed to capture market psychology:
-   **Hype Index ($H_{i,t}$):** The share of media attention a stock receives on a given day, measuring intensity. (Equation 11)
-   **Sentiment Score ($S_{i,t}$):** The confidence-weighted average sentiment (positive, neutral, negative) of news articles, measuring tone. (Equation 9)

**3. Hyped LPPL (HLPPL) `BubbleScore`:**
The paper's core innovation is the fusion of the technical and behavioral signals into a single `BubbleScore`. The formula is regime-dependent, with the Hype Index acting as an amplifier in both positive and negative deviations. (Equation 14)
$$
\text{BubbleScore}_{i}(t) =
\begin{cases}
\epsilon_{\text{norm}}(t) + \alpha_1 H_{i,t} + \alpha_2 S_{i,t}, & \text{if } \epsilon_{\text{norm}}(t) > 0 \\
\epsilon_{\text{norm}}(t) - \alpha_1 H_{i,t} + \alpha_2 S_{i,t}, & \text{if } \epsilon_{\text{norm}}(t) \le 0
\end{cases}
$$

**4. Dual-Stream Transformer:**
A deep learning model is trained to forecast the `BubbleScore`. Its architecture is designed to process stock-specific features and market-wide features in parallel, allowing them to interact via a bi-directional cross-attention mechanism before making a final prediction.

## Features

The provided iPython Notebook (`identifying_quantifying_financial_bubbles_hyped_log_period_power_law_draft.ipynb`) implements the full research pipeline, including:

-   **Modular, Multi-Task Architecture:** The entire pipeline is broken down into 33 distinct, modular tasks, each with its own orchestrator function for maximum clarity and testability.
-   **Configuration-Driven Design:** All study parameters are managed in an external `config.yaml` file, allowing for easy customization and replication.
-   **Idempotent & Resumable Pipeline:** Computationally expensive steps (e.g., NLP processing, LPPL fitting, model training) create checkpoint files, allowing the pipeline to be resumed efficiently.
-   **Robust LPPL Fitting:** Implements a multi-start, constrained non-linear least squares optimization to robustly fit the 7-parameter LPPL model across thousands of rolling windows.
-   **State-of-the-Art Deep Learning:** Implements a Dual-Stream Transformer in PyTorch with modern training techniques (`AdamW`, `OneCycleLR`, gradient clipping, early stopping) and a custom multi-component loss function.
-   **Realistic Event-Driven Backtester:** Simulates trading performance with daily stop-loss checks and transaction costs.
-   **Automated Ablation & Sensitivity Analysis:** Includes a top-level orchestrator to automatically re-run the entire pipeline under different configurations to test the contribution of each model component.

## Methodology Implemented

The notebook is a direct, sequential implementation of the paper's methodology:

1.  **Validation & Cleansing (Tasks 1-6):** Ingests and validates the `config.yaml` and raw data, cleanses the data, adjusts for corporate actions, and engineers primary features.
2.  **NLP Feature Engineering (Tasks 7-12):** Uses BERTopic and FinBERT to process news text and generate the `Sentiment_Score` and `Hype_Index`.
3.  **LPPL Signal Generation (Tasks 13-18):** Defines rolling windows, fits the LPPL model, computes normalized residuals, fuses them into the `BubbleScore`, and labels discrete episodes.
4.  **ML Data Preparation (Tasks 19-22):** Normalizes features (with leakage protection), constructs fixed-length sequences for the stock and market streams, creates multi-horizon targets, and performs a strict chronological split.
5.  **Deep Learning (Tasks 23-28):** Defines the `DualStreamTransformer` architecture and `CompositeLoss`, trains the model with early stopping, persists the final artifact, and evaluates its out-of-sample predictive performance.
6.  **Backtesting (Tasks 29-31):** Converts predictions into discrete trading signals, runs the event-driven backtest, and computes a full suite of performance metrics.
7.  **Final Orchestration (Tasks 32-33):** Provides top-level functions to run the entire baseline pipeline and the full suite of ablation studies.

## Core Components (Notebook Structure)

The `identifying_quantifying_financial_bubbles_hyped_log_period_power_law_draft.ipynb` notebook is structured as a logical pipeline with modular orchestrator functions for each of the 33 major tasks. All functions are self-contained, fully documented with type hints and docstrings, and designed for professional-grade execution.

## Key Callable: `execute_full_study`

The project is designed around a single, top-level user-facing interface function:

-   **`execute_full_study`:** This master orchestrator function, located in the final section of the notebook, runs the entire automated research pipeline from end-to-end. A single call to this function reproduces the entire computational portion of the project, from data validation to the final report.

## Workflow Diagram

The following diagram illustrates the high-level workflow orchestrated by the `run_hlppl_pipeline` function, which is the core engine called by `execute_full_study`.

```mermaid
graph TD
    A[Start: Raw Data & Config] --> B(Phase I: Validation);
    B --> C(Phase II: Data Cleansing & Feature Eng.);
    C --> D(Phase III: NLP Signal Generation);
    D --> E(Phase IV: LPPL Signal Generation);
    E --> F(Phase V: ML Data Preparation);
    F --> G(Phase VI: Model Training & Validation);
    G --> H(Phase VII: Inference & Backtesting);
    H --> I[End: Performance Report];

    subgraph Phase III
        D -- Hype Index & Sentiment --> E;
    end

    subgraph Phase IV
        E -- BubbleScore --> F;
    end

    subgraph Phase VI
        G -- Trained Model --> H;
    end
```

## Prerequisites

-   Python 3.9+
-   A CUDA-enabled GPU is highly recommended for the deep learning and NLP components.
-   Core dependencies: `pandas`, `numpy`, `torch`, `transformers`, `sentence-transformers`, `bertopic`, `umap-learn`, `hdbscan`, `scipy`, `pyyaml`, `matplotlib`, `seaborn`, `tqdm`.

## Installation

1.  **Clone the repository:**
    ```sh
    git clone https://github.com/chirindaopensource/identifying_quantifying_financial_bubbles_hyped_log_period_power_law.git
    cd identifying_quantifying_financial_bubbles_hyped_log_period_power_law
    ```

2.  **Create and activate a virtual environment (recommended):**
    ```sh
    python -m venv venv
    source venv/bin/activate  # On Windows, use `venv\Scripts\activate`
    ```

3.  **Install Python dependencies:**
    ```sh
    pip install -r requirements.txt
    ```

## Input Data Structure

The pipeline requires a `pandas.DataFrame` (`df_raw`) with a `MultiIndex` of `('Date', 'TICKER')` and the following columns and dtypes:
-   `PERMNO`: `int64`
-   `SIC_Code`: `int64`
-   `Close_Price_Raw`: `float64`
-   `Volume_Raw`: `int64`
-   `CFACSHR`: `float64`
-   `PE_Ratio`: `float64`
-   `PB_Ratio`: `float64`
-   `VIX_Close`: `float64`
-   `News_Articles`: `object` (containing `list` of `str`)

All other parameters are controlled by the `config.yaml` file.

## Usage

The `identifying_quantifying_financial_bubbles_hyped_log_period_power_law_draft.ipynb` notebook provides a complete, step-by-step guide. The primary workflow is to execute the final cell of the notebook, which contains the main execution block:

```python
# Final cell of the notebook

# This function generates a sample DataFrame for demonstration.
# In a real run, you would load your own data here.
df_raw = create_sample_dataframe()

# Load the master configuration from the YAML file.
with open("config.yaml", 'r') as f:
    base_config = yaml.safe_load(f)

# --- Execute the entire study ---
# To run only the baseline model (faster):
final_results = execute_full_study(
    df_raw=df_raw,
    base_config=base_config,
    run_ablation=False
)

# To run the baseline AND all ablation/sensitivity studies (very slow):
# final_results = execute_full_study(
#     df_raw=df_raw,
#     base_config=base_config,
#     run_ablation=True
# )

# The `final_results` dictionary will contain the key outputs.
print(final_results['baseline_performance'])
```

## Output Structure

The `execute_full_study` function creates a `study_results/` directory with the following structure:

```
study_results/
│
├── baseline/
│   ├── data_intermediate/
│   ├── logs/
│   ├── models/
│   └── reports/
│       └── performance_summary.csv
│
└── ablation_studies/
    ├── ablation_no_hype/
    │   ├── data_intermediate/
    │   ├── logs/
    │   ├── models/
    │   └── reports/
    ├── ... (other experiments)
    │
    ├── ablation_comparison_summary.csv
    └── ablation_core_performance.png
```

## Project Structure

```
identifying_quantifying_financial_bubbles_hyped_log_period_power_law/
│
├── identifying_quantifying_financial_bubbles_hyped_log_period_power_law_draft.ipynb # Main implementation notebook
├── config.yaml                                                                      # Master configuration file
├── requirements.txt                                                                 # Python package dependencies
├── LICENSE                                                                          # MIT license file
└── README.md                                                                        # This documentation file
```

## Customization

The pipeline is highly customizable via the `config.yaml` file. Users can easily modify all study parameters, including LPPL window size, `BubbleScore` weights, Transformer architecture, and backtesting thresholds, without altering the core Python code.

## Contributing

Contributions are welcome. Please fork the repository, create a feature branch, and submit a pull request with a clear description of your changes. Adherence to PEP 8, type hinting, and comprehensive docstrings is required.

## Recommended Extensions

Future extensions could include:
-   **Alternative Architectures:** Replacing the Transformer with other sequence models like LSTMs or state-space models (e.g., Mamba).
-   **Dynamic Alpha Weights:** Making the `alpha_1` and `alpha_2` weights in the `BubbleScore` dynamic, perhaps dependent on market volatility.
-   **Advanced Backtesting:** Integrating a more sophisticated backtesting engine that handles portfolio-level constraints, realistic order execution, and market impact.
-   **Cross-Asset Analysis:** Applying the HLPPL framework to other asset classes like cryptocurrencies, commodities, or fixed income.

## License

This project is licensed under the MIT License.

## Citation

If you use this code or the methodology in your research, please cite the original paper:

```bibtex
@article{cao2025identifying,
  title   = {Identifying and Quantifying Financial Bubbles with the Hyped Log-Periodic Power Law Model},
  author  = {Cao, Zheng and Shao, Xingran and Yan, Yuheng and Geman, Helyette},
  journal = {arXiv preprint arXiv:2510.10878},
  year    = {2025}
}
```

For the implementation itself, you may cite this repository:
```
Chirinda, C. (2025). A Professional-Grade Implementation of the "Hyped Log-Periodic Power Law Model" Framework.
GitHub repository: https://github.com/chirindaopensource/identifying_quantifying_financial_bubbles_hyped_log_period_power_law
```

## Acknowledgments

-   Credit to **Zheng Cao, Xingran Shao, Yuheng Yan, and Helyette Geman** for the foundational research that forms the entire basis for this computational replication.
-   This project is built upon the exceptional tools provided by the open-source community. Sincere thanks to the developers of the scientific Python ecosystem, including **Pandas, NumPy, Scikit-learn, PyTorch, Hugging Face, SciPy, and Jupyter**.

--

*This README was generated based on the structure and content of the `identifying_quantifying_financial_bubbles_hyped_log_period_power_law_draft.ipynb` notebook and follows best practices for research software documentation.*

# Paper

Title: "*Identifying and Quantifying Financial Bubbles with the Hyped Log-Periodic Power Law Model*"

Authors: Zheng Cao, Xingran Shao, Yuheng Yan, Helyette Geman

E-Journal Submission Date: 13 October 2025

Link: https://arxiv.org/abs/2510.10878

Abstract:

We propose a novel model, the Hyped Log-Periodic Power Law Model (HLPPL), to the problem of quantifying and detecting financial bubbles, an ever-fascinating one for academics and practitioners alike. Bubble labels are generated using a Log-Periodic Power Law (LPPL) model, sentiment scores, and a hype index we introduced in previous research on NLP forecasting of stock return volatility. Using these tools, a dual-stream transformer model is trained with market data and machine learning methods, resulting in a time series of confidence scores as a Bubble Score. A distinctive feature of our framework is that it captures phases of extreme overpricing and underpricing within a unified structure.

We achieve an average yield of 34.13 percentage annualized return when backtesting U.S. equities during the period 2018 to 2024, while the approach exhibits a remarkable generalization ability across industry sectors. Its conservative bias in predicting bubble periods minimizes false positives, a feature which is especially beneficial for market signaling and decision-making. Overall, this approach utilizes both theoretical and empirical advances for real-time positive and negative bubble identification and measurement with HLPPL signals.

# Summary

### **The Core Problem and Proposed Contribution**

The paper addresses a well-known, difficult problem in quantitative finance: the timely detection and quantification of financial bubbles. The authors correctly identify two primary limitations of existing models, particularly the classic Log-Periodic Power Law (LPPL) model:

1.  **Asymmetry:** Traditional models focus almost exclusively on positive bubbles (speculative manias) and largely ignore "negative bubbles" or anti-bubbles (protracted periods of undervaluation and oversold conditions).
2.  **Technical Isolation:** The LPPL model is a purely technical, price-based framework. It describes the *what* (super-exponential price acceleration) but not the *why*. It is agnostic to the underlying behavioral drivers like investor sentiment and media hype, which are central to the formation of bubbles.

The paper's core contribution is to create a unified framework, the **Hyped Log-Periodic Power Law (HLPPL) Model**, that addresses both limitations by integrating behavioral signals directly into the technical model.

### **Deconstructing the "Bubble Score" - The HLPPL Foundation**

The first major innovation is the creation of a descriptive indicator they call the **Bubble Score**. This is not a machine learning output; rather, it's a composite score engineered from three components.

1.  **LPPL Residuals (The Technical Base):** Instead of just using the LPPL model's crash prediction, the authors cleverly focus on the *residuals*—the difference between the actual log-price and the LPPL-fitted price trajectory.
    *   A positive residual (`ε(t) > 0`) signifies that the price is running *ahead* of its super-exponential trend, indicating a "bubble behavior."
    *   A negative residual (`ε(t) < 0`) signifies the price is lagging, indicating a "negative behavior" or oversold state.
    *   These residuals are normalized to lie within `[-1, 1]` for comparability.

2.  **Hype Index (The Attention Metric):** Drawing on their previous work, they incorporate a novel NLP-derived metric. The Hype Index measures the *volume* of media attention a stock receives relative to its peers and its economic size (market capitalization). It quantifies disproportionate attention, which is a key catalyst for herding behavior.

3.  **Sentiment Score (The Polarity Metric):** This is a more standard NLP metric, derived using FinBERT to classify financial news as positive, neutral, or negative. It captures the *tone* of the media coverage.

These three components are combined into a single **Bubble Score** (Equation 14). The formula is intuitive: the normalized LPPL residual forms the base, which is then amplified or dampened by the Hype and Sentiment scores. Crucially, the Hype Index is designed to always amplify the extremity (i.e., it makes positive scores more positive and negative scores more negative), reflecting the idea that high attention exacerbates any market move.

### **The Predictive Engine - The Dual-Stream Transformer**

Having created a rich, descriptive `Bubble Score`, the authors pivot to prediction. They frame the problem as a supervised learning task: can we train a model to forecast the `Bubble Score` over the next five trading days?

This is where their computer science expertise comes into play. They design a sophisticated **Dual-Stream Transformer Architecture**:

*   **Stream 1 (Stock-Level):** This stream processes features specific to an individual asset, such as its historical prices, volume, and valuation ratios (P/E, P/B).
*   **Stream 2 (Market-Level):** This stream processes market-wide and behavioral features, including the VIX, aggregate sentiment scores, and the Hype Index.

The architecture uses self-attention within each stream to capture temporal dynamics and then employs **cross-attention** between the streams. This is a powerful design choice, as it allows the model to learn complex interactions, such as how a shift in market-wide sentiment (Stream 2) might influence the price trajectory of a specific stock (Stream 1).

The model is trained to predict the `Bubble Score` at horizons of 1 to 5 days, using the score calculated in Step 2 as the "ground truth" label.

### **Empirical Validation and Backtesting**

The authors conduct a two-stage backtest on U.S. real estate stocks from 2018 to 2024.

1.  **"Traditional" Strategy:** A simple rules-based strategy trading directly on the calculated `Bubble Score`. (e.g., Go short if Score > 0.7, go long if Score < -0.7). This strategy yielded an average annualized return of **16.64%** with a Sharpe ratio of **0.72**.

2.  **"ML-Enhanced" Strategy:** This strategy trades on the *forecasted* `Bubble Score` from the Transformer model. This is the main test of their predictive framework. The results are substantially stronger:
    *   Average Annualized Return: **34.13%**
    *   Average Sharpe Ratio: **1.19**
    *   Average Win Rate: **72.30%**

They highlight an exceptional case study, HOUS, where the ML-enhanced strategy achieved an annualized return of **85.80%** with a Sharpe ratio of **2.49**, demonstrating the model's potential when a stock's dynamics align perfectly with the HLPPL framework.

--

### **Critical Analysis and Discussion**

This is a highly commendable and methodologically sophisticated paper. The synthesis of ideas is novel and the empirical results are, on the surface, extremely impressive.

**Strengths:**

*   **Conceptual Elegance:** The idea of using LPPL residuals as a "technical anchor" for behavioral metrics is elegant and powerful. It grounds the abstract concepts of hype and sentiment in the mathematical reality of price dynamics.
*   **Bidirectional Analysis:** The explicit modeling of both positive and negative bubbles is a significant and practical contribution, opening up a richer set of trading opportunities.
*   **Sophisticated ML Architecture:** The choice of a dual-stream transformer with cross-attention is well-justified and perfectly suited to the problem of disentangling idiosyncratic and systematic drivers.

**Potential Weaknesses and Questions for the Authors:**

1.  **The Self-Referential Labeling Problem:** This is the most significant concern. The Transformer model is trained to predict a `Bubble Score` that is, itself, a model output. The "ground truth" is not an objective market fact but a construct of the authors' HLPPL formula. While the strategy's profitability provides external validation, it's possible the Transformer is simply becoming very good at approximating the complex HLPPL function, rather than predicting an objective, underlying economic phenomenon.
2.  **LPPL Model Instability:** The fitting of the LPPL model is notoriously sensitive to the choice of the time window and initial parameters. The paper does not detail how this fitting was performed (e.g., rolling window size, parameter constraints). The stability of the LPPL fit is paramount, as unstable fits would lead to noisy residuals and, consequently, noisy training labels for the Transformer.
3.  **Risk of Overfitting and Data Snooping:** The model is complex, and the backtest is confined to a single sector (real estate) during a specific, highly volatile period (2018-2024, which includes the COVID-19 crash and recovery). The stellar performance could be regime-dependent. The model must be tested on different asset classes (e.g., tech stocks, cryptocurrencies, commodities) and across different market regimes (e.g., the low-volatility bull market of 2013-2017) to prove its robustness.
4.  **Practical Implementation Costs:** While a 0.1% transaction cost is included, the model relies on signals that can appear and disappear quickly. The backtest does not account for slippage or market impact, which could be significant when trying to execute trades based on rapidly changing news sentiment.

### **Conclusion and Verdict**

This paper presents a state-of-the-art framework for bubble detection that pushes the research frontier forward. The authors have successfully built a bridge between econophysics, behavioral finance, and deep learning. The HLPPL model and its resulting `Bubble Score` are a significant conceptual advance.

The empirical results are compelling, suggesting that this integrated approach can generate substantial alpha. However, due to concerns about the self-referential nature of the training labels and the need for more extensive out-of-sample and cross-asset validation, the findings should be considered promising but preliminary.

Hence, one would encourage the authors to next perform rigorous sensitivity analysis on the LPPL fitting process and to backtest their framework across a much wider universe of assets and time periods. If the results hold, this methodology could become a valuable tool for both asset managers and financial regulators.


# Import Essential Modules

In [None]:
#!/usr/bin/env python3
# ==============================================================================
#
#  Identifying and Quantifying Financial Bubbles with the Hyped Log-Periodic
#  Power Law (HLPPL) Model
#
#  This module provides a complete, production-grade implementation of the
#  analytical framework presented in "Identifying and Quantifying Financial
#  Bubbles with the Hyped Log-Periodic Power Law Model" by Cao, Shao, Yan, and
#  Geman (2025). It delivers a real-time, data-driven system for the dynamic
#  assessment of asset mispricing by fusing econophysics, natural language
#  processing, and deep learning into a single, actionable metric for
#  optimizing investment timing and managing risk exposure.
#
#  Core Methodological Components:
#  • Log-Periodic Power Law (LPPL) model fitting via constrained non-linear
#    optimization to generate a technical measure of deviation from a theoretical
#    bubble trajectory.
#  • NLP-driven behavioral feature engineering, including a Hype Index (attention
#    share) and a confidence-weighted sentiment score derived from FinBERT.
#  • A novel "Bubble Score" signal that fuses the technical LPPL residual with
#    the behavioral NLP signals in a regime-dependent manner.
#  • A Dual-Stream Transformer architecture that processes stock-specific and
#    market-level features in parallel, with bi-directional cross-attention
#    to learn their interactions.
#  • Multi-horizon forecasting of the Bubble Score using independent prediction heads.
#  • A composite loss function combining Huber, correlation, R-squared, and
#    temporal consistency objectives for robust training.
#  • An event-driven backtesting engine with realistic risk controls (stop-loss,
#    transaction costs) to evaluate strategy performance.
#
#  Technical Implementation Features:
#  • Idempotent, checkpointed pipeline for all computationally intensive tasks.
#  • Rigorous, leakage-proof data validation, cleansing, and normalization.
#  • BERTopic for thematic filtering of large news corpora.
#  • Multi-start optimization with a seeded random number generator for LPPL fits.
#  • Modern deep learning training regimen including AdamW, OneCycleLR, gradient
#    clipping, and early stopping.
#  • Automated ablation and sensitivity analysis framework.
#
#  Paper Reference:
#  Cao, Z., Shao, X., Yan, Y., & Geman, H. (2025). Identifying and Quantifying
#  Financial Bubbles with the Hyped Log-Periodic Power Law Model. arXiv
#  preprint arXiv:2510.10878.
#  https://arxiv.org/abs/2510.10878
#
#  Author: CS Chirinda
#  License: MIT
#  Version: 1.0.0
#
# ==============================================================================
# ==============================================================================
# Consolidated Imports for the HLPPL End-to-End Pipeline
# ==============================================================================

# --- Standard Library ---
import copy
import json
import logging
import math
import pickle
import subprocess
import tarfile
from collections.abc import Mapping
from datetime import datetime
from itertools import chain
from pathlib import Path
from typing import (Any, Dict, List, NamedTuple, Optional, Set, Tuple, Union)

# --- Core Scientific Computing ---
import numpy as np
import pandas as pd

# --- Machine Learning & Deep Learning (PyTorch) ---
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# --- NLP Libraries ---
# Note: These are substantial dependencies.
# pip install transformers sentence-transformers bertopic umap-learn hdbscan
import hdbscan
import umap
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from transformers import (AutoModelForSequenceClassification, AutoTokenizer,
                          pipeline)

# --- Scientific & Numerical Optimization ---
# pip install scipy
from scipy.optimize import least_squares, OptimizeResult

# --- Visualization & Progress ---
# pip install matplotlib seaborn tqdm
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm


# Configure a basic logger
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)


# Implementation

## Draft 1

### **Functional and Methodological Breakdown of Pipeline Orchestrators**

#### **Task 1: `validate_and_parse_config`**

*   **Inputs:** A Python dictionary (`study_parameters`) containing the nested configuration for the entire study.
*   **Processes:**
    1.  Recursively traverses the input dictionary to verify that its structure matches a predefined schema, ensuring all required keys and sections are present.
    2.  Performs a series of specific validation checks on key numerical and string parameters to ensure they are within valid, sensible ranges and match expected values (e.g., model names).
    3.  Creates a timestamped JSON snapshot of the validated configuration.
*   **Outputs:** The original, validated configuration dictionary. A `KeyError`, `ValueError`, or `TypeError` is raised if any check fails.
*   **Data Transformation:** The primary transformation is one of state validation and persistence. The dictionary is validated, and a serializable copy is written to disk.
*   **Role in Research Pipeline:** This callable serves as the **Gatekeeper** of the entire pipeline. It implements the foundational step of ensuring that the experiment is configured correctly and reproducibly before any computation begins. It enforces methodological consistency by validating parameters (e.g., LPPL constraints, backtesting thresholds) that are specified in the paper's text and are critical for a faithful replication.

#### **Task 2: `validate_input_dataframe`**

*   **Inputs:** A raw `pandas.DataFrame` (`df_raw`) with a `MultiIndex`.
*   **Processes:**
    1.  Validates that the index is a `pd.MultiIndex` with the correct names (`'Date'`, `'TICKER'`), dtypes (`datetime64[ns]`, `object`), and is chronologically sorted.
    2.  Verifies the presence and correct dtypes of all required columns (e.g., `Close_Price_Raw`, `CFACSHR`, `News_Articles`).
    3.  Checks that the data coverage meets minimum requirements for the study period (2018-2024), number of trading days, and number of unique tickers.
*   **Outputs:** A validated and sorted copy of the input DataFrame.
*   **Data Transformation:** The DataFrame is transformed by sorting its index if it was not already sorted. Otherwise, it is a validation step.
*   **Role in Research Pipeline:** This callable is the **Data Schema Enforcer**. It ensures that the raw input data conforms to the precise structure and content requirements of the pipeline, preventing a vast category of potential downstream errors related to incorrect data types, missing columns, or unsorted time series.

#### **Task 3: `cleanse_raw_data`**

*   **Inputs:** A validated `pandas.DataFrame`.
*   **Processes:**
    1.  Removes rows with missing or non-positive `Close_Price_Raw`.
    2.  Conditionally forward-fills `NaN`s in `Volume_Raw` if the missing percentage is below a threshold.
    3.  Asserts completeness of `CFACSHR` and forward-fills `VIX_Close`.
    4.  Logs extreme single-day log returns ($|r_t| > 0.5$) as potential outliers.
    5.  Filters the entire DataFrame to retain only stocks within the specified real estate `SIC_Code` universe.
*   **Outputs:** A cleansed and filtered `pandas.DataFrame`.
*   **Data Transformation:** The DataFrame is transformed by removing rows (price cleaning, SIC filtering) and imputing a limited set of missing values (volume, VIX).
*   **Role in Research Pipeline:** This callable is the **Data Purifier and Universe Selector**. It implements the initial data cleaning and universe selection steps, ensuring the data is free of critical errors and represents the specific sector (real estate) analyzed in the paper.

#### **Task 4: `adjust_for_corporate_actions`**

*   **Inputs:** A cleansed `pandas.DataFrame`.
*   **Processes:**
    1.  Computes a new `Close_Price_Adj` column using the formula:
        $$
        \text{Close\_Price\_Adj} = \text{Close\_Price\_Raw} \times \text{CFACSHR}
        $$
    2.  Computes a new `Volume_Adj` column using the formula:
        $$
        \text{Volume\_Adj} = \frac{\text{Volume\_Raw}}{\text{CFACSHR}}
        $$
    3.  Performs a programmatic spot-check to verify adjustment consistency.
    4.  Drops the `CFACSHR` column to prevent data leakage.
*   **Outputs:** A `pandas.DataFrame` with adjusted price and volume columns.
*   **Data Transformation:** The DataFrame is transformed by adding two new columns (`Close_Price_Adj`, `Volume_Adj`) and removing one (`CFACSHR`).
*   **Role in Research Pipeline:** This callable is the **Time Series Normalizer**. Its role is to create a continuous and comparable time series for price and volume by removing the artificial jumps caused by corporate actions like splits and dividends. This is a non-negotiable prerequisite for any meaningful time-series analysis, including the LPPL fitting and return calculations.

#### **Task 5: `derive_engineered_features`**

*   **Inputs:** A `pandas.DataFrame` with adjusted price and volume.
*   **Processes:**
    1.  Computes log-transformed price and volume:
        $$
        \text{Log\_Price} = \ln(\text{Close\_Price\_Adj})
        $$
        $$
        \text{Log\_Volume} = \ln(\text{Volume\_Adj} + 1)
        $$
    2.  Computes daily log returns on a per-ticker basis:
        $$
        r_t = \text{Log\_Price}_t - \text{Log\_Price}_{t-1}
        $$
    3.  Extracts integer `Month` and `Day` features from the `Date` index.
*   **Outputs:** A `pandas.DataFrame` enriched with the new feature columns.
*   **Data Transformation:** The DataFrame is transformed by adding four new feature columns.
*   **Role in Research Pipeline:** This callable is the **Primary Feature Engineer**. It derives the foundational technical and calendar features that will be used as inputs for both the LPPL model (via `Log_Price`) and the Transformer model.

#### **Task 6: `align_and_validate_calendar`**

*   **Inputs:** A `pandas.DataFrame` with engineered features.
*   **Processes:**
    1.  Constructs a master `DatetimeIndex` of all unique trading days and checks it for large gaps.
    2.  Analyzes the number of tickers per day to assess if the panel is balanced or unbalanced.
    3.  Enforces consistency on market-wide features (e.g., `VIX_Close`) to ensure every ticker has the identical value for a given day.
*   **Outputs:** A `pandas.DataFrame` with a validated temporal structure.
*   **Data Transformation:** The `VIX_Close` column is potentially modified to enforce consistency.
*   **Role in Research Pipeline:** This callable is the **Temporal Integrity Validator**. It ensures the time dimension of the panel data is coherent and complete before any time-sensitive operations (like NLP aggregation or model fitting) are performed.

#### **Task 7: `setup_topic_model`**

*   **Inputs:** The main `pandas.DataFrame` and the study configuration.
*   **Processes:**
    1.  Extracts all unique news articles from the `News_Articles` column into a single corpus.
    2.  Uses a pre-trained `SentenceTransformer` model (e.g., `"all-MiniLM-L6-v2"`) to convert each unique article into a high-dimensional vector embedding.
    3.  Initializes and fits a `BERTopic` model on these embeddings, using the specified `UMAP` and `HDBSCAN` hyperparameters for dimensionality reduction and clustering.
*   **Outputs:** A tuple containing the unique corpus (list of strings), the embeddings (numpy array), and the fitted `BERTopic` model object.
*   **Data Transformation:** A list of lists of strings is transformed into a flat list of unique strings, a dense numerical matrix of embeddings, and a trained topic model object.
*   **Role in Research Pipeline:** This callable implements the first major step of the NLP pipeline described in Section 4.2.1. It is the **Thematic Structure Extractor**, responsible for discovering the underlying topics within the entire news corpus.

#### **Task 8: `apply_topic_filter`**

*   **Inputs:** The main `DataFrame`, the unique corpus, and the fitted `BERTopic` model.
*   **Processes:**
    1.  Extracts the keyword representations for each topic discovered by the model.
    2.  Identifies a subset of topics as "real estate-relevant" by matching their keywords against a predefined set (e.g., `"housing"`, `"REIT"`).
    3.  Filters the `News_Articles` column in the main DataFrame, removing any article that does not belong to one of the identified relevant topics.
*   **Outputs:** A `pandas.DataFrame` with the filtered `News_Articles` column and a `set` of the retained article texts.
*   **Data Transformation:** The lists within the `News_Articles` column are modified (shortened or emptied).
*   **Role in Research Pipeline:** This callable is the **Domain-Specific Information Filter**. It implements the second step of the NLP pipeline in Section 4.2.1, ensuring that the subsequent sentiment analysis is performed only on text that is thematically relevant to the study's universe (real estate), thereby reducing noise.

#### **Task 9: `classify_article_sentiment`**

*   **Inputs:** The set of filtered, relevant articles and the study configuration.
*   **Processes:**
    1.  Loads the pre-trained `ProsusAI/finbert` model and tokenizer.
    2.  Runs batched inference on the entire corpus of relevant articles to classify each as 'positive', 'neutral', or 'negative' and obtain a confidence score.
    3.  Maps the string labels to numerical polarity scores (`+1.0`, `0.0`, `-1.0`).
*   **Outputs:** A dictionary mapping each unique article text to its sentiment class, confidence, and numerical polarity.
*   **Data Transformation:** A set of strings is transformed into a dictionary mapping strings to structured sentiment data.
*   **Role in Research Pipeline:** This callable is the **Affective Classifier**. It implements the core sentiment analysis step from Section 4.2.1, quantifying the emotional tone of each relevant news article.

#### **Task 10: `aggregate_stock_day_sentiment`**

*   **Inputs:** The `DataFrame` with filtered `News_Articles` and the dictionary of sentiment results.
*   **Processes:** For each row (i.e., for each stock `i` on day `t`), it calculates the confidence-weighted average of the polarity scores of all articles in its `News_Articles` list. The formula implemented is from Section 3.3, Equation (9):
    $$
    S_{i,t} = \frac{\sum_{k=1}^{N_{i,t}} w_{i,t,k} s_{i,t,k}}{\sum_{k=1}^{N_{i,t}} w_{i,t,k}}
    $$
    where the weight $w_{i,t,k}$ is the FinBERT confidence score.
*   **Outputs:** A `pandas.DataFrame` with a new `Sentiment_Score` column.
*   **Data Transformation:** The `News_Articles` column (list of strings) is transformed into a new `Sentiment_Score` column (float).
*   **Role in Research Pipeline:** This callable is the **Stock-Level Sentiment Aggregator**. It bridges the gap from individual article sentiments to a structured, time-series feature, $S_{i,t}$, representing the daily sentiment for a specific stock.

#### **Task 11: `aggregate_market_level_sentiment`**

*   **Inputs:** The main `DataFrame` and the dictionary of sentiment results.
*   **Processes:** For each day `t`, it aggregates the confidence-weighted one-hot vectors of *all* articles published on that day across *all* tickers. It then normalizes these sums to get the daily market-wide share of positive, neutral, and negative sentiment. The formula implemented is from Section 4.2.1, Equation (17):
    $$
    S_t^{c} = \frac{\sum_{i,k \in D_t} s_{i,t,k}^{c}}{\sum_{i,k \in D_t} p_{i,t,k}}
    $$
*   **Outputs:** A `pandas.DataFrame` with three new columns: `Market_Sentiment_Neg`, `Market_Sentiment_Neu`, `Market_Sentiment_Pos`.
*   **Data Transformation:** The `News_Articles` column is transformed into three new market-wide feature columns.
*   **Role in Research Pipeline:** This callable is the **Market-Level Sentiment Aggregator**. It creates the macro-behavioral features required for the market stream of the Dual-Stream Transformer.

#### **Task 12: `construct_hype_index`**

*   **Inputs:** The `DataFrame` with the filtered `News_Articles` column.
*   **Processes:**
    1.  Counts the number of articles for each stock-day, $N_{i,t}$.
    2.  Sums these counts across all tickers for each day to get the total market-wide article count, $N_{\text{mkt},t}$.
    3.  Calculates the Hype Index as the ratio, as defined in Section 3.4, Equation (11):
        $$
        H_{i,t} = \frac{N_{i,t}}{N_{\text{mkt},t}}
        $$
*   **Outputs:** A `pandas.DataFrame` with a new `Hype_Index` column.
*   **Data Transformation:** The `News_Articles` column is transformed into a new `Hype_Index` column (float).
*   **Role in Research Pipeline:** This callable is the **Attention Share Calculator**. It implements the Hype Index, a key behavioral feature that measures the intensity of media attention, distinct from its tone.

#### **Task 13: `define_lppl_calibration_windows`**

*   **Inputs:** The `DataFrame` with the `Log_Price` column and the study configuration.
*   **Processes:** For each ticker, it generates a list of all possible overlapping windows of `Log_Price` data with a fixed length `W` (e.g., 250 days). It validates that each window is complete (no `NaN`s) and prepares it with an integer time index from 1 to `W`.
*   **Outputs:** A list of `LPPLWindow` objects, where each object contains the data and metadata for one window.
*   **Data Transformation:** The continuous `Log_Price` time series for each stock is transformed into a discrete list of fixed-length segments.
*   **Role in Research Pipeline:** This callable is the **LPPL Problem Definer**. It prepares the individual, well-defined data segments that will be fed into the LPPL optimization algorithm.

#### **Task 14: `initialize_lppl_fitter`**

*   **Inputs:** The study configuration.
*   **Processes:**
    1.  Defines the strict mathematical bounds for each of the 7 LPPL parameters (e.g., $0 < m < 1$, $B < 0$).
    2.  Defines the strategy for generating multiple random starting points (seeds) for the optimization.
    3.  Documents this entire initialization strategy in a metadata file.
*   **Outputs:** A tuple containing the parameter bounds and the configuration.
*   **Data Transformation:** The configuration parameters are transformed into a structured set of bounds and a documented strategy for the optimizer.
*   **Role in Research Pipeline:** This callable is the **Optimizer Initializer**. It sets up the constrained, multi-start optimization framework required for robustly fitting the notoriously difficult LPPL model, as described in Section 2.1.1.

#### **Task 15: `fit_lppl_model_to_windows`**

*   **Inputs:** The list of `LPPLWindow` objects and the parameter bounds.
*   **Processes:** For each window, it runs a constrained non-linear least squares optimization (`scipy.optimize.least_squares`) multiple times, starting from each of the random seeds. It seeks to find the parameter vector $\theta = (A, B, C, m, \omega, \phi, t_c)$ that minimizes the sum of squared errors:
    $$
    J(\theta) = \sum_{i=1}^{W} \big[\ln p_i - \ln \hat{p}(t_i; \theta)\big]^2
    $$
    It selects the best valid fit (lowest error) from the multiple starts.
*   **Outputs:** A `pandas.DataFrame` where each row contains the best-fit 7 parameters for a single window.
*   **Data Transformation:** The list of data windows is transformed into a table of fitted model parameters.
*   **Role in Research Pipeline:** This callable is the **LPPL Calibration Engine**. It is the computational core of the econophysics analysis, performing the thousands of optimizations required to fit the LPPL model across the entire dataset.

#### **Task 16: `compute_and_merge_lppl_residuals`**

*   **Inputs:** The main `DataFrame` and the `lppl_fits` DataFrame.
*   **Processes:**
    1.  For each fitted window, it calculates the raw residuals: $\epsilon(t) = \ln p(t) - \ln \hat{p}(t)$, as defined in Section 3.2, Equation (5).
    2.  It combines the residuals from all overlapping windows.
    3.  It then computes the normalized residual $\epsilon_{\text{norm}}(t)$ on a per-ticker basis using a running maximum, as defined in Equation (8):
        $$
        \epsilon_{\text{norm}}(t) = \frac{\epsilon(t)}{\max_{s \le t} |\epsilon(s)|}
        $$
    4.  Merges the final `Residual_Norm` series back into the main DataFrame.
*   **Outputs:** A `pandas.DataFrame` with a new `Residual_Norm` column.
*   **Data Transformation:** The table of fitted parameters is transformed into a single, normalized time-series feature.
*   **Role in Research Pipeline:** This callable is the **Technical Signal Extractor**. It translates the results of the LPPL fits into the primary technical indicator, $\epsilon_{\text{norm}}(t)$, which quantifies the price deviation from the theoretical bubble path.

#### **Task 17: `construct_bubblescore`**

*   **Inputs:** The `DataFrame` containing `Residual_Norm`, `Hype_Index`, and `Sentiment_Score`.
*   **Processes:** It implements the core HLPPL fusion logic from Section 3.5, Equation (14), a piecewise formula that depends on the sign of the residual:
    $$
    \text{BubbleScore}_{i}(t) =
    \begin{cases}
    \epsilon_{\text{norm}}(t) + \alpha_1 H_{i,t} + \alpha_2 S_{i,t}, & \text{if } \epsilon_{\text{norm}}(t) > 0 \\
    \epsilon_{\text{norm}}(t) - \alpha_1 H_{i,t} + \alpha_2 S_{i,t}, & \text{if } \epsilon_{\text{norm}}(t) \le 0
    \end{cases}
    $$
*   **Outputs:** A `pandas.DataFrame` with a new `BubbleScore` column.
*   **Data Transformation:** Three feature columns are fused into a single, more powerful feature column.
*   **Role in Research Pipeline:** This callable is the **Signal Fusion Engine**. It implements the central hypothesis of the paper: that combining a technical signal with behavioral signals in a regime-dependent way creates a superior measure of mispricing. The output, `BubbleScore`, is the paper's key methodological innovation.

#### **Task 18: `label_bubble_episodes`**

*   **Inputs:** The `DataFrame` with the `BubbleScore` column.
*   **Processes:** It implements the thresholding and persistence algorithm from Section 4.2.2. It identifies contiguous periods where $|\text{BubbleScore}(t)| > \tau$ and the duration is at least $d_{\min}$ days.
*   **Outputs:** A `pandas.DataFrame` with two new binary indicator columns (`In_Bubble_Episode`, `In_Negative_Episode`).
*   **Data Transformation:** The continuous `BubbleScore` series is transformed into discrete event labels.
*   **Role in Research Pipeline:** This callable is the **Event Labeler**. While the `BubbleScore` itself is the target for the regression model, these discrete labels are useful for descriptive analysis and for understanding the model's output, as shown in the paper's case study figures.

#### **Task 19: `engineer_stock_level_sequences`**

*   **Inputs:** The fully featured `DataFrame`.
*   **Processes:**
    1.  Selects the stock-specific features (`Log_Price`, `PE_Ratio`, etc.).
    2.  Performs Z-score normalization, crucially fitting the scaler *only* on the training data to prevent data leakage.
    3.  Handles missing fundamental data by dropping rows.
    4.  For each ticker, it constructs a set of overlapping sequences of length `L` (e.g., 60 days).
*   **Outputs:** A list of numpy arrays (the sequences), a `MultiIndex` of their corresponding anchor points, and the fitted scaler objects.
*   **Data Transformation:** The 2D panel DataFrame is transformed into a 3D-like dataset of sequences suitable for a Transformer model.
*   **Role in Research Pipeline:** This callable is the **Stock-Stream Data Preparer**. It prepares the first of the two input streams for the Dual-Stream Transformer, as described in Section 4.1.

#### **Task 20: `engineer_market_level_sequences`**

*   **Inputs:** The fully featured `DataFrame`.
*   **Processes:**
    1.  Selects the market-wide features (`VIX_Close`, `Market_Sentiment_*`, etc.).
    2.  Performs leakage-proof normalization on the continuous features.
    3.  Constructs a set of overlapping sequences of length `L` for the market-wide data.
*   **Outputs:** A dictionary mapping each anchor date to its corresponding market sequence array.
*   **Data Transformation:** The market-level columns of the panel DataFrame are transformed into a lookup map of sequence arrays.
*   **Role in Research Pipeline:** This callable is the **Market-Stream Data Preparer**. It prepares the second input stream for the Dual-Stream Transformer, providing the market context for each prediction.

#### **Task 21: `construct_and_align_targets`**

*   **Inputs:** The `DataFrame` with the `BubbleScore` column and the anchor indices from the sequence generation.
*   **Processes:** For each anchor point at time `t`, it looks up the future values of the `BubbleScore` at times $t+1, t+2, \ldots, t+5$. It uses a leakage-proof `groupby().shift()` operation. It drops any samples for which a complete set of future targets is not available.
*   **Outputs:** A numpy array of target vectors and the final, filtered list of valid anchor indices.
*   **Data Transformation:** The `BubbleScore` time series is transformed into a matrix of multi-horizon targets, perfectly aligned with the input sequences.
*   **Role in Research Pipeline:** This callable is the **Supervised Learning Target Constructor**. It creates the ground-truth labels (`y`) that the Transformer model will be trained to predict.

#### **Task 22: `split_dataset_chronologically`**

*   **Inputs:** All the prepared data components (stock sequences, market sequences, targets, anchor indices).
*   **Processes:** It determines date-based boundaries based on the specified ratios (e.g., 80-10-10) and partitions all data components into training, validation, and test sets. The split is strictly chronological to prevent look-ahead bias.
*   **Outputs:** A dictionary containing the three partitioned `ModelDataset` objects.
*   **Data Transformation:** The single, unified dataset is split into three disjoint subsets.
*   **Role in Research Pipeline:** This callable is the **Cross-Validation Manager**. It implements the essential step of splitting data for model training and evaluation in a way that is valid for time-series forecasting.

#### **Task 23: `DualStreamTransformer` (Class Definition)**

*   **Inputs:** A configuration dictionary and the number of input features for each stream.
*   **Processes:** This is not a data processing function but a class that *defines* the neural network architecture. It constructs the PyTorch modules for the input projectors, positional encodings, parallel Transformer encoders, the bi-directional cross-attention layer, the fusion layer, and the multi-horizon prediction heads, as described in Section 4.1.
*   **Outputs:** An instantiated `torch.nn.Module` object representing the model.
*   **Data Transformation:** Not applicable.
*   **Role in Research Pipeline:** This is the **Model Architect**. It provides the blueprint for the predictive engine of the entire system.

#### **Task 24: `CompositeLoss` (Class Definition)**

*   **Inputs:** A configuration dictionary with loss weights.
*   **Processes:** This class defines the custom, multi-component loss function. Its `forward` method takes model predictions and targets and computes the weighted sum of the five loss components, as specified in Section "Combined Loss Function", Equation (15):
    $$
    \mathcal{L} = \lambda_1 \mathcal{L}_{\text{Huber}} + \lambda_2 \mathcal{L}_{\text{Corr}} + \lambda_3 \mathcal{L}_{R^2} + \lambda_4 \mathcal{L}_{\text{Cons}} + \lambda_5 \mathcal{L}_{\text{Smooth}}
    $$
*   **Outputs:** An instantiated `torch.nn.Module` object that can be called to compute a scalar loss.
*   **Data Transformation:** Not applicable.
*   **Role in Research Pipeline:** This is the **Objective Function Definer**. It specifies the exact mathematical objective that the model training process will seek to minimize, tailoring the optimization towards producing robust, stable, and accurate financial forecasts.

#### **Task 25-6: `train_and_validate_model`**

*   **Inputs:** The instantiated model, the partitioned datasets, the loss function, and the configuration.
*   **Processes:** This function orchestrates the entire model training process. It sets up the `AdamW` optimizer and `OneCycleLR` scheduler. It then runs the main training loop, which for each epoch, iterates through the training data, performs the forward/backward pass, and updates the model weights. It incorporates the specific regularization techniques mentioned in the paper: gradient clipping and dropout (handled by the model architecture). Crucially, it also runs a validation loop at the end of each epoch and implements early stopping to prevent overfitting and save the best-performing model.
*   **Outputs:** The trained model loaded with its best weights, and a `DataFrame` of the training history.
*   **Data Transformation:** The training data is used to transform the model from a randomly initialized state to a trained state.
*   **Role in Research Pipeline:** This callable is the **Model Training Engine**. It is the computational core of the deep learning phase, responsible for optimizing the millions of parameters in the Transformer model.

#### **Task 27: `persist_model_and_metadata`**

*   **Inputs:** The study configuration and the training history.
*   **Processes:**
    1.  Validates the `best_model.pth` checkpoint saved by the training function.
    2.  Creates a detailed `model_metadata.json` file, including hyperparameters and the Git commit hash of the code.
    3.  Bundles all critical artifacts into a single `.tar.gz` archive.
*   **Outputs:** A compressed archive file.
*   **Data Transformation:** A collection of model artifacts and logs are transformed into a single, portable file.
*   **Role in Research Pipeline:** This callable is the **Reproducibility Manager**. It ensures that the result of the computationally expensive training process is saved in a complete, auditable, and reproducible format.

#### **Task 28: `run_inference_pipeline`**

*   **Inputs:** The partitioned datasets, configuration, and model artifacts.
*   **Processes:**
    1.  Loads the best trained model from the saved checkpoint.
    2.  Runs the model in evaluation mode on the held-out test set to generate multi-horizon predictions.
    3.  Computes a suite of performance metrics (Correlation, MSE, RMSE, MAE) by comparing the predictions to the ground-truth targets.
*   **Outputs:** A `DataFrame` of predictions and a `DataFrame` of performance metrics.
*   **Data Transformation:** The test set input sequences are transformed into a `DataFrame` of predictions and then into a summary table of metrics.
*   **Role in Research Pipeline:** This callable is the **Out-of-Sample Evaluator**. It provides the final, unbiased assessment of the model's predictive performance, generating the key results reported in the paper's Table 3 and Figure 11.

#### **Task 29: `generate_trading_signals`**

*   **Inputs:** The `DataFrame` of model predictions and the configuration.
*   **Processes:** It translates the continuous numerical predictions into discrete trading signals (`LONG_ENTRY`, `SHORT_EXIT`, etc.) by applying the threshold-based rules and the prediction reversal rule described in Section 6.1.
*   **Outputs:** A long-form `DataFrame` where each row is a specific trading signal for a given ticker, date, and horizon.
*   **Data Transformation:** A `DataFrame` of continuous predictions is transformed into a `DataFrame` of discrete events.
*   **Role in Research Pipeline:** This callable is the **Strategy Logic Interpreter**. It bridges the gap between the model's forecasts and an actionable trading strategy.

#### **Task 30: `simulate_all_strategies`**

*   **Inputs:** The `DataFrame` of signals, a `DataFrame` of prices, and the configuration.
*   **Processes:** It runs an event-driven backtest for each unique strategy (ticker-horizon pair). It iterates day-by-day, manages position state, applies transaction costs, and enforces the stop-loss rule specified in Section 6.1.
*   **Outputs:** Dictionaries containing the daily equity curves and detailed trade logs for every strategy.
*   **Data Transformation:** The discrete signals and price series are transformed into a continuous equity curve and a log of realized trades.
*   **Role in Research Pipeline:** This callable is the **Backtesting Engine**. It simulates the performance of the trading strategy in a realistic historical context, generating the raw PnL data required for performance evaluation.

#### **Task 31: `report_backtest_performance`**

*   **Inputs:** The dictionaries of equity curves and trade logs.
*   **Processes:** For each simulated strategy, it calculates a comprehensive set of performance metrics (Annualized Return, Sharpe Ratio, Max Drawdown, Win Rate). It then aggregates these results into a summary table and computes the high-level findings reported in the paper, such as the top-5 performers (Table 4), overall success rate (Table 5), and the distribution of optimal horizons (Table 7).
*   **Outputs:** A final summary `DataFrame` of performance metrics.
*   **Data Transformation:** The raw simulation outputs (equity curves, trade logs) are transformed into a final, interpretable table of performance statistics.
*   **Role in Research Pipeline:** This callable is the **Performance Analyst**. It distills the results of the backtest into the key metrics that determine the success or failure of the research project.

#### **Task 32: `run_hlppl_pipeline`**

*   **Inputs:**
    1.  A raw `pandas.DataFrame` (`df_raw`) containing the complete, unprocessed dataset.
    2.  A Python dictionary (`study_parameters`) containing the configuration for a single experimental run.
    3.  A set of `pathlib.Path` objects (`intermediate_dir`, `model_dir`, `log_dir`, `report_dir`) specifying the isolated output directories for this specific run.
*   **Processes:** This function is the **Master Orchestrator** for a single, end-to-end execution of the entire research methodology. It does not implement any novel algorithms itself but is responsible for the critical task of sequencing the calls to all previously defined task-level orchestrators (from Task 1 to Task 31). Its internal process is a direct, linear sequence:
    1.  It begins by calling `validate_and_parse_config` (Task 1) and `validate_input_dataframe` (Task 2).
    2.  It then proceeds through the data cleansing, feature engineering, and NLP/LPPL signal generation phases by calling the orchestrators from Tasks 3 through 18.
    3.  It passes the fully featured DataFrame to the machine learning data preparation functions (Tasks 19-22) to create the final `partitioned_datasets`.
    4.  It orchestrates the deep learning phase by defining the model and loss function (Tasks 23-24) and then calling `train_and_validate_model` (Task 26).
    5.  It finalizes the modeling phase by calling `persist_model_and_metadata` (Task 27).
    6.  Finally, it orchestrates the evaluation phase by calling the inference, signal generation, backtesting, and reporting functions (Tasks 28-31).
    7.  The entire sequence is wrapped in a top-level `try...except` block to catch and log any failure at any stage of the pipeline.
*   **Outputs:** The primary output is the final `performance_summary` `pandas.DataFrame` generated by the last step (Task 31). As side effects, it populates the specified output directories with all intermediate artifacts, logs, models, and reports generated during the run.
*   **Data Transformation:** This function orchestrates the grand transformation of the entire project: converting the raw input `df_raw` and `study_parameters` into the final `performance_summary` DataFrame. It manages the flow of data through dozens of intermediate states.
*   **Role in Research Pipeline:** This callable is the **Execution Engine**. While other functions define *what* to do, this function defines *in what order* to do it. Its role is to ensure the logical and causal dependencies between tasks are respected (e.g., you cannot calculate residuals before fitting the model). The amended version, which accepts parameterized output directories, makes it a reusable and modular workflow, which is essential for the ablation studies in Task 33. It represents the complete, reproducible recipe for a single experimental run.

#### **Task 33: `run_ablation_studies`**

*   **Inputs:** The raw `DataFrame` and the base configuration.
*   **Processes:**
    1.  Systematically generates a list of modified configurations to test different hypotheses (e.g., model performance without the Hype Index).
    2.  For each modified configuration, it calls the master `run_hlppl_pipeline` function, executing the entire research process in an isolated directory.
    3.  It collects the final performance summary from each run and aggregates them into a final comparison report and a set of visualizations.
*   **Outputs:** A collection of reports and plots saved to disk.
*   **Data Transformation:** A base configuration is transformed into a comprehensive analysis of the model's robustness and component contributions.
*   **Role in Research Pipeline:** This callable is the **Robustness and Sensitivity Analyst**. It moves beyond a single point estimate of performance to explore the model's behavior under different conditions, providing a deeper understanding of what drives its success and how sensitive it is to key parameters.

<br><br>

### **Usage Example**

### Example of End-to-End Pipeline Execution

Below is a high fidelity example, of how to execute the End-to-End Pipeline using sythentic data:
<br>

```python
# ==============================================================================
# Main Execution Script for the HLPPL Research Pipeline
# ==============================================================================
# This script serves as the main entry point for running the entire end-to-end
# research pipeline developed to replicate the "Identifying and Quantifying
# Financial Bubbles with the Hyped Log-Periodic Power Law Model" study.
#
# It demonstrates the three core steps of a professional quantitative project:
#   1. Configuration Loading: Loading all study parameters from an external,
#      human-readable YAML file.
#   2. Data Loading: Preparing the raw input DataFrame that matches the exact
#      schema required by the pipeline. In this example, we generate a small,
#      synthetic but structurally correct dataset for demonstration purposes.
#   3. Pipeline Execution: Calling the top-level orchestrator function to run
#      the baseline model and, optionally, the full suite of ablation studies.
#
# To run this script, ensure all previously defined functions (from Task 1 to 33)
# are available in the execution scope (e.g., in preceding notebook cells or
# imported from modules).
# ==============================================================================

import pandas as pd
import numpy as np
import yaml  # Requires PyYAML: pip install pyyaml
from pathlib import Path
from typing import Dict, Any

# Assume all orchestrator functions are defined in the current scope.
# Assume all imports of the requisite Python modules have been made.
# For example:
# from .pipeline import execute_full_study

def create_sample_dataframe() -> pd.DataFrame:
    """
    Generates a small, synthetic but structurally correct DataFrame for demonstration.
    
    In a real-world scenario, this function would be replaced with a data loader
    that reads and merges data from actual sources like CRSP, Compustat, and news APIs.
    """
    # Define the time range and tickers for the sample data.
    dates = pd.to_datetime(pd.date_range(start='2018-01-01', end='2024-12-31', freq='B'))
    tickers = ['HOUS', 'AMTX']
    
    # Create a MultiIndex from the product of dates and tickers.
    index = pd.MultiIndex.from_product([dates, tickers], names=['Date', 'TICKER'])
    
    # Create the DataFrame with the correct index.
    df = pd.DataFrame(index=index)
    
    # --- Generate Realistic Synthetic Data ---
    # Generate random walks for the closing prices for each ticker.
    log_returns = np.random.normal(loc=0.0005, scale=0.02, size=(len(dates), len(tickers)))
    log_prices = np.log(100) + np.cumsum(log_returns, axis=0)
    df['Close_Price_Raw'] = pd.DataFrame(np.exp(log_prices), index=dates, columns=tickers).stack()

    # Generate other columns with the correct dtypes and realistic values.
    df['PERMNO'] = df.index.get_level_values('TICKER').map({'HOUS': 10001, 'AMTX': 10002}).astype('int64')
    df['SIC_Code'] = df.index.get_level_values('TICKER').map({'HOUS': 6531, 'AMTX': 2869}).astype('int64') # HOUS is in target universe
    df['Volume_Raw'] = np.random.randint(100_000, 5_000_000, size=len(df)).astype('int64')
    df['CFACSHR'] = 1.0 # Assume no corporate actions for simplicity in this sample.
    df['PE_Ratio'] = np.random.uniform(10, 30, size=len(df))
    df['PB_Ratio'] = np.random.uniform(1, 5, size=len(df))
    
    # VIX is a market-wide feature, so it should be the same for all tickers on a given day.
    vix_series = pd.Series(np.random.uniform(12, 40, size=len(dates)), index=dates)
    df['VIX_Close'] = df.index.get_level_values('Date').map(vix_series)
    
    # News articles: mostly empty lists, with some sample text.
    news_list = [[] for _ in range(len(df))]
    # Sprinkle in some sample articles.
    for i in np.random.choice(len(df), size=50, replace=False):
        news_list[i] = ["This is a sample news article about real estate trends.", "Another article discusses market volatility."]
    df['News_Articles'] = news_list
    
    return df

def main():
    """
    Main entry point to run the HLPPL study.
    """
    # --- 1. Configuration Loading ---
    # Define the path to the configuration file.
    config_path = Path("config.yaml")
    
    # Check if the configuration file exists.
    if not config_path.exists():
        logging.error(f"Configuration file not found at '{config_path}'. Please create it.")
        return

    # Load the study parameters from the YAML file.
    logging.info(f"Loading base configuration from '{config_path}'...")
    with open(config_path, 'r') as f:
        try:
            # Use safe_load to prevent execution of arbitrary code.
            base_config: Dict[str, Any] = yaml.safe_load(f)
            logging.info("Base configuration loaded successfully.")
        except yaml.YAMLError as e:
            logging.error(f"Error parsing YAML configuration file: {e}")
            return

    # --- 2. Data Loading / Generation ---
    # In a real project, you would load your data from CSV, Parquet, or a database here.
    # For this example, we generate a structurally correct synthetic DataFrame.
    logging.info("Generating synthetic raw data for demonstration...")
    df_raw = create_sample_dataframe()
    logging.info(f"Generated sample DataFrame with {len(df_raw):,} rows.")
    
    # --- 3. Pipeline Execution ---
    # This is the main call to the top-level orchestrator.
    # It will run the entire end-to-end pipeline.
    
    # For a full run including sensitivity analysis (takes a very long time):
    # execute_full_study(df_raw=df_raw, base_config=base_config, run_ablation=True)
    
    # For a faster, baseline-only run:
    results = execute_full_study(
        df_raw=df_raw,
        base_config=base_config,
        run_ablation=False  # Set to False to skip the expensive ablation studies
    )
    
    # --- 4. Display Final Results ---
    # Print the key results returned by the orchestrator.
    if results.get('status') == 'SUCCESS':
        logging.info("\n\n--- FINAL BASELINE PERFORMANCE SUMMARY ---")
        # The baseline_performance DataFrame is the final output of the main run.
        print(results.get('baseline_performance'))
        logging.info(f"All baseline outputs saved in: {results.get('baseline_output_directory')}")
    else:
        logging.error(f"Pipeline execution failed. Error: {results.get('error')}")

if __name__ == '__main__':
    # This block ensures the main function is called only when the script is executed directly.
    main()
```

In [None]:
# Task 1: Validate and parse the study configuration dictionary

# ==============================================================================
# Task 1: Validate and parse the study configuration dictionary
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 1, Step 1:  Load the `study_parameters` dictionary and verify structural
#                  completeness.
# ------------------------------------------------------------------------------

def _validate_config_structure(
    config: Dict[str, Any],
    schema: Dict[str, Any],
    path: str = ""
) -> None:
    """
    Recursively validates the nested structure of a configuration dictionary
    against a schema.

    This function ensures that all keys defined in the schema exist in the
    configuration at the correct level of nesting. It raises a detailed
    KeyError if a structural mismatch is found.

    Args:
        config (Dict[str, Any]): The configuration dictionary to validate.
        schema (Dict[str, Any]): A dictionary representing the expected
                                 structure. Nested dictionaries in the schema
                                 indicate expected sub-sections.
        path (str): The current nested path, used for generating precise
                    error messages. Should not be set by the user.

    Raises:
        KeyError: If a key from the schema is missing in the configuration.
        TypeError: If the configuration object is not a dictionary.
    """
    # Input validation: Ensure the config object is a dictionary.
    if not isinstance(config, Mapping):
        # Raise a TypeError if the object at the current path is not a dict-like
        # object where a nested structure is expected.
        raise TypeError(f"Configuration error at path '{path}': Expected a dictionary, but found type {type(config)}.")

    # Iterate through all keys defined in the schema for the current level.
    for key in schema:
        # Construct the full path for the current key for clear error reporting.
        current_path = f"{path}.{key}" if path else key

        # Check if the key exists in the configuration dictionary.
        if key not in config:
            # If a required key is missing, raise a KeyError with the exact path.
            raise KeyError(f"Missing required configuration key at path: '{current_path}'")

        # If the schema expects a nested dictionary for this key, recurse.
        if isinstance(schema[key], Mapping):
            # Recursively call the validation function for the nested sub-dictionary.
            _validate_config_structure(config[key], schema[key], path=current_path)


# ------------------------------------------------------------------------------
# Task 1, Step 2: Validate numerical parameter ranges and types.
# ------------------------------------------------------------------------------

def _validate_numerical_params(config: Dict[str, Any]) -> None:
    """
    Validates the types and ranges of critical numerical parameters within the
    configuration dictionary.

    This function performs specific checks on key parameters to ensure they are
    within sensible and theoretically sound bounds. It raises a ValueError or
    TypeError if any parameter fails validation.

    Args:
        config (Dict[str, Any]): The configuration dictionary to validate.

    Raises:
        ValueError: If a numerical parameter is outside its allowed range.
        TypeError: If a parameter has an incorrect data type.
    """
    # A list to aggregate all validation errors found.
    errors = []

    # Define a validation map: path -> (validation_function, error_message)
    # This declarative approach makes it easy to add or modify checks.
    validation_map = {
        ("descriptive_model", "lppl_fitting", "rolling_window_size"):
            (lambda x: isinstance(x, int) and 100 <= x <= 500, "must be an integer between 100 and 500."),
        ("descriptive_model", "lppl_fitting", "parameter_constraints", "m", "min"):
            (lambda x: isinstance(x, float) and 0 < x, "for 'm' min bound must be a float > 0."),
        ("descriptive_model", "lppl_fitting", "parameter_constraints", "m", "max"):
            (lambda x: isinstance(x, float) and x < 1, "for 'm' max bound must be a float < 1."),
        ("descriptive_model", "lppl_fitting", "parameter_constraints", "omega", "min"):
            (lambda x: isinstance(x, float) and x > 0, "for 'omega' min bound must be a float > 0."),
        ("descriptive_model", "bubble_score_synthesis", "alpha_1_hype_weight"):
            (lambda x: isinstance(x, float) and x > 0, "must be a positive float."),
        ("descriptive_model", "bubble_score_synthesis", "alpha_2_sentiment_weight"):
            (lambda x: isinstance(x, float) and x > 0, "must be a positive float."),
        ("descriptive_model", "episode_labeling", "significance_threshold_tau"):
            (lambda x: isinstance(x, float) and 0 < x < 1, "must be a float between 0 and 1."),
        ("descriptive_model", "episode_labeling", "min_duration_d_min"):
            (lambda x: isinstance(x, int) and x >= 1, "must be an integer >= 1."),
        ("predictive_model", "data_preparation", "sequence_length"):
            (lambda x: isinstance(x, int) and 20 <= x <= 252, "must be an integer between 20 and 252."),
        ("predictive_model", "training", "dropout_rate"):
            (lambda x: isinstance(x, float) and 0 < x < 1, "must be a float between 0 and 1."),
        ("predictive_model", "optimizer", "learning_rate"):
            (lambda x: isinstance(x, float) and x > 0, "must be a positive float."),
        ("predictive_model", "optimizer", "weight_decay"):
            (lambda x: isinstance(x, float) and x > 0, "must be a positive float."),
        ("predictive_model", "optimizer", "gradient_clipping_threshold"):
            (lambda x: isinstance(x, float) and x > 0, "must be a positive float."),
        ("backtesting", "strategy_rules", "entry_threshold_theta_1"):
            (lambda x: isinstance(x, float) and 0 < x < 1, "must be a float between 0 and 1."),
        ("backtesting", "strategy_rules", "exit_threshold_theta_2"):
            (lambda x: isinstance(x, float) and 0 < x < 1, "must be a float between 0 and 1."),
        ("backtesting", "risk_management", "stop_loss_percentage"):
            (lambda x: isinstance(x, float) and 0 < x < 1, "must be a float between 0 and 1."),
        ("backtesting", "risk_management", "max_position_size_percentage"):
            (lambda x: isinstance(x, float) and 0 < x < 1, "must be a float between 0 and 1."),
        ("backtesting", "market_assumptions", "transaction_cost_per_trade"):
            (lambda x: isinstance(x, float) and 0 <= x <= 0.01, "must be a float between 0 and 0.01."),
        ("backtesting", "market_assumptions", "risk_free_rate_annual"):
            (lambda x: isinstance(x, float) and 0 <= x <= 0.10, "must be a float between 0 and 0.10."),
    }

    # Iterate through the validation map to perform checks.
    for path_tuple, (validator, msg) in validation_map.items():
        try:
            # Traverse the dictionary to get the value.
            value = config
            for key in path_tuple:
                value = value[key]
            # Apply the validator function.
            if not validator(value):
                # If validation fails, append a detailed error message.
                errors.append(f"Parameter '{'.'.join(path_tuple)}' (value: {value}) is invalid: {msg}")
        except KeyError:
            # This should be caught by the structure validator, but is here for safety.
            errors.append(f"Parameter '{'.'.join(path_tuple)}' is missing.")
        except Exception as e:
            # Catch any other unexpected errors during validation.
            errors.append(f"Error validating parameter '{'.'.join(path_tuple)}': {e}")

    # Additional, more complex relational checks.
    try:
        # Check m_min < m_max for LPPL constraints.
        m_min = config["descriptive_model"]["lppl_fitting"]["parameter_constraints"]["m"]["min"]
        m_max = config["descriptive_model"]["lppl_fitting"]["parameter_constraints"]["m"]["max"]
        if not m_min < m_max:
            errors.append("LPPL constraint error: 'm' min bound must be less than max bound.")

        # Check theta_2 < theta_1 for backtesting thresholds.
        theta_1 = config["backtesting"]["strategy_rules"]["entry_threshold_theta_1"]
        theta_2 = config["backtesting"]["strategy_rules"]["exit_threshold_theta_2"]
        if not theta_2 < theta_1:
            errors.append("Backtesting threshold error: 'exit_threshold_theta_2' must be less than 'entry_threshold_theta_1'.")

        # Check all loss function weights are non-negative.
        loss_weights = config["predictive_model"]["training"]["loss_function_weights"]
        for name, weight in loss_weights.items():
            if not (isinstance(weight, (int, float)) and weight >= 0):
                errors.append(f"Loss weight '{name}' must be a non-negative number.")

    except KeyError as e:
        errors.append(f"Missing key for relational check: {e}")

    # If any errors were collected, raise a single, comprehensive ValueError.
    if errors:
        raise ValueError("Configuration validation failed with the following errors:\n" + "\n".join(errors))


# ------------------------------------------------------------------------------
# Task 1, Step 3: Validate string-based model identifiers and create a
#                 configuration snapshot.
# ------------------------------------------------------------------------------

def _make_json_serializable(obj: Any) -> Any:
    """
    Recursively converts non-JSON-serializable types (like numpy types) in a
    nested structure to their Python native equivalents.

    Args:
        obj (Any): The object (e.g., dict, list, value) to process.

    Returns:
        Any: A version of the object with all values converted to be
             JSON-serializable.
    """
    # If the object is a dictionary, recurse on its values.
    if isinstance(obj, dict):
        return {k: _make_json_serializable(v) for k, v in obj.items()}
    # If the object is a list or tuple, recurse on its elements.
    elif isinstance(obj, (list, tuple)):
        return [_make_json_serializable(i) for i in obj]
    # Convert numpy infinity to a string representation.
    elif obj == np.inf:
        return "Infinity"
    elif obj == -np.inf:
        return "-Infinity"
    # Convert other numpy number types to Python native types.
    elif isinstance(obj, np.generic):
        return obj.item()
    # Return the object as is if it's already serializable.
    return obj


def _validate_identifiers_and_snapshot(
    config: Dict[str, Any],
    log_dir: Union[str, Path] = "logs"
) -> Path:
    """
    Validates string-based identifiers and saves a timestamped JSON snapshot
    of the configuration.

    Args:
        config (Dict[str, Any]): The validated configuration dictionary.
        log_dir (Union[str, Path]): The directory to save the snapshot in.
                                    Defaults to "logs".

    Returns:
        Path: The path to the saved configuration snapshot file.

    Raises:
        ValueError: If a string identifier does not match its expected value.
    """
    # A list to aggregate all validation errors found.
    errors = []

    # Define a map for exact string value checks.
    identifier_map = {
        ("nlp_settings", "sentiment_model", "huggingface_model_name"): "ProsusAI/finbert",
        ("nlp_settings", "topic_model", "embedding_model"): "all-MiniLM-L6-v2",
        ("predictive_model", "optimizer", "scheduler"): "OneCycleLR",
    }

    # Perform the string identifier checks.
    for path_tuple, expected_value in identifier_map.items():
        try:
            # Traverse the dictionary to get the value.
            value = config
            for key in path_tuple:
                value = value[key]
            # Compare the actual value with the expected value.
            if value != expected_value:
                errors.append(f"Identifier '{'.'.join(path_tuple)}' is incorrect. Expected '{expected_value}', found '{value}'.")
        except KeyError:
            errors.append(f"Identifier '{'.'.join(path_tuple)}' is missing.")

    # If any errors were found, raise a comprehensive ValueError.
    if errors:
        raise ValueError("Identifier validation failed:\n" + "\n".join(errors))

    # --- Create Configuration Snapshot ---
    # Ensure the log directory exists.
    log_path = Path(log_dir)
    log_path.mkdir(parents=True, exist_ok=True)

    # Generate a timestamp for the filename.
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    snapshot_filename = f"config_snapshot_{timestamp}.json"
    snapshot_filepath = log_path / snapshot_filename

    # Prepare the configuration for JSON serialization.
    serializable_config = _make_json_serializable(config)

    # Write the snapshot to a JSON file with indentation for readability.
    try:
        with open(snapshot_filepath, 'w') as f:
            json.dump(serializable_config, f, indent=4)
    except (IOError, TypeError) as e:
        # Handle potential file writing or serialization errors.
        logging.error(f"Failed to write configuration snapshot to {snapshot_filepath}: {e}")
        raise

    # Return the path to the created snapshot file.
    return snapshot_filepath


# ------------------------------------------------------------------------------
# Task 1, Orchestrator Function
# ------------------------------------------------------------------------------

def validate_and_parse_config(
    study_parameters: Dict[str, Any]
) -> Dict[str, Any]:
    """
    Orchestrates the complete validation of the study configuration dictionary.

    This function serves as the entry point for configuration validation. It
    sequentially performs three critical validation steps:
    1.  Structural Validation: Ensures all required keys and nested sections
        are present.
    2.  Numerical Validation: Checks that key numerical parameters are within
        their valid and sensible ranges.
    3.  Identifier Validation & Snapshotting: Verifies specific string
        identifiers (e.g., model names) and creates a timestamped JSON
        snapshot of the configuration for reproducibility.

    The function operates on a "fail-fast" basis, raising a specific,
    informative exception upon the first validation failure.

    Args:
        study_parameters (Dict[str, Any]): The main configuration dictionary
                                           for the entire study.

    Returns:
        Dict[str, Any]: The original, validated study_parameters dictionary,
                        returned if all checks pass.

    Raises:
        KeyError: If the configuration is structurally incomplete.
        ValueError: If any numerical or string parameter is invalid.
        TypeError: If a part of the configuration has an incorrect type.
    """
    # Log the start of the validation process.
    logging.info("Initiating validation of the study configuration dictionary...")

    # Define the expected schema for structural validation.
    # An empty dict `{}` signifies a section with parameters to be validated later.
    schema = {
        "descriptive_model": {
            "lppl_fitting": {},
            "bubble_score_synthesis": {},
            "episode_labeling": {}
        },
        "predictive_model": {
            "data_preparation": {},
            "architecture": {},
            "training": {"loss_function_weights": {}},
            "optimizer": {},
            "early_stopping": {}
        },
        "nlp_settings": {
            "sentiment_model": {},
            "topic_model": {}
        },
        "backtesting": {
            "strategy_rules": {},
            "risk_management": {},
            "market_assumptions": {}
        }
    }

    # --- Step 1: Validate the overall structure of the dictionary. ---
    # This ensures all required sections and subsections are present before
    # checking their contents.
    _validate_config_structure(study_parameters, schema)
    logging.info("Step 1/3: Configuration structure is complete and valid.")

    # --- Step 2: Validate numerical parameter types and ranges. ---
    # This checks the actual values of key parameters for correctness.
    _validate_numerical_params(study_parameters)
    logging.info("Step 2/3: Numerical parameters are within specified ranges.")

    # --- Step 3: Validate string identifiers and create a snapshot. ---
    # This verifies model names and creates a reproducible record of the config.
    snapshot_path = _validate_identifiers_and_snapshot(study_parameters)
    logging.info(f"Step 3/3: String identifiers are correct. Configuration snapshot saved to '{snapshot_path}'.")

    # Log the successful completion of all validation steps.
    logging.info("Configuration validation successful.")

    # Return the validated configuration dictionary for use in the pipeline.
    return study_parameters


In [None]:
# Task 2: Validate the input DataFrame structure and schema

# ==============================================================================
# Task 2: Validate the input DataFrame structure and schema
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 2, Step 1:  Validate the MultiIndex structure.
# ------------------------------------------------------------------------------

def _validate_dataframe_index(df: pd.DataFrame) -> pd.DataFrame:
    """
    Validates the MultiIndex structure of the input DataFrame.

    This function ensures the DataFrame's index is a two-level MultiIndex
    named ('Date', 'TICKER') with correct dtypes (datetime64[ns], object)
    and is monotonically increasing. If the index is not sorted, it sorts
    the DataFrame and returns a sorted copy.

    Args:
        df (pd.DataFrame): The input DataFrame to validate.

    Returns:
        pd.DataFrame: A validated and sorted copy of the input DataFrame.

    Raises:
        TypeError: If the index is not a pandas MultiIndex.
        ValueError: If the index does not meet the structural, naming, or
                    dtype requirements.
    """
    # Create a copy to avoid modifying the original DataFrame in place.
    df_validated = df.copy()

    # --- MultiIndex Type Check ---
    # Verify that the index is indeed a MultiIndex.
    if not isinstance(df_validated.index, pd.MultiIndex):
        raise TypeError("Input DataFrame index is not a pandas MultiIndex.")

    # --- MultiIndex Level Count Check ---
    # Verify that the MultiIndex has exactly two levels.
    if df_validated.index.nlevels != 2:
        raise ValueError(f"Index must have 2 levels, but found {df_validated.index.nlevels}.")

    # --- MultiIndex Level Names Check ---
    # Verify the names of the index levels are 'Date' and 'TICKER' in order.
    expected_names = ['Date', 'TICKER']
    if list(df_validated.index.names) != expected_names:
        raise ValueError(f"Index names must be {expected_names}, but found {list(df_validated.index.names)}.")

    # --- MultiIndex Level Dtypes Check ---
    # Verify the dtype of the first level ('Date') is datetime.
    if not pd.api.types.is_datetime64_any_dtype(df_validated.index.levels[0]):
        raise ValueError(f"Index level 'Date' must be datetime64, but found {df_validated.index.levels[0].dtype}.")

    # Verify the dtype of the second level ('TICKER') is string/object.
    if not pd.api.types.is_string_dtype(df_validated.index.levels[1]):
        raise ValueError(f"Index level 'TICKER' must be object/string, but found {df_validated.index.levels[1].dtype}.")

    # --- Index Sorting Check ---
    # Verify that the index is sorted for efficient time-series operations.
    if not df_validated.index.is_monotonic_increasing:
        # If not sorted, log a warning and sort it. This can be a costly operation.
        logging.warning("DataFrame index is not sorted. Sorting index... For better performance, provide pre-sorted data.")
        # Sort the index in place on the copied DataFrame.
        df_validated.sort_index(inplace=True)

    # Return the validated and sorted DataFrame copy.
    return df_validated


# ------------------------------------------------------------------------------
# Task 2, Step 2: Validate required columns and their data types.
# ------------------------------------------------------------------------------

def _validate_dataframe_columns(df: pd.DataFrame) -> None:
    """
    Validates the presence and data types of required columns in the DataFrame.

    This function checks against a predefined schema for column names and
    their expected dtypes. It performs a special validation on the
    'News_Articles' column by sampling to ensure its contents are lists of
    strings.

    Args:
        df (pd.DataFrame): The input DataFrame with a validated index.

    Raises:
        ValueError: If columns are missing or have incorrect dtypes.
        TypeError: If the 'News_Articles' column contains invalid data types.
    """
    # Define the required schema: column name -> expected dtype.
    REQUIRED_SCHEMA = {
        "PERMNO": "int64",
        "SIC_Code": "int64",
        "Close_Price_Raw": "float64",
        "Volume_Raw": "int64", # Target dtype, may be float if NaNs exist
        "CFACSHR": "float64",
        "PE_Ratio": "float64",
        "PB_Ratio": "float64",
        "VIX_Close": "float64",
        "News_Articles": "object",
    }

    # --- Column Presence Check ---
    # Get the set of actual columns and required columns.
    actual_columns = set(df.columns)
    required_columns = set(REQUIRED_SCHEMA.keys())

    # Check if all required columns are present in the DataFrame.
    if not required_columns.issubset(actual_columns):
        # Identify and report missing columns.
        missing_cols = required_columns - actual_columns
        raise ValueError(f"DataFrame is missing required columns: {sorted(list(missing_cols))}")

    # --- Column Dtype Check ---
    # Iterate through the schema to validate each column's dtype.
    for col, expected_dtype in REQUIRED_SCHEMA.items():
        actual_dtype = str(df[col].dtype)

        # Special handling for Volume_Raw, which might be float due to NaNs.
        if col == "Volume_Raw" and actual_dtype == 'float64':
            # If it's float but contains no NaNs, it's a problem. It should be int.
            if df[col].isnull().sum() == 0:
                 raise TypeError(f"Column '{col}' is float64 but contains no NaNs. It should be int64.")
            # If it contains NaNs, we accept float64 for now; Task 3 will handle cleansing.
            continue

        # For all other columns, perform a strict dtype check.
        if actual_dtype != expected_dtype:
            raise TypeError(f"Column '{col}' has incorrect dtype. Expected '{expected_dtype}', found '{actual_dtype}'.")

    # --- Special Validation for 'News_Articles' Column ---
    # Sample up to 100 non-null rows for efficient validation.
    news_col_non_null = df['News_Articles'].dropna()
    sample_size = min(100, len(news_col_non_null))
    if sample_size > 0:
        # Use a fixed random_state for reproducible sampling.
        sample = news_col_non_null.sample(n=sample_size, random_state=42)
        # Iterate through the sampled cells to check their content.
        for item in sample:
            # Each cell must be a list.
            if not isinstance(item, list):
                raise TypeError(f"Column 'News_Articles' contains a non-list element of type {type(item)}.")
            # All elements within a non-empty list must be strings.
            if item and not all(isinstance(article, str) for article in item):
                raise TypeError("A list in 'News_Articles' contains non-string elements.")


# ------------------------------------------------------------------------------
# Task 2, Step 3: Validate temporal and cross-sectional coverage.
# ------------------------------------------------------------------------------

def _validate_dataframe_coverage(df: pd.DataFrame) -> None:
    """
    Validates the temporal and cross-sectional coverage of the DataFrame.

    This function checks if the data spans the required study period and
    contains a minimum number of trading days and unique tickers for a
    meaningful analysis.

    Args:
        df (pd.DataFrame): The input DataFrame with validated index and columns.

    Raises:
        ValueError: If the data coverage does not meet the minimum requirements.
    """
    # --- Temporal Coverage Validation ---
    # Extract unique dates from the index.
    unique_dates = df.index.get_level_values('Date').unique()

    # Define the required start and end dates for the study.
    required_start_date = pd.Timestamp('2018-01-01')
    required_end_date = pd.Timestamp('2024-12-31')

    # Check if the data's date range covers the required study period.
    if unique_dates.min() > required_start_date or unique_dates.max() < required_end_date:
        raise ValueError(f"Data does not cover the required study period from {required_start_date.date()} to {required_end_date.date()}. "
                         f"Actual range: {unique_dates.min().date()} to {unique_dates.max().date()}.")

    # Check for the minimum number of trading days.
    n_days = len(unique_dates)
    min_days = 1700
    if n_days < min_days:
        raise ValueError(f"Insufficient trading days. Found {n_days}, but require at least {min_days}.")

    # --- Cross-Sectional Coverage Validation ---
    # Extract unique tickers from the index.
    unique_tickers = df.index.get_level_values('TICKER').unique()

    # Check for the minimum number of unique tickers.
    n_tickers = len(unique_tickers)
    min_tickers = 10
    if n_tickers < min_tickers:
        raise ValueError(f"Insufficient unique tickers. Found {n_tickers}, but require at least {min_tickers}.")

    # Log summary statistics upon successful validation.
    logging.info(f"DataFrame coverage validated: {len(df)} rows, {n_days} trading days, {n_tickers} unique tickers.")


# ------------------------------------------------------------------------------
# Task 2, Orchestrator Function
# ------------------------------------------------------------------------------

def validate_input_dataframe(df_raw: pd.DataFrame) -> pd.DataFrame:
    """
    Orchestrates the complete validation of the input DataFrame's schema.

    This function serves as the entry point for data validation. It executes
    a sequence of checks to ensure the input DataFrame is correctly structured
    and has adequate data coverage for the study. The steps are:
    1.  Index Validation: Verifies the MultiIndex structure, names, dtypes,
        and sortedness. Returns a sorted copy if necessary.
    2.  Column Validation: Checks for the presence and correct dtypes of all
        required columns.
    3.  Coverage Validation: Ensures the data spans the required time period
        and includes a minimum number of assets and trading days.

    Args:
        df_raw (pd.DataFrame): The raw input DataFrame for the study.

    Returns:
        pd.DataFrame: A validated, sorted copy of the input DataFrame, ready
                      for the next processing step.

    Raises:
        TypeError: If the input is not a pandas DataFrame or if dtypes are
                   incorrect.
        ValueError: If the structure, columns, or data coverage are invalid.
    """
    # --- Input Type Check ---
    # Ensure the input is a pandas DataFrame.
    if not isinstance(df_raw, pd.DataFrame):
        raise TypeError("Input must be a pandas DataFrame.")

    # Log the start of the validation process.
    logging.info("Initiating validation of the input DataFrame...")

    # --- Step 1: Validate the DataFrame's MultiIndex. ---
    # This step returns a sorted copy of the DataFrame.
    df_validated = _validate_dataframe_index(df_raw)
    logging.info("Step 1/3: DataFrame index structure is valid and sorted.")

    # --- Step 2: Validate the required columns and their dtypes. ---
    # This step operates on the validated (and possibly sorted) DataFrame.
    _validate_dataframe_columns(df_validated)
    logging.info("Step 2/3: DataFrame columns and dtypes are valid.")

    # --- Step 3: Validate the temporal and cross-sectional coverage. ---
    _validate_dataframe_coverage(df_validated)
    logging.info("Step 3/3: DataFrame temporal and cross-sectional coverage is sufficient.")

    # Log the successful completion of all validation steps.
    logging.info("DataFrame validation successful.")

    # Return the fully validated and sorted DataFrame.
    return df_validated


In [None]:
# Task 3: Cleanse the raw data for missing values, outliers, and inconsistencies

# ==============================================================================
# Task 3: Cleanse the raw data for missing values, outliers, and
#         inconsistencies
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 3, Step 1: Handle missing values in price and volume fields.
# ------------------------------------------------------------------------------

def _cleanse_price_volume_fields(
    df: pd.DataFrame,
    volume_ffill_threshold: float = 0.01
) -> pd.DataFrame:
    """
    Cleanses missing and invalid data in price and volume columns.

    This function performs the following actions:
    1. Removes rows with missing or non-positive 'Close_Price_Raw'.
    2. Forward-fills missing 'Volume_Raw' within each ticker group if the
       proportion of missing data is below a specified threshold.
    3. Asserts that the 'CFACSHR' column has no missing values.

    Args:
        df (pd.DataFrame): The input DataFrame.
        volume_ffill_threshold (float): The maximum proportion of missing
                                        'Volume_Raw' data to allow for
                                        forward-filling. Defaults to 0.01 (1%).

    Returns:
        pd.DataFrame: A cleansed copy of the input DataFrame.

    Raises:
        ValueError: If 'CFACSHR' contains missing values.
    """
    # Work on a copy to avoid side effects.
    df_cleansed = df.copy()
    initial_rows = len(df_cleansed)

    # --- Cleanse 'Close_Price_Raw' ---
    # Identify rows with missing or non-positive closing prices.
    invalid_price_mask = df_cleansed['Close_Price_Raw'].isna() | (df_cleansed['Close_Price_Raw'] <= 0)
    num_invalid_prices = invalid_price_mask.sum()

    # If invalid prices are found, log and remove them.
    if num_invalid_prices > 0:
        logging.info(f"Found {num_invalid_prices} rows ({num_invalid_prices / initial_rows:.2%}) with missing or non-positive 'Close_Price_Raw'. Removing them.")
        # Keep only the rows that do not match the invalid mask.
        df_cleansed = df_cleansed[~invalid_price_mask]

    # --- Cleanse 'Volume_Raw' ---
    # Calculate the number and proportion of missing volume data.
    missing_volume_count = df_cleansed['Volume_Raw'].isna().sum()
    if missing_volume_count > 0:
        missing_volume_pct = missing_volume_count / len(df_cleansed)
        logging.info(f"Found {missing_volume_count} missing 'Volume_Raw' values ({missing_volume_pct:.2%}).")

        # Conditionally forward-fill based on the threshold.
        if missing_volume_pct < volume_ffill_threshold:
            logging.info(f"Missing volume percentage is below threshold of {volume_ffill_threshold:.2%}. Forward-filling within each ticker group.")
            # Group by ticker to prevent filling across different securities.
            df_cleansed['Volume_Raw'] = df_cleansed.groupby(level='TICKER')['Volume_Raw'].ffill()
            # Log remaining NaNs, which could exist at the start of a series.
            remaining_nans = df_cleansed['Volume_Raw'].isna().sum()
            if remaining_nans > 0:
                logging.warning(f"{remaining_nans} 'Volume_Raw' NaNs remain at the beginning of time series after forward-filling.")
        else:
            # If above threshold, retain NaNs and warn the user.
            logging.warning(f"Missing volume percentage exceeds threshold. Retaining NaNs in 'Volume_Raw' for later treatment.")

    # --- Validate 'CFACSHR' ---
    # This is a critical field for price adjustments; it cannot have missing data.
    if df_cleansed['CFACSHR'].isna().any():
        raise ValueError("'CFACSHR' column contains missing values, which is not permissible for price/volume adjustments.")

    # Log the final count of rows removed.
    final_rows = len(df_cleansed)
    logging.info(f"Price/volume cleansing complete. Total rows removed: {initial_rows - final_rows}.")

    return df_cleansed


# ------------------------------------------------------------------------------
# Task 3, Step 2: Handle missing values in fundamental and macro fields.
# ------------------------------------------------------------------------------

def _cleanse_fundamental_macro_fields(df: pd.DataFrame) -> pd.DataFrame:
    """
    Cleanses or validates missing data in fundamental and macro columns.

    This function applies distinct rules:
    1. 'PE_Ratio', 'PB_Ratio': Missing values are permitted and counted.
    2. 'VIX_Close': Missing values are not permitted and are forward-filled.

    Args:
        df (pd.DataFrame): The input DataFrame.

    Returns:
        pd.DataFrame: A cleansed copy of the input DataFrame.

    Raises:
        ValueError: If 'VIX_Close' has a missing value at the very start of
                    the series that cannot be forward-filled.
    """
    # Work on a copy.
    df_cleansed = df.copy()

    # --- Handle Fundamental Ratios ('PE_Ratio', 'PB_Ratio') ---
    # These are allowed to be NaN. We just log their presence.
    for col in ['PE_Ratio', 'PB_Ratio']:
        missing_count = df_cleansed[col].isna().sum()
        if missing_count > 0:
            missing_pct = missing_count / len(df_cleansed)
            logging.info(f"Column '{col}' contains {missing_count} NaNs ({missing_pct:.2%}). These are permitted and will be retained.")

    # --- Handle Macro Indicator ('VIX_Close') ---
    # VIX should be complete. We forward-fill any gaps.
    if df_cleansed['VIX_Close'].isna().any():
        # Check for the critical edge case: NaN at the beginning of the series.
        if pd.isna(df_cleansed['VIX_Close'].iloc[0]):
            raise ValueError("Missing 'VIX_Close' value at the start of the dataset. Cannot forward-fill.")

        # Log and perform the forward-fill.
        logging.info("Found missing values in 'VIX_Close'. Forward-filling...")
        df_cleansed['VIX_Close'].ffill(inplace=True)

        # Final assertion to guarantee completeness.
        if df_cleansed['VIX_Close'].isna().any():
             # This should not be reached if the initial check passes, but is a safeguard.
             raise RuntimeError("Forward-filling 'VIX_Close' failed unexpectedly.")

    return df_cleansed


# ------------------------------------------------------------------------------
# Task 3, Step 3: Detect and handle outliers and data quality issues.
# ------------------------------------------------------------------------------

def _filter_and_screen_data(
    df: pd.DataFrame,
    target_sic_codes: Set[int],
    return_outlier_threshold: float = 0.5
) -> pd.DataFrame:
    """
    Filters the DataFrame to the target universe and screens for outliers.

    This function performs two main actions:
    1. Calculates 1-day log returns on raw prices to detect extreme single-day
       movements, logging them as potential data errors.
    2. Filters the DataFrame to include only stocks belonging to a specified
       set of SIC codes (the real estate sector).

    Args:
        df (pd.DataFrame): The input DataFrame.
        target_sic_codes (Set[int]): A set of SIC codes for the target universe.
        return_outlier_threshold (float): The absolute log return value to
                                          flag as an outlier. Defaults to 0.5.

    Returns:
        pd.DataFrame: A filtered copy of the DataFrame containing only the
                      target universe.

    Raises:
        ValueError: If filtering by SIC code results in an empty DataFrame.
    """
    # Work on a copy.
    df_screened = df.copy()

    # --- Outlier Detection based on Log Returns ---
    # Calculate log returns on raw prices within each ticker group.
    # Equation: r_t = ln(P_t / P_{t-1})
    log_returns = np.log(
        df_screened.groupby(level='TICKER')['Close_Price_Raw'].pct_change() + 1
    )

    # Identify outliers where the absolute log return exceeds the threshold.
    outlier_mask = log_returns.abs() > return_outlier_threshold
    num_outliers = outlier_mask.sum()

    # If outliers are found, log them for manual review.
    if num_outliers > 0:
        logging.warning(f"Found {num_outliers} potential outliers with absolute 1-day log return > {return_outlier_threshold:.0%}.")
        # Log the details of the first few outliers for quick inspection.
        outlier_details = df_screened[outlier_mask].copy()
        outlier_details['Log_Return'] = log_returns[outlier_mask]
        logging.warning("Outlier examples:\n" + outlier_details[['Log_Return']].head().to_string())
        # Note: No automatic removal/winsorization is performed to preserve data integrity.

    # --- Filter by Target SIC Codes ---
    initial_rows = len(df_screened)
    logging.info(f"Filtering DataFrame to target SIC codes: {target_sic_codes}.")
    # Create a boolean mask for rows with a target SIC code.
    sic_filter_mask = df_screened['SIC_Code'].isin(target_sic_codes)
    # Apply the filter.
    df_screened = df_screened[sic_filter_mask]
    final_rows = len(df_screened)

    # Check if the filtering resulted in an empty DataFrame.
    if df_screened.empty:
        raise ValueError("Filtering by SIC codes resulted in an empty DataFrame. Check input data and target SIC codes.")

    # Log the result of the filtering operation.
    logging.info(f"Retained {final_rows} rows ({final_rows / initial_rows:.2%}) after SIC code filtering.")

    return df_screened


# ------------------------------------------------------------------------------
# Task 3, Orchestrator Function
# ------------------------------------------------------------------------------

def cleanse_raw_data(
    df_validated: pd.DataFrame,
    target_sic_codes: Set[int] = {6500, 6512, 6513, 6519, 6531, 6541, 6552, 6798}
) -> pd.DataFrame:
    """
    Orchestrates the complete data cleansing pipeline.

    This function applies a series of cleansing and filtering steps to the
    validated input DataFrame to prepare it for feature engineering. The
    pipeline includes:
    1.  Handling missing values in core price and volume fields.
    2.  Handling missing values in fundamental and macroeconomic indicators.
    3.  Screening for extreme return outliers and filtering the dataset to the
        specified target universe based on SIC codes.

    Args:
        df_validated (pd.DataFrame): The DataFrame that has passed the schema
                                     validation checks from Task 2.
        target_sic_codes (Set[int]): A set of SIC codes defining the study's
                                     investment universe. Defaults to the real
                                     estate sector codes from the paper.

    Returns:
        pd.DataFrame: A cleansed and filtered DataFrame ready for the next
                      stage of processing (corporate action adjustments).
    """
    # Log the start of the cleansing process.
    logging.info("Initiating data cleansing and filtering pipeline...")

    # --- Step 1: Cleanse price and volume fields. ---
    df_step1 = _cleanse_price_volume_fields(df_validated)
    logging.info("Step 1/3: Cleansing of price and volume fields complete.")

    # --- Step 2: Cleanse fundamental and macro fields. ---
    df_step2 = _cleanse_fundamental_macro_fields(df_step1)
    logging.info("Step 2/3: Cleansing of fundamental and macro fields complete.")

    # --- Step 3: Filter to target universe and screen for outliers. ---
    df_final = _filter_and_screen_data(df_step2, target_sic_codes)
    logging.info("Step 3/3: Outlier screening and SIC code filtering complete.")

    # Log the successful completion of the cleansing pipeline.
    logging.info("Data cleansing pipeline finished successfully.")

    # Return the fully cleansed and filtered DataFrame.
    return df_final


In [None]:
# Task 4: Adjust prices and volumes for corporate actions using CFACSHR

# ------------------------------------------------------------------------------
# Task 4, Step 1: Compute split- and dividend-adjusted prices.
# ------------------------------------------------------------------------------

def _compute_adjusted_prices(df: pd.DataFrame) -> pd.DataFrame:
    """
    Computes split- and dividend-adjusted closing prices.

    This function applies the Cumulative Factor to Adjust Shares (CFACSHR) from
    CRSP to the raw closing prices to create a continuous, comparable time
    series. It performs rigorous pre- and post-computation validation.

    Equation: Close_Price_Adj = Close_Price_Raw * CFACSHR

    Args:
        df (pd.DataFrame): The cleansed DataFrame containing 'Close_Price_Raw'
                           and 'CFACSHR' columns.

    Returns:
        pd.DataFrame: A copy of the input DataFrame with the new
                      'Close_Price_Adj' column.

    Raises:
        ValueError: If 'CFACSHR' contains non-positive values.
    """
    # Work on a copy to avoid modifying the original DataFrame.
    df_adj = df.copy()

    # --- Pre-computation Validation ---
    # The adjustment factor must be strictly positive.
    if not (df_adj['CFACSHR'] > 0).all():
        raise ValueError("'CFACSHR' column contains non-positive values, which is invalid for price adjustment.")

    # --- Price Adjustment Calculation ---
    # Apply the vectorized multiplication to compute the adjusted price.
    df_adj['Close_Price_Adj'] = df_adj['Close_Price_Raw'] * df_adj['CFACSHR']

    # --- Post-computation Validation ---
    # Adjusted prices, like raw prices, must be positive.
    if not (df_adj['Close_Price_Adj'] > 0).all():
        # This would indicate an issue with raw prices that was missed in cleansing.
        logging.warning("Post-adjustment check found non-positive adjusted prices. Review raw price and CFACSHR data.")

    return df_adj


# ------------------------------------------------------------------------------
# Task 4, Step 2: Compute adjusted volumes.
# ------------------------------------------------------------------------------

def _compute_adjusted_volumes(df: pd.DataFrame) -> pd.DataFrame:
    """
    Computes split- and dividend-adjusted trading volumes.

    This function adjusts the raw trading volume by dividing by the CFACSHR.
    This is the inverse operation to the price adjustment, ensuring that the
    total value traded (price * volume) remains consistent before and after
    adjustment.

    Equation: Volume_Adj = Volume_Raw / CFACSHR

    Args:
        df (pd.DataFrame): The DataFrame from the price adjustment step.

    Returns:
        pd.DataFrame: A copy of the input DataFrame with the new
                      'Volume_Adj' column.
    """
    # Work on a copy.
    df_adj = df.copy()

    # --- Volume Adjustment Calculation ---
    # Apply the vectorized division to compute the adjusted volume.
    # Since we already validated CFACSHR > 0, division by zero is not a risk.
    df_adj['Volume_Adj'] = df_adj['Volume_Raw'] / df_adj['CFACSHR']

    # --- Post-computation Validation ---
    # Adjusted volume must be a non-negative quantity.
    if not (df_adj['Volume_Adj'] >= 0).all():
        logging.warning("Post-adjustment check found negative adjusted volumes. Review raw volume and CFACSHR data.")

    return df_adj


# ------------------------------------------------------------------------------
# Task 4, Step 3: Verify adjustment consistency and persistence.
# ------------------------------------------------------------------------------

def _verify_adjustments_and_finalize(
    df: pd.DataFrame,
    num_tickers_to_spot_check: int = 5,
    random_state: int = 42
) -> pd.DataFrame:
    """
    Performs a programmatic spot-check of the adjustments and removes the
    CFACSHR column.

    The verification logic identifies corporate action dates (where CFACSHR
    changes) for a sample of tickers. It checks if the adjusted return on
    these dates is significantly smaller in magnitude than the raw return,
    which indicates a successful adjustment. Finally, it drops the CFACSHR
    column to prevent data leakage.

    Args:
        df (pd.DataFrame): The DataFrame with both raw and adjusted columns.
        num_tickers_to_spot_check (int): The number of random tickers to verify.
        random_state (int): Seed for the random sampler for reproducibility.

    Returns:
        pd.DataFrame: The final adjusted DataFrame with the 'CFACSHR' column
                      removed.
    """
    # Work on a copy.
    df_final = df.copy()

    # --- Programmatic Spot-Check ---
    # Get a list of all unique tickers in the DataFrame.
    all_tickers = df_final.index.get_level_values('TICKER').unique()

    # Ensure the number of tickers to check is not more than available tickers.
    num_to_check = min(num_tickers_to_spot_check, len(all_tickers))

    # Select a random sample of tickers for verification.
    if num_to_check > 0:
        spot_check_tickers = np.random.RandomState(random_state).choice(
            all_tickers, size=num_to_check, replace=False
        )
        logging.info(f"Performing programmatic spot-check on {num_to_check} tickers: {list(spot_check_tickers)}")

        # Group by ticker to perform time-series operations.
        grouped = df_final.groupby(level='TICKER')

        for ticker in spot_check_tickers:
            # Get the data for the specific ticker.
            ticker_df = grouped.get_group(ticker)

            # Identify corporate action dates by finding where CFACSHR changes.
            cfacshr_change = ticker_df['CFACSHR'].diff().abs() > 1e-8 # Use tolerance for float comparison
            event_dates = cfacshr_change[cfacshr_change].index

            if not event_dates.empty:
                # Calculate raw and adjusted returns for the entire series.
                raw_returns = ticker_df['Close_Price_Raw'].pct_change()
                adj_returns = ticker_df['Close_Price_Adj'].pct_change()

                # Check the returns on the identified event dates.
                raw_event_returns = raw_returns.loc[event_dates].abs()
                adj_event_returns = adj_returns.loc[event_dates].abs()

                # A successful adjustment should make the adjusted return much smaller than the raw return.
                # Heuristic: Adjusted return magnitude should be less than 1/5th of raw return magnitude.
                if (adj_event_returns > 0.2 * raw_event_returns).any():
                    logging.warning(f"Ticker {ticker}: Adjustment consistency check failed. "
                                    f"Adjusted return is not significantly smaller than raw return on a corporate action date. "
                                    f"Manual review recommended.")

    # --- Finalize by Dropping the CFACSHR Column ---
    # This is a critical step to prevent data leakage in downstream models.
    df_final.drop(columns=['CFACSHR'], inplace=True)
    logging.info("'CFACSHR' column dropped to prevent data leakage.")

    return df_final


# ------------------------------------------------------------------------------
# Task 4, Orchestrator Function
# ------------------------------------------------------------------------------

def adjust_for_corporate_actions(df_cleansed: pd.DataFrame) -> pd.DataFrame:
    """
    Orchestrates the full pipeline for adjusting price and volume data.

    This function takes a cleansed DataFrame and applies corporate action
    adjustments using the CRSP CFACSHR factor. The process includes:
    1.  Computing adjusted prices.
    2.  Computing adjusted volumes.
    3.  Performing a programmatic spot-check to verify the consistency of the
        adjustments.
    4.  Dropping the adjustment factor column ('CFACSHR') to finalize the
        dataset for the next stage.

    Args:
        df_cleansed (pd.DataFrame): The DataFrame that has passed the cleansing
                                    and filtering steps from Task 3.

    Returns:
        pd.DataFrame: A DataFrame with new 'Close_Price_Adj' and 'Volume_Adj'
                      columns, and the 'CFACSHR' column removed.
    """
    # Log the start of the adjustment process.
    logging.info("Initiating corporate action adjustment pipeline...")

    # --- Step 1: Compute split- and dividend-adjusted prices. ---
    df_adj_price = _compute_adjusted_prices(df_cleansed)
    logging.info("Step 1/3: Adjusted prices computed successfully.")

    # --- Step 2: Compute adjusted volumes. ---
    df_adj_volume = _compute_adjusted_volumes(df_adj_price)
    logging.info("Step 2/3: Adjusted volumes computed successfully.")

    # --- Step 3: Verify adjustments and finalize the DataFrame. ---
    df_final = _verify_adjustments_and_finalize(df_adj_volume)
    logging.info("Step 3/3: Adjustments verified and 'CFACSHR' column removed.")

    # Log the successful completion of the pipeline.
    logging.info("Corporate action adjustment pipeline finished successfully.")

    return df_final



In [None]:
# Task 5: Derive engineered features from adjusted prices and volumes

# ==============================================================================
# Task 5: Derive engineered features from adjusted prices and volumes
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 5, Step 1: Compute log-transformed price and volume series.
# ------------------------------------------------------------------------------

def _compute_log_transforms(df: pd.DataFrame) -> pd.DataFrame:
    """
    Computes the natural logarithm of adjusted prices and volumes.

    Logarithmic transformation is a standard technique in finance to stabilize
    variance and convert multiplicative relationships into additive ones.

    Equations:
    1. Log_Price = ln(Close_Price_Adj)
    2. Log_Volume = ln(Volume_Adj + 1)

    Args:
        df (pd.DataFrame): The DataFrame with adjusted price and volume columns.
                           Must contain 'Close_Price_Adj' and 'Volume_Adj'.

    Returns:
        pd.DataFrame: A copy of the input DataFrame with new 'Log_Price' and
                      'Log_Volume' columns.

    Raises:
        ValueError: If 'Close_Price_Adj' contains non-positive values.
    """
    # Work on a copy to avoid side effects on the original object.
    df_featured = df.copy()

    # --- Pre-computation Validation ---
    # The argument to the logarithm for price must be strictly positive.
    if not (df_featured['Close_Price_Adj'] > 0).all():
        raise ValueError("Cannot compute log price: 'Close_Price_Adj' column contains non-positive values.")

    # --- Log Price Calculation ---
    # Equation: Log_Price = ln(Close_Price_Adj)
    # Compute the natural logarithm of the adjusted closing price.
    df_featured['Log_Price'] = np.log(df_featured['Close_Price_Adj'])

    # --- Log Volume Calculation ---
    # Equation: Log_Volume = ln(Volume_Adj + 1)
    # Add 1 to volume before taking the log to handle zero-volume days gracefully.
    # This avoids log(0) = -inf and ensures the result is non-negative.
    df_featured['Log_Volume'] = np.log(df_featured['Volume_Adj'] + 1)

    return df_featured


# ------------------------------------------------------------------------------
# Task 5, Step 2: Compute daily log returns.
# ------------------------------------------------------------------------------

def _compute_log_returns(df: pd.DataFrame) -> pd.DataFrame:
    """
    Computes daily logarithmic returns from the log-transformed price series.

    Log returns are the first difference of the log price series and are a
    cornerstone of quantitative financial analysis due to their desirable
    statistical properties (e.g., time-additivity).

    Equation: r_t = Log_Price_t - Log_Price_{t-1}

    Args:
        df (pd.DataFrame): The DataFrame containing the 'Log_Price' column.

    Returns:
        pd.DataFrame: A copy of the input DataFrame with the new 'Log_Return'
                      column. The first entry for each ticker will be NaN.
    """
    # Work on a copy.
    df_featured = df.copy()

    # --- Log Return Calculation ---
    # Group by ticker to ensure returns are calculated only within each security's
    # time series. This is critical to prevent data leakage across tickers.
    # The .diff() method calculates the difference from the previous row in the group.
    df_featured['Log_Return'] = df_featured.groupby(level='TICKER')['Log_Price'].diff()

    # --- Post-computation Logging ---
    # Log the number of NaNs created, which should equal the number of unique tickers.
    num_tickers = df_featured.index.get_level_values('TICKER').nunique()
    num_nans = df_featured['Log_Return'].isna().sum()
    if num_nans == num_tickers:
        logging.info(f"Successfully computed log returns. {num_nans} NaNs created for the first observation of each ticker, as expected.")
    else:
        logging.warning(f"Log return calculation resulted in {num_nans} NaNs, while there are {num_tickers} tickers. Review for potential data gaps.")

    return df_featured


# ------------------------------------------------------------------------------
# Task 5, Step 3: Extract calendar features from the Date index.
# ------------------------------------------------------------------------------

def _extract_calendar_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Extracts month and day-of-month features from the DataFrame's Date index.

    These features can capture potential seasonal or calendar-based anomalies
    in market behavior.

    Args:
        df (pd.DataFrame): The DataFrame with a datetime index level named 'Date'.

    Returns:
        pd.DataFrame: A copy of the input DataFrame with new 'Month' and 'Day'
                      columns.
    """
    # Work on a copy.
    df_featured = df.copy()

    # --- Feature Extraction ---
    # Access the 'Date' level of the MultiIndex.
    date_index = df_featured.index.get_level_values('Date')

    # Extract the month (integer 1-12) using the .month accessor.
    df_featured['Month'] = date_index.month

    # Extract the day of the month (integer 1-31) using the .day accessor.
    df_featured['Day'] = date_index.day

    return df_featured


# ------------------------------------------------------------------------------
# Task 5, Orchestrator Function
# ------------------------------------------------------------------------------

def derive_engineered_features(df_adjusted: pd.DataFrame) -> pd.DataFrame:
    """
    Orchestrates the derivation of primary engineered features from adjusted data.

    This function takes a DataFrame with adjusted prices and volumes and creates
    several key features required for the downstream LPPL model and the
    Transformer model. The pipeline includes:
    1.  Computing log-transformed price and volume series.
    2.  Calculating daily log returns.
    3.  Extracting calendar features (month and day) from the index.

    Args:
        df_adjusted (pd.DataFrame): The DataFrame that has passed the corporate
                                    action adjustment steps from Task 4.

    Returns:
        pd.DataFrame: A DataFrame enriched with the new engineered features:
                      'Log_Price', 'Log_Volume', 'Log_Return', 'Month', 'Day'.
    """
    # Log the start of the feature engineering process.
    logging.info("Initiating derivation of engineered features...")

    # --- Input Validation ---
    # Ensure required adjusted columns are present.
    required_cols = ['Close_Price_Adj', 'Volume_Adj']
    if not all(col in df_adjusted.columns for col in required_cols):
        raise ValueError(f"Input DataFrame is missing required columns: {required_cols}")

    # --- Step 1: Compute log-transformed price and volume series. ---
    df_step1 = _compute_log_transforms(df_adjusted)
    logging.info("Step 1/3: Log-transformed price and volume features computed.")

    # --- Step 2: Compute daily log returns. ---
    df_step2 = _compute_log_returns(df_step1)
    logging.info("Step 2/3: Daily log returns computed.")

    # --- Step 3: Extract calendar features. ---
    df_final = _extract_calendar_features(df_step2)
    logging.info("Step 3/3: Calendar features (Month, Day) extracted.")

    # Log the successful completion of the pipeline.
    logging.info("Engineered feature derivation pipeline finished successfully.")

    return df_final


In [None]:
# Task 6: Align and validate the temporal calendar across all tickers

# ------------------------------------------------------------------------------
# Task 6, Step 1: Construct the master trading calendar.
# ------------------------------------------------------------------------------

def _construct_and_validate_master_calendar(
    df: pd.DataFrame,
    max_gap_days: int = 10
) -> pd.DatetimeIndex:
    """
    Constructs and validates a master trading calendar from the DataFrame index.

    This function extracts all unique dates from the DataFrame's index, sorts
    them, and performs validation checks for monotonicity and large gaps, which
    might indicate data outages.

    Args:
        df (pd.DataFrame): The input DataFrame with a 'Date' level in its index.
        max_gap_days (int): The maximum number of calendar days allowed between
                            consecutive dates before logging a warning.

    Returns:
        pd.DatetimeIndex: A sorted, unique DatetimeIndex of all trading days.

    Raises:
        ValueError: If the constructed calendar is not monotonic.
    """
    # --- Calendar Construction ---
    # Extract all unique dates from the 'Date' index level.
    master_calendar = df.index.get_level_values('Date').unique()
    # Sort the dates to ensure chronological order.
    master_calendar = master_calendar.sort_values()

    # --- Monotonicity Validation ---
    # A trading calendar must be strictly increasing.
    if not master_calendar.is_monotonic_increasing:
        raise ValueError("Master calendar is not monotonically increasing. Check for duplicate or out-of-order dates.")

    # --- Gap Analysis ---
    # Calculate the difference in calendar days between consecutive trading days.
    gaps = np.diff(master_calendar).astype('timedelta64[D]').astype(int)

    # Check if the median gap is reasonable (e.g., <= 3 for weekends/holidays).
    if pd.Series(gaps).median() > 3:
        logging.warning(f"Median gap between trading days is {pd.Series(gaps).median()} days, which is unusually high.")

    # Check for any single gap larger than the specified maximum.
    if (gaps > max_gap_days).any():
        # Find the locations of large gaps to provide informative warnings.
        large_gap_indices = np.where(gaps > max_gap_days)[0]
        for idx in large_gap_indices:
            gap_start = master_calendar[idx].date()
            gap_end = master_calendar[idx + 1].date()
            logging.warning(f"Large data gap detected: {gaps[idx]} calendar days between {gap_start} and {gap_end}.")

    return master_calendar


# ------------------------------------------------------------------------------
# Task 6, Step 2: Verify cross-sectional completeness (balanced vs unbalanced).
# ------------------------------------------------------------------------------

def _verify_cross_sectional_completeness(
    df: pd.DataFrame,
    min_obs_per_ticker: int = 100
) -> None:
    """
    Analyzes the DataFrame to assess its cross-sectional completeness.

    This function performs two checks:
    1. Assesses if the panel is balanced or unbalanced by analyzing the number
       of tickers present each day.
    2. Identifies and logs tickers with very few observations.

    Args:
        df (pd.DataFrame): The input DataFrame.
        min_obs_per_ticker (int): The minimum number of observations a ticker
                                  should have. Tickers below this are flagged.
    """
    # --- Panel Balance Assessment ---
    # Count the number of tickers with data for each day.
    daily_ticker_counts = df.groupby(level='Date').size()
    mean_count = daily_ticker_counts.mean()
    std_count = daily_ticker_counts.std()

    # Check if the standard deviation is large relative to the mean.
    if std_count > 0.3 * mean_count:
        logging.warning(f"Unbalanced panel detected. Daily ticker count has mean={mean_count:.2f} and std={std_count:.2f}. "
                        "Missing data will be handled per-ticker.")
    else:
        logging.info(f"Panel is relatively balanced. Daily ticker count: mean={mean_count:.2f}, std={std_count:.2f}.")

    # --- Per-Ticker Observation Count ---
    # Count the total number of observations for each ticker.
    obs_per_ticker = df.groupby(level='TICKER').size()
    # Identify tickers with fewer observations than the minimum threshold.
    sparse_tickers = obs_per_ticker[obs_per_ticker < min_obs_per_ticker]

    if not sparse_tickers.empty:
        logging.warning(f"Found {len(sparse_tickers)} tickers with fewer than {min_obs_per_ticker} observations. "
                        f"These may be unsuitable for models requiring long lookbacks. "
                        f"Example sparse tickers: {sparse_tickers.head(5).index.tolist()}")


# ------------------------------------------------------------------------------
# Task 6, Step 3: Forward-fill market-wide features consistently.
# ------------------------------------------------------------------------------

def _enforce_market_feature_consistency(
    df: pd.DataFrame,
    master_calendar: pd.DatetimeIndex
) -> pd.DataFrame:
    """
    Ensures market-wide features are consistent and complete across all tickers.

    This function takes a market-wide feature (e.g., 'VIX_Close'), creates a
    single authoritative time series for it on the master calendar, and then
    maps this series back to the entire DataFrame. This corrects any potential
    inconsistencies where the same feature might have different values for
    different tickers on the same day.

    Args:
        df (pd.DataFrame): The input DataFrame.
        master_calendar (pd.DatetimeIndex): The master list of all trading days.

    Returns:
        pd.DataFrame: A copy of the DataFrame with the 'VIX_Close' column
                      made consistent and complete.
    """
    # Work on a copy.
    df_consistent = df.copy()
    market_feature = 'VIX_Close'

    # --- Create Authoritative Time Series ---
    # Extract the series for the first ticker as a reference.
    first_ticker = df_consistent.index.get_level_values('TICKER')[0]
    # Use .droplevel to get a simple Series indexed by Date.
    authoritative_series = df_consistent.loc[(slice(None), first_ticker), market_feature].droplevel('TICKER')

    # Reindex to the master calendar to fill any missing dates with NaN.
    # Then forward-fill to propagate the last known value.
    complete_series = authoritative_series.reindex(master_calendar).ffill()

    # --- Final Validation of the Complete Series ---
    if complete_series.isna().any():
        raise ValueError(f"Could not create a complete series for '{market_feature}'. NaNs remain after reindexing and forward-filling, likely due to missing data at the start.")

    # --- Map Back to the Full DataFrame ---
    # Create a mapping from the Date index to the complete VIX values.
    date_to_vix_map = complete_series.to_dict()
    # Use the .map() function on the Date index level for an efficient update.
    df_consistent[market_feature] = df_consistent.index.get_level_values('Date').map(date_to_vix_map)

    # --- Final Assertion ---
    if df_consistent[market_feature].isna().any():
        raise RuntimeError(f"Enforcing consistency for '{market_feature}' failed unexpectedly. NaNs are still present.")

    return df_consistent


# ------------------------------------------------------------------------------
# Task 6, Orchestrator Function
# ------------------------------------------------------------------------------

def align_and_validate_calendar(df_featured: pd.DataFrame) -> pd.DataFrame:
    """
    Orchestrates the validation and alignment of the dataset's temporal structure.

    This function ensures the time dimension of the panel data is coherent,
    complete, and consistent before proceeding to modeling. The pipeline includes:
    1.  Constructing a master trading calendar and checking for gaps.
    2.  Assessing the cross-sectional completeness (panel balance).
    3.  Enforcing consistency and completeness of market-wide features like VIX.

    Args:
        df_featured (pd.DataFrame): The DataFrame with engineered features from
                                    Task 5.

    Returns:
        pd.DataFrame: A DataFrame with a validated temporal structure and
                      consistent market-wide features.
    """
    # Log the start of the calendar validation process.
    logging.info("Initiating temporal calendar alignment and validation...")

    # --- Step 1: Construct and validate the master trading calendar. ---
    master_calendar = _construct_and_validate_master_calendar(df_featured)
    logging.info(f"Step 1/3: Master calendar constructed, containing {len(master_calendar)} unique trading days.")

    # --- Step 2: Verify cross-sectional completeness. ---
    _verify_cross_sectional_completeness(df_featured)
    logging.info("Step 2/3: Cross-sectional completeness verified.")

    # --- Step 3: Enforce consistency for market-wide features. ---
    df_final = _enforce_market_feature_consistency(df_featured, master_calendar)
    logging.info("Step 3/3: Consistency of market-wide features enforced.")

    # Log the successful completion of the pipeline.
    logging.info("Temporal calendar alignment and validation finished successfully.")

    return df_final


In [None]:
# Task 7: Set up BERTopic and generate sentence embeddings for the news corpus

# ==============================================================================
# Task 7: Set up BERTopic and generate sentence embeddings for the news corpus
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 7, Step 1: Extract and deduplicate the full news corpus.
# ------------------------------------------------------------------------------

def _extract_and_deduplicate_corpus(df: pd.DataFrame) -> List[str]:
    """
    Extracts all news articles from the DataFrame, normalizes them, and
    returns a list of unique articles.

    Args:
        df (pd.DataFrame): The DataFrame containing the 'News_Articles' column,
                           where each cell is a list of strings.

    Returns:
        List[str]: A deduplicated list of all news articles.
    """
    # --- Corpus Extraction ---
    # Use dropna() to skip any rows that might have nulls in this column.
    # chain.from_iterable is a highly efficient way to flatten a list of lists.
    logging.info("Extracting all news articles from the DataFrame...")
    corpus_raw = list(chain.from_iterable(df['News_Articles'].dropna()))
    logging.info(f"Extracted a total of {len(corpus_raw):,} articles.")

    # --- Normalization and Deduplication ---
    # Normalize by stripping whitespace and converting to lowercase to ensure
    # that semantically identical articles are treated as unique.
    # Using a set is the most efficient method for deduplication.
    logging.info("Normalizing and deduplicating corpus...")
    unique_articles_set = {article.strip().lower() for article in corpus_raw if article.strip()}
    corpus_unique = list(unique_articles_set)
    logging.info(f"Found {len(corpus_unique):,} unique articles after deduplication.")

    return corpus_unique


# ------------------------------------------------------------------------------
# Task 7, Step 2: Load the sentence-transformer model and generate embeddings.
# ------------------------------------------------------------------------------

def _generate_corpus_embeddings(
    corpus: List[str],
    config: Dict[str, Any],
    output_path: Path,
    batch_size: int = 64
) -> np.ndarray:
    """
    Generates sentence embeddings for a corpus of documents using a pre-trained
    transformer model.

    This is a computationally expensive step. The function is designed to be
    idempotent by checking if the output file already exists.

    Args:
        corpus (List[str]): The list of unique documents to embed.
        config (Dict[str, Any]): The study configuration dictionary, containing
                                 the model name.
        output_path (Path): The file path to save the embeddings to.
        batch_size (int): The batch size to use for encoding, for memory
                          efficiency.

    Returns:
        np.ndarray: A 2D numpy array of shape (n_articles, embedding_dim).
    """
    # --- Idempotency Check ---
    # If the embeddings file already exists, load and return it to avoid re-computation.
    if output_path.exists():
        logging.info(f"Found existing embeddings at '{output_path}'. Loading from file.")
        return np.load(output_path)

    # --- Model Loading ---
    # Determine the device to use (GPU if available, otherwise CPU).
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    model_name = config['nlp_settings']['topic_model']['embedding_model']
    logging.info(f"Loading sentence-transformer model '{model_name}' onto device '{device}'.")
    # Load the specified pre-trained model.
    model = SentenceTransformer(model_name, device=device)

    # --- Embedding Generation ---
    logging.info(f"Generating embeddings for {len(corpus):,} articles... (This may take a while)")
    # The .encode() method handles tokenization, inference, and pooling.
    # Using a progress bar provides crucial feedback for this long-running task.
    embeddings = model.encode(
        corpus,
        show_progress_bar=True,
        batch_size=batch_size,
        normalize_embeddings=True # Often improves clustering performance
    )

    # --- Persistence ---
    # Ensure the parent directory for the output file exists.
    output_path.parent.mkdir(parents=True, exist_ok=True)
    # Save the numpy array to the specified path.
    logging.info(f"Saving embeddings to '{output_path}'.")
    np.save(output_path, embeddings)

    return embeddings


# ------------------------------------------------------------------------------
# Task 7, Step 3: Initialize BERTopic with the specified hyperparameters.
# ------------------------------------------------------------------------------

def _fit_bertopic_model(
    corpus: List[str],
    embeddings: np.ndarray,
    config: Dict[str, Any],
    output_path: Path
) -> BERTopic:
    """
    Initializes and fits a BERTopic model using pre-computed embeddings.

    This function configures the underlying UMAP and HDBSCAN models with
    parameters from the configuration to ensure reproducibility. The fitted
    model is saved to disk to avoid re-fitting.

    Args:
        corpus (List[str]): The list of unique documents.
        embeddings (np.ndarray): The pre-computed embeddings for the corpus.
        config (Dict[str, Any]): The study configuration dictionary.
        output_path (Path): The file path to save the fitted BERTopic model.

    Returns:
        BERTopic: The fitted BERTopic model instance.
    """
    # --- Idempotency Check ---
    if output_path.exists():
        logging.info(f"Found existing BERTopic model at '{output_path}'. Loading from file.")
        return BERTopic.load(output_path)

    # --- Component Configuration ---
    # Extract hyperparameters from the configuration.
    topic_config = config['nlp_settings']['topic_model']
    umap_params = {'n_neighbors': topic_config['umap_n_neighbors'], 'n_components': 5, 'min_dist': 0.0, 'metric': 'cosine', 'random_state': 42}
    hdbscan_params = {'min_cluster_size': topic_config['hdbscan_min_cluster_size'], 'metric': 'euclidean', 'cluster_selection_method': 'eom'}

    # Instantiate the components with the specified parameters for reproducibility.
    umap_model = UMAP(**umap_params)
    hdbscan_model = HDBSCAN(**hdbscan_params)

    # --- BERTopic Initialization ---
    # Initialize BERTopic with the custom components.
    logging.info("Initializing BERTopic model with custom UMAP and HDBSCAN parameters.")
    topic_model = BERTopic(
        umap_model=umap_model,
        hdbscan_model=hdbscan_model,
        verbose=True
    )

    # --- Model Fitting ---
    logging.info(f"Fitting BERTopic model on {len(corpus):,} documents...")
    # Fit the model using the pre-computed embeddings.
    # This is more efficient and ensures consistency with the embedding step.
    topics, _ = topic_model.fit_transform(corpus, embeddings)

    # --- Persistence ---
    # Ensure the parent directory exists.
    output_path.parent.mkdir(parents=True, exist_ok=True)
    # Save the entire fitted model object.
    logging.info(f"Saving fitted BERTopic model to '{output_path}'.")
    topic_model.save(output_path)

    # Log a summary of the topic modeling results.
    num_topics = len(topic_model.get_topic_info())
    logging.info(f"BERTopic fitting complete. Found {num_topics} topics.")

    return topic_model


# ------------------------------------------------------------------------------
# Task 7, Orchestrator Function
# ------------------------------------------------------------------------------

def setup_topic_model(
    df_aligned: pd.DataFrame,
    study_parameters: Dict[str, Any],
    intermediate_data_dir: Union[str, Path] = "data_intermediate",
    model_dir: Union[str, Path] = "models"
) -> Tuple[List[str], np.ndarray, BERTopic]:
    """
    Orchestrates the end-to-end process of setting up the BERTopic model.

    This function prepares the textual data for thematic analysis by:
    1.  Extracting and deduplicating all news articles into a clean corpus.
    2.  Generating high-quality sentence embeddings for the corpus.
    3.  Initializing and fitting a BERTopic model with specified hyperparameters.

    The function is designed to be idempotent, leveraging saved intermediate
    artifacts (embeddings, fitted model) to avoid expensive re-computation.

    Args:
        df_aligned (pd.DataFrame): The DataFrame from Task 6 with a validated
                                   temporal structure.
        study_parameters (Dict[str, Any]): The main configuration dictionary.
        intermediate_data_dir (Union[str, Path]): Directory to save/load
                                                   intermediate data like embeddings.
        model_dir (Union[str, Path]): Directory to save/load the fitted model.

    Returns:
        Tuple[List[str], np.ndarray, BERTopic]: A tuple containing:
            - The unique, normalized corpus of articles.
            - The corresponding sentence embeddings.
            - The fitted BERTopic model instance.
    """
    logging.info("Initiating NLP setup for topic modeling...")

    # Define paths for intermediate artifacts.
    data_path = Path(intermediate_data_dir)
    model_path = Path(model_dir)
    embeddings_filepath = data_path / "corpus_embeddings.npy"
    bertopic_model_filepath = model_path / "bertopic_model.pkl"

    # --- Step 1: Extract and deduplicate the news corpus. ---
    corpus = _extract_and_deduplicate_corpus(df_aligned)
    logging.info("Step 1/3: News corpus extracted and deduplicated successfully.")

    # --- Step 2: Generate sentence embeddings for the corpus. ---
    embeddings = _generate_corpus_embeddings(corpus, study_parameters, embeddings_filepath)
    logging.info("Step 2/3: Sentence embeddings generated successfully.")

    # --- Step 3: Initialize and fit the BERTopic model. ---
    topic_model = _fit_bertopic_model(corpus, embeddings, study_parameters, bertopic_model_filepath)
    logging.info("Step 3/3: BERTopic model fitted successfully.")

    logging.info("NLP setup for topic modeling finished successfully.")

    return corpus, embeddings, topic_model


In [None]:
# Task 8: Apply BERTopic clustering and filter to real-estate-relevant articles

# ==============================================================================
# Task 8: Apply BERTopic clustering and filter to real-estate-relevant articles
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 8, Step 1: Extract topic assignments and representations.
# ------------------------------------------------------------------------------

def _extract_topic_assignments(
    corpus: List[str],
    topic_model: BERTopic
) -> Tuple[Dict[str, int], pd.DataFrame]:
    """
    Extracts topic assignments for each document and retrieves topic representations.

    Args:
        corpus (List[str]): The unique, normalized corpus of articles that the
                            model was fitted on.
        topic_model (BERTopic): The fitted BERTopic model instance.

    Returns:
        Tuple[Dict[str, int], pd.DataFrame]: A tuple containing:
            - A dictionary mapping each article text to its assigned topic ID.
            - A DataFrame with information about each topic (keywords, count).
    """
    # --- Get Topic Information ---
    # get_topic_info() returns a DataFrame with details for each topic,
    # including the default c-TF-IDF keyword representation.
    logging.info("Extracting topic representations from the BERTopic model.")
    topic_info_df = topic_model.get_topic_info()

    # --- Create Article-to-Topic Mapping ---
    # The `topic_model.topics_` attribute stores the topic assignment for each
    # document in the same order as the input corpus.
    if len(corpus) != len(topic_model.topics_):
        raise ValueError("Mismatch between corpus size and topic assignment count. The model may have been fitted on a different corpus.")

    # Create a dictionary for efficient lookup of an article's topic.
    logging.info("Creating a map from articles to their assigned topic IDs.")
    article_to_topic_map = dict(zip(corpus, topic_model.topics_))

    return article_to_topic_map, topic_info_df


# ------------------------------------------------------------------------------
# Task 8, Step 2: Identify real-estate-relevant topics via keyword matching.
# ------------------------------------------------------------------------------

def _identify_relevant_topics(
    topic_info_df: pd.DataFrame,
    keywords: Set[str],
    metadata_path: Path
) -> Set[int]:
    """
    Identifies relevant topics based on a set of keywords and saves metadata.

    This function implements a deterministic, keyword-based heuristic to flag
    topics related to the real estate sector. It creates an audit trail by
    saving the selected topics and their keywords to a JSON file.

    Args:
        topic_info_df (pd.DataFrame): DataFrame containing topic information.
        keywords (Set[str]): A set of keywords to identify relevant topics.
        metadata_path (Path): Path to save the topic filter metadata JSON file.

    Returns:
        Set[int]: A set of topic IDs identified as relevant.
    """
    logging.info(f"Identifying real estate-related topics using keywords: {keywords}")
    relevant_topic_ids = set()
    # This dictionary will store the metadata for the audit trail.
    selection_metadata = {}

    # Iterate over each topic returned by the model.
    # The outlier topic (-1) is always ignored.
    for _, row in topic_info_df[topic_info_df.Topic != -1].iterrows():
        topic_id = row['Topic']
        # 'Representation' column contains the top keywords for the topic.
        topic_keywords = row['Representation']

        # Check if any of the target keywords appear in the topic's representation.
        if any(keyword in topic_keywords for keyword in keywords):
            # If a match is found, add the topic ID to our set.
            relevant_topic_ids.add(topic_id)
            # Record the reason for selection in the metadata.
            selection_metadata[topic_id] = {
                "keywords": topic_keywords,
                "reason": "Matched one or more target keywords."
            }

    logging.info(f"Identified {len(relevant_topic_ids)} relevant topics: {sorted(list(relevant_topic_ids))}")

    # --- Persistence of Metadata ---
    # Save the selection metadata for reproducibility and review.
    metadata_path.parent.mkdir(parents=True, exist_ok=True)
    with metadata_path.open('w') as f:
        json.dump(selection_metadata, f, indent=4)
    logging.info(f"Topic selection metadata saved to '{metadata_path}'.")

    return relevant_topic_ids


# ------------------------------------------------------------------------------
# Task 8, Step 3: Filter the corpus and update the DataFrame.
# ------------------------------------------------------------------------------

def _filter_dataframe_articles(
    df: pd.DataFrame,
    article_to_topic_map: Dict[str, int],
    relevant_topic_ids: Set[int]
) -> Tuple[pd.DataFrame, Set[str]]:
    """
    Filters the 'News_Articles' column in the DataFrame to retain only
    articles belonging to relevant topics.

    Args:
        df (pd.DataFrame): The DataFrame to be filtered.
        article_to_topic_map (Dict[str, int]): Mapping from article text to topic ID.
        relevant_topic_ids (Set[int]): Set of topic IDs to keep.

    Returns:
        Tuple[pd.DataFrame, Set[str]]: A tuple containing:
            - A copy of the DataFrame with the 'News_Articles' lists filtered.
            - A set containing the texts of all articles that were kept.
    """
    # Work on a copy to avoid side effects.
    df_filtered = df.copy()

    # --- Build the Set of Relevant Articles ---
    # This is more efficient than checking topics for every article in every row.
    logging.info("Building the set of all relevant articles for efficient filtering.")
    corpus_filtered = {
        article for article, topic_id in article_to_topic_map.items()
        if topic_id in relevant_topic_ids
    }

    # --- Define the Filtering Function ---
    # This function will be applied to each cell in the 'News_Articles' column.
    def filter_article_list(articles: List[str]) -> List[str]:
        # Ensure the input is a list before processing.
        if not isinstance(articles, list):
            return []
        # Normalize articles in the same way (strip/lower) and check for membership
        # in the pre-computed set of relevant articles.
        return [
            article for article in articles
            if article.strip().lower() in corpus_filtered
        ]

    # --- Apply the Filter to the DataFrame ---
    logging.info("Applying topic filter to the 'News_Articles' column in the DataFrame...")
    df_filtered['News_Articles'] = df_filtered['News_Articles'].apply(filter_article_list)

    return df_filtered, corpus_filtered


# ------------------------------------------------------------------------------
# Task 8, Orchestrator Function
# ------------------------------------------------------------------------------

def apply_topic_filter(
    df_aligned: pd.DataFrame,
    corpus: List[str],
    topic_model: BERTopic,
    log_dir: Union[str, Path] = "logs"
) -> Tuple[pd.DataFrame, Set[str]]:
    """
    Orchestrates the filtering of news articles based on thematic relevance.

    This function uses a fitted BERTopic model to:
    1.  Extract topic assignments for all unique articles.
    2.  Identify topics relevant to a specific domain (real estate) using a
        keyword-based heuristic.
    3.  Filter the 'News_Articles' column in the main DataFrame, removing any
        articles that do not belong to the identified relevant topics.

    Args:
        df_aligned (pd.DataFrame): The main DataFrame.
        corpus (List[str]): The unique, normalized corpus of articles.
        topic_model (BERTopic): The fitted BERTopic model from Task 7.
        log_dir (Union[str, Path]): Directory to save the topic filter metadata.

    Returns:
        Tuple[pd.DataFrame, Set[str]]: A tuple containing:
            - The DataFrame with its 'News_Articles' column filtered.
            - A set of the unique, normalized article texts that were retained.
    """
    logging.info("Initiating article filtering based on topic modeling...")

    # Define the keywords for identifying the real estate domain.
    real_estate_keywords = {
        "real estate", "property", "housing", "reit", "mortgage",
        "construction", "leasing", "landlord", "tenant", "zoning"
    }
    metadata_filepath = Path(log_dir) / "topic_filter_metadata.json"

    # --- Step 1: Extract topic assignments and representations. ---
    article_to_topic_map, topic_info_df = _extract_topic_assignments(corpus, topic_model)
    logging.info("Step 1/3: Extracted topic assignments and representations.")

    # --- Step 2: Identify real-estate-relevant topics. ---
    relevant_topic_ids = _identify_relevant_topics(topic_info_df, real_estate_keywords, metadata_filepath)
    logging.info("Step 2/3: Identified relevant topics using keyword matching.")

    # --- Step 3: Filter the corpus and update the DataFrame. ---
    # Get the total number of articles before filtering for comparison.
    total_articles_before = sum(df_aligned['News_Articles'].apply(lambda x: len(x) if isinstance(x, list) else 0))

    df_filtered, corpus_retained = _filter_dataframe_articles(df_aligned, article_to_topic_map, relevant_topic_ids)

    # Calculate and log retention statistics.
    total_articles_after = sum(df_filtered['News_Articles'].apply(lambda x: len(x) if isinstance(x, list) else 0))
    retention_pct = (total_articles_after / total_articles_before) * 100 if total_articles_before > 0 else 0
    logging.info(f"Step 3/3: DataFrame's 'News_Articles' column filtered successfully.")
    logging.info(f"Article retention rate: {total_articles_after:,} / {total_articles_before:,} ({retention_pct:.2f}%).")

    logging.info("Article filtering based on topics finished successfully.")

    return df_filtered, corpus_retained


In [None]:
# Task 9: Apply FinBERT sentiment classification to each filtered article

# ==============================================================================
# Task 9: Apply FinBERT sentiment classification to each filtered article
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 9, Step 1: Load the FinBERT model and tokenizer.
# ------------------------------------------------------------------------------


def _load_sentiment_pipeline(config: Dict[str, Any]) -> pipeline:
    """
    Loads the pre-trained FinBERT model and tokenizer into a Hugging Face pipeline
    with explicit evaluation mode setting for maximum rigor.

    This function handles device detection (GPU/CPU), loads the model and
    tokenizer, explicitly sets the model to evaluation mode to disable
    training-specific layers like dropout, and then initializes the inference
    pipeline. This explicit state management is a best practice for
    production-grade, deterministic inference.

    Args:
        config (Dict[str, Any]): The study configuration dictionary, containing
                                 the sentiment model name.

    Returns:
        transformers.pipeline: An initialized text-classification pipeline
                               guaranteed to be in evaluation mode.

    Raises:
        IOError: If the model fails to download from the Hugging Face Hub.
        Exception: For other unexpected errors during model loading.
    """
    # --- Step 1: Device Selection ---
    # Determine the appropriate device for PyTorch operations. Use CUDA if a
    # compatible GPU is available for significant performance gains, otherwise
    # default to the CPU. The device_id is used by the pipeline constructor.
    device_id = 0 if torch.cuda.is_available() else -1
    device = torch.device("cuda:0" if device_id == 0 else "cpu")

    # --- Step 2: Model and Tokenizer Loading with Explicit Eval Mode ---
    # Extract the model identifier from the configuration dictionary.
    model_name = config['nlp_settings']['sentiment_model']['huggingface_model_name']
    logging.info(f"Loading sentiment model '{model_name}' and tokenizer for device '{device}'.")

    try:
        # Load the tokenizer associated with the pre-trained model.
        tokenizer = AutoTokenizer.from_pretrained(model_name)

        # Load the pre-trained model weights and architecture.
        model = AutoModelForSequenceClassification.from_pretrained(model_name)

        # Explicitly set the model to evaluation mode.
        # This is the critical step for ensuring deterministic inference by disabling
        # layers like Dropout that behave differently during training.
        model.eval()

        # Move the model to the selected device (GPU or CPU).
        model.to(device)

    except Exception as e:
        # Catch potential network errors or other issues during download/loading.
        logging.error(f"Failed to load model '{model_name}' from Hugging Face. Check model name and internet connection.")
        # Re-raise the exception to halt execution, as this is a critical failure.
        raise IOError(f"Could not load model '{model_name}'.") from e

    # --- Step 3: Pipeline Initialization ---
    # Instantiate the text-classification pipeline, passing the pre-loaded and
    # pre-configured model and tokenizer objects. This removes any ambiguity
    # about the model's state.
    sentiment_pipeline = pipeline(
        task="text-classification",
        model=model,
        tokenizer=tokenizer,
        device=device_id, # The pipeline API uses the integer ID.
        framework="pt"
    )

    logging.info("FinBERT sentiment analysis pipeline loaded successfully in evaluation mode.")

    # Return the fully configured pipeline object.
    return sentiment_pipeline

# ------------------------------------------------------------------------------
# Task 9, Step 2: Run inference on each article in the filtered corpus.
# ------------------------------------------------------------------------------

def _run_batch_sentiment_inference(
    corpus: Union[List[str], Set[str]],
    sentiment_pipeline: pipeline,
    batch_size: int = 64
) -> Dict[str, Dict[str, Any]]:
    """
    Runs batched sentiment analysis inference on a corpus of articles.

    Args:
        corpus (Union[List[str], Set[str]]): A list or set of unique,
                                             normalized article texts.
        sentiment_pipeline (pipeline): The initialized Hugging Face pipeline.
        batch_size (int): The number of articles to process in each batch.

    Returns:
        Dict[str, Dict[str, Any]]: A dictionary mapping each article text to
                                   its predicted class ('label') and
                                   confidence ('score').
    """
    # Convert corpus to list if it's a set, as pipelines expect a sequence.
    corpus_list = list(corpus)
    logging.info(f"Running sentiment inference on {len(corpus_list):,} articles with batch size {batch_size}...")

    # --- Batch Inference ---
    # Passing the entire list to the pipeline with a batch_size enables highly
    # optimized parallel processing on the GPU.
    # We specify truncation to handle articles longer than the model's max length.
    try:
        results = sentiment_pipeline(
            corpus_list,
            batch_size=batch_size,
            truncation=True,
            max_length=512
        )
    except Exception as e:
        logging.error(f"An error occurred during batch sentiment inference: {e}")
        raise

    # --- Result Structuring ---
    # Combine the input articles with their corresponding results into a dictionary
    # for efficient O(1) lookup by article text.
    sentiment_results = {
        text: {'class': result['label'].lower(), 'confidence': result['score']}
        for text, result in zip(corpus_list, results)
    }

    logging.info("Batch sentiment inference completed.")
    return sentiment_results


# ------------------------------------------------------------------------------
# Task 9, Step 3: Map FinBERT classes to numerical polarity scores.
# ------------------------------------------------------------------------------

def _map_sentiment_to_polarity(
    sentiment_results: Dict[str, Dict[str, Any]]
) -> Dict[str, Dict[str, Any]]:
    """
    Adds a numerical polarity score to the sentiment results dictionary.

    Args:
        sentiment_results (Dict[str, Dict[str, Any]]): The dictionary of
                                                       inference results.

    Returns:
        Dict[str, Dict[str, Any]]: The same dictionary, with each value dict
                                   augmented with a 'polarity' key.
    """
    # Define the canonical mapping from class label to numerical score.
    POLARITY_MAP = {'positive': 1.0, 'neutral': 0.0, 'negative': -1.0}

    logging.info("Mapping sentiment classes to numerical polarity scores...")
    # Iterate through the results and add the 'polarity' score.
    for article_text, result in sentiment_results.items():
        sentiment_class = result.get('class')
        # Use .get() for safety; if an unexpected class appears, default to neutral (0.0).
        polarity = POLARITY_MAP.get(sentiment_class)

        if polarity is None:
            logging.warning(f"Unexpected sentiment class '{sentiment_class}' found for an article. Assigning neutral polarity (0.0).")
            polarity = 0.0

        result['polarity'] = polarity

    return sentiment_results


# ------------------------------------------------------------------------------
# Task 9, Orchestrator Function
# ------------------------------------------------------------------------------

def classify_article_sentiment(
    corpus_filtered: Set[str],
    study_parameters: Dict[str, Any],
    output_path: Union[str, Path]
) -> Dict[str, Dict[str, Any]]:
    """
    Orchestrates the end-to-end sentiment classification of a news corpus.

    This function is idempotent: it will load results from the output path if
    the file already exists, avoiding re-computation. Otherwise, it performs:
    1.  Loading the pre-trained FinBERT model into an efficient pipeline.
    2.  Running batched inference on the entire filtered corpus.
    3.  Mapping the resulting sentiment labels to numerical polarity scores.
    4.  Persisting the final, comprehensive results to disk.

    Args:
        corpus_filtered (Set[str]): A set of unique, normalized, and
                                    thematically relevant article texts.
        study_parameters (Dict[str, Any]): The main configuration dictionary.
        output_path (Union[str, Path]): The file path to save/load the final
                                        sentiment results dictionary.

    Returns:
        Dict[str, Dict[str, Any]]: A dictionary mapping each article to its
                                   sentiment class, confidence, and polarity.
    """
    logging.info("Initiating sentiment classification pipeline...")
    output_path = Path(output_path)

    # --- Idempotency Check ---
    # If results already exist, load and return them immediately.
    if output_path.exists():
        logging.info(f"Found existing sentiment results at '{output_path}'. Loading from file.")
        with open(output_path, 'rb') as f:
            return pickle.load(f)

    # --- Step 1: Load the FinBERT model and tokenizer into a pipeline. ---
    sentiment_pipeline = _load_sentiment_pipeline(study_parameters)
    logging.info("Step 1/3: FinBERT sentiment pipeline loaded successfully.")

    # --- Step 2: Run batched inference on the filtered corpus. ---
    # The pipeline is most efficient when processing many documents at once.
    sentiment_results = _run_batch_sentiment_inference(corpus_filtered, sentiment_pipeline)
    logging.info("Step 2/3: Batch inference completed for all articles.")

    # --- Step 3: Map sentiment classes to numerical polarity scores. ---
    final_results = _map_sentiment_to_polarity(sentiment_results)
    logging.info("Step 3/3: Polarity scores mapped successfully.")

    # --- Persistence ---
    # Save the final dictionary to disk for future runs.
    output_path.parent.mkdir(parents=True, exist_ok=True)
    logging.info(f"Saving final sentiment results to '{output_path}'.")
    with open(output_path, 'wb') as f:
        pickle.dump(final_results, f)

    logging.info("Sentiment classification pipeline finished successfully.")
    return final_results


In [None]:
# Task 10: Aggregate article-level sentiment into stock-day sentiment scores

# ==============================================================================
# Task 10: Aggregate article-level sentiment into stock-day sentiment scores S_i,t
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 10, Steps 1 & 2: Define weighting and compute the weighted sentiment
#                       score per stock-day.
# ------------------------------------------------------------------------------

def _calculate_weighted_sentiment(
    articles: List[str],
    sentiment_map: Dict[str, Dict[str, Any]]
) -> float:
    """
    Calculates the confidence-weighted average sentiment score for a list of articles.

    This function implements the aggregation logic defined in the paper's
    Equation 9, using the FinBERT model's confidence score as the weight.

    Equation: S_i,t = (Σ p_k * s_k) / (Σ p_k)
    where p_k is the confidence and s_k is the polarity of article k.

    Args:
        articles (List[str]): A list of raw article texts for a single stock-day.
        sentiment_map (Dict[str, Dict[str, Any]]): A dictionary mapping
            normalized article texts to their sentiment analysis results
            (including 'polarity' and 'confidence').

    Returns:
        float: The aggregated sentiment score, bounded between -1.0 and 1.0.
               Returns 0.0 if there are no articles or if total confidence is zero.
    """
    # Initialize numerator and denominator for the weighted average calculation.
    numerator = 0.0
    denominator = 0.0

    # Input validation: ensure the cell content is a list.
    if not isinstance(articles, list):
        return 0.0

    # Iterate through each article for the given stock-day.
    for article_text in articles:
        # Normalize the article text in the same way as the sentiment_map keys
        # to ensure a successful lookup.
        normalized_text = article_text.strip().lower()

        # Look up the sentiment results for the article.
        result = sentiment_map.get(normalized_text)

        # If the article has a sentiment score, incorporate it.
        if result:
            # Retrieve the polarity (s_k) and confidence (p_k).
            polarity = result.get('polarity', 0.0)
            confidence = result.get('confidence', 0.0)

            # Add to the weighted sum.
            # Numerator: Σ (confidence * polarity)
            numerator += confidence * polarity
            # Denominator: Σ confidence
            denominator += confidence

    # --- Final Score Calculation ---
    # Avoid division by zero. If there are no articles or all have zero
    # confidence, the sentiment is defined as neutral (0.0).
    if denominator == 0:
        return 0.0
    else:
        # Compute the final weighted average.
        return numerator / denominator


# ------------------------------------------------------------------------------
# Task 10, Step 3: Add the sentiment score as a new column in the DataFrame.
# ------------------------------------------------------------------------------

def _add_sentiment_score_to_dataframe(
    df: pd.DataFrame,
    sentiment_map: Dict[str, Dict[str, Any]]
) -> pd.DataFrame:
    """
    Applies the sentiment aggregation to the entire DataFrame and adds the
    result as a new column.

    Args:
        df (pd.DataFrame): The DataFrame containing the 'News_Articles' column.
        sentiment_map (Dict[str, Dict[str, Any]]): The map of article sentiment results.

    Returns:
        pd.DataFrame: A copy of the DataFrame with the new 'Sentiment_Score' column.
    """
    # Work on a copy to avoid side effects.
    df_with_sentiment = df.copy()

    logging.info("Aggregating article-level sentiment into stock-day scores...")
    # Use the .apply() method to run the aggregation function on each row's
    # 'News_Articles' list. This is the most direct way to perform this
    # custom row-wise operation.
    sentiment_scores = df_with_sentiment['News_Articles'].apply(
        _calculate_weighted_sentiment,
        args=(sentiment_map,)
    )

    # Assign the resulting Series to a new column in the DataFrame.
    df_with_sentiment['Sentiment_Score'] = sentiment_scores

    # --- Post-computation Validation ---
    # The final score must be mathematically bounded between -1 and 1.
    # We use a small tolerance for floating-point comparisons.
    tolerance = 1e-9
    if not df_with_sentiment['Sentiment_Score'].between(-1 - tolerance, 1 + tolerance).all():
        raise ValueError("Calculated 'Sentiment_Score' is out of the expected [-1, 1] bounds.")

    # Log summary statistics for the newly created feature.
    logging.info("'Sentiment_Score' column added successfully. Summary statistics:")
    logging.info(df_with_sentiment['Sentiment_Score'].describe().to_string())

    return df_with_sentiment


# ------------------------------------------------------------------------------
# Task 10, Orchestrator Function
# ------------------------------------------------------------------------------

def aggregate_stock_day_sentiment(
    df_filtered: pd.DataFrame,
    sentiment_results: Dict[str, Dict[str, Any]]
) -> pd.DataFrame:
    """
    Orchestrates the aggregation of article-level sentiment into a stock-day
    level time series feature.

    This function takes the raw sentiment scores for individual articles and
    computes a single, confidence-weighted sentiment score for each
    stock-day observation in the main DataFrame.

    Args:
        df_filtered (pd.DataFrame): The DataFrame with the filtered
                                    'News_Articles' column from Task 8.
        sentiment_results (Dict[str, Dict[str, Any]]): The dictionary mapping
            each unique article to its sentiment analysis results, including
            'polarity' and 'confidence', from Task 9.

    Returns:
        pd.DataFrame: The DataFrame, now enriched with a 'Sentiment_Score'
                      column representing the aggregated sentiment for each
                      stock-day.
    """
    logging.info("Initiating stock-day sentiment aggregation pipeline...")

    # --- Input Validation ---
    if 'News_Articles' not in df_filtered.columns:
        raise ValueError("Input DataFrame is missing the 'News_Articles' column.")
    if not isinstance(sentiment_results, dict) or not sentiment_results:
        raise ValueError("'sentiment_results' must be a non-empty dictionary.")

    # --- Steps 1, 2, and 3 are combined in this single, efficient function call ---
    # The weighting logic (Step 1) is inside _calculate_weighted_sentiment.
    # The computation (Step 2) and column addition (Step 3) are handled by
    # the _add_sentiment_score_to_dataframe function.
    df_with_sentiment = _add_sentiment_score_to_dataframe(df_filtered, sentiment_results)

    logging.info("Stock-day sentiment aggregation pipeline finished successfully.")

    return df_with_sentiment


In [None]:
# Task 11: Aggregate article-level sentiment into market-level sentiment shares

# ==============================================================================
# Task 11: Aggregate article-level sentiment into market-level sentiment shares S_t^c
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 11, Step 1: Construct confidence-weighted one-hot vectors per article.
# ------------------------------------------------------------------------------

def _augment_with_one_hot_vectors(
    sentiment_results: Dict[str, Dict[str, Any]]
) -> Dict[str, Dict[str, Any]]:
    """
    Augments the sentiment results dictionary with confidence-weighted one-hot vectors.

    For each article, this function computes a vector representing the sentiment
    distribution, where the predicted class's position holds the confidence
    score and all others are zero.

    Equation: s^c = p if ĉ = c, else 0

    Args:
        sentiment_results (Dict[str, Dict[str, Any]]): The dictionary of
            inference results from Task 9.

    Returns:
        Dict[str, Dict[str, Any]]: The augmented dictionary with an added
                                   'one_hot_weighted' key for each article.
    """
    # Define the universe of possible sentiment classes.
    sentiment_classes = ['positive', 'neutral', 'negative']

    # Iterate through each article's sentiment result to add the one-hot vector.
    for result in sentiment_results.values():
        # Initialize a one-hot vector with zeros.
        one_hot_vector = {s_class: 0.0 for s_class in sentiment_classes}

        # Get the predicted class and confidence score.
        predicted_class = result.get('class')
        confidence = result.get('confidence', 0.0)

        # If the predicted class is valid, place the confidence score in the
        # corresponding position of the one-hot vector.
        if predicted_class in one_hot_vector:
            one_hot_vector[predicted_class] = confidence
        else:
            logging.warning(f"Found unexpected sentiment class '{predicted_class}'. It will be ignored in one-hot vector creation.")

        # Add the computed vector to the results dictionary.
        result['one_hot_weighted'] = one_hot_vector

    return sentiment_results


# ------------------------------------------------------------------------------
# Task 11, Step 2: Aggregate by date to obtain market-level sentiment shares.
# ------------------------------------------------------------------------------

def _aggregate_daily_market_sentiment(
    df: pd.DataFrame,
    sentiment_map: Dict[str, Dict[str, Any]],
    master_calendar: pd.DatetimeIndex
) -> pd.DataFrame:
    """
    Aggregates article sentiments to compute daily market-wide sentiment shares.

    This function creates a long-form DataFrame of all article occurrences,
    joins the sentiment data, and then performs a highly efficient groupby
    operation to calculate the daily shares.

    Equation: S_t^c = (Σ s_k^c) / (Σ p_k) for all articles k on day t.

    Args:
        df (pd.DataFrame): The main DataFrame with the 'News_Articles' column.
        sentiment_map (Dict[str, Dict[str, Any]]): The augmented map of article
                                                   sentiment results.
        master_calendar (pd.DatetimeIndex): The master list of all trading days.

    Returns:
        pd.DataFrame: A DataFrame indexed by date with columns for the share
                      of negative, neutral, and positive sentiment.
    """
    logging.info("Creating long-form DataFrame of all article occurrences for efficient aggregation.")

    # --- Create a mapping from normalized text to sentiment data ---
    # This avoids repeated normalization during the main loop.
    normalized_sentiment_map = {
        text.strip().lower(): data
        for text, data in sentiment_map.items()
    }

    # --- Build a list of records for all article occurrences ---
    article_records = []
    # Iterate through each row of the main DataFrame.
    for (date, _), row in df.iterrows():
        articles = row['News_Articles']
        if isinstance(articles, list):
            for article_text in articles:
                # For each article, create a record with its date and normalized text.
                article_records.append({
                    'Date': date,
                    'normalized_text': article_text.strip().lower()
                })

    if not article_records:
        logging.warning("No articles found in the DataFrame to aggregate. Returning empty sentiment shares.")
        # Return a correctly structured but empty DataFrame.
        return pd.DataFrame(columns=['Market_Sentiment_Neg', 'Market_Sentiment_Neu', 'Market_Sentiment_Pos']).reindex(master_calendar).fillna(0)

    # Convert the list of records into a DataFrame.
    articles_df = pd.DataFrame(article_records)

    # --- Map sentiment data to the articles DataFrame ---
    # Extract one-hot vectors and confidence scores using the map.
    articles_df['sentiment_data'] = articles_df['normalized_text'].map(normalized_sentiment_map)
    articles_df.dropna(subset=['sentiment_data'], inplace=True) # Drop articles not in map

    articles_df['confidence'] = articles_df['sentiment_data'].apply(lambda x: x['confidence'])
    # Create columns for each component of the one-hot vector.
    one_hot_df = pd.DataFrame(articles_df['sentiment_data'].apply(lambda x: x['one_hot_weighted']).tolist(), index=articles_df.index)

    # Combine into a single DataFrame for aggregation.
    aggregation_df = pd.concat([articles_df[['Date', 'confidence']], one_hot_df], axis=1)

    # --- Perform GroupBy Aggregation ---
    # Group by date and sum the confidence and one-hot vector components.
    daily_sums = aggregation_df.groupby('Date').sum()

    # --- Calculate Sentiment Shares ---
    # The denominator is the total confidence for the day.
    total_daily_confidence = daily_sums['confidence']

    # The numerators are the sums of the weighted one-hot vectors.
    market_sentiment_df = pd.DataFrame(index=daily_sums.index)
    market_sentiment_df['Market_Sentiment_Neg'] = daily_sums['negative'].div(total_daily_confidence).fillna(0)
    market_sentiment_df['Market_Sentiment_Neu'] = daily_sums['neutral'].div(total_daily_confidence).fillna(0)
    market_sentiment_df['Market_Sentiment_Pos'] = daily_sums['positive'].div(total_daily_confidence).fillna(0)

    # --- Finalize the DataFrame ---
    # Ensure all days from the master calendar are present, filling missing days with a neutral default.
    market_sentiment_df = market_sentiment_df.reindex(master_calendar)
    market_sentiment_df.fillna({'Market_Sentiment_Neu': 1.0, 'Market_Sentiment_Neg': 0.0, 'Market_Sentiment_Pos': 0.0}, inplace=True)

    return market_sentiment_df


# ------------------------------------------------------------------------------
# Task 11, Step 3: Merge market-level sentiment shares into the DataFrame.
# ------------------------------------------------------------------------------

def _merge_market_sentiment_shares(
    df: pd.DataFrame,
    market_sentiment_df: pd.DataFrame
) -> pd.DataFrame:
    """
    Merges the daily market sentiment shares into the main DataFrame.

    Args:
        df (pd.DataFrame): The main DataFrame.
        market_sentiment_df (pd.DataFrame): Date-indexed DataFrame of sentiment shares.

    Returns:
        pd.DataFrame: A copy of the main DataFrame with three new market
                      sentiment columns.
    """
    # Work on a copy.
    df_merged = df.copy()

    logging.info("Merging daily market sentiment shares into the main DataFrame.")
    # Perform a left merge. This broadcasts the daily market sentiment values
    # to all tickers present on that day.
    df_merged = df_merged.merge(
        market_sentiment_df,
        left_on='Date',
        right_index=True,
        how='left'
    )

    # --- Post-merge Validation ---
    # Check that the merge did not introduce NaNs.
    new_cols = ['Market_Sentiment_Neg', 'Market_Sentiment_Neu', 'Market_Sentiment_Pos']
    if df_merged[new_cols].isna().any().any():
        raise RuntimeError("Merging market sentiment shares resulted in unexpected NaNs.")

    # Check that the shares on each day sum to 1.0 (within a tolerance).
    daily_sums = df_merged.groupby('Date')[new_cols].first().sum(axis=1)
    if not np.allclose(daily_sums, 1.0, atol=1e-6):
        logging.warning("Market sentiment shares do not sum to 1.0 on all days. Check aggregation logic.")

    return df_merged


# ------------------------------------------------------------------------------
# Task 11, Orchestrator Function
# ------------------------------------------------------------------------------

def aggregate_market_level_sentiment(
    df_with_sentiment: pd.DataFrame,
    sentiment_results: Dict[str, Dict[str, Any]],
    master_calendar: pd.DatetimeIndex
) -> pd.DataFrame:
    """
    Orchestrates the aggregation of article sentiment into market-level shares.

    This function computes the daily proportion of negative, neutral, and
    positive sentiment across all articles in the corpus and merges these
    features into the main DataFrame. The pipeline includes:
    1.  Augmenting sentiment results with confidence-weighted one-hot vectors.
    2.  Aggregating these vectors on a daily basis to compute sentiment shares.
    3.  Merging these daily shares back into the main panel DataFrame.

    Args:
        df_with_sentiment (pd.DataFrame): The DataFrame from Task 10.
        sentiment_results (Dict[str, Dict[str, Any]]): The dictionary of
            sentiment results from Task 9.
        master_calendar (pd.DatetimeIndex): The master list of all trading days
                                            from Task 6.

    Returns:
        pd.DataFrame: The DataFrame enriched with three new columns:
                      'Market_Sentiment_Neg', 'Market_Sentiment_Neu',
                      'Market_Sentiment_Pos'.
    """
    logging.info("Initiating market-level sentiment aggregation pipeline...")

    # --- Step 1: Construct confidence-weighted one-hot vectors. ---
    augmented_sentiment_map = _augment_with_one_hot_vectors(sentiment_results)
    logging.info("Step 1/3: Augmented sentiment results with one-hot vectors.")

    # --- Step 2: Aggregate by date to get market sentiment shares. ---
    market_sentiment_df = _aggregate_daily_market_sentiment(
        df_with_sentiment, augmented_sentiment_map, master_calendar
    )
    logging.info("Step 2/3: Aggregated daily market sentiment shares.")

    # --- Step 3: Merge the market-level shares into the main DataFrame. ---
    df_final = _merge_market_sentiment_shares(df_with_sentiment, market_sentiment_df)
    logging.info("Step 3/3: Merged market sentiment shares into the main DataFrame.")

    logging.info("Market-level sentiment aggregation pipeline finished successfully.")

    return df_final


In [None]:
# Task 12: Construct the Hype Index

# ==============================================================================
# Task 12: Construct the Hype Index H_i,t (attention share)
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 12, Step 1: Count articles per stock-day N_i,t.
# ------------------------------------------------------------------------------

def _count_articles_per_stock_day(df: pd.DataFrame) -> pd.DataFrame:
    """
    Counts the number of articles for each stock-day observation.

    Args:
        df (pd.DataFrame): The DataFrame containing the 'News_Articles' column,
                           where each cell is a list of strings.

    Returns:
        pd.DataFrame: A copy of the input DataFrame with a new 'N_Articles'
                      column of integer counts.
    """
    # Work on a copy to avoid side effects.
    df_with_counts = df.copy()

    logging.info("Counting articles for each stock-day...")
    # Define a safe length function that returns 0 for non-list inputs.
    def safe_len(item):
        return len(item) if isinstance(item, list) else 0

    # Apply the safe length function to the 'News_Articles' column.
    df_with_counts['N_Articles'] = df_with_counts['News_Articles'].apply(safe_len)

    # Ensure the new column is of integer type.
    df_with_counts['N_Articles'] = df_with_counts['N_Articles'].astype('int64')

    return df_with_counts


# ------------------------------------------------------------------------------
# Task 12, Step 2: Compute the market-wide article count N_mkt,t per day.
# ------------------------------------------------------------------------------

def _compute_daily_market_article_counts(df: pd.DataFrame) -> pd.Series:
    """
    Aggregates article counts to get the total number of articles per day.

    Equation: N_mkt,t = Σ_i N_i,t

    Args:
        df (pd.DataFrame): The DataFrame containing the 'N_Articles' column.

    Returns:
        pd.Series: A Series indexed by 'Date' containing the total article
                   count for each day.
    """
    logging.info("Computing total market-wide article counts per day...")
    # Group by the 'Date' level of the index and sum the 'N_Articles'.
    # This is a highly efficient, vectorized operation.
    market_counts = df.groupby(level='Date')['N_Articles'].sum()

    return market_counts


# ------------------------------------------------------------------------------
# Task 12, Step 3: Compute the Hype Index and add it to the DataFrame.
# ------------------------------------------------------------------------------

def _compute_and_validate_hype_index(
    df: pd.DataFrame,
    market_counts: pd.Series
) -> pd.DataFrame:
    """
    Computes the Hype Index and validates its properties as a probability measure.

    Equation: H_i,t = N_i,t / N_mkt,t

    Args:
        df (pd.DataFrame): The DataFrame with 'N_Articles'.
        market_counts (pd.Series): The date-indexed series of total daily counts.

    Returns:
        pd.DataFrame: A copy of the input DataFrame with the new 'Hype_Index'
                      column.
    """
    # Work on a copy.
    df_with_hype = df.copy()

    logging.info("Computing the Hype Index (daily attention share)...")
    # --- Hype Index Calculation ---
    # Map the daily market counts to each row based on its date.
    # This creates a column where each row has the total for its day.
    df_with_hype['N_mkt'] = df_with_hype.index.get_level_values('Date').map(market_counts)

    # Compute the Hype Index via vectorized division.
    df_with_hype['Hype_Index'] = df_with_hype['N_Articles'] / df_with_hype['N_mkt']

    # Handle division by zero: on days with no news (N_mkt = 0), the
    # Hype Index for all stocks should be 0, not NaN or inf.
    df_with_hype['Hype_Index'].fillna(0, inplace=True)
    df_with_hype.replace([np.inf, -np.inf], 0, inplace=True)

    # Clean up the intermediate market count column.
    df_with_hype.drop(columns=['N_mkt'], inplace=True)

    # --- Post-computation Validation ---
    # The Hype Index for each day must sum to 1.0 (or 0.0 on no-news days).
    # This confirms it is a valid probability measure.
    daily_hype_sums = df_with_hype.groupby(level='Date')['Hype_Index'].sum()
    # Use np.allclose for robust floating-point comparison.
    is_valid_measure = np.allclose(
        daily_hype_sums, 1.0, atol=1e-9
    ) | np.allclose(
        daily_hype_sums, 0.0, atol=1e-9
    )

    if not is_valid_measure.all():
        # Find and log specific dates that fail the check for easier debugging.
        invalid_dates = daily_hype_sums[~is_valid_measure]
        logging.warning(f"Hype Index validation failed. The sum across tickers does not equal 1.0 on {len(invalid_dates)} days.")
        logging.warning(f"Example invalid daily sums:\n{invalid_dates.head().to_string()}")
    else:
        logging.info("Hype Index validated successfully as a daily probability measure.")

    return df_with_hype


# ------------------------------------------------------------------------------
# Task 12, Orchestrator Function
# ------------------------------------------------------------------------------

def construct_hype_index(df_market_sentiment: pd.DataFrame) -> pd.DataFrame:
    """
    Orchestrates the construction of the Hype Index feature.

    This function calculates a measure of relative media attention for each
    stock on each day. The pipeline includes:
    1.  Counting the number of articles per stock-day (N_i,t).
    2.  Aggregating these to find the total number of articles market-wide
        each day (N_mkt,t).
    3.  Computing the Hype Index as the ratio H_i,t = N_i,t / N_mkt,t and
        validating its properties.

    Args:
        df_market_sentiment (pd.DataFrame): The DataFrame from Task 11, which
            contains the filtered 'News_Articles' column.

    Returns:
        pd.DataFrame: The DataFrame enriched with the 'Hype_Index' and
                      'N_Articles' columns.
    """
    logging.info("Initiating Hype Index construction pipeline...")

    # --- Input Validation ---
    if 'News_Articles' not in df_market_sentiment.columns:
        raise ValueError("Input DataFrame is missing the 'News_Articles' column.")

    # --- Step 1: Count articles per stock-day. ---
    df_with_counts = _count_articles_per_stock_day(df_market_sentiment)
    logging.info("Step 1/3: Article counts per stock-day computed.")

    # --- Step 2: Compute market-wide article counts per day. ---
    market_counts = _compute_daily_market_article_counts(df_with_counts)
    logging.info("Step 2/3: Market-wide daily article counts computed.")

    # --- Step 3: Compute and validate the Hype Index. ---
    df_final = _compute_and_validate_hype_index(df_with_counts, market_counts)
    logging.info("Step 3/3: Hype Index computed and validated.")

    logging.info("Hype Index construction pipeline finished successfully.")

    return df_final


In [None]:
# Task 13: Define rolling windows for LPPL calibration

# ==============================================================================
# Task 13: Define rolling windows for LPPL calibration
# ==============================================================================

# Define a structured data container for window metadata for clarity and type safety.
class LPPLWindow(NamedTuple):
    """
    A structured data container for storing the metadata and data of a single
    rolling window intended for LPPL (Log-Periodic Power Law) model calibration.

    This class uses a NamedTuple for immutability and clarity, ensuring that
    each window object is a lightweight, self-contained package of information
    required by the downstream optimization routines.

    Attributes:
        ticker (str):
            The unique identifier (e.g., stock ticker) for the time series
            from which the window was extracted. This is essential for tracking
            and aggregating results on a per-security basis.

        end_date (pd.Timestamp):
            The calendar date corresponding to the last observation in this
            window. This acts as the anchor point in time for the window,
            allowing results of the LPPL fit to be mapped back to the correct
            point in the master DataFrame.

        log_price_series (pd.Series):
            A pandas Series containing the log-price data for the window.
            Crucially, the index of this Series is not a DatetimeIndex but a
            simple integer index ranging from 1 to W (where W is the window
            size). This numerical time index `t` is required for the mathematical
            formulation of the LPPL model during the non-linear optimization process.
    """
    # The unique identifier for the security (e.g., 'AAPL').
    ticker: str

    # The timestamp of the last data point in the window.
    end_date: pd.Timestamp

    # The series of log-prices for the window, indexed from 1 to W.
    log_price_series: pd.Series

# ------------------------------------------------------------------------------
# Task 13, Step 1: Extract the rolling window size from configuration.
# ------------------------------------------------------------------------------

def _get_lppl_window_size(config: Dict[str, Any]) -> int:
    """
    Retrieves and validates the rolling window size for LPPL fitting from the
    configuration.

    Args:
        config (Dict[str, Any]): The study configuration dictionary.

    Returns:
        int: The validated rolling window size.

    Raises:
        KeyError: If the required key is missing from the configuration.
    """
    try:
        # Access the nested key for the window size.
        window_size = config['descriptive_model']['lppl_fitting']['rolling_window_size']
        # The type and range of this parameter were already validated in Task 1,
        # so we can be confident in its value here.
        logging.info(f"LPPL rolling window size set to {window_size} trading days.")
        return window_size
    except KeyError as e:
        # This error should not occur if the config was validated by Task 1.
        logging.error("Configuration key for LPPL rolling window size is missing.")
        raise KeyError("Missing 'descriptive_model.lppl_fitting.rolling_window_size' in configuration.") from e


# ------------------------------------------------------------------------------
# Task 13, Step 2: Generate window start and end indices for each ticker.
# ------------------------------------------------------------------------------

def _generate_lppl_windows(df: pd.DataFrame, window_size: int) -> List[LPPLWindow]:
    """
    Generates a list of valid rolling windows for LPPL calibration.

    For each ticker, this function creates overlapping windows of a specified
    size. A window is considered valid only if it is full (contains no NaNs
    from data gaps). Each window's data is prepared with a simple integer time
    index (1 to W) required for the LPPL optimization algorithm.

    Args:
        df (pd.DataFrame): The DataFrame containing the 'Log_Price' column.
        window_size (int): The number of observations in each rolling window.

    Returns:
        List[LPPLWindow]: A list of structured window metadata objects.
    """
    logging.info(f"Generating rolling windows of size {window_size} for LPPL fitting...")

    # This list will store the metadata for all valid windows across all tickers.
    all_windows: List[LPPLWindow] = []

    # Group the DataFrame by ticker to process each time series independently.
    grouped_by_ticker = df.groupby(level='TICKER')['Log_Price']

    for ticker, log_price_series in grouped_by_ticker:
        # Check if the ticker has enough data points to form at least one window.
        if len(log_price_series) < window_size:
            continue # Skip tickers with insufficient history.

        # Use the .rolling() method to create an iterator of windows.
        # This is highly memory-efficient as it doesn't load all windows at once.
        for window in log_price_series.rolling(window=window_size):
            # The rolling object yields partial windows at the start. We only want full ones.
            if len(window) < window_size:
                continue

            # A window is invalid if it contains any NaNs (e.g., from data gaps).
            if window.isna().any():
                continue

            # --- Prepare the window data for LPPL fitting ---
            # The LPPL formula requires a simple integer time index from 1 to W.
            # We reset the index of the window Series to create this.
            window_data = window.reset_index(drop=True)
            window_data.index = window_data.index + 1 # Index from 1 to W

            # The end date of the window is the last date in its original DatetimeIndex.
            end_date = window.index[-1][0] # window.index is a MultiIndex, get date from first level

            # Create the structured metadata object for this valid window.
            lppl_window = LPPLWindow(
                ticker=ticker,
                end_date=end_date,
                log_price_series=window_data
            )
            all_windows.append(lppl_window)

    return all_windows


# ------------------------------------------------------------------------------
# Task 13, Step 3: Persist window metadata and log statistics.
# ------------------------------------------------------------------------------

def _persist_windows_and_log_stats(
    windows: List[LPPLWindow],
    output_path: Path
) -> None:
    """
    Saves the list of generated LPPL windows to disk and logs summary statistics.

    Args:
        windows (List[LPPLWindow]): The list of window metadata objects.
        output_path (Path): The file path to save the pickled list to.
    """
    # --- Persistence ---
    # Ensure the parent directory for the output file exists.
    output_path.parent.mkdir(parents=True, exist_ok=True)

    logging.info(f"Saving {len(windows):,} generated LPPL windows to '{output_path}'...")
    # Use pickle to serialize the list of NamedTuple objects.
    with open(output_path, 'wb') as f:
        pickle.dump(windows, f)

    # --- Logging Statistics ---
    if windows:
        # Calculate statistics on the generated windows.
        num_windows = len(windows)
        tickers_covered = {w.ticker for w in windows}
        num_tickers = len(tickers_covered)

        logging.info(f"Successfully generated and saved {num_windows:,} valid windows.")
        logging.info(f"Coverage: {num_tickers} tickers.")
    else:
        logging.warning("No valid LPPL windows were generated. Check data for sufficient length and continuity.")


# ------------------------------------------------------------------------------
# Task 13, Orchestrator Function
# ------------------------------------------------------------------------------

def define_lppl_calibration_windows(
    df_features: pd.DataFrame,
    study_parameters: Dict[str, Any],
    output_path: Union[str, Path] = "data_intermediate/lppl_windows.pkl"
) -> List[LPPLWindow]:
    """
    Orchestrates the definition and generation of rolling windows for LPPL fitting.

    This function is idempotent: it will load the window definitions from disk
    if they have been previously generated. Otherwise, it performs:
    1.  Extracting the rolling window size from the configuration.
    2.  Generating all valid, complete rolling windows for each ticker.
    3.  Persisting the list of window metadata to disk for reproducibility and
        to avoid re-computation.

    Args:
        df_features (pd.DataFrame): The DataFrame containing all engineered
                                    features, including 'Log_Price'.
        study_parameters (Dict[str, Any]): The main configuration dictionary.
        output_path (Union[str, Path]): The file path to save/load the list
                                        of LPPL windows.

    Returns:
        List[LPPLWindow]: A list of structured objects, each containing the
                          metadata and data for a single window to be fitted.
    """
    logging.info("Initiating LPPL calibration window definition pipeline...")
    output_path = Path(output_path)

    # --- Idempotency Check ---
    # If the window file already exists, load and return it to skip computation.
    if output_path.exists():
        logging.info(f"Found existing LPPL windows at '{output_path}'. Loading from file.")
        with open(output_path, 'rb') as f:
            return pickle.load(f)

    # --- Input Validation ---
    if 'Log_Price' not in df_features.columns:
        raise ValueError("Input DataFrame is missing the required 'Log_Price' column.")

    # --- Step 1: Extract the rolling window size. ---
    window_size = _get_lppl_window_size(study_parameters)
    logging.info("Step 1/3: LPPL rolling window size extracted.")

    # --- Step 2: Generate all valid rolling windows. ---
    windows = _generate_lppl_windows(df_features, window_size)
    logging.info("Step 2/3: All valid rolling windows generated.")

    # --- Step 3: Persist the windows and log summary statistics. ---
    _persist_windows_and_log_stats(windows, output_path)
    logging.info("Step 3/3: Window metadata persisted and statistics logged.")

    logging.info("LPPL calibration window definition pipeline finished successfully.")

    return windows


In [None]:
# Task 14: Initialize LPPL parameter bounds and multi-start seeds

# ==============================================================================
# Task 14: Initialize LPPL parameter bounds and multi-start seeds
# ==============================================================================

# Standardized order of LPPL parameters to be used throughout the fitting process.
LPPL_PARAM_ORDER = ['A', 'B', 'C', 'm', 'omega', 'phi', 't_c']

# ------------------------------------------------------------------------------
# Task 14, Step 1: Initialize LPPL parameter bounds.
# ------------------------------------------------------------------------------

def _define_lppl_parameter_bounds(
    config: Dict[str, Any],
    window_size: int
) -> Tuple[List[float], List[float]]:
    """
    Defines the lower and upper bounds for the 7 LPPL model parameters.

    The bounds are derived from the study configuration and theoretical
    constraints of the LPPL model. The bounds for the critical time `t_c` are
    dynamically calculated based on the window size.

    Args:
        config (Dict[str, Any]): The study configuration dictionary.
        window_size (int): The size of the fitting window (W).

    Returns:
        Tuple[List[float], List[float]]: A tuple containing two lists:
            - The lower bounds for all 7 parameters in a standard order.
            - The upper bounds for all 7 parameters in the same standard order.
            This format is directly compatible with `scipy.optimize.least_squares`.
    """
    # Retrieve the static parameter constraints from the configuration.
    constraints = config['descriptive_model']['lppl_fitting']['parameter_constraints']

    # Define bounds for all 7 parameters. Note that t_c is relative to the
    # window's integer index (1 to W). The end of the window is at t=W.
    bounds_map = {
        'A': (-np.inf, np.inf),
        'B': (constraints['B']['min'], constraints['B']['max']),
        'C': (-np.inf, np.inf), # Unconstrained amplitude
        'm': (constraints['m']['min'], constraints['m']['max']),
        'omega': (constraints['omega']['min'], constraints['omega']['max']),
        'phi': (-2 * np.pi, 2 * np.pi), # Phase can be unconstrained over a full cycle
        't_c': (window_size + 5, window_size + 250) # t_c must be in the future
    }

    # Create the lower and upper bound lists in the standardized order.
    lower_bounds = [bounds_map[param][0] for param in LPPL_PARAM_ORDER]
    upper_bounds = [bounds_map[param][1] for param in LPPL_PARAM_ORDER]

    return lower_bounds, upper_bounds


# ------------------------------------------------------------------------------
# Task 14, Step 2: Generate multi-start initialization seeds.
# ------------------------------------------------------------------------------

def generate_lppl_initial_seeds(
    log_price_series: pd.Series,
    window_size: int,
    num_seeds: int = 10,
    seed: int = 42
) -> List[np.ndarray]:
    """
    Generates a set of random initial parameter guesses (seeds) for the LPPL
    optimization, using a multi-start strategy to avoid local minima.

    Args:
        log_price_series (pd.Series): The log-price series for a single window.
        window_size (int): The size of the fitting window (W).
        num_seeds (int): The number of different initial guesses to generate.
        seed (int): The seed for the random number generator for reproducibility.

    Returns:
        List[np.ndarray]: A list of 7-element numpy arrays, where each array
                          is a complete set of initial parameter guesses.
    """
    # Initialize a seeded random number generator for reproducible results.
    rng = np.random.default_rng(seed)

    # This list will store the generated seed vectors.
    initial_seeds = []

    # Data-driven parameters for initialization.
    mean_log_price = log_price_series.mean()

    for _ in range(num_seeds):
        # Generate one set of random initial parameters based on the specified distributions.
        seed_params = {
            'A': rng.normal(loc=mean_log_price, scale=0.1),
            'B': rng.uniform(-1.0, -0.01),
            'C': rng.uniform(-0.5, 0.5),
            'm': rng.uniform(0.1, 0.9),
            'omega': rng.uniform(2.0, 20.0),
            'phi': rng.uniform(-np.pi, np.pi),
            't_c': rng.uniform(window_size + 10, window_size + 100)
        }

        # Assemble the seed vector in the standardized order.
        seed_vector = np.array([seed_params[param] for param in LPPL_PARAM_ORDER])
        initial_seeds.append(seed_vector)

    return initial_seeds


# ------------------------------------------------------------------------------
# Task 14, Step 3: Document the initialization strategy.
# ------------------------------------------------------------------------------

def _document_initialization_strategy(
    bounds: Tuple[List[float], List[float]],
    config: Dict[str, Any],
    output_path: Path
) -> None:
    """
    Saves a JSON file documenting the LPPL initialization strategy.

    This creates a reproducible record of the parameter bounds, number of seeds,
    sampling distributions, and convergence tolerances used in the optimization.

    Args:
        bounds (Tuple[List[float], List[float]]): The lower and upper parameter bounds.
        config (Dict[str, Any]): The study configuration dictionary.
        output_path (Path): The file path for the JSON metadata log.
    """
    # Unpack the bounds tuple.
    lower_bounds, upper_bounds = bounds

    # Construct a dictionary of the metadata to be saved.
    metadata = {
        "parameter_order": LPPL_PARAM_ORDER,
        "parameter_bounds": {
            param: (lower, upper)
            for param, lower, upper in zip(LPPL_PARAM_ORDER, lower_bounds, upper_bounds)
        },
        "multi_start_seeds": {
            "num_seeds": 10, # As specified in the instructions
            "sampling_distributions": {
                'A': "Normal(mean(log_price), 0.1)",
                'B': "Uniform(-1.0, -0.01)",
                'C': "Uniform(-0.5, 0.5)",
                'm': "Uniform(0.1, 0.9)",
                'omega': "Uniform(2.0, 20.0)",
                'phi': "Uniform(-pi, pi)",
                't_c': "Uniform(W+10, W+100)"
            }
        },
        "convergence_tolerances": {
            "ftol": 1e-8, # Standard tolerance for function value change
            "xtol": 1e-8, # Standard tolerance for parameter change
            "max_nfev": 1000 # Max number of function evaluations
        }
    }

    # Ensure the parent directory exists.
    output_path.parent.mkdir(parents=True, exist_ok=True)

    # Convert any numpy types to be JSON serializable.
    serializable_metadata = _make_json_serializable(metadata)

    # Write the metadata to the JSON file.
    logging.info(f"Saving LPPL initialization metadata to '{output_path}'.")
    with open(output_path, 'w') as f:
        json.dump(serializable_metadata, f, indent=4)


# ------------------------------------------------------------------------------
# Task 14, Orchestrator Function
# ------------------------------------------------------------------------------

def initialize_lppl_fitter(
    study_parameters: Dict[str, Any],
    log_dir: Union[str, Path] = "logs"
) -> Tuple[Tuple[List[float], List[float]], Dict[str, Any]]:
    """
    Orchestrates the setup of parameters for the LPPL fitting process.

    This function prepares the two key components required for a robust,
    constrained, multi-start optimization:
    1.  Defines the strict lower and upper bounds for each of the 7 LPPL parameters.
    2.  Generates a strategy for creating multiple random starting points (seeds)
        to mitigate the risk of converging to local minima.
    3.  Documents this entire strategy in a metadata file for reproducibility.

    Args:
        study_parameters (Dict[str, Any]): The main configuration dictionary.
        log_dir (Union[str, Path]): Directory to save the metadata log.

    Returns:
        Tuple[Tuple[List[float], List[float]], Dict[str, Any]]: A tuple containing:
            - A tuple of (lower_bounds, upper_bounds) for the optimizer.
            - The original study_parameters dictionary (passed through).
    """
    logging.info("Initiating LPPL fitter initialization pipeline...")

    # --- Step 1: Define parameter bounds. ---
    # This requires the window size from the config.
    window_size = study_parameters['descriptive_model']['lppl_fitting']['rolling_window_size']
    bounds = _define_lppl_parameter_bounds(study_parameters, window_size)
    logging.info("Step 1/3: LPPL parameter bounds defined successfully.")

    # --- Step 2: The seed generation logic is encapsulated in its own function. ---
    # This function (`generate_lppl_initial_seeds`) will be called within the
    # main fitting loop (Task 15) for each specific window, as it is data-dependent.
    # This step is therefore a conceptual preparation.
    logging.info("Step 2/3: Multi-start seed generation strategy is defined.")

    # --- Step 3: Document the entire initialization strategy. ---
    metadata_path = Path(log_dir) / "lppl_initialization_metadata.json"
    _document_initialization_strategy(bounds, study_parameters, metadata_path)
    logging.info("Step 3/3: Initialization strategy documented successfully.")

    logging.info("LPPL fitter initialization pipeline finished successfully.")

    # Return the bounds and config needed for the next step.
    return bounds, study_parameters


In [None]:
# Task 15: Fit the LPPL model to each window via constrained nonlinear least squares

# ==============================================================================
# Task 15: Fit the LPPL model to each window via constrained nonlinear
#          least squares
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 15, Step 1: Define the LPPL objective function.
# ------------------------------------------------------------------------------

def lppl_objective_func(
    theta: np.ndarray,
    t: np.ndarray,
    log_price: np.ndarray
) -> np.ndarray:
    """
    Calculates the residuals for the LPPL model given a set of parameters.

    This function is designed to be used with `scipy.optimize.least_squares`,
    so it returns the vector of residuals (observed - predicted), not the
    sum of squared errors.

    Equation:
    ln_p_hat = A + B*(t_c - t)^m + C*(t_c - t)^m * cos(omega*ln(t_c - t) + phi)
    Residuals = log_price - ln_p_hat

    Args:
        theta (np.ndarray): A 7-element array of the LPPL parameters in the
                            standard order: [A, B, C, m, omega, phi, t_c].
        t (np.ndarray): A 1D array of time indices (e.g., 1 to W).
        log_price (np.ndarray): A 1D array of observed log-prices.

    Returns:
        np.ndarray: A 1D array of the residuals.
    """
    # Unpack the parameter vector for clarity.
    A, B, C, m, omega, phi, t_c = theta

    # --- Robustness Check ---
    # The optimizer might test t_c values that are inside the window.
    # This would cause a domain error in np.log. We must prevent this.
    # If t_c is not greater than all t, the parameters are invalid.
    if t_c <= t.max():
        # Return a large residual vector to penalize this invalid region.
        return np.full(t.shape, 1e12)

    # --- LPPL Equation Implementation ---
    # Calculate the (t_c - t) term, which is used multiple times.
    dt = t_c - t

    # Calculate the predicted log-price using the LPPL formula.
    # This is performed using vectorized numpy operations for efficiency.
    log_p_hat = A + B * (dt**m) + C * (dt**m) * np.cos(omega * np.log(dt) + phi)

    # Return the vector of residuals.
    return log_price - log_p_hat


# ------------------------------------------------------------------------------
# Task 15, Step 2 & 3: Run optimization and select the best fit.
# ------------------------------------------------------------------------------

def _fit_single_window(
    window: LPPLWindow,
    bounds: Tuple[List[float], List[float]],
    num_seeds: int = 10
) -> Optional[Dict[str, Any]]:
    """
    Fits the LPPL model to a single window of data using a multi-start,
    constrained non-linear least squares optimization.

    Args:
        window (LPPLWindow): The metadata and data for the window to fit.
        bounds (Tuple[List[float], List[float]]): Lower and upper bounds for the
                                                  7 LPPL parameters.
        num_seeds (int): The number of initial guesses to try.

    Returns:
        Optional[Dict[str, Any]]: A dictionary containing the best-fit parameters
                                  and final SSE if a valid fit is found,
                                  otherwise None.
    """
    # Extract the data needed for fitting from the window object.
    log_price_series = window.log_price_series
    t_vector = log_price_series.index.to_numpy()
    log_price_vector = log_price_series.to_numpy()
    window_size = len(log_price_series)

    # Generate a set of random initial guesses for this specific window.
    initial_seeds = generate_lppl_initial_seeds(log_price_series, window_size, num_seeds)

    # This list will store the results of all successful optimization runs.
    successful_fits: List[OptimizeResult] = []

    # --- Multi-Start Optimization Loop ---
    for i, seed in enumerate(initial_seeds):
        try:
            # Run the constrained non-linear least squares optimization.
            result = least_squares(
                fun=lppl_objective_func,
                x0=seed,
                args=(t_vector, log_price_vector),
                bounds=bounds,
                method='trf', # Trust Region Reflective is suitable for bounds.
                ftol=1e-8,
                xtol=1e-8,
                max_nfev=1000
            )
            # If the optimizer reports success, add the result to our list.
            if result.success:
                successful_fits.append(result)
        except Exception:
            # Silently ignore optimization failures for a single seed.
            # This is expected as some seeds may be in difficult regions.
            continue

    # --- Best Fit Selection ---
    # If no optimization runs converged, this window has failed.
    if not successful_fits:
        return None

    # Find the best fit by selecting the one with the lowest cost (residual sum of squares).
    # The `cost` from least_squares is 0.5 * sum(residuals^2).
    best_fit = min(successful_fits, key=lambda r: r.cost)

    # The Sum of Squared Errors (SSE) is 2 * cost.
    sse = 2 * best_fit.cost

    # Structure the final results into a dictionary.
    fit_results = {param: value for param, value in zip(LPPL_PARAM_ORDER, best_fit.x)}
    fit_results['SSE'] = sse
    fit_results['Ticker'] = window.ticker
    fit_results['End_Date'] = window.end_date

    return fit_results


# ------------------------------------------------------------------------------
# Task 15, Orchestrator Function
# ------------------------------------------------------------------------------

def fit_lppl_model_to_windows(
    windows: List[LPPLWindow],
    bounds: Tuple[List[float], List[float]],
    study_parameters: Dict[str, Any],
    output_path: Union[str, Path] = "data_intermediate/lppl_fit_parameters.csv"
) -> pd.DataFrame:
    """
    Orchestrates the fitting of the LPPL model to all generated rolling windows.

    This function is idempotent. If the output file exists, it loads the results.
    Otherwise, it iterates through each window, performs a multi-start
    constrained optimization, selects the best fit, and aggregates all results.

    Args:
        windows (List[LPPLWindow]): The list of all windows to be fitted.
        bounds (Tuple[List[float], List[float]]): Parameter bounds from Task 14.
        study_parameters (Dict[str, Any]): The main configuration dictionary.
        output_path (Union[str, Path]): The path to save the final CSV of fit
                                        parameters.

    Returns:
        pd.DataFrame: A DataFrame containing the best-fit LPPL parameters and
                      SSE for each successfully fitted window.
    """
    logging.info("Initiating LPPL model fitting pipeline...")
    output_path = Path(output_path)

    # --- Idempotency Check ---
    if output_path.exists():
        logging.info(f"Found existing LPPL fit results at '{output_path}'. Loading from file.")
        return pd.read_csv(output_path, parse_dates=['End_Date'])

    if not windows:
        logging.warning("Window list is empty. No LPPL fitting will be performed.")
        return pd.DataFrame()

    # This list will store the dictionary of results for each successful fit.
    all_fit_results = []

    # --- Main Fitting Loop ---
    # Use tqdm for a progress bar, as this is a very long-running process.
    for window in tqdm(windows, desc="Fitting LPPL to windows"):
        # Fit the model to the current window.
        best_fit = _fit_single_window(window, bounds)

        # If a successful fit was found, add it to our results list.
        if best_fit is not None:
            all_fit_results.append(best_fit)

    # --- Result Aggregation and Persistence ---
    if not all_fit_results:
        logging.warning("LPPL fitting completed, but no windows converged successfully.")
        return pd.DataFrame()

    # Convert the list of dictionaries into a pandas DataFrame.
    results_df = pd.DataFrame(all_fit_results)

    # Reorder columns for clarity.
    column_order = ['Ticker', 'End_Date'] + LPPL_PARAM_ORDER + ['SSE']
    results_df = results_df[column_order]

    # Save the final DataFrame to a CSV file.
    output_path.parent.mkdir(parents=True, exist_ok=True)
    logging.info(f"Saving {len(results_df)} successful LPPL fits to '{output_path}'.")
    results_df.to_csv(output_path, index=False)

    # Log summary statistics.
    success_rate = len(results_df) / len(windows)
    logging.info(f"LPPL fitting pipeline finished. Success rate: {success_rate:.2%}")

    return results_df


In [None]:
# Task 16: Compute residuals and normalize to obtain ε_norm(t)

# ==============================================================================
# Task 16: Compute residuals and normalize to obtain ε_norm(t)
# ==============================================================================

# Standardized order of LPPL parameters.
LPPL_PARAM_ORDER = ['A', 'B', 'C', 'm', 'omega', 'phi', 't_c']

# ------------------------------------------------------------------------------
# Task 16, Step 1: Compute the raw residuals ε(t) for each window.
# ------------------------------------------------------------------------------

def _lppl_predict(theta: np.ndarray, t: np.ndarray) -> np.ndarray:
    """
    Calculates the predicted log-price using the LPPL model formula.

    This is a helper function to reconstruct the fitted LPPL trajectory.

    Args:
        theta (np.ndarray): The 7-element array of fitted LPPL parameters.
        t (np.ndarray): A 1D array of time indices (1 to W).

    Returns:
        np.ndarray: A 1D array of the predicted log-prices.
    """
    # Unpack the parameter vector.
    A, B, C, m, omega, phi, t_c = theta

    # Defensive check for numerical stability.
    if t_c <= t.max():
        return np.full(t.shape, np.nan) # Return NaNs if t_c is invalid

    # Calculate the (t_c - t) term.
    dt = t_c - t

    # Calculate the predicted log-price using the LPPL formula.
    log_p_hat = A + B * (dt**m) + C * (dt**m) * np.cos(omega * np.log(dt) + phi)

    return log_p_hat


def _compute_raw_residuals(
    df_features: pd.DataFrame,
    lppl_fits: pd.DataFrame,
    window_size: int
) -> pd.Series:
    """
    Computes the raw LPPL residuals for all successfully fitted windows.

    Equation: ε(t) = ln p(t) - ln p̂(t)

    Args:
        df_features (pd.DataFrame): The main DataFrame with 'Log_Price'.
        lppl_fits (pd.DataFrame): DataFrame of fitted LPPL parameters from Task 15.
        window_size (int): The size of the fitting window (W).

    Returns:
        pd.Series: A Series containing all raw residuals, indexed by the
                   original (Date, Ticker) MultiIndex.
    """
    logging.info("Computing raw residuals for all fitted windows...")

    # This list will hold Series objects of residuals for each window.
    all_residuals: List[pd.Series] = []

    # Group the main DataFrame by ticker for efficient slicing.
    grouped_df = df_features.groupby(level='TICKER')

    # Iterate through each successful fit in the parameters DataFrame.
    for _, fit in tqdm(lppl_fits.iterrows(), total=len(lppl_fits), desc="Calculating Residuals"):
        # Extract the ticker and end date to identify the window.
        ticker, end_date = fit['Ticker'], fit['End_Date']

        # Retrieve the original log price series for this window.
        try:
            ticker_series = grouped_df.get_group(ticker)['Log_Price']
            # Find the integer position of the end date.
            end_idx_pos = ticker_series.index.get_loc((end_date, ticker))
            # Slice the window using integer positions for speed and accuracy.
            window_series = ticker_series.iloc[end_idx_pos - window_size + 1 : end_idx_pos + 1]
        except (KeyError, IndexError):
            continue # Skip if the window can't be reconstructed.

        # Ensure the reconstructed window is valid.
        if len(window_series) != window_size:
            continue

        # Extract the fitted parameters for this window.
        theta = fit[LPPL_PARAM_ORDER].to_numpy()

        # Reconstruct the time vector (1 to W).
        t_vector = np.arange(1, window_size + 1)

        # Calculate the predicted log prices for the window.
        log_p_hat = _lppl_predict(theta, t_vector)

        # Compute the raw residuals.
        raw_residuals = window_series.values - log_p_hat

        # Create a Series for these residuals with the original MultiIndex.
        residuals_series = pd.Series(raw_residuals, index=window_series.index)
        all_residuals.append(residuals_series)

    if not all_residuals:
        logging.warning("No raw residuals were computed. Check LPPL fits.")
        return pd.Series(dtype=np.float64)

    # Concatenate all individual residual Series into a single Series.
    # This may have duplicate index entries due to overlapping windows.
    combined_residuals = pd.concat(all_residuals)

    # For overlapping windows, multiple residual values exist for the same
    # (Date, Ticker). We take the mean as a simple and robust way to resolve this.
    final_residuals = combined_residuals.groupby(combined_residuals.index).mean()

    return final_residuals


# ------------------------------------------------------------------------------
# Task 16, Step 2: Normalize residuals to the range [-1, 1].
# ------------------------------------------------------------------------------

def _normalize_residuals(raw_residuals: pd.Series) -> pd.Series:
    """
    Normalizes raw residuals using a running maximum of their absolute value.

    Equations:
    1. M(t) = max_{s <= t} |ε(s)|
    2. ε_norm(t) = ε(t) / M(t)

    Args:
        raw_residuals (pd.Series): A Series of raw residuals indexed by
                                   (Date, Ticker).

    Returns:
        pd.Series: A Series of normalized residuals in the range [-1, 1].
    """
    logging.info("Normalizing residuals using a running maximum...")

    # Ensure the series is sorted by date for the expanding window to work correctly.
    residuals_sorted = raw_residuals.sort_index()

    # --- Calculate Running Maximum M(t) ---
    # Group by ticker to compute the running max only within each security's history.
    # .expanding() creates a window that grows from the beginning of each group.
    running_max_abs_residual = residuals_sorted.abs().groupby(level='TICKER').expanding().max()

    # The result of expanding is a MultiIndex Series. We need to drop the extra level
    # to align it with our original residuals Series.
    running_max_abs_residual = running_max_abs_residual.droplevel(0)

    # --- Normalize ---
    # Divide the raw residuals by the running maximum.
    normalized_residuals = residuals_sorted / running_max_abs_residual

    # Handle the case where the running max is zero (residuals are zero),
    # which results in NaNs. These should be 0.
    normalized_residuals.fillna(0, inplace=True)

    return normalized_residuals


# ------------------------------------------------------------------------------
# Task 16, Step 3: Merge normalized residuals into the main DataFrame.
# ------------------------------------------------------------------------------

def compute_and_merge_lppl_residuals(
    df_features: pd.DataFrame,
    lppl_fits: pd.DataFrame,
    study_parameters: Dict[str, Any]
) -> pd.DataFrame:
    """
    Orchestrates the computation and normalization of LPPL residuals and merges
    them into the main feature DataFrame.

    Args:
        df_features (pd.DataFrame): The main DataFrame with all prior features.
        lppl_fits (pd.DataFrame): The DataFrame of fitted LPPL parameters.
        study_parameters (Dict[str, Any]): The main configuration dictionary.

    Returns:
        pd.DataFrame: A copy of the input DataFrame with the new
                      'Residual_Norm' column.
    """
    logging.info("Initiating LPPL residual computation pipeline...")
    # Work on a copy.
    df_final = df_features.copy()

    # --- Step 1: Compute raw residuals for all windows. ---
    window_size = study_parameters['descriptive_model']['lppl_fitting']['rolling_window_size']
    raw_residuals = _compute_raw_residuals(df_final, lppl_fits, window_size)
    logging.info(f"Step 1/3: Computed raw residuals for {len(raw_residuals)} data points.")

    if raw_residuals.empty:
        logging.warning("Raw residuals Series is empty. Adding a NaN column for 'Residual_Norm'.")
        df_final['Residual_Norm'] = np.nan
        return df_final

    # --- Step 2: Normalize the raw residuals. ---
    normalized_residuals = _normalize_residuals(raw_residuals)
    logging.info("Step 2/3: Normalized residuals successfully.")

    # --- Step 3: Merge the normalized residuals into the main DataFrame. ---
    # Assigning the Series as a new column will automatically align by the index.
    # Points in the DataFrame that don't have a residual will get NaN, which is correct.
    df_final['Residual_Norm'] = normalized_residuals
    logging.info("Step 3/3: Merged normalized residuals into the main DataFrame.")

    # --- Final Logging ---
    coverage = df_final['Residual_Norm'].notna().sum() / len(df_final)
    logging.info(f"LPPL residual computation pipeline finished. Coverage: {coverage:.2%}")

    return df_final


In [None]:
# Task 17: Construct the BubbleScore by fusing residuals with behavioral signals

# ==============================================================================
# Task 17: Construct the BubbleScore by fusing residuals with behavioral signals
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 17, Step 1: Extract the BubbleScore weights from configuration.
# ------------------------------------------------------------------------------

def _get_bubblescore_weights(config: Dict[str, Any]) -> Tuple[float, float]:
    """
    Retrieves the weights for the Hype and Sentiment components of the BubbleScore.

    Args:
        config (Dict[str, Any]): The study configuration dictionary.

    Returns:
        Tuple[float, float]: A tuple containing (alpha_1_hype_weight,
                             alpha_2_sentiment_weight).

    Raises:
        KeyError: If the required weight keys are missing from the configuration.
    """
    try:
        # Access the nested dictionary for bubble score synthesis parameters.
        synthesis_config = config['descriptive_model']['bubble_score_synthesis']

        # Extract the weight for the Hype Index component.
        alpha_1 = synthesis_config['alpha_1_hype_weight']

        # Extract the weight for the Sentiment Score component.
        alpha_2 = synthesis_config['alpha_2_sentiment_weight']

        # The validity of these parameters was confirmed in Task 1.
        logging.info(f"BubbleScore weights extracted: alpha_1 (Hype) = {alpha_1}, alpha_2 (Sentiment) = {alpha_2}.")

        return alpha_1, alpha_2
    except KeyError as e:
        # This error indicates a problem with the configuration structure.
        logging.error(f"Missing a required BubbleScore weight in the configuration: {e}")
        raise


# ------------------------------------------------------------------------------
# Task 17, Step 2: Compute the BubbleScore using the regime-aware formula.
# ------------------------------------------------------------------------------

def _compute_bubblescore(
    df: pd.DataFrame,
    alpha_1: float,
    alpha_2: float
) -> pd.DataFrame:
    """
    Computes the BubbleScore by fusing the LPPL residual with behavioral signals.

    This function implements the core piecewise formula from the paper, which
    treats the amplifying effect of hype asymmetrically based on the sign of
    the LPPL residual (the "regime").

    Equation (14):
    - If ε_norm > 0: BubbleScore = ε_norm + α1*Hype + α2*Sentiment
    - If ε_norm <= 0: BubbleScore = ε_norm - α1*Hype + α2*Sentiment

    Args:
        df (pd.DataFrame): DataFrame containing 'Residual_Norm', 'Hype_Index',
                           and 'Sentiment_Score'.
        alpha_1 (float): The weight for the Hype Index.
        alpha_2 (float): The weight for the Sentiment Score.

    Returns:
        pd.DataFrame: A copy of the input DataFrame with the new 'BubbleScore'
                      column.
    """
    # Work on a copy to avoid side effects.
    df_scored = df.copy()

    # --- Identify the Regime ---
    # Create a boolean mask to identify the "bubble" regime (positive residual).
    # Where this is False, it's the "negative behavior" regime.
    is_bubble_regime = df_scored['Residual_Norm'] > 0

    # --- Calculate Components ---
    # For clarity, calculate the value of each term in the equation.
    residual_term = df_scored['Residual_Norm']
    hype_term = alpha_1 * df_scored['Hype_Index']
    sentiment_term = alpha_2 * df_scored['Sentiment_Score']

    # --- Apply Piecewise Formula using np.where ---
    # This is a highly efficient, vectorized way to apply conditional logic.
    # np.where(condition, value_if_true, value_if_false)
    bubble_score = np.where(
        is_bubble_regime,
        # Formula for the positive regime (ε_norm > 0)
        residual_term + hype_term + sentiment_term,
        # Formula for the negative regime (ε_norm <= 0)
        residual_term - hype_term + sentiment_term
    )

    # Assign the computed array to the new 'BubbleScore' column.
    # NaNs in any component will correctly propagate to the result.
    df_scored['BubbleScore'] = bubble_score

    return df_scored


# ------------------------------------------------------------------------------
# Task 17, Step 3: Validate and persist the BubbleScore series.
# ------------------------------------------------------------------------------

def _validate_and_persist_bubblescore(
    df: pd.DataFrame,
    output_path: Path
) -> None:
    """
    Validates the computed BubbleScore and saves a snapshot for auditing.

    Args:
        df (pd.DataFrame): The DataFrame containing the 'BubbleScore' column.
        output_path (Path): The file path to save the CSV snapshot.
    """
    # --- Validation ---
    # Check for any non-finite values (inf, -inf) that could indicate errors.
    if not np.all(np.isfinite(df['BubbleScore'].dropna())):
        raise ValueError("BubbleScore column contains non-finite values (inf/-inf). Check input data and weights.")

    # Log summary statistics for the final signal.
    logging.info("BubbleScore computed successfully. Summary statistics:")
    logging.info(df['BubbleScore'].describe().to_string())

    # --- Persistence for Auditing ---
    # Define the columns relevant to the BubbleScore calculation for the snapshot.
    snapshot_cols = ['BubbleScore', 'Residual_Norm', 'Hype_Index', 'Sentiment_Score']

    # Ensure the parent directory exists.
    output_path.parent.mkdir(parents=True, exist_ok=True)

    # Save the snapshot to a CSV file.
    logging.info(f"Saving BubbleScore component snapshot to '{output_path}' for audit.")
    df[snapshot_cols].to_csv(output_path)


# ------------------------------------------------------------------------------
# Task 17, Orchestrator Function
# ------------------------------------------------------------------------------

def construct_bubblescore(
    df_residuals: pd.DataFrame,
    study_parameters: Dict[str, Any],
    output_dir: Union[str, Path] = "data_intermediate"
) -> pd.DataFrame:
    """
    Orchestrates the construction of the final BubbleScore signal.

    This function fuses the technical LPPL residual with the behavioral Hype
    and Sentiment signals according to the paper's regime-dependent formula.

    Args:
        df_residuals (pd.DataFrame): The DataFrame from Task 16, containing
            'Residual_Norm' and all prior behavioral features.
        study_parameters (Dict[str, Any]): The main configuration dictionary.
        output_dir (Union[str, Path]): Directory to save the audit snapshot.

    Returns:
        pd.DataFrame: The DataFrame enriched with the final 'BubbleScore' column.
    """
    logging.info("Initiating BubbleScore construction pipeline...")

    # --- Input Validation ---
    required_cols = ['Residual_Norm', 'Hype_Index', 'Sentiment_Score']
    if not all(col in df_residuals.columns for col in required_cols):
        raise ValueError(f"Input DataFrame is missing one or more required columns for BubbleScore construction: {required_cols}")

    # --- Step 1: Extract the BubbleScore weights from configuration. ---
    alpha_1, alpha_2 = _get_bubblescore_weights(study_parameters)
    logging.info("Step 1/3: BubbleScore weights extracted.")

    # --- Step 2: Compute the BubbleScore using the regime-aware formula. ---
    df_scored = _compute_bubblescore(df_residuals, alpha_1, alpha_2)
    logging.info("Step 2/3: BubbleScore computed using regime-aware formula.")

    # --- Step 3: Validate and persist a snapshot of the BubbleScore. ---
    snapshot_path = Path(output_dir) / "bubblescore_snapshot.csv"
    _validate_and_persist_bubblescore(df_scored, snapshot_path)
    logging.info("Step 3/3: BubbleScore validated and snapshot persisted.")

    logging.info("BubbleScore construction pipeline finished successfully.")

    return df_scored


In [None]:
# Task 18: Label bubble and negative-bubble episodes via thresholding and persistence

# ==============================================================================
# Task 18: Label bubble and negative-bubble episodes via thresholding and
#          persistence
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 18, Step 1: Extract episode detection parameters from configuration.
# ------------------------------------------------------------------------------

def _get_episode_detection_params(config: Dict[str, Any]) -> Tuple[float, int]:
    """
    Retrieves the parameters for episode detection from the configuration.

    Args:
        config (Dict[str, Any]): The study configuration dictionary.

    Returns:
        Tuple[float, int]: A tuple containing (tau_threshold, d_min_duration).

    Raises:
        KeyError: If the required keys are missing from the configuration.
    """
    try:
        # Access the nested dictionary for episode labeling parameters.
        labeling_config = config['descriptive_model']['episode_labeling']

        # Extract the significance threshold 'tau'.
        tau = labeling_config['significance_threshold_tau']

        # Extract the minimum duration 'd_min'.
        d_min = labeling_config['min_duration_d_min']

        logging.info(f"Episode detection parameters extracted: tau = {tau}, d_min = {d_min} days.")

        return tau, d_min
    except KeyError as e:
        # This error indicates a problem with the configuration structure.
        logging.error(f"Missing a required episode labeling parameter in the configuration: {e}")
        raise


# ------------------------------------------------------------------------------
# Task 18, Step 2: Identify raw episodes by thresholding.
# ------------------------------------------------------------------------------

def _identify_episodes_vectorized(
    df: pd.DataFrame,
    tau: float,
    d_min: int
) -> pd.DataFrame:
    """
    Identifies bubble and negative-bubble episodes using a vectorized approach.

    This function uses an efficient pandas algorithm based on detecting state
    changes to identify contiguous blocks of significant BubbleScore values that
    meet the minimum duration requirement.

    Args:
        df (pd.DataFrame): DataFrame containing the 'BubbleScore' column.
        tau (float): The significance threshold for the BubbleScore.
        d_min (int): The minimum number of consecutive days for an episode.

    Returns:
        pd.DataFrame: A DataFrame where each row represents a valid episode,
                      with columns for Ticker, Start_Date, End_Date, Type,
                      and Intensity.
    """
    logging.info("Identifying bubble episodes using vectorized state-change detection...")

    # Work on a temporary DataFrame with only the necessary data.
    temp_df = df[['BubbleScore']].copy()

    # --- Step A: Define the state for each day ---
    # State is 1 for a positive bubble, -1 for a negative bubble, 0 otherwise.
    temp_df['State'] = np.where(
        temp_df['BubbleScore'] > tau, 1,
        np.where(temp_df['BubbleScore'] < -tau, -1, 0)
    )

    # --- Step B: Detect state changes ---
    # A change occurs if the current state is different from the previous day's state for the same ticker.
    # .ne(0) converts the diff result (e.g., 1, -1, 2, -2) to a boolean.
    temp_df['State_Change'] = temp_df.groupby(level='TICKER')['State'].diff().ne(0)

    # --- Step C: Assign a unique ID to each contiguous episode block ---
    # The cumulative sum of state changes creates a unique ID for each block.
    temp_df['Episode_ID'] = temp_df['State_Change'].cumsum()

    # --- Step D: Aggregate by episode ID ---
    # Group by Ticker and the newly created Episode_ID.
    episodes = temp_df.groupby(['TICKER', 'Episode_ID']).agg(
        Start_Date=('BubbleScore', lambda x: x.index[0][0]), # Get date from MultiIndex
        End_Date=('BubbleScore', lambda x: x.index[-1][0]),
        Duration=('BubbleScore', 'size'),
        Type=('State', 'first'), # State is constant within the group
        Intensity=('BubbleScore', lambda x: x.abs().max())
    )

    # --- Step E: Filter for valid episodes ---
    # A valid episode must have a non-zero type (i.e., not a neutral period)
    # and a duration greater than or equal to the minimum requirement.
    valid_episodes = episodes[
        (episodes['Type'] != 0) & (episodes['Duration'] >= d_min)
    ].reset_index()

    # Clean up the final DataFrame.
    valid_episodes.drop(columns=['Episode_ID'], inplace=True)
    # Convert Type from number to a more descriptive string.
    valid_episodes['Type'] = valid_episodes['Type'].map({1: 'Normal', -1: 'Negative'})

    return valid_episodes


# ------------------------------------------------------------------------------
# Task 18, Step 3: Persist episode labels and create binary indicators.
# ------------------------------------------------------------------------------

def _persist_episodes_and_create_indicators(
    df: pd.DataFrame,
    episodes_df: pd.DataFrame,
    output_path: Path
) -> pd.DataFrame:
    """
    Saves the detected episodes to a CSV and adds binary indicator columns
    to the main DataFrame.

    Args:
        df (pd.DataFrame): The main DataFrame to add indicators to.
        episodes_df (pd.DataFrame): The DataFrame of detected episodes.
        output_path (Path): The file path to save the episodes CSV.

    Returns:
        pd.DataFrame: A copy of the main DataFrame with two new binary
                      indicator columns.
    """
    # Work on a copy.
    df_labeled = df.copy()

    # --- Persistence ---
    output_path.parent.mkdir(parents=True, exist_ok=True)
    logging.info(f"Saving {len(episodes_df)} detected episodes to '{output_path}'.")
    episodes_df.to_csv(output_path, index=False)

    # --- Create Binary Indicators ---
    logging.info("Creating binary episode indicators in the main DataFrame...")
    # Initialize the new columns with 0.
    df_labeled['In_Bubble_Episode'] = 0
    df_labeled['In_Negative_Episode'] = 0

    # Iterate through the (much smaller) episodes DataFrame to label the main DataFrame.
    for _, episode in episodes_df.iterrows():
        # Define the slice for the current episode using the MultiIndex.
        # The slice selects all dates between Start_Date and End_Date for the specific Ticker.
        idx_slice = (slice(episode['Start_Date'], episode['End_Date']), episode['Ticker'])

        # Assign 1 to the appropriate indicator column for the sliced rows.
        if episode['Type'] == 'Normal':
            df_labeled.loc[idx_slice, 'In_Bubble_Episode'] = 1
        elif episode['Type'] == 'Negative':
            df_labeled.loc[idx_slice, 'In_Negative_Episode'] = 1

    return df_labeled


# ------------------------------------------------------------------------------
# Task 18, Orchestrator Function
# ------------------------------------------------------------------------------

def label_bubble_episodes(
    df_scored: pd.DataFrame,
    study_parameters: Dict[str, Any],
    output_dir: Union[str, Path] = "data_intermediate"
) -> pd.DataFrame:
    """
    Orchestrates the process of identifying and labeling bubble episodes.

    This function translates the continuous BubbleScore into discrete event
    windows (episodes) based on magnitude and duration thresholds.

    Args:
        df_scored (pd.DataFrame): The DataFrame from Task 17, containing the
                                  'BubbleScore' column.
        study_parameters (Dict[str, Any]): The main configuration dictionary.
        output_dir (Union[str, Path]): Directory to save the episodes CSV.

    Returns:
        pd.DataFrame: The DataFrame enriched with two new binary indicator
                      columns: 'In_Bubble_Episode' and 'In_Negative_Episode'.
    """
    logging.info("Initiating bubble episode labeling pipeline...")

    # --- Input Validation ---
    if 'BubbleScore' not in df_scored.columns:
        raise ValueError("Input DataFrame is missing the required 'BubbleScore' column.")

    # --- Step 1: Extract episode detection parameters. ---
    tau, d_min = _get_episode_detection_params(study_parameters)
    logging.info("Step 1/3: Episode detection parameters extracted.")

    # --- Step 2: Identify all valid episodes using the vectorized method. ---
    episodes_df = _identify_episodes_vectorized(df_scored, tau, d_min)
    logging.info(f"Step 2/3: Identified {len(episodes_df)} valid bubble/negative-bubble episodes.")

    # --- Step 3: Persist episodes and create binary indicators. ---
    episodes_filepath = Path(output_dir) / "bubble_episodes.csv"
    df_final = _persist_episodes_and_create_indicators(df_scored, episodes_df, episodes_filepath)
    logging.info("Step 3/3: Episode list persisted and binary indicators created.")

    logging.info("Bubble episode labeling pipeline finished successfully.")

    return df_final


In [None]:
# Task 19: Engineer stock-level feature sequences for the Transformer

# ==============================================================================
# Task 19: Engineer stock-level feature sequences for the Transformer
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 19, Step 1: Select and normalize stock-level features.
# ------------------------------------------------------------------------------

def _normalize_stock_features(
    df: pd.DataFrame,
    train_end_date: pd.Timestamp
) -> Tuple[pd.DataFrame, Dict[str, Dict[str, float]]]:
    """
    Selects and normalizes stock-level features using Z-score scaling.

    Crucially, the scaling parameters (mean and standard deviation) are
    calculated ONLY on the training portion of the data to prevent data leakage.
    These parameters are then applied to the entire dataset.

    Equation: x_norm = (x - μ_train) / σ_train

    Args:
        df (pd.DataFrame): The full feature DataFrame.
        train_end_date (pd.Timestamp): The last date of the training period.

    Returns:
        Tuple[pd.DataFrame, Dict[str, Dict[str, float]]]: A tuple containing:
            - A copy of the DataFrame with normalized features.
            - A dictionary of the fitted scalers (mean and std for each column).
    """
    # Work on a copy to avoid modifying the original DataFrame.
    df_normalized = df.copy()

    # Define the list of stock-level features to be processed.
    stock_features = [
        'Log_Price', 'Log_Volume', 'Log_Return',
        'PE_Ratio', 'PB_Ratio', 'Month', 'Day'
    ]
    # Define which of these features require normalization (i.e., are continuous).
    continuous_features = ['Log_Price', 'Log_Volume', 'Log_Return', 'PE_Ratio', 'PB_Ratio']

    # --- Isolate the Training Set for Fitting Scalers ---
    # This is the critical step to prevent data leakage from the validation/test sets.
    train_df = df_normalized.loc[df_normalized.index.get_level_values('Date') <= train_end_date]

    logging.info(f"Fitting normalization scalers using training data up to {train_end_date.date()}...")

    # This dictionary will store the calculated mean and std for each feature.
    scalers: Dict[str, Dict[str, float]] = {}

    for col in continuous_features:
        # Calculate mean and standard deviation from the training data.
        mean = train_df[col].mean()
        std = train_df[col].std()

        # --- Robustness Check for Zero Standard Deviation ---
        # If std is zero or very close to it, the feature is constant.
        if std < 1e-8:
            logging.warning(f"Feature '{col}' has near-zero standard deviation in the training set. It will not be scaled.")
            # Store scalers that result in a transform of all zeros.
            scalers[col] = {'mean': mean, 'std': np.inf} # Division by inf -> 0
        else:
            # Store the valid scalers.
            scalers[col] = {'mean': mean, 'std': std}

            # --- Apply the Transformation to the Entire Dataset ---
            # Apply the z-score normalization using the training set parameters.
            df_normalized[col] = (df_normalized[col] - mean) / std

    logging.info("Stock-level features normalized successfully.")
    return df_normalized, scalers


# ------------------------------------------------------------------------------
# Task 19, Step 2: Handle missing values in fundamental features.
# ------------------------------------------------------------------------------

def _handle_missing_fundamental_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Handles missing values in fundamental feature columns by dropping rows.

    This function implements the "drop if missing" policy for 'PE_Ratio' and
    'PB_Ratio' as specified in the paper's methodology.

    Args:
        df (pd.DataFrame): The DataFrame with normalized features.

    Returns:
        pd.DataFrame: A DataFrame with rows containing NaNs in fundamental
                      columns removed.
    """
    # Columns to check for missing values.
    fundamental_cols = ['PE_Ratio', 'PB_Ratio']

    initial_rows = len(df)

    # Drop rows where any of the specified fundamental columns have NaN values.
    df_cleaned = df.dropna(subset=fundamental_cols)

    final_rows = len(df_cleaned)
    rows_dropped = initial_rows - final_rows

    if rows_dropped > 0:
        logging.info(f"Dropped {rows_dropped:,} rows ({rows_dropped/initial_rows:.2%}) due to missing fundamental data ('PE_Ratio', 'PB_Ratio').")

    return df_cleaned


# ------------------------------------------------------------------------------
# Task 19, Step 3: Construct stock-level sequences.
# ------------------------------------------------------------------------------

def _construct_sequences_for_ticker(
    ticker_df: pd.DataFrame,
    feature_cols: List[str],
    sequence_length: int
) -> Tuple[List[np.ndarray], pd.MultiIndex]:
    """
    Constructs all possible fixed-length sequences for a single ticker.

    Args:
        ticker_df (pd.DataFrame): The DataFrame for a single ticker, sorted by date.
        feature_cols (List[str]): The names of the columns to include in the sequences.
        sequence_length (int): The desired length of each sequence (L).

    Returns:
        Tuple[List[np.ndarray], pd.MultiIndex]: A tuple containing:
            - A list of 2D numpy arrays, each of shape (L, d_s).
            - The MultiIndex corresponding to the end-date of each sequence.
    """
    # Convert the relevant feature columns to a numpy array for efficient slicing.
    feature_matrix = ticker_df[feature_cols].to_numpy()
    num_obs, num_features = feature_matrix.shape

    # This list will store the generated sequence arrays.
    sequences = []

    # The number of possible sequences is num_obs - sequence_length + 1.
    for i in range(num_obs - sequence_length + 1):
        # Slice the feature matrix to create a sequence of length L.
        sequence = feature_matrix[i : i + sequence_length]
        sequences.append(sequence)

    # Get the index labels for the end-points of each sequence.
    anchor_indices = ticker_df.index[sequence_length - 1:]

    return sequences, anchor_indices


# ------------------------------------------------------------------------------
# Task 19, Orchestrator Function
# ------------------------------------------------------------------------------

def engineer_stock_level_sequences(
    df_labeled: pd.DataFrame,
    study_parameters: Dict[str, Any]
) -> Tuple[List[np.ndarray], pd.MultiIndex, Dict[str, Dict[str, float]]]:
    """
    Orchestrates the full pipeline for creating stock-level feature sequences.

    This function prepares the primary input for the stock-specific stream of
    the Dual-Stream Transformer. The pipeline includes:
    1.  Selecting and normalizing features, crucially using only training data
        to fit the scalers to prevent data leakage.
    2.  Handling missing values in fundamental ratio columns by dropping rows.
    3.  Constructing fixed-length sequences for each ticker.

    Args:
        df_labeled (pd.DataFrame): The full feature DataFrame from Task 18.
        study_parameters (Dict[str, Any]): The main configuration dictionary.

    Returns:
        Tuple[List[np.ndarray], pd.MultiIndex, Dict[str, Dict[str, float]]]:
            - A list of all generated stock-level sequences (numpy arrays).
            - A pandas MultiIndex aligning each sequence to its anchor (Ticker, Date).
            - A dictionary containing the fitted normalization scalers.
    """
    logging.info("Initiating stock-level sequence engineering pipeline...")

    # --- Define Split Date for Normalization ---
    # This is essential to prevent data leakage.
    all_dates = df_labeled.index.get_level_values('Date').unique().sort_values()
    split_ratio = study_parameters['predictive_model']['data_preparation']['dataset_split_ratio']['train']
    train_end_idx = int(len(all_dates) * split_ratio)
    train_end_date = all_dates[train_end_idx]

    # --- Step 1: Select and normalize stock-level features. ---
    df_normalized, scalers = _normalize_stock_features(df_labeled, train_end_date)
    logging.info("Step 1/3: Stock-level features normalized.")

    # --- Step 2: Handle missing values in fundamental features. ---
    df_cleaned = _handle_missing_fundamental_features(df_normalized)
    logging.info("Step 2/3: Missing fundamental data handled.")

    # --- Step 3: Construct fixed-length sequences for each ticker. ---
    sequence_length = study_parameters['predictive_model']['data_preparation']['sequence_length']
    feature_cols = [
        'Log_Price', 'Log_Volume', 'Log_Return', 'PE_Ratio', 'PB_Ratio', 'Month', 'Day'
    ]

    all_sequences: List[np.ndarray] = []
    all_anchor_indices: List[pd.MultiIndex] = []

    # Group by ticker and apply the sequence generation function to each group.
    logging.info(f"Constructing sequences of length {sequence_length} for each ticker...")
    grouped = df_cleaned[feature_cols].groupby(level='TICKER')

    for _, ticker_df in tqdm(grouped, desc="Generating Stock Sequences"):
        if len(ticker_df) >= sequence_length:
            sequences, anchor_indices = _construct_sequences_for_ticker(
                ticker_df, feature_cols, sequence_length
            )
            all_sequences.extend(sequences)
            all_anchor_indices.append(anchor_indices)

    if not all_anchor_indices:
        raise ValueError("No sequences could be generated. Check sequence_length and data availability.")

    # Concatenate the indices from all tickers into a single MultiIndex.
    final_anchor_indices = pd.MultiIndex.from_tuples(
        [idx for multi_idx in all_anchor_indices for idx in multi_idx]
    )

    logging.info(f"Step 3/3: Successfully generated {len(all_sequences):,} stock-level sequences.")
    logging.info("Stock-level sequence engineering pipeline finished successfully.")

    return all_sequences, final_anchor_indices, scalers


In [None]:
# Task 20: Engineer market-level feature sequences for the Transformer

# ==============================================================================
# Task 20: Engineer market-level feature sequences for the Transformer
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 20, Step 1: Select and normalize market-level features.
# ------------------------------------------------------------------------------

def _normalize_market_features(
    df: pd.DataFrame,
    train_end_date: pd.Timestamp
) -> Tuple[pd.DataFrame, Dict[str, Dict[str, float]]]:
    """
    Selects, de-duplicates, and normalizes market-level features.

    This function first creates a compact, date-indexed DataFrame of market
    features. It then applies Z-score normalization to continuous variables
    (like VIX), fitting the scaler ONLY on the training data to prevent leakage.
    Probabilistic features (sentiment shares, Hype Index) are not scaled.

    Args:
        df (pd.DataFrame): The full feature DataFrame.
        train_end_date (pd.Timestamp): The last date of the training period.

    Returns:
        Tuple[pd.DataFrame, Dict[str, Dict[str, float]]]: A tuple containing:
            - A date-indexed DataFrame of normalized market features.
            - A dictionary of the fitted scalers.
    """
    # Define the list of market-level features.
    market_features_cols = [
        'VIX_Close', 'Hype_Index', 'Market_Sentiment_Neg',
        'Market_Sentiment_Neu', 'Market_Sentiment_Pos'
    ]
    # Define which of these are continuous and require scaling.
    continuous_market_features = ['VIX_Close']

    # --- Create a Compact, Date-Indexed DataFrame of Market Features ---
    # This is highly efficient as it avoids processing redundant data.
    # We select the columns, get the date level of the index, and drop duplicates.
    market_features_df = df[market_features_cols].copy()
    market_features_df = market_features_df.reset_index(level='TICKER', drop=True)
    market_features_df = market_features_df[~market_features_df.index.duplicated(keep='first')]

    # --- Fit Scalers on Training Data Only ---
    train_market_df = market_features_df.loc[market_features_df.index <= train_end_date]
    logging.info(f"Fitting market feature scalers using training data up to {train_end_date.date()}...")

    scalers: Dict[str, Dict[str, float]] = {}

    for col in continuous_market_features:
        # Calculate mean and standard deviation from the training data.
        mean = train_market_df[col].mean()
        std = train_market_df[col].std()

        # Handle constant features to avoid division by zero.
        if std < 1e-8:
            logging.warning(f"Market feature '{col}' has near-zero standard deviation. It will not be scaled.")
            scalers[col] = {'mean': mean, 'std': np.inf}
        else:
            scalers[col] = {'mean': mean, 'std': std}
            # Apply the z-score transformation to the entire market feature DataFrame.
            market_features_df[col] = (market_features_df[col] - mean) / std

    logging.info("Market-level features selected and normalized successfully.")
    return market_features_df, scalers


# ------------------------------------------------------------------------------
# Task 20, Step 2: Construct market-level sequences.
# ------------------------------------------------------------------------------

def _construct_market_sequences(
    market_features_df: pd.DataFrame,
    sequence_length: int
) -> Dict[pd.Timestamp, np.ndarray]:
    """
    Constructs fixed-length sequences from the date-indexed market features.

    This creates a dictionary mapping each valid anchor date to its
    corresponding market sequence. This lookup map is highly efficient for
    aligning with stock-level sequences later.

    Args:
        market_features_df (pd.DataFrame): Date-indexed DataFrame of market features.
        sequence_length (int): The desired length of each sequence (L).

    Returns:
        Dict[pd.Timestamp, np.ndarray]: A dictionary mapping the anchor date
                                        (end of sequence) to the sequence array.
    """
    logging.info(f"Constructing market-level sequences of length {sequence_length}...")

    # Convert to numpy for efficient slicing.
    feature_matrix = market_features_df.to_numpy()
    dates = market_features_df.index
    num_obs = len(dates)

    market_sequence_map: Dict[pd.Timestamp, np.ndarray] = {}

    # Iterate through all possible end-points of a sequence.
    for i in range(sequence_length - 1, num_obs):
        # The anchor date is the date at the end of the sequence.
        anchor_date = dates[i]
        # Slice the feature matrix to get the sequence.
        sequence = feature_matrix[i - sequence_length + 1 : i + 1]
        # Store the sequence in the map with its anchor date as the key.
        market_sequence_map[anchor_date] = sequence

    return market_sequence_map


# ------------------------------------------------------------------------------
# Task 20, Orchestrator Function (Step 3 is part of the final dataset assembly)
# ------------------------------------------------------------------------------

def engineer_market_level_sequences(
    df_labeled: pd.DataFrame,
    study_parameters: Dict[str, Any]
) -> Tuple[Dict[pd.Timestamp, np.ndarray], Dict[str, Dict[str, float]]]:
    """
    Orchestrates the creation of market-level feature sequences.

    This function prepares the second input stream for the Dual-Stream
    Transformer. The pipeline includes:
    1.  Selecting, de-duplicating, and normalizing market-level features,
        using only training data to fit scalers.
    2.  Constructing fixed-length sequences for each valid anchor date and
        storing them in an efficient lookup map.

    Args:
        df_labeled (pd.DataFrame): The full feature DataFrame from Task 18.
        study_parameters (Dict[str, Any]): The main configuration dictionary.

    Returns:
        Tuple[Dict[pd.Timestamp, np.ndarray], Dict[str, Dict[str, float]]]:
            - A dictionary mapping each anchor date to its market sequence array.
            - A dictionary containing the fitted normalization scalers.
    """
    logging.info("Initiating market-level sequence engineering pipeline...")

    # --- Define Split Date for Normalization (Consistent with Task 19) ---
    all_dates = df_labeled.index.get_level_values('Date').unique().sort_values()
    split_ratio = study_parameters['predictive_model']['data_preparation']['dataset_split_ratio']['train']
    train_end_idx = int(len(all_dates) * split_ratio)
    train_end_date = all_dates[train_end_idx]

    # --- Step 1: Select and normalize market-level features. ---
    market_features_df, market_scalers = _normalize_market_features(df_labeled, train_end_date)
    logging.info("Step 1/2: Market-level features normalized.")

    # --- Step 2: Construct market-level sequences. ---
    sequence_length = study_parameters['predictive_model']['data_preparation']['sequence_length']
    market_sequence_map = _construct_market_sequences(market_features_df, sequence_length)
    logging.info(f"Step 2/2: Successfully generated {len(market_sequence_map):,} market-level sequences.")

    logging.info("Market-level sequence engineering pipeline finished successfully.")

    return market_sequence_map, market_scalers


In [None]:
# Task 21: Construct target sequences for multi-horizon BubbleScore forecasting

# ==============================================================================
# Task 21: Construct target sequences for multi-horizon BubbleScore forecasting
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 21, Step 1: Define the prediction horizons.
# ------------------------------------------------------------------------------

def _get_prediction_horizons(config: Dict[str, Any]) -> List[int]:
    """
    Retrieves and validates the list of prediction horizons from the configuration.

    Args:
        config (Dict[str, Any]): The study configuration dictionary.

    Returns:
        List[int]: A validated list of positive integer prediction horizons.

    Raises:
        KeyError: If the required key is missing from the configuration.
        ValueError: If the horizons are not positive integers.
    """
    try:
        # Access the nested key for the prediction horizons.
        horizons = config['backtesting']['strategy_rules']['prediction_horizons_to_test']

        # --- Validation ---
        # Ensure it's a list and all elements are positive integers.
        if not isinstance(horizons, list) or not all(isinstance(h, int) and h > 0 for h in horizons):
            raise ValueError("'prediction_horizons_to_test' must be a list of positive integers.")

        logging.info(f"Prediction horizons extracted and validated: {horizons}")
        return sorted(horizons) # Return sorted for consistent ordering

    except KeyError as e:
        logging.error(f"Missing prediction horizons in the configuration: {e}")
        raise


# ------------------------------------------------------------------------------
# Task 21, Steps 2 & 3: Handle edge cases and structure the dataset.
# ------------------------------------------------------------------------------

def _construct_multi_horizon_targets(
    df: pd.DataFrame,
    anchor_indices: pd.MultiIndex,
    horizons: List[int]
) -> Tuple[pd.MultiIndex, np.ndarray]:
    """
    Constructs a matrix of multi-horizon targets for each valid anchor point.

    This function uses efficient, vectorized `shift` operations to look up
    future BubbleScore values. It handles edge cases by dropping any anchor
    points for which a complete set of future targets is not available.

    Args:
        df (pd.DataFrame): The main DataFrame containing the 'BubbleScore' column.
        anchor_indices (pd.MultiIndex): The (Date, Ticker) indices corresponding
                                        to the end of each input sequence.
        horizons (List[int]): The list of forecast horizons (e.g., [1, 2, 3, 4, 5]).

    Returns:
        Tuple[pd.MultiIndex, np.ndarray]: A tuple containing:
            - The final, filtered MultiIndex of valid anchor points.
            - A 2D numpy array of shape (n_valid_samples, n_horizons)
              containing the corresponding target values.
    """
    logging.info(f"Constructing multi-horizon targets for horizons: {horizons}...")

    # --- Create Shifted Target Columns ---
    # Create a temporary DataFrame to hold the shifted target columns.
    target_df = pd.DataFrame(index=df.index)

    # Group by ticker to prevent data leakage across securities during shifting.
    grouped = df.groupby(level='TICKER')['BubbleScore']

    for h in horizons:
        # Use a negative shift to pull future values into the present.
        # df.shift(-h) at time t gives the value from time t+h.
        target_df[f'Target_H{h}'] = grouped.shift(-h)

    # --- Align Targets with Anchor Points ---
    # Select only the rows corresponding to our sequence anchor points.
    aligned_targets = target_df.loc[anchor_indices]

    # --- Handle Edge Cases by Dropping NaNs ---
    # Any row with a NaN value indicates that at least one of its future targets
    # was outside the available data range (i.e., too close to the end).
    initial_samples = len(aligned_targets)
    final_targets = aligned_targets.dropna()
    final_samples = len(final_targets)

    samples_dropped = initial_samples - final_samples
    if samples_dropped > 0:
        logging.info(f"Dropped {samples_dropped:,} samples ({samples_dropped/initial_samples:.2%}) due to insufficient forward data for targets.")

    # The index of this cleaned DataFrame is our final set of valid anchor points.
    final_valid_anchor_indices = final_targets.index

    # Convert the final target DataFrame to a numpy array for use in ML models.
    target_matrix = final_targets.to_numpy()

    return final_valid_anchor_indices, target_matrix


# ------------------------------------------------------------------------------
# Task 21, Orchestrator Function
# ------------------------------------------------------------------------------

def construct_and_align_targets(
    df_bubblescore: pd.DataFrame,
    anchor_indices: pd.MultiIndex,
    study_parameters: Dict[str, Any]
) -> Tuple[pd.MultiIndex, np.ndarray]:
    """
    Orchestrates the construction of multi-horizon forecast targets.

    This function prepares the target variable (y) for the supervised learning
    problem. It ensures that for every input sequence, there is a corresponding
    vector of future BubbleScore values to predict.

    Args:
        df_bubblescore (pd.DataFrame): The DataFrame from Task 17, containing
                                       the 'BubbleScore' column.
        anchor_indices (pd.MultiIndex): The (Date, Ticker) indices from Task 19
                                        that mark the end of each input sequence.
        study_parameters (Dict[str, Any]): The main configuration dictionary.

    Returns:
        Tuple[pd.MultiIndex, np.ndarray]: A tuple containing:
            - The final, filtered MultiIndex of valid anchor points for which
              both inputs and a full set of targets exist.
            - A 2D numpy array of the corresponding multi-horizon targets.
    """
    logging.info("Initiating multi-horizon target construction pipeline...")

    # --- Input Validation ---
    if 'BubbleScore' not in df_bubblescore.columns:
        raise ValueError("Input DataFrame is missing the required 'BubbleScore' column.")

    # --- Step 1: Define the prediction horizons. ---
    horizons = _get_prediction_horizons(study_parameters)
    logging.info("Step 1/2: Prediction horizons defined.")

    # --- Step 2 & 3: Construct targets and handle edge cases. ---
    final_anchor_indices, target_matrix = _construct_multi_horizon_targets(
        df_bubblescore, anchor_indices, horizons
    )
    logging.info(f"Step 2/2: Constructed target matrix with shape {target_matrix.shape}.")

    logging.info("Multi-horizon target construction pipeline finished successfully.")

    return final_anchor_indices, target_matrix


In [None]:
# Task 22: Split the dataset into training, validation, and test sets chronologically

# ==============================================================================
# Task 22: Split the dataset into training, validation, and test sets
#          chronologically
# ==============================================================================

# Define a simple data structure to hold the partitioned datasets for clarity.
class ModelDataset(NamedTuple):
    """
    A structured data container for holding a complete, partitioned dataset
    ready for a deep learning model like the Dual-Stream Transformer.

    This class uses a NamedTuple to group the different input streams (stock,
    market) and the corresponding targets into a single, immutable object. This
    improves code clarity and makes passing partitioned data between functions
    less error-prone. Each instance of this class represents one full data split
    (e.g., training, validation, or test set).

    Attributes:
        stock_sequences (np.ndarray):
            A 3D numpy array containing the stock-specific feature sequences.
            The shape is (n_samples, sequence_length, n_stock_features), where
            `n_samples` is the number of observations in this particular data split.
            This array forms the input to the stock-specific stream of the
            Transformer model.

        market_sequences (np.ndarray):
            A 3D numpy array containing the market-level feature sequences.
            The shape is (n_samples, sequence_length, n_market_features).
            Each sequence `market_sequences[i]` corresponds to the same time
            period as `stock_sequences[i]`, providing the market context for
            that specific sample. This array forms the input to the market-level
            stream of the Transformer model.

        targets (np.ndarray):
            A 2D numpy array containing the multi-horizon forecast targets.
            The shape is (n_samples, n_horizons). Each row `targets[i]` is a
            vector of future BubbleScore values `[y_{t+1}, y_{t+2}, ..., y_{t+H}]`
            corresponding to the input sequences `stock_sequences[i]` and
            `market_sequences[i]`, where `t` is the anchor date of the sequences.
    """
    # A 3D array of stock-specific input sequences.
    stock_sequences: np.ndarray

    # A 3D array of market-level input sequences, aligned with the stock sequences.
    market_sequences: np.ndarray

    # A 2D array of multi-horizon target values, aligned with the input sequences.
    targets: np.ndarray

# ------------------------------------------------------------------------------
# Task 22, Step 1: Extract the split ratios from configuration.
# ------------------------------------------------------------------------------

def _get_and_validate_split_ratios(config: Dict[str, Any]) -> Dict[str, float]:
    """
    Retrieves and validates the dataset split ratios from the configuration.

    Args:
        config (Dict[str, Any]): The study configuration dictionary.

    Returns:
        Dict[str, float]: A dictionary of validated split ratios for
                          'train', 'validation', and 'test'.

    Raises:
        KeyError: If the split ratio configuration is missing.
        ValueError: If the ratios do not sum to 1.0.
    """
    try:
        # Access the nested dictionary for split ratios.
        ratios = config['predictive_model']['data_preparation']['dataset_split_ratio']

        # --- Validation ---
        # Ensure all required keys are present.
        if not all(k in ratios for k in ['train', 'validation', 'test']):
            raise ValueError("Split ratios must contain 'train', 'validation', and 'test' keys.")

        # Ensure the ratios sum to 1.0, using a tolerance for floating-point math.
        if not np.isclose(sum(ratios.values()), 1.0):
            raise ValueError(f"Split ratios must sum to 1.0, but sum to {sum(ratios.values())}.")

        logging.info(f"Dataset split ratios validated: {ratios}")
        return ratios

    except KeyError as e:
        logging.error(f"Missing dataset split ratio configuration: {e}")
        raise


# ------------------------------------------------------------------------------
# Task 22, Step 2: Determine the chronological split dates.
# ------------------------------------------------------------------------------

def _determine_chronological_split_dates(
    anchor_indices: pd.MultiIndex,
    ratios: Dict[str, float]
) -> Tuple[pd.Timestamp, pd.Timestamp]:
    """
    Determines the date boundaries for the train, validation, and test sets.

    Args:
        anchor_indices (pd.MultiIndex): The MultiIndex of all valid samples.
        ratios (Dict[str, float]): The validated split ratios.

    Returns:
        Tuple[pd.Timestamp, pd.Timestamp]: A tuple containing:
            - The end date of the training set.
            - The end date of the validation set.
    """
    # Get the unique, sorted timeline of all anchor dates in the dataset.
    unique_dates = anchor_indices.get_level_values('Date').unique().sort_values()

    # Calculate the integer index for the end of the training period.
    train_end_idx = int(len(unique_dates) * ratios['train'])

    # Calculate the integer index for the end of the validation period.
    validation_end_idx = train_end_idx + int(len(unique_dates) * ratios['validation'])

    # Retrieve the actual timestamp for the training set boundary.
    train_end_date = unique_dates[train_end_idx]

    # Retrieve the actual timestamp for the validation set boundary.
    validation_end_date = unique_dates[validation_end_idx]

    logging.info(f"Chronological split dates determined:")
    logging.info(f"  - Training ends on:   {train_end_date.date()}")
    logging.info(f"  - Validation ends on: {validation_end_date.date()}")

    return train_end_date, validation_end_date


# ------------------------------------------------------------------------------
# Task 22, Step 3: Partition the dataset.
# ------------------------------------------------------------------------------

def _partition_datasets(
    stock_sequences: List[np.ndarray],
    market_sequence_map: Dict[pd.Timestamp, np.ndarray],
    target_matrix: np.ndarray,
    anchor_indices: pd.MultiIndex,
    train_end_date: pd.Timestamp,
    validation_end_date: pd.Timestamp
) -> Dict[str, ModelDataset]:
    """
    Partitions the complete dataset into training, validation, and test sets.

    Args:
        stock_sequences (List[np.ndarray]): List of all stock-level sequences.
        market_sequence_map (Dict[pd.Timestamp, np.ndarray]): Map of market sequences.
        target_matrix (np.ndarray): Matrix of all multi-horizon targets.
        anchor_indices (pd.MultiIndex): The anchor index for all samples.
        train_end_date (pd.Timestamp): The training set boundary date.
        validation_end_date (pd.Timestamp): The validation set boundary date.

    Returns:
        Dict[str, ModelDataset]: A dictionary containing the partitioned
                                 'train', 'validation', and 'test' datasets.
    """
    logging.info("Partitioning data into train, validation, and test sets...")

    # Extract the date component of the anchor index for masking.
    anchor_dates = anchor_indices.get_level_values('Date')

    # --- Create Boolean Masks for Each Set ---
    train_mask = anchor_dates <= train_end_date
    validation_mask = (anchor_dates > train_end_date) & (anchor_dates <= validation_end_date)
    test_mask = anchor_dates > validation_end_date

    # --- Assemble Final Datasets ---
    datasets: Dict[str, ModelDataset] = {}

    # Convert list of arrays to a single 3D numpy array for easier slicing.
    stock_sequences_arr = np.array(stock_sequences)

    for name, mask in [('train', train_mask), ('validation', validation_mask), ('test', test_mask)]:
        # Apply the mask to get the anchor indices for the current split.
        split_indices = anchor_indices[mask]
        split_dates = split_indices.get_level_values('Date')

        # Slice the stock sequences and targets using the boolean mask.
        split_stock_sequences = stock_sequences_arr[mask]
        split_targets = target_matrix[mask]

        # Retrieve the corresponding market sequences using the date map.
        # This is efficient as we only construct the array we need.
        split_market_sequences = np.array([market_sequence_map[date] for date in split_dates])

        # Store the partitioned data in the results dictionary.
        datasets[name] = ModelDataset(
            stock_sequences=split_stock_sequences,
            market_sequences=split_market_sequences,
            targets=split_targets
        )
        logging.info(f"  - {name.capitalize()} set created with {len(split_targets):,} samples.")

    # --- Final Sanity Check ---
    total_samples = sum(len(ds.targets) for ds in datasets.values())
    if total_samples != len(anchor_indices):
        raise RuntimeError("Sample count mismatch after partitioning. Check split logic.")

    return datasets


# ------------------------------------------------------------------------------
# Task 22, Orchestrator Function
# ------------------------------------------------------------------------------

def split_dataset_chronologically(
    # Inputs from previous tasks
    stock_sequences: List[np.ndarray],
    market_sequence_map: Dict[pd.Timestamp, np.ndarray],
    target_matrix: np.ndarray,
    final_anchor_indices: pd.MultiIndex,
    # Configuration
    study_parameters: Dict[str, Any]
) -> Dict[str, ModelDataset]:
    """
    Orchestrates the chronological splitting of the dataset for time-series modeling.

    This function is critical for preventing look-ahead bias. It ensures that
    the training, validation, and test sets represent distinct, sequential
    periods of time.

    Args:
        stock_sequences (List[np.ndarray]): All stock-level feature sequences.
        market_sequence_map (Dict[pd.Timestamp, np.ndarray]): Map of market sequences.
        target_matrix (np.ndarray): All corresponding multi-horizon targets.
        final_anchor_indices (pd.MultiIndex): The (Date, Ticker) index for all samples.
        study_parameters (Dict[str, Any]): The main configuration dictionary.

    Returns:
        Dict[str, ModelDataset]: A dictionary containing the 'train',
                                 'validation', and 'test' ModelDataset objects.
    """
    logging.info("Initiating chronological dataset splitting pipeline...")

    # --- Step 1: Get and validate the split ratios from the configuration. ---
    ratios = _get_and_validate_split_ratios(study_parameters)
    logging.info("Step 1/3: Dataset split ratios validated.")

    # --- Step 2: Determine the chronological date boundaries for the splits. ---
    train_end_date, validation_end_date = _determine_chronological_split_dates(
        final_anchor_indices, ratios
    )
    logging.info("Step 2/3: Chronological split boundaries determined.")

    # --- Step 3: Partition all data components into three sets. ---
    partitioned_datasets = _partition_datasets(
        stock_sequences,
        market_sequence_map,
        target_matrix,
        final_anchor_indices,
        train_end_date,
        validation_end_date
    )
    logging.info("Step 3/3: All data components partitioned successfully.")

    logging.info("Chronological dataset splitting pipeline finished successfully.")

    return partitioned_datasets


In [None]:
# Task 23: Define the Dual-Stream Transformer architecture

# ==============================================================================
# Task 23: Define the Dual-Stream Transformer architecture (Enhanced Docs)
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 23, Component: Positional Encoding
# ------------------------------------------------------------------------------

class PositionalEncoding(nn.Module):
    """
    Injects positional information into sequence embeddings.

    This module implements the fixed sinusoidal positional encoding described in
    the "Attention Is All You Need" paper. It generates a matrix of sine and
    cosine functions of different frequencies, which are then added to the
    input embeddings. This allows the model, which is otherwise permutation-
    invariant, to understand the relative or absolute position of elements
    in a sequence.

    The encoding is not a trainable parameter but is registered as a buffer,
    meaning it is part of the model's state and will be moved to the correct
    device (e.g., GPU) along with the model.
    """

    def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 5000):
        """
        Initializes the PositionalEncoding module.

        Args:
            d_model (int): The dimensionality of the input embeddings.
            dropout (float): The dropout rate to apply to the final output.
            max_len (int): The maximum possible sequence length.
        """
        # Call the parent class constructor.
        super().__init__()

        # Initialize a dropout layer for regularization.
        self.dropout = nn.Dropout(p=dropout)

        # Create a tensor representing the positions in the sequence (0, 1, ..., max_len-1).
        # Shape: (max_len, 1)
        position = torch.arange(max_len).unsqueeze(1)

        # Calculate the division term for the sinusoidal functions. This creates
        # a geometric progression of frequencies.
        # Shape: (d_model / 2)
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))

        # Initialize the positional encoding matrix with zeros.
        # Shape: (max_len, 1, d_model)
        pe = torch.zeros(max_len, 1, d_model)

        # Apply the sine function to even indices in the embedding dimension.
        pe[:, 0, 0::2] = torch.sin(position * div_term)

        # Apply the cosine function to odd indices in the embedding dimension.
        pe[:, 0, 1::2] = torch.cos(position * div_term)

        # Register 'pe' as a buffer. This makes it part of the model's state_dict
        # but not a parameter that is updated by the optimizer.
        self.register_buffer('pe', pe)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Adds positional encoding to the input tensor.

        Args:
            x (torch.Tensor): The input tensor of sequence embeddings.
                              Shape: (seq_len, batch_size, d_model).

        Returns:
            torch.Tensor: The output tensor with positional information added.
                          Shape: (seq_len, batch_size, d_model).
        """
        # Add the pre-computed positional encodings to the input embeddings.
        # We slice `self.pe` to match the length of the input sequence `x`.
        x = x + self.pe[:x.size(0)]

        # Apply dropout to the combined embeddings for regularization.
        return self.dropout(x)

# ------------------------------------------------------------------------------
# Task 23, Component: Bi-directional Cross-Attention
# ------------------------------------------------------------------------------

class BiDirectionalCrossAttention(nn.Module):
    """
    A module for bi-directional cross-attention between two parallel sequences.

    This module is a key component of the Dual-Stream architecture. It contains
    two multi-head attention layers:
    1.  One where the stock sequence acts as the query and the market sequence
        acts as the key and value.
    2.  One where the market sequence acts as the query and the stock sequence
        acts as the key and value.

    This allows information to flow between the two streams, enabling the model
    to learn context-dependent representations. Each attention operation is
    followed by a residual connection and layer normalization, as is standard
    in Transformer architectures.
    """

    def __init__(self, d_model: int, nhead: int, dropout: float = 0.1):
        """
        Initializes the BiDirectionalCrossAttention module.

        Args:
            d_model (int): The embedding dimension of the sequences.
            nhead (int): The number of attention heads.
            dropout (float): The dropout rate for the attention layers.
        """
        # Call the parent class constructor.
        super().__init__()

        # Initialize the attention layer for the stock stream to attend to the market stream.
        # `batch_first=False` is specified because we will be working with tensors of
        # shape (seq_len, batch_size, d_model).
        self.stock_to_market_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout, batch_first=False)

        # Initialize the attention layer for the market stream to attend to the stock stream.
        self.market_to_stock_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout, batch_first=False)

        # Initialize layer normalization for the output of each attention block.
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

        # Initialize dropout layers for regularization within the residual connections.
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, stock_seq: torch.Tensor, market_seq: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Performs the forward pass for bi-directional cross-attention.

        Args:
            stock_seq (torch.Tensor): The encoded stock sequence tensor.
                                      Shape: (seq_len, batch_size, d_model).
            market_seq (torch.Tensor): The encoded market sequence tensor.
                                       Shape: (seq_len, batch_size, d_model).

        Returns:
            Tuple[torch.Tensor, torch.Tensor]: A tuple containing the updated
                stock and market sequence tensors after cross-attention.
        """
        # --- Stock-to-Market Attention Block ---
        # The stock sequence is the query (Q); the market sequence is the key (K) and value (V).
        attended_stock, _ = self.stock_to_market_attn(query=stock_seq, key=market_seq, value=market_seq)

        # Apply a residual connection (Add) and layer normalization (Norm).
        # This is a standard building block of Transformer architectures.
        stock_seq = self.norm1(stock_seq + self.dropout1(attended_stock))

        # --- Market-to-Stock Attention Block ---
        # The market sequence is the query (Q); the stock sequence is the key (K) and value (V).
        attended_market, _ = self.market_to_stock_attn(query=market_seq, key=stock_seq, value=stock_seq)

        # Apply a second residual connection and layer normalization.
        market_seq = self.norm2(market_seq + self.dropout2(attended_market))

        # Return the two updated sequences.
        return stock_seq, market_seq

# ------------------------------------------------------------------------------
# Task 23, Main Architecture: Dual-Stream Transformer
# ------------------------------------------------------------------------------

class DualStreamTransformer(nn.Module):
    """
    Implements the full Dual-Stream Transformer architecture for bubble prediction.

    This model is designed to process two parallel time-series inputs:
    1.  A stock-specific stream containing features like price, volume, and ratios.
    2.  A market-level stream containing features like VIX and market sentiment.

    The architecture consists of the following stages:
    - Input Projection: Each stream's features are projected into a high-dimensional space.
    - Positional Encoding: Sinusoidal encodings are added to inform the model of sequence order.
    - Parallel Self-Attention: Each stream is processed by a separate Transformer encoder.
    - Bi-Directional Cross-Attention: The two streams interact, allowing them to share information.
    - Pooling & Fusion: The sequence representations are pooled and fused into a single vector.
    - Multi-Horizon Prediction: A set of independent MLP heads predict the BubbleScore at different future horizons.
    """

    def __init__(self, config: Dict[str, Any], num_stock_features: int, num_market_features: int):
        """
        Initializes the DualStreamTransformer model.

        Args:
            config (Dict[str, Any]): The main study configuration dictionary.
            num_stock_features (int): The number of features in the stock-specific input stream.
            num_market_features (int): The number of features in the market-level input stream.

        Raises:
            ValueError: If the embedding dimension is not divisible by the number of attention heads.
        """
        # Call the parent class constructor.
        super().__init__()

        # --- Step 1: Hyperparameter Extraction and Validation ---
        # Extract architectural hyperparameters from the configuration dictionary.
        arch_config = config['architecture']
        d_model = arch_config['embedding_dim']
        nhead = arch_config['num_attention_heads']
        d_hid = d_model * arch_config['mlp_hidden_dim_ratio']
        nlayers = arch_config['num_encoder_layers']
        dropout = config['training']['dropout_rate']
        pred_head_hid_dim = arch_config['prediction_head_hidden_dim']
        self.num_horizons = len(config['backtesting']['strategy_rules']['prediction_horizons_to_test'])

        # Validate a critical architectural constraint for multi-head attention.
        if d_model % nhead != 0:
            raise ValueError(f"embedding_dim ({d_model}) must be divisible by num_attention_heads ({nhead}).")

        # --- Step 2 & 3: Architecture Definition ---
        # A linear layer to project the raw stock features into the model's embedding space.
        self.stock_projector = nn.Linear(num_stock_features, d_model)
        # A separate linear layer for the market features.
        self.market_projector = nn.Linear(num_market_features, d_model)

        # The positional encoding module.
        self.pos_encoder = PositionalEncoding(d_model, dropout)

        # Define a standard Transformer encoder layer, which will be cloned for the encoder stacks.
        encoder_layer = nn.TransformerEncoderLayer(d_model, nhead, d_hid, dropout, batch_first=False)

        # Create the Transformer encoder for the stock stream.
        self.stock_transformer_encoder = nn.TransformerEncoder(encoder_layer, nlayers)
        # Create a separate, independent Transformer encoder for the market stream.
        self.market_transformer_encoder = nn.TransformerEncoder(encoder_layer, nlayers)

        # The custom bi-directional cross-attention module.
        self.cross_attention = BiDirectionalCrossAttention(d_model, nhead, dropout)

        # A small MLP to fuse the representations from the two streams after pooling.
        self.fusion_layer = nn.Sequential(
            nn.Linear(d_model * 2, d_model),
            nn.ReLU(),
            nn.LayerNorm(d_model)
        )

        # Create a list of independent prediction heads, one for each forecast horizon.
        self.prediction_heads = nn.ModuleList([
            nn.Sequential(
                nn.Linear(d_model, pred_head_hid_dim),
                nn.ReLU(),
                nn.Dropout(dropout),
                nn.Linear(pred_head_hid_dim, 1),
                nn.Tanh() # Tanh activation is crucial to bound the output in [-1, 1].
            ) for _ in range(self.num_horizons)
        ])

    def forward(self, stock_seq: torch.Tensor, market_seq: torch.Tensor) -> torch.Tensor:
        """
        Defines the forward pass of the model from inputs to final predictions.

        Args:
            stock_seq (torch.Tensor): A batch of stock-specific sequences.
                                      Shape: (batch_size, seq_len, num_stock_features).
            market_seq (torch.Tensor): A batch of market-level sequences.
                                       Shape: (batch_size, seq_len, num_market_features).

        Returns:
            torch.Tensor: A tensor of multi-horizon predictions.
                          Shape: (batch_size, num_horizons).
        """
        # --- 1. Input Projection ---
        # Map the input features of each stream to the model's embedding dimension `d_model`.
        stock_emb = self.stock_projector(stock_seq)
        market_emb = self.market_projector(market_seq)

        # --- 2. Reshape and Add Positional Encoding ---
        # PyTorch's native Transformer modules expect shape (seq_len, batch_size, d_model).
        # We permute the dimensions from (batch, seq, feat) to (seq, batch, feat).
        stock_emb = stock_emb.permute(1, 0, 2)
        market_emb = market_emb.permute(1, 0, 2)

        # Add positional information to the embeddings.
        stock_pos = self.pos_encoder(stock_emb)
        market_pos = self.pos_encoder(market_emb)

        # --- 3. Independent Self-Attention (Encoders) ---
        # Each stream is processed independently by its own Transformer encoder.
        stock_encoded = self.stock_transformer_encoder(stock_pos)
        market_encoded = self.market_transformer_encoder(market_pos)

        # --- 4. Bi-directional Cross-Attention ---
        # The two streams interact, sharing contextual information.
        stock_cross, market_cross = self.cross_attention(stock_encoded, market_encoded)

        # --- 5. Pooling and Fusion ---
        # Aggregate the sequence information by taking the mean across the time dimension (dim=0).
        stock_pooled = stock_cross.mean(dim=0)
        market_pooled = market_cross.mean(dim=0)

        # Concatenate the two resulting vectors into a single, larger representation.
        fused = torch.cat((stock_pooled, market_pooled), dim=1)

        # Pass the concatenated vector through a final fusion layer.
        final_repr = self.fusion_layer(fused)

        # --- 6. Multi-Horizon Prediction ---
        # Apply each independent prediction head to the final fused representation.
        predictions = [head(final_repr) for head in self.prediction_heads]

        # Concatenate the scalar outputs from each head into a single prediction tensor.
        # The result is a tensor of shape (batch_size, num_horizons).
        return torch.cat(predictions, dim=1)


In [None]:
# Task 24: Implement the composite training loss function

# ==============================================================================
# Task 24: Implement the composite training loss function
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 24, Steps 1, 2, & 3: Define and combine all loss components.
# ------------------------------------------------------------------------------

class CompositeLoss(nn.Module):
    """
    Implements the composite, multi-component loss function described in Eq. (15).

    This loss function is a weighted sum of five distinct components, each
    targeting a different desirable property for the model's predictions:
    1. Huber Loss: Robustness to outliers (point-wise accuracy).
    2. Correlation Loss: Encourages linear relationship between predictions and targets.
    3. R-squared Loss: Directly optimizes the coefficient of determination.
    4. Consistency Loss: Penalizes mismatches in the day-to-day changes.
    5. Smoothness Loss: Regularizes the predictions to prevent abrupt fluctuations.

    Equation (15):
    L = λ1*L_Huber + λ2*L_Corr + λ3*L_R2 + λ4*L_Cons + λ5*L_Smooth
    """

    def __init__(self, config: Dict[str, Any], epsilon: float = 1e-8):
        """
        Initializes the CompositeLoss module and extracts weights from config.

        Args:
            config (Dict[str, Any]): The main study configuration dictionary.
            epsilon (float): A small value to add to denominators for
                             numerical stability.

        Raises:
            KeyError: If the loss function weights are not found in the config.
        """
        # Call the parent class constructor.
        super().__init__()

        # --- Step 1: Extract loss component weights ---
        try:
            # Access the nested dictionary of loss weights.
            weights = config['predictive_model']['training']['loss_function_weights']
            # Store each weight as an attribute of the class.
            self.lambda_huber = weights['lambda_1_huber']
            self.lambda_corr = weights['lambda_2_corr']
            self.lambda_r_squared = weights['lambda_3_r_squared']
            self.lambda_cons = weights['lambda_4_cons']
            self.lambda_smooth = weights['lambda_5_smooth']
        except KeyError as e:
            logging.error(f"Missing a required loss weight in the configuration: {e}")
            raise

        # Store the epsilon value for numerical stability.
        self.epsilon = epsilon

    def forward(self, predictions: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
        """
        Computes the total composite loss.

        Args:
            predictions (torch.Tensor): The model's output tensor.
                                        Shape: (batch_size, num_horizons).
            targets (torch.Tensor): The ground-truth target tensor.
                                    Shape: (batch_size, num_horizons).

        Returns:
            torch.Tensor: A single scalar tensor representing the total loss.
        """
        # --- Step 2: Define each loss component ---

        # 1. Huber Loss (L_Huber): Robust to outliers.
        # PyTorch's built-in Huber loss is efficient and stable.
        loss_huber = F.huber_loss(predictions, targets, reduction='mean')

        # 2. Correlation Loss (L_Corr): 1 - Pearson Correlation.
        # We compute this manually for full control over stability.
        # Center the predictions and targets by subtracting their means.
        pred_mean = predictions.mean()
        targ_mean = targets.mean()
        pred_centered = predictions - pred_mean
        targ_centered = targets - targ_mean
        # Calculate the covariance.
        covariance = (pred_centered * targ_centered).mean()
        # Calculate the standard deviations.
        pred_std = torch.sqrt((pred_centered**2).mean())
        targ_std = torch.sqrt((targ_centered**2).mean())
        # Calculate the Pearson correlation coefficient.
        correlation = covariance / (pred_std * targ_std + self.epsilon)
        # The loss is 1 minus the correlation.
        loss_corr = 1 - correlation

        # 3. R-squared Loss (L_R2): 1 - R^2, which simplifies to SS_res / SS_tot.
        # This directly encourages the model to explain variance.
        # Calculate the residual sum of squares.
        ss_res = torch.sum((targets - predictions)**2)
        # Calculate the total sum of squares.
        ss_tot = torch.sum((targets - targets.mean())**2)
        # The loss is the ratio. Add epsilon for stability.
        loss_r_squared = ss_res / (ss_tot + self.epsilon)

        # 4. Consistency Loss (L_Cons): MSE of the first differences.
        # This encourages the model to match the direction and magnitude of changes.
        # Calculate the difference between consecutive horizons.
        delta_pred = torch.diff(predictions, dim=1)
        delta_targ = torch.diff(targets, dim=1)
        # Calculate the Mean Squared Error of these differences.
        loss_cons = F.mse_loss(delta_pred, delta_targ)

        # 5. Smoothness Loss (L_Smooth): MSE of the prediction's first differences.
        # This acts as a regularizer to penalize overly volatile predictions.
        loss_smooth = F.mse_loss(delta_pred, torch.zeros_like(delta_pred))

        # --- Step 3: Combine into the total loss ---
        # Calculate the final weighted sum of all loss components.
        total_loss = (
            self.lambda_huber * loss_huber +
            self.lambda_corr * loss_corr +
            self.lambda_r_squared * loss_r_squared +
            self.lambda_cons * loss_cons +
            self.lambda_smooth * loss_smooth
        )

        return total_loss


In [None]:
# Task 25: Train the Dual-Stream Transformer model and Task 26: Validate the model and implement early stopping.

# ==============================================================================
# Task 25: Train the Dual-Stream Transformer model without validation and
#          Early stopping
# ==============================================================================

def train_model(
    model: DualStreamTransformer,
    partitioned_datasets: Dict[str, ModelDataset],
    loss_function: CompositeLoss,
    study_parameters: Dict[str, Any],
    device: torch.device,
    output_dir: Union[str, Path] = "models"
) -> Tuple[DualStreamTransformer, List[Dict[str, float]]]:
    """
    Orchestrates the end-to-end training of the Dual-Stream Transformer model.

    This function manages the entire training process, including:
    1.  Initializing the AdamW optimizer and the OneCycleLR learning rate scheduler.
    2.  Running the main training loop over the specified number of epochs.
    3.  Implementing the complete forward/backward pass, including gradient clipping.
    4.  Logging training progress (loss, learning rate) and saving periodic
        checkpoints for resumability.

    Args:
        model (DualStreamTransformer): The instantiated model to be trained.
        partitioned_datasets (Dict[str, ModelDataset]): The dictionary containing
            the 'train', 'validation', and 'test' data splits.
        loss_function (CompositeLoss): The instantiated composite loss function.
        study_parameters (Dict[str, Any]): The main configuration dictionary.
        device (torch.device): The device (CPU or CUDA) to train on.
        output_dir (Union[str, Path]): The directory to save checkpoints.

    Returns:
        Tuple[DualStreamTransformer, List[Dict[str, float]]]: A tuple containing:
            - The model with its trained weights.
            - A list of dictionaries logging the training history per epoch.
    """
    logging.info("Initiating model training pipeline...")

    # --- Extract Configuration ---
    # Retrieve training and optimizer parameters from the configuration.
    train_config = study_parameters['predictive_model']['training']
    optim_config = study_parameters['predictive_model']['optimizer']

    num_epochs = train_config['num_epochs']
    batch_size = train_config['batch_size']
    lr = optim_config['learning_rate']
    weight_decay = optim_config['weight_decay']
    clip_threshold = optim_config['gradient_clipping_threshold']

    # --- Prepare DataLoader ---
    # Convert numpy arrays from the training set into PyTorch tensors.
    train_data = partitioned_datasets['train']
    train_tensors = [
        torch.from_numpy(train_data.stock_sequences).float(),
        torch.from_numpy(train_data.market_sequences).float(),
        torch.from_numpy(train_data.targets).float()
    ]
    # Create a TensorDataset and a DataLoader for efficient batching and shuffling.
    train_dataset = TensorDataset(*train_tensors)
    train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

    # --- Step 1: Initialize Optimizer and Scheduler ---
    # Instantiate the AdamW optimizer, which is well-suited for Transformer models.
    optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=weight_decay)

    # Instantiate the OneCycleLR scheduler, which dynamically adjusts the learning rate.
    # This requires knowing the total number of training steps.
    total_steps = num_epochs * len(train_dataloader)
    scheduler = optim.lr_scheduler.OneCycleLR(optimizer, max_lr=lr, total_steps=total_steps)

    logging.info(f"Optimizer (AdamW) and Scheduler (OneCycleLR) initialized. Total training steps: {total_steps}.")

    # --- Step 2 & 3: Implement Training Loop with Logging and Checkpointing ---
    # Move the model to the designated training device.
    model.to(device)

    # This list will store the loss and learning rate for each epoch.
    training_history = []

    # Create directories for saving checkpoints.
    checkpoint_dir = Path(output_dir) / "checkpoints"
    checkpoint_dir.mkdir(parents=True, exist_ok=True)

    logging.info(f"Starting training for {num_epochs} epochs...")
    # The main training loop iterates over epochs.
    for epoch in range(num_epochs):
        # Set the model to training mode. This enables layers like Dropout.
        model.train()

        # Accumulator for the loss over an epoch.
        running_loss = 0.0

        # The inner loop iterates over batches of data.
        for batch in tqdm(train_dataloader, desc=f"Epoch {epoch + 1}/{num_epochs}"):
            # Unpack the batch and move all tensors to the training device.
            stock_seq, market_seq, targets = [b.to(device) for b in batch]

            # --- Forward Pass ---
            # Get model predictions for the current batch.
            predictions = model(stock_seq, market_seq)

            # --- Loss Calculation ---
            # Compute the composite loss between predictions and targets.
            loss = loss_function(predictions, targets)

            # --- Backward Pass and Optimization ---
            # 1. Reset the gradients from the previous step.
            optimizer.zero_grad()

            # 2. Compute gradients of the loss with respect to model parameters.
            loss.backward()

            # 3. Clip gradients to prevent them from exploding, a common issue in deep networks.
            # This is a critical regularization step.
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=clip_threshold)

            # 4. Update the model's weights based on the computed gradients.
            optimizer.step()

            # 5. Update the learning rate. For OneCycleLR, this is done after every batch.
            scheduler.step()

            # Accumulate the loss for this batch.
            running_loss += loss.item()

        # --- End of Epoch Logging ---
        # Calculate the average loss for the epoch.
        avg_loss = running_loss / len(train_dataloader)
        # Get the current learning rate from the optimizer.
        current_lr = optimizer.param_groups[0]['lr']

        # Log the epoch's performance.
        logging.info(f"Epoch {epoch + 1} Complete | Average Loss: {avg_loss:.6f} | Current LR: {current_lr:.8f}")

        # Store the metrics in the history log.
        training_history.append({'epoch': epoch + 1, 'loss': avg_loss, 'lr': current_lr})

        # --- Step 3: Checkpointing ---
        # Save a checkpoint periodically (e.g., every 10 epochs).
        if (epoch + 1) % 10 == 0:
            checkpoint_path = checkpoint_dir / f"epoch_{epoch + 1}.pth"
            logging.info(f"Saving checkpoint to '{checkpoint_path}'...")
            # Save a comprehensive state dictionary for full resumability.
            torch.save({
                'epoch': epoch + 1,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'scheduler_state_dict': scheduler.state_dict(),
                'loss': avg_loss,
            }, checkpoint_path)

    logging.info("Model training pipeline finished successfully.")

    # Return the trained model and its training history.
    return model, training_history


# ==============================================================================
# Task 26: Train the Dual-Stream Transformer model with validation and
#          Early stopping
# ==============================================================================

def _run_validation_epoch(
    model: DualStreamTransformer,
    val_dataloader: DataLoader,
    loss_function: CompositeLoss,
    device: torch.device
) -> Dict[str, float]:
    """
    Runs a full evaluation pass on the validation dataset for a single epoch.

    This function is a critical part of the training loop. It sets the model
    to evaluation mode, disables gradient calculations for efficiency, and
    iterates through the entire validation set to compute performance metrics.
    Metrics are calculated on the full validation set (not per-batch) to ensure
    statistical robustness.

    Args:
        model (DualStreamTransformer):
            The model instance to be evaluated.
        val_dataloader (DataLoader):
            The PyTorch DataLoader for the validation dataset.
        loss_function (CompositeLoss):
            The instantiated loss function used to calculate the validation loss.
        device (torch.device):
            The device (e.g., 'cuda' or 'cpu') on which to perform the evaluation.

    Returns:
        Dict[str, float]:
            A dictionary containing key validation metrics for the epoch,
            including 'val_loss', 'val_corr', 'val_mse', 'val_rmse', and 'val_mae'.
    """
    # Set the model to evaluation mode. This is a critical step that disables
    # layers like Dropout and LayerNorm's training-specific behavior, ensuring
    # deterministic and reproducible evaluation.
    model.eval()

    # Initialize a variable to accumulate the loss over all batches.
    running_val_loss = 0.0
    # Initialize lists to store all predictions and targets from the validation set.
    all_predictions: List[torch.Tensor] = []
    all_targets: List[torch.Tensor] = []

    # Use the `torch.no_grad()` context manager to disable gradient computation.
    # This significantly speeds up the forward pass and reduces memory consumption.
    with torch.no_grad():
        # Iterate over all batches provided by the validation DataLoader.
        for batch in val_dataloader:
            # Unpack the batch and move all tensors to the specified evaluation device.
            stock_seq, market_seq, targets = [b.to(device) for b in batch]

            # --- Forward Pass ---
            # Pass the input sequences through the model to get predictions.
            predictions = model(stock_seq, market_seq)

            # --- Loss Calculation ---
            # Compute the loss for the current batch using the provided loss function.
            loss = loss_function(predictions, targets)
            # Add the scalar loss value of the batch to the running total.
            running_val_loss += loss.item()

            # Append the batch's predictions and targets to the master lists.
            # Move them to the CPU to free up GPU memory.
            all_predictions.append(predictions.cpu())
            all_targets.append(targets.cpu())

    # Concatenate the lists of batch tensors into single, large tensors.
    all_predictions_tensor = torch.cat(all_predictions)
    all_targets_tensor = torch.cat(all_targets)

    # --- Calculate Full-Set Metrics ---
    # Calculate the average validation loss across all batches.
    avg_val_loss = running_val_loss / len(val_dataloader)

    # Manually calculate the Pearson correlation coefficient for the entire validation set.
    # This is more accurate than averaging per-batch correlations.
    # Center the prediction and target tensors.
    vx = all_predictions_tensor - torch.mean(all_predictions_tensor)
    vy = all_targets_tensor - torch.mean(all_targets_tensor)
    # Compute correlation using the formula: cov(X, Y) / (std(X) * std(Y)).
    corr = torch.sum(vx * vy) / (torch.sqrt(torch.sum(vx ** 2)) * torch.sqrt(torch.sum(vy ** 2)))

    # Calculate Mean Squared Error (MSE) using PyTorch's functional implementation.
    mse = F.mse_loss(all_predictions_tensor, all_targets_tensor)
    # Calculate Mean Absolute Error (MAE).
    mae = F.l1_loss(all_predictions_tensor, all_targets_tensor)

    # Return a dictionary containing all computed metrics for this epoch.
    return {
        'val_loss': avg_val_loss,
        'val_corr': corr.item(),
        'val_mse': mse.item(),
        'val_rmse': torch.sqrt(mse).item(),
        'val_mae': mae.item()
    }


def train_and_validate_model(
    model: DualStreamTransformer,
    partitioned_datasets: Dict[str, ModelDataset],
    loss_function: CompositeLoss,
    study_parameters: Dict[str, Any],
    device: torch.device,
    output_dir: Union[str, Path] = "models"
) -> Tuple[DualStreamTransformer, pd.DataFrame]:
    """
    Orchestrates the complete model training and validation pipeline, including
    early stopping and checkpointing of the best model.

    This function integrates a training loop with a validation loop at the end
    of each epoch. It monitors the validation loss to prevent overfitting by
    stopping the training process when performance on the validation set ceases
    to improve. It saves the model state at the point of best performance.

    Args:
        model (DualStreamTransformer):
            The instantiated model to be trained.
        partitioned_datasets (Dict[str, ModelDataset]):
            A dictionary containing the 'train', 'validation', and 'test' data splits.
        loss_function (CompositeLoss):
            The instantiated composite loss function.
        study_parameters (Dict[str, Any]):
            The main configuration dictionary for the study.
        device (torch.device):
            The device (e.g., 'cuda' or 'cpu') on which to perform training.
        output_dir (Union[str, Path]):
            The directory where model checkpoints and training logs will be saved.

    Returns:
        Tuple[DualStreamTransformer, pd.DataFrame]: A tuple containing:
            - The model instance loaded with the weights from the best performing epoch.
            - A pandas DataFrame logging the complete training and validation history.
    """
    # Announce the start of the training process.
    logging.info("Initiating model training and validation pipeline...")

    # --- Configuration Extraction ---
    # Retrieve all necessary hyperparameters from the configuration dictionary.
    train_config = study_parameters['predictive_model']['training']
    optim_config = study_parameters['predictive_model']['optimizer']
    early_stop_config = study_parameters['predictive_model']['early_stopping']

    num_epochs = train_config['num_epochs']
    batch_size = train_config['batch_size']
    lr = optim_config['learning_rate']
    weight_decay = optim_config['weight_decay']
    clip_threshold = optim_config['gradient_clipping_threshold']
    patience = early_stop_config['patience']

    # --- Prepare DataLoaders ---
    # Create a PyTorch TensorDataset from the training data numpy arrays.
    train_data = partitioned_datasets['train']
    train_dataset = TensorDataset(
        torch.from_numpy(train_data.stock_sequences).float(),
        torch.from_numpy(train_data.market_sequences).float(),
        torch.from_numpy(train_data.targets).float()
    )
    # Create a DataLoader to handle batching and shuffling of the training data.
    train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

    # Create a TensorDataset and DataLoader for the validation data (no shuffling).
    val_data = partitioned_datasets['validation']
    val_dataset = TensorDataset(
        torch.from_numpy(val_data.stock_sequences).float(),
        torch.from_numpy(val_data.market_sequences).float(),
        torch.from_numpy(val_data.targets).float()
    )
    val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

    # --- Initialize Optimizer and Scheduler ---
    # Instantiate the AdamW optimizer with the model's parameters and configured hyperparameters.
    optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=weight_decay)
    # Calculate the total number of training steps (batches) for the scheduler.
    total_steps = num_epochs * len(train_dataloader)
    # Instantiate the OneCycleLR scheduler, which manages the learning rate over the training run.
    scheduler = optim.lr_scheduler.OneCycleLR(optimizer, max_lr=lr, total_steps=total_steps)

    # --- Initialize Early Stopping and Logging Variables ---
    # Track the best validation loss seen so far, initialized to infinity.
    best_val_loss = float('inf')
    # Counter for epochs without improvement in validation loss.
    epochs_no_improve = 0
    # List to store the metrics from each epoch.
    history = []
    # Define paths for saving model and log files.
    output_path = Path(output_dir)
    best_model_path = output_path / "best_model.pth"
    # Ensure the output directory exists.
    output_path.mkdir(parents=True, exist_ok=True)

    # --- Main Training & Validation Loop ---
    # Move the model to the designated training device.
    model.to(device)
    logging.info(f"Starting training for up to {num_epochs} epochs with early stopping patience of {patience}...")

    # The main loop iterates over the specified number of epochs.
    for epoch in range(num_epochs):
        # --- Training Step ---
        # Set the model to training mode to enable dropout, etc.
        model.train()
        # Initialize a variable to accumulate the training loss for the epoch.
        running_train_loss = 0.0
        # Iterate over batches from the training DataLoader.
        for batch in tqdm(train_dataloader, desc=f"Epoch {epoch + 1}/{num_epochs} [Train]"):
            # Move the batch of data to the training device.
            stock_seq, market_seq, targets = [b.to(device) for b in batch]
            # Reset gradients to zero before the backward pass.
            optimizer.zero_grad()
            # Perform the forward pass to get model predictions.
            predictions = model(stock_seq, market_seq)
            # Calculate the loss.
            loss = loss_function(predictions, targets)
            # Perform the backward pass to compute gradients.
            loss.backward()
            # Clip the norm of the gradients to prevent them from exploding.
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=clip_threshold)
            # Update the model's weights using the optimizer.
            optimizer.step()
            # Update the learning rate according to the scheduler's policy.
            scheduler.step()
            # Add the batch loss to the running total for the epoch.
            running_train_loss += loss.item()

        # Calculate the average training loss for the epoch.
        avg_train_loss = running_train_loss / len(train_dataloader)

        # --- Step 1: Validation Step ---
        # Run a full evaluation on the validation set.
        val_metrics = _run_validation_epoch(model, val_dataloader, loss_function, device)

        # --- Step 3: Logging ---
        # Get the current learning rate for logging.
        current_lr = optimizer.param_groups[0]['lr']
        # Format and print a comprehensive log message for the epoch.
        log_message = (
            f"Epoch {epoch + 1} | Train Loss: {avg_train_loss:.6f} | "
            f"Val Loss: {val_metrics['val_loss']:.6f} | Val Corr: {val_metrics['val_corr']:.4f} | "
            f"LR: {current_lr:.8f}"
        )
        logging.info(log_message)

        # Store all metrics for this epoch in the history log.
        epoch_history = {'epoch': epoch + 1, 'train_loss': avg_train_loss, 'lr': current_lr, **val_metrics}
        history.append(epoch_history)

        # --- Step 2: Early Stopping Logic ---
        # Check if the current validation loss is the best seen so far.
        if val_metrics['val_loss'] < best_val_loss:
            # If so, update the best loss, save the model state, and reset the patience counter.
            logging.info(f"Validation loss improved from {best_val_loss:.6f} to {val_metrics['val_loss']:.6f}. Saving best model...")
            best_val_loss = val_metrics['val_loss']
            epochs_no_improve = 0
            # Save a checkpoint containing the model's state dictionary and other useful info.
            torch.save({
                'epoch': epoch + 1,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'val_loss': best_val_loss,
            }, best_model_path)
        else:
            # If validation loss did not improve, increment the patience counter.
            epochs_no_improve += 1
            logging.info(f"Validation loss did not improve. Patience: {epochs_no_improve}/{patience}.")

        # If the patience counter exceeds the configured limit, stop training.
        if epochs_no_improve >= patience:
            logging.info(f"Early stopping triggered after {patience} epochs with no improvement.")
            break

    # --- Finalization ---
    logging.info("Model training pipeline finished.")

    # Load the state dictionary from the best saved model to ensure the returned
    # model object has the best performing weights.
    logging.info(f"Loading best model from epoch with validation loss: {best_val_loss:.6f}")
    checkpoint = torch.load(best_model_path)
    model.load_state_dict(checkpoint['model_state_dict'])

    # Convert the history list of dictionaries to a pandas DataFrame.
    history_df = pd.DataFrame(history)
    # Define the path for the history log file.
    history_path = output_path / "training_history.csv"
    # Save the history DataFrame to a CSV file for later analysis.
    history_df.to_csv(history_path, index=False)
    logging.info(f"Full training history saved to '{history_path}'.")

    # Return the best model and the training history.
    return model, history_df


In [None]:
# Task 27: Persist the trained model and architecture metadata

# ==============================================================================
# Task 27: Persist the trained model and architecture metadata
# ==============================================================================

def persist_model_and_metadata(
    study_parameters: Dict[str, Any],
    training_history: pd.DataFrame,
    model_dir: Union[str, Path] = "models",
    log_dir: Union[str, Path] = "logs"
) -> Path:
    """
    Orchestrates the persistence of the final model and all related metadata,
    culminating in a single, reproducible bundle.

    This function performs three steps:
    1.  Validates the existence and integrity of the best model checkpoint saved
        during training.
    2.  Creates a comprehensive JSON metadata file detailing the model's
        architecture, training hyperparameters, performance, and the exact
        code version (Git hash) used.
    3.  Bundles all critical artifacts (model weights, metadata, config, logs)
        into a single compressed .tar.gz archive for portability and
        reproducibility.

    Args:
        study_parameters (Dict[str, Any]): The main configuration dictionary.
        training_history (pd.DataFrame): The DataFrame logging the training and
                                         validation metrics for each epoch.
        model_dir (Union[str, Path]): The directory where the best model was saved.
        log_dir (Union[str, Path]): The directory containing configuration snapshots
                                    and where training logs are saved.

    Returns:
        Path: The file path to the final reproducibility bundle (.tar.gz).

    Raises:
        FileNotFoundError: If a required artifact for bundling is missing.
    """
    logging.info("Initiating model persistence and metadata creation pipeline...")

    model_path = Path(model_dir)
    log_path = Path(log_dir)

    # --- Step 1: Validate the final trained model state ---
    # The best model was already saved by the training function. Here, we verify it.
    best_model_filepath = model_path / "best_model.pth"
    if not best_model_filepath.exists():
        raise FileNotFoundError(f"Best model checkpoint not found at '{best_model_filepath}'. Training may have failed.")

    try:
        # Load the checkpoint to verify its integrity.
        checkpoint = torch.load(best_model_filepath)
        # Check for essential keys.
        assert 'epoch' in checkpoint and 'model_state_dict' in checkpoint
        best_epoch = checkpoint['epoch']
        logging.info(f"Step 1/3: Verified best model checkpoint from epoch {best_epoch} at '{best_model_filepath}'.")
    except Exception as e:
        logging.error(f"Failed to load or validate the best model checkpoint: {e}")
        raise

    # --- Step 2: Document the model architecture and hyperparameters ---
    logging.info("Creating comprehensive model metadata file...")

    # Find the performance metrics from the best epoch.
    best_epoch_metrics = training_history.loc[training_history['val_loss'].idxmin()].to_dict()

    # Get the current Git commit hash for perfect code versioning.
    try:
        git_hash = subprocess.check_output(['git', 'rev-parse', 'HEAD']).strip().decode('utf-8')
    except (subprocess.CalledProcessError, FileNotFoundError):
        git_hash = "N/A (Not a git repository or git is not installed)"
        logging.warning("Could not retrieve Git commit hash.")

    # Construct the complete metadata dictionary.
    model_metadata = {
        "model_architecture": study_parameters['predictive_model']['architecture'],
        "training_parameters": study_parameters['predictive_model']['training'],
        "optimizer_parameters": study_parameters['predictive_model']['optimizer'],
        "training_completion_timestamp": datetime.now().isoformat(),
        "code_version_git_hash": git_hash,
        "best_epoch_performance": {
            "best_epoch": best_epoch,
            "metrics": best_epoch_metrics
        }
    }

    # Save the metadata to a JSON file.
    metadata_filepath = model_path / "model_metadata.json"
    serializable_metadata = _make_json_serializable(model_metadata)
    with metadata_filepath.open('w') as f:
        json.dump(serializable_metadata, f, indent=4)
    logging.info(f"Step 2/3: Model metadata saved to '{metadata_filepath}'.")

    # --- Step 3: Create a reproducibility bundle ---
    logging.info("Creating reproducibility bundle...")

    # Find the most recent configuration snapshot in the log directory.
    config_snapshots = sorted(log_path.glob("config_snapshot_*.json"), reverse=True)
    if not config_snapshots:
        raise FileNotFoundError(f"No configuration snapshot found in '{log_path}'.")
    config_snapshot_path = config_snapshots[0]

    # Define all files to be included in the bundle.
    files_to_bundle = {
        "best_model.pth": best_model_filepath,
        "model_metadata.json": metadata_filepath,
        "config_snapshot.json": config_snapshot_path,
        "training_history.csv": log_path / "training_history.csv" # Assuming it's saved here
    }

    # Verify that all required files exist before creating the archive.
    for name, path in files_to_bundle.items():
        if not path.exists():
            raise FileNotFoundError(f"Cannot create bundle: required artifact '{name}' not found at '{path}'.")

    # Create the final .tar.gz archive.
    bundle_filename = f"model_bundle_{datetime.now().strftime('%Y%m%d_%H%M%S')}.tar.gz"
    bundle_filepath = model_path / bundle_filename

    with tarfile.open(bundle_filepath, "w:gz") as tar:
        for name_in_archive, path_on_disk in files_to_bundle.items():
            # Add each file to the archive with a clean name.
            tar.add(path_on_disk, arcname=name_in_archive)

    logging.info(f"Step 3/3: Reproducibility bundle created at '{bundle_filepath}'.")
    logging.info("Model persistence and metadata pipeline finished successfully.")

    return bundle_filepath


In [None]:
# Task 28: Implement the inference pipeline for generating multi-horizon BubbleScore forecasts

# ==============================================================================
# Task 28: Implement the inference pipeline for generating multi-horizon
#          BubbleScore forecasts
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 28, Step 1: Load the trained model for inference.
# ------------------------------------------------------------------------------

def _load_model_for_inference(
    model_path: Path,
    config: Dict[str, Any],
    num_stock_features: int,
    num_market_features: int,
    device: torch.device
) -> DualStreamTransformer:
    """
    Loads the best trained model from a checkpoint for inference.

    This function first instantiates the model architecture using the exact
    configuration from the study, then loads the saved state dictionary from
    the best checkpoint, and finally sets the model to evaluation mode.

    Args:
        model_path (Path): Path to the '.pth' model checkpoint file.
        config (Dict[str, Any]): The main study configuration dictionary.
        num_stock_features (int): The number of features in the stock stream.
        num_market_features (int): The number of features in the market stream.
        device (torch.device): The device to load the model onto.

    Returns:
        DualStreamTransformer: The trained model, ready for inference.
    """
    logging.info(f"Loading model for inference from '{model_path}'...")

    # --- Instantiate Architecture ---
    # The model must be created with the same architecture as during training.
    model = DualStreamTransformer(config, num_stock_features, num_market_features)

    # --- Load State Dictionary ---
    # Load the checkpoint from the specified path.
    checkpoint = torch.load(model_path, map_location=device)
    # Load the learned weights into the model instance.
    model.load_state_dict(checkpoint['model_state_dict'])

    # --- Configure for Inference ---
    # Move the model to the specified device.
    model.to(device)
    # Set the model to evaluation mode. This is a critical step.
    model.eval()

    logging.info(f"Model successfully loaded from epoch {checkpoint.get('epoch', 'N/A')} and set to evaluation mode.")
    return model


# ------------------------------------------------------------------------------
# Task 28, Step 2: Generate predictions for the test set.
# ------------------------------------------------------------------------------

def _generate_test_set_predictions(
    model: DualStreamTransformer,
    test_dataset: ModelDataset,
    anchor_indices: pd.MultiIndex,
    device: torch.device,
    batch_size: int
) -> pd.DataFrame:
    """
    Generates multi-horizon predictions for the entire test set.

    Args:
        model (DualStreamTransformer): The trained model in evaluation mode.
        test_dataset (ModelDataset): The test data split.
        anchor_indices (pd.MultiIndex): The anchor indices for the test set.
        device (torch.device): The device to run inference on.
        batch_size (int): The batch size for inference.

    Returns:
        pd.DataFrame: A DataFrame of predictions, indexed by (Date, Ticker).
    """
    logging.info(f"Generating predictions for {len(test_dataset.targets):,} test samples...")

    # --- Create DataLoader ---
    # Create a DataLoader for the test set. Shuffling must be False to maintain order.
    dataset = TensorDataset(
        torch.from_numpy(test_dataset.stock_sequences).float(),
        torch.from_numpy(test_dataset.market_sequences).float()
    )
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False)

    all_predictions: List[torch.Tensor] = []

    # --- Inference Loop ---
    # Disable gradient computation for efficiency.
    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Generating Test Predictions"):
            # Move input tensors to the device.
            stock_seq, market_seq = [b.to(device) for b in batch]
            # Get model predictions.
            predictions = model(stock_seq, market_seq)
            # Append predictions to the list, moving them to CPU.
            all_predictions.append(predictions.cpu())

    # Concatenate all batch predictions into a single tensor.
    predictions_tensor = torch.cat(all_predictions)

    # --- Structure the Output ---
    # Create a DataFrame from the predictions tensor.
    num_horizons = predictions_tensor.shape[1]
    pred_cols = [f'Pred_H{h+1}' for h in range(num_horizons)]
    predictions_df = pd.DataFrame(
        predictions_tensor.numpy(),
        index=anchor_indices,
        columns=pred_cols
    )

    return predictions_df


# ------------------------------------------------------------------------------
# Task 28, Step 3: Compute test metrics.
# ------------------------------------------------------------------------------

def _compute_test_metrics(
    predictions: np.ndarray,
    targets: np.ndarray
) -> pd.DataFrame:
    """
    Computes and reports performance metrics for the test set predictions.

    Args:
        predictions (np.ndarray): 2D array of model predictions.
        targets (np.ndarray): 2D array of ground-truth targets.

    Returns:
        pd.DataFrame: A DataFrame summarizing performance metrics per horizon.
    """
    metrics = []
    num_horizons = predictions.shape[1]

    # Calculate metrics for each forecast horizon independently.
    for h in range(num_horizons):
        preds_h = predictions[:, h]
        targs_h = targets[:, h]

        # Pearson Correlation
        corr = np.corrcoef(preds_h, targs_h)[0, 1]
        # Mean Squared Error (MSE)
        mse = np.mean((preds_h - targs_h)**2)
        # Root Mean Squared Error (RMSE)
        rmse = np.sqrt(mse)
        # Mean Absolute Error (MAE)
        mae = np.mean(np.abs(preds_h - targs_h))

        metrics.append({
            'Horizon': h + 1,
            'Correlation': corr,
            'MSE': mse,
            'RMSE': rmse,
            'MAE': mae
        })

    # Create a DataFrame from the list of metric dictionaries.
    metrics_df = pd.DataFrame(metrics).set_index('Horizon')

    return metrics_df


# ------------------------------------------------------------------------------
# Task 28, Orchestrator Function
# ------------------------------------------------------------------------------

def run_inference_pipeline(
    partitioned_datasets: Dict[str, ModelDataset],
    study_parameters: Dict[str, Any],
    final_anchor_indices: pd.MultiIndex,
    validation_end_date: pd.Timestamp,
    model_dir: Union[str, Path] = "models",
    log_dir: Union[str, Path] = "logs"
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    Orchestrates the full inference and evaluation pipeline on the test set.

    This function takes the trained model and the held-out test data to
    generate out-of-sample predictions and compute final performance metrics.
    It is the ultimate test of the model's generalization capability.

    Args:
        partitioned_datasets (Dict[str, ModelDataset]):
            The dictionary of data splits ('train', 'validation', 'test').
        study_parameters (Dict[str, Any]):
            The main configuration dictionary for the study.
        final_anchor_indices (pd.MultiIndex):
            The complete MultiIndex of all valid samples before splitting.
        validation_end_date (pd.Timestamp):
            The precise timestamp marking the end of the validation period. This
            is used to define the start of the test set, ensuring a clean
            chronological split and making the function's dependencies explicit.
        model_dir (Union[str, Path]):
            The directory where the best trained model checkpoint is saved.
        log_dir (Union[str, Path]):
            The directory where the final metrics report will be saved.

    Returns:
        Tuple[pd.DataFrame, pd.DataFrame]: A tuple containing:
            - A DataFrame of predictions for the test set, indexed by (Date, Ticker).
            - A DataFrame summarizing the performance metrics per forecast horizon.
    """
    # Announce the start of the inference process.
    logging.info("Initiating inference and evaluation pipeline...")

    # --- Setup ---
    # Determine the device for running inference (prefer GPU if available).
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    # Define the path to the best model checkpoint.
    model_path = Path(model_dir) / "best_model.pth"
    # Extract the test dataset from the partitioned datasets.
    test_dataset = partitioned_datasets['test']

    # Determine the number of input features from the shape of the test data arrays.
    num_stock_features = test_dataset.stock_sequences.shape[2]
    num_market_features = test_dataset.market_sequences.shape[2]

    # --- Use the passed `validation_end_date` to create the test mask ---
    # This removes the hardcoded dependency on the study_parameters dictionary.
    # Create a boolean mask to select only the anchor indices that fall within the test period.
    test_mask = final_anchor_indices.get_level_values('Date') > validation_end_date
    # Apply the mask to get the precise anchor indices for the test set.
    test_anchor_indices = final_anchor_indices[test_mask]

    # --- Step 1: Load the trained model for inference. ---
    # Instantiate the model architecture and load the best weights from the checkpoint.
    model = _load_model_for_inference(
        model_path, study_parameters, num_stock_features, num_market_features, device
    )
    logging.info("Step 1/3: Trained model loaded successfully.")

    # --- Step 2: Generate predictions for the test set. ---
    # Retrieve the batch size from the configuration.
    batch_size = study_parameters['predictive_model']['training']['batch_size']
    # Run the model on the test data to get out-of-sample predictions.
    predictions_df = _generate_test_set_predictions(
        model, test_dataset, test_anchor_indices, device, batch_size
    )
    logging.info("Step 2/3: Test set predictions generated.")

    # --- Step 3: Align predictions and compute test metrics. ---
    # Compare the generated predictions against the ground-truth targets.
    metrics_df = _compute_test_metrics(
        predictions_df.to_numpy(),
        test_dataset.targets
    )
    logging.info("Step 3/3: Performance metrics computed on the test set.")

    # --- Reporting ---
    # Log the detailed per-horizon performance summary to the console.
    logging.info("Test Set Performance Summary:\n" + metrics_df.to_string())

    # Log the average performance across all horizons, as reported in the paper.
    avg_metrics = metrics_df.mean()
    logging.info("Average Performance Across Horizons:\n" + avg_metrics.to_string())

    # Define the path for the metrics report.
    metrics_path = Path(log_dir) / "test_metrics.csv"
    # Ensure the output directory exists.
    metrics_path.parent.mkdir(parents=True, exist_ok=True)
    # Save the metrics DataFrame to a CSV file for a persistent record.
    metrics_df.to_csv(metrics_path)
    logging.info(f"Test metrics report saved to '{metrics_path}'.")

    logging.info("Inference and evaluation pipeline finished successfully.")

    # Return the predictions and the performance metrics.
    return predictions_df, metrics_df


In [None]:
# Task 29: Convert multi-horizon BubbleScore forecasts into trading signals

# ==============================================================================
# Task 29: Convert multi-horizon BubbleScore forecasts into trading signals
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 29, Step 1: Extract trading thresholds from configuration.
# ------------------------------------------------------------------------------

def _get_trading_thresholds(config: Dict[str, Any]) -> Tuple[float, float]:
    """
    Retrieves and validates the trading signal thresholds from the configuration.

    This function extracts the entry threshold (theta_1) and exit threshold
    (theta_2) from the study parameters. It also performs a validation check
    to ensure the thresholds are logically consistent (0 < exit < entry < 1).

    Args:
        config (Dict[str, Any]):
            The main study configuration dictionary.

    Returns:
        Tuple[float, float]:
            A tuple containing (theta_1_entry, theta_2_exit).

    Raises:
        KeyError:
            If the required threshold keys are missing from the configuration.
        ValueError:
            If the thresholds are not valid or logically inconsistent.
    """
    try:
        # Access the nested dictionary for backtesting strategy rules.
        rules_config = config['backtesting']['strategy_rules']

        # Extract the entry threshold (theta_1), the level at which a position is initiated.
        theta_1 = rules_config['entry_threshold_theta_1']
        # Extract the exit threshold (theta_2), the level at which a position is closed.
        theta_2 = rules_config['exit_threshold_theta_2']

        # --- Validation ---
        # Re-assert the logical consistency of the thresholds. This was checked in
        # Task 1 but is a critical safety check at the point of use.
        if not (0 < theta_2 < theta_1 < 1):
            raise ValueError(f"Trading thresholds are invalid. Must satisfy 0 < theta_2 < theta_1 < 1, but got theta_1={theta_1}, theta_2={theta_2}.")

        # Log the extracted parameters for auditability.
        logging.info(f"Trading thresholds extracted: Entry (theta_1) = {theta_1}, Exit (theta_2) = {theta_2}.")

        # Return the validated thresholds.
        return theta_1, theta_2

    except KeyError as e:
        # If a key is missing, raise an error with a specific message.
        logging.error(f"Missing a required trading threshold in the configuration: {e}")
        raise

# ------------------------------------------------------------------------------
# Task 29, Step 2: Generate entry and exit signals for each horizon.
# ------------------------------------------------------------------------------

def _generate_signals_for_group(
    group: pd.DataFrame,
    theta_1: float,
    theta_2: float
) -> pd.DataFrame:
    """
    Generates trading signals for a single time series (one ticker/horizon pair).

    This function implements a state machine that tracks the current position
    (flat, long, or short) and iterates through the time series of predictions.
    It emits entry or exit signals based on the crossing of the `theta_1` and
    `theta_2` thresholds.

    Args:
        group (pd.DataFrame):
            A DataFrame for a single ticker and forecast horizon, sorted by date,
            with a 'Prediction' column.
        theta_1 (float):
            The entry threshold. A position is considered if |Prediction| >= theta_1.
        theta_2 (float):
            The exit threshold. A position is closed if |Prediction| <= theta_2.

    Returns:
        pd.DataFrame:
            A DataFrame containing the generated signals ('Date', 'Signal') for this group.
    """
    # Initialize the position state: 0 for flat, 1 for long, -1 for short.
    position = 0
    # Initialize a list to store the generated signal events.
    signals = []

    # Iterate through each timestamp and corresponding row in the group's data.
    for date, row in group.iterrows():
        # Get the model's prediction for the current day.
        prediction = row['Prediction']

        # --- Exit Logic ---
        # Exit logic is checked before entry logic to allow for a position to be
        # closed and a new one opened on the same day if signals permit.

        # If currently in a long position, check for a long exit signal.
        if position == 1 and prediction >= -theta_2:
            signals.append({'Date': date, 'Signal': 'LONG_EXIT'})
            position = 0 # Reset position to flat.
        # If currently in a short position, check for a short exit signal.
        elif position == -1 and prediction <= theta_2:
            signals.append({'Date': date, 'Signal': 'SHORT_EXIT'})
            position = 0 # Reset position to flat.

        # --- Entry Logic ---
        # Entry signals are only considered if the current position is flat.
        if position == 0:
            # Check for a long entry signal (prediction is very negative).
            if prediction <= -theta_1:
                signals.append({'Date': date, 'Signal': 'LONG_ENTRY'})
                position = 1 # Set position to long.
            # Check for a short entry signal (prediction is very positive).
            elif prediction >= theta_1:
                signals.append({'Date': date, 'Signal': 'SHORT_ENTRY'})
                position = -1 # Set position to short.

    # Convert the list of signal dictionaries into a DataFrame.
    return pd.DataFrame(signals)


def _generate_threshold_signals(
    predictions_df: pd.DataFrame,
    theta_1: float,
    theta_2: float
) -> pd.DataFrame:
    """
    Generates all threshold-based entry and exit signals for all tickers and horizons.

    This function first transforms the wide-format prediction DataFrame into a
    long format, then applies the stateful signal generation logic to each
    ticker-horizon group.

    Args:
        predictions_df (pd.DataFrame):
            The wide-format DataFrame of predictions from the model.
        theta_1 (float): The entry threshold.
        theta_2 (float): The exit threshold.

    Returns:
        pd.DataFrame:
            A long-form DataFrame containing all generated threshold-based signals.
    """
    # --- Step A: Melt DataFrame ---
    # Convert the wide prediction DataFrame (columns Pred_H1, Pred_H2, ...)
    # to a long format with columns ['Date', 'TICKER', 'Horizon', 'Prediction'].
    # This simplifies group-wise operations.
    long_predictions = predictions_df.reset_index().melt(
        id_vars=['Date', 'TICKER'],
        var_name='Horizon_Str',
        value_name='Prediction'
    )
    # Convert the 'Horizon_Str' (e.g., 'Pred_H1') to a simple integer.
    long_predictions['Horizon'] = long_predictions['Horizon_Str'].str.replace('Pred_H', '').astype(int)
    long_predictions.drop(columns=['Horizon_Str'], inplace=True)

    # --- Step B & C: Apply state machine logic to each group ---
    logging.info("Generating threshold-based entry/exit signals for all ticker-horizon pairs...")
    # Group by ticker and horizon, then apply the state machine function to each group.
    # This is a powerful pattern for applying a stateful function to many time series.
    signals = long_predictions.groupby(['TICKER', 'Horizon'], group_keys=False).apply(
        _generate_signals_for_group, theta_1=theta_1, theta_2=theta_2
    ).reset_index()

    # The .apply() can add an unwanted 'level_2' index; this removes it.
    if 'level_2' in signals.columns:
        signals.drop(columns=['level_2'], inplace=True)

    return signals


def _generate_reversal_signals(predictions_df: pd.DataFrame) -> pd.DataFrame:
    """
    Generates exit signals based on the prediction reversal rule.

    This rule provides an additional layer of risk management by forcing an exit
    if the model's forecast flips sign between consecutive horizons for the same
    anchor date, indicating high uncertainty.

    Args:
        predictions_df (pd.DataFrame):
            The wide-format DataFrame of model predictions.

    Returns:
        pd.DataFrame:
            A DataFrame containing all generated reversal exit signals.
    """
    logging.info("Generating prediction reversal exit signals...")
    # Initialize a list to store reversal signal events.
    reversal_signals = []

    # Define the pairs of consecutive horizons to check (e.g., (1, 2), (2, 3), ...).
    num_horizons = len(predictions_df.columns)
    horizon_pairs = [(h, h + 1) for h in range(1, num_horizons)]

    # Iterate through each pair of consecutive horizons.
    for h1, h2 in horizon_pairs:
        # Select the prediction columns for the two horizons.
        pred_h1 = predictions_df[f'Pred_H{h1}']
        pred_h2 = predictions_df[f'Pred_H{h2}']

        # --- Step D: Reversal Protection Logic ---
        # A reversal occurs if the product of the two predictions is negative.
        # Equation: B_t+h * B_t+h+1 < 0
        reversal_mask = (pred_h1 * pred_h2) < 0

        # If any reversals are detected for this horizon pair...
        if reversal_mask.any():
            # Get the (Date, Ticker) MultiIndex for the rows where reversals occurred.
            reversal_indices = predictions_df.index[reversal_mask]
            # For each reversal event, create a signal record.
            # The signal is for horizon `h1`, as it's the earlier of the pair.
            for date, ticker in reversal_indices:
                reversal_signals.append({
                    'Date': date,
                    'TICKER': ticker,
                    'Horizon': h1,
                    'Signal': 'REVERSAL_EXIT'
                })

    # Convert the list of dictionaries into a DataFrame.
    return pd.DataFrame(reversal_signals)


# ------------------------------------------------------------------------------
# Task 29, Orchestrator Function
# ------------------------------------------------------------------------------

def generate_trading_signals(
    predictions_df: pd.DataFrame,
    study_parameters: Dict[str, Any],
    output_path: Union[str, Path] = "data_final/trading_signals.csv"
) -> pd.DataFrame:
    """
    Orchestrates the conversion of model predictions into discrete trading signals.

    This function implements the full signal generation logic from the paper by
    first generating primary entry/exit signals based on threshold crossings,
    then generating additional risk-management signals based on the prediction
    reversal rule. All signals are combined, sorted, and persisted to a file.

    Args:
        predictions_df (pd.DataFrame):
            The DataFrame of out-of-sample predictions from the inference pipeline.
        study_parameters (Dict[str, Any]):
            The main configuration dictionary.
        output_path (Union[str, Path]):
            The file path to save the final signals CSV.

    Returns:
        pd.DataFrame:
            A long-form DataFrame containing all generated trading signals,
            ready for use in an event-driven backtester.
    """
    # Announce the start of the signal generation process.
    logging.info("Initiating trading signal generation pipeline...")
    output_path = Path(output_path)

    # --- Step 1: Extract and validate trading thresholds. ---
    theta_1, theta_2 = _get_trading_thresholds(study_parameters)
    logging.info("Step 1/3: Trading thresholds extracted.")

    # --- Step 2: Generate signals from both rules. ---
    # Generate the primary entry/exit signals based on the stateful threshold logic.
    threshold_signals = _generate_threshold_signals(predictions_df, theta_1, theta_2)

    # Generate the secondary, risk-management exit signals from the reversal rule.
    reversal_signals = _generate_reversal_signals(predictions_df)
    logging.info("Step 2/3: Threshold and reversal signals generated.")

    # --- Step 3: Combine, persist, and log statistics. ---
    # Concatenate the signals from both sources into a single DataFrame.
    all_signals = pd.concat([threshold_signals, reversal_signals], ignore_index=True)

    # Sort the signals chronologically for each ticker and horizon. This is
    # essential for correct processing in a sequential backtester.
    all_signals.sort_values(by=['TICKER', 'Horizon', 'Date'], inplace=True)

    # Ensure the output directory exists.
    output_path.parent.mkdir(parents=True, exist_ok=True)
    # Save the final, sorted signals DataFrame to a CSV file.
    all_signals.to_csv(output_path, index=False)

    # Log summary statistics of the generated signals for a high-level overview.
    logging.info(f"Step 3/3: Generated a total of {len(all_signals):,} signals. Signals saved to '{output_path}'.")
    if not all_signals.empty:
        logging.info("Signal distribution:\n" + all_signals['Signal'].value_counts().to_string())

    logging.info("Trading signal generation pipeline finished successfully.")

    # Return the final DataFrame of signals.
    return all_signals


In [None]:
# Task 30: Simulate position management, risk controls, and compute PnL

# ==============================================================================
# Task 30: Simulate position management, risk controls, and compute PnL
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 30, Steps 1 & 2: Initialize and simulate a single strategy.
# ------------------------------------------------------------------------------

def _run_single_backtest(
    strategy_signals: pd.DataFrame,
    price_series: pd.Series,
    risk_params: Dict[str, float]
) -> Tuple[pd.Series, List[Dict[str, Any]]]:
    """
    Runs a rigorous, event-driven backtest for a single strategy, featuring
    corrected and robust state management.

    This function simulates trading on a day-by-day basis. It processes signals
    chronologically, manages the trading position state (flat, long, short),
    applies transaction costs, and enforces a daily stop-loss rule. This
    re-implementation corrects a latent bug by ensuring all state variables,
    including the entry date, are atomically updated and reset.

    Args:
        strategy_signals (pd.DataFrame):
            A DataFrame of trading signals for a single ticker-horizon pair,
            indexed by date.
        price_series (pd.Series):
            A Series of adjusted closing prices for the corresponding ticker,
            indexed by date.
        risk_params (Dict[str, float]):
            A dictionary containing risk parameters: 'stop_loss' (e.g., 0.15)
            and 'txn_cost' (e.g., 0.001).

    Returns:
        Tuple[pd.Series, List[Dict[str, Any]]]: A tuple containing:
            - The daily equity curve as a pandas Series, indexed by date.
            - A log of all executed trades as a list of dictionaries.
    """
    # --- Step 1: Augment State Initialization ---
    # Initialize all state variables for the simulation to a clean "flat" state.
    position: int = 0  # 0: flat, 1: long, -1: short
    entry_price: float = 0.0
    entry_price_date: Optional[pd.Timestamp] = None # Correctly initialized to None
    capital: float = 1.0

    # Prepare data structures for storing the simulation results.
    equity_curve = pd.Series(index=price_series.index, dtype=float)
    trade_log: List[Dict[str, Any]] = []

    # Create a dictionary of signals for efficient O(1) lookup by date.
    signals_dict = strategy_signals.set_index('Date')['Signal'].to_dict()

    # --- Simulation Loop ---
    # Iterate through each trading day in the chronological price series.
    for date, price in price_series.items():
        # The capital at the start of the day is the capital from the end of the previous day.
        equity_curve[date] = capital

        # Get the signal for the current day, if one exists.
        signal = signals_dict.get(date)

        # --- 1. Daily Risk Management (Stop-Loss Check) ---
        # This check is performed every day there is an open position.
        if position != 0:
            # Calculate the current unrealized return on the open position.
            unrealized_return = position * (price - entry_price) / entry_price
            # Check if the stop-loss threshold has been breached.
            if unrealized_return < -risk_params['stop_loss']:
                # If breached, an exit is forced.
                net_return = unrealized_return - risk_params['txn_cost']
                # Update the capital based on the losing trade.
                capital *= (1 + net_return)
                # Log the details of the trade that was stopped out.
                trade_log.append({
                    'entry_date': entry_price_date, 'exit_date': date,
                    'position': 'long' if position == 1 else 'short',
                    'return': net_return, 'exit_reason': 'STOP_LOSS'
                })
                # --- Augment Exit Logic ---
                # Atomically reset all state variables to 'flat'.
                position = 0
                entry_price = 0.0
                entry_price_date = None
                # Update the equity curve for the current day to reflect the capital change.
                equity_curve[date] = capital

        # --- 2. Process Trading Signals ---
        # Only process signals if a position is still open or flat (i.e., not stopped out today).
        if signal:
            # Define exit conditions based on the signal and current position.
            is_long_exit = (signal in ['LONG_EXIT', 'REVERSAL_EXIT']) and position == 1
            is_short_exit = (signal in ['SHORT_EXIT', 'REVERSAL_EXIT']) and position == -1

            if is_long_exit or is_short_exit:
                # Calculate the return on the closed trade based on the current price.
                trade_return = position * (price - entry_price) / entry_price
                # Subtract transaction costs to get the net return.
                net_return = trade_return - risk_params['txn_cost']
                # Update capital with the result of the trade.
                capital *= (1 + net_return)
                # Log the trade details.
                trade_log.append({
                    'entry_date': entry_price_date, 'exit_date': date,
                    'position': 'long' if position == 1 else 'short',
                    'return': net_return, 'exit_reason': signal
                })
                # --- Augment Exit Logic ---
                # Atomically reset all state variables to 'flat'.
                position = 0
                entry_price = 0.0
                entry_price_date = None
                # Update the equity curve for the current day.
                equity_curve[date] = capital

            # Entry signals are only processed if the position is currently flat.
            # This allows an exit and a new entry on the same day.
            if position == 0:
                if signal == 'LONG_ENTRY':
                    # --- Augment Entry Logic ---
                    # Atomically set all state variables for a new long position.
                    position = 1
                    entry_price = price
                    entry_price_date = date
                elif signal == 'SHORT_ENTRY':
                    # --- Augment Entry Logic ---
                    # Atomically set all state variables for a new short position.
                    position = -1
                    entry_price = price
                    entry_price_date = date

    # Return the completed equity curve and the detailed log of all trades.
    return equity_curve, trade_log

# ------------------------------------------------------------------------------
# Task 30, Orchestrator Function
# ------------------------------------------------------------------------------

def simulate_all_strategies(
    signals_df: pd.DataFrame,
    prices_df: pd.DataFrame,
    study_parameters: Dict[str, Any],
    output_dir: Union[str, Path] = "data_final"
) -> Tuple[Dict[Tuple[str, int], pd.Series], Dict[Tuple[str, int], List[Dict[str, Any]]]]:
    """
    Orchestrates the backtesting simulation for all ticker-horizon strategies.

    This function iterates through each unique strategy defined in the signals
    DataFrame, runs an event-driven backtest for each, and persists the
    resulting equity curves and trade logs.

    Args:
        signals_df (pd.DataFrame): The long-form DataFrame of all trading signals.
        prices_df (pd.DataFrame): DataFrame containing 'Close_Price_Adj' for all tickers.
        study_parameters (Dict[str, Any]): The main configuration dictionary.
        output_dir (Union[str, Path]): The root directory to save results.

    Returns:
        Tuple[Dict, Dict]: A tuple containing:
            - A dictionary mapping (Ticker, Horizon) to its equity curve Series.
            - A dictionary mapping (Ticker, Horizon) to its trade log list.
    """
    logging.info("Initiating backtesting simulation for all strategies...")

    # --- Extract Risk Parameters ---
    risk_params = {
        'stop_loss': study_parameters['backtesting']['risk_management']['stop_loss_percentage'],
        'txn_cost': study_parameters['backtesting']['market_assumptions']['transaction_cost_per_trade']
    }

    # --- Prepare Output Directories (Step 3) ---
    equity_curve_dir = Path(output_dir) / "equity_curves"
    trade_log_dir = Path(output_dir) / "trade_logs"
    equity_curve_dir.mkdir(parents=True, exist_ok=True)
    trade_log_dir.mkdir(parents=True, exist_ok=True)

    # --- Group Signals by Strategy ---
    # Each (Ticker, Horizon) pair is a unique strategy to be backtested.
    strategy_groups = signals_df.groupby(['TICKER', 'Horizon'])

    # Dictionaries to hold the in-memory results.
    all_equity_curves: Dict[Tuple[str, int], pd.Series] = {}
    all_trade_logs: Dict[Tuple[str, int], List[Dict[str, Any]]] = {}

    # --- Main Simulation Loop ---
    # Iterate through each strategy group.
    for (ticker, horizon), group_signals in tqdm(strategy_groups, desc="Backtesting Strategies"):
        # Get the price series for the current ticker.
        if ticker not in prices_df.index.get_level_values('TICKER'):
            logging.warning(f"No price data found for ticker {ticker}. Skipping backtest.")
            continue

        # Select the price series for the specific ticker.
        price_series = prices_df.loc[(slice(None), ticker), 'Close_Price_Adj'].droplevel('TICKER')

        # Run the backtest simulation for this single strategy.
        equity_curve, trade_log = _run_single_backtest(group_signals, price_series, risk_params)

        # Store the results in the dictionaries.
        strategy_key = (ticker, horizon)
        all_equity_curves[strategy_key] = equity_curve
        all_trade_logs[strategy_key] = trade_log

        # --- Persistence (Step 3) ---
        # Save the equity curve to a CSV file.
        equity_curve.to_csv(equity_curve_dir / f"{ticker}_H{horizon}.csv", header=['Equity'])
        # Save the trade log to a CSV file.
        if trade_log:
            pd.DataFrame(trade_log).to_csv(trade_log_dir / f"{ticker}_H{horizon}.csv", index=False)

    logging.info(f"Backtesting simulation complete. Results for {len(all_equity_curves)} strategies saved to '{output_dir}'.")

    return all_equity_curves, all_trade_logs


In [None]:
# Task 31: Compute and report performance metrics for each strategy

# ==============================================================================
# Task 31: Compute and report performance metrics for each strategy
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 31, Steps 1 & 2: Compute all performance metrics for a single strategy.
# ------------------------------------------------------------------------------

def _calculate_strategy_metrics(
    equity_curve: pd.Series,
    trade_log: List[Dict[str, Any]],
    risk_free_rate: float,
    trading_days_per_year: int = 252
) -> Dict[str, Any]:
    """
    Calculates a comprehensive set of performance metrics for a single strategy.

    Args:
        equity_curve (pd.Series): The daily equity curve of the strategy.
        trade_log (List[Dict[str, Any]]): The log of executed trades.
        risk_free_rate (float): The annualized risk-free rate.
        trading_days_per_year (int): The number of trading days in a year.

    Returns:
        Dict[str, Any]: A dictionary containing all calculated performance metrics.
    """
    # --- Annualized Return ---
    # Equation: AR = (E_T / E_0)^(252 / N) - 1
    num_days = len(equity_curve)
    if num_days < 2: return {} # Not enough data to calculate metrics
    total_return = equity_curve.iloc[-1] / equity_curve.iloc[0] - 1
    annualized_return = (1 + total_return) ** (trading_days_per_year / num_days) - 1

    # --- Sharpe Ratio ---
    # Equation: Sharpe = mean(r_excess) / std(r_excess) * sqrt(252)
    daily_returns = equity_curve.pct_change().dropna()
    if len(daily_returns) > 1 and daily_returns.std() > 1e-8:
        daily_risk_free = (1 + risk_free_rate)**(1/trading_days_per_year) - 1
        excess_returns = daily_returns - daily_risk_free
        sharpe_ratio = (excess_returns.mean() / excess_returns.std()) * np.sqrt(trading_days_per_year)
    else:
        sharpe_ratio = 0.0 # If no volatility, Sharpe is zero.

    # --- Maximum Drawdown (MDD) ---
    # Equation: MDD = max(1 - E_u / M_u) where M_u is running max equity.
    running_max = equity_curve.expanding().max()
    drawdowns = 1 - equity_curve / running_max
    max_drawdown = drawdowns.max()

    # --- Win Rate and Trade Count ---
    num_trades = len(trade_log)
    if num_trades > 0:
        winning_trades = sum(1 for trade in trade_log if trade['return'] > 0)
        win_rate = winning_trades / num_trades
    else:
        win_rate = np.nan # Undefined if no trades were made.

    return {
        'Annualized_Return': annualized_return,
        'Sharpe_Ratio': sharpe_ratio,
        'Max_Drawdown': max_drawdown,
        'Win_Rate': win_rate,
        'Num_Trades': num_trades
    }


# ------------------------------------------------------------------------------
# Task 31, Step 3: Aggregate and report performance.
# ------------------------------------------------------------------------------

def report_backtest_performance(
    all_equity_curves: Dict[Tuple[str, int], pd.Series],
    all_trade_logs: Dict[Tuple[str, int], List[Dict[str, Any]]],
    study_parameters: Dict[str, Any],
    output_path: Union[str, Path] = "reports/performance_summary.csv"
) -> pd.DataFrame:
    """
    Aggregates performance metrics for all strategies and generates a summary report.

    This function iterates through the results of all backtest simulations,
    calculates performance metrics for each, and compiles them into a single
    DataFrame. It also computes and logs high-level summary statistics as
    described in the paper.

    Args:
        all_equity_curves (Dict): Dictionary mapping (Ticker, Horizon) to equity curves.
        all_trade_logs (Dict): Dictionary mapping (Ticker, Horizon) to trade logs.
        study_parameters (Dict[str, Any]): The main configuration dictionary.
        output_path (Union[str, Path]): The path to save the summary CSV report.

    Returns:
        pd.DataFrame: A DataFrame summarizing the performance of all strategies.
    """
    logging.info("Aggregating and reporting backtest performance metrics...")
    output_path = Path(output_path)

    # --- Parameter Extraction ---
    risk_free_rate = study_parameters['backtesting']['market_assumptions']['risk_free_rate_annual']

    # --- Metric Calculation Loop ---
    # This list will store the dictionary of metrics for each strategy.
    all_metrics = []
    # Iterate through all simulated strategies.
    for strategy_key, equity_curve in all_equity_curves.items():
        # Retrieve the corresponding trade log.
        trade_log = all_trade_logs.get(strategy_key, [])
        # Calculate all performance metrics for this strategy.
        metrics = _calculate_strategy_metrics(equity_curve, trade_log, risk_free_rate)

        if metrics:
            # Add the strategy identifiers (Ticker, Horizon) to the metrics dict.
            metrics['Ticker'] = strategy_key[0]
            metrics['Horizon'] = strategy_key[1]
            all_metrics.append(metrics)

    if not all_metrics:
        logging.warning("No performance metrics could be calculated. No valid strategies found.")
        return pd.DataFrame()

    # --- Create and Persist Summary DataFrame ---
    # Convert the list of dictionaries into a DataFrame.
    summary_df = pd.DataFrame(all_metrics)
    # Set a MultiIndex for easy lookup and analysis.
    summary_df.set_index(['Ticker', 'Horizon'], inplace=True)

    # Save the detailed summary report to a CSV file.
    output_path.parent.mkdir(parents=True, exist_ok=True)
    summary_df.to_csv(output_path)
    logging.info(f"Full performance summary saved to '{output_path}'.")

    # --- High-Level Analysis and Reporting ---
    # Log the overall summary statistics across all strategies.
    logging.info("--- Overall Strategy Performance Summary ---")
    logging.info(summary_df.describe().to_string())

    # Identify and log the top 5 performing strategies by annualized return.
    top_5 = summary_df.sort_values(by='Annualized_Return', ascending=False).head(5)
    logging.info("\n--- Top 5 Performing Strategies (by Annualized Return) ---")
    logging.info(top_5.to_string())

    # Calculate and log the overall success rate (percentage of strategies with positive return).
    success_rate = (summary_df['Annualized_Return'] > 0).mean()
    logging.info(f"\nOverall Success Rate (Positive Ann. Return): {success_rate:.2%}")

    # Calculate and log the distribution of optimal horizons.
    # For each ticker, find the horizon that yielded the highest annualized return.
    optimal_horizons = summary_df.loc[summary_df.groupby('Ticker')['Annualized_Return'].idxmax()]
    horizon_distribution = optimal_horizons.index.get_level_values('Horizon').value_counts(normalize=True).sort_index()
    logging.info("\n--- Distribution of Optimal Prediction Horizons ---")
    logging.info(horizon_distribution.to_string())

    logging.info("Performance reporting pipeline finished successfully.")

    return summary_df


In [None]:
# Task 32: Create an orchestrator function for the end-to-end research pipeline

def run_hlppl_pipeline(
    df_raw: pd.DataFrame,
    study_parameters: Dict[str, Any],
    intermediate_dir: Path,
    model_dir: Path,
    log_dir: Path,
    report_dir: Path
) -> pd.DataFrame:
    """
    Executes the complete end-to-end research pipeline for the HLPPL model.

    This master orchestrator function serves as the engine for a single,
    isolated experimental run. It manages the entire workflow, from raw data
    validation to final performance reporting, by calling a sequence of modular,
    task-specific functions. All outputs (intermediate data, models, logs,
    and reports) are directed to the specified directories, ensuring that runs
    from different experiments do not interfere with each other.

    The pipeline is structured as follows:
    -   Phase I: Setup and Validation (Tasks 1-2)
    -   Phase II: Data Cleansing and Feature Engineering (Tasks 3-6)
    -   Phase III: NLP Signal Generation (Tasks 7-12)
    -   Phase IV: LPPL Signal Generation (Tasks 13-18)
    -   Phase V: Deep Learning Data Preparation (Tasks 19-22)
    -   Phase VI: Model Training and Evaluation (Tasks 23-26)
    -   Phase VII: Persistence, Inference, and Backtesting (Tasks 27-31)

    Args:
        df_raw (pd.DataFrame):
            The raw input DataFrame containing market, fundamental, and news data.
        study_parameters (Dict[str, Any]):
            The configuration dictionary for this specific experimental run.
        intermediate_dir (Path):
            The directory for saving/loading intermediate data artifacts (e.g.,
            embeddings, window definitions, signals).
        model_dir (Path):
            The directory for saving/loading model-related artifacts (e.g.,
            checkpoints, metadata).
        log_dir (Path):
            The directory for saving logs and configuration snapshots.
        report_dir (Path):
            The directory for saving final output reports (e.g., performance summary).

    Returns:
        pd.DataFrame:
            The final performance summary DataFrame, detailing the results of
            all backtested strategies for this experimental run.

    Raises:
        Exception:
            Propagates any exception from a failed pipeline step after logging
            the error, causing the run for this experiment to halt.
    """
    # Create a logger specific to this pipeline run, named after the experiment.
    logger = logging.getLogger(f"HLPPL_Pipeline_{model_dir.parent.name}")
    logger.info(f"--- STARTING PIPELINE RUN. OUTPUTS DIRECTED TO: {model_dir.parent} ---")

    try:
        # --- PHASE I: SETUP AND VALIDATION ---
        logger.info("\n--- Phase I: Setup and Validation ---")

        # Task 1: Validate the configuration and create a snapshot in the experiment's log directory.
        config = validate_and_parse_config(study_parameters, log_dir=log_dir)

        # Task 2: Validate the raw DataFrame's schema and structure.
        df_validated = validate_input_dataframe(df_raw)

        # --- PHASE II: DATA CLEANSING AND FEATURE ENGINEERING ---
        logger.info("\n--- Phase II: Data Cleansing and Feature Engineering ---")

        # Tasks 3-6: Sequentially clean, adjust, and engineer the base features.
        df_cleansed = cleanse_raw_data(df_validated)
        df_adjusted = adjust_for_corporate_actions(df_cleansed)
        df_features = derive_engineered_features(df_adjusted)
        df_aligned = align_and_validate_calendar(df_features)
        master_calendar = df_aligned.index.get_level_values('Date').unique().sort_values()

        # --- PHASE III: NLP SIGNAL GENERATION ---
        logger.info("\n--- Phase III: NLP Signal Generation ---")

        # Tasks 7-12: Process all text data to generate sentiment and hype signals.
        corpus, _, topic_model = setup_topic_model(df_aligned, config, intermediate_data_dir=intermediate_dir, model_dir=model_dir)
        df_filtered, corpus_retained = apply_topic_filter(df_aligned, corpus, topic_model, log_dir=log_dir)
        sentiment_results = classify_article_sentiment(corpus_retained, config, output_path=intermediate_dir / "finbert_sentiment_results.pkl")
        df_stock_sentiment = aggregate_stock_day_sentiment(df_filtered, sentiment_results)
        df_market_sentiment = aggregate_market_level_sentiment(df_stock_sentiment, sentiment_results, master_calendar)
        df_hype = construct_hype_index(df_market_sentiment)

        # --- PHASE IV: LPPL SIGNAL GENERATION ---
        logger.info("\n--- Phase IV: LPPL Signal Generation ---")

        # Tasks 13-18: Perform the LPPL analysis to generate the final BubbleScore.
        windows = define_lppl_calibration_windows(df_hype, config, output_path=intermediate_dir / "lppl_windows.pkl")
        bounds, _ = initialize_lppl_fitter(config, log_dir=log_dir)
        lppl_fits = fit_lppl_model_to_windows(windows, bounds, config, output_path=intermediate_dir / "lppl_fit_parameters.csv")
        df_residuals = compute_and_merge_lppl_residuals(df_hype, lppl_fits, config)
        df_bubblescore = construct_bubblescore(df_residuals, config, output_dir=intermediate_dir)
        df_labeled = label_bubble_episodes(df_bubblescore, config, output_dir=intermediate_dir)

        # --- PHASE V: DEEP LEARNING DATA PREPARATION ---
        logger.info("\n--- Phase V: Deep Learning Data Preparation ---")

        # Tasks 19-22: Engineer and align all features into sequences and split the data.
        stock_sequences, anchor_indices, _ = engineer_stock_level_sequences(df_labeled, config)
        market_sequence_map, _ = engineer_market_level_sequences(df_labeled, config)
        final_anchor_indices, target_matrix = construct_and_align_targets(df_labeled, anchor_indices, config)

        # Align the stock sequences with the final valid anchor indices.
        valid_indices_set = set(final_anchor_indices)
        stock_sequences = [seq for idx, seq in zip(anchor_indices, stock_sequences) if idx in valid_indices_set]

        partitioned_datasets = split_dataset_chronologically(stock_sequences, market_sequence_map, target_matrix, final_anchor_indices, config)

        # --- PHASE VI: MODEL TRAINING AND EVALUATION ---
        logger.info("\n--- Phase VI: Model Training and Evaluation ---")

        # Determine the device for training.
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

        # Tasks 23-26: Define, train, and validate the Transformer model.
        num_stock_features = partitioned_datasets['train'].stock_sequences.shape[2]
        num_market_features = partitioned_datasets['train'].market_sequences.shape[2]
        model = DualStreamTransformer(config, num_stock_features, num_market_features)
        loss_function = CompositeLoss(config)
        model, training_history = train_and_validate_model(model, partitioned_datasets, loss_function, config, device, output_dir=model_dir)

        # Task 27: Persist the final trained model and all associated metadata.
        persist_model_and_metadata(config, training_history, model_dir=model_dir, log_dir=log_dir)

        # --- PHASE VII: INFERENCE AND BACKTESTING ---
        logger.info("\n--- Phase VII: Inference and Backtesting ---")

        # Task 28: Run inference on the held-out test set.
        _, val_end_date = _determine_chronological_split_dates(final_anchor_indices, config['predictive_model']['data_preparation']['dataset_split_ratio'])
        predictions_df, _ = run_inference_pipeline(partitioned_datasets, config, final_anchor_indices, val_end_date, model_dir=model_dir, log_dir=log_dir)

        # Task 29: Convert model predictions into discrete trading signals.
        signals_df = generate_trading_signals(predictions_df, config, output_path=intermediate_dir / "trading_signals.csv")

        # Task 30: Simulate the trading strategy based on the signals.
        test_prices = df_aligned.loc[df_aligned.index.get_level_values('Date') > val_end_date]
        all_equity_curves, all_trade_logs = simulate_all_strategies(signals_df, test_prices, config, output_dir=intermediate_dir)

        # Task 31: Compute and report the final performance metrics.
        performance_summary = report_backtest_performance(all_equity_curves, all_trade_logs, config, output_path=report_dir / "performance_summary.csv")

        # Log the successful completion of this experimental run.
        logger.info(f"--- PIPELINE RUN SUCCEEDED FOR: {model_dir.parent.name} ---")

        # Return the final performance summary.
        return performance_summary

    except Exception as e:
        # Catch any exception from any step in the pipeline.
        logger.critical(f"!!! PIPELINE FAILED FOR: {model_dir.parent.name} !!!", exc_info=True)
        # Re-raise the exception to halt the execution of the ablation study for this run.
        raise e


In [None]:
# Task 33: Conduct robustness and ablation studies

# ==============================================================================
# Task 33: Conduct robustness and ablation studies
# ==============================================================================

# ------------------------------------------------------------------------------
# Task 33, Step 1: Define ablation test configurations.
# ------------------------------------------------------------------------------

def _define_ablation_configurations(
    base_config: Dict[str, Any]
) -> List[Tuple[str, Dict[str, Any]]]:
    """
    Generates a list of modified configuration dictionaries for ablation studies.

    Args:
        base_config (Dict[str, Any]): The original, validated study configuration.

    Returns:
        List[Tuple[str, Dict[str, Any]]]: A list of tuples, where each tuple is
            an (experiment_name, modified_config_dict).
    """
    experiments = []

    # --- Baseline Experiment ---
    # The first experiment is the full model with the base configuration.
    experiments.append(("baseline_full_model", copy.deepcopy(base_config)))

    # --- Ablation: Residual-only (No Behavioral Features) ---
    config_res_only = copy.deepcopy(base_config)
    config_res_only['descriptive_model']['bubble_score_synthesis']['alpha_1_hype_weight'] = 0.0
    config_res_only['descriptive_model']['bubble_score_synthesis']['alpha_2_sentiment_weight'] = 0.0
    experiments.append(("ablation_residual_only", config_res_only))

    # --- Ablation: No Hype ---
    config_no_hype = copy.deepcopy(base_config)
    config_no_hype['descriptive_model']['bubble_score_synthesis']['alpha_1_hype_weight'] = 0.0
    experiments.append(("ablation_no_hype", config_no_hype))

    # --- Ablation: No Sentiment ---
    config_no_sentiment = copy.deepcopy(base_config)
    config_no_sentiment['descriptive_model']['bubble_score_synthesis']['alpha_2_sentiment_weight'] = 0.0
    experiments.append(("ablation_no_sentiment", config_no_sentiment))

    # --- Sensitivity: Vary LPPL Window Size ---
    for window_size in [150, 200, 250, 300]:
        config_ws = copy.deepcopy(base_config)
        config_ws['descriptive_model']['lppl_fitting']['rolling_window_size'] = window_size
        experiments.append((f"sensitivity_lppl_window_{window_size}", config_ws))

    logging.info(f"Defined {len(experiments)} experiments for ablation and sensitivity analysis.")
    return experiments


# ------------------------------------------------------------------------------
# Task 33, Step 3: Compare ablation results and report sensitivity.
# ------------------------------------------------------------------------------

def _analyze_and_report_ablation_results(
    all_results: Dict[str, pd.DataFrame],
    output_dir: Path
) -> None:
    """
    Analyzes results from all experiments and generates a summary report and plots.

    Args:
        all_results (Dict[str, pd.DataFrame]): A dictionary mapping experiment
            names to their performance summary DataFrames.
        output_dir (Path): The directory to save the final report and plots.
    """
    summary_metrics = []
    # Calculate aggregate metrics for each experiment's result.
    for name, result_df in all_results.items():
        if not result_df.empty:
            summary_metrics.append({
                'Experiment': name,
                'Mean_Annualized_Return': result_df['Annualized_Return'].mean(),
                'Mean_Sharpe_Ratio': result_df['Sharpe_Ratio'].mean(),
                'Mean_Max_Drawdown': result_df['Max_Drawdown'].mean(),
                'Mean_Win_Rate': result_df['Win_Rate'].mean()
            })

    if not summary_metrics:
        logging.warning("No results to analyze for ablation study.")
        return

    # Create a comparison DataFrame.
    comparison_df = pd.DataFrame(summary_metrics).set_index('Experiment')

    # Save the final comparison table.
    report_path = output_dir / "ablation_comparison_summary.csv"
    comparison_df.to_csv(report_path)
    logging.info(f"Ablation comparison summary saved to '{report_path}'.")
    logging.info("\n--- Ablation Study Summary ---")
    logging.info(comparison_df.to_string())

    # --- Generate Plots ---
    # Set plot style.
    sns.set_theme(style="whitegrid")

    # Plot 1: Bar chart for core ablation study.
    ablation_names = ['baseline_full_model', 'ablation_residual_only', 'ablation_no_hype', 'ablation_no_sentiment']
    plot_data = comparison_df.loc[ablation_names]

    fig, ax = plt.subplots(2, 1, figsize=(10, 12))
    sns.barplot(x=plot_data.index, y='Mean_Annualized_Return', data=plot_data, ax=ax[0])
    ax[0].set_title('Ablation Study: Mean Annualized Return')
    ax[0].set_ylabel('Mean Annualized Return')
    ax[0].tick_params(axis='x', rotation=15)

    sns.barplot(x=plot_data.index, y='Mean_Sharpe_Ratio', data=plot_data, ax=ax[1])
    ax[1].set_title('Ablation Study: Mean Sharpe Ratio')
    ax[1].set_ylabel('Mean Sharpe Ratio')
    ax[1].tick_params(axis='x', rotation=15)

    plt.tight_layout()
    plot_path = output_dir / "ablation_core_performance.png"
    fig.savefig(plot_path)
    logging.info(f"Core ablation performance plot saved to '{plot_path}'.")
    plt.close(fig)

    # Plot 2: Line plot for LPPL window size sensitivity.
    sensitivity_data = comparison_df[comparison_df.index.str.startswith('sensitivity_lppl_window_')]
    sensitivity_data['Window_Size'] = sensitivity_data.index.str.split('_').str[-1].astype(int)
    sensitivity_data.sort_values('Window_Size', inplace=True)

    fig, ax = plt.subplots(1, 1, figsize=(10, 6))
    sns.lineplot(x='Window_Size', y='Mean_Sharpe_Ratio', data=sensitivity_data, marker='o', ax=ax)
    ax.set_title('Sensitivity Analysis: Mean Sharpe Ratio vs. LPPL Window Size')
    ax.set_xlabel('LPPL Rolling Window Size (days)')
    ax.set_ylabel('Mean Sharpe Ratio')

    plot_path = output_dir / "sensitivity_lppl_window.png"
    fig.savefig(plot_path)
    logging.info(f"LPPL window sensitivity plot saved to '{plot_path}'.")
    plt.close(fig)


# ------------------------------------------------------------------------------
# Task 33, Orchestrator Function
# ------------------------------------------------------------------------------

def run_ablation_studies(
    df_raw: pd.DataFrame,
    base_config: Dict[str, Any],
    root_output_dir: Union[str, Path] = "results"
) -> None:
    """
    Orchestrates the execution of all ablation and sensitivity analysis experiments.

    This function serves as the master controller for the entire study. It
    systematically modifies the base configuration to create different
    experimental setups (e.g., removing behavioral features, varying
    hyperparameters). For each setup, it runs the entire end-to-end pipeline
    in an isolated directory. Finally, it aggregates and analyzes the results
    from all experiments to produce a final comparison report and visualizations,
    quantifying the contribution of each model component.

    Args:
        df_raw (pd.DataFrame):
            The raw input DataFrame, which will be used for all experimental runs.
        base_config (Dict[str, Any]):
            The baseline configuration for the study, which will be modified for
            each experiment.
        root_output_dir (Union[str, Path]):
            The root directory where results for all experiments will be saved
            in separate, named subdirectories.
    """
    # Get a logger specific to this master orchestrator.
    logger = logging.getLogger("AblationOrchestrator")
    logger.info("==========================================================")
    logger.info("=== STARTING ABLATION AND SENSITIVITY ANALYSIS ===")
    logger.info("==========================================================")

    # Define the root path for all experimental outputs.
    root_path = Path(root_output_dir)

    # --- Step 1: Define all experimental configurations. ---
    # This helper function generates a list of (name, config_dict) tuples.
    experiments = _define_ablation_configurations(base_config)

    # This dictionary will store the final performance summary DataFrame from each successful run.
    all_results: Dict[str, pd.DataFrame] = {}

    # --- Step 2: Run the full pipeline for each configuration. ---
    # This loop executes the entire, computationally intensive research pipeline for each defined experiment.
    for name, config in experiments:
        # Log the start of a new experimental run.
        logger.info(f"\n{'='*80}\n--- Running Experiment: {name} ---\n{'='*80}")

        # Define a unique, isolated output directory for this specific experiment
        # to prevent interference between runs.
        experiment_root_dir = root_path / name
        # Define subdirectories for organized output within the experiment's folder.
        intermediate_dir = experiment_root_dir / "data_intermediate"
        model_dir = experiment_root_dir / "models"
        log_dir = experiment_root_dir / "logs"
        report_dir = experiment_root_dir / "reports"

        # Create all necessary directories for the current experiment.
        for d in [intermediate_dir, model_dir, log_dir, report_dir]:
            d.mkdir(parents=True, exist_ok=True)

        try:
            # --- Execute the full end-to-end pipeline ---
            # Call the master orchestrator, passing the specific config
            # and isolated directories for this experimental run.
            performance_summary = run_hlppl_pipeline(
                df_raw=df_raw,
                study_parameters=config,
                intermediate_dir=intermediate_dir,
                model_dir=model_dir,
                log_dir=log_dir,
                report_dir=report_dir
            )
            # If the pipeline succeeds, store the final performance summary for later analysis.
            all_results[name] = performance_summary

        except Exception:
            # If any part of the pipeline fails for one experiment, log the error
            # and continue to the next one. This makes the overall study robust.
            logger.error(f"!!! Experiment '{name}' FAILED and will be excluded from the final analysis. !!!", exc_info=False)
            # Store an empty DataFrame to signify a failed run.
            all_results[name] = pd.DataFrame()

    # --- Step 3: Compare results and generate final report. ---
    logger.info(f"\n{'='*80}\n--- All experiments complete. Analyzing and reporting results. ---\n{'='*80}")
    # This function takes the dictionary of all results and produces the
    # final comparison tables and plots.
    _analyze_and_report_ablation_results(all_results, root_path)

    logger.info("==========================================================")
    logger.info("=== ABLATION AND SENSITIVITY ANALYSIS COMPLETE ===")
    logger.info("==========================================================")


In [None]:
# Top-Level Orchestrator

# Ensure the logger is configured for the entire run.
# This top-level configuration will be inherited by all modules.
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        # Log to a file for a persistent record of the entire study.
        logging.FileHandler("main_orchestrator.log", mode='w'),
        # Also log to the console for real-time feedback.
        logging.StreamHandler()
    ]
)

def execute_full_study(
    df_raw: pd.DataFrame,
    base_config: Dict[str, Any],
    run_ablation: bool = True,
    root_output_dir: Union[str, Path] = "study_results"
) -> Dict[str, Any]:
    """
    Serves as the master orchestrator for the entire HLPPL research study.

    This top-level function provides a single entry point to execute the complete
    research pipeline. It performs two major operations:
    1.  Executes a full end-to-end run of the baseline HLPPL model using the
        provided base configuration. The results of this run are stored in a
        dedicated 'baseline' subdirectory.
    2.  If requested, it then proceeds to run the comprehensive ablation and
        sensitivity analysis, which involves re-running the entire pipeline for
        several variations of the configuration.

    This function is designed for reproducibility and manages all I/O to ensure
    that results from different experimental runs are kept isolated and organized.

    Args:
        df_raw (pd.DataFrame):
            The raw input DataFrame containing all necessary market, fundamental,
            and news data for the study.
        base_config (Dict[str, Any]):
            The baseline configuration dictionary for the study. This will be
            used for the main run and as a template for the ablation studies.
        run_ablation (bool):
            A flag to control whether the computationally expensive ablation
            studies should be executed after the baseline run. Defaults to True.
        root_output_dir (Union[str, Path]):
            The root directory where all outputs from the entire study (baseline
            and ablation runs) will be saved.

    Returns:
        Dict[str, Any]:
            A dictionary containing the key results of the study, including the
            performance summary of the baseline model and paths to the detailed
            output directories.
    """
    # Get a logger for this top-level orchestrator.
    logger = logging.getLogger("MainOrchestrator")
    logger.info("==========================================================")
    logger.info("====== EXECUTING FULL HLPPL REPLICATION STUDY ======")
    logger.info("==========================================================")

    # Define the root path for all outputs of this study.
    root_path = Path(root_output_dir)
    root_path.mkdir(parents=True, exist_ok=True)

    # This dictionary will store the final results to be returned.
    results_summary: Dict[str, Any] = {}

    # --- 1. Execute the Baseline Pipeline Run ---
    # Define the isolated output directories for the baseline experiment.
    baseline_dir = root_path / "baseline"
    logger.info(f"\n--- Starting Baseline Pipeline Run (Outputs in: {baseline_dir}) ---")

    try:
        # Call the main pipeline orchestrator for the baseline configuration.
        baseline_performance = run_hlppl_pipeline(
            df_raw=df_raw,
            study_parameters=base_config,
            intermediate_dir=baseline_dir / "data_intermediate",
            model_dir=baseline_dir / "models",
            log_dir=baseline_dir / "logs",
            report_dir=baseline_dir / "reports"
        )
        # Store the key result from the baseline run.
        results_summary['baseline_performance'] = baseline_performance
        results_summary['baseline_output_directory'] = str(baseline_dir.resolve())
        logger.info("--- Baseline Pipeline Run Completed Successfully ---")

    except Exception as e:
        # If the baseline run fails, it's a critical error. Log and terminate.
        logger.critical("!!! BASELINE PIPELINE FAILED. ABORTING STUDY. !!!", exc_info=True)
        results_summary['status'] = 'FAILED'
        results_summary['error'] = str(e)
        return results_summary

    # --- 2. Conditionally Execute Ablation Studies ---
    # Check the flag to see if the user requested the ablation studies.
    if run_ablation:
        # The ablation orchestrator will manage its own subdirectories within the root path.
        ablation_dir = root_path / "ablation_studies"
        logger.info(f"\n--- Starting Ablation Studies (Outputs in: {ablation_dir}) ---")

        # Call the ablation study orchestrator. It handles all sub-runs internally.
        # Note: This function has its own internal error handling for individual
        # experiment failures and does not need to be wrapped in a try/except here.
        run_ablation_studies(
            df_raw=df_raw,
            base_config=base_config,
            root_output_dir=ablation_dir
        )
        # Store the path to the ablation results for easy access.
        results_summary['ablation_output_directory'] = str(ablation_dir.resolve())
        logger.info("--- Ablation Studies Completed ---")
    else:
        # If ablation studies are skipped, log this information.
        logger.info("\n--- Skipping Ablation Studies as per configuration (run_ablation=False) ---")

    # --- Finalization ---
    # Log the successful completion of the entire study.
    logger.info("==========================================================")
    logger.info("========= FULL HLPPL REPLICATION STUDY COMPLETE =========")
    logger.info("==========================================================")

    # Add a final success status to the results summary.
    results_summary['status'] = 'SUCCESS'

    # Return the dictionary containing the key results and output paths.
    return results_summary
