# Deep Dive: Time Series Cross-Validation Implementation
## Complete Line-by-Line Analysis with Theory, Logic, and Interconnected Reasoning

### Overview
This notebook provides a comprehensive, deeply reasoned explanation of our improved time series cross-validation implementation. We'll examine every design choice, understand the theoretical foundations, and see how each decision connects to previous ones in a logical chain of reasoning.

### What We'll Cover:
1. **Theoretical Foundation**: Why traditional CV fails for time series
2. **Import Strategy**: Each library choice and its purpose
3. **Data Structures**: Every field, type hint, and their interconnections
4. **Class Architecture**: Design patterns and their justifications
5. **Split Generation Logic**: Mathematical foundation and implementation
6. **Boundary Calculations**: Precise temporal alignment reasoning
7. **Index Extraction**: Memory efficiency and accuracy trade-offs
8. **Model Training Pipeline**: Isolation principles and validation
9. **Loss Calculation**: Per-origin tracking and aggregation strategies
10. **Statistical Testing**: Preparation for rigorous model comparison

### Learning Philosophy
- **Every choice has a reason** - we'll explain the "why" behind each decision
- **Connections matter** - we'll show how each step builds on previous ones
- **Context is key** - we'll relate everything to practical time series forecasting
- **No shortcuts** - we'll dive deep into the mathematical and logical foundations

## 1. Theoretical Foundation: Why Traditional Cross-Validation Fails

### The Core Problem
Traditional k-fold cross-validation assumes **exchangeable data** - that any subset can represent the population. This assumption is **fundamentally violated** in time series because:

1. **Temporal Dependencies**: Future values depend on past values
2. **Data Leakage**: Using future data to predict the past creates unrealistic performance estimates
3. **Non-Stationarity**: Statistical properties change over time
4. **Seasonality**: Patterns repeat at regular intervals

### What We Need Instead
For time series, we need **time-aware validation** that:
- Respects temporal order (never train on future to predict past)
- Maintains realistic forecasting scenarios
- Tracks performance across different time periods
- Enables statistical comparison between models

### Our Solution: Rolling Origin Cross-Validation
We implement a variant that:
1. **Preserves temporal order**: Training always precedes testing
2. **Creates realistic scenarios**: Each fold simulates real forecasting
3. **Tracks per-origin performance**: Enables detailed analysis
4. **Supports statistical testing**: Provides data for model comparison

This foundation drives every design choice we'll examine next.

## 2. Import Strategy: Every Library Choice Has a Purpose

Let's examine each import and understand why it's essential for our implementation:

In [1]:
# Let's examine our imports and understand each choice
import numpy as np
import pandas as pd
import torch
from dataclasses import dataclass, field
from typing import List, Tuple, Dict, Optional, Any, Union
from collections import defaultdict

print("Import Analysis:")
print("================")
print("numpy (as np): Mathematical operations, array handling")
print("  - Why: Fast numerical computation for index calculations")
print("  - Connection: Foundation for efficient data manipulation")
print()
print("pandas (as pd): Data structure and time series handling")
print("  - Why: DatetimeIndex operations, data alignment")
print("  - Connection: Builds on numpy for structured time series data")
print()
print("torch: Deep learning framework")
print("  - Why: Model training, loss computation, device management")
print("  - Connection: Our models are PyTorch-based, needs tensor operations")
print()
print("dataclasses: Structured data containers")
print("  - Why: Type-safe, self-documenting data structures")
print("  - Connection: Replaces error-prone dictionaries with validated objects")
print()
print("typing: Type hints and annotations")
print("  - Why: Code clarity, IDE support, runtime validation")
print("  - Connection: Essential for maintainable, large-scale projects")
print()
print("defaultdict: Automatic dictionary initialization")
print("  - Why: Simplifies nested data structure creation")
print("  - Connection: Reduces boilerplate in loss aggregation")

Import Analysis:
numpy (as np): Mathematical operations, array handling
  - Why: Fast numerical computation for index calculations
  - Connection: Foundation for efficient data manipulation

pandas (as pd): Data structure and time series handling
  - Why: DatetimeIndex operations, data alignment
  - Connection: Builds on numpy for structured time series data

torch: Deep learning framework
  - Why: Model training, loss computation, device management
  - Connection: Our models are PyTorch-based, needs tensor operations

dataclasses: Structured data containers
  - Why: Type-safe, self-documenting data structures
  - Connection: Replaces error-prone dictionaries with validated objects

typing: Type hints and annotations
  - Why: Code clarity, IDE support, runtime validation
  - Connection: Essential for maintainable, large-scale projects

defaultdict: Automatic dictionary initialization
  - Why: Simplifies nested data structure creation
  - Connection: Reduces boilerplate in loss aggregat

## 3. Data Structures: The Foundation of Type Safety and Clarity

Our implementation uses three key dataclasses. Let's understand why each field exists and how they interconnect:

In [None]:
# Data Structure 1: CVFold - Represents a single cross-validation fold
@dataclass
class CVFold:
    """
    A single cross-validation fold with complete information about train/test splits.
    
    WHY THIS STRUCTURE:
    - Encapsulates all information needed for one fold
    - Type safety prevents common indexing errors
    - Self-documenting through field names
    """
    
    # Core split information - the fundamental data
    train_indices: np.ndarray  # WHERE to train (indices into original data)
    test_indices: np.ndarray   # WHERE to test (indices into original data)
    
    # Temporal context - connects this fold to the time dimension
    origin_date: pd.Timestamp  # WHEN this fold's forecast originates
    test_start_date: pd.Timestamp  # WHEN the test period begins
    test_end_date: pd.Timestamp    # WHEN the test period ends
    
    # Metadata - helps with debugging and analysis
    fold_number: int           # WHICH fold this is (for ordering/tracking)

print("CVFold Design Rationale:")
print("========================")
print("✓ train_indices: np.ndarray - Fast integer indexing, memory efficient")
print("✓ test_indices: np.ndarray - Consistent with train_indices type")
print("✓ origin_date: pd.Timestamp - Precise temporal reference point")
print("✓ test_start/end_date: pd.Timestamp - Define exact test window")
print("✓ fold_number: int - Simple ordering for analysis")
print()
print("KEY INSIGHT: Each fold is completely self-contained")
print("- No dependencies on external state")
print("- Can be processed independently")
print("- Contains all information needed for reproducible training")

In [None]:
# Data Structure 2: CVResults - Comprehensive results from cross-validation
@dataclass
class CVResults:
    """
    Complete results from cross-validation with multi-level loss tracking.
    
    WHY THIS HIERARCHICAL STRUCTURE:
    - Different analysis levels need different aggregations
    - Statistical testing requires per-origin data
    - Debugging needs fold-level detail
    """
    
    # Per-origin losses - the finest granularity
    per_origin_losses: Dict[str, List[float]] = field(default_factory=dict)
    # Structure: {origin_date_str: [loss_step1, loss_step2, ...]}
    # WHY: Enables per-timepoint analysis and statistical testing
    
    # Per-fold losses - intermediate aggregation
    per_fold_losses: Dict[str, float] = field(default_factory=dict) 
    # Structure: {fold_identifier: average_loss_for_fold}
    # WHY: Fold-level performance comparison and debugging
    
    # Overall metrics - highest level aggregation  
    overall_metrics: Dict[str, float] = field(default_factory=dict)
    # Structure: {metric_name: aggregated_value}
    # WHY: Single numbers for model comparison and reporting

print("CVResults Design Rationale:")
print("===========================")
print("✓ Three-tier structure matches analysis needs:")
print("  - per_origin_losses: Statistical testing, temporal analysis")
print("  - per_fold_losses: Cross-validation diagnostics")  
print("  - overall_metrics: Model comparison, reporting")
print()
print("✓ Dict with default_factory prevents KeyError issues")
print("✓ String keys enable flexible metric naming")
print("✓ Hierarchical design allows drilling down from summary to detail")
print()
print("CRITICAL CONNECTION: This structure directly supports")
print("the Diebold-Mariano test for statistical significance!")

## 4. Class Architecture: The ImprovedTimeSeriesCrossValidator

Now we examine the main class. Every design choice here builds on our previous decisions:

In [None]:
# Class Definition and Initialization
class ImprovedTimeSeriesCrossValidator:
    """
    Time series cross-validator with proper temporal splits and comprehensive tracking.
    
    DESIGN PHILOSOPHY:
    - Explicit is better than implicit (all parameters are clear)
    - Fail fast (validation in __init__)
    - Single responsibility (this class only does CV splitting and tracking)
    """
    
    def __init__(self, 
                 n_splits: int = 5,
                 train_size: Optional[int] = None,
                 test_size: int = 1,
                 gap: int = 0,
                 expanding_window: bool = True):
        """
        WHY EACH PARAMETER EXISTS:
        
        n_splits: int = 5
        - DEFAULT RATIONALE: 5 is the most common CV choice, balances bias/variance
        - CONNECTION: More splits = less bias but more variance in performance estimates
        - PRACTICAL: With daily data, 5 splits gives good temporal coverage
        
        train_size: Optional[int] = None  
        - DEFAULT RATIONALE: None means use all available data (expanding window)
        - CONNECTION: Fixed size creates walk-forward window, None creates expanding
        - TRADE-OFF: Fixed size = consistent train data volume, expanding = more data
        
        test_size: int = 1
        - DEFAULT RATIONALE: Single-step ahead is most common forecasting scenario
        - CONNECTION: Matches our per-origin loss tracking (1 prediction per origin)
        - FLEXIBILITY: Can increase for multi-step forecasting
        
        gap: int = 0
        - DEFAULT RATIONALE: No gap assumes immediate forecasting capability
        - CONNECTION: Real-world may need gaps for data availability delays
        - SAFETY: Prevents accidental data leakage if processing delays exist
        
        expanding_window: bool = True
        - DEFAULT RATIONALE: More training data generally improves performance
        - CONNECTION: If train_size is None, this parameter is ignored
        - CHOICE: expanding=True uses all past data, False uses fixed window
        """
        
        # VALIDATION LOGIC: Fail fast with clear error messages
        if n_splits < 2:
            raise ValueError("n_splits must be at least 2 for meaningful cross-validation")
        if test_size < 1:
            raise ValueError("test_size must be at least 1")
        if gap < 0:
            raise ValueError("gap cannot be negative")
            
        # Store parameters for later use
        self.n_splits = n_splits
        self.train_size = train_size  
        self.test_size = test_size
        self.gap = gap
        self.expanding_window = expanding_window

print("Class Initialization Design Rationale:")
print("======================================")
print("✓ All parameters have sensible defaults based on common use cases")
print("✓ Validation prevents silent failures later in the pipeline")  
print("✓ Parameter storage enables method access without globals")
print("✓ Type hints make usage clear and enable IDE support")
print()
print("KEY INSIGHT: This initialization establishes the 'contract'")
print("for what kind of cross-validation we'll perform.")

## 5. Split Generation Logic: The Heart of Time Series Cross-Validation

The `get_rolling_origin_aligned_splits` method is where theory meets practice. Let's understand every line:

In [None]:
def get_rolling_origin_aligned_splits(self, data: pd.DataFrame) -> List[CVFold]:
    """
    Generate time series cross-validation splits aligned with RollingOrigin evaluation.
    
    MATHEMATICAL FOUNDATION:
    For time series with T observations, we need splits that:
    1. Preserve temporal order: train_end < test_start
    2. Have consistent test window size
    3. Align with forecast origins for proper evaluation
    
    WHY "ROLLING ORIGIN ALIGNED":
    - Rolling: Each fold advances the origin point forward in time
    - Origin: The point in time from which we make forecasts  
    - Aligned: Test periods align with evaluation framework
    """
    
    # STEP 1: Input validation - fail fast with clear messages
    if not isinstance(data.index, pd.DatetimeIndex):
        raise ValueError("Data must have DatetimeIndex for time series CV")
    # WHY: We need temporal information for proper splitting
    # CONNECTION: This validates our assumption from theoretical foundation
    
    if len(data) < self.n_splits + self.test_size + self.gap:
        raise ValueError(f"Data too short: need at least {self.n_splits + self.test_size + self.gap} points")
    # WHY: Ensures we can create the requested number of splits
    # MATH: n_splits origins + test_size per split + gap between train/test
    # CONNECTION: Builds on our initialization parameter validation
    
    print("Step 1 Analysis:")
    print("===============")
    print("✓ DatetimeIndex validation ensures we can work with temporal data")
    print("✓ Length validation prevents impossible split requests")  
    print("✓ Both connect back to our theoretical foundation of time-aware CV")
    
    # STEP 2: Calculate fold boundaries - the core mathematical logic
    fold_boundaries = self._calculate_fold_boundaries(len(data))
    # WHY: Separates complex boundary logic into testable function
    # CONNECTION: This is where our n_splits parameter gets translated to actual indices
    
    print("\nStep 2 Analysis:")
    print("===============")
    print("✓ Fold boundaries define WHERE each split occurs in time")
    print("✓ Separation of concerns: this method orchestrates, _calculate_fold_boundaries computes")
    print("✓ Mathematical logic isolated for testing and validation")
    
    # STEP 3: Generate actual indices for each fold
    folds = []
    for i, (train_start, train_end, test_start, test_end) in enumerate(fold_boundaries):
        # Extract indices using our boundary calculations
        train_indices, test_indices = self._generate_aligned_indices(
            data, train_start, train_end, test_start, test_end
        )
        # WHY: Another separation of concerns - boundaries vs actual index extraction
        # CONNECTION: Uses our CVFold structure to encapsulate results
        
        # Create fold with complete temporal context
        fold = CVFold(
            train_indices=train_indices,
            test_indices=test_indices,
            origin_date=data.index[test_start - 1],  # Last training point = origin
            test_start_date=data.index[test_start],
            test_end_date=data.index[test_end - 1],
            fold_number=i
        )
        # WHY: origin_date is test_start - 1 because we forecast FROM the last training point
        # CONNECTION: This aligns with RollingOrigin evaluation framework
        
        folds.append(fold)
    
    print("\nStep 3 Analysis:")
    print("===============")
    print("✓ Index generation handles the actual data slicing")
    print("✓ CVFold creation packages all information together")
    print("✓ origin_date calculation aligns with forecasting reality")
    print("✓ Complete temporal context enables detailed analysis")
    
    return folds

print("Overall Method Design Rationale:")
print("===============================")
print("✓ Three-step process: validate, calculate, generate")
print("✓ Each step builds on previous parameter validation")
print("✓ Clear separation of concerns enables testing and debugging")
print("✓ Mathematical rigor ensures temporal consistency")
print("✓ Rich metadata supports downstream analysis")

## 6. Boundary Calculations: The Mathematical Core

The `_calculate_fold_boundaries` method contains the mathematical logic that defines our cross-validation strategy:

In [None]:
def _calculate_fold_boundaries(self, data_length: int) -> List[Tuple[int, int, int, int]]:
    """
    Calculate fold boundaries for time series cross-validation.
    
    MATHEMATICAL LOGIC:
    Given data of length N, we need to determine (train_start, train_end, test_start, test_end)
    for each fold such that:
    1. No temporal leakage: train_end + gap ≤ test_start
    2. Consistent test windows: test_end - test_start = test_size
    3. Proper spacing: test windows are evenly distributed
    
    Returns: List[(train_start, train_end, test_start, test_end)]
    """
    
    # STEP 1: Calculate available space for cross-validation
    available_length = data_length - self.test_size - self.gap
    # WHY: We need to reserve space for the last test window plus any gap
    # MATH: If data ends at index N-1, last test can start at N-test_size
    # CONNECTION: This builds on our initialization validation
    
    print("Boundary Calculation Step 1:")
    print("===========================")
    print(f"Data length: {data_length}")
    print(f"Test size: {self.test_size}")  
    print(f"Gap: {self.gap}")
    print(f"Available for CV: {available_length}")
    print("LOGIC: We reserve space at the end for the final test window")
    
    # STEP 2: Determine test start positions
    if self.n_splits == 1:
        # Special case: single split uses all available data
        test_starts = [available_length]
        # WHY: With one split, we want to use maximum training data
        # CONNECTION: Connects to expanding_window philosophy
    else:
        # Multiple splits: distribute test windows evenly
        step_size = available_length // self.n_splits
        # WHY: Even distribution gives balanced temporal coverage
        # MATH: Integer division ensures we don't exceed available space
        
        test_starts = [step_size * (i + 1) for i in range(self.n_splits)]
        # WHY: Starts at step_size, not 0, to ensure minimum training data
        # PATTERN: [step_size, 2*step_size, 3*step_size, ...]
        # CONNECTION: This creates the "rolling" in rolling origin
    
    print("\nBoundary Calculation Step 2:")
    print("===========================")
    print(f"Test start positions: {test_starts}")
    print("LOGIC: Even spacing ensures representative temporal coverage")
    
    # STEP 3: Calculate complete boundaries for each fold
    boundaries = []
    for i, test_start in enumerate(test_starts):
        # Test window: fixed size starting at test_start
        test_end = test_start + self.test_size
        # WHY: Consistent test window size for fair comparison
        # MATH: test_end is exclusive (Python slice convention)
        
        # Apply gap before test window
        train_end = test_start - self.gap
        # WHY: Gap prevents data leakage in realistic scenarios
        # CONNECTION: Links to our initialization gap parameter
        
        # Determine training window start
        if self.train_size is None:
            # Expanding window: use all available data
            train_start = 0
            # WHY: More training data generally improves performance
            # PHILOSOPHY: "All available information" approach
        else:
            # Fixed window: use only recent data
            train_start = max(0, train_end - self.train_size)
            # WHY: Consistent training data volume across folds
            # SAFETY: max(0, ...) prevents negative indices
            # TRADE-OFF: Consistency vs maximum information
        
        boundaries.append((train_start, train_end, test_start, test_end))
        
        print(f"\nFold {i} boundary analysis:")
        print(f"  Train: [{train_start}:{train_end}] (size: {train_end - train_start})")
        print(f"  Gap: [{train_end}:{test_start}] (size: {test_start - train_end})")
        print(f"  Test: [{test_start}:{test_end}] (size: {test_end - test_start})")
    
    print("\nBoundary Calculation Step 3:")
    print("===========================")
    print("✓ Test windows have consistent size")
    print("✓ Gap prevents temporal leakage")
    print("✓ Training windows follow expanding/fixed strategy")
    print("✓ All indices are valid (non-negative, within bounds)")
    
    return boundaries

print("Overall Boundary Calculation Rationale:")
print("======================================")
print("✓ Mathematical rigor ensures temporal consistency")
print("✓ Three-step process: available space → test positions → complete boundaries")
print("✓ Handles both expanding and fixed window strategies")
print("✓ Gap parameter provides realistic forecasting scenarios")
print("✓ Boundary validation prevents index errors downstream")

## 7. Index Extraction: From Boundaries to Actual Data

The `_generate_aligned_indices` method converts our mathematical boundaries into actual data indices:

In [None]:
def _generate_aligned_indices(self, data: pd.DataFrame, 
                            train_start: int, train_end: int,
                            test_start: int, test_end: int) -> Tuple[np.ndarray, np.ndarray]:
    """
    Generate actual indices for train/test splits from boundary positions.
    
    WHY SEPARATE FROM BOUNDARY CALCULATION:
    - Boundaries work with abstract positions
    - Indices work with actual data
    - Separation enables testing boundary logic independently
    - Memory efficiency: create arrays only when needed
    """
    
    # STEP 1: Generate training indices
    train_indices = np.arange(train_start, train_end)
    # WHY np.arange: Memory efficient, fast integer sequence generation
    # CONVENTION: train_end is exclusive (standard Python slicing)
    # CONNECTION: Builds directly on boundary calculations
    
    print("Index Extraction Analysis:")
    print("=========================")
    print(f"Train boundaries: [{train_start}:{train_end}]")
    print(f"Train indices shape: {train_indices.shape}")
    print(f"Train indices preview: {train_indices[:5]}...{train_indices[-5:]}")
    
    # STEP 2: Generate test indices  
    test_indices = np.arange(test_start, test_end)
    # WHY: Consistent with training index generation
    # MEMORY: Small arrays for typical test sizes (1-7 days)
    # CONNECTION: test_end from boundary calculation
    
    print(f"Test boundaries: [{test_start}:{test_end}]")
    print(f"Test indices shape: {test_indices.shape}")
    print(f"Test indices: {test_indices}")
    
    # STEP 3: Validation and safety checks
    # Ensure no negative indices (shouldn't happen with proper boundaries)
    assert train_indices.min() >= 0, f"Negative training index: {train_indices.min()}"
    assert test_indices.min() >= 0, f"Negative test index: {test_indices.min()}"
    
    # Ensure indices don't exceed data length
    assert train_indices.max() < len(data), f"Training index {train_indices.max()} >= data length {len(data)}"
    assert test_indices.max() < len(data), f"Test index {test_indices.max()} >= data length {len(data)}"
    
    # Ensure temporal order (no training data after test data)
    if len(train_indices) > 0 and len(test_indices) > 0:
        assert train_indices.max() < test_indices.min(), "Training data cannot come after test data"
    
    print("Validation Results:")
    print("==================")
    print("✓ All indices are non-negative")
    print("✓ All indices are within data bounds")
    print("✓ Temporal order preserved (train < test)")
    print("✓ No data leakage possible")
    
    return train_indices, test_indices

print("Index Extraction Design Rationale:")
print("==================================")
print("✓ np.ndarray for memory efficiency and fast operations")
print("✓ Boundary-to-index separation enables independent testing")
print("✓ Comprehensive validation prevents runtime errors")
print("✓ Clear temporal ordering enforced at index level")
print("✓ Connection: Indices are ready for direct pandas/numpy slicing")

## 8. Model Training Pipeline: Putting It All Together

The `cross_validate_model` method orchestrates the entire cross-validation process:

In [None]:
def cross_validate_model(self, 
                        model_class,
                        data: pd.DataFrame,
                        target_col: str,
                        feature_cols: List[str],
                        model_params: Optional[Dict] = None,
                        loss_fn=None) -> CVResults:
    """
    Perform complete cross-validation with per-origin loss tracking.
    
    ORCHESTRATION PHILOSOPHY:
    - This method coordinates but delegates specific tasks
    - Each step builds on previous infrastructure
    - Comprehensive tracking enables detailed analysis
    - Clean separation of model training from CV logic
    """
    
    # STEP 1: Parameter setup and defaults
    if model_params is None:
        model_params = {}
    # WHY: Mutable default argument avoidance (Python best practice)
    # CONNECTION: Enables flexible model configuration
    
    if loss_fn is None:
        loss_fn = torch.nn.MSELoss()
    # WHY: MSE is most common regression loss, good default
    # FLEXIBILITY: Can override for custom loss functions
    # CONNECTION: torch.nn.MSELoss works with our PyTorch models
    
    print("Cross-Validation Pipeline Step 1:")
    print("================================")
    print("✓ Default parameters established")
    print("✓ Loss function configured") 
    print("✓ Ready for split generation")
    
    # STEP 2: Generate splits using our carefully designed logic
    folds = self.get_rolling_origin_aligned_splits(data)
    # WHY: This leverages all our previous work (boundaries, indices, validation)
    # CONNECTION: Returns List[CVFold] with complete temporal context
    # DELEGATION: Uses our proven split generation logic
    
    print(f"\nStep 2: Generated {len(folds)} folds")
    print("✓ Each fold has complete temporal context")
    print("✓ All validation passed in split generation")
    
    # STEP 3: Initialize results tracking
    results = CVResults()
    # WHY: Uses our structured results container
    # CONNECTION: Supports three-tier loss tracking (origin/fold/overall)
    # PREPARATION: Ready for statistical testing extraction
    
    # STEP 4: Train and evaluate each fold
    for fold in folds:
        print(f"\nProcessing Fold {fold.fold_number}:")
        print(f"  Origin: {fold.origin_date}")
        print(f"  Train size: {len(fold.train_indices)}")
        print(f"  Test size: {len(fold.test_indices)}")
        
        # Extract data for this fold
        X_train = data.iloc[fold.train_indices][feature_cols]
        y_train = data.iloc[fold.train_indices][target_col]
        X_test = data.iloc[fold.test_indices][feature_cols] 
        y_test = data.iloc[fold.test_indices][target_col]
        # WHY: iloc with our validated indices ensures correct data extraction
        # SAFETY: No risk of temporal leakage due to our boundary calculations
        # CONNECTION: Uses indices from our CVFold structure
        
        # Train model in isolation
        model = model_class(**model_params)
        model.fit(X_train, y_train)
        # WHY: Fresh model instance prevents contamination between folds
        # ISOLATION: Each fold gets independently trained model
        # CONNECTION: Works with any sklearn-compatible interface
        
        # Generate predictions
        predictions = model.predict(X_test)
        # WHY: Standard prediction interface
        # TEMPORAL: Predictions align with test period
        
        # Calculate per-step losses for this fold
        per_origin_losses = self._calculate_per_origin_losses(
            y_test, predictions, fold, loss_fn
        )
        # WHY: Delegates detailed loss calculation to specialized method
        # TRACKING: Maintains per-origin granularity for statistical testing
        # CONNECTION: Uses our fold metadata for proper labeling
        
        # Store results in our structured format
        results.per_origin_losses.update(per_origin_losses)
        # WHY: Accumulates across all folds for complete picture
        # STRUCTURE: Maintains per-origin detail for analysis
        
        # Calculate and store fold-level loss
        fold_loss = np.mean(list(per_origin_losses.values()))
        results.per_fold_losses[f"fold_{fold.fold_number}"] = fold_loss
        # WHY: Fold-level aggregation enables fold comparison
        # CONNECTION: Builds intermediate summary from detailed data
    
    print("\nStep 4 Complete:")
    print("================")
    print("✓ All folds processed independently")
    print("✓ Per-origin losses tracked")
    print("✓ Fold-level summaries calculated")
    
    # STEP 5: Calculate overall metrics
    all_losses = []
    for losses_list in results.per_origin_losses.values():
        all_losses.extend(losses_list)
    # WHY: Flatten all per-origin losses for overall statistics
    # CONNECTION: Builds on our per-origin tracking
    
    results.overall_metrics = {
        'mean_loss': np.mean(all_losses),
        'std_loss': np.std(all_losses),
        'n_origins': len(results.per_origin_losses),
        'total_predictions': len(all_losses)
    }
    # WHY: Standard statistical summaries for model comparison
    # RICHNESS: Multiple metrics for different analysis needs
    # CONNECTION: Derived from our comprehensive tracking
    
    print("\nStep 5: Overall Metrics Calculated")
    print("=================================")
    print(f"✓ Mean loss: {results.overall_metrics['mean_loss']:.4f}")
    print(f"✓ Std loss: {results.overall_metrics['std_loss']:.4f}")
    print(f"✓ Origins: {results.overall_metrics['n_origins']}")
    print(f"✓ Predictions: {results.overall_metrics['total_predictions']}")
    
    return results

print("Cross-Validation Pipeline Design Rationale:")
print("==========================================")
print("✓ Orchestration pattern: coordinate but delegate specialized tasks")
print("✓ Five-step process: setup → splits → initialize → process → summarize")
print("✓ Each step builds on previous infrastructure")
print("✓ Comprehensive tracking enables multiple analysis levels")
print("✓ Clean model isolation prevents fold contamination")
print("✓ Structured results support downstream statistical testing")

## 9. Loss Calculation and Statistical Testing Preparation

The final pieces of our implementation focus on detailed loss tracking and preparing data for rigorous statistical comparison:

In [None]:
# Loss Calculation: Per-Origin Tracking
def _calculate_per_origin_losses(self, y_true: pd.Series, y_pred: np.ndarray, 
                               fold: CVFold, loss_fn) -> Dict[str, List[float]]:
    """
    Calculate losses for each forecast origin in the test period.
    
    WHY PER-ORIGIN TRACKING:
    - Statistical testing requires matched predictions across models
    - Temporal analysis needs origin-specific performance
    - Diebold-Mariano test specifically needs per-origin loss differences
    """
    
    per_origin_losses = {}
    test_dates = y_true.index
    
    # For each test date, calculate loss and associate with origin
    for i, (date, actual) in enumerate(zip(test_dates, y_true.values)):
        origin_key = fold.origin_date.strftime('%Y-%m-%d')
        # WHY: String keys for JSON serialization and human readability
        # CONNECTION: Uses origin_date from our CVFold structure
        
        if origin_key not in per_origin_losses:
            per_origin_losses[origin_key] = []
        
        # Calculate loss for this prediction
        pred_tensor = torch.tensor([y_pred[i]], dtype=torch.float32)
        actual_tensor = torch.tensor([actual], dtype=torch.float32)
        loss_value = loss_fn(pred_tensor, actual_tensor).item()
        # WHY: Tensor conversion enables custom PyTorch loss functions
        # CONSISTENCY: .item() extracts scalar value for storage
        
        per_origin_losses[origin_key].append(loss_value)
    
    return per_origin_losses

# Statistical Testing Preparation
def extract_cv_results_for_statistical_testing(cv_results_dict: Dict[str, CVResults]) -> pd.DataFrame:
    """
    Extract and align CV results for statistical testing (e.g., Diebold-Mariano test).
    
    DIEBOLD-MARIANO TEST REQUIREMENTS:
    - Matched predictions: Same forecast origins across models
    - Loss differences: Can calculate model_A_loss - model_B_loss
    - Temporal alignment: Origins correspond to same time points
    
    WHY THIS FUNCTION EXISTS:
    - Statistical tests need specific data format
    - Cross-validation results need alignment across models
    - Automation reduces manual error in test setup
    """
    
    print("Statistical Testing Preparation:")
    print("===============================")
    
    # STEP 1: Collect all unique origins across all models
    all_origins = set()
    for model_name, results in cv_results_dict.items():
        all_origins.update(results.per_origin_losses.keys())
    # WHY: Ensures we only compare origins that exist in all models
    # CONNECTION: Uses our per-origin tracking structure
    
    print(f"✓ Found {len(all_origins)} unique forecast origins")
    
    # STEP 2: Build aligned DataFrame
    aligned_data = []
    for origin in sorted(all_origins):  # Sort for consistent ordering
        row = {'origin_date': origin}
        
        # Check if this origin exists in all models
        origin_in_all_models = True
        for model_name, results in cv_results_dict.items():
            if origin not in results.per_origin_losses:
                origin_in_all_models = False
                break
        
        if origin_in_all_models:
            # Add loss data for each model
            for model_name, results in cv_results_dict.items():
                losses = results.per_origin_losses[origin]
                row[f'{model_name}_loss'] = np.mean(losses)  # Average if multiple predictions per origin
                row[f'{model_name}_loss_count'] = len(losses)
            aligned_data.append(row)
    
    print(f"✓ Aligned {len(aligned_data)} common origins across all models")
    print("✓ Ready for Diebold-Mariano test")
    print("✓ Each row represents matched predictions from same forecast origin")
    
    df = pd.DataFrame(aligned_data)
    df['origin_date'] = pd.to_datetime(df['origin_date'])
    # WHY: Convert back to datetime for temporal analysis
    # CONNECTION: Enables time-based filtering and analysis
    
    return df

print("Loss Calculation and Statistical Testing Rationale:")
print("=================================================")
print("✓ Per-origin tracking enables statistical testing")
print("✓ String keys provide human-readable origin identification")
print("✓ Tensor operations support custom PyTorch loss functions")
print("✓ Alignment function automates test preparation")
print("✓ DataFrame output integrates with statistical testing libraries")
print("✓ Temporal information preserved for advanced analysis")

## 10. Practical Demonstration: Seeing It All Work Together

Let's create a synthetic example to demonstrate how all these components work together in practice:

In [None]:
# Create synthetic time series data for demonstration
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
from sklearn.linear_model import LinearRegression

# Generate synthetic time series
np.random.seed(42)  # For reproducible results
dates = pd.date_range(start='2020-01-01', end='2023-12-31', freq='D')
n_points = len(dates)

# Create realistic time series patterns
trend = np.linspace(100, 150, n_points)  # Long-term trend
seasonal = 10 * np.sin(2 * np.pi * np.arange(n_points) / 365.25)  # Annual seasonality
noise = np.random.normal(0, 5, n_points)  # Random noise
target = trend + seasonal + noise

# Create some correlated features
feature1 = target + np.random.normal(0, 2, n_points)  # Highly correlated
feature2 = 0.5 * target + np.random.normal(0, 10, n_points)  # Moderately correlated
feature3 = np.random.normal(50, 15, n_points)  # Uncorrelated

# Build DataFrame
demo_data = pd.DataFrame({
    'date': dates,
    'target': target,
    'feature1': feature1,
    'feature2': feature2,
    'feature3': feature3
}).set_index('date')

print("Synthetic Data Overview:")
print("=======================")
print(f"Date range: {demo_data.index.min()} to {demo_data.index.max()}")
print(f"Data points: {len(demo_data)}")
print(f"Features: {demo_data.columns.tolist()}")
print("\nFirst 5 rows:")
print(demo_data.head())
print("\nLast 5 rows:")
print(demo_data.tail())

# Demonstrate our cross-validator
print("\n" + "="*50)
print("CROSS-VALIDATION DEMONSTRATION")
print("="*50)

# Initialize our improved cross-validator
cv = ImprovedTimeSeriesCrossValidator(
    n_splits=5,
    train_size=None,  # Expanding window
    test_size=30,     # 30-day forecast horizon
    gap=1,            # 1-day gap for realistic scenarios
    expanding_window=True
)

print("Cross-Validator Configuration:")
print(f"✓ Number of splits: {cv.n_splits}")
print(f"✓ Training window: {'Expanding' if cv.train_size is None else f'Fixed ({cv.train_size})'}")
print(f"✓ Test window size: {cv.test_size} days")
print(f"✓ Gap between train/test: {cv.gap} days")

# Generate splits
folds = cv.get_rolling_origin_aligned_splits(demo_data)

print(f"\nGenerated {len(folds)} folds:")
print("-" * 40)
for fold in folds:
    print(f"Fold {fold.fold_number}:")
    print(f"  Origin Date: {fold.origin_date.strftime('%Y-%m-%d')}")
    print(f"  Test Period: {fold.test_start_date.strftime('%Y-%m-%d')} to {fold.test_end_date.strftime('%Y-%m-%d')}")
    print(f"  Training Size: {len(fold.train_indices)} days")
    print(f"  Test Size: {len(fold.test_indices)} days")
    print()

# Demonstrate the complete cross-validation process
print("Running Complete Cross-Validation:")
print("=" * 40)

# Use LinearRegression as our model (sklearn-compatible)
results = cv.cross_validate_model(
    model_class=LinearRegression,
    data=demo_data,
    target_col='target',
    feature_cols=['feature1', 'feature2', 'feature3'],
    model_params={},  # Use defaults for LinearRegression
    loss_fn=None  # Use default MSE loss
)

print("\nResults Summary:")
print("=" * 20)
print(f"Overall Metrics:")
for metric, value in results.overall_metrics.items():
    if isinstance(value, float):
        print(f"  {metric}: {value:.4f}")
    else:
        print(f"  {metric}: {value}")

print(f"\nPer-Fold Performance:")
for fold_name, loss in results.per_fold_losses.items():
    print(f"  {fold_name}: {loss:.4f}")

print(f"\nPer-Origin Tracking:")
print(f"  Tracked origins: {len(results.per_origin_losses)}")
print(f"  First 3 origins: {list(results.per_origin_losses.keys())[:3]}")

# Demonstrate statistical testing preparation
print("\n" + "="*50)
print("STATISTICAL TESTING PREPARATION")
print("="*50)

# Simulate having results from multiple models
mock_results = {
    'LinearRegression': results,
    'MockModel2': results  # In practice, this would be a different model's results
}

# Extract for statistical testing
statistical_df = extract_cv_results_for_statistical_testing(mock_results)

print("\nStatistical Testing DataFrame:")
print(statistical_df.head())
print(f"\nShape: {statistical_df.shape}")
print(f"Columns: {statistical_df.columns.tolist()}")

print("\n" + "="*50)
print("KEY INSIGHTS FROM DEMONSTRATION")
print("="*50)
print("✓ Complete temporal alignment maintained")
print("✓ No data leakage - training always precedes testing")
print("✓ Per-origin losses enable statistical testing")
print("✓ Fold-level tracking enables performance analysis")
print("✓ Overall metrics provide model comparison summary")
print("✓ Statistical testing preparation automates complex alignment")
print("✓ All design choices contribute to rigorous time series evaluation")

## Summary: The Interconnected Web of Design Choices

### How Everything Connects

Our improved time series cross-validation implementation is a carefully crafted system where each decision builds on previous ones:

#### 1. **Theoretical Foundation → Data Structures**
- Time series temporal dependencies → CVFold with origin_date tracking
- Need for statistical testing → CVResults with per-origin loss storage
- Comprehensive analysis needs → Three-tier result hierarchy

#### 2. **Data Structures → Class Architecture**
- Type safety requirements → Comprehensive type hints and validation
- Temporal context needs → Explicit date handling in initialization
- Flexibility demands → Optional parameters with sensible defaults

#### 3. **Class Architecture → Split Generation**
- Parameter validation → Boundary calculation safety checks
- Temporal order requirements → Mathematical boundary logic
- Index efficiency needs → Separate boundary/index generation

#### 4. **Split Generation → Loss Tracking**
- Per-origin alignment → Origin-based loss dictionary keys
- Statistical testing needs → Matched prediction tracking
- Temporal analysis → Date-string keys for human readability

#### 5. **Loss Tracking → Statistical Testing**
- Diebold-Mariano requirements → Per-origin loss differences
- Model comparison needs → Aligned DataFrame output
- Automation goals → Extract function for complex alignment

### Key Insights

1. **Every line serves multiple purposes**: Our origin_date calculation aligns CV splits with evaluation framework AND enables statistical testing AND provides temporal context for analysis.

2. **Separation of concerns enables testing**: By separating boundary calculation from index generation, we can test mathematical logic independently of data handling.

3. **Type safety prevents silent failures**: Comprehensive type hints and validation catch errors early rather than producing misleading results.

4. **Rich metadata enables deep analysis**: Storing complete temporal context in CVFold enables debugging, visualization, and advanced analysis patterns.

5. **Structured results support multiple use cases**: The three-tier CVResults structure serves immediate model comparison, detailed fold analysis, and rigorous statistical testing.

### Why This Matters

Traditional cross-validation approaches for time series often:
- Lose temporal context
- Enable data leakage
- Provide insufficient information for statistical testing
- Lack rigorous validation

Our implementation addresses each of these issues through:
- **Explicit temporal tracking** in every data structure
- **Mathematical rigor** in boundary calculations
- **Comprehensive loss tracking** for statistical testing
- **Extensive validation** at every step

This creates a foundation for **rigorous, reproducible, and statistically sound** time series model evaluation.

### Next Steps

With this implementation, you can:
1. **Confidently compare models** using Diebold-Mariano tests
2. **Analyze temporal performance patterns** using per-origin data
3. **Debug cross-validation issues** using fold-level tracking
4. **Extend the framework** for custom loss functions and model types

The deep interconnections between components mean that modifications should be made carefully, understanding how each change propagates through the system.