## Project Framework

### 1. Target Definition
- News signals are generally short-lived; therefore, the focus is on **short-horizon SPY returns**.
- The prediction horizon is **determined during the feature analysis stage**, prior to model training and portfolio backtesting.

### 2. Data Preparation & Alignment
- Ingest the news dataset and aggregate headlines to a **daily frequency**.
- Align news features with SPY returns using a **one-day lag** to avoid look-ahead bias, given the absence of intraday timestamps and the possibility that some headlines are released after market close.

### 3. Feature Construction & Economic Intuition
- **News volume:** Daily headline counts as a proxy for information flow and market activity.
- **Sentiment:**
  - Counts of positive, negative, and neutral headlines
  - Average sentiment score
  - Constructed at both the **aggregate level** and the **category level** (category counts and category-level sentiment)
- **Complexity:** Measures of headline complexity, where unusually complex or long headlines may signal more impactful information.
- All features are **normalized using rolling time-series z-scores** with a 90-day window and **winsorized at the 1st and 99th percentiles**.

### 4. Modeling Choices
- **Sentiment:** Use `twitter-roberta-base-sentiment`, a well-validated model optimized for short, headline-style text, as a strong and reproducible baseline.
- **Complexity:** Use readability-based proxies (e.g., Gunning Fog, Dale–Chall, Flesch) and finance-oriented complexity lexicons where applicable.

### 5. Feature-Level Analysis
- Evaluate the standalone behavior of individual features using **Sharpe ratios** and **correlation-based metrics**.

### 6. Modeling Process
- Apply **time-series-based cross-validation** throughout.
- Evaluate a focused set of models, including **XGBoost**, **Random Forest**, and **LightGBM**.
- Explore **model ensembling** to combine complementary signals.

### 7. Prediction Assessment
- Analyze and visualize predictive performance across models using **R²** and **RMSE**.
- Evaluate **risk-neutralized prediction performance** by regressing out common equity risk factors using data from the **Kenneth French Data Library**.

### 8. Portfolio Construction & Backtesting
- Construct **long–short portfolios** based on individual model predictions as well as the ensemble signal.
- Analyze backtest performance (e.g., Sharpe ratio, drawdown, turnover).
- Assess **factor-neutral portfolio performance** after neutralizing common risk factors using data from the Kenneth French Data Library.

In [None]:
# Install required packages
# Run this cell first if you encounter ModuleNotFoundError
# Option 1: Install from requirements.txt (recommended)
# !pip install -r requirements.txt

# Option 2: Install packages individually (including safetensors to avoid torch version issues)
# Uncomment the line below to install all packages:
# !pip install pandas "numpy>=1.23.0,<2.0.0" scikit-learn transformers torch safetensors textstat matplotlib seaborn yfinance xgboost lightgbm

# Option 3: Install via terminal/command line:
# pip install -r requirements.txt

# Quick install for missing packages:
# !pip install yfinance        # For downloading SPY data
# !pip install xgboost         # For XGBoost model
# !pip install lightgbm        # For LightGBM model

# IMPORTANT NOTES:
# 1. NumPy Compatibility: PyTorch 2.2.2 requires NumPy < 2.0.0. If you encounter NumPy 2.x compatibility errors,
#    downgrade NumPy: !pip install "numpy<2.0.0"
# 2. If you encounter a ValueError about torch.load requiring torch>=2.6,
#    install safetensors: !pip install safetensors
#    The notebook will automatically use safetensors if available to bypass the torch version requirement.
# 3. PyArrow is required for sentiment caching (parquet file support):
#    If using conda: !conda install -c conda-forge pyarrow
#    If using pip: !pip install pyarrow
#    Or install from requirements.txt which includes pyarrow
# 4. yfinance is required for downloading SPY data:
#    !pip install yfinance
# 5. xgboost and lightgbm are required for model training:
#    !pip install xgboost lightgbm


In [None]:
# News Sentiment Analysis for SPY Returns Prediction
# This notebook implements a comprehensive pipeline for predicting next-day SPY returns
# using news sentiment, volume, complexity, and uncertainty features

# Standard library imports
import json
import warnings
warnings.filterwarnings('ignore')

# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Date/time utilities
from datetime import datetime, timedelta

# Set style for plots
try:
    plt.style.use('seaborn-v0_8-darkgrid')
except:
    try:
        plt.style.use('seaborn-darkgrid')
    except:
        plt.style.use('default')
sns.set_palette("husl")

print("Libraries imported successfully!")


## 0. Function Definitions

All analysis functions are defined below for modularity and reusability.


In [None]:
# Data Loading and Preparation
from data_loader import DataLoader
print("DataLoader class imported!")

# DATA EXPLORATION AND STATISTICS FUNCTIONS
from data_explorer import DataExplorer
print("DataExplorer class imported!")

# FEATURE ENGINEERING FUNCTIONS (on individual articles)
from feature_extractor import ArticleFeatureExtractor
print("Feature extraction class imported!")

# # FEATURE AGGREGATION, NORMALIZATION, AND ANALYSIS
from feature_analyzer import FeatureAnalyzer
print("FeatureAnalyzer class imported!")

# MODELING FUNCTIONS
from model import NewsSentimentModeler
print("NewsSentimentModeler class imported!")

# BACKTESTING FUNCTIONS
from strategy_backtest import StrategyBacktester
print("StrategyBacktester class imported!")

## 1. Data Loading and Preparation

### 1.1 Load News Dataset


In [None]:
# Load news dataset using DataLoader class
data_loader = DataLoader()
df_news = data_loader.load_news_dataset()
df_news.head()


In [None]:
# Compute and display statistics
data_explorer = DataExplorer()
df_news, daily_counts, monthly_counts, category_counts = data_explorer.compute_news_statistics(df_news)


In [None]:
# Create visualizations
data_explorer.plot_news_statistics(df_news, daily_counts, monthly_counts, category_counts)


## Observation Summary

### 1. News Volume Over Time
- From **2012 to mid-2018**, the dataset exhibits a **stable and high daily volume** of news articles, averaging roughly **80–100 articles per day**, with moderate short-term fluctuations.
- Around **mid-2018**, there is a **sharp structural break**, after which daily article counts drop dramatically to fewer than **10 articles per day**.
- This discontinuity likely reflects a **data collection or coverage change**, rather than a genuine collapse in news production.
- **Implication:** Post-2018 data may not be directly comparable to earlier periods, requiring regime-aware modeling or robustness checks.

---

### 2. Category Composition Over Time
- Prior to 2018, **Politics, Wellness, Entertainment, and Lifestyle-related categories** dominate overall article volume.
- The category mix is relatively stable over time, with cyclical variation within individual categories.
- After 2018, article counts across **all categories decline simultaneously**, reinforcing the interpretation of a **dataset-level structural break**.
- **Implication:** Category-level features are more informative pre-2018 and may suffer from sparsity afterward.

---

### 3. Category Distribution (Cross-Section)
- **Politics** is by far the most frequent category, followed by **Wellness** and **Entertainment**.
- Business-related articles represent a **smaller fraction** of total headlines.
- This skew suggests that predictive signals extracted from the dataset are likely driven by **broad risk sentiment and macro narratives**, rather than firm-level fundamentals.

---

### 4. Headline Length Distribution
- Headline length is tightly distributed, with:
  - **Mean ≈ 9.6 tokens**
  - **Median ≈ 10 tokens**
- The narrow distribution indicates headlines are **short and standardized**
- Extreme outliers are rare, reducing noise from anomalously long text.

---

### 5. Headline Length Over Time (Complexity Proxy)
- Average headline length increases gradually from **2012 to 2018**, rising from approximately **8.7 to 11 tokens**.
- After 2018, headline length stabilizes around **11 tokens**, with modest month-to-month variation.
- This suggests a **slow shift in editorial style** toward more descriptive or nuanced headlines.
- **Implication:** Raw complexity measures are non-stationary; relative or demeaned features are preferred.

---

### 6. Implications for Feature Engineering and Modeling
- The **2018 structural break** is the most significant characteristic of the dataset and should be explicitly addressed.
- Volume-based features require **normalization or regime-aware interpretation**.
- Sentiment, complexity, and uncertainty features are well-defined given the consistency in headline length.
- Relative (“surprise”) features are likely more informative than absolute levels due to long-term stylistic drift.

---

### Summary Takeaway
Overall, the dataset exhibits **stable structure and rich variation prior to 2018**, making it well-suited for extracting short-horizon news-based signals. In contrast, the **post-2018 period shows a clear structural shift**, which may partly reflect **data sourcing or coverage changes** rather than genuine regime effects. 

In [None]:
# Load SPY returns data using DataLoader class
df_spy = data_loader.load_spy_returns()
print(f"\nDataset shape: {df_spy.shape}")
print(f"Columns: {list(df_spy.columns)}")
df_spy.head()


## 1. Data Loading and Preparation

### 1.1 Load News Dataset


In [None]:
extractor = ArticleFeatureExtractor(data_loader=data_loader)

In [None]:
# Compute features on individual articles
# Create ArticleFeatureExtractor instance (sentiment model will be initialized automatically)
df_features = extractor.compute_all_features(df_news, reload_cache=True)
df_features.head()

## 2. Feature Aggregation, Normalization, and Analysis

This section aggregates article-level features to daily frequency, merges with category-specific features, applies normalization, aligns with SPY returns, and performs feature analysis.


In [None]:
# Aggregate and normalize features (daily + by category)
# This function:
# 1. Aggregates features to daily frequency (overall)
# 2. Aggregates features to daily frequency by category  
# 3. Merges both with category_feature_name columns
# 4. Applies rolling window z-score normalization
feature_analyzer = FeatureAnalyzer()
daily_features = feature_analyzer.aggregate_and_normalize_features(
    df_features,
    window_size=90,
    min_periods=30
)
print(f"\nFeatures shape: {daily_features.shape}")
print(f"Feature columns: {len([c for c in daily_features.columns if c != 'date'])}")
print(f"\nFirst few rows:")
daily_features.head()


In [None]:
# Align features with SPY returns using one-day lag
# Features from day t-1 are used to predict returns on day t
df_merged = feature_analyzer.align_features_with_spy(daily_features, df_spy)
print(f"\nMerged dataset shape: {df_merged.shape}")
print(f"Date range: {df_merged['date'].min()} to {df_merged['date'].max()}")



In [None]:
# Keep only *_normalized_lag features and remove the '_normalized_lag' suffix from column names
# This keeps only the normalized lagged features for modeling
normalized_lag_cols = [col for col in df_merged.columns if col.endswith('_normalized_lag')]

# Create a new dataframe with only normalized lagged features
df_features_clean = df_merged[['date', 'spy_return', 'spy_return_next'] + normalized_lag_cols].copy()

# Remove '_normalized_lag' suffix from feature column names
rename_dict = {col: col.replace('_normalized_lag', '') for col in normalized_lag_cols}
df_features_clean = df_features_clean.rename(columns=rename_dict)

print(f"Kept {len(normalized_lag_cols)} normalized lagged features")
print(f"Final dataset shape: {df_features_clean.shape}")

### Horizon Analysis

Evaluate multiple forward return horizons (e.g., 1D, 2D, 3D, 4D, 5D, 7D, 10D, 14D) to determine the most appropriate prediction target for news-derived signals.  
For each horizon, we assess signal alignment and tradeability **in the following priority order**:

1. **Sharpe (primary):** long–short strategy Sharpe constructed from each feature/signal group  
2. **Spearman (secondary):** rank-based association between features and forward returns  
3. **Pearson (tertiary):** linear correlation between features and forward returns  

Results are summarized by signal category (sentiment, complexity, token-length, volume) to capture differences in information decay across feature families. The selected horizon is fixed before model training and backtesting.

In [None]:
feature_cols = [col for col in df_features_clean.columns 
                       if col not in ['date', 'spy_return', 'spy_return_next']]
feature_groups = feature_analyzer.generate_feature_groups(feature_cols)
horizons = [1, 2, 3, 4, 5, 7, 10, 14]
results = feature_analyzer.analyze_horizon_selection(df_features_clean, feature_groups, horizons)
feature_analyzer.plot_horizon_analysis(results)


### Horizon Selection Summary (Sharpe-First)


- The **1-day horizon** ranks highest overall, driven by the strongest median Sharpe and broad participation across features.
- **2–4 day horizons** form a secondary tier, showing moderate but decaying signal strength.
- Horizons **beyond 5 days** consistently underperform, with median Sharpe turning negative across most signal groups.

By signal type:
- **Complexity signals** exhibit strong short-horizon behavior (1–3 days) with rapid decay.
- **Sentiment signals** are more persistent, remaining positive across short to medium horizons.
- **Token-length and volume features** show weak performance across all horizons.

Based on these results, the **1-day forward return** is selected as the primary prediction target.
**Token-length and volume-based features are excluded from the modeling stage.**

In [None]:
# remove all features that end with _token_length and _headline_count
df_features_clean = df_features_clean.drop(columns=[col for col in df_features_clean.columns if col.endswith('_token_length') or col.endswith('_headline_count')])
feature_cols_clean = [c for c in df_features_clean.columns if c not in ['date', 'spy_return', 'spy_return_next']]


### 2.1 Feature-Level Analysis

Use data from 2012 to 2018 to Evaluate the standalone behavior of individual features using correlation and information coefficients, complemented by simple long–short return analyses, to assess their directional consistency.


In [None]:
# Analyze all features and create summary table

df_features_is = df_features_clean[df_features_clean['date'] < '2019-12-31'].copy()
feature_analysis_results = feature_analyzer.analyze_all_features(df_features_is)


### 2.2 Feature Visualization

Visualize the top features ranked by different performance measures.


In [None]:
# Visualize top 30 features by Sharpe ratio
feature_analyzer.plot_top_features(feature_analysis_results, measure='sharpe_ratio', top_n=15)


In [None]:
# Visualize top 30 features by Pearson correlation
feature_analyzer.plot_top_features(feature_analysis_results, measure='spearman_r', top_n=15)


In [None]:
# Visualize top 30 features by Pearson correlation
feature_analyzer.plot_top_features(feature_analysis_results, measure='pearson_r', top_n=15)


### 2.3 Feature Pairwise Correlation Analysis

Analyze correlations between features to identify multicollinearity and redundant features.


In [None]:
# Analyze pairwise correlations between features
corr_matrix, top_correlations = feature_analyzer.analyze_pairwise_correlations(
    df_features_is, 
    feature_cols=feature_cols_clean,
    method='pearson',
    figsize=(16, 14),
    top_pairs=20
)


In [None]:
# Analyze feature distributions
stats_df = feature_analyzer.analyze_feature_distributions(df_features_is, feature_cols_clean)

## Feature Analysis Summary (Exploratory Diagnostics)

This section summarizes what is learned from basic distribution and correlation diagnostics on the engineered daily news features (overall + per-category).

### Dataset coverage
- **~210 engineered features** are created over **~1,992 daily observations**.

### Scale and normalization
- Core aggregated features such as `sentiment_score`, `sentiment_ratio`, `complexity` appear **roughly standardized** (means near 0, standard deviations near 1).  

### Feature distributions: non-Gaussian + sparsity effects
- Many features are **approximately symmetric** (skewness near 0), but a meaningful subset exhibits **heavy tails / outliers** (high kurtosis).

### High correlation of some features
- Pairwise correlations are high for some features like sentiment ratio and sentiment score, not suitable for standard OLS model

### Feature-Level Performance Summary (Sharpe & Spearman R & Pearson R)

The strongest positive Pearson correlations with next-day S&P 500 returns are observed in sentiment-based features.

In contrast, the most negative Pearson correlations are associated with complexity-related features.

### Cross-Metric Observations

- Sentiment-based features (both aggregate and category-level) appear frequently among the top-ranked features in **both Pearson correlation and Sharpe ratio**.
- Complexity features exhibit mixed behavior: some perform strongly on a Sharpe basis despite weaker linear correlation, while others consistently rank near the bottom across both metrics.

## 3. Model Training and Evaluation

This section trains predictive models using Random Forest, XGBoost, and LightGBM with time series cross-validation. Models are evaluated on test (2020-2021) and out-of-sample (2022) sets. An ensemble model (weighted: 80% XGBoost, 10% LightGBM, 10% Random Forest) is also included.


In [None]:
# Run the complete modeling pipeline
# This will:
# 1. Split data into train (2012-2019), test (2020-2021), and out-of-sample (2022) sets
# 2. Check if hyperparameters file exists - if yes, load models; if no, train with TSCV
# 3. Evaluate all models (including ensemble) on test and out-of-sample sets
# 4. Return results summary

modeler = NewsSentimentModeler()
# remove all features that end with _token_length and _headline_count
df_features_clean = df_features_clean.drop(columns=[col for col in df_features_clean.columns if col.endswith('_token_length') or col.endswith('_headline_count')])
results_summary = modeler.run_full_pipeline(
    df_features_clean,
    cv_folds=5,
    hyperparameters_file='model_hyperparameters.json', 
    overwrite=False
)

print("\n" + "="*60)
print("MODEL PERFORMANCE SUMMARY")
print("="*60)
print(results_summary.to_string(index=False))


In [None]:
# Visualize model performance across all models
# Shows R² and RMSE comparisons across CV, test, and out-of-sample sets
modeler.visualize_model_performance(figsize=(14, 8))


## Model Prediction Results Summary

We evaluate three individual models (Random Forest, XGBoost, LightGBM) and a weighted ensemble (80% XGBoost, 10% LightGBM, 10% Random Forest) using time-series cross-validation, a held-out test set (2020–2021), and a fully out-of-sample period (2022).

### Cross-Validation Performance
- All models exhibit **negative cross-validated R²**, indicating limited stable predictive power in rolling training windows.
- Tree-based models (Random Forest, XGBoost, LightGBM) show variability across folds.

### Test Set Performance (2020–2021)
- Test-set R² values are **near zero across all models**, with some models achieving marginally positive R².
- RMSE and MAE are very similar across models, suggesting comparable error magnitudes.
- Directional accuracy on the test set is consistently **above 56%** for all models, indicating some ability to capture short-term return direction despite weak linear fit.

### Out-of-Sample Performance (2022)
- All models show **negative out-of-sample R²**, reflecting deterioration in explanatory power when evaluated on unseen data.
- RMSE remains stable relative to the test period, indicating no significant increase in prediction error magnitude.
- Directional accuracy converges to **~43.7% across all models**, consistent with reduced signal strength in the most recent period.

### Ensemble Results
- The weighted ensemble (80% XGBoost, 10% LightGBM, 10% Random Forest) provides a combination of model predictions.

### Overall Takeaways
- Predictive performance is **weak in terms of R²**, which is typical for next-day index return forecasting.
- Directional accuracy above 50% in-sample does not consistently persist out-of-sample.
- The results motivate:
  - Careful signal normalization and portfolio construction,
  - Risk-controlled trading strategies,
  - And further exploration of non-linear interactions or regime-dependent behavior rather than reliance on raw point forecasts.

## 4. Strategy Construction and Backtesting

This section constructs long-short trading strategies based on model predictions, analyzes backtest performance, and assesses factor-neutral portfolio performance.

### Strategy Design

The trading strategy is designed as a **directional long-short portfolio** on SPY using model predictions. Since SPY is a single asset, "long-short" here means directional exposure that can flip sign (long, short, or flat), not cross-sectional ranking. The strategy construction follows these steps:

#### Step 1: Signal Normalization
Model predictions are normalized using a **rolling window z-score** to ensure stable position sizing over time:
- **Z-score normalization**: `z_t = (prediction_t - μ_t^(L)) / σ_t^(L)`
  - `μ_t^(L)`, `σ_t^(L)`: Rolling mean and standard deviation over a **60-day window**
  - Minimum periods required: 30 days
  - This removes trends and makes signals comparable across time periods
  - Early periods with insufficient history are filled with 0 (neutral position)

#### Step 2: Position Construction
Positions are computed from the normalized signal using **discrete exposure**:

**Discrete Position Rule**:
- `w_t = +1` (100% long) if `z_t > threshold`
- `w_t = -1` (100% short) if `z_t < -threshold`
- `w_t = 0` (flat/neutral) otherwise
- **Threshold used: 1.5** (z-score units)

This means positions are only taken when the normalized signal is strong enough (more than 1.5 standard deviations from the rolling mean). This conservative threshold reduces trading frequency and focuses on high-conviction signals.

#### Step 3: Portfolio Return Calculation
Daily portfolio returns are computed as:
- **Gross return**: `r_port_t+1 = w_t · r_SPY_t+1`
  - Accounts for trading costs from position changes (only charged when position changes from flat to long/short or vice versa)

#### Step 4: Performance Evaluation
The strategy is evaluated using comprehensive metrics:
- **Return metrics**: Annualized return, total return
- **Risk metrics**: Annualized volatility, max drawdown, Calmar ratio
- **Risk-adjusted**: Sharpe ratio
- **Trading metrics**: Turnover, hit rate, win rate, profit factor
- **Factor-neutral**: Alpha and risk-adjusted returns after neutralizing Fama-French factors (MKT-RF, SMB, HML)

#### Key Design Principles
1. **No look-ahead bias**: Only uses information available at time `t` to predict returns at `t+1`
2. **Stable position sizing**: Rolling normalization prevents signal drift
3. **Conservative threshold**: Using threshold=1.5 ensures positions are only taken on strong signals, reducing false positives
4. **Discrete positions**: Binary positions (+1, -1, 0) simplify execution and reduce sensitivity to signal magnitude
6. **Factor-neutral analysis**: Isolates alpha from common risk factors



In [None]:
# Prepare data for strategy backtesting
# Use test + out-of-sample data (2020-2022) for backtesting
df_backtest = df_features_clean[
    (df_features_clean['date'] >= '2020-01-01') & 
    (df_features_clean['date'] <= '2021-12-31')
].copy()

print(f"Backtest period: {df_backtest['date'].min()} to {df_backtest['date'].max()}")
print(f"Number of observations: {len(df_backtest):,}")

# Prepare feature matrix and get SPY returns
X_backtest, y_backtest, feature_cols_backtest = modeler.prepare_features_target(df_backtest)
spy_returns_backtest = df_backtest['spy_return_next'].values
dates_backtest = df_backtest['date'].values

print(f"\nFeature matrix shape: {X_backtest.shape}")
print(f"SPY returns shape: {spy_returns_backtest.shape}")


In [None]:
# Generate predictions from all models and ensemble
predictions = modeler.generate_predictions(X_backtest)


In [None]:
# Initialize backtester
backtester = StrategyBacktester()

# Backtest each strategy
print("\n" + "="*60)
print("BACKTESTING STRATEGIES")
print("="*60)

for model_name, pred in predictions.items():
    print(f"\nBacktesting {model_name.upper().replace('_', ' ')} strategy...")
    
    try:
        results = backtester.backtest_strategy(
            predictions=pred,
            spy_returns=spy_returns_backtest,
            dates=dates_backtest,
            normalization='zscore',
            window=60,
            k=0.5,
            w_max=1.0,
            position_type='discrete',
            threshold=1.5,
            vol_targeting=False,
            strategy_name=model_name
        )
        
        # Print key metrics
        metrics = results['metrics']
        print(f"  Sharpe Ratio: {metrics['sharpe_ratio']:.3f}")
        print(f"  Annualized Return: {metrics['annualized_return']:.2%}")
        print(f"  Max Drawdown: {metrics['max_drawdown']:.2%}")
        print(f"  Turnover: {metrics['turnover']:.4f}")
        
    except Exception as e:
        print(f"  Error backtesting {model_name}: {e}")

print("\n" + "="*60)
print("BACKTESTING COMPLETE")
print("="*60)


In [None]:
# Compare all strategies
print("\n" + "="*60)
print("STRATEGY PERFORMANCE COMPARISON")
print("="*60)

comparison_df = backtester.compare_strategies()
print(comparison_df.to_string(index=False))

# Visualize strategy comparison
backtester.plot_strategy_comparison(figsize=(16, 10))


In [None]:
# Plot detailed performance for each strategy
for strategy_name in backtester.strategies.keys():
    print(f"\n{'='*60}")
    print(f"Detailed Performance: {strategy_name.upper().replace('_', ' ')}")
    print(f"{'='*60}")
    backtester.plot_strategy_performance(strategy_name, figsize=(16, 10))


### 4.1 Factor-Neutral Analysis

Assess portfolio performance after neutralizing common risk factors using Fama-French factors from the Kenneth French Data Library.


In [None]:
# Run factor-neutral analysis for all strategies
# Note: To perform factor-neutral analysis, download Fama-French factors from:
# https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html
# Save as CSV with columns: date, MKT-RF, SMB, HML, RF (or similar format)

french_factors_path = 'data/F-F_Research_Data_5_Factors_2x3_daily.csv'  # Update this path if you have French factors file
# Example: french_factors_path = 'data/french_factors.csv'

# Run complete factor-neutral analysis
factor_results = backtester.run_factor_neutral_analysis(
    french_factors_path=french_factors_path,
    factors=['MKT-RF', 'SMB', 'HML']
)


## 5. Pure Out-of-Sample Analysis (2022+)

This section evaluates the ensemble model and strategy backtesting on pure out-of-sample data (2022-01-01 and onwards).


In [None]:
# Prepare pure out-of-sample data (2022-01-01 and onwards)
df_oos_pure = df_features_clean[
    df_features_clean['date'] >= '2022-01-01'
].copy()

print(f"Pure out-of-sample period: {df_oos_pure['date'].min()} to {df_oos_pure['date'].max()}")
print(f"Number of observations: {len(df_oos_pure):,}")

# Prepare feature matrix and get SPY returns
X_oos_pure, y_oos_pure, feature_cols_oos = modeler.prepare_features_target(df_oos_pure)
spy_returns_oos_pure = df_oos_pure['spy_return_next'].values
dates_oos_pure = df_oos_pure['date'].values

print(f"\nFeature matrix shape: {X_oos_pure.shape}")
print(f"SPY returns shape: {spy_returns_oos_pure.shape}")


In [None]:
# Generate predictions from ensemble model on pure out-of-sample data
print("\n" + "="*60)
print("GENERATING ENSEMBLE PREDICTIONS ON PURE OUT-OF-SAMPLE DATA")
print("="*60)

predictions_oos_pure = modeler.generate_predictions(X_oos_pure)

# Focus on ensemble predictions
if 'ensemble' in predictions_oos_pure:
    ensemble_predictions_oos = predictions_oos_pure['ensemble']
    print(f"\nEnsemble predictions shape: {ensemble_predictions_oos.shape}")
    print(f"Ensemble prediction stats:")
    print(f"  Mean: {np.mean(ensemble_predictions_oos):.6f}")
    print(f"  Std: {np.std(ensemble_predictions_oos):.6f}")
    print(f"  Min: {np.min(ensemble_predictions_oos):.6f}")
    print(f"  Max: {np.max(ensemble_predictions_oos):.6f}")


In [None]:
# Initialize backtester for out-of-sample analysis
backtester_oos = StrategyBacktester()

# Backtest ensemble strategy on pure out-of-sample data
print("\n" + "="*60)
print("BACKTESTING ENSEMBLE STRATEGY ON PURE OUT-OF-SAMPLE DATA")
print("="*60)

if 'ensemble' in predictions_oos_pure:
    ensemble_results_oos = backtester_oos.backtest_strategy(
        predictions=ensemble_predictions_oos,
        spy_returns=spy_returns_oos_pure,
        dates=dates_oos_pure,
        normalization='zscore',
        window=60,
        k=0.5,
        w_max=1.0,
        position_type='continuous',
        vol_targeting=False,
        strategy_name='ensemble_oos_pure'
    )
    
    # Print key metrics
    metrics_oos = ensemble_results_oos['metrics']
    print("\n" + "="*60)
    print("PURE OUT-OF-SAMPLE PERFORMANCE (Ensemble Strategy)")
    print("="*60)
    print(f"Annualized Return: {metrics_oos['annualized_return']:.2%}")
    print(f"Annualized Volatility: {metrics_oos['annualized_volatility']:.2%}")
    print(f"Sharpe Ratio: {metrics_oos['sharpe_ratio']:.3f}")
    print(f"Max Drawdown: {metrics_oos['max_drawdown']:.2%}")
    print(f"Calmar Ratio: {metrics_oos['calmar_ratio']:.3f}")
    print(f"Hit Rate: {metrics_oos['hit_rate']:.2%}")
    print(f"Turnover: {metrics_oos['turnover']:.4f}")
    print("="*60)


In [None]:
# Visualize ensemble strategy performance on pure out-of-sample data
if 'ensemble_oos_pure' in backtester_oos.strategies:
    print("\n" + "="*60)
    print("VISUALIZING ENSEMBLE STRATEGY PERFORMANCE")
    print("="*60)
    backtester_oos.plot_strategy_performance('ensemble_oos_pure', figsize=(16, 10))


In [None]:
# Factor-neutral analysis on pure out-of-sample data
french_factors_path_oos = 'data/F-F_Research_Data_5_Factors_2x3_daily.csv'

factor_results_oos = backtester_oos.run_factor_neutral_analysis(
    french_factors_path=french_factors_path_oos,
    factors=['MKT-RF', 'SMB', 'HML']
)


Pure oos data shows negative performance. Overall 



### Performance Summary and Conclusions

#### Model Performance
- **Cross-Validation**: Models show limited predictive power (negative R²) in rolling training windows, which is typical for next-day index return forecasting
- **Test Period (2020-2021)**: Models achieve near-zero R² but maintain directional accuracy above 56%, indicating some ability to capture short-term return direction
- **Out-of-Sample (2022+)**: Performance deteriorates with negative R² and directional accuracy converging to ~43.7%, consistent with reduced signal strength in recent periods
- **Ensemble Model**: The weighted ensemble (80% XGBoost, 10% LightGBM, 10% Random Forest) combines model predictions but does not materially improve performance relative to individual models

#### Strategy Performance
- **Backtest Results**: Long-short strategies constructed from model predictions show varying performance across models
  - Portfolio return: `r_port_t+1 = w_t · r_SPY_t+1`
- **Risk Metrics**: Strategies exhibit different risk-return profiles, with some showing better Sharpe ratios and lower drawdowns
- **Out-of-Sample**: Pure out-of-sample performance (2022+) shows negative returns, highlighting the challenge of maintaining predictive power in unseen market conditions

#### Factor-Neutral Analysis
- **Factor Exposures**: Strategies show minimal exposure to common risk factors (MKT-RF, SMB, HML)
- **Alpha**: Factor-neutral alpha is negative in out-of-sample period, indicating that returns cannot be attributed to common risk factors
- **Risk-Adjusted Returns**: After neutralizing factors, strategies fail to generate positive risk-adjusted returns

#### Key Takeaways
1. **Predictive Power**: While models show weak R² performance, directional accuracy above 50% in-sample suggests some signal exists
2. **Out-of-Sample Deterioration**: Performance degradation in out-of-sample period highlights the importance of robust validation and the challenges of time-varying market regimes
3. **Transaction Costs**: Realistic cost assumptions reveal that high-turnover strategies may not be profitable after costs
4. **Factor Exposure**: Minimal factor exposure suggests strategies are not simply capturing common risk premia
5. **Future Directions**: 
   - Explore regime-dependent models that adapt to changing market conditions
   - Investigate alternative feature engineering approaches
   - Consider longer prediction horizons or different normalization schemes
   - Evaluate non-linear interactions and feature combinations

#### Limitations
- **Data Quality**: News data may have timing issues (some headlines published after market close)
- **Feature Engineering**: Current features may not capture all relevant information
- **Market Regimes**: Models trained on 2012-2019 data may not generalize to post-2020 market conditions
- **Transaction Costs**: Assumed costs may not reflect actual implementation costs for all market participants

#### Conclusion
This analysis demonstrates a systematic approach to news sentiment-based trading strategy development. While the results show limited out-of-sample profitability, the framework provides a solid foundation for further research and improvement. The negative performance in recent periods underscores the importance of continuous model validation and adaptation to changing market conditions.
