# Hedge Fund Time Series Forecasting - Data Exploration

## Competition Overview
This notebook performs a comprehensive exploration of the competition dataset to understand:
- Data structure and types
- Feature distributions and statistics
- Missing values and data quality
- Target variable characteristics
- Temporal patterns
- Category/entity relationships
- Correlation analysis

### Key Competition Details
- **Evaluation Metric**: Weighted RMSE Skill Score
- **Important Rule**: Predict `ts_index t` using only data from `ts_index 0 to t` (no look-ahead)
- **Features**: 86 anonymized features (feature_a to feature_ch)
- **Horizons**: 1 (short), 3 (medium), 10 (long), 25 (extra-long)
- **Prize**: $10,000 total + potential job interview at hedge fund

## 1. Environment Setup

In [None]:
# Core imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings

# Configure display options
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.4f}'.format)
warnings.filterwarnings('ignore')

# Configure plotting
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = [14, 6]
plt.rcParams['font.size'] = 12
sns.set_palette('husl')

# Define paths
DATA_DIR = Path('../data')
print(f"Data directory: {DATA_DIR.resolve()}")

## 2. Load Data

In [None]:
# Load training and test data
print("Loading data...")
train = pd.read_parquet(DATA_DIR / 'train.parquet')
test = pd.read_parquet(DATA_DIR / 'test.parquet')

print(f"Train shape: {train.shape}")
print(f"Test shape: {test.shape}")
print(f"\nTrain memory usage: {train.memory_usage(deep=True).sum() / 1e6:.2f} MB")
print(f"Test memory usage: {test.memory_usage(deep=True).sum() / 1e6:.2f} MB")

In [None]:
# Display first few rows
print("=" * 80)
print("TRAIN DATA - First 5 Rows")
print("=" * 80)
train.head()

In [None]:
# Display test data
print("=" * 80)
print("TEST DATA - First 5 Rows")
print("=" * 80)
test.head()

## 3. Data Structure Analysis

In [None]:
# Column information
print("=" * 80)
print("TRAIN COLUMNS INFO")
print("=" * 80)
train.info()

In [None]:
# Identify column types
train_columns = train.columns.tolist()

# Separate column groups
id_cols = ['id']
categorical_cols = ['code', 'sub_code', 'sub_category']
temporal_cols = ['ts_index']
horizon_col = ['horizon']
weight_col = ['weight']
target_col = ['target'] if 'target' in train_columns else []
feature_cols = [col for col in train_columns if col.startswith('feature_')]

print(f"ID columns: {id_cols}")
print(f"Categorical columns: {categorical_cols}")
print(f"Temporal columns: {temporal_cols}")
print(f"Horizon column: {horizon_col}")
print(f"Weight column: {weight_col}")
print(f"Target column: {target_col}")
print(f"Number of feature columns: {len(feature_cols)}")
print(f"\nFeature columns (first 10): {feature_cols[:10]}")
print(f"Feature columns (last 10): {feature_cols[-10:]}")

In [None]:
# Check what columns are different between train and test
train_cols_set = set(train.columns)
test_cols_set = set(test.columns)

print("Columns in train but not in test:")
print(train_cols_set - test_cols_set)

print("\nColumns in test but not in train:")
print(test_cols_set - train_cols_set)

## 4. Categorical Variables Analysis

In [None]:
# Code analysis
print("=" * 80)
print("CODE ANALYSIS")
print("=" * 80)
print(f"Number of unique codes in train: {train['code'].nunique()}")
print(f"Number of unique codes in test: {test['code'].nunique()}")

# Check overlap
train_codes = set(train['code'].unique())
test_codes = set(test['code'].unique())
print(f"\nCodes only in train: {len(train_codes - test_codes)}")
print(f"Codes only in test: {len(test_codes - train_codes)}")
print(f"Codes in both: {len(train_codes & test_codes)}")

In [None]:
# Sub-code analysis
print("=" * 80)
print("SUB_CODE ANALYSIS")
print("=" * 80)
print(f"Number of unique sub_codes in train: {train['sub_code'].nunique()}")
print(f"Number of unique sub_codes in test: {test['sub_code'].nunique()}")

# Check overlap
train_subcodes = set(train['sub_code'].unique())
test_subcodes = set(test['sub_code'].unique())
print(f"\nSub_codes only in train: {len(train_subcodes - test_subcodes)}")
print(f"Sub_codes only in test: {len(test_subcodes - train_subcodes)}")
print(f"Sub_codes in both: {len(train_subcodes & test_subcodes)}")

In [None]:
# Sub-category analysis
print("=" * 80)
print("SUB_CATEGORY ANALYSIS")
print("=" * 80)
print(f"Number of unique sub_categories in train: {train['sub_category'].nunique()}")
print(f"Number of unique sub_categories in test: {test['sub_category'].nunique()}")

# Check overlap
train_subcats = set(train['sub_category'].unique())
test_subcats = set(test['sub_category'].unique())
print(f"\nSub_categories only in train: {len(train_subcats - test_subcats)}")
print(f"Sub_categories only in test: {len(test_subcats - train_subcats)}")
print(f"Sub_categories in both: {len(train_subcats & test_subcats)}")

print("\nAll sub_categories in train:")
print(sorted(train_subcats))

In [None]:
# Visualize categorical distributions
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Code distribution
code_counts = train['code'].value_counts()
axes[0].bar(range(len(code_counts)), code_counts.values)
axes[0].set_title(f'Code Distribution (n={len(code_counts)})')
axes[0].set_xlabel('Code Index')
axes[0].set_ylabel('Count')

# Sub-code distribution
subcode_counts = train['sub_code'].value_counts()
axes[1].bar(range(len(subcode_counts)), subcode_counts.values)
axes[1].set_title(f'Sub-code Distribution (n={len(subcode_counts)})')
axes[1].set_xlabel('Sub-code Index')
axes[1].set_ylabel('Count')

# Sub-category distribution
subcat_counts = train['sub_category'].value_counts()
axes[2].bar(range(len(subcat_counts)), subcat_counts.values)
axes[2].set_title(f'Sub-category Distribution (n={len(subcat_counts)})')
axes[2].set_xlabel('Sub-category Index')
axes[2].set_ylabel('Count')

plt.tight_layout()
plt.show()

In [None]:
# Horizon distribution
print("=" * 80)
print("HORIZON DISTRIBUTION")
print("=" * 80)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

train_horizon_counts = train['horizon'].value_counts().sort_index()
test_horizon_counts = test['horizon'].value_counts().sort_index()

axes[0].bar(train_horizon_counts.index.astype(str), train_horizon_counts.values, color='steelblue')
axes[0].set_title('Train - Horizon Distribution')
axes[0].set_xlabel('Horizon')
axes[0].set_ylabel('Count')
for i, v in enumerate(train_horizon_counts.values):
    axes[0].text(i, v, f'{v:,}', ha='center', va='bottom')

axes[1].bar(test_horizon_counts.index.astype(str), test_horizon_counts.values, color='coral')
axes[1].set_title('Test - Horizon Distribution')
axes[1].set_xlabel('Horizon')
axes[1].set_ylabel('Count')
for i, v in enumerate(test_horizon_counts.values):
    axes[1].text(i, v, f'{v:,}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

print("\nTrain horizon value counts:")
print(train_horizon_counts)
print("\nTest horizon value counts:")
print(test_horizon_counts)

## 5. Temporal Analysis (ts_index)

In [None]:
# ts_index analysis
print("=" * 80)
print("TS_INDEX (TEMPORAL) ANALYSIS")
print("=" * 80)

print(f"\nTrain ts_index range: {train['ts_index'].min()} to {train['ts_index'].max()}")
print(f"Train ts_index unique values: {train['ts_index'].nunique()}")

print(f"\nTest ts_index range: {test['ts_index'].min()} to {test['ts_index'].max()}")
print(f"Test ts_index unique values: {test['ts_index'].nunique()}")

# Check for temporal overlap
train_ts = set(train['ts_index'].unique())
test_ts = set(test['ts_index'].unique())
print(f"\nts_index overlap between train and test: {len(train_ts & test_ts)}")

In [None]:
# Visualize ts_index distribution
fig, axes = plt.subplots(2, 2, figsize=(16, 10))

# Train ts_index histogram
axes[0, 0].hist(train['ts_index'], bins=100, color='steelblue', alpha=0.7)
axes[0, 0].set_title('Train ts_index Distribution')
axes[0, 0].set_xlabel('ts_index')
axes[0, 0].set_ylabel('Count')

# Test ts_index histogram
axes[0, 1].hist(test['ts_index'], bins=100, color='coral', alpha=0.7)
axes[0, 1].set_title('Test ts_index Distribution')
axes[0, 1].set_xlabel('ts_index')
axes[0, 1].set_ylabel('Count')

# Combined view
axes[1, 0].hist(train['ts_index'], bins=100, color='steelblue', alpha=0.5, label='Train')
axes[1, 0].hist(test['ts_index'], bins=100, color='coral', alpha=0.5, label='Test')
axes[1, 0].set_title('Train vs Test ts_index Distribution')
axes[1, 0].set_xlabel('ts_index')
axes[1, 0].set_ylabel('Count')
axes[1, 0].legend()

# Rows per ts_index
train_ts_counts = train.groupby('ts_index').size()
axes[1, 1].plot(train_ts_counts.index, train_ts_counts.values, alpha=0.7)
axes[1, 1].set_title('Number of Rows per ts_index (Train)')
axes[1, 1].set_xlabel('ts_index')
axes[1, 1].set_ylabel('Count')

plt.tight_layout()
plt.show()

## 6. Target Variable Analysis

In [None]:
if 'target' in train.columns:
    print("=" * 80)
    print("TARGET VARIABLE ANALYSIS")
    print("=" * 80)
    
    print("\nTarget Statistics:")
    print(train['target'].describe())
    
    print(f"\nTarget missing values: {train['target'].isna().sum()} ({100*train['target'].isna().mean():.2f}%)")
    print(f"Target zeros: {(train['target'] == 0).sum()} ({100*(train['target'] == 0).mean():.2f}%)")
else:
    print("No 'target' column found in training data.")

In [None]:
if 'target' in train.columns:
    # Target distribution visualization
    fig, axes = plt.subplots(2, 3, figsize=(18, 10))
    
    # Raw distribution
    axes[0, 0].hist(train['target'].dropna(), bins=100, color='steelblue', alpha=0.7)
    axes[0, 0].set_title('Target Distribution (Raw)')
    axes[0, 0].set_xlabel('Target')
    axes[0, 0].set_ylabel('Count')
    
    # Log-transformed distribution (if applicable)
    target_positive = train.loc[train['target'] > 0, 'target']
    if len(target_positive) > 0:
        axes[0, 1].hist(np.log1p(target_positive), bins=100, color='seagreen', alpha=0.7)
        axes[0, 1].set_title('Target Distribution (log1p, positive only)')
        axes[0, 1].set_xlabel('log1p(Target)')
        axes[0, 1].set_ylabel('Count')
    
    # Box plot
    axes[0, 2].boxplot(train['target'].dropna())
    axes[0, 2].set_title('Target Box Plot')
    axes[0, 2].set_ylabel('Target')
    
    # Target by horizon
    train.boxplot(column='target', by='horizon', ax=axes[1, 0])
    axes[1, 0].set_title('Target by Horizon')
    axes[1, 0].set_xlabel('Horizon')
    plt.suptitle('')
    
    # Target mean over time
    target_by_ts = train.groupby('ts_index')['target'].mean()
    axes[1, 1].plot(target_by_ts.index, target_by_ts.values, alpha=0.7)
    axes[1, 1].set_title('Mean Target over ts_index')
    axes[1, 1].set_xlabel('ts_index')
    axes[1, 1].set_ylabel('Mean Target')
    
    # Target std over time
    target_std_by_ts = train.groupby('ts_index')['target'].std()
    axes[1, 2].plot(target_std_by_ts.index, target_std_by_ts.values, alpha=0.7, color='coral')
    axes[1, 2].set_title('Std of Target over ts_index')
    axes[1, 2].set_xlabel('ts_index')
    axes[1, 2].set_ylabel('Std Target')
    
    plt.tight_layout()
    plt.show()

In [None]:
if 'target' in train.columns:
    # Target by category
    print("\nTarget Statistics by Sub-Category:")
    target_by_subcat = train.groupby('sub_category')['target'].agg(['mean', 'std', 'median', 'count'])
    print(target_by_subcat.sort_values('mean', ascending=False))

## 7. Weight Analysis

In [None]:
print("=" * 80)
print("WEIGHT ANALYSIS (Important for evaluation!)")
print("=" * 80)

print("\nTrain Weight Statistics:")
print(train['weight'].describe())

print("\nTest Weight Statistics:")
print(test['weight'].describe())

In [None]:
# Weight distribution visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Train weight distribution
axes[0, 0].hist(train['weight'], bins=100, color='steelblue', alpha=0.7)
axes[0, 0].set_title('Train Weight Distribution')
axes[0, 0].set_xlabel('Weight')
axes[0, 0].set_ylabel('Count')

# Test weight distribution
axes[0, 1].hist(test['weight'], bins=100, color='coral', alpha=0.7)
axes[0, 1].set_title('Test Weight Distribution')
axes[0, 1].set_xlabel('Weight')
axes[0, 1].set_ylabel('Count')

# Weight by horizon
train.boxplot(column='weight', by='horizon', ax=axes[1, 0])
axes[1, 0].set_title('Weight by Horizon (Train)')
plt.suptitle('')

# Weight over time
weight_by_ts = train.groupby('ts_index')['weight'].mean()
axes[1, 1].plot(weight_by_ts.index, weight_by_ts.values, alpha=0.7)
axes[1, 1].set_title('Mean Weight over ts_index')
axes[1, 1].set_xlabel('ts_index')
axes[1, 1].set_ylabel('Mean Weight')

plt.tight_layout()
plt.show()

In [None]:
# Weight patterns
print("\nWeight by Horizon:")
print(train.groupby('horizon')['weight'].describe())

print("\nWeight by Sub-Category:")
print(train.groupby('sub_category')['weight'].describe())

## 8. Feature Analysis

In [None]:
print("=" * 80)
print("FEATURE ANALYSIS")
print("=" * 80)

# Feature statistics
feature_stats = train[feature_cols].describe().T
feature_stats['missing'] = train[feature_cols].isna().sum()
feature_stats['missing_pct'] = 100 * train[feature_cols].isna().mean()
feature_stats['zeros'] = (train[feature_cols] == 0).sum()
feature_stats['zeros_pct'] = 100 * (train[feature_cols] == 0).mean()

print("\nFeature Statistics Summary:")
print(feature_stats.head(20))

In [None]:
# Missing values analysis
print("\nFeatures with highest missing values:")
missing_features = feature_stats.sort_values('missing_pct', ascending=False)[['missing', 'missing_pct']]
print(missing_features.head(20))

# Visualize missing values
fig, ax = plt.subplots(figsize=(14, 6))
missing_pcts = train[feature_cols].isna().mean() * 100
ax.bar(range(len(missing_pcts)), missing_pcts.values)
ax.set_title('Missing Value Percentage by Feature')
ax.set_xlabel('Feature Index')
ax.set_ylabel('Missing %')
plt.tight_layout()
plt.show()

In [None]:
# Feature distributions (sample)
sample_features = feature_cols[:9]  # First 9 features

fig, axes = plt.subplots(3, 3, figsize=(15, 12))
for idx, feat in enumerate(sample_features):
    ax = axes[idx // 3, idx % 3]
    ax.hist(train[feat].dropna(), bins=50, alpha=0.7)
    ax.set_title(f'{feat}')
    ax.set_xlabel('Value')
    ax.set_ylabel('Count')

plt.suptitle('Sample Feature Distributions', fontsize=14)
plt.tight_layout()
plt.show()

In [None]:
# Feature-Target correlations
if 'target' in train.columns:
    print("\nFeature-Target Correlations (Pearson):")
    correlations = train[feature_cols + ['target']].corr()['target'].drop('target').sort_values(key=abs, ascending=False)
    print("\nTop 20 Most Correlated Features:")
    print(correlations.head(20))
    print("\nTop 20 Least Correlated Features:")
    print(correlations.tail(20))
    
    # Visualize
    fig, ax = plt.subplots(figsize=(16, 6))
    correlations_sorted = correlations.sort_values()
    colors = ['coral' if x < 0 else 'steelblue' for x in correlations_sorted.values]
    ax.bar(range(len(correlations_sorted)), correlations_sorted.values, color=colors)
    ax.set_title('Feature-Target Correlations')
    ax.set_xlabel('Feature (sorted by correlation)')
    ax.set_ylabel('Correlation')
    ax.axhline(y=0, color='black', linestyle='-', linewidth=0.5)
    plt.tight_layout()
    plt.show()

## 9. Feature Correlation Matrix

In [None]:
# Full feature correlation matrix (may be slow for 86 features)
print("Computing feature correlation matrix...")
feature_corr = train[feature_cols].corr()

# Find highly correlated feature pairs
upper_triangle = feature_corr.where(np.triu(np.ones(feature_corr.shape), k=1).astype(bool))
high_corr_pairs = []
for col in upper_triangle.columns:
    for idx in upper_triangle.index:
        corr_val = upper_triangle.loc[idx, col]
        if pd.notna(corr_val) and abs(corr_val) > 0.9:
            high_corr_pairs.append((idx, col, corr_val))

print(f"\nHighly correlated feature pairs (|r| > 0.9): {len(high_corr_pairs)}")
if high_corr_pairs:
    for f1, f2, r in sorted(high_corr_pairs, key=lambda x: abs(x[2]), reverse=True)[:20]:
        print(f"  {f1} <-> {f2}: {r:.4f}")

In [None]:
# Correlation heatmap (using clustering)
plt.figure(figsize=(16, 14))
sns.clustermap(feature_corr, cmap='RdBu_r', center=0, figsize=(16, 14),
               dendrogram_ratio=0.1, cbar_pos=(0.02, 0.8, 0.03, 0.15))
plt.suptitle('Feature Correlation Clustermap', y=1.02)
plt.show()

## 10. Entity-Level Analysis

In [None]:
# Analyze entity combinations
print("=" * 80)
print("ENTITY-LEVEL ANALYSIS")
print("=" * 80)

# Unique entity combinations
entity_combo_train = train.groupby(['code', 'sub_code', 'sub_category']).size().reset_index(name='count')
entity_combo_test = test.groupby(['code', 'sub_code', 'sub_category']).size().reset_index(name='count')

print(f"Unique (code, sub_code, sub_category) combinations in train: {len(entity_combo_train)}")
print(f"Unique (code, sub_code, sub_category) combinations in test: {len(entity_combo_test)}")

# Check overlap
train_entities = set(zip(entity_combo_train['code'], entity_combo_train['sub_code'], entity_combo_train['sub_category']))
test_entities = set(zip(entity_combo_test['code'], entity_combo_test['sub_code'], entity_combo_test['sub_category']))

print(f"\nEntity combinations only in train: {len(train_entities - test_entities)}")
print(f"Entity combinations only in test: {len(test_entities - train_entities)}")
print(f"Entity combinations in both: {len(train_entities & test_entities)}")

In [None]:
# Time series length per entity
entity_ts_length = train.groupby(['code', 'sub_code', 'sub_category'])['ts_index'].nunique().reset_index(name='ts_length')

print("\nTime series length per entity:")
print(entity_ts_length['ts_length'].describe())

plt.figure(figsize=(12, 5))
plt.hist(entity_ts_length['ts_length'], bins=50, alpha=0.7, color='steelblue')
plt.title('Distribution of Time Series Length per Entity')
plt.xlabel('Number of ts_index values')
plt.ylabel('Count of entities')
plt.tight_layout()
plt.show()

## 11. Time Series Patterns

In [None]:
# Sample time series visualization
if 'target' in train.columns:
    # Get a sample entity
    sample_entity = train.groupby(['code', 'sub_code', 'sub_category']).size().idxmax()
    sample_data = train[(train['code'] == sample_entity[0]) & 
                        (train['sub_code'] == sample_entity[1]) &
                        (train['sub_category'] == sample_entity[2])].sort_values('ts_index')
    
    fig, axes = plt.subplots(2, 2, figsize=(16, 10))
    
    for idx, horizon in enumerate(sample_data['horizon'].unique()[:4]):
        ax = axes[idx // 2, idx % 2]
        horizon_data = sample_data[sample_data['horizon'] == horizon]
        ax.plot(horizon_data['ts_index'], horizon_data['target'], marker='o', alpha=0.7, markersize=2)
        ax.set_title(f'Sample Entity - Horizon {horizon}')
        ax.set_xlabel('ts_index')
        ax.set_ylabel('Target')
    
    plt.suptitle(f'Sample Time Series: {sample_entity}', fontsize=12)
    plt.tight_layout()
    plt.show()

## 12. Data Quality Summary

In [None]:
print("=" * 80)
print("DATA QUALITY SUMMARY")
print("=" * 80)

# Overall missing values
print("\n--- Missing Values ---")
missing_train = train.isna().sum()
missing_test = test.isna().sum()

for col in train.columns:
    train_miss = missing_train[col]
    if train_miss > 0:
        print(f"Train - {col}: {train_miss} ({100*train_miss/len(train):.2f}%)")

print("")
for col in test.columns:
    test_miss = missing_test[col]
    if test_miss > 0:
        print(f"Test - {col}: {test_miss} ({100*test_miss/len(test):.2f}%)")

# Data types
print("\n--- Data Types ---")
print(train.dtypes.value_counts())

# Duplicates
print("\n--- Duplicate Checks ---")
print(f"Duplicate IDs in train: {train['id'].duplicated().sum()}")
print(f"Duplicate IDs in test: {test['id'].duplicated().sum()}")
print(f"Duplicate rows in train (all columns): {train.duplicated().sum()}")

## 13. Key Findings and Recommendations

In [None]:
print("=" * 80)
print("KEY FINDINGS")
print("=" * 80)

print("""
Based on the data exploration, here are the key findings:

1. DATA STRUCTURE:
   - Training data: {} rows, {} columns
   - Test data: {} rows, {} columns
   - 86 anonymized features (feature_a to feature_ch)
   - 4 forecast horizons: 1, 3, 10, 25

2. TEMPORAL STRUCTURE:
   - Train ts_index range: {} to {}
   - Test ts_index range: {} to {}
   - IMPORTANT: Test data comes from AFTER training data

3. ENTITY STRUCTURE:
   - Number of unique codes: {}
   - Number of unique sub_codes: {}
   - Number of unique sub_categories: {}

4. RECOMMENDATIONS FOR MODELING:
   - Use time-based validation (train on early ts_index, validate on later)
   - Consider weighting recent data more heavily
   - Build separate models or features for different horizons
   - Handle entity hierarchies (code -> sub_code -> sub_category)
   - Address missing values appropriately
   - Use the weight column for weighted loss functions

5. POTENTIAL APPROACHES:
   - Gradient Boosting (LightGBM, XGBoost, CatBoost)
   - Neural Networks with entity embeddings
   - Time series specific models
   - Ensemble methods
""".format(
    len(train), len(train.columns),
    len(test), len(test.columns),
    train['ts_index'].min(), train['ts_index'].max(),
    test['ts_index'].min(), test['ts_index'].max(),
    train['code'].nunique(),
    train['sub_code'].nunique(),
    train['sub_category'].nunique()
))

## 14. Save Exploration Results

In [None]:
# Save key statistics for later use
import json

exploration_results = {
    'train_shape': train.shape,
    'test_shape': test.shape,
    'n_features': len(feature_cols),
    'feature_cols': feature_cols,
    'n_codes': train['code'].nunique(),
    'n_sub_codes': train['sub_code'].nunique(),
    'n_sub_categories': train['sub_category'].nunique(),
    'horizons': sorted(train['horizon'].unique().tolist()),
    'train_ts_range': [int(train['ts_index'].min()), int(train['ts_index'].max())],
    'test_ts_range': [int(test['ts_index'].min()), int(test['ts_index'].max())],
    'target_stats': train['target'].describe().to_dict() if 'target' in train.columns else None,
}

output_path = Path('../exploration_results.json')
with open(output_path, 'w') as f:
    json.dump(exploration_results, f, indent=2, default=str)

print(f"Exploration results saved to {output_path.resolve()}")

---
## Next Steps

1. **Feature Engineering**: Create time-based features, lag features, rolling statistics
2. **Baseline Model**: Build a simple LightGBM baseline
3. **Validation Strategy**: Implement proper time-based cross-validation
4. **Advanced Models**: Try different architectures and ensembles
5. **Hyperparameter Tuning**: Optimize model parameters