# Training Data Preparation: Triple-Barrier (Bollinger Mean Reversion)

This notebook prepares the final training dataset for the triple-barrier approach with Bollinger Band mean reversion strategy by:

1. **Loading features** from `features_triple_barrier.csv`
2. **Loading labels** from triple-barrier events CSV (generated by meta_labeling.ipynb)
3. **Merging** features with labels on timestamp
4. **Computing sample weights** using concurrency and return attribution
5. **Preprocessing features** with MinMax normalization
6. **Time-based train/test split** (no shuffling to preserve temporal order)
7. **Saving ready-to-use datasets** for model training

**Strategy Context:**
- Primary model: Bollinger Band mean reversion (window=20, num_std=2.0)
- Entry filter: CUSUM filter on volatility
- Triple-barrier settings: pt_sl=[1, 2], vertical_barrier=50 bars
- Can be changed to MA crossover later for comparison with trend-scanning

In [None]:
# Imports
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.preprocessing import MinMaxScaler
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

print("Libraries loaded successfully")

## 1. Load Features and Labels

In [None]:
# Load features
features_file = 'data/features_triple_barrier.csv'
print(f"Loading features from {features_file}...")
features = pd.read_csv(features_file, index_col=0, parse_dates=True)

print(f"✓ Features loaded: {features.shape}")
print(f"  Date range: {features.index[0]} to {features.index[-1]}")
print(f"  Columns: {len(features.columns)}")
features.head(3)

In [None]:
# Find and load the triple-barrier events file
data_dir = Path('data')
label_files = list(data_dir.glob('EURUSD_triple_barrier_events_unfiltered.csv'))

if not label_files:
    raise FileNotFoundError("No triple-barrier events file found! Run meta_labeling.ipynb first.")

# Use the most recent file if multiple exist
labels_file = sorted(label_files)[-1]
print(f"Loading labels from {labels_file.name}...")
labels = pd.read_csv(labels_file, index_col=0, parse_dates=True)

print(f"✓ Labels loaded: {labels.shape}")
print(f"  Date range: {labels.index[0]} to {labels.index[-1]}")
print(f"  Columns: {list(labels.columns)}")
print(f"\nLabel distribution:")
if 'bin' in labels.columns:
    label_dist = labels['bin'].value_counts().sort_index()
    for label, count in label_dist.items():
        pct = count / len(labels) * 100
        label_name = {-1: 'STOP_LOSS', 0: 'TIMEOUT', 1: 'PROFIT_TARGET'}.get(label, 'UNKNOWN')
        print(f"  {label_name:15s} ({label:2d}): {count:6,} ({pct:5.2f}%)")
labels.head(3)

## 2. Merge Features with Labels

In [None]:
# Merge on timestamp (inner join to keep only observations with both features and labels)
print("Merging features with labels...")
print(f"  Features: {len(features):,} rows")
print(f"  Labels: {len(labels):,} rows")

# Inner join on index (timestamp)
data = features.join(labels, how='inner', rsuffix='_label')

print(f"\n✓ Merged dataset: {data.shape}")
print(f"  Date range: {data.index[0]} to {data.index[-1]}")
print(f"  Lost {len(features) - len(data):,} observations due to missing labels")

# Verify no missing values in critical columns
print(f"\nMissing values check:")
print(f"  Features: {data[features.columns].isnull().sum().sum()}")
print(f"  Labels (bin): {data['bin'].isnull().sum()}")

data.head(3)

## 3. Prepare Features, Labels, and Metadata

In [None]:
# Separate features, labels, and metadata
feature_cols = features.columns.tolist()
X = data[feature_cols].copy()
y = data['bin'].copy()

# Extract metadata columns (may vary depending on triple_barrier output)
metadata_cols = [col for col in labels.columns if col != 'bin']
label_metadata = data[metadata_cols].copy()

print("="*80)
print("DATASET COMPOSITION")
print("="*80)
print(f"Features (X): {X.shape}")
print(f"Labels (y): {y.shape}")
print(f"Label metadata: {label_metadata.shape}")
print(f"  Metadata columns: {list(label_metadata.columns)}")

print(f"\nLabel distribution:")
label_counts = y.value_counts().sort_index()
for label, count in label_counts.items():
    pct = count / len(y) * 100
    label_name = {-1: 'STOP_LOSS', 0: 'TIMEOUT', 1: 'PROFIT_TARGET'}.get(label, 'UNKNOWN')
    print(f"  {label_name:15s} ({label:2d}): {count:6,} ({pct:5.2f}%)")

# Check if return information is available
if 'ret' in label_metadata.columns:
    print(f"\nReturn statistics:")
    print(f"  Mean: {label_metadata['ret'].mean():.6f}")
    print(f"  Median: {label_metadata['ret'].median():.6f}")
    print(f"  Std: {label_metadata['ret'].std():.6f}")
    print(f"  Min: {label_metadata['ret'].min():.6f}")
    print(f"  Max: {label_metadata['ret'].max():.6f}")

## 4. Compute Sample Weights

**Sample weighting based on:**
- **Concurrency**: How many labels overlap at each timestamp
- **Return attribution**: Importance based on magnitude of returns
- **Time decay**: Optional decay to emphasize recent observations

**Note:** If concurrency weights aren't available, we'll use uniform weights and compute them in the model notebook.

## 5. Feature Normalization (MinMax Scaling)

**Important:** We normalize features to [0, 1] range to:
- Ensure all features contribute equally to the model
- Improve convergence for tree-based models
- Make feature importance more interpretable

**Note:** We fit the scaler on training data and transform both train and test to prevent data leakage.

In [None]:
# Check for any extreme values or issues before normalization
print("Feature statistics check:")
print(f"  Features with inf: {np.isinf(X).sum().sum()}")
print(f"  Features with NaN: {X.isnull().sum().sum()}")

# Check for constant columns (would cause issues in normalization)
constant_cols = [col for col in X.columns if X[col].nunique() <= 1]
if constant_cols:
    print(f"\n⚠ WARNING - Constant columns detected ({len(constant_cols)}): {constant_cols}")
    print(f"  These will be removed before normalization")
    X = X.drop(columns=constant_cols)
    feature_cols = X.columns.tolist()
else:
    print(f"  ✓ No constant columns detected")

print(f"\nFinal feature count: {len(feature_cols)}")

## 6. Time-Based Train/Test Split

**Critical for time-series:**
- No shuffling (preserves temporal order)
- Train on earlier data, test on later data
- Simulates real-world deployment scenario
- Standard split: 70% train, 30% test

In [None]:
# Time-based split (70/30)
split_ratio = 0.70
split_idx = int(len(data) * split_ratio)

# Split data
X_train = X.iloc[:split_idx].copy()
X_test = X.iloc[split_idx:].copy()
y_train = y.iloc[:split_idx].copy()
y_test = y.iloc[split_idx:].copy()

metadata_train = label_metadata.iloc[:split_idx].copy()
metadata_test = label_metadata.iloc[split_idx:].copy()

print("="*80)
print("TRAIN/TEST SPLIT (Time-Based, No Shuffling)")
print("="*80)
print(f"Split ratio: {split_ratio:.0%} train / {1-split_ratio:.0%} test")
print(f"Split index: {split_idx:,}")
print(f"\nTrain set:")
print(f"  Shape: {X_train.shape}")
print(f"  Date range: {X_train.index[0]} to {X_train.index[-1]}")
print(f"  Label distribution:")
for label, count in y_train.value_counts().sort_index().items():
    pct = count / len(y_train) * 100
    label_name = {-1: 'STOP_LOSS', 0: 'TIMEOUT', 1: 'PROFIT_TARGET'}.get(label, 'UNKNOWN')
    print(f"    {label_name:15s} ({label:2d}): {count:6,} ({pct:5.2f}%)")

print(f"\nTest set:")
print(f"  Shape: {X_test.shape}")
print(f"  Date range: {X_test.index[0]} to {X_test.index[-1]}")
print(f"  Label distribution:")
for label, count in y_test.value_counts().sort_index().items():
    pct = count / len(y_test) * 100
    label_name = {-1: 'STOP_LOSS', 0: 'TIMEOUT', 1: 'PROFIT_TARGET'}.get(label, 'UNKNOWN')
    print(f"    {label_name:15s} ({label:2d}): {count:6,} ({pct:5.2f}%)")

## 7. Normalize Features with MinMax Scaler

**Fit on train, transform both train and test** to prevent data leakage.

In [None]:
# Initialize scaler
scaler = MinMaxScaler()

# Fit scaler on training data only
print("Fitting MinMaxScaler on training data...")
scaler.fit(X_train)

# Transform both train and test
print("Transforming features to [0, 1] range...")
X_train_scaled = pd.DataFrame(
    scaler.transform(X_train),
    index=X_train.index,
    columns=X_train.columns
)

X_test_scaled = pd.DataFrame(
    scaler.transform(X_test),
    index=X_test.index,
    columns=X_test.columns
)

print(f"✓ Normalization complete")
print(f"\nScaled feature statistics (train):")
print(f"  Min: {X_train_scaled.min().min():.6f}")
print(f"  Max: {X_train_scaled.max().max():.6f}")
print(f"  Mean: {X_train_scaled.mean().mean():.6f}")
print(f"  Std: {X_train_scaled.std().mean():.6f}")

print(f"\nScaled feature statistics (test):")
print(f"  Min: {X_test_scaled.min().min():.6f}")
print(f"  Max: {X_test_scaled.max().max():.6f}")
print(f"  Mean: {X_test_scaled.mean().mean():.6f}")
print(f"  Std: {X_test_scaled.std().mean():.6f}")

# Verify no NaN or inf after scaling
print(f"\nData quality check (post-scaling):")
print(f"  Train - NaN: {X_train_scaled.isnull().sum().sum()}, Inf: {np.isinf(X_train_scaled).sum().sum()}")
print(f"  Test - NaN: {X_test_scaled.isnull().sum().sum()}, Inf: {np.isinf(X_test_scaled).sum().sum()}")

## 8. Save Processed Datasets

Save **ready-to-use** datasets for model training:
- Features (normalized)
- Labels
- Sample weights (uniform for now, can be updated later)
- Label metadata (for analysis)

In [None]:
# Create output directory
output_dir = Path('data/training')
output_dir.mkdir(exist_ok=True)

timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')

print("="*80)
print("SAVING PROCESSED DATASETS")
print("="*80)

# Save train set
train_features_file = output_dir / f'X_train_triple_barrier_{timestamp}.csv'
train_labels_file = output_dir / f'y_train_triple_barrier_{timestamp}.csv'
train_weights_file = output_dir / f'weights_train_triple_barrier_{timestamp}.csv'
train_metadata_file = output_dir / f'metadata_train_triple_barrier_{timestamp}.csv'

X_train_scaled.to_csv(train_features_file)
y_train.to_csv(train_labels_file, header=True)

metadata_train.to_csv(train_metadata_file)

print(f"Train set saved:")
print(f"  Features: {train_features_file.name}")
print(f"  Labels: {train_labels_file.name}")
print(f"  Weights: {train_weights_file.name}")
print(f"  Metadata: {train_metadata_file.name}")

# Save test set
test_features_file = output_dir / f'X_test_triple_barrier_{timestamp}.csv'
test_labels_file = output_dir / f'y_test_triple_barrier_{timestamp}.csv'
test_weights_file = output_dir / f'weights_test_triple_barrier_{timestamp}.csv'
test_metadata_file = output_dir / f'metadata_test_triple_barrier_{timestamp}.csv'

X_test_scaled.to_csv(test_features_file)
y_test.to_csv(test_labels_file, header=True)
weights_test.to_csv(test_weights_file, header=True)
metadata_test.to_csv(test_metadata_file)

print(f"\nTest set saved:")
print(f"  Features: {test_features_file.name}")
print(f"  Labels: {test_labels_file.name}")
print(f"  Weights: {test_weights_file.name}")
print(f"  Metadata: {test_metadata_file.name}")

# Save feature names for reference
feature_names_file = output_dir / f'feature_names_triple_barrier_{timestamp}.txt'
with open(feature_names_file, 'w') as f:
    f.write("Triple-Barrier Feature Names (Bollinger Mean Reversion)\n")
    f.write("="*80 + "\n\n")
    for i, col in enumerate(X_train_scaled.columns, 1):
        f.write(f"{i}. {col}\n")

print(f"\nFeature names: {feature_names_file.name}")

print("\n" + "="*80)
print("✓ ALL DATASETS SAVED SUCCESSFULLY")
print("="*80)
print(f"\nDatasets ready for model training in: {output_dir}")
print(f"\nNext steps:")
print(f"  1. Compute concurrency weights (optional) using concurrency_weights.ipynb")
print(f"  2. Train Random Forest with weighted samples in models.ipynb")
print(f"  3. Compare with trend-scanning approach")
print(f"\nNote: Currently using Bollinger mean reversion strategy")
print(f"      Can change to MA crossover in meta_labeling.ipynb for comparison")

## 9. Final Dataset Summary

In [None]:
print("="*80)
print("FINAL DATASET SUMMARY")
print("="*80)

print(f"\nTrain Set:")
print(f"  Observations: {len(X_train_scaled):,}")
print(f"  Features: {X_train_scaled.shape[1]}")
print(f"  Date range: {X_train_scaled.index[0]} to {X_train_scaled.index[-1]}")
print(f"  Label distribution:")
for label, count in y_train.value_counts().sort_index().items():
    pct = count / len(y_train) * 100
    label_name = {-1: 'STOP_LOSS', 0: 'TIMEOUT', 1: 'PROFIT_TARGET'}.get(label, 'UNKNOWN')
    print(f"    {label_name}: {count:,} ({pct:.2f}%)")
print(f"  Sample weights: mean={weights_train.mean():.4f}, std={weights_train.std():.4f}")

print(f"\nTest Set:")
print(f"  Observations: {len(X_test_scaled):,}")
print(f"  Features: {X_test_scaled.shape[1]}")
print(f"  Date range: {X_test_scaled.index[0]} to {X_test_scaled.index[-1]}")
print(f"  Label distribution:")
for label, count in y_test.value_counts().sort_index().items():
    pct = count / len(y_test) * 100
    label_name = {-1: 'STOP_LOSS', 0: 'TIMEOUT', 1: 'PROFIT_TARGET'}.get(label, 'UNKNOWN')
    print(f"    {label_name}: {count:,} ({pct:.2f}%)")
print(f"  Sample weights: mean={weights_test.mean():.4f}, std={weights_test.std():.4f}")

print(f"\nFeature Normalization:")
print(f"  Method: MinMaxScaler")
print(f"  Range: [0, 1]")
print(f"  Fitted on: Train set only")
print(f"  Applied to: Both train and test")

print(f"\nPrimary Strategy:")
print(f"  Strategy: Bollinger Band Mean Reversion")
print(f"  Window: 20, Std: 2.0")
print(f"  Entry filter: CUSUM filter on volatility")
print(f"  Triple-barrier: pt_sl=[1, 2], vertical_barrier=50 bars")

print(f"\n✓ Data preparation complete!")
print(f"✓ Ready for Random Forest training with triple-barrier labels")

In [None]:
# Quick preview of prepared data
print("Sample of prepared training data:")
print("\nFeatures (first 3 rows, first 10 columns):")
print(X_train_scaled.iloc[:3, :10])
print("\nLabels (first 10):")
print(y_train.head(10))
print("\nSample weights (first 10):")
print(weights_train.head(10))