# Training Data Preparation: Trend-Scanning

This notebook prepares the final training dataset for the trend-scanning approach by:

1. **Loading features** from `features_trend_scanning.csv`
2. **Loading labels** from trend-scanning labels CSV
3. **Merging** features with labels on timestamp
4. **Extracting sample weights** from t-values (statistical significance)
5. **Preprocessing features** with MinMax normalization
6. **Time-based train/test split** (no shuffling to preserve temporal order)
7. **Saving ready-to-use datasets** for model training

**Key Insight from Article:**
- Use **t-values as sample weights** to emphasize statistically stronger trends
- Weighted models significantly outperform unweighted (Sharpe: 2.62 vs varies wildly)
- This approach delivered +37% Sharpe improvement and +88% Sortino improvement in out-of-sample testing

In [1]:
# Imports
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.preprocessing import MinMaxScaler
from datetime import datetime
from strategies import MACrossoverStrategy
from volatility import get_daily_vol
from signal_processing import get_entries

from load_data import load_bars
import warnings
warnings.filterwarnings('ignore')

print("Libraries loaded successfully")

Libraries loaded successfully


## 1. Load Features and Labels

In [2]:
# Load features
features_file = 'data/features_trend_scanning.csv'
print(f"Loading features from {features_file}...")
features = pd.read_csv(features_file, index_col=0, parse_dates=True)

print(f"✓ Features loaded: {features.shape}")
print(f"  Date range: {features.index[0]} to {features.index[-1]}")
print(f"  Columns: {len(features.columns)}")
features.head(3)

Loading features from data/features_trend_scanning.csv...
✓ Features loaded: (332916, 96)
  Date range: 2022-09-02 00:05:00 to 2025-10-31 23:00:00
  Columns: 96


Unnamed: 0_level_0,ma_10,ma_20,ma_50,ma_100,ma_200,ma_10_20_cross,ma_20_50_cross,ma_50_200_cross,ma_spread_10_20,ma_spread_20_50,...,inside_bar,outside_bar,near_recent_high,near_recent_low,fractal_trend_strength,fractal_trend_direction,fractal_ma_ratio,fractal_trend_confirmation,distance_to_fractal_resistance,distance_to_fractal_support
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2022-09-02 00:05:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2022-09-02 00:10:00,0.0,0.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2022-09-02 00:15:00,0.0,0.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [3]:
# Find and load the most recent trend labels file
data_dir = Path('data')
label_files = list(data_dir.glob('EURUSD_trend_labels.csv'))

if not label_files:
    raise FileNotFoundError("No trend labels file found! Run trend_scanning.ipynb first.")

labels = pd.read_csv(label_files[0], index_col=0, parse_dates=True)

print(f"✓ Labels loaded: {labels.shape}")
print(f"  Date range: {labels.index[0]} to {labels.index[-1]}")
print(f"  Columns: {list(labels.columns)}")
print(f"\nLabel distribution:")
print(labels['bin'].value_counts().sort_index())
labels.head(3)

✓ Labels loaded: (332817, 7)
  Date range: 2022-09-02 00:05:00 to 2025-10-31 14:45:00
  Columns: ['t1', 'window', 'slope', 't_value', 'rsquared', 'ret', 'bin']

Label distribution:
bin
-1    114356
 0     96788
 1    121673
Name: count, dtype: int64


Unnamed: 0,t1,window,slope,t_value,rsquared,ret,bin
2022-09-02 00:05:00,2022-09-02 00:25:00,5,0.0,0.0,0.0,0.000372,0
2022-09-02 00:10:00,2022-09-02 00:30:00,5,0.0,0.0,0.0,0.000307,0
2022-09-02 00:15:00,2022-09-02 00:35:00,5,0.0,0.0,0.0,0.000312,0


In [4]:
# Load EURUSD M5 time bars for trend-scanning experiment
# Following the article: EURUSD M5 from 2018-01-01 to 2021-12-31 (train/validation)
print("Loading EURUSD M5 time bars...")
df = load_bars('EURUSD', 'time')

print(f"✓ Bars loaded: {len(df):,} observations")
print(f"  Date range: {df.index[0]} to {df.index[-1]}")
print(f"  Columns: {list(df.columns)}")

# Calculate 20-day EWM volatility (as per article)
target = get_daily_vol(df['close'], lookback=20)
print(f"\n✓ Volatility calculated (20-day EWM)")
print(f"  Mean volatility: {target.mean():.6f}")
print(f"  Median volatility: {target.median():.6f}")

# Set CUSUM filter threshold (as per article methodology)
cusum_filter_threshold = target.mean()
print(f"\n✓ CUSUM filter threshold set to mean volatility: {cusum_filter_threshold:.6f}")


INFO: Loading time bars from: EURUSD_time_bars_20251102_145959.csv


Loading EURUSD M5 time bars...


INFO: Loaded 332,916 time bars
INFO:   Start: 2022-09-02 00:05:00
INFO:   End: 2025-10-31 23:00:00
INFO:   Columns: ['open', 'high', 'low', 'close', 'tick_volume', 'bid_open', 'bid_high', 'bid_low', 'bid_close', 'ask_open', 'ask_high', 'ask_low', 'ask_close']


✓ Bars loaded: 332,916 observations
  Date range: 2022-09-02 00:05:00 to 2025-10-31 23:00:00
  Columns: ['open', 'high', 'low', 'close', 'tick_volume', 'bid_open', 'bid_high', 'bid_low', 'bid_close', 'ask_open', 'ask_high', 'ask_low', 'ask_close']

✓ Volatility calculated (20-day EWM)
  Mean volatility: 0.000581
  Median volatility: 0.000437

✓ CUSUM filter threshold set to mean volatility: 0.000581


In [15]:
# Create a strategy instance
strategy = MACrossoverStrategy(fast_window=20, slow_window=50)

# Generate entry signals
sides, t_events = get_entries(
    strategy=strategy,
    data=df,
    filter_events=True,
    filter_threshold=cusum_filter_threshold,
    on_crossover=True,
)

print(f"Generated {len(t_events):,} entry events")
print(f"\nSide distribution:")
print(sides.value_counts())
print(f"\nFirst few events:")
print(t_events[:5])

INFO: 30,899 CUSUM-filtered events


[32m2025-11-02 17:13:24.979[0m | [1mINFO    [0m | [36msignal_processing[0m:[36mget_entries[0m:[36m148[0m - [1mGenerated 830 trade events generated by crossover + CUSUM filter.[0m


Generated 830 entry events

Side distribution:
side
-1    180850
 1    152017
 0        49
Name: count, dtype: int64

First few events:
DatetimeIndex(['2022-09-02 04:10:00', '2022-09-05 00:10:00',
               '2022-09-05 05:25:00', '2022-09-05 10:40:00',
               '2022-09-06 01:15:00'],
              dtype='datetime64[ns]', name='time', freq=None)


In [17]:
# Step 3: Create meta-labels by combining MA side with trend-scanning bin
# Find intersection of t_events and available labels
common_timestamps = t_events.intersection(labels.index)

print(f"MA crossover events: {len(t_events):,}")
print(f"Trend-scanning labels: {len(labels):,}")
print(f"Common timestamps: {len(common_timestamps):,}")

if len(common_timestamps) == 0:
    raise ValueError("No common timestamps between t_events and labels! Check data alignment.")

# Filter to only common timestamps
meta_labels = labels.loc[common_timestamps].copy()

# Create meta-labels: 1 if MA side agrees with trend-scanning bin, 0 otherwise
# MA side: 1 (long) or -1 (short)
# Trend bin: 1 (up), -1 (down), 0 (neutral)
# Agreement: side * bin > 0 means they agree in direction
meta_labels['side'] = sides.loc[common_timestamps].values
meta_labels['meta_label'] = ((meta_labels['side'] * meta_labels['bin']) > 0).astype(int)

print(f"\n✓ Meta-labels created: {len(meta_labels):,}")
print(f"\nMeta-label distribution:")
print(meta_labels['meta_label'].value_counts().sort_index())
print(f"\nBreakdown by strategy side:")
for side_val in [-1, 1]:
    side_name = "SHORT" if side_val == -1 else "LONG"
    side_data = meta_labels[meta_labels['side'] == side_val]
    if len(side_data) > 0:
        agree_pct = (side_data['meta_label'] == 1).sum() / len(side_data) * 100
        print(f"  {side_name}: {len(side_data):,} events, {agree_pct:.1f}% agree with trend-scanning")

meta_labels.head(10)

MA crossover events: 830
Trend-scanning labels: 332,817
Common timestamps: 829

✓ Meta-labels created: 829

Meta-label distribution:
meta_label
0    420
1    409
Name: count, dtype: int64

Breakdown by strategy side:
  SHORT: 445 events, 46.5% agree with trend-scanning
  LONG: 384 events, 52.6% agree with trend-scanning


Unnamed: 0,t1,window,slope,t_value,rsquared,ret,bin,side,meta_label
2022-09-02 04:10:00,2022-09-02 11:55:00,94,3.6e-05,16.660364,0.751234,0.00247,1,1,1
2022-09-05 00:10:00,2022-09-05 08:20:00,99,-2.3e-05,-10.172517,0.516391,-0.004134,-1,-1,1
2022-09-05 05:25:00,2022-09-05 06:25:00,13,0.000116,8.646405,0.871797,0.001404,1,-1,0
2022-09-05 10:40:00,2022-09-05 11:10:00,7,-0.000277,-6.244016,0.886383,-0.001391,-1,1,0
2022-09-06 01:15:00,2022-09-06 04:45:00,43,7.6e-05,16.366613,0.867309,0.002278,1,1,1
2022-09-07 02:05:00,2022-09-07 03:15:00,15,7.4e-05,9.322426,0.869982,0.000743,1,-1,0
2022-09-07 08:10:00,2022-09-07 11:55:00,46,6.6e-05,7.88899,0.585894,0.001515,1,1,1
2022-09-07 17:10:00,2022-09-08 01:20:00,99,9.3e-05,23.884285,0.854737,0.009876,1,1,1
2022-09-08 02:30:00,2022-09-08 04:05:00,20,-6.8e-05,-7.304482,0.747843,-0.00077,-1,-1,1
2022-09-09 10:30:00,2022-09-09 18:40:00,99,-7.7e-05,-20.979967,0.819501,-0.006965,-1,1,0


In [None]:
# Merge features with meta-labels (on timestamp)
print("Merging features with meta-labels...")
print(f"  Features: {len(features):,} rows")
print(f"  Meta-labels: {len(meta_labels):,} rows")

# Inner join on index (timestamp) - only keep rows with both features and meta-labels
data = features.join(meta_labels, how='inner', rsuffix='_meta')

print(f"\n✓ Merged dataset: {data.shape}")
print(f"  Date range: {data.index[0]} to {data.index[-1]}")
print(f"  Lost {len(features) - len(data):,} observations due to missing meta-labels")

# Verify no missing values in critical columns
print(f"\nMissing values check:")
print(f"  Features: {data[features.columns].isnull().sum().sum()}")
print(f"  Meta-label (meta_label): {data['meta_label'].isnull().sum()}")
print(f"  Weights (t_value): {data['t_value'].isnull().sum()}")
print(f"  Side: {data['side'].isnull().sum()}")

print(f"\nColumns in merged dataset:")
print(f"  Total: {len(data.columns)}")
print(f"  Features: {len(features.columns)}")
print(f"  Meta-label columns: {[col for col in data.columns if col not in features.columns]}")

data.head(3)

Merging features with labels...
  Features: 332,916 rows
  Labels: 332,817 rows

✓ Merged dataset: (332817, 103)
  Date range: 2022-09-02 00:05:00 to 2025-10-31 14:45:00
  Lost 99 observations due to missing labels

Missing values check:
  Features: 0
  Labels (bin): 0
  Weights (t_value): 0


Unnamed: 0,ma_10,ma_20,ma_50,ma_100,ma_200,ma_10_20_cross,ma_20_50_cross,ma_50_200_cross,ma_spread_10_20,ma_spread_20_50,...,fractal_trend_confirmation,distance_to_fractal_resistance,distance_to_fractal_support,t1,window,slope,t_value,rsquared,ret,bin
2022-09-02 00:05:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2022-09-02 00:25:00,5,0.0,0.0,0.0,0.000372,0
2022-09-02 00:10:00,0.0,0.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,0.0,0.0,...,0.0,0.0,0.0,2022-09-02 00:30:00,5,0.0,0.0,0.0,0.000307,0
2022-09-02 00:15:00,0.0,0.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,0.0,0.0,...,0.0,0.0,0.0,2022-09-02 00:35:00,5,0.0,0.0,0.0,0.000312,0


## 3. Prepare Features, Labels, and Weights

In [None]:
# Separate features, labels, and sample weights
feature_cols = features.columns.tolist()
X = data[feature_cols].copy()
y = data['meta_label'].copy()  # Use meta_label (0 or 1) instead of bin (-1, 0, 1)
sample_weights = data['t_value'].abs().copy()  # Use absolute t-value as weight
label_metadata = data[['t1', 'window', 'slope', 't_value', 'rsquared', 'ret', 'bin', 'side']].copy()

print("="*80)
print("META-LABELING DATASET COMPOSITION")
print("="*80)
print(f"Features (X): {X.shape}")
print(f"Meta-labels (y): {y.shape}")
print(f"Sample weights: {sample_weights.shape}")
print(f"Label metadata: {label_metadata.shape}")

print(f"\nMeta-label distribution:")
label_counts = y.value_counts().sort_index()
for label, count in label_counts.items():
    pct = count / len(y) * 100
    label_name = {0: 'SKIP (disagree)', 1: 'TAKE (agree)'}.get(label, 'UNKNOWN')
    print(f"  {label_name:20s} ({label:2d}): {count:6,} ({pct:5.2f}%)")

print(f"\nOriginal trend-scanning bin distribution:")
bin_counts = data['bin'].value_counts().sort_index()
for bin_val, count in bin_counts.items():
    pct = count / len(data) * 100
    bin_name = {-1: 'DOWN', 0: 'NEUTRAL', 1: 'UP'}.get(bin_val, 'UNKNOWN')
    print(f"  {bin_name:8s} ({bin_val:2d}): {count:6,} ({pct:5.2f}%)")

print(f"\nStrategy side distribution:")
side_counts = data['side'].value_counts().sort_index()
for side_val, count in side_counts.items():
    pct = count / len(data) * 100
    side_name = {-1: 'SHORT', 1: 'LONG'}.get(side_val, 'UNKNOWN')
    print(f"  {side_name:8s} ({side_val:2d}): {count:6,} ({pct:5.2f}%)")

print(f"\nSample weights statistics:")
print(f"  Mean: {sample_weights.mean():.4f}")
print(f"  Median: {sample_weights.median():.4f}")
print(f"  Min: {sample_weights.min():.4f}")
print(f"  Max: {sample_weights.max():.4f}")
print(f"  Std: {sample_weights.std():.4f}")

DATASET COMPOSITION
Features (X): (332817, 96)
Labels (y): (332817,)
Sample weights: (332817,)
Label metadata: (332817, 6)

Label distribution:
  DOWN     (-1): 114,356 (34.36%)
  NEUTRAL  ( 0): 96,788 (29.08%)
  UP       ( 1): 121,673 (36.56%)

Sample weights statistics:
  Mean: 10.4319
  Median: 10.4436
  Min: 0.0000
  Max: 205.0355
  Std: 8.8540


## 4. Feature Normalization (MinMax Scaling)

**Important:** We normalize features to [0, 1] range to:
- Ensure all features contribute equally to the model
- Improve convergence for tree-based models
- Make feature importance more interpretable

**Note:** We fit the scaler on training data and transform both train and test to prevent data leakage.

In [8]:
# We'll normalize after train/test split to prevent leakage
# For now, just check for any extreme values or issues

print("Feature statistics check:")
print(f"  Features with inf: {np.isinf(X).sum().sum()}")
print(f"  Features with NaN: {X.isnull().sum().sum()}")

# Check for constant columns (would cause issues in normalization)
constant_cols = [col for col in X.columns if X[col].nunique() <= 1]
if constant_cols:
    print(f"\n⚠ WARNING - Constant columns detected ({len(constant_cols)}): {constant_cols}")
    print(f"  These will be removed before normalization")
    X = X.drop(columns=constant_cols)
    feature_cols = X.columns.tolist()
else:
    print(f"  ✓ No constant columns detected")

print(f"\nFinal feature count: {len(feature_cols)}")

Feature statistics check:
  Features with inf: 0
  Features with NaN: 0

  These will be removed before normalization

Final feature count: 95


## 5. Time-Based Train/Test Split

**Critical for time-series:**
- No shuffling (preserves temporal order)
- Train on earlier data, test on later data
- Simulates real-world deployment scenario
- Standard split: 70% train, 30% test

In [9]:
# Time-based split (70/30)
split_ratio = 0.70
split_idx = int(len(data) * split_ratio)

# Split data
X_train = X.iloc[:split_idx].copy()
X_test = X.iloc[split_idx:].copy()
y_train = y.iloc[:split_idx].copy()
y_test = y.iloc[split_idx:].copy()
weights_train = sample_weights.iloc[:split_idx].copy()
weights_test = sample_weights.iloc[split_idx:].copy()
metadata_train = label_metadata.iloc[:split_idx].copy()
metadata_test = label_metadata.iloc[split_idx:].copy()

print("="*80)
print("TRAIN/TEST SPLIT (Time-Based, No Shuffling)")
print("="*80)
print(f"Split ratio: {split_ratio:.0%} train / {1-split_ratio:.0%} test")
print(f"Split index: {split_idx:,}")
print(f"\nTrain set:")
print(f"  Shape: {X_train.shape}")
print(f"  Date range: {X_train.index[0]} to {X_train.index[-1]}")
print(f"  Label distribution:")
for label, count in y_train.value_counts().sort_index().items():
    pct = count / len(y_train) * 100
    label_name = {-1: 'DOWN', 0: 'NEUTRAL', 1: 'UP'}.get(label, 'UNKNOWN')
    print(f"    {label_name:8s} ({label:2d}): {count:6,} ({pct:5.2f}%)")

print(f"\nTest set:")
print(f"  Shape: {X_test.shape}")
print(f"  Date range: {X_test.index[0]} to {X_test.index[-1]}")
print(f"  Label distribution:")
for label, count in y_test.value_counts().sort_index().items():
    pct = count / len(y_test) * 100
    label_name = {-1: 'DOWN', 0: 'NEUTRAL', 1: 'UP'}.get(label, 'UNKNOWN')
    print(f"    {label_name:8s} ({label:2d}): {count:6,} ({pct:5.2f}%)")

TRAIN/TEST SPLIT (Time-Based, No Shuffling)
Split ratio: 70% train / 30% test
Split index: 232,971

Train set:
  Shape: (232971, 95)
  Date range: 2022-09-02 00:05:00 to 2024-11-18 22:15:00
  Label distribution:
    DOWN     (-1): 80,002 (34.34%)
    NEUTRAL  ( 0): 67,858 (29.13%)
    UP       ( 1): 85,111 (36.53%)

Test set:
  Shape: (99846, 95)
  Date range: 2024-11-18 22:20:00 to 2025-10-31 14:45:00
  Label distribution:
    DOWN     (-1): 34,354 (34.41%)
    NEUTRAL  ( 0): 28,930 (28.97%)
    UP       ( 1): 36,562 (36.62%)


## 6. Normalize Features with MinMax Scaler

**Fit on train, transform both train and test** to prevent data leakage.

In [10]:
# Initialize scaler
scaler = MinMaxScaler()

# Fit scaler on training data only
print("Fitting MinMaxScaler on training data...")
scaler.fit(X_train)

# Transform both train and test
print("Transforming features to [0, 1] range...")
X_train_scaled = pd.DataFrame(
    scaler.transform(X_train),
    index=X_train.index,
    columns=X_train.columns
)

X_test_scaled = pd.DataFrame(
    scaler.transform(X_test),
    index=X_test.index,
    columns=X_test.columns
)

print(f"✓ Normalization complete")
print(f"\nScaled feature statistics (train):")
print(f"  Min: {X_train_scaled.min().min():.6f}")
print(f"  Max: {X_train_scaled.max().max():.6f}")
print(f"  Mean: {X_train_scaled.mean().mean():.6f}")
print(f"  Std: {X_train_scaled.std().mean():.6f}")

print(f"\nScaled feature statistics (test):")
print(f"  Min: {X_test_scaled.min().min():.6f}")
print(f"  Max: {X_test_scaled.max().max():.6f}")
print(f"  Mean: {X_test_scaled.mean().mean():.6f}")
print(f"  Std: {X_test_scaled.std().mean():.6f}")

# Verify no NaN or inf after scaling
print(f"\nData quality check (post-scaling):")
print(f"  Train - NaN: {X_train_scaled.isnull().sum().sum()}, Inf: {np.isinf(X_train_scaled).sum().sum()}")
print(f"  Test - NaN: {X_test_scaled.isnull().sum().sum()}, Inf: {np.isinf(X_test_scaled).sum().sum()}")

Fitting MinMaxScaler on training data...
Transforming features to [0, 1] range...
✓ Normalization complete

Scaled feature statistics (train):
  Min: 0.000000
  Max: 1.000000
  Mean: 0.395174
  Std: 0.199742

Scaled feature statistics (test):
  Min: -0.355090
  Max: 1.365021
  Mean: 0.404283
  Std: 0.204104

Data quality check (post-scaling):
  Train - NaN: 0, Inf: 0
  Test - NaN: 0, Inf: 0


## 7. Save Processed Datasets

Save **ready-to-use** datasets for model training:
- Features (normalized)
- Labels
- Sample weights (t-values)
- Label metadata (for analysis)

In [11]:
# Create output directory
output_dir = Path('data/training')
output_dir.mkdir(exist_ok=True)

timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')

print("="*80)
print("SAVING PROCESSED DATASETS")
print("="*80)

# Save train set
train_features_file = output_dir / f'X_train_trend_scanning_{timestamp}.csv'
train_labels_file = output_dir / f'y_train_trend_scanning_{timestamp}.csv'
train_weights_file = output_dir / f'weights_train_trend_scanning_{timestamp}.csv'
train_metadata_file = output_dir / f'metadata_train_trend_scanning_{timestamp}.csv'

X_train_scaled.to_csv(train_features_file)
y_train.to_csv(train_labels_file, header=True)
weights_train.to_csv(train_weights_file, header=True)
metadata_train.to_csv(train_metadata_file)

print(f"Train set saved:")
print(f"  Features: {train_features_file.name}")
print(f"  Labels: {train_labels_file.name}")
print(f"  Weights: {train_weights_file.name}")
print(f"  Metadata: {train_metadata_file.name}")

# Save test set
test_features_file = output_dir / f'X_test_trend_scanning_{timestamp}.csv'
test_labels_file = output_dir / f'y_test_trend_scanning_{timestamp}.csv'
test_weights_file = output_dir / f'weights_test_trend_scanning_{timestamp}.csv'
test_metadata_file = output_dir / f'metadata_test_trend_scanning_{timestamp}.csv'

X_test_scaled.to_csv(test_features_file)
y_test.to_csv(test_labels_file, header=True)
weights_test.to_csv(test_weights_file, header=True)
metadata_test.to_csv(test_metadata_file)

print(f"\nTest set saved:")
print(f"  Features: {test_features_file.name}")
print(f"  Labels: {test_labels_file.name}")
print(f"  Weights: {test_weights_file.name}")
print(f"  Metadata: {test_metadata_file.name}")

# Save feature names for reference
feature_names_file = output_dir / f'feature_names_trend_scanning_{timestamp}.txt'
with open(feature_names_file, 'w') as f:
    f.write("Trend-Scanning Feature Names\n")
    f.write("="*80 + "\n\n")
    for i, col in enumerate(X_train_scaled.columns, 1):
        f.write(f"{i}. {col}\n")

print(f"\nFeature names: {feature_names_file.name}")

print("\n" + "="*80)
print("✓ ALL DATASETS SAVED SUCCESSFULLY")
print("="*80)
print(f"\nDatasets ready for model training in: {output_dir}")
print(f"\nNext step: Create models.ipynb to train Random Forest with:")
print(f"  - Weighted samples (t-values)")
print(f"  - PurgedKFold cross-validation")
print(f"  - BaggingClassifier for ensemble")

SAVING PROCESSED DATASETS
Train set saved:
  Features: X_train_trend_scanning_20251102_171209.csv
  Labels: y_train_trend_scanning_20251102_171209.csv
  Weights: weights_train_trend_scanning_20251102_171209.csv
  Metadata: metadata_train_trend_scanning_20251102_171209.csv

Test set saved:
  Features: X_test_trend_scanning_20251102_171209.csv
  Labels: y_test_trend_scanning_20251102_171209.csv
  Weights: weights_test_trend_scanning_20251102_171209.csv
  Metadata: metadata_test_trend_scanning_20251102_171209.csv

Feature names: feature_names_trend_scanning_20251102_171209.txt

✓ ALL DATASETS SAVED SUCCESSFULLY

Datasets ready for model training in: data\training

Next step: Create models.ipynb to train Random Forest with:
  - Weighted samples (t-values)
  - PurgedKFold cross-validation
  - BaggingClassifier for ensemble


## 8. Final Dataset Summary

In [12]:
print("="*80)
print("FINAL DATASET SUMMARY")
print("="*80)

print(f"\nTrain Set:")
print(f"  Observations: {len(X_train_scaled):,}")
print(f"  Features: {X_train_scaled.shape[1]}")
print(f"  Date range: {X_train_scaled.index[0]} to {X_train_scaled.index[-1]}")
print(f"  Label distribution:")
for label, count in y_train.value_counts().sort_index().items():
    pct = count / len(y_train) * 100
    label_name = {-1: 'DOWN', 0: 'NEUTRAL', 1: 'UP'}.get(label, 'UNKNOWN')
    print(f"    {label_name}: {count:,} ({pct:.2f}%)")
print(f"  Sample weights: mean={weights_train.mean():.4f}, std={weights_train.std():.4f}")

print(f"\nTest Set:")
print(f"  Observations: {len(X_test_scaled):,}")
print(f"  Features: {X_test_scaled.shape[1]}")
print(f"  Date range: {X_test_scaled.index[0]} to {X_test_scaled.index[-1]}")
print(f"  Label distribution:")
for label, count in y_test.value_counts().sort_index().items():
    pct = count / len(y_test) * 100
    label_name = {-1: 'DOWN', 0: 'NEUTRAL', 1: 'UP'}.get(label, 'UNKNOWN')
    print(f"    {label_name}: {count:,} ({pct:.2f}%)")
print(f"  Sample weights: mean={weights_test.mean():.4f}, std={weights_test.std():.4f}")

print(f"\nFeature Normalization:")
print(f"  Method: MinMaxScaler")
print(f"  Range: [0, 1]")
print(f"  Fitted on: Train set only")
print(f"  Applied to: Both train and test")

print(f"\nSample Weighting:")
print(f"  Method: Absolute t-values from trend-scanning")
print(f"  Purpose: Emphasize statistically significant trends")
print(f"  Range: [{sample_weights.min():.4f}, {sample_weights.max():.4f}]")

print(f"\n✓ Data preparation complete!")
print(f"✓ Ready for Random Forest training with weighted samples")

FINAL DATASET SUMMARY

Train Set:
  Observations: 232,971
  Features: 95
  Date range: 2022-09-02 00:05:00 to 2024-11-18 22:15:00
  Label distribution:
    DOWN: 80,002 (34.34%)
    NEUTRAL: 67,858 (29.13%)
    UP: 85,111 (36.53%)
  Sample weights: mean=10.4378, std=8.8548

Test Set:
  Observations: 99,846
  Features: 95
  Date range: 2024-11-18 22:20:00 to 2025-10-31 14:45:00
  Label distribution:
    DOWN: 34,354 (34.41%)
    NEUTRAL: 28,930 (28.97%)
    UP: 36,562 (36.62%)
  Sample weights: mean=10.4182, std=8.8522

Feature Normalization:
  Method: MinMaxScaler
  Range: [0, 1]
  Fitted on: Train set only
  Applied to: Both train and test

Sample Weighting:
  Method: Absolute t-values from trend-scanning
  Purpose: Emphasize statistically significant trends
  Range: [0.0000, 205.0355]

✓ Data preparation complete!
✓ Ready for Random Forest training with weighted samples


In [13]:
# Quick preview of prepared data
print("Sample of prepared training data:")
print("\nFeatures (first 3 rows, first 10 columns):")
print(X_train_scaled.iloc[:3, :10])
print("\nLabels (first 10):")
print(y_train.head(10))
print("\nSample weights (first 10):")
print(weights_train.head(10))

Sample of prepared training data:

Features (first 3 rows, first 10 columns):
                     ma_10  ma_20  ma_50  ma_100  ma_200  ma_10_20_cross  \
2022-09-02 00:05:00    0.0    0.0    0.0     0.0     0.0             0.5   
2022-09-02 00:10:00    0.0    0.0    0.0     0.0     0.0             0.0   
2022-09-02 00:15:00    0.0    0.0    0.0     0.0     0.0             0.0   

                     ma_20_50_cross  ma_50_200_cross  ma_spread_10_20  \
2022-09-02 00:05:00             0.5              0.5         0.023328   
2022-09-02 00:10:00             0.0              0.0         0.023328   
2022-09-02 00:15:00             0.0              0.0         0.023328   

                     ma_spread_20_50  
2022-09-02 00:05:00         0.909814  
2022-09-02 00:10:00         0.909814  
2022-09-02 00:15:00         0.909814  

Labels (first 10):
2022-09-02 00:05:00    0
2022-09-02 00:10:00    0
2022-09-02 00:15:00    0
2022-09-02 00:20:00    1
2022-09-02 00:25:00    1
2022-09-02 00:30:00    