# Feature Engineering Experiments

Test which features actually predict future price direction.

**Key insight from r/algotrading**: Indicators (RSI, MAs, BBs, Donchian, Keltner) all do the same thing â€” they use the past to make a statement about the present. Don't obsess over which ones; obsess over whether they predict anything.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
sys.path.insert(0, '../..')

from utils.features import compute_features, forward_return_direction
from utils.data_loaders import load_yahoo

plt.style.use('seaborn-v0_8-whitegrid')
pd.set_option('display.max_columns', 50)

## 1. Load Sample Data

In [None]:
# Download diverse sector stocks
tickers = ['AAPL', 'MSFT', 'JPM', 'JNJ', 'XOM', 'AMZN', 'NVDA', 'KO', 'CAT', 'META']
df = load_yahoo(tickers, start='2019-01-01', end='2024-12-31')
print(f'Loaded: {len(df)} rows, {df["ticker"].nunique()} tickers')
df.head()

## 2. Compute Features

In [None]:
# Compute features per ticker
featured_dfs = []
for ticker, group in df.groupby('ticker'):
    group = group.sort_values('date').reset_index(drop=True)
    featured = compute_features(group)
    featured['ticker'] = ticker
    # Add target
    featured['target'] = forward_return_direction(featured['close'], periods=5)  # 5-day direction
    featured_dfs.append(featured)

featured_df = pd.concat(featured_dfs, ignore_index=True).dropna()
print(f'Feature matrix: {featured_df.shape}')
print(f'Target distribution: {featured_df["target"].value_counts().to_dict()}')

## 3. Feature-Target Correlation

Which features actually correlate with future direction?

In [None]:
exclude = ['date', 'ticker', 'target', 'open', 'high', 'low', 'close', 'volume', 'adj_close']
feature_cols = [c for c in featured_df.columns if c not in exclude]

# Correlation with target
correlations = featured_df[feature_cols + ['target']].corr()['target'].drop('target').sort_values()

fig, ax = plt.subplots(figsize=(10, 8))
correlations.plot(kind='barh', ax=ax)
ax.set_title('Feature Correlation with 5-day Forward Direction')
ax.axvline(x=0, color='black', linewidth=0.5)
plt.tight_layout()
plt.show()

print('\nTop positive correlations:')
print(correlations.tail(10))
print('\nTop negative correlations:')
print(correlations.head(10))

## 4. Feature Predictive Power (LightGBM Importance)

In [None]:
from sklearn.model_selection import TimeSeriesSplit
import lightgbm as lgb

X = featured_df[feature_cols].values
y = featured_df['target'].values

# Train a quick model to get feature importance
model = lgb.LGBMClassifier(n_estimators=200, max_depth=6, learning_rate=0.05, verbose=-1)
model.fit(X, y)

importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

fig, ax = plt.subplots(figsize=(10, 8))
importance.head(20).plot(x='feature', y='importance', kind='barh', ax=ax)
ax.set_title('Top 20 Feature Importances (LightGBM gain)')
plt.tight_layout()
plt.show()

## 5. Key Takeaways

Fill in after running:
- Which features have predictive power?
- Are momentum features (returns) more useful than level features (RSI, MACD)?
- Do cross-stock features add value?
- How stable is feature importance across time?