# 6. Create Feature View for Model Training

Join all engineered feature groups and create targets for QQQ prediction.

**Pipeline**: Hopsworks FGs (engineered) → Feature View with Targets

**Key Components**:
- **Feature Join**: Time-aligned merge of all feature groups on date
- **Target Creation**: Next-day return (regression) and direction (classification)
- **Missing Value Handling**: Drop NaN from rolling windows, fill sentiment with 0
- **Feature View**: Hopsworks object for training data access

**Critical**: Targets use `.shift(-1)` which is safe because we drop the last row (no future data available)

In [None]:
import sys
sys.path.append('..')

import pandas as pd
from utils.hopsworks_helpers import get_feature_store, create_feature_group
from dotenv import load_dotenv
import yaml

load_dotenv()

# Load config
with open('../config/config.yaml', 'r') as f:
    config = yaml.safe_load(f)

## Connect to Hopsworks and Load Feature Groups

In [None]:
# Connect to Hopsworks
print("Connecting to Hopsworks...")
fs = get_feature_store()
print(f"✓ Connected to feature store: {fs.name}")

In [None]:
# Load required feature groups
print("\nLoading feature groups...")

qqq_tech_fg = fs.get_feature_group('qqq_technical_features', version=1)
xlk_sector_fg = fs.get_feature_group('xlk_sector_features', version=1)
vix_vol_fg = fs.get_feature_group('vix_volatility_features', version=1)
macro_fg = fs.get_feature_group('macro_features', version=1)
qqq_raw_fg = fs.get_feature_group('qqq_raw', version=1)

print("✓ Required feature groups loaded")

# Try to load optional sentiment features
has_sentiment = False
try:
    sentiment_fg = fs.get_feature_group('sentiment_features', version=1)
    has_sentiment = True
    print("✓ Sentiment features loaded (optional)")
except Exception as e:
    print(f"⚠ Sentiment features not found (optional): {e}")
    print("  Proceeding without sentiment data...")

## Read and Preview Feature Groups

In [None]:
# Read all feature groups
print("\nReading feature groups...")

qqq_tech = qqq_tech_fg.read()
xlk_sector = xlk_sector_fg.read()
vix_vol = vix_vol_fg.read()
macro = macro_fg.read()
qqq_raw = qqq_raw_fg.read()

print(f"✓ QQQ technical features: {qqq_tech.shape}")
print(f"✓ XLK sector features: {xlk_sector.shape}")
print(f"✓ VIX volatility features: {vix_vol.shape}")
print(f"✓ Macro features: {macro.shape}")
print(f"✓ QQQ raw (for target): {qqq_raw.shape}")

if has_sentiment:
    sentiment = sentiment_fg.read()
    print(f"✓ Sentiment features: {sentiment.shape}")

In [None]:
# Preview feature groups
print("\nQQQ Technical Features:")
print(qqq_tech.head())
print(f"\nColumns: {qqq_tech.columns.tolist()}")

## Join All Feature Groups

Perform time-aligned left joins on the date column:
- Start with QQQ technical features (base)
- Left join all other feature groups
- Sentiment features filled with 0 if missing (no news that day)

In [None]:
print("Joining feature groups...")

# Start with QQQ technical features as base
features = qqq_tech.copy()

# Join XLK sector features
features = features.merge(xlk_sector, on='date', how='left', suffixes=('', '_xlk'))

# Join VIX volatility features
features = features.merge(vix_vol, on='date', how='left', suffixes=('', '_vix'))

# Join macro features
features = features.merge(macro, on='date', how='left', suffixes=('', '_macro'))

# Join sentiment features if available
if has_sentiment:
    features = features.merge(sentiment, on='date', how='left', suffixes=('', '_sent'))

# Add raw QQQ close for target calculation
features = features.merge(qqq_raw[['date', 'qqq_close']], on='date', how='left')

# Sort by date
features = features.sort_values('date').reset_index(drop=True)

print(f"\n✓ Combined features shape: {features.shape}")
print(f"  Total columns: {len(features.columns)}")
print(f"  Date range: {features['date'].min()} to {features['date'].max()}")

In [None]:
# Preview combined features
print("\nCombined features preview:")
features.head(10)

## Create Target Variables

**Target 1**: `target_return` (regression)
- Next-day percentage return of QQQ
- Calculated as: `(close_t+1 / close_t) - 1`
- Uses `.shift(-1)` which is safe because we drop the last row

**Target 2**: `target_direction` (classification)
- Binary: 1 if next-day return > 0, else 0
- For up/down prediction

**Important**: The last row will have NaN targets (no future data), so we drop it.

In [None]:
print("Creating target variables...")

# Target 1: Next-day return (regression)
# shift(-1) gets tomorrow's close, then calculate return
features['next_day_close'] = features['qqq_close'].shift(-1)
features['target_return'] = (features['next_day_close'] / features['qqq_close']) - 1

# Target 2: Binary up/down (classification)
features['target_direction'] = (features['target_return'] > 0).astype(int)

# Drop temporary column
features = features.drop(columns=['next_day_close'])

print(f"\n✓ Targets created")
print(f"\nTarget statistics:")
print(features['target_return'].describe())

In [None]:
# Analyze target distribution
print("\nTarget distribution (classification):")
print(features['target_direction'].value_counts())
print(f"\nUp days: {(features['target_direction'] == 1).sum()} ({(features['target_direction'] == 1).mean()*100:.1f}%)")
print(f"Down days: {(features['target_direction'] == 0).sum()} ({(features['target_direction'] == 0).mean()*100:.1f}%)")

## Handle Missing Values

Missing values come from:
1. **Rolling windows**: First N rows have NaN for N-day features
2. **Sentiment**: Some days may have no news articles
3. **Target**: Last row has NaN (no future data)

Strategy:
- Fill sentiment features with 0 (no news = neutral sentiment)
- Drop rows with NaN targets
- Drop remaining NaN (from rolling windows at start)

In [None]:
# Check missing values before handling
print("Missing values per column:")
missing = features.isnull().sum()
missing_cols = missing[missing > 0].sort_values(ascending=False)
print(missing_cols.head(20))
print(f"\nTotal rows with any missing: {features.isnull().any(axis=1).sum()}")

In [None]:
# Handle sentiment features (fill with 0 if no news)
if has_sentiment:
    sentiment_cols = [col for col in features.columns if 'sentiment' in col.lower() or col in ['article_count', 'positive_mean', 'negative_mean', 'neutral_mean']]
    print(f"\nFilling {len(sentiment_cols)} sentiment columns with 0 (no news)...")
    for col in sentiment_cols:
        features[col] = features[col].fillna(0)
    print("✓ Sentiment features filled")

# Drop rows with missing targets (last row)
features_with_targets = features.dropna(subset=['target_return', 'target_direction'])
print(f"\n✓ Dropped {len(features) - len(features_with_targets)} rows with missing targets")

# Drop remaining rows with NaN (rolling window initialization)
features_clean = features_with_targets.dropna()
print(f"✓ Dropped {len(features_with_targets) - len(features_clean)} rows with NaN from rolling windows")

print(f"\nFinal clean dataset shape: {features_clean.shape}")
print(f"Date range: {features_clean['date'].min()} to {features_clean['date'].max()}")
print(f"Total training samples: {len(features_clean)}")

In [None]:
# Verify no missing values remain
remaining_missing = features_clean.isnull().sum().sum()
print(f"\nRemaining missing values: {remaining_missing}")
if remaining_missing > 0:
    print("WARNING: Still have missing values!")
    print(features_clean.isnull().sum()[features_clean.isnull().sum() > 0])
else:
    print("✓ No missing values in clean dataset")

## Analyze Feature Correlations with Target

In [None]:
# Calculate correlations with target_return
feature_cols = [col for col in features_clean.columns 
                if col not in ['date', 'qqq_close', 'target_return', 'target_direction']]

correlations = features_clean[feature_cols].corrwith(features_clean['target_return']).sort_values(ascending=False)

print("Top 10 features positively correlated with next-day return:")
print(correlations.head(10))
print("\nTop 10 features negatively correlated with next-day return:")
print(correlations.tail(10))

In [None]:
# Visualize top correlations
import matplotlib.pyplot as plt

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Top positive correlations
top_positive = correlations.head(10)
ax1.barh(range(len(top_positive)), top_positive.values)
ax1.set_yticks(range(len(top_positive)))
ax1.set_yticklabels(top_positive.index)
ax1.set_xlabel('Correlation with Next-Day Return')
ax1.set_title('Top 10 Positive Correlations')
ax1.grid(True, alpha=0.3)

# Top negative correlations
top_negative = correlations.tail(10).sort_values()
ax2.barh(range(len(top_negative)), top_negative.values)
ax2.set_yticks(range(len(top_negative)))
ax2.set_yticklabels(top_negative.index)
ax2.set_xlabel('Correlation with Next-Day Return')
ax2.set_title('Top 10 Negative Correlations')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n✓ Feature correlations analyzed")

## Upload Combined Feature Group to Hopsworks

In [None]:
# Create combined feature group with all features and targets
print("\nUploading combined feature group to Hopsworks...")

combined_fg = create_feature_group(
    fs,
    name='qqq_combined_features',
    df=features_clean,
    primary_key=['date'],
    description='Combined feature set with all engineered features and prediction targets (next-day return and direction)'
)

print(f"✓ Created feature group: qqq_combined_features (version {combined_fg.version})")
print(f"  Total features: {len(feature_cols)}")
print(f"  Total samples: {len(features_clean)}")

## Create Feature View for Model Training

A Feature View in Hopsworks:
- Defines which features to use for training
- Specifies target labels
- Enables time-series splits
- Provides consistent train/test data access

In [None]:
# Select features and labels for feature view
# Exclude: date (not a feature), qqq_close (raw price), targets (labels)
feature_names = [col for col in features_clean.columns 
                 if col not in ['date', 'qqq_close', 'target_return', 'target_direction']]

print(f"\nFeature View configuration:")
print(f"  Features: {len(feature_names)}")
print(f"  Labels: target_return (regression), target_direction (classification)")
print(f"\nFeature list:")
for i, feat in enumerate(feature_names, 1):
    print(f"  {i:2d}. {feat}")

In [None]:
# Create query for feature view
query = combined_fg.select(feature_names + ['target_return', 'target_direction'])

# Create or get feature view
print("\nCreating feature view...")

try:
    # Try to get existing feature view
    feature_view = fs.get_feature_view(name='qqq_prediction_fv', version=1)
    print("✓ Feature view already exists: qqq_prediction_fv (version 1)")
except:
    # Create new feature view
    feature_view = fs.create_feature_view(
        name='qqq_prediction_fv',
        version=1,
        query=query,
        labels=['target_return', 'target_direction']
    )
    print("✓ Created new feature view: qqq_prediction_fv (version 1)")

print(f"\nFeature View Summary:")
print(f"  Name: {feature_view.name}")
print(f"  Version: {feature_view.version}")
print(f"  Features: {len(feature_names)}")
print(f"  Labels: {len(feature_view.labels)}")

## Summary

**✅ Feature View created successfully**:

**Feature Groups Joined**:
- `qqq_technical_features`: Returns, volatility, RSI, MA ratios
- `xlk_sector_features`: Sector returns and correlation
- `vix_volatility_features`: Volatility regime indicators
- `macro_features`: DGS10 and CPI (point-in-time correct)
- `sentiment_features`: News sentiment (optional)

**Targets Created**:
- `target_return`: Next-day percentage return (regression)
- `target_direction`: Binary up/down (classification)

**Data Quality**:
- All missing values handled (sentiment filled with 0, rolling window NaN dropped)
- No look-ahead bias (targets use only future data, last row dropped)
- Time-aligned features (all joined on date)

**Hopsworks Assets Created**:
- Feature Group: `qqq_combined_features`
- Feature View: `qqq_prediction_fv` (version 1)

**Next steps**:
- Notebook 7: Train XGBoost models using time-series splits
- Use Feature View for consistent train/test data access
- Implement proper backtesting with purge gaps