# Homework 09: Feature Engineering

**Assignment**: Create 2-3 engineered features from financial dataset with clear rationale and documentation.

## Objectives
- Implement engineered features based on EDA insights
- Document reasoning for each feature
- Test correlation with target variables
- Create reusable feature engineering functions

In [None]:
import sys
import os
sys.path.append('../src')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import utils
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

print("🔧 Homework 09: Feature Engineering")

## 1. Load and Prepare Dataset

In [None]:
# Load financial dataset
symbols = ['AAPL', 'MSFT', 'GOOGL', 'AMZN', 'TSLA']
print(f"Loading data for: {symbols}")

raw_data = utils.fetch_multiple_stocks(symbols, prefer_alphavantage=False, period='2y')

if not raw_data.empty:
    # Basic preprocessing
    processed_data = []
    
    for symbol in symbols:
        symbol_data = raw_data[raw_data['symbol'] == symbol].copy()
        symbol_data = symbol_data.sort_values('date')
        
        # Basic features
        symbol_data['daily_return'] = symbol_data['close'].pct_change()
        symbol_data['log_return'] = np.log(symbol_data['close'] / symbol_data['close'].shift(1))
        symbol_data['price_range'] = symbol_data['high'] - symbol_data['low']
        
        # Moving averages for feature engineering
        symbol_data['sma_5'] = symbol_data['close'].rolling(5).mean()
        symbol_data['sma_20'] = symbol_data['close'].rolling(20).mean()
        symbol_data['sma_50'] = symbol_data['close'].rolling(50).mean()
        
        # Volume moving average
        symbol_data['volume_ma_20'] = symbol_data['volume'].rolling(20).mean()
        
        processed_data.append(symbol_data)
    
    df = pd.concat(processed_data, ignore_index=True)
    df = df.dropna()
    
    print(f"✅ Dataset prepared: {df.shape}")
    print(f"Base features: {list(df.columns)}")
else:
    print("❌ Failed to load data")

## 2. Feature 1: Volatility-Adjusted Return Ratio

**Rationale**: From EDA, we observed that returns and volatility are closely related. This feature captures risk-adjusted performance by normalizing returns by their recent volatility, similar to a Sharpe ratio but using rolling volatility.

In [None]:
def create_volatility_adjusted_return(df, return_col='daily_return', window=20):
    """
    Create volatility-adjusted return feature.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Input dataframe with return data
    return_col : str
        Column name for returns
    window : int
        Rolling window for volatility calculation
    
    Returns:
    --------
    pd.Series
        Volatility-adjusted returns
    """
    # Calculate rolling volatility
    rolling_vol = df.groupby('symbol')[return_col].rolling(window).std().reset_index(0, drop=True)
    
    # Avoid division by zero
    rolling_vol = rolling_vol.replace(0, np.nan)
    
    # Calculate volatility-adjusted return
    vol_adj_return = df[return_col] / rolling_vol
    
    return vol_adj_return

# Apply feature engineering
if not df.empty:
    df['vol_adj_return'] = create_volatility_adjusted_return(df)
    
    print("📊 Feature 1: Volatility-Adjusted Return")
    print(f"Description: Current return normalized by 20-day rolling volatility")
    print(f"Range: {df['vol_adj_return'].min():.3f} to {df['vol_adj_return'].max():.3f}")
    print(f"Mean: {df['vol_adj_return'].mean():.3f}")
    print(f"Std: {df['vol_adj_return'].std():.3f}")
    
    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    # Distribution
    axes[0].hist(df['vol_adj_return'].dropna(), bins=50, alpha=0.7, edgecolor='black')
    axes[0].set_title('Volatility-Adjusted Return Distribution')
    axes[0].set_xlabel('Vol-Adjusted Return')
    axes[0].set_ylabel('Frequency')
    
    # Comparison with raw returns
    axes[1].scatter(df['daily_return'], df['vol_adj_return'], alpha=0.5, s=10)
    axes[1].set_xlabel('Raw Daily Return')
    axes[1].set_ylabel('Volatility-Adjusted Return')
    axes[1].set_title('Raw vs Volatility-Adjusted Returns')
    
    plt.tight_layout()
    plt.show()

## 3. Feature 2: Volume Momentum Indicator

**Rationale**: EDA showed strong correlation between volume and price movements. This feature captures volume momentum by comparing current volume to its recent average and combining it with price momentum, indicating institutional interest.

In [None]:
def create_volume_momentum(df, volume_col='volume', price_col='close', window=10):
    """
    Create volume momentum indicator.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Input dataframe
    volume_col : str
        Volume column name
    price_col : str
        Price column name
    window : int
        Rolling window for calculations
    
    Returns:
    --------
    pd.Series
        Volume momentum indicator
    """
    # Volume ratio (current vs average)
    volume_ma = df.groupby('symbol')[volume_col].rolling(window).mean().reset_index(0, drop=True)
    volume_ratio = df[volume_col] / volume_ma
    
    # Price momentum (rate of change)
    price_momentum = df.groupby('symbol')[price_col].pct_change(window).reset_index(0, drop=True)
    
    # Combine volume and price momentum
    volume_momentum = volume_ratio * np.sign(price_momentum) * np.abs(price_momentum)
    
    return volume_momentum

# Apply feature engineering
if not df.empty:
    df['volume_momentum'] = create_volume_momentum(df)
    
    print("\n📊 Feature 2: Volume Momentum Indicator")
    print(f"Description: Volume ratio weighted by price momentum direction and magnitude")
    print(f"Range: {df['volume_momentum'].min():.3f} to {df['volume_momentum'].max():.3f}")
    print(f"Mean: {df['volume_momentum'].mean():.3f}")
    print(f"Std: {df['volume_momentum'].std():.3f}")
    
    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    # Distribution
    axes[0].hist(df['volume_momentum'].dropna(), bins=50, alpha=0.7, edgecolor='black')
    axes[0].set_title('Volume Momentum Distribution')
    axes[0].set_xlabel('Volume Momentum')
    axes[0].set_ylabel('Frequency')
    
    # Relationship with next-day returns
    df_temp = df.copy()
    df_temp['next_return'] = df_temp.groupby('symbol')['daily_return'].shift(-1)
    
    momentum_data = df_temp[['volume_momentum', 'next_return']].dropna()
    axes[1].scatter(momentum_data['volume_momentum'], momentum_data['next_return'], alpha=0.5, s=10)
    axes[1].set_xlabel('Volume Momentum')
    axes[1].set_ylabel('Next Day Return')
    axes[1].set_title('Volume Momentum vs Future Returns')
    
    plt.tight_layout()
    plt.show()

## 4. Feature 3: Technical Divergence Signal

**Rationale**: Technical analysis suggests that divergences between price and momentum indicators can signal trend reversals. This feature captures when price makes new highs/lows but momentum indicators don't confirm, indicating potential reversal points.

In [None]:
def create_technical_divergence(df, price_col='close', window=20):
    """
    Create technical divergence signal.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Input dataframe
    price_col : str
        Price column name
    window : int
        Lookback window for divergence calculation
    
    Returns:
    --------
    pd.Series
        Technical divergence signal
    """
    divergence_signals = []
    
    for symbol in df['symbol'].unique():
        symbol_data = df[df['symbol'] == symbol].copy().sort_values('date')
        
        # Price momentum (ROC)
        price_roc = symbol_data[price_col].pct_change(window)
        
        # RSI-like momentum indicator
        returns = symbol_data[price_col].pct_change()
        gains = returns.where(returns > 0, 0)
        losses = -returns.where(returns < 0, 0)
        
        avg_gains = gains.rolling(window).mean()
        avg_losses = losses.rolling(window).mean()
        
        rs = avg_gains / avg_losses
        rsi = 100 - (100 / (1 + rs))
        
        # Normalize RSI to momentum scale
        rsi_momentum = (rsi - 50) / 50  # Convert to -1 to 1 scale
        
        # Calculate divergence: price momentum vs RSI momentum
        divergence = price_roc - rsi_momentum
        
        # Add symbol identifier and append
        divergence_df = pd.DataFrame({
            'symbol': symbol,
            'date': symbol_data['date'],
            'divergence': divergence
        })
        divergence_signals.append(divergence_df)
    
    # Combine all symbols
    all_divergence = pd.concat(divergence_signals, ignore_index=True)
    
    # Merge back to original dataframe order
    df_with_divergence = df.merge(all_divergence, on=['symbol', 'date'], how='left')
    
    return df_with_divergence['divergence']

# Apply feature engineering
if not df.empty:
    df['tech_divergence'] = create_technical_divergence(df)
    
    print("\n📊 Feature 3: Technical Divergence Signal")
    print(f"Description: Divergence between price momentum and RSI-based momentum")
    print(f"Range: {df['tech_divergence'].min():.3f} to {df['tech_divergence'].max():.3f}")
    print(f"Mean: {df['tech_divergence'].mean():.3f}")
    print(f"Std: {df['tech_divergence'].std():.3f}")
    
    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    # Distribution
    axes[0].hist(df['tech_divergence'].dropna(), bins=50, alpha=0.7, edgecolor='black')
    axes[0].set_title('Technical Divergence Distribution')
    axes[0].set_xlabel('Divergence Signal')
    axes[0].set_ylabel('Frequency')
    
    # Time series example for one symbol
    sample_symbol = df['symbol'].iloc[0]
    sample_data = df[df['symbol'] == sample_symbol].sort_values('date').tail(100)
    
    ax2 = axes[1]
    ax2_twin = ax2.twinx()
    
    ax2.plot(sample_data['date'], sample_data['close'], 'b-', label='Price', alpha=0.7)
    ax2_twin.plot(sample_data['date'], sample_data['tech_divergence'], 'r-', label='Divergence', alpha=0.7)
    
    ax2.set_xlabel('Date')
    ax2.set_ylabel('Price', color='b')
    ax2_twin.set_ylabel('Divergence Signal', color='r')
    ax2.set_title(f'Price vs Divergence Signal ({sample_symbol})')
    ax2.tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()

## 5. Feature Correlation Analysis

In [None]:
if not df.empty:
    # Create target variable (next-day return)
    df['target_return'] = df.groupby('symbol')['daily_return'].shift(-1)
    
    # Select features for correlation analysis
    feature_cols = ['vol_adj_return', 'volume_momentum', 'tech_divergence']
    target_col = 'target_return'
    
    # Calculate correlations with target
    correlations = []
    for feature in feature_cols:
        corr = df[feature].corr(df[target_col])
        correlations.append({'Feature': feature, 'Correlation': corr})
    
    corr_df = pd.DataFrame(correlations)
    print("\n🔗 Feature Correlations with Next-Day Returns:")
    print(corr_df.round(4))
    
    # Feature correlation matrix
    feature_data = df[feature_cols + [target_col]].dropna()
    feature_corr_matrix = feature_data.corr()
    
    plt.figure(figsize=(8, 6))
    sns.heatmap(feature_corr_matrix, annot=True, cmap='coolwarm', center=0,
                square=True, fmt='.3f')
    plt.title('Engineered Features Correlation Matrix')
    plt.tight_layout()
    plt.show()
    
    # Statistical significance tests
    print("\n📊 Statistical Significance Tests:")
    for feature in feature_cols:
        feature_data = df[[feature, target_col]].dropna()
        if len(feature_data) > 30:  # Minimum sample size
            corr, p_value = stats.pearsonr(feature_data[feature], feature_data[target_col])
            significance = "***" if p_value < 0.001 else "**" if p_value < 0.01 else "*" if p_value < 0.05 else ""
            print(f"{feature}: r={corr:.4f}, p={p_value:.4f} {significance}")

## 6. Feature Validation and Testing

In [None]:
if not df.empty:
    # Test feature stability across different symbols
    print("\n🧪 Feature Stability Analysis:")
    
    stability_results = []
    for symbol in df['symbol'].unique():
        symbol_data = df[df['symbol'] == symbol]
        
        for feature in feature_cols:
            if len(symbol_data) > 50:  # Minimum data points
                corr = symbol_data[feature].corr(symbol_data[target_col])
                stability_results.append({
                    'Symbol': symbol,
                    'Feature': feature,
                    'Correlation': corr
                })
    
    stability_df = pd.DataFrame(stability_results)
    
    # Pivot for better visualization
    stability_pivot = stability_df.pivot(index='Symbol', columns='Feature', values='Correlation')
    print(stability_pivot.round(4))
    
    # Feature importance ranking
    print("\n🏆 Feature Importance Ranking:")
    feature_importance = stability_df.groupby('Feature')['Correlation'].agg([
        'mean', 'std', 'count'
    ]).round(4)
    feature_importance['abs_mean'] = feature_importance['mean'].abs()
    feature_importance = feature_importance.sort_values('abs_mean', ascending=False)
    
    print(feature_importance)
    
    # Visualize feature performance across symbols
    fig, axes = plt.subplots(1, len(feature_cols), figsize=(15, 5))
    
    for i, feature in enumerate(feature_cols):
        feature_data = stability_df[stability_df['Feature'] == feature]
        axes[i].bar(feature_data['Symbol'], feature_data['Correlation'], alpha=0.7)
        axes[i].set_title(f'{feature}\nCorrelation by Symbol')
        axes[i].set_ylabel('Correlation')
        axes[i].tick_params(axis='x', rotation=45)
        axes[i].axhline(y=0, color='black', linestyle='-', alpha=0.3)
    
    plt.tight_layout()
    plt.show()

## 7. Feature Engineering Summary

### Implemented Features

**1. Volatility-Adjusted Return (`vol_adj_return`)**
- **Formula**: `daily_return / rolling_volatility_20d`
- **Rationale**: Normalizes returns by recent volatility to capture risk-adjusted performance
- **Use Case**: Identifies periods of unusual return magnitude relative to recent volatility

**2. Volume Momentum Indicator (`volume_momentum`)**
- **Formula**: `(volume / volume_ma_10d) * sign(price_momentum) * abs(price_momentum)`
- **Rationale**: Combines volume activity with price momentum direction and magnitude
- **Use Case**: Captures institutional interest and conviction behind price moves

**3. Technical Divergence Signal (`tech_divergence`)**
- **Formula**: `price_momentum_20d - rsi_momentum_normalized`
- **Rationale**: Identifies divergences between price and momentum indicators
- **Use Case**: Signals potential trend reversals when price and momentum diverge

### Feature Performance

Based on correlation analysis with next-day returns:
- Features show varying predictive power across different symbols
- Statistical significance varies, indicating feature effectiveness depends on market conditions
- Feature stability analysis reveals which features are more robust across different stocks

### Connection to EDA Insights

1. **Volume-Return Relationship**: Volume momentum feature directly addresses the strong correlation observed in EDA
2. **Non-Normal Returns**: Volatility-adjusted returns help normalize the fat-tailed distribution
3. **Technical Patterns**: Divergence signal captures the mean-reversion tendencies identified in EDA

### Next Steps

1. **Feature Selection**: Use statistical tests and cross-validation to select most predictive features
2. **Feature Scaling**: Standardize features for machine learning models
3. **Interaction Terms**: Consider feature interactions and polynomial terms
4. **Time-Based Features**: Add seasonal and cyclical components identified in EDA

## 8. Save Engineered Features

In [None]:
if not df.empty:
    # Save dataset with engineered features
    feature_dataset = df[['symbol', 'date', 'close', 'volume', 'daily_return'] + feature_cols + ['target_return']].copy()
    
    # Save to file
    output_path = utils.save_with_timestamp(
        df=feature_dataset,
        prefix="engineered_features",
        source="homework",
        ext="csv"
    )
    
    print(f"\n💾 Engineered features saved to: {output_path}")
    
    print("\n🎯 Feature Engineering Summary:")
    print(f"✅ {len(feature_cols)} new features created")
    print(f"✅ Features tested for correlation with target")
    print(f"✅ Stability analysis across {df['symbol'].nunique()} symbols")
    print(f"✅ Dataset ready for modeling: {feature_dataset.shape}")
    
    print("\n📊 Final Feature Set:")
    for i, feature in enumerate(feature_cols, 1):
        print(f"{i}. {feature}: {feature_dataset[feature].describe().round(4).to_dict()}")