# Fine-Tuning Our Recession Prediction Features

Now that we've got a working model, let's optimize it! We'll experiment with different feature engineering approaches to squeeze out every bit of predictive power we can get from our economic data.

**What we'll try:**
- Test different lag periods for consumer sentiment (how far back should we look?)
- Create interaction terms between sentiment and economic indicators
- Transform sentiment data in various ways to capture different patterns
- Use feature selection to find the cream of the crop
- Compare models with and without sentiment to prove its value

In [None]:
# Import notebook utilities
from notebook_utils import init_notebook, load_data, display_data_info, save_figure
import os
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display

# Initialize notebook environment
init_notebook()

# Import from econ_downturn package
from econ_downturn import (
    engineer_features, normalize_data, apply_mda, create_discriminant_time_series,
    plot_indicator_with_recessions, plot_correlation_matrix,
    plot_feature_importance, plot_discriminant_time_series,
    
    # Advanced feature engineering functions
    engineer_features_with_custom_lags, create_interaction_terms,
    apply_sentiment_transformations, select_features
)

## Getting Our Data Ready

Let's start by loading all our data sources and getting everything set up for our optimization experiments.

In [None]:
# Load all data using the utility function
merged_data = load_data(use_cached=False)  # Force reload from original sources

# Display information about the dataset
display_data_info(merged_data)

## Experimenting with Sentiment Lag Periods

How far back should we look at consumer sentiment to predict recessions? Let's try different lag periods - from 1 month to 2 years - and see which ones give us the best signals.

In [None]:
# Engineer features with different lag periods for consumer sentiment
data_with_custom_lags = engineer_features_with_custom_lags(
    merged_data,
    sentiment_lags=[1, 3, 6, 12, 18, 24],
    other_lags=[1, 3, 6, 12]
)

# Save the dataset with custom lags
from econ_downturn import get_data_paths
data_paths = get_data_paths()
output_dir = data_paths['processed_dir']
os.makedirs(output_dir, exist_ok=True)
custom_lags_path = os.path.join(output_dir, 'data_with_custom_sentiment_lags.csv')
data_with_custom_lags.to_csv(custom_lags_path)
print(f"Saved dataset with custom lags to {custom_lags_path}")

## Building Interaction Features

Sometimes the magic happens when variables work together. Let's create interaction terms between consumer sentiment and economic indicators to capture how they amplify each other's effects.

In [None]:
# Create interaction terms
data_with_interactions = create_interaction_terms(data_with_custom_lags)

# Save the dataset with interaction terms
interactions_path = os.path.join(output_dir, 'data_with_interactions.csv')
data_with_interactions.to_csv(interactions_path)
print(f"Saved dataset with interaction terms to {interactions_path}")

## Transforming Sentiment Data

Raw sentiment data is good, but transformed sentiment data might be even better. Let's try different mathematical transformations to see if we can extract more predictive power.

In [None]:
# Apply transformations to sentiment data
data_with_transformations = apply_sentiment_transformations(data_with_interactions)

# Save the dataset with transformations
transformations_path = os.path.join(output_dir, 'data_with_transformations.csv')
data_with_transformations.to_csv(transformations_path)
print(f"Saved dataset with transformations to {transformations_path}")

## Finding the Best Features

We've created a lot of features - maybe too many! Let's use statistical methods to identify which ones are actually the most useful for predicting recessions.

In [None]:
# Normalize the data
data_normalized, _ = normalize_data(data_with_transformations)

# Separate features and target
X = data_normalized.drop(columns=['recession'])
y = data_normalized['recession']

# Select features using ANOVA F-value
X_selected_anova, selected_features_anova = select_features(X, y, method='anova', k=20)

# Select features using Recursive Feature Elimination
X_selected_rfe, selected_features_rfe = select_features(X, y, method='rfe', k=20)

## The Sentiment Showdown

Time for the ultimate test - does consumer sentiment actually make our model better? Let's run the same analysis with and without sentiment features to see the difference.

In [None]:
def compare_mda_models(data, with_sentiment=True):
    """
    Compare MDA models with and without consumer sentiment features.
    
    Parameters
    ----------
    data : pandas.DataFrame
        Input dataset
    with_sentiment : bool
        Whether to include sentiment features
        
    Returns
    -------
    dict
        MDA results
    """
    # Make a copy of the data
    df = data.copy()
    
    # Identify sentiment-related columns
    sentiment_cols = [col for col in df.columns if 'SENTIMENT' in col]
    
    # Remove sentiment features if not including them
    if not with_sentiment:
        df = df.drop(columns=sentiment_cols)
        print(f"Removed {len(sentiment_cols)} sentiment-related features")
    
    # Normalize the data
    df_normalized, _ = normalize_data(df)
    
    # Separate features and target
    X = df_normalized.drop(columns=['recession'])
    y = df_normalized['recession']
    
    # Apply MDA
    mda_results = apply_mda(X, y)
    
    # Print model performance metrics
    print(f"Accuracy: {mda_results['accuracy']:.4f}")
    print("\nConfusion Matrix:")
    print(mda_results['conf_matrix'])
    print("\nClassification Report:")
    print(mda_results['class_report'])
    print(f"\nCross-Validation Scores: {mda_results['cv_scores']}")
    print(f"Mean CV Score: {mda_results['cv_scores'].mean():.4f}")
    
    # Plot feature importances
    if mda_results['feature_importance'] is not None:
        fig = plot_feature_importance(mda_results['feature_importance'])
        plt.title(f"Feature Importance {'With' if with_sentiment else 'Without'} Sentiment", fontsize=16)
        plt.show()
        save_figure(fig, f"feature_importance_{'with' if with_sentiment else 'without'}_sentiment.png")
    
    # Create and plot discriminant time series
    discriminant_df = create_discriminant_time_series(mda_results['model'], X, y)
    fig = plot_discriminant_time_series(discriminant_df)
    plt.title(f"Discriminant Function {'With' if with_sentiment else 'Without'} Sentiment", fontsize=16)
    plt.show()
    save_figure(fig, f"discriminant_time_series_{'with' if with_sentiment else 'without'}_sentiment.png")
    
    return mda_results

# Compare MDA models with and without sentiment features
print("\nMDA Model WITH Sentiment Features:")
mda_with_sentiment = compare_mda_models(data_with_transformations, with_sentiment=True)

print("\nMDA Model WITHOUT Sentiment Features:")
mda_without_sentiment = compare_mda_models(data_with_transformations, with_sentiment=False)

## Testing Our Selected Features

Now let's see if our feature selection actually worked. Do the hand-picked features perform better than using everything we've got?

In [None]:
# Add recession indicator to selected features
X_selected_anova_with_target = X_selected_anova.copy()
X_selected_anova_with_target['recession'] = y

X_selected_rfe_with_target = X_selected_rfe.copy()
X_selected_rfe_with_target['recession'] = y

# Apply MDA to ANOVA-selected features
print("\nMDA Model with ANOVA-Selected Features:")
mda_anova = compare_mda_models(X_selected_anova_with_target, with_sentiment=True)

# Apply MDA to RFE-selected features
print("\nMDA Model with RFE-Selected Features:")
mda_rfe = compare_mda_models(X_selected_rfe_with_target, with_sentiment=True)