# Building Better Features for Recession Prediction

Now that we've explored our data, it's time to get it ready for machine learning. We'll clean up missing values, create some lag variables (because economic indicators often predict the future), and normalize everything so our model doesn't get confused by different scales.

In [None]:
# Import notebook utilities
from notebook_utils import (
    # Setup functions
    setup_notebook, load_data, display_data_info, save_figure,
    
    # Import from econ_downturn package
    engineer_features, normalize_data, apply_pca
)

# Import other libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from IPython.display import display

# Set up the notebook environment
setup_notebook()

## Loading Our Clean Dataset

Let's grab the merged dataset we put together in our exploration phase.

In [None]:
# Load all data using the utility function
merged_data = load_data(use_cached=True)

# Display information about the dataset
display_data_info(merged_data)

## Creating Smart Features

Time to engineer some features that will help our model spot recession patterns. We'll create lag variables, moving averages, and other transformations that capture how economic indicators behave over time.

In [None]:
# Engineer features using the package function
data_with_features = engineer_features(merged_data)

print(f"Data shape after feature engineering: {data_with_features.shape}")
print(f"Number of features: {data_with_features.shape[1]}")

# Display the first few rows of the engineered data
display(data_with_features.head())

## Getting Everything on the Same Scale

Different economic indicators have wildly different scales - unemployment might be 5% while GDP is in the trillions. Let's normalize everything so our model treats all features fairly.

In [None]:
# Normalize the data
data_normalized, scaler = normalize_data(data_with_features)

print(f"Data shape after normalization: {data_normalized.shape}")

# Display the first few rows of the normalized data
display(data_normalized.head())

## Reducing Complexity with PCA

With all these features, we might have too much of a good thing. PCA will help us find the most important patterns in our data while reducing complexity and dealing with correlated variables.

In [None]:
# Separate features and target
X = data_normalized.drop(columns=['recession'])
y = data_normalized['recession']

# Apply PCA
X_pca_df, pca = apply_pca(X, n_components=0.95)

# Calculate explained variance
explained_variance = np.sum(pca.explained_variance_ratio_)

print(f"Data shape after PCA: {X_pca_df.shape}")
print(f"Number of PCA components: {X_pca_df.shape[1] - 1}")  # Subtract 1 for target column
print(f"Cumulative explained variance: {explained_variance:.4f}")

# Plot explained variance
plt.figure(figsize=(10, 6))
plt.bar(range(1, len(pca.explained_variance_ratio_) + 1), pca.explained_variance_ratio_)
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1), 
         np.cumsum(pca.explained_variance_ratio_), 'r-')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Explained Variance by Principal Components')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Save the figure
save_figure(plt.gcf(), "pca_explained_variance.png")

## Saving Our Work

Time to save all these processed datasets so we can use them in our modeling phase.

In [None]:
# Get data paths for saving processed data
from econ_downturn import get_data_paths
data_paths = get_data_paths()
output_dir = data_paths['processed_dir']
os.makedirs(output_dir, exist_ok=True)

# Save the dataset with features
data_path = os.path.join(output_dir, 'data_with_features.csv')
data_with_features.to_csv(data_path)
print(f"Saved dataset with features to {data_path}")

# Save the normalized dataset
normalized_path = os.path.join(output_dir, 'data_normalized.csv')
data_normalized.to_csv(normalized_path)
print(f"Saved normalized dataset to {normalized_path}")

# Save the PCA dataset
pca_path = os.path.join(output_dir, 'data_pca.csv')
X_pca_df.to_csv(pca_path)
print(f"Saved PCA dataset to {pca_path}")