# Building Better Features for Recession Prediction

Now that we've explored our data, it's time to get it ready for machine learning. We'll clean up missing values, create some lag variables (because economic indicators often predict the future), and normalize everything so our model doesn't get confused by different scales.

In [None]:
# Import notebook utilities
from notebook_utils import init_notebook, load_data, display_data_info, save_figure
import os
import numpy as np
import matplotlib.pyplot as plt

# Initialize notebook environment
init_notebook()

# Import from econ_downturn package
from econ_downturn import engineer_features, normalize_data, apply_pca

## Loading Our Clean Dataset

Let's grab the merged dataset we put together in our exploration phase.

In [None]:
# Load all data using the utility function
merged_data = load_data(use_cached=True)

# Display information about the dataset
display_data_info(merged_data)

## Creating Smart Features

Our goal here is to transform raw indicators related to economic recession into informative inputs. engineer_features as a function will automate this process, generating lagging variables, moving averages, and other general transformations to capture indicator change over time.

The engineered features displayed here are made to reflect the time-dependent aspect of economic signals. To use an example, a spike in unemployment could not indiciate a recession today, but could indicate one months later on. Through our inclusion of lagged indicator versions, our model is allowed to detect such delays. This results in a richer and more predictive dataset that reflects not only short-term shifts, but longer term trends that enhance our conclusions.

The engineer_features function:

1. Creates lagged versions of macroeconomic indicators.

2. Calculates percentage changes over time to quantify shifts.

3. Combines existing features (e.g. GPD & Unemployment) to measure joint effects.

In [None]:
# Engineer features using the package function
data_with_features = engineer_features(merged_data)

print(f"Data shape after feature engineering: {data_with_features.shape}")
print(f"Number of features: {data_with_features.shape[1]}")

# Display the first few rows of the engineered data
display(data_with_features.head())

## Getting Everything on the Same Scale

An initial limitation of the data we selected for analysis was the scale variance. These indicators have wildly different scales. For instance, unemployment rate might be 5% while GDP is in the trillions. Since these numeric metrics need to be fairly weighted and added together, it is crucial that we normalize the scale before building the MDA model. If we were to leave features unscaled, variables with larger numeric values would dominate the learning process.

To avoid this, we will normalize the entire merged dataset using the normalize_data() function from our utility code. This rescales all our features to give them similar influence on an MDA model. Our normalized output preserves the original data structure while giving each variable equal influence.

With our normalize_data function:

1. A scaler is created to standardize the features.

2. This scaler is fit onto our data and transforms all feature columns.

3. The new DataFrame is saved in the data_normalized variable for future use.

In [None]:
# Normalize the data
data_normalized, scaler = normalize_data(data_with_features)

print(f"Data shape after normalization: {data_normalized.shape}")

# Display the first few rows of the normalized data
display(data_normalized.head())

## Reducing Complexity with PCA

Next, we will use Principal Component Analysis (aka PCA) to reduce the dimensionality of our dataset. The ultimate goal is to simply the feature space listed while keeping the most important information. This step will help reduce any noise, redundancy, or multicollinearity reflected in the current merged dataset. We hope this feature simplification will improve the performance of our MDA model.

There are a few important steps to note in this section:

1. Seperating the target variable:

    We removed the recession indicator column in order to only apply PCA to the predictor variable.


2. Applying PCA:

    Principal Component Analysis transforms our original features into a set of uncorrelated features that are called principal components. Each of these components is a combination of the original variables and captures a portion of overall dataset variance. We are essentially telling PCA to retain enough of these components to explain 95% of the whole variance, keeping the useful indicators while dropping less useful patterns.


3. Understanding the Output:

    We have created a new dataset with these principal components and added the recession indicator back into it for future modeling. This final dataset has less columns than the original merged dataset, but retains nearly all the important statistical patterns.
    

4. Our Plot:

    The code outputs a bar chart and line graph to show how much variance is captured by each principal component. To explain the roles of each, the bar chart shows individual contributions of each comoponent, while the line shows the cumulative total. This is a visual way to verify that we retained all the critical information necessary.

In [None]:
# Separate features and target
X = data_normalized.drop(columns=['recession'])
y = data_normalized['recession']

# Apply PCA
X_pca_df, pca = apply_pca(X, n_components=0.95)

# Calculate explained variance
explained_variance = np.sum(pca.explained_variance_ratio_)

print(f"Data shape after PCA: {X_pca_df.shape}")
print(f"Number of PCA components: {X_pca_df.shape[1] - 1}")  # Subtract 1 for target column
print(f"Cumulative explained variance: {explained_variance:.4f}")

# Plot explained variance
plt.figure(figsize=(10, 6))
plt.bar(range(1, len(pca.explained_variance_ratio_) + 1), pca.explained_variance_ratio_)
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1), 
         np.cumsum(pca.explained_variance_ratio_), 'r-')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Explained Variance by Principal Components')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Save the figure
save_figure(plt.gcf(), "pca_explained_variance.png")

## Saving Our Work

After our process of prepping data through feature engineering, normalization, and PCA, we save each version of the dataset to files. We can cleanly reuse these three new datasets in our modeling steps without repeating processing lines.

Specifically, we save:

1. Data with Features:

    This is our basic transformed data, including all original indicators with the engineered lagged and smoothed variables.


2. Normalized Data:

    As discussed earlier, putting each of our macroeconomic indicators on the same numeric scale is key for our PCA and MDA steps. This data is ready to be fed into our PCA step.


3. PCA Data:

    This saved file will include only the principal components identified from PCA, as well as the target recession indicator. This is the most compact and analysis-ready version of our data, and is optimized for training our MDA model.


Saving these for later use will cleanly consolidate our future steps by allowing us to easily import transformed data to model.

In [None]:
# Get data paths for saving processed data
from econ_downturn import get_data_paths
data_paths = get_data_paths()
output_dir = data_paths['processed_dir']
os.makedirs(output_dir, exist_ok=True)

# Save the dataset with features
data_path = os.path.join(output_dir, 'data_with_features.csv')
data_with_features.to_csv(data_path)
print(f"Saved dataset with features to {data_path}")

# Save the normalized dataset
normalized_path = os.path.join(output_dir, 'data_normalized.csv')
data_normalized.to_csv(normalized_path)
print(f"Saved normalized dataset to {normalized_path}")

# Save the PCA dataset
pca_path = os.path.join(output_dir, 'data_pca.csv')
X_pca_df.to_csv(pca_path)
print(f"Saved PCA dataset to {pca_path}")