# Add peer-based features using learned embeddings

This notebook will:

1. Load a dataset and train a neural network to learn embeddings for categorical features.
2. Extract the embeddings
3. Construct peer-based features using the learned embeddings.
4. Export the new features for use in other models.

In [1]:
import json
import pandas as pd
from pathlib import Path
from src.ml_backend import load_processed
from src.ml_backend import NeuralNetworkWithEmbeddings
from src.ml_backend import split_data

## Preparation

- load preprocessed data
- load data split configs

In [None]:
# Load configurations for data splitting, this would be the input for the `split_data` function
with open(Path("configs.json"), "r") as f:
    configs = json.load(f)["train_iteration"]

train_start: int = configs["train_start"]
val_start: int = configs["val_start"]
val_years: int = configs["val_years"]
test_years: int = configs["test_years"]
test_end: int = configs["test_end"]

# load processed data
preprocessed_data = load_processed(reprocesse=False)  # change to True if reprocessing is needed
    

Processed data not found or reprocesse is True. Loading and preprocessing raw data...
Preprocessing: Handling missing values...
Preprocessing: Converting binary variables...
Preprocessing: Converting categorical variables to dummies...
Preprocessing: Rank-transforming continuous variables...
Preprocessing: Shifting target variable...
Preprocessing: Filling remaining missing values with 0...
Saving processed data...


In [3]:
# check all the datatype are not float64
for col in preprocessed_data.columns:
    if preprocessed_data[col].dtype != "float64":
        print(f"Column {col} has datatype {preprocessed_data[col].dtype}")

Column AbnormalAccruals has datatype float32
Column Accruals has datatype float32
Column Activism1 has datatype float32
Column Activism2 has datatype float32
Column AdExp has datatype float32
Column AgeIPO has datatype float32
Column AM has datatype float32
Column AnalystRevision has datatype float32
Column AnalystValue has datatype float32
Column AnnouncementReturn has datatype float32
Column AOP has datatype float32
Column AssetGrowth has datatype float32
Column Beta has datatype float32
Column BetaFP has datatype float32
Column BetaLiquidityPS has datatype float32
Column BetaTailRisk has datatype float32
Column betaVIX has datatype float32
Column BidAskSpread has datatype float32
Column BM has datatype float32
Column BMdec has datatype float32
Column BookLeverage has datatype float32
Column BPEBM has datatype float32
Column BrandInvest has datatype float32
Column Cash has datatype float32
Column CashProd has datatype float32
Column CBOperProf has datatype float32
Column CF has datat

## Train neural network with embeddings

The models will be defined in `src/ml_backend/models.py` as `NeuralNetworkWithEmbeddings`.

All models defined in this project have a consistent interface for training and prediction.

- `model.auto_tune()` will auto-tune hyperparameters (defined in `configs.json`
- `model.train_final()` will train the model on training data + validation data
- `model.predict()` will generate predictions on test data

In [None]:
# loop for training iterations
predictions_list = []
while True:
    train_df, val_df, test_df = split_data(
        df = preprocessed_data,
        train_start=train_start,
        val_start=val_start,
        val_years=val_years,
        test_years=test_years
    )
    
    model = NeuralNetworkWithEmbeddings(
        train_df=train_df,
        val_df=val_df,
        test_df=test_df
    )

    print(f"{'='*20} New Iteration {'='*20}")
    print(f"start auto-tuning for test period {val_start +val_years} to {val_start + val_years + test_years - 1}")
    model.auto_tune()
    print(f"start final training for test period {val_start +val_years} to {val_start + val_years + test_years - 1}")
    model.train_final()
    print(f"start prediction for test period {val_start +val_years} to {val_start + val_years + test_years - 1}")
    predictions = model.predict()
    predictions_list.append(predictions)
    # get peer-based features and save to csv
    
    # Example: print first two columns of predictions
    all_new_features = []
    for col in predictions.columns[:2]:
        new_feature_name = f"peer_based_{col}"
        new_feature: pd.Series = model.distance_weighted_feature(col)
        all_new_features.append(new_feature.rename(new_feature_name))
    new_features_df = pd.concat(all_new_features, axis=1)
    new_features_df.to_csv(Path(f"data/test/peer_based_features_{val_start}_{val_start + val_years - 1}.csv"), index=False)

    print(f"Completed predictions for test period {val_start +val_years} to {val_start + val_years + test_years - 1}")
    # Update for next iteration
    val_start += test_years
    if val_start + val_years + test_years - 1 > test_end:
        break

# Combine all predictions
all_predictions = pd.concat(predictions_list, ignore_index=True)
all_predictions.to_csv(Path("data/test/predictions_with_embeddings.csv"), index=False)


[I 2025-11-26 10:50:26,172] A new study created in memory with name: no-name-8ce8d819-4246-4b0d-b7e0-39a67d55ba7a


start auto-tuning for test period 2015 to 2015


2025-11-26 10:50:26.188082: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M1 Max
2025-11-26 10:50:26.195040: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 32.00 GB
2025-11-26 10:50:26.195396: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 10.67 GB
2025-11-26 10:50:26.195441: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2025-11-26 10:50:26.195470: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2025-11-26 10:50:27.053642: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.
