# **Training a CO2 Emissions Prediction Model**

## **1. Introduction**

This notebook marks the third stage of our MLOps pipeline: model training. Leveraging the preprocessed and segregated data from the previous step, our goal is to train a Multilayer Perceptron (MLP) model to predict CO2 emissions.

The main activities in this script are:
* Loading the versioned training and testing datasets from Weights & Biases (Wandb).
* Preparing the data for the model by separating features and the target variable.
* Conducting a systematic hyperparameter sweep to find the best model configuration.
* Training the model with each configuration.
* Evaluating the model's performance on both training and testing data.
* Logging all hyperparameters, performance metrics, and saving the final model files as artifacts in Wandb.

## **2. Library Imports**

We start by importing the necessary libraries for our tasks.

In [None]:
import wandb
import os
import pandas as pd
import numpy as np
import time
import warnings

# Suppress warnings for a cleaner output
warnings.filterwarnings("ignore")

# tensorflores is a custom library for the Multilayer Perceptron.
# Make sure it is correctly installed and accessible in your environment.
from tensorflores.models.multilayer_perceptron import MultilayerPerceptron


# To run this notebook, you need a Wandb account and an API key.
# You can create a file named my_key.py with the line: WANDB_KEY = 'your_api_key_here'
# and then uncomment the line below.
from my_key import WANDB_KEY

## **3. Loading Versioned Datasets**

To ensure reproducibility, we do not use local files directly. Instead, we pull the `train_dataset` and `test_dataset` directly from our Wandb project. This guarantees that we are training our model on the exact same data that was processed in the previous stage.

In [None]:
# Initialize a temporary Wandb run to download the artifacts.
# This run will be closed once the data is loaded.
with wandb.init(project="SBAI 2025", job_type="data-loading") as run:
    
    # --- Load Training Data ---
    train_artifact = run.use_artifact("train_dataset:latest")
    train_path = train_artifact.download()
    train_csv_path = os.path.join(train_path, os.listdir(train_path)[0])
    df_train = pd.read_csv(train_csv_path)
    print("Training dataset loaded successfully.")
    display(df_train.head())

    # --- Load Testing Data ---
    test_artifact = run.use_artifact("test_dataset:latest")
    test_path = test_artifact.download()
    test_csv_path = os.path.join(test_path, os.listdir(test_path)[0])
    df_test = pd.read_csv(test_csv_path)
    print("Testing dataset loaded successfully.")
    display(df_test.head())

## **4. Data Preparation**

With the datasets loaded, we need to separate them into input features (X) and the target variable (y). We also convert the pandas DataFrames into NumPy arrays, which is the expected input format for our `MultilayerPerceptron` model.

In [None]:
# Define the target and input feature columns
target_column = ['CO2 (g/s) [estimated maf]']
feature_columns = ['intake_pressure', 'intake_temperature', 'rpm', 'speed']

# --- Prepare Training Data ---
X_train = df_train[feature_columns].values
y_train = df_train[target_column].values

# --- Prepare Testing Data ---
X_test = df_test[feature_columns].values
y_test = df_test[target_column].values

print(f"Data prepared for training and testing.")
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")

## **5. Hyperparameter Sweep (Model Training Loop)**

Instead of training a single model, we will perform a **hyperparameter sweep**. This involves systematically training multiple models with different configurations to find the one that performs best. We will experiment with different activation functions, hidden layer architectures, and learning rates.

Each combination of these hyperparameters will constitute a single, tracked experiment in Wandb.

In [None]:
os.makedirs("cpp_models", exist_ok=True)
os.makedirs("json_models", exist_ok=True)

In [None]:
# Define the hyperparameter grid for our sweep
activation_sets = [
    ['relu', 'relu', 'linear'],
    ['tanh', 'relu', 'linear'],
    ['sigmoid', 'sigmoid', 'linear']
]
hidden_layer_sets = [
    [16, 8],
    [32, 16],
    [64, 32]
]
learning_rates = [0.01, 0.001]

print("Starting hyperparameter sweep...")

# Loop through each combination of hyperparameters
for act_funcs in activation_sets:
    for hidden_layers in hidden_layer_sets:
        for lr in learning_rates:

            # 1. Define the configuration for the current run
            config = {
                'input_size': X_train.shape[1],
                'output_size': y_train.shape[1],
                'hidden_layer_sizes': hidden_layers,
                'activation_functions': act_funcs,
                'weight_bias_init': 'RandomNormal',
                'training_with_quantization': False,
                'epochs': 100,  # A fixed number of epochs for this sweep
                'learning_rate': lr,
                'loss_function': 'mean_squared_error',
                'optimizer': 'adamax',
                'batch_size': 36,
                'validation_split': 0.2
            }

            # 2. Initialize a new Wandb run for this specific configuration
            with wandb.init(project="SBAI 2025", job_type="training", config=config, save_code=True) as run:
                
                # 3. Instantiate and train the model
                print(f"Training with config: layers={hidden_layers}, activations={act_funcs}, lr={lr}")
                nn = MultilayerPerceptron(
                    input_size=config['input_size'],
                    output_size=config['output_size'],
                    hidden_layer_sizes=config['hidden_layer_sizes'],
                    activation_functions=config['activation_functions'],
                    weight_bias_init=config['weight_bias_init'],
                    training_with_quantization=config['training_with_quantization']
                )

                start_time = time.time()
                nn.train(
                    X=X_train,
                    y=y_train,
                    epochs=config['epochs'],
                    learning_rate=config['learning_rate'],
                    loss_function=config['loss_function'],
                    optimizer=config['optimizer'],
                    batch_size=config['batch_size'],
                    validation_split=config['validation_split']
                )
                train_time = time.time() - start_time
                print(f"Training finished in {train_time:.2f} seconds.")

                # 4. Evaluate the model and calculate metrics
                y_pred_test = nn.predict(X_test)
                mse_test = np.mean(np.square(y_pred_test - y_test))
                mae_test = np.mean(np.abs(y_pred_test - y_test))

                y_pred_train = nn.predict(X_train)
                mse_train = np.mean(np.square(y_pred_train - y_train))
                mae_train = np.mean(np.abs(y_pred_train - y_train))

                # 5. Save the model in different formats
                model_base_name = f'model_{run.id}'
                nn.save_model_as_cpp('./cpp_models/' + model_base_name)
                nn.save_model_as_json('./json_models/' + model_base_name)

                # 6. Log metrics and model artifacts to Wandb
                metrics_to_log = {
                    'train_time': train_time,
                    'mse_train': mse_train,
                    'mae_train': mae_train,
                    'mse_test': mse_test,
                    'mae_test': mae_test
                }
                wandb.log(metrics_to_log)
                
                cpp_artifact = wandb.Artifact(name=f"model-cpp-{run.id}", type="model")
                cpp_artifact.add_file('./cpp_models/' + f'{model_base_name}.h')
                wandb.log_artifact(cpp_artifact)

                json_artifact = wandb.Artifact(name=f"model-json-{run.id}", type="model")
                json_artifact.add_file('./json_models/' +  f'{model_base_name}.json')
                wandb.log_artifact(json_artifact)
                
                print(f"Run {run.id} finished and logged.")

print("Hyperparameter sweep completed.")

## **6. Conclusion**

This notebook has automated the process of training and evaluating multiple models. By performing a hyperparameter sweep, we can systematically explore different model architectures and training parameters. Each run is logged in Weights & Biases, providing a comprehensive overview of performance metrics and linking them to the specific configurations that produced them.

After the sweep is complete, you can analyze the results in the Wandb dashboard to select the best-performing model for deployment or further analysis. The saved model artifacts (`.h` and `.json` files) are versioned and ready to be used in the next stage of the pipeline.