# H2O Benchmark: HTGR Micro-Core Quadrant Power

**Input**

- `theta1`: Angle of control drum in quadrant 1 (radians) 
- `theta2`: Angle of control drum in quadrant 1 (radians) 
- `theta3`: Angle of control drum in quadrant 2 (radians)  
- `theta4`: Angle of control drum in quadrant 2 (radians)
- `theta5`: Angle of control drum in quadrant 3 (radians)
- `theta6`: Angle of control drum in quadrant 3 (radians)
- `theta7`: Angle of control drum in quadrant 4 (radians)  
- `theta8`: Angle of control drum in quadrant 4 (radians)  

**Output** 

- `fluxQ1` : Neutron flux in quadrant 1 ($\frac{neutrons}{cm^{2} s}$)
- `fluxQ2` : Neutron flux in quadrant 2 ($\frac{neutrons}{cm^{2} s}$)
- `fluxQ3` : Neutron flux in quadrant 3 ($\frac{neutrons}{cm^{2} s}$)
- `fluxQ4` : Neutron flux in quadrant 4 ($\frac{neutrons}{cm^{2} s}$)


We will be benchmarking the complete HTGR dataset of 3004 samples using H2O ML (version 3.46.0.5) in efforts to compare pyMAISE to other industry standard ML benchmarking frameworks. We will be following the same procedures we did in the original HTGR example, first extending the dataset to 3004 samples using symmetry, and then training and evaluating to compare results. 

In [1]:
# Importing Packages
import time
import numpy as np
import pandas as pd

# Set display option to show all rows and columns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# Set the width of the columns
pd.set_option('display.width', None)

# See the full content of each column
pd.set_option('display.max_colwidth', None)

import xarray as xr
import matplotlib.pyplot as plt
from scipy.stats import uniform, randint
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Normalizer, MinMaxScaler
# Plot settings
matplotlib_settings = {
    "font.size": 12,
    "legend.fontsize": 11,
    "figure.figsize": (8, 8)
}
plt.rcParams.update(**matplotlib_settings)

## Processing the data

First, we will load the raw data into a dataframe and print it out.

In [2]:
import os

cwd = os.getcwd()
new_cwd = cwd.replace("/docs/source/benchmarks", "/pyMAISE/datasets")

# Define the full path to the microreactor.csv file
csv_path = os.path.join(new_cwd, 'microreactor.csv')

# Load the CSV file into a pandas DataFrame
raw_data = pd.read_csv(csv_path)
raw_data.head()

Unnamed: 0,sample number,cpu_time,runtime,k,fluxQ1,fluxQ2,fluxQ3,fluxQ4,k_uncert,flux_runcertQ1,flux_runcertQ2,flux_runcertQ3,flux_runcertQ4,fissQ1,fissQ2,fissQ3,fissQ4,fissEQ1,fissEQ2,fissEQ3,fissEQ4,fiss_runcertQ1,fiss_runcertQ2,fiss_runcertQ3,fiss_runcertQ4,fissE_runcertQ1,fissE_runcertQ2,fissE_runcertQ3,fissE_runcertQ4,theta1,theta2,theta3,theta4,theta5,theta6,theta7,theta8
0,sample_00000,4260.0,200.0,0.998328,2.58e+19,2.59e+19,2.67e+19,2.56e+19,0.00019,0.00112,0.00111,0.00111,0.00108,8.49e+16,8.49e+16,8.48e+16,8.49e+16,2751290,2751060,2749270,2750450,0.0006,0.0006,0.00063,0.00062,0.0006,0.0006,0.00063,0.00062,5.919526,2.369503,2.923656,4.488987,3.683212,4.008905,4.970368,2.987966
1,sample_00001,2570.0,130.0,0.988522,2.55e+19,2.53e+19,2.51e+19,2.51e+19,0.00025,0.00142,0.00148,0.00154,0.0015,8.49e+16,8.49e+16,8.49e+16,8.49e+16,2750610,2750210,2750150,2750110,0.00076,0.00077,0.00084,0.00074,0.00076,0.00077,0.00084,0.00074,2.16238,0.273624,0.927741,4.595586,2.598824,0.170167,2.124048,4.980209
2,sample_00002,2590.0,130.0,1.00461,2.57e+19,2.58e+19,2.52e+19,2.52e+19,0.00025,0.00167,0.00163,0.00161,0.00165,8.48e+16,8.48e+16,8.49e+16,8.49e+16,2748870,2749690,2752250,2751840,0.00076,0.00077,0.00086,0.0008,0.00076,0.00077,0.00086,0.0008,0.4501,0.006301,2.512217,3.313864,1.913458,3.582252,0.280764,4.888595
3,sample_00003,2580.0,129.0,0.991892,2.57e+19,2.58e+19,2.52e+19,2.56e+19,0.00025,0.00197,0.00193,0.00195,0.002,8.48e+16,8.49e+16,8.48e+16,8.47e+16,2748920,2750720,2749330,2746220,0.00082,0.00076,0.0008,0.00078,0.00082,0.00076,0.0008,0.00078,0.461105,4.825628,3.771356,2.599278,2.056019,0.007332,1.106786,5.504671
4,sample_00004,2570.0,129.0,0.985047,2.54e+19,2.62e+19,2.58e+19,2.52e+19,0.00025,0.00167,0.00167,0.00172,0.00169,8.48e+16,8.49e+16,8.48e+16,8.49e+16,2748910,2753130,2747870,2752420,0.0008,0.00081,0.00082,0.00083,0.0008,0.00081,0.00082,0.00083,5.248202,3.549416,3.333632,3.90731,2.095312,5.585145,3.774253,2.48012


We are then going to create input and output dataframes by defining our input and output variables.

In [3]:
# Create the input DataFrame with theta values
input_columns = ['theta1', 'theta2', 'theta3', 'theta4', 'theta5', 'theta6', 'theta7', 'theta8']
inputs = raw_data[input_columns]

# Create the output DataFrame with flux values
output_columns = ['fluxQ1', 'fluxQ2', 'fluxQ3', 'fluxQ4']
outputs = raw_data[output_columns]

Below, we print out the results for input and output then also create a combined dataset with both.

In [4]:
inputs.head()

Unnamed: 0,theta1,theta2,theta3,theta4,theta5,theta6,theta7,theta8
0,5.919526,2.369503,2.923656,4.488987,3.683212,4.008905,4.970368,2.987966
1,2.16238,0.273624,0.927741,4.595586,2.598824,0.170167,2.124048,4.980209
2,0.4501,0.006301,2.512217,3.313864,1.913458,3.582252,0.280764,4.888595
3,0.461105,4.825628,3.771356,2.599278,2.056019,0.007332,1.106786,5.504671
4,5.248202,3.549416,3.333632,3.90731,2.095312,5.585145,3.774253,2.48012


In [5]:
outputs.head()

Unnamed: 0,fluxQ1,fluxQ2,fluxQ3,fluxQ4
0,2.58e+19,2.59e+19,2.67e+19,2.56e+19
1,2.55e+19,2.53e+19,2.51e+19,2.51e+19
2,2.57e+19,2.58e+19,2.52e+19,2.52e+19
3,2.57e+19,2.58e+19,2.52e+19,2.56e+19
4,2.54e+19,2.62e+19,2.58e+19,2.52e+19


In [6]:
combined_df = pd.concat([inputs, outputs], axis=1)
print(combined_df.head())

     theta1    theta2    theta3    theta4    theta5    theta6    theta7  \
0  5.919526  2.369503  2.923656  4.488987  3.683212  4.008905  4.970368   
1  2.162380  0.273624  0.927741  4.595586  2.598824  0.170167  2.124048   
2  0.450100  0.006301  2.512217  3.313864  1.913458  3.582252  0.280764   
3  0.461105  4.825628  3.771356  2.599278  2.056019  0.007332  1.106786   
4  5.248202  3.549416  3.333632  3.907310  2.095312  5.585145  3.774253   

     theta8        fluxQ1        fluxQ2        fluxQ3        fluxQ4  
0  2.987966  2.580000e+19  2.590000e+19  2.670000e+19  2.560000e+19  
1  4.980209  2.550000e+19  2.530000e+19  2.510000e+19  2.510000e+19  
2  4.888595  2.570000e+19  2.580000e+19  2.520000e+19  2.520000e+19  
3  5.504671  2.570000e+19  2.580000e+19  2.520000e+19  2.560000e+19  
4  2.480120  2.540000e+19  2.620000e+19  2.580000e+19  2.520000e+19  


Now it is time to extend the dataset to 3004 samples. This is done in the same way as in the original HTGR, replicating the same steps below.

In [7]:
# Credit to mult_sym and g21 from https://github.com/deanrp2/MicroControl/blob/main/pmdata/utils.py#L51
theta_cols = [f"theta{i + 1}" for i in range(8)]
flux_cols = [f"fluxQ{i + 1}" for i in range(4)]

def mult_samples(data):
    # Create empty arrays
    ht = xr.DataArray(
        np.zeros(data.shape), 
        coords={
            "index": [f"{idx}_h" for idx in data.coords["index"].values],
            "variable": data.coords["variable"],
        },
    )
    vt = xr.DataArray(
        np.zeros(data.shape), 
        coords={
            "index": [f"{idx}_v" for idx in data.coords["index"].values],
            "variable": data.coords["variable"],
        },
    )
    rt = xr.DataArray(
        np.zeros(data.shape),     
        coords={
            "index": [f"{idx}_r" for idx in data.coords["index"].values],
            "variable": data.coords["variable"],
        },
    )

    # Swap drum positions
    hkey = [f"theta{i}" for i in np.array([3, 2, 1, 0, 7, 6, 5, 4], dtype=int) + 1]
    vkey = [f"theta{i}" for i in np.array([7, 6, 5, 4, 3, 2, 1, 0], dtype=int) + 1]
    rkey = [f"theta{i}" for i in np.array([4, 5, 6, 7, 0, 1, 2, 3], dtype=int) + 1]

    ht.loc[:, hkey] = data.loc[:, theta_cols].values
    vt.loc[:, vkey] = data.loc[:, theta_cols].values
    rt.loc[:, rkey] = data.loc[:, theta_cols].values

    # Adjust angles
    ht.loc[:, hkey] = (3 * np.pi - ht.loc[:, hkey].loc[:, hkey]) % (2 * np.pi)
    vt.loc[:, vkey] = (2 * np.pi - vt.loc[:, hkey].loc[:, vkey]) % (2 * np.pi)
    rt.loc[:, rkey] = (np.pi + rt.loc[:, hkey].loc[:, rkey]) % (2 * np.pi)

    # Fill quadrant tallies
    hkey = [2, 1, 4, 3]
    vkey = [4, 3, 2, 1]
    rkey = [3, 4, 1, 2]

    ht.loc[:, [f"fluxQ{i}" for i in hkey]] = data.loc[:, flux_cols].values
    vt.loc[:, [f"fluxQ{i}" for i in vkey]] = data.loc[:, flux_cols].values
    rt.loc[:, [f"fluxQ{i}" for i in rkey]] = data.loc[:, flux_cols].values

    sym_data = xr.concat([data, ht, vt, rt], dim="index").sortby("index")
    
    # Normalize fluxes
    sym_data.loc[:, flux_cols].values = Normalizer().transform(sym_data.loc[:, flux_cols].values)
    
    # Convert global coordinate system to local
    loc_offsets = np.array(
        [3.6820187359906447, 4.067668586955522, 2.2155167202240653 - np.pi, 2.6011665711889425 - np.pi, 
         0.5404260824008517, 0.9260759333657285, 5.3571093738138575 - np.pi, 5.742759224778734 - np.pi]
    )

    # Apply correct 0 point
    sym_data.loc[:, theta_cols] = sym_data.loc[:, theta_cols] - loc_offsets + 2 * np.pi

    # Reverse necessary angles
    sym_data.loc[:, [f"theta{i}" for i in [3,4,5,6]]] *= -1

    # Scale all to [0, 2 * np.pi]
    sym_data.loc[:, theta_cols] = sym_data.loc[:, theta_cols] % (2 * np.pi)
        
    return sym_data

In [8]:
train_data, test_data = train_test_split(combined_df, test_size=0.3)

# Convert to xarray DataArray and specify the index as a coordinate
train_data_xr = xr.DataArray(
    train_data.values,
    coords={"index": train_data.index, "variable": train_data.columns},
    dims=["index", "variable"]
)
test_data_xr = xr.DataArray(
    test_data.values,
    coords={"index": test_data.index, "variable": test_data.columns},
    dims=["index", "variable"]
)

In [9]:
sym_train_data = mult_samples(train_data_xr)
sym_test_data = mult_samples(test_data_xr)
print(f"Multiplied training shape: {sym_train_data.shape}, Multiplied testing shape: {sym_test_data.shape}")

Multiplied training shape: (2100, 12), Multiplied testing shape: (904, 12)


As seen above, we end up with data the same size as the original HTGR. Below, we are going to Min-Max the X_data and normalize the y_data.

In [10]:
# Min-Max scaling data 
def scale_data(train_data, test_data, scaler):
    train_data.values = scaler.fit_transform(
        train_data.values.reshape(-1, train_data.shape[-1])
    ).reshape(train_data.shape)
    test_data.values = scaler.transform(
        test_data.values.reshape(-1, test_data.shape[-1])
    ).reshape(test_data.shape)
    
    # Return data
    return train_data, test_data, scaler

xtrain_arr, xtest_arr , _ = scale_data(sym_train_data.loc[:, theta_cols], sym_test_data.loc[:, theta_cols], MinMaxScaler())
ytrain_arr, ytest_arr, _ = scale_data(sym_train_data.loc[:, flux_cols], sym_test_data.loc[:, flux_cols], Normalizer(norm="l1"))

In [11]:
xtrain = xtrain_arr.to_pandas()
xtest = xtest_arr.to_pandas()
ytrain = ytrain_arr.to_pandas()
ytest = ytest_arr.to_pandas()

## Benchmark with H20ML

Now that all the data is preprocessed in the same fashion as the original HTGR example, it is time to use H2OML to obtain results. Below, we import the necessary libraries from H2O and initialize the H2O instance for the next tasks.

In [19]:
import h2o
from h2o.automl import H2OAutoML

# Step 1: Initialize H2O
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.24" 2024-07-16; OpenJDK Runtime Environment (build 11.0.24+8-post-Ubuntu-1ubuntu322.04); OpenJDK 64-Bit Server VM (build 11.0.24+8-post-Ubuntu-1ubuntu322.04, mixed mode, sharing)
  Starting server from /home/schidige/anaconda3/envs/h2oML/lib/python3.8/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpqy9lzok0
  JVM stdout: /tmp/tmpqy9lzok0/h2o_schidige_started_from_python.out
  JVM stderr: /tmp/tmpqy9lzok0/h2o_schidige_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,01 secs
H2O_cluster_timezone:,America/New_York
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.46.0.5
H2O_cluster_version_age:,1 month and 22 days
H2O_cluster_name:,H2O_from_python_schidige_rvz83h
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,29.97 Gb
H2O_cluster_total_cores:,32
H2O_cluster_allowed_cores:,32


After that, we are going to put each of our dataset splits on the connection through H2OFrame. This lets H2O access our data and use it for training/testing.

In [20]:
# Assuming xtrain, xtest, ytrain, ytest are Pandas DataFrames
xtrain_h2o = h2o.H2OFrame(xtrain)
xtest_h2o = h2o.H2OFrame(xtest)
ytrain_h2o = h2o.H2OFrame(ytrain)
ytest_h2o = h2o.H2OFrame(ytest)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


We will then create a combined set of train and test datasets and also set the target and feature variables in lists. 

In [21]:
# Combine features and targets for training
train_h2o = xtrain_h2o.cbind(ytrain_h2o)
test_h2o = xtest_h2o.cbind(ytest_h2o)

# Specify the column names for the targets
targets = ytrain.columns.tolist()  # List of target columns
features = xtrain.columns.tolist()  # List of feature columns

It is now time to train our model. H2O is not natively multi-output while this problem has us predicting 4 variables simultaneously. This means we can't use H2O out of the box on this dataset. Instead we will naively extend the capacities of H2O by have it train a different model for each target outcome. Below, we do training an H2OAutoML model on each taget independently and then storing it inside a dictionary. It is also important to note that H2O natively does cross validation while training with a 5 fold split.

In [22]:
# Dictionary to store models
aml_models = {}

for target in targets:
    # Initialize AutoML
    aml = H2OAutoML(max_models=20, seed=1234)

    # Train AutoML for each target
    aml.train(x=features, y=target, training_frame=train_h2o)

    # Store the model
    aml_models[target] = aml


AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%
AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%
AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%
AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%


After training, we can test our models below on the testing dataset we set aside earlier. Note that the $R^2$ for all the models are quite similar to the FNN's performance on HTGR, showing similar results. The same can be said about RMSE and MAE. It is also important to note that there are 4 different sets of results here for each target variable.

In [25]:
# Evaluate models for each target
for target in targets:
    # Make predictions
    predictions = aml_models[target].leader.predict(test_h2o)

    # Evaluate model performance
    performance = aml_models[target].leader.model_performance(test_h2o)
    print(f"Evaluation for target: {target}")
    print(performance)
    print(f"R² on test set: {performance.r2()}")
    print('-' * 50)  

gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Evaluation for target: fluxQ1
ModelMetricsRegression: gbm
** Reported on test data. **

MSE: 4.008380111192609e-07
RMSE: 0.0006331176913649317
MAE: 0.0005025689597728681
RMSLE: 0.0005058321307694414
Mean Residual Deviance: 4.008380111192609e-07
R² on test set: 0.9776496656942096
--------------------------------------------------
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Evaluation for target: fluxQ2
ModelMetricsRegression: gbm
** Reported on test data. **

MSE: 3.8885122709052564e-07
RMSE: 0.0006235793671141835
MAE: 0.000492043480224217
RMSLE: 0.0004982041393286098
Mean Residual Deviance: 3.8885122709052564e-07
R² on test set: 0.9783180370134837
--------------------------------------------------
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Evaluation for target: fluxQ3
ModelMetricsRegression

All that is left now is to print out the model parameters for all the models trained in a leaderboard style (best to worst). We can see that H2O models that are used were mainly Gradient Boosting and XGBoost models. Again, we are printing the leaderboard for each target variable. However, we can see that each leaderboard gives pretty consistant results across all the targets.

In [26]:
for target, aml in aml_models.items():
    print(f"\nLeaderboard for target: {target}")
    print(aml.leaderboard)
    print(f"Best model for {target}: {aml.leader}")
    print('-' * 100)  


Leaderboard for target: fluxQ1
model_id                                           rmse          mse          mae        rmsle    mean_residual_deviance
GBM_1_AutoML_1_20241023_32520               0.000685361  4.69719e-07  0.000542796  0.000547656               4.69719e-07
GBM_grid_1_AutoML_1_20241023_32520_model_2  0.000766287  5.87195e-07  0.00060316   0.000612268               5.87195e-07
GBM_2_AutoML_1_20241023_32520               0.000945019  8.93061e-07  0.000742318  0.000755141               8.93061e-07
GBM_3_AutoML_1_20241023_32520               0.000973695  9.48082e-07  0.000769139  0.000778094               9.48082e-07
XGBoost_3_AutoML_1_20241023_32520           0.000985956  9.72109e-07  0.000778139  0.000788122               9.72109e-07
GBM_5_AutoML_1_20241023_32520               0.000996767  9.93544e-07  0.000779383  0.000796456               9.93544e-07
GBM_4_AutoML_1_20241023_32520               0.00102334   1.04723e-06  0.000806216  0.000817664               1.04723e-06


All that is left is to shutdown the cluster since we initalized it earlier.

In [None]:
h2o.cluster.shutdown()