## ML Ops Homework 3

**Disclaimer:** ChatGPT and Claude were used to support the completion of this assignment.

## 1. Setup and Data Loading

In [7]:
# !pip install h2o
# !pip install "flaml[automl]"
# !pip install numpy==1.26.4
# !pip install catboost

In [2]:
# Data manipulation
import pandas as pd
import numpy as np
import time

# H2O AutoML
import h2o
from h2o.automl import H2OAutoML

# FLAML AutoML
from flaml import AutoML

# Train-test split
from sklearn.model_selection import train_test_split

# Evaluation
from sklearn.metrics import mean_squared_error, r2_score


In [3]:
data = pd.read_csv("/Users/brunamedeiros/Documents/University of Chicago/Summer 2025 - ML Ops/HW3/athletes.csv")
print(f"Original dataset shape: {data.shape}")
data.head()

Original dataset shape: (423006, 27)


Unnamed: 0,athlete_id,name,region,team,affiliate,gender,age,height,weight,fran,...,snatch,deadlift,backsq,pullups,eat,train,background,experience,schedule,howlong
0,2554.0,Pj Ablang,South West,Double Edge,Double Edge CrossFit,Male,24.0,70.0,166.0,,...,,400.0,305.0,,,I workout mostly at a CrossFit Affiliate|I hav...,I played youth or high school level sports|I r...,I began CrossFit with a coach (e.g. at an affi...,I do multiple workouts in a day 2x a week|,4+ years|
1,3517.0,Derek Abdella,,,,Male,42.0,70.0,190.0,,...,,,,,,I have a coach who determines my programming|I...,I played youth or high school level sports|,I began CrossFit with a coach (e.g. at an affi...,I do multiple workouts in a day 2x a week|,4+ years|
2,4691.0,,,,,,,,,,...,,,,,,,,,,
3,5164.0,Abo Brandon,Southern California,LAX CrossFit,LAX CrossFit,Male,40.0,67.0,,211.0,...,200.0,375.0,325.0,25.0,I eat 1-3 full cheat meals per week|,I workout mostly at a CrossFit Affiliate|I hav...,I played youth or high school level sports|,I began CrossFit by trying it alone (without a...,I usually only do 1 workout a day|,4+ years|
4,5286.0,Bryce Abbey,,,,Male,32.0,65.0,149.0,206.0,...,150.0,,325.0,50.0,I eat quality foods but don't measure the amount|,I workout mostly at a CrossFit Affiliate|I inc...,I played college sports|,I began CrossFit by trying it alone (without a...,I usually only do 1 workout a day|I strictly s...,1-2 years|


## 2. Data Cleaning

In [12]:
data_clean = data.copy()

# Drop NA in key columns
data_clean = data_clean.dropna(subset=[
    'region','age','weight','height','howlong','gender','eat',
    'train','background','experience','schedule','howlong',
    'deadlift','candj','snatch','backsq','experience',
    'background','schedule','howlong'
])

# Drop irrelevant columns
data_clean = data_clean.drop(columns=[
    'affiliate','team','name','fran','helen','grace','filthy50','fgonebad','run400','run5k','pullups','train'
])

# Remove outliers
data_clean = data_clean[data_clean['weight'] < 1500]
data_clean = data_clean[data_clean['gender'] != '--']
data_clean = data_clean[data_clean['age'] >= 18]
data_clean = data_clean[(data_clean['height'] < 96) & (data_clean['height'] > 48)]

data_clean = data_clean[
    ((data_clean['gender'] == 'Male') & (data_clean['deadlift'] > 0) & (data_clean['deadlift'] <= 1105)) |
    ((data_clean['gender'] == 'Female') & (data_clean['deadlift'] > 0) & (data_clean['deadlift'] <= 636))
]
data_clean = data_clean[(data_clean['candj'] > 0) & (data_clean['candj'] <= 395)]
data_clean = data_clean[(data_clean['snatch'] > 0) & (data_clean['snatch'] <= 496)]
data_clean = data_clean[(data_clean['backsq'] > 0) & (data_clean['backsq'] <= 1069)]

# Clean survey data
decline_dict = {'Decline to answer|': np.nan}
data_clean = data_clean.replace(decline_dict)
data_clean = data_clean.dropna(subset=['background','experience','schedule','howlong','eat'])

# Create target variable
data_clean['total_lift'] = (
    data_clean['deadlift'] + data_clean['candj'] +
    data_clean['snatch'] + data_clean['backsq']
)

# Do same feature engineering from HW2 so I can compare answers:
data_clean['BMI'] = (data_clean['weight'] * 703) / (data_clean['height'] ** 2)
data_clean['howlong'] = data_clean['howlong'].str.rstrip('|').str.strip()
data_clean = data_clean[~data_clean['howlong'].str.contains(r'\|', na=False)]
howlong_map = {
    "Less than 6 months": 0.5,
    "6-12 months": 1,
    "1-2 years": 1.5,
    "2-4 years": 3,
    "4+ years": 5
}
data_clean['howlong_numeric'] = data_clean['howlong'].map(howlong_map)

# I am not one-hot encoding gender. AutoML should be able to do this


data_leakage_cols = ['deadlift', 'candj', 'snatch', 'backsq', 'howlong'] # Remove data leakage columns
unnecessary_columns = ['background', 'experience', 'schedule', 'eat', 'region', 'athlete_id'] # Remove unnecessary columns
data_clean = data_clean.drop(columns=data_leakage_cols + unnecessary_columns)

print(f"Cleaned dataset shape: {data_clean.shape}")
print(f"Target variable (total_lift) statistics:")
print(data_clean['total_lift'].describe())

Cleaned dataset shape: (29444, 7)
Target variable (total_lift) statistics:
count    29444.000000
mean      1016.511581
std        277.449516
min          4.000000
25%        804.000000
50%       1036.000000
75%       1221.000000
max       2135.000000
Name: total_lift, dtype: float64


In [13]:
data_clean.columns

Index(['gender', 'age', 'height', 'weight', 'total_lift', 'BMI',
       'howlong_numeric'],
      dtype='object')

## 3. H2O AutoML
- Uses cross-validation internally during model selection
- Leaderboard shows validation scores (not test scores)
- We manually evaluate on test set with .model_performance(hf_test)

### 3.1 Initialize + Prepare Data

In [14]:
# Initialize H2O
h2o.init()

# Train/test split
train_df, test_df = train_test_split(data_clean, test_size=0.2, random_state=32)
print(f"Training set size: {train_df.shape}")
print(f"Test set size: {test_df.shape}")

# Convert to H2O frames
hf_train = h2o.H2OFrame(train_df)
hf_test = h2o.H2OFrame(test_df)

# Set target and features
target = "total_lift"
features_all = [col for col in data_clean.columns if col != "total_lift"]
print(f"Number of features: {len(features_all)}")
print(f"Features: {features_all}")

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.23" 2024-04-16 LTS; OpenJDK Runtime Environment Zulu11.72+19-CA (build 11.0.23+9-LTS); OpenJDK 64-Bit Server VM Zulu11.72+19-CA (build 11.0.23+9-LTS, mixed mode)
  Starting server from /opt/anaconda3/lib/python3.12/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/qf/l8sfrcpd6vq2564v1_qt31fm0000gn/T/tmpz_on9lbg
  JVM stdout: /var/folders/qf/l8sfrcpd6vq2564v1_qt31fm0000gn/T/tmpz_on9lbg/h2o_brunamedeiros_started_from_python.out
  JVM stderr: /var/folders/qf/l8sfrcpd6vq2564v1_qt31fm0000gn/T/tmpz_on9lbg/h2o_brunamedeiros_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.
Please download and install the latest version from: https://h2o-release.s3.amazonaws.com/h2o/latest_stable.html


0,1
H2O_cluster_uptime:,01 secs
H2O_cluster_timezone:,America/Chicago
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.46.0.7
H2O_cluster_version_age:,3 months and 26 days
H2O_cluster_name:,H2O_from_python_brunamedeiros_0cax3g
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,4.500 Gb
H2O_cluster_total_cores:,11
H2O_cluster_allowed_cores:,11


Training set size: (23555, 7)
Test set size: (5889, 7)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Number of features: 6
Features: ['gender', 'age', 'height', 'weight', 'BMI', 'howlong_numeric']


### 3.2 H2O - All Features

In [17]:
# Run AutoML with all features
print("Running H2O AutoML with all features...")
start_time = time.time()

aml_all = H2OAutoML(max_runtime_secs=300,
                    seed=32,
                    exclude_algos=["StackedEnsemble", "DeepLearning"]) # exclude black box models
aml_all.train(x=features_all,
              y=target,
              training_frame=hf_train)

h2o_all_runtime = time.time() - start_time

Running H2O AutoML with all features...
AutoML progress: |
12:46:08.209: AutoML: XGBoost is not available; skipping it.

███████████████████████████████████████████████████████████████| (done) 100%


In [18]:
# Leaderboard
print('='*60)
print("H2O AutoML Leaderboard - All Features:")
print('='*60)
leaderboard_all = aml_all.leaderboard.as_data_frame()
display(leaderboard_all.head(10))


leaderboard_all = aml_all.leaderboard.as_data_frame()
all_model_ids = leaderboard_all.head(3)['model_id'].tolist()

print('='*60)
print("Top 3 Models Performance (Top 3 Features):")
print('='*60)

# Loop through top 3 models and print performance for each
for i, model_id in enumerate(all_model_ids, 1):
    model = h2o.get_model(model_id)
    perf = model.model_performance(hf_test)

    print(f"\n--- Rank {i} Model ---")
    print(f"Model Name: {model.model_id}")
    print(f"Algorithm: {model.algo}")
    print(f"RMSE: {perf.rmse():.4f}")
    print(f"R²: {perf.r2():.4f}")
    print(f"Training time: {h2o_all_runtime:.2f} seconds")

H2O AutoML Leaderboard - All Features:





Unnamed: 0,model_id,rmse,mse,mae,rmsle,mean_residual_deviance
0,GBM_grid_1_AutoML_1_20250724_124608_model_3,154.068752,23737.180442,118.742915,0.181279,23737.180442
1,GBM_grid_1_AutoML_1_20250724_124608_model_156,154.127977,23755.433228,118.851322,0.181334,23755.433228
2,GBM_grid_1_AutoML_1_20250724_124608_model_131,154.152319,23762.937436,118.742795,0.181266,23762.937436
3,GBM_grid_1_AutoML_1_20250724_124608_model_2,154.156008,23764.074878,118.874591,0.181388,23764.074878
4,GBM_grid_1_AutoML_1_20250724_124608_model_170,154.167271,23767.547555,118.878197,0.181319,23767.547555
5,GBM_grid_1_AutoML_1_20250724_124608_model_122,154.217446,23783.020598,118.83645,0.18138,23783.020598
6,GBM_grid_1_AutoML_1_20250724_124608_model_186,154.221274,23784.201349,118.810109,0.181346,23784.201349
7,GBM_grid_1_AutoML_1_20250724_124608_model_138,154.232574,23787.686866,118.783441,0.181412,23787.686866
8,GBM_grid_1_AutoML_1_20250724_124608_model_95,154.24006,23789.996143,118.835463,0.181361,23789.996143
9,GBM_grid_1_AutoML_1_20250724_124608_model_126,154.254215,23794.362853,118.874698,0.181416,23794.362853


Top 3 Models Performance (Top 3 Features):

--- Rank 1 Model ---
Model Name: GBM_grid_1_AutoML_1_20250724_124608_model_3
Algorithm: gbm
RMSE: 152.9106
R²: 0.6946
Training time: 174.29 seconds

--- Rank 2 Model ---
Model Name: GBM_grid_1_AutoML_1_20250724_124608_model_156
Algorithm: gbm
RMSE: 153.2226
R²: 0.6934
Training time: 174.29 seconds

--- Rank 3 Model ---
Model Name: GBM_grid_1_AutoML_1_20250724_124608_model_131
Algorithm: gbm
RMSE: 153.2206
R²: 0.6934
Training time: 174.29 seconds





**Important:** H2O's leaderboard ranks models based on validation RMSE. I am demonstrating test RMSE, so final order may not be aligned with the RMSE values displayed. 

### 3.3 Importance and Top Features

In [19]:
feature_importance = aml_all.leader.varimp(use_pandas=True)
print('='*60)
print("Feature Importance")
print('='*60)
display(feature_importance)

top_5_features = feature_importance.head(5)['variable'].tolist()
print(f"Top 5 Features: {top_5_features}")

top_3_features = feature_importance.head(3)['variable'].tolist()
print(f"\nTop 3 Features for next experiment: {top_3_features}")

Feature Importance


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,gender,3603383000.0,1.0,0.552209
1,weight,1331014000.0,0.369379,0.203974
2,BMI,541733800.0,0.15034,0.083019
3,age,421219200.0,0.116895,0.064551
4,howlong_numeric,364085500.0,0.10104,0.055795
5,height,263964700.0,0.073255,0.040452


Top 5 Features: ['gender', 'weight', 'BMI', 'age', 'howlong_numeric']

Top 3 Features for next experiment: ['gender', 'weight', 'BMI']


### 3.4 H2O - Top 3 Features

In [20]:
print(f"Running H2O AutoML with top 3 features ({top_3_features})...")
start_time = time.time()

aml_top3 = H2OAutoML(max_runtime_secs=300,
                    seed=32,
                    exclude_algos=["StackedEnsemble", "DeepLearning"]) # exclude black box models
aml_top3.train(x=top_3_features,
              y=target,
              training_frame=hf_train)

h2o_top3_runtime = time.time() - start_time

Running H2O AutoML with top 3 features (['gender', 'weight', 'BMI'])...
AutoML progress: |
12:49:02.620: AutoML: XGBoost is not available; skipping it.

███████████████████████████████████████████████████████████████| (done) 100%


In [21]:
# Leaderboard
print('='*60)
print("H2O AutoML Leaderboard - Top 3 Features")
print('='*60)
leaderboard_top3 = aml_top3.leaderboard.as_data_frame()
display(leaderboard_top3.head(3))

leaderboard_top3 = aml_top3.leaderboard.as_data_frame()
top_3_model_ids = leaderboard_top3.head(3)['model_id'].tolist()

print('='*60)
print("Top 3 Models Performance (Top 3 Features):")
print('='*60)

# Loop through top 3 models and print performance for each
for i, model_id in enumerate(top_3_model_ids, 1):
    model = h2o.get_model(model_id)
    perf = model.model_performance(hf_test)

    print(f"\n--- Rank {i} Model ---")
    print(f"Model Name: {model.model_id}")
    print(f"Algorithm: {model.algo}")
    print(f"RMSE: {perf.rmse():.4f}")
    print(f"R²: {perf.r2():.4f}")
    print(f"Training time: {h2o_top3_runtime:.2f} seconds")

H2O AutoML Leaderboard - Top 3 Features





Unnamed: 0,model_id,rmse,mse,mae,rmsle,mean_residual_deviance
0,GBM_grid_1_AutoML_2_20250724_124902_model_131,175.008063,30627.822019,136.863551,0.200758,30627.822019
1,GBM_grid_1_AutoML_2_20250724_124902_model_99,175.01233,30629.315827,136.840827,0.200776,30629.315827
2,GBM_grid_1_AutoML_2_20250724_124902_model_115,175.020346,30632.121672,136.823486,0.200766,30632.121672


Top 3 Models Performance (Top 3 Features):

--- Rank 1 Model ---
Model Name: GBM_grid_1_AutoML_2_20250724_124902_model_131
Algorithm: gbm
RMSE: 173.2301
R²: 0.6081
Training time: 160.17 seconds

--- Rank 2 Model ---
Model Name: GBM_grid_1_AutoML_2_20250724_124902_model_99
Algorithm: gbm
RMSE: 173.5611
R²: 0.6066
Training time: 160.17 seconds

--- Rank 3 Model ---
Model Name: GBM_grid_1_AutoML_2_20250724_124902_model_115
Algorithm: gbm
RMSE: 173.3062
R²: 0.6078
Training time: 160.17 seconds





---
## 4. FLAML AutoML
- Uses cross-validation internally during hyperparameter optimization
- `automl_all.best_loss` = validation score from cross-validation
- Manually evaluate on test set with .predict(X_test)

### 4.1 Prepare Data

**Notice:** I realized I did not remove the `athletes_id` column like I did for H20, but it ended up being the feature with smallest importance, so I decided to continue the assignment without rerunning the AutoML, in the interest of time.

In [23]:
target = "total_lift"
features_all_flaml = [col for col in data_clean.columns if col != "total_lift"]

X_train = train_df[features_all_flaml]
X_test = test_df[features_all_flaml]
y_train = train_df[target]
y_test = test_df[target]

print(f"FLAML Training set shape: {X_train.shape}")
print(f"FLAML Test set shape: {X_test.shape}")

FLAML Training set shape: (23555, 6)
FLAML Test set shape: (5889, 6)


### 4.2 FLAML AutoML - All Features

In [24]:
# %pip uninstall lightgbm -y
# %pip install lightgbm==3.3.5

In [25]:
automl_all = AutoML()
automl_all_settings = {
    "time_budget": 300,  # Reduced time for stability
    "metric": 'r2',
    "task": 'regression',
    "seed": 32,
    "estimator_list": ['rf', 'xgboost', 'extra_tree', 'xgb_limitdepth', 'sgd', 'catboost'] # had to skip LightGBM because it was giving errors
}

start_time = time.time()

automl_all.fit(X_train=X_train, y_train=y_train, **automl_all_settings)
flaml_all_runtime = time.time() - start_time
y_pred_all = automl_all.predict(X_test)

[flaml.automl.logger: 07-24 12:52:45] {1752} INFO - task = regression
[flaml.automl.logger: 07-24 12:52:45] {1763} INFO - Evaluation method: cv
[flaml.automl.logger: 07-24 12:52:45] {1862} INFO - Minimizing error metric: 1-r2
[flaml.automl.logger: 07-24 12:52:45] {1979} INFO - List of ML learners in AutoML Run: ['rf', 'xgboost', 'extra_tree', 'xgb_limitdepth', 'sgd', 'catboost']
[flaml.automl.logger: 07-24 12:52:45] {2282} INFO - iteration 0, current learner rf
[flaml.automl.logger: 07-24 12:52:45] {2417} INFO - Estimated sufficient time budget=1845s. Estimated necessary time budget=8s.
[flaml.automl.logger: 07-24 12:52:45] {2466} INFO -  at 0.2s,	estimator rf's best error=0.4077,	best estimator rf's best error=0.4077
[flaml.automl.logger: 07-24 12:52:45] {2282} INFO - iteration 1, current learner sgd
[flaml.automl.logger: 07-24 12:52:49] {2466} INFO -  at 4.1s,	estimator sgd's best error=2.6202,	best estimator rf's best error=0.4077
[flaml.automl.logger: 07-24 12:52:49] {2282} INFO - 

In [40]:
print('='*60)
print("FLAML Top 1 Model by Test Performance (All):")
print('='*60)

print('Best ML Learner:', automl_all.best_estimator)
print('Best hyperparameter config:', automl_all.best_config)
print('Best r2 on validation data: {0:.4g}'.format(1-automl_all.best_loss))
print('Training duration of best model: {0:.4f} s'.format(automl_all.best_config_train_time))
print('Training duration of best run: {0:.4f} s'.format(automl_all.best_config_train_time))

FLAML Top 1 Model by Test Performance (All):
Best ML Learner: catboost
Best hyperparameter config: {'early_stopping_rounds': 18, 'learning_rate': 0.07618189423410547, 'n_estimators': 8192}
Best r2 on validation data: 0.6935
Training duration of best run: 0.1982 s


**Important:** FLAML doesn't automatically store any model besides the 1 best performing one. To retrieve this information, one needs to look into the search history.

### 4.3 FLAML Feature Importance and Top Features

In [27]:
top_5_features_flaml = pd.DataFrame({
   'feature': X_train.columns,
   'importance': automl_all.model.estimator.feature_importances_
}).sort_values('importance', ascending=False).head(5)['feature'].tolist()

print(f"FLAML Top 5 Features: {top_5_features_flaml}")

top_3_features_flaml = pd.DataFrame({
   'feature': X_train.columns,
   'importance': automl_all.model.estimator.feature_importances_
}).sort_values('importance', ascending=False).head(3)['feature'].tolist()

print(f"Top 3 Features for next model: {top_3_features_flaml}")

FLAML Top 5 Features: ['gender', 'age', 'BMI', 'howlong_numeric', 'weight']
Top 3 Features for next model: ['gender', 'age', 'BMI']


### 4.4 FLAML - Top 3 Features

In [28]:
print("Running FLAML AutoML with top 3 features...")
start_time = time.time()

X_train_top3 = X_train[top_3_features_flaml]
X_test_top3 = X_test[top_3_features_flaml]

automl_top3 = AutoML()
automl_top3_settings = {
    "time_budget": 300,
    "metric": 'r2',
    "task": 'regression',
    "seed": 32,
    "estimator_list": ['rf', 'xgboost', 'extra_tree', 'xgb_limitdepth', 'sgd', 'catboost']
}

automl_top3.fit(X_train=X_train_top3, y_train=y_train, **automl_top3_settings)
flaml_top3_runtime = time.time() - start_time

Running FLAML AutoML with top 3 features...
[flaml.automl.logger: 07-24 12:57:45] {1752} INFO - task = regression
[flaml.automl.logger: 07-24 12:57:45] {1763} INFO - Evaluation method: cv
[flaml.automl.logger: 07-24 12:57:45] {1862} INFO - Minimizing error metric: 1-r2
[flaml.automl.logger: 07-24 12:57:45] {1979} INFO - List of ML learners in AutoML Run: ['rf', 'xgboost', 'extra_tree', 'xgb_limitdepth', 'sgd', 'catboost']
[flaml.automl.logger: 07-24 12:57:45] {2282} INFO - iteration 0, current learner rf
[flaml.automl.logger: 07-24 12:57:45] {2417} INFO - Estimated sufficient time budget=1748s. Estimated necessary time budget=7s.
[flaml.automl.logger: 07-24 12:57:45] {2466} INFO -  at 0.2s,	estimator rf's best error=0.4077,	best estimator rf's best error=0.4077
[flaml.automl.logger: 07-24 12:57:45] {2282} INFO - iteration 1, current learner sgd
[flaml.automl.logger: 07-24 12:57:49] {2466} INFO -  at 3.6s,	estimator sgd's best error=2.6403,	best estimator rf's best error=0.4077
[flaml.a

In [39]:
y_pred_top3 = automl_top3.predict(X_test_top3)

print('='*60)
print("FLAML Top 1 Model by Test Performance (Top 3 Features):")
print('='*60)
print('Best ML Learner:', automl_top3.best_estimator)
print('Best hyperparameter config:', automl_top3.best_config)
print('Best r2 on validation data: {0:.4g}'.format(1-automl_top3.best_loss))
print('Training duration of best model: {0:.4f} s'.format(automl_top3.best_config_train_time))
print(f"Total training time: {flaml_top3_runtime:.2f} seconds")

FLAML Top 1 Model by Test Performance (Top 3 Features):
Best ML Learner: xgboost
Best hyperparameter config: {'n_estimators': 461, 'max_leaves': 4, 'min_child_weight': 1.3527820602007294, 'learning_rate': 0.10276236833603367, 'subsample': 0.9647327788976525, 'colsample_bylevel': 0.543933125705768, 'colsample_bytree': 1.0, 'reg_alpha': 0.001535032177772235, 'reg_lambda': 31.136985009903125}
Best r2 on validation data: 0.6368
Training duration of best run: 0.0866 s
Total training time: 299.98 seconds


## 5. Results

In [None]:
results_summary = []

# H2O All Features - Best Model
h2o_all_best = h2o.get_model(aml_all.leaderboard.as_data_frame().iloc[0]['model_id'])
h2o_all_perf = h2o_all_best.model_performance(hf_test)
results_summary.append({
    'Framework': 'H2O',
    'Features': 'All (6 features)',
    'Algorithm': h2o_all_best.algo,
    'Test_RMSE': h2o_all_perf.rmse(),
    'Test_R2': h2o_all_perf.r2(),
    'Total Training Time (All Models)': h2o_all_runtime,
    'Training Time of Best Model': aml_all.leader.run_time / 1000.0
})


# H2O Top 3 Features - Best Model
h2o_top3_best = h2o.get_model(aml_top3.leaderboard.as_data_frame().iloc[0]['model_id'])
h2o_top3_perf = h2o_top3_best.model_performance(hf_test)
results_summary.append({
    'Framework': 'H2O',
    'Features': 'Top 3 (gender, weight, BMI)',
    'Algorithm': h2o_top3_best.algo,
    'Test_RMSE': h2o_top3_perf.rmse(),
    'Test_R2': h2o_top3_perf.r2(),
    'Total Training Time (All Models)': h2o_top3_runtime,
    'Training Time of Best Model': aml_top3.leader.run_time / 1000.0
})

# FLAML All Features
y_pred_flaml_all = automl_all.predict(X_test)
flaml_all_rmse = mean_squared_error(y_test, y_pred_flaml_all, squared=False)
flaml_all_r2 = r2_score(y_test, y_pred_flaml_all)
results_summary.append({
    'Framework': 'FLAML',
    'Features': 'All (6 features)',
    'Algorithm': automl_all.best_estimator,
    'Test_RMSE': flaml_all_rmse,
    'Test_R2': flaml_all_r2,
    'Total Training Time (All Models)': flaml_all_runtime,
    'Training Time of Best Model': automl_all.best_config_train_time
})

# FLAML Top 3 Features
y_pred_flaml_top3 = automl_top3.predict(X_test_top3)
flaml_top3_rmse = mean_squared_error(y_test, y_pred_flaml_top3, squared=False)
flaml_top3_r2 = r2_score(y_test, y_pred_flaml_top3)
results_summary.append({
    'Framework': 'FLAML',
    'Features': 'Top 3 (gender, age, BMI)',
    'Algorithm': automl_top3.best_estimator,
    'Test_RMSE': flaml_top3_rmse,
    'Test_R2': flaml_top3_r2,
    'Total Training Time (All Models)': flaml_top3_runtime,
    'Training Time of Best Model': automl_top3.best_config_train_time
})

# Create results DataFrame
results_df = pd.DataFrame(results_summary)
results_df = results_df.round({'Test_RMSE': 4, 'Test_R2': 4, 'Training_Time': 2})

print("="*80)
print("FINAL RESULTS SUMMARY")
print("="*80)
print(results_df.to_string(index=False))
print("="*80)

# Find best performing model
best_model_idx = results_df['Test_R2'].idxmax()
best_model = results_df.iloc[best_model_idx]

print(f"\nBEST PERFORMING MODEL:")
print(f"Framework: {best_model['Framework']}")
print(f"Features: {best_model['Features']}")
print(f"Algorithm: {best_model['Algorithm']}")
print(f"Test R²: {best_model['Test_R2']:.4f}")
print(f"Test RMSE: {best_model['Test_RMSE']:.4f}")
print(f"Training Time: {best_model['Training Time of Best Model']:.2f} seconds")

# Feature importance comparison
print(f"\n" + "="*60)
print("FEATURE IMPORTANCE COMPARISON")
print("="*60)

print("\nH2O Top Features (from all features model):")
h2o_importance = aml_all.leader.varimp(use_pandas=True)
for i, row in h2o_importance.head(5).iterrows():
    print(f"{i+1}. {row['variable']}: {row['scaled_importance']:.3f}")

print(f"\nFLAML Top Features (from all features model):")
flaml_importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': automl_all.model.estimator.feature_importances_
}).sort_values('importance', ascending=False)

for i, row in flaml_importance.head(5).iterrows():
    print(f"{i+1}. {row['feature']}: {row['importance']:.3f}")

print(f"\n" + "="*60)
print("FRAMEWORK COMPARISON")
print("="*60)
print(f"H2O Best R²: {results_df[results_df['Framework'] == 'H2O']['Test_R2'].max():.4f}")
print(f"FLAML Best R²: {results_df[results_df['Framework'] == 'FLAML']['Test_R2'].max():.4f}")
print(f"H2O Avg Training Time: {results_df[results_df['Framework'] == 'H2O']['Training Time of Best Model'].mean():.2f}s")
print(f"FLAML Avg Training Time: {results_df[results_df['Framework'] == 'FLAML']['Training Time of Best Model'].mean():.2f}s")

FINAL RESULTS SUMMARY
Framework                    Features Algorithm  Test_RMSE  Test_R2  Total Training Time (All Models)  Training Time of Best Model
      H2O            All (6 features)       gbm   152.9106   0.6946                        174.286040                     0.110000
      H2O Top 3 (gender, weight, BMI)       gbm   173.2301   0.6081                        160.174460                     0.065000
    FLAML            All (6 features)  catboost   152.7933   0.6951                        300.180503                     0.198187
    FLAML    Top 3 (gender, age, BMI)   xgboost   166.8939   0.6362                        299.981470                     0.086600

BEST PERFORMING MODEL:
Framework: FLAML
Features: All (6 features)
Algorithm: catboost
Test R²: 0.6951
Test RMSE: 152.7933
Training Time: 0.20 seconds

FEATURE IMPORTANCE COMPARISON

H2O Top Features (from all features model):
1. gender: 1.000
2. weight: 0.369
3. BMI: 0.150
4. age: 0.117
5. howlong_numeric: 0.101

FLAML 





In [68]:
results_summary = []

# HW1 RESULTS (Manually Added)
results_summary += [
    {
      'Framework': 'HW1',
      'Features':  'DVC-V1',
      'Algorithm': 'RandomForestRegressor',
      'Test_RMSE': None,
      'Test_R2':   -143413.14511,
      'Total Training Time (All Models)': None,
      'Training Time of Best Model':      None
    },
    {
      'Framework': 'HW1',
      'Features':  'DVC-V2',
      'Algorithm': 'RandomForestRegressor',
      'Test_RMSE': None,
      'Test_R2':   0.6026758,
      'Total Training Time (All Models)': None,
      'Training Time of Best Model':      None
    },
    {
      'Framework': 'HW1',
      'Features':  'DVC-V3',
      'Algorithm': 'DPKerasSGDOptimizer',
      'Test_RMSE': None,
      'Test_R2':   0.603250,
      'Total Training Time (All Models)': None,
      'Training Time of Best Model':      None
    },
    {
      'Framework': 'HW1',
      'Features':  'lakefs-v1',
      'Algorithm': 'RandomForestRegressor',
      'Test_RMSE': None,
      'Test_R2':   -143413.1451140,
      'Total Training Time (All Models)': None,
      'Training Time of Best Model':      None
    },
    {
      'Framework': 'HW1',
      'Features':  'lakefs-v2',
      'Algorithm': 'RandomForestRegressor',
      'Test_RMSE': None,
      'Test_R2':   0.60267582,
      'Total Training Time (All Models)': None,
      'Training Time of Best Model':      None
    },
    {
      'Framework': 'HW1',
      'Features':  'lakefs-v3',
      'Algorithm': 'DPKerasSGDOptimizer',
      'Test_RMSE': None,
      'Test_R2':   0.610,
      'Total Training Time (All Models)': None,
      'Training Time of Best Model':      None
    }
]

# HW2 RESULTS (Manually Added)
results_summary += [
    {
      'Framework': 'HW2',
      'Features':  'Version v1',
      'Algorithm': 'RandomForestRegressor',
      'Test_RMSE': 173.574468,
      'Test_R2':   0.603250,
      'Total Training Time (All Models)': None,
      'Training Time of Best Model':      None
    },
    {
      'Framework': 'HW2',
      'Features':  'Version v2',
      'Algorithm': 'RandomForestRegressor',
      'Test_RMSE': 167.046094,
      'Test_R2':   0.630611,
      'Total Training Time (All Models)': None,
      'Training Time of Best Model':      None
    }
]

# H2O All Features - Best Model
h2o_all_best = h2o.get_model(aml_all.leaderboard.as_data_frame().iloc[0]['model_id'])
h2o_all_perf = h2o_all_best.model_performance(hf_test)
results_summary.append({
    'Framework': 'H2O',
    'Features': 'All (6 features)',
    'Algorithm': h2o_all_best.algo,
    'Test_RMSE': h2o_all_perf.rmse(),
    'Test_R2': h2o_all_perf.r2(),
    'Total Training Time (All Models)': h2o_all_runtime,
    'Training Time of Best Model': aml_all.leader.run_time / 1000.0
})


# H2O Top 3 Features - Best Model
h2o_top3_best = h2o.get_model(aml_top3.leaderboard.as_data_frame().iloc[0]['model_id'])
h2o_top3_perf = h2o_top3_best.model_performance(hf_test)
results_summary.append({
    'Framework': 'H2O',
    'Features': 'Top 3 (gender, weight, BMI)',
    'Algorithm': h2o_top3_best.algo,
    'Test_RMSE': h2o_top3_perf.rmse(),
    'Test_R2': h2o_top3_perf.r2(),
    'Total Training Time (All Models)': h2o_top3_runtime,
    'Training Time of Best Model': aml_top3.leader.run_time / 1000.0
})

# FLAML All Features
y_pred_flaml_all = automl_all.predict(X_test)
flaml_all_rmse = mean_squared_error(y_test, y_pred_flaml_all, squared=False)
flaml_all_r2 = r2_score(y_test, y_pred_flaml_all)
results_summary.append({
    'Framework': 'FLAML',
    'Features': 'All (6 features)',
    'Algorithm': automl_all.best_estimator,
    'Test_RMSE': flaml_all_rmse,
    'Test_R2': flaml_all_r2,
    'Total Training Time (All Models)': flaml_all_runtime,
    'Training Time of Best Model': automl_all.best_config_train_time
})

# FLAML Top 3 Features
y_pred_flaml_top3 = automl_top3.predict(X_test_top3)
flaml_top3_rmse = mean_squared_error(y_test, y_pred_flaml_top3, squared=False)
flaml_top3_r2 = r2_score(y_test, y_pred_flaml_top3)
results_summary.append({
    'Framework': 'FLAML',
    'Features': 'Top 3 (gender, age, BMI)',
    'Algorithm': automl_top3.best_estimator,
    'Test_RMSE': flaml_top3_rmse,
    'Test_R2': flaml_top3_r2,
    'Total Training Time (All Models)': flaml_top3_runtime,
    'Training Time of Best Model': automl_top3.best_config_train_time
})



results_df = pd.DataFrame(results_summary)

# Round numeric columns
results_df['Test_RMSE'] = results_df['Test_RMSE'].round(4)
results_df['Test_R2']   = results_df['Test_R2'].round(4)

# Times may be None → NaN after to_numeric
for col in ['Total Training Time (All Models)', 'Training Time of Best Model']:
    results_df[col] = pd.to_numeric(results_df[col], errors='coerce').round(2)

print("="*80)
print("FINAL RESULTS SUMMARY (HW2 & HW3)")
print("="*80)
print(results_df.to_string(index=False))
print("="*80)

FINAL RESULTS SUMMARY (HW2 & HW3)
Framework                    Features             Algorithm  Test_RMSE      Test_R2  Total Training Time (All Models)  Training Time of Best Model
      HW1                      DVC-V1 RandomForestRegressor        NaN -143413.1451                               NaN                          NaN
      HW1                      DVC-V2 RandomForestRegressor        NaN       0.6027                               NaN                          NaN
      HW1                      DVC-V3   DPKerasSGDOptimizer        NaN       0.6032                               NaN                          NaN
      HW1                   lakefs-v1 RandomForestRegressor        NaN -143413.1451                               NaN                          NaN
      HW1                   lakefs-v2 RandomForestRegressor        NaN       0.6027                               NaN                          NaN
      HW1                   lakefs-v3   DPKerasSGDOptimizer        NaN       0.6100 





**Notes:** Here are some important notes regarding NaNs in the comparison table
- MAE was used instead of RMSE. When I reran the first assignment's notebook, the RMSE scores were plain unreasonable. In the interest of time, I decided to just compare the r2.
- For assignments 1 and 2, I was not able to get the time it took to run the models.
- DVC-V1 and lakefs-V1 had substandard R2 scores because the data was not cleaned. That was a requirement of the assignment.
- **Missing models:** Although having attempted to solve the issue, the H2O ML constantly failed to use XGBoost model, and FLAML failed to use LightGBM. I had to discart both model in order to complete the assignment. 

## 6. Conclusions

**Results:** The AutoML models both performed better than the models used in assignment 1 and 2. It is important to note that I performed the same data manipulation as assignment 2. Both H20 and FLAML models achieved a higher R2. There was an improvement from HW1 to HW2, and even more from HW2 to HW3. 

**Is your platform low-code?:** In my opinion, both H2O and FLAML can be considered low-code, but simply because we don't have to test many models—which is exactly what AutoML does. However, each AutoML platform handled data cleaning differently, and FLAML, specifically, struggled to work with the categorical columns. Because of that, I had to manually clean data so we could solely test their performance in picking the best model. Additionally, one thing I disliked is that, visualizing performance for models in FLAML. You can only analyze performance of the top best model. That is not the case for H2O.

Overall, I preferred working with H2O because it required less data cleaning and analysis of model performance accross X top performing models, not just one. 