# Tune Experiment Results Summary
*Classical ML, XGBoost & LightGBM, 3600 time budget, 8 concurrent trials, 3 seeds*
By Antoni Baum

tl;dr HyperOpt wins.

In [41]:
import pandas as pd
import plotly.express as px

In [42]:
results = pd.read_csv("https://raw.githubusercontent.com/Yard1/tune_experiment/master/results/results_3600s.csv")
results = results.groupby(["concurrency", "problem", "dataset", "searcher"]).mean()

In [43]:
cummax_searcher_mean = pd.read_csv("https://raw.githubusercontent.com/Yard1/tune_experiment/master/results/results_3600s_cummax_mean.csv").drop("Unnamed: 0", axis=1, errors="ignore")
cummax_searcher_median = pd.read_csv("https://raw.githubusercontent.com/Yard1/tune_experiment/master/results/results_3600s_cummax_median.csv").drop("Unnamed: 0", axis=1, errors="ignore")

In [56]:
cols_to_drop = ["cv", "seed", "time_budget_s", "best_result", "worst_result"]

## Aggregate results

### Normalization
All calculations have been performed on normalized ROC-AUC. The normalization process used min-max scaling for every model-dataset pair, with the worst result for that pair becoming 0 and the best becoming 1.

### Mean results

In [57]:
mean_agg = results.groupby(["searcher"]).mean().sort_values("best_result_normalized", ascending=False)
mean_agg.drop(cols_to_drop, axis=1)

Unnamed: 0_level_0,best_result_normalized,worst_result_normalized,num_trials,avg_time_per_iter,best_result_normalized_1m,best_result_normalized_5m,best_result_normalized_10m,best_result_normalized_30m
searcher,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ASHASkOptSearcher,0.857625,0.277924,436.767094,51.060401,0.62646,0.827668,0.841078,0.852638
ASHAHyperOptSearcher,0.849066,0.263384,489.230769,25.53569,0.624692,0.822303,0.833348,0.843929
HyperOptSearcher,0.84857,0.281186,398.497863,36.355854,0.667066,0.826508,0.837147,0.845225
ASHACFOSearcher,0.846047,0.319986,314.356838,30.738369,0.578953,0.806404,0.822925,0.838514
ASHAHEBOSearcher,0.844482,0.272238,331.606838,55.39558,0.659015,0.817189,0.828849,0.840134
ASHAAxSearcher,0.841381,0.32422,298.474359,22.199362,0.604409,0.809015,0.822314,0.837076
ASHARandomSearch,0.838188,0.239683,878.113248,47.506311,0.598949,0.807704,0.822591,0.833406
HEBOSearcher,0.838172,0.218041,347.630342,18.437384,0.587305,0.819103,0.825848,0.833797
BlendSearchSearcher,0.833056,0.294173,417.760684,40.736241,0.622293,0.804169,0.818223,0.829241
BOHBSearcher,0.833048,0.262354,403.989316,15.026873,0.627234,0.802633,0.815887,0.827902


For mean results, surprisingly, ASHASkOpt provided the best results, despite not making use of pruned trials. ASHAHyperOpt and HyperOpt trial close behind. For most searchers, aside from BlendSearch and Optuna, combining them with ASHA increases their mean best result and the amount of trials.

The fact that ASHAOptuna underperforms so severly may be due to the fact it's one of the only searchers that actually use pruned trials when suggesting new ones. Seeing that BlendSearch is also using Optuna under the hood, its ASHA version underperforming is not surprising.

Ax, CFO and SkOpt are underperforming, being unable to get ahead of Random Search. It is unknown why Ax is giving such poor results (the defaults were used).

Optuna (without ASHA) is also trailing behind HyperOpt despite both using TPEs. I know that Optuna is using different parameters - perhaps that is the reason for the discrepancy.

We can see that the leaderboard quickly becomes relatively stable over time. The best results for 30 minutes do not differ much from the ones after 60 minutes.

### Median results

In [58]:
median_agg = results.groupby(["searcher"]).median().sort_values("best_result_normalized", ascending=False)
median_agg.drop(cols_to_drop, axis=1)

Unnamed: 0_level_0,best_result_normalized,worst_result_normalized,num_trials,avg_time_per_iter,best_result_normalized_1m,best_result_normalized_5m,best_result_normalized_10m,best_result_normalized_30m
searcher,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
RandomSearch,0.887581,0.215728,437.0,8.291066,0.641686,0.848242,0.855207,0.871049
ASHAHyperOptSearcher,0.887156,0.214658,335.166667,8.39255,0.662487,0.860923,0.872448,0.881582
ASHASkOptSearcher,0.880653,0.260799,301.666667,11.779178,0.634953,0.833478,0.858994,0.871792
BlendSearchSearcher,0.874253,0.282306,260.333333,11.266937,0.651208,0.850131,0.861991,0.873565
HyperOptSearcher,0.872751,0.293715,296.833333,10.51926,0.664138,0.861415,0.86569,0.872329
ASHAAxSearcher,0.86887,0.281048,165.666667,10.098225,0.617643,0.843026,0.851786,0.866054
CFOSearcher,0.86814,0.225674,223.333333,11.850166,0.555361,0.829063,0.844187,0.851061
HEBOSearcher,0.867818,0.195828,226.083333,7.598375,0.610164,0.83846,0.850214,0.861667
ASHARandomSearch,0.864497,0.222499,623.666667,8.664884,0.60999,0.817386,0.837526,0.85507
ASHACFOSearcher,0.85775,0.354596,254.5,10.432908,0.587322,0.815426,0.827204,0.844945


The median results differ somewhat from the mean results. Random Search tops the charts - I think it is because it had the most median trials. ASHASkOpt and HyperOpt (both with and without ASHA) are still very strong contenders on median, while the worst searchers are most or less the same as for mean. BlendSearch, while performing poorly on average, has quite good median results.

For most searchers aside from SkOpt, Ax and HyperOpt, ASHA hurts median results, but increases the numbers of trials ran (the latter being expected).

Same as with mean results, the scores after 30 minutes do not change much until the end.

### Standard deviations of results

In [59]:
std_agg = results.groupby(["searcher"]).std().sort_values("best_result_normalized", ascending=True)
std_agg.drop(cols_to_drop, axis=1)

Unnamed: 0_level_0,best_result_normalized,worst_result_normalized,num_trials,avg_time_per_iter,best_result_normalized_1m,best_result_normalized_5m,best_result_normalized_10m,best_result_normalized_30m
searcher,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ASHACFOSearcher,0.094638,0.188437,233.802319,89.785047,0.183678,0.099646,0.096872,0.094568
ASHAHEBOSearcher,0.104993,0.204161,318.011694,212.682718,0.155661,0.108823,0.106547,0.104293
ASHASkOptSearcher,0.108941,0.204551,376.228763,146.370878,0.183629,0.115168,0.113237,0.109849
ASHAAxSearcher,0.112695,0.218318,344.517117,39.684171,0.165757,0.120101,0.11773,0.112722
SkOptSearcher,0.117373,0.169182,360.958318,76.386199,0.181553,0.125113,0.119798,0.116904
HyperOptSearcher,0.119011,0.201655,331.783463,127.213096,0.189508,0.128467,0.123467,0.120156
ASHARandomSearch,0.124522,0.178074,727.051549,150.875391,0.198759,0.131587,0.12955,0.125398
ASHAHyperOptSearcher,0.127748,0.207154,378.741866,79.757141,0.208986,0.133222,0.132154,0.128966
ASHAOptunaTPESearcher,0.130253,0.174743,844.305923,231.859847,0.185745,0.134851,0.134387,0.131431
BOHBSearcher,0.13272,0.183405,449.924219,17.578398,0.157482,0.13677,0.133705,0.13154


For most searchers, ASHA helps to make results more consistent - which is interesting, considering that median results with ASHA were usually worse than mean results.

## Number of wins
*Win is defined as having the best result for a model-dataset pair.*

In [60]:
winners = pd.DataFrame(results.groupby(["concurrency", "problem", "dataset"], sort=False)['best_result_normalized'].idxmax())
winners["best_result_normalized"] = winners["best_result_normalized"].apply(lambda x: x[-1] if isinstance(x, tuple) else None)
winners.value_counts()

best_result_normalized 
BOHBSearcher               11
RandomSearch                8
HyperOptSearcher            7
ASHAHyperOptSearcher        7
ASHARandomSearch            7
ASHASkOptSearcher           6
ASHAAxSearcher              4
ASHAHEBOSearcher            4
ASHAOptunaTPESearcher       4
ASHABlendSearchSearcher     4
OptunaTPESearcher           3
SkOptSearcher               3
CFOSearcher                 3
HEBOSearcher                2
ASHACFOSearcher             2
AxSearcher                  2
BlendSearchSearcher         1
dtype: int64

Contrary to the aggregate results, BOHB tops the leaderboard, followed by Random Search. HyperOpt still performs strongly, matching the aggregate results. For most searchers, ASHA variants have more wins.

Given the discrepancy between the number of wins and the aggregate results, it is most likely that differences between the winners were small for most datasets, while being much higher for others.

In [61]:
losers = pd.DataFrame(results.groupby(["concurrency", "problem", "dataset"], sort=False)['best_result_normalized'].idxmin())
losers["best_result_normalized"] = losers["best_result_normalized"].apply(lambda x: x[-1] if isinstance(x, tuple) else None)
losers.value_counts()

best_result_normalized 
BOHBSearcher               8
AxSearcher                 7
ASHAHyperOptSearcher       7
ASHAOptunaTPESearcher      7
CFOSearcher                7
ASHABlendSearchSearcher    7
BlendSearchSearcher        5
HEBOSearcher               5
RandomSearch               4
OptunaTPESearcher          4
SkOptSearcher              3
ASHASkOptSearcher          3
ASHARandomSearch           3
ASHACFOSearcher            3
HyperOptSearcher           2
ASHAAxSearcher             2
ASHAHEBOSearcher           1
dtype: int64

BOHB tops both the number of winners and losers, meaning it is highly inconsistent. HyperOpt without ASHA is more consistent than with.

## Average rank
*Smaller is better*

In [62]:
ranks = pd.DataFrame(results.groupby(["concurrency", "problem", "dataset"], sort=False)["best_result"].rank(ascending=False))
avg_rank = ranks.mean(level=[0,3]).sort_values(by="best_result").loc[8]
avg_rank


Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead. df.median(level=1) should use df.groupby(level=1).median().



Unnamed: 0_level_0,best_result
searcher,Unnamed: 1_level_1
ASHAHyperOptSearcher,7.910256
ASHASkOptSearcher,8.012821
HyperOptSearcher,8.461538
HEBOSearcher,8.474359
OptunaTPESearcher,8.794872
ASHAHEBOSearcher,8.794872
BOHBSearcher,8.794872
ASHARandomSearch,8.807692
RandomSearch,8.858974
ASHACFOSearcher,8.974359


The top and bottom of the average rank leaderboard is looking very similar to the top of the mean results leaderboard. Random Search is squarely in the middle. HEBO is performing the best out of GP-based searchers taking both ASHA and no ASHA into account. Ax, CFO and BlendSearch are underperforming. The difference between ASHA and no ASHA SkOpt is significant again, as has been the theme in previous leaderboards.

## Plots

### Violin plot

In [64]:
df = results.loc[8].reset_index()
df["searcher"] = df["searcher"].astype("category")
df["searcher"].cat.set_categories = df["searcher"].cat.set_categories(list(avg_rank.index))
df.sort_values(["searcher"])

px.violin(df, y="best_result_normalized", color="searcher", box=True, points="suspectedoutliers", category_orders={"searcher":sorted(median_agg.index)}).show()

### Average cumulative maximum score over time
*Each step is one minute*

In [65]:
px.line(cummax_searcher_mean).update_traces(mode='lines+markers').show()

### Median cumulative maximum score over time
*Each step is one minute*

In [66]:
px.line(cummax_searcher_median).update_traces(mode='lines+markers').show()

## Summary

The only searcher that consistently topped all the leaderboards and worked well with both ASHA and without it was HyperOpt. ASHA with SkOpt also gave very good results. Optuna did not work well when paried with ASHA, which is contrary to the expectations, as it is the only algorithm that makes use of information from pruned trials. Random Search has also shown good performance - this may be due to the parallel nature of the experiment, which would penalize most algorithms that expect sequential results while having no such effect on RS. For most searchers, their rank varied significantly between different methods of scoring. It is possible that 3 runs with different seeds were not enough to reduce the impact of randomness on the results.

All in all, given the results, it is my opinion that HyperOpt (with or without ASHA) is the best choice for tuning GBDTs.