Сначала вам нужно будет установить и импортировать модуль Python H2O и класс H2OAutoML, как и любую другую библиотеку, и инициализировать H2O.

In [9]:
import h2o
import pandas as pd
from sklearn.preprocessing import LabelEncoder 
from h2o.automl import H2OAutoML
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O_cluster_uptime:,1 min 33 secs
H2O_cluster_timezone:,Europe/Moscow
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.46.0.7
H2O_cluster_version_age:,3 months and 7 days
H2O_cluster_name:,H2O_from_python_zhigu_4q5tvk
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,6.758 Gb
H2O_cluster_total_cores:,8
H2O_cluster_allowed_cores:,8


Затем нам нужно загрузить данные. Это можно сделать либо прямо в «H2OFrame» (закомментированный пример), либо обычным способом, чтобы мы могли закодировать целевую переменную, а затем преобразовать их в H2OFrame. 

Как и многие функции и объекты в H2O, H2OFrame схож с Pandas DataFrame, но имеет свои небольшие отличия и синтаксис. Например, H2O может самостоятельно обрабатывать категориальные признаки, поэтому выполнять этот шаг отдельно не требуется.

In [10]:
df = pd.read_csv("data/mushrooms.zip")
# df = h2o.import_file("mushrooms.csv", header =1)

labelEncoder = LabelEncoder()
for column in df.columns:
    df[column] = labelEncoder.fit_transform(df[column])
df = h2o.H2OFrame(df)
df = df.asfactor()

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


Получается, с использованием AutoML мы не будем проделывать эти шаги и сразу перейдём к обучению!

Подобно функциям в sklearn, мы можем создать разбиение на train и test, чтобы можно было проверить производительность модели на неизвестном наборе данных. 

In [11]:
# Разделяем данные на Train/Test/Validation. Train размером 70%, test и validation по 15% 
train_df, test_df = df.split_frame(ratios=[.7])

Затем нам нужно получить имена столбцов, чтобы передать их функции. 

Для AutoML необходимо указать несколько обязательных параметров: x, y и training_frame. Здесь вы также можете настроить значения max_runtime_sec и max_models. 

Max_runtime_sec — обязательный параметр, а max_model — необязательный. 
По умолчанию все непереданные параметры принимают значение NULL. 
Параметр x — это вектор признаков из training_frame. Если вы не хотите использовать все признаки из переданного кадра, вы можете установить его, передав его в x.
На этих данных будем использовать все признаки (кроме таргета) и установим max_runtime_sec на 10 минут (по умолчанию некоторые из моделей могут занять много времени). Запустим AutoML:

In [12]:
y = "class"
train_columns = train_df.columns
train_columns.remove(y)

aml_model = H2OAutoML(max_runtime_secs=500, seed = 1)
aml_model.train(x = train_columns, y = y, training_frame = train_df)

AutoML progress: |█
17:53:27.609: AutoML: XGBoost is not available; skipping it.
17:53:27.670: _train param, Dropping bad and constant columns: [veil-type]
17:53:32.274: _train param, Dropping bad and constant columns: [veil-type]

███████
17:54:10.920: _train param, Dropping unused columns: [veil-type]
17:54:13.947: _train param, Dropping bad and constant columns: [veil-type]
17:54:16.15: _train param, Dropping bad and constant columns: [veil-type]

███
17:54:36.283: _train param, Dropping bad and constant columns: [veil-type]

██
17:54:55.187: _train param, Dropping bad and constant columns: [veil-type]

███
17:55:17.498: _train param, Dropping unused columns: [veil-type]


17:55:19.823: _train param, Dropping unused columns: [veil-type]
17:55:23.213: _train param, Dropping bad and constant columns: [veil-type]

█
17:55:25.467: _train param, Dropping bad and constant columns: [veil-type]

██
17:55:41.650: _train param, Dropping bad and constant columns: [veil-type]
17:55:44.518: _tra

Unnamed: 0,family,link,regularization,lambda_search,number_of_predictors_total,number_of_active_predictors,number_of_iterations,training_frame
,binomial,logit,Ridge ( lambda = 1.947E-5 ),"nlambda = 30, lambda.max = 19.466, lambda.min = 1.947E-5, lambda.1se = 1.947E-5",116,116,79,AutoML_1_20250704_175327_training_py_3_sid_9953

Unnamed: 0,0,1,Error,Rate
0,2964.0,0.0,0.0,(0.0/2964.0)
1,0.0,2752.0,0.0,(0.0/2752.0)
Total,2964.0,2752.0,0.0,(0.0/5716.0)

metric,threshold,value,idx
max f1,0.8891314,1.0,192.0
max f2,0.8891314,1.0,192.0
max f0point5,0.8891314,1.0,192.0
max accuracy,0.8891314,1.0,192.0
max precision,0.999998,1.0,0.0
max recall,0.8891314,1.0,192.0
max specificity,0.999998,1.0,0.0
max absolute_mcc,0.8891314,1.0,192.0
max min_per_class_accuracy,0.8891314,1.0,192.0
max mean_per_class_accuracy,0.8891314,1.0,192.0

group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.010147,0.9999996,2.0770349,2.0770349,1.0,0.9999997,1.0,0.9999997,0.0210756,0.0210756,107.7034884,107.7034884,0.0210756
2,0.020119,0.9999994,2.0770349,2.0770349,1.0,0.9999995,1.0,0.9999996,0.0207122,0.0417878,107.7034884,107.7034884,0.0417878
3,0.030091,0.9999989,2.0770349,2.0770349,1.0,0.9999991,1.0,0.9999995,0.0207122,0.0625,107.7034884,107.7034884,0.0625
4,0.040063,0.9999983,2.0770349,2.0770349,1.0,0.9999986,1.0,0.9999992,0.0207122,0.0832122,107.7034884,107.7034884,0.0832122
5,0.050035,0.9999976,2.0770349,2.0770349,1.0,0.9999979,1.0,0.999999,0.0207122,0.1039244,107.7034884,107.7034884,0.1039244
6,0.10007,0.999993,2.0770349,2.0770349,1.0,0.9999955,1.0,0.9999972,0.1039244,0.2078488,107.7034884,107.7034884,0.2078488
7,0.150105,0.999985,2.0770349,2.0770349,1.0,0.9999896,1.0,0.9999947,0.1039244,0.3117733,107.7034884,107.7034884,0.3117733
8,0.20014,0.9999665,2.0770349,2.0770349,1.0,0.9999769,1.0,0.9999903,0.1039244,0.4156977,107.7034884,107.7034884,0.4156977
9,0.300035,0.9997862,2.0770349,2.0770349,1.0,0.9999073,1.0,0.9999626,0.2074855,0.6231831,107.7034884,107.7034884,0.6231831
10,0.400105,0.9988264,2.0770349,2.0770349,1.0,0.9994096,1.0,0.9998243,0.2078488,0.831032,107.7034884,107.7034884,0.831032

Unnamed: 0,0,1,Error,Rate
0,2964.0,0.0,0.0,(0.0/2964.0)
1,0.0,2752.0,0.0,(0.0/2752.0)
Total,2964.0,2752.0,0.0,(0.0/5716.0)

metric,threshold,value,idx
max f1,0.7345617,1.0,190.0
max f2,0.7345617,1.0,190.0
max f0point5,0.7345617,1.0,190.0
max accuracy,0.7345617,1.0,190.0
max precision,0.9999962,1.0,0.0
max recall,0.7345617,1.0,190.0
max specificity,0.9999962,1.0,0.0
max absolute_mcc,0.7345617,1.0,190.0
max min_per_class_accuracy,0.7345617,1.0,190.0
max mean_per_class_accuracy,0.7345617,1.0,190.0

group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.010147,0.9999994,2.0770349,2.0770349,1.0,0.9999996,1.0,0.9999996,0.0210756,0.0210756,107.7034884,107.7034884,0.0210756
2,0.020119,0.999999,2.0770349,2.0770349,1.0,0.9999992,1.0,0.9999994,0.0207122,0.0417878,107.7034884,107.7034884,0.0417878
3,0.0302659,0.9999984,2.0770349,2.0770349,1.0,0.9999987,1.0,0.9999992,0.0210756,0.0628634,107.7034884,107.7034884,0.0628634
4,0.040063,0.9999975,2.0770349,2.0770349,1.0,0.999998,1.0,0.9999989,0.0203488,0.0832122,107.7034884,107.7034884,0.0832122
5,0.0505598,0.9999965,2.0770349,2.0770349,1.0,0.999997,1.0,0.9999985,0.0218023,0.1050145,107.7034884,107.7034884,0.1050145
6,0.1002449,0.9999905,2.0770349,2.0770349,1.0,0.9999936,1.0,0.9999961,0.1031977,0.2082122,107.7034884,107.7034884,0.2082122
7,0.150105,0.9999802,2.0770349,2.0770349,1.0,0.9999857,1.0,0.9999926,0.103561,0.3117733,107.7034884,107.7034884,0.3117733
8,0.20014,0.9999564,2.0770349,2.0770349,1.0,0.9999696,1.0,0.9999869,0.1039244,0.4156977,107.7034884,107.7034884,0.4156977
9,0.300035,0.9997297,2.0770349,2.0770349,1.0,0.999881,1.0,0.9999516,0.2074855,0.6231831,107.7034884,107.7034884,0.6231831
10,0.400105,0.998587,2.0770349,2.0770349,1.0,0.9992728,1.0,0.9997818,0.2078488,0.831032,107.7034884,107.7034884,0.831032

Unnamed: 0,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
accuracy,1.0,0.0,1.0,1.0,1.0,1.0,1.0
aic,237.00415,1.2426684,236.58124,237.89026,237.356,235.0544,238.13884
auc,1.0,0.0,1.0,1.0,1.0,1.0,1.0
err,0.0,0.0,0.0,0.0,0.0,0.0,0.0
err_count,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f0point5,1.0,0.0,1.0,1.0,1.0,1.0,1.0
f1,1.0,0.0,1.0,1.0,1.0,1.0,1.0
f2,1.0,0.0,1.0,1.0,1.0,1.0,1.0
lift_top_group,2.0773273,0.0275513,2.076225,2.048387,2.1011028,2.0520647,2.1088562
loglikelihood,0.0,0.0,0.0,0.0,0.0,0.0,0.0

Unnamed: 0,timestamp,duration,iteration,lambda,predictors,deviance_train,deviance_xval,deviance_se,alpha,iterations,training_rmse,training_logloss,training_r2,training_auc,training_pr_auc,training_lift,training_classification_error
,2025-07-04 17:53:31,0.000 sec,2,.19E2,117,1.3536899,1.3598748,0.0005280,0.0,,,,,,,,
,2025-07-04 17:53:31,0.015 sec,4,.12E2,117,1.3356217,1.3451896,0.0006041,0.0,,,,,,,,
,2025-07-04 17:53:31,0.032 sec,6,.75E1,117,1.3079809,1.3225023,0.0007347,0.0,,,,,,,,
,2025-07-04 17:53:31,0.047 sec,8,.47E1,117,1.2668933,1.2882714,0.0009516,0.0,,,,,,,,
,2025-07-04 17:53:31,0.059 sec,10,.29E1,117,1.2083515,1.2384257,0.0012943,0.0,,,,,,,,
,2025-07-04 17:53:31,0.072 sec,12,.18E1,117,1.1297256,1.1694527,0.0018049,0.0,,,,,,,,
,2025-07-04 17:53:31,0.087 sec,14,.11E1,117,1.0318037,1.0803047,0.0025127,0.0,,,,,,,,
,2025-07-04 17:53:31,0.100 sec,16,.69E0,117,0.9199827,0.9741553,0.0034088,0.0,,,,,,,,
,2025-07-04 17:53:31,0.115 sec,18,.43E0,117,0.8023207,0.8584074,0.0044202,0.0,,,,,,,,
,2025-07-04 17:53:31,0.128 sec,20,.27E0,117,0.6866634,0.7411846,0.0054132,0.0,,,,,,,,

variable,relative_importance,scaled_importance,percentage
spore-print-color.5,5.6340294,1.0,0.0473956
odor.5,5.4199610,0.9620044,0.0455948
odor.0,4.4657154,0.7926326,0.0375673
odor.1,4.3782506,0.7771082,0.0368315
odor.3,4.3038187,0.7638971,0.0362054
odor.2,3.7647421,0.6682149,0.0316705
stalk-root.1,3.3628807,0.5968873,0.0282899
odor.6,3.1497993,0.5590669,0.0264973
gill-size.0,3.0331221,0.5383575,0.0255158
gill-size.1,2.9752145,0.5280793,0.0250287


Мы задали значение параметра Max_runtime_sec ~10 минут, но также есть параметр, отвечающий не за время, а за максимальное количество исследованных моделей. Помимо этого, есть возможности дополнительной настройки процесса работы AutoML и множество дополнительных параметров, которые вы можете задать. 

Подробнее о параметрах можно прочитать в официальной документации.

После запуска и обучения H2OAutoML вы можете посмотреть, какие модели в нашем случае сработали лучше всего, и выбрать их для дальнейшего исследования.

In [13]:
leaderboard = aml_model.leaderboard
leaderboard.head()

model_id,auc,logloss,aucpr,mean_per_class_error,rmse,mse
GLM_1_AutoML_1_20250704_175327,1,0.00148887,1,0,0.00711334,5.05996e-05
DeepLearning_grid_1_AutoML_1_20250704_175327_model_1,1,0.00024599,1,0,0.00678681,4.60608e-05
StackedEnsemble_BestOfFamily_1_AutoML_1_20250704_175327,1,0.000847209,1,0,0.000847485,7.1823e-07
GBM_grid_1_AutoML_1_20250704_175327_model_1,1,1.54317e-13,1,0,1.13466e-11,1.28746e-22
GBM_grid_1_AutoML_1_20250704_175327_model_3,1,1.60042e-06,1,0,0.00010158,1.03185e-08
StackedEnsemble_AllModels_1_AutoML_1_20250704_175327,1,0.000677898,1,0,0.000679677,4.61961e-07
GBM_grid_1_AutoML_1_20250704_175327_model_7,1,3.14266e-14,1,0,1.77794e-12,3.1610800000000003e-24
StackedEnsemble_AllModels_3_AutoML_1_20250704_175327,1,0.000647937,1,0,0.000648353,4.20362e-07
GBM_grid_1_AutoML_1_20250704_175327_model_8,1,9.93531e-07,1,0,3.46542e-06,1.20092e-11
StackedEnsemble_AllModels_2_AutoML_1_20250704_175327,1,0.000667657,1,0,0.000668159,4.46436e-07


Давайте проверим результат предсказания на тестовых данных и посмотрим, не переобучилась ли модель:

In [15]:
prediction = aml_model.predict(test_df)

glm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
