## Uso de AutoGluon en Competencias Kaggle

Este notebook describe un flujo básico para participar en competencias de Kaggle utilizando AutoGluon en problemas con datos tabulares almacenados en archivos CSV.

In [52]:
# Instalación completa para mejora rendimiento en AutoGluon + Kaggle
!pip install kaggle python-dotenv autogluon.tabular[all]==1.5.0 -q


### 1. Instalación del Cliente de Kaggle

Para interactuar con la plataforma desde un entorno local o en la nube, es necesario instalar el cliente oficial de Kaggle.

In [55]:
from dotenv import load_dotenv
import os

load_dotenv()

os.environ["KAGGLE_API_TOKEN"] = os.getenv("KAGGLE_API_TOKEN")

print("Token OK:", os.getenv("KAGGLE_API_TOKEN") is not None)

Token OK: True


### 2. Configuración de Credenciales

El acceso a la API requiere generar un token desde la cuenta personal de Kaggle y descargar el archivo `kaggle.json`.

Este archivo debe ubicarse en el directorio correspondiente del sistema.

In [56]:
# Descarga de datos desde una competencia
COMP = "playground-series-s6e2"

!kaggle competitions download -c {COMP}

playground-series-s6e2.zip: Skipping, found more recently modified local copy (use --force to force download)


### 3. Descarga de Datos

Una vez configuradas las credenciales, los datos pueden descargarse de forma programática desde el notebook.

Posteriormente, los archivos comprimidos deben descomprimirse antes de su uso.

In [10]:
# Descompresión de archivos
import zipfile

with zipfile.ZipFile(f"{COMP}.zip", "r") as z:
    z.extractall("data")

### 4. Descarga Manual (Alternativa)

Como alternativa, los conjuntos de datos pueden descargarse directamente desde la página oficial de la competencia, aceptando previamente sus condiciones.


In [21]:
from autogluon.tabular import TabularDataset, TabularPredictor
from sklearn.model_selection import train_test_split

# Cargar datos crudos
train_raw = TabularDataset("data/train.csv")
test_raw = TabularDataset("data/test.csv")
sample = TabularDataset("data/sample_submission.csv")

# Guardar IDs
test_ids = test_raw["id"].copy()

# Quitar ID para entrenar
train = train_raw.drop(columns=["id"])
test = test_raw.drop(columns=["id"])

# Definir target
label = "Heart Disease"

# Split estratificado
train_set, val_set = train_test_split(
    train,
    test_size=0.2,
    stratify=train[label],
    random_state=42
)

X_val = val_set.drop(columns=[label])
y_val = val_set[label]

# Verificar
train_set.head()


Loaded data from: data/train.csv | Columns = 15 / 15 | Rows = 630000 -> 630000
Loaded data from: data/test.csv | Columns = 14 / 14 | Rows = 270000 -> 270000
Loaded data from: data/sample_submission.csv | Columns = 2 / 2 | Rows = 270000 -> 270000


Unnamed: 0,Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium,Heart Disease
539041,48,1,4,140,249,0,2,144,0,2.5,2,1,7,Presence
211140,61,1,3,130,208,0,0,174,0,1.6,2,0,3,Absence
325129,69,1,4,130,226,0,0,132,1,1.4,2,0,6,Presence
91177,67,1,4,140,269,0,2,182,1,1.2,1,3,3,Presence
346105,48,1,3,120,275,0,0,171,1,0.4,1,0,3,Absence


In [17]:
import torch
torch.cuda.empty_cache()

import psutil
print(psutil.virtual_memory())


svmem(total=16873545728, available=5454082048, percent=67.7, used=11419463680, free=5454082048)


In [18]:
predictor = TabularPredictor(
    label=label,
    eval_metric="roc_auc",
    path="models/"
).fit(
    train_set,
    presets='best',
    time_limit=3600
)


Preset alias specified: 'best' maps to 'best_quality'.
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.5.0
Python Version:     3.13.2
Operating System:   Windows
Platform Machine:   AMD64
Platform Version:   10.0.26200
CPU Count:          28
Pytorch Version:    2.7.1+cu128
CUDA Version:       12.8
GPU Memory:         GPU 0: 8.00/8.00 GB
Total GPU Memory:   Free: 8.00 GB, Allocated: 0.00 GB, Total: 8.00 GB
GPU Count:          1
Memory Avail:       5.13 GB / 15.71 GB (32.7%)
Disk Space Avail:   163.22 GB / 951.65 GB (17.2%)
Presets specified: ['best']
Using hyperparameters preset: hyperparameters='zeroshot'
Setting dynamic_stacking from 'auto' to True. Reason: Enable dynamic_stacking when use_bag_holdout is disabled. (use_bag_holdout=False)
Stack configuration (auto_stack=True): num_stack_levels=1, num_bag_folds=8, num_bag_sets=1
DyStack is enabled (dynamic_stacking=True). AutoGluon will try to determine whether the input data is affected by stacked overfitting and enable or disabl

[1000]	valid_set's binary_logloss: 0.299041
[2000]	valid_set's binary_logloss: 0.298871
[1000]	valid_set's binary_logloss: 0.295506
[1000]	valid_set's binary_logloss: 0.294834
[2000]	valid_set's binary_logloss: 0.294594
[1000]	valid_set's binary_logloss: 0.293079
[2000]	valid_set's binary_logloss: 0.292857
[1000]	valid_set's binary_logloss: 0.294914
[1000]	valid_set's binary_logloss: 0.297277
[1000]	valid_set's binary_logloss: 0.291697
[2000]	valid_set's binary_logloss: 0.291453
[1000]	valid_set's binary_logloss: 0.297668
[2000]	valid_set's binary_logloss: 0.297401


	0.9455	 = Validation score   (roc_auc)
	66.7s	 = Training   runtime
	2.26s	 = Validation runtime
Fitting model: LightGBM_BAG_L1 ... Training model for up to 529.11s of the 829.09s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with SequentialLocalFoldFittingStrategy (sequential: cpus=20, gpus=0)


[1000]	valid_set's binary_logloss: 0.265324


	0.9549	 = Validation score   (roc_auc)
	22.26s	 = Training   runtime
	0.71s	 = Validation runtime
Fitting model: RandomForestGini_BAG_L1 ... Training model for up to 505.39s of the 805.37s of remaining time.
		To set the same value for all models, do the following when calling predictor.fit: `predictor.fit(..., ag_args_fit={"ag.max_memory_usage_ratio": VALUE})`
		Setting "ag.max_memory_usage_ratio" to values above 1 may result in out-of-memory errors. You may consider using a machine with more memory as a safer alternative.
	Fitting 1 model on all data (use_child_oof=True) | Fitting with cpus=28, gpus=0, mem=2.1/5.0 GB
	0.9521	 = Validation score   (roc_auc)
	9.43s	 = Training   runtime
	9.41s	 = Validation runtime
Fitting model: RandomForestEntr_BAG_L1 ... Training model for up to 485.96s of the 785.94s of remaining time.
		To set the same value for all models, do the following when calling predictor.fit: `predictor.fit(..., ag_args_fit={"ag.max_memory_usage_ratio": VALUE})`
		Settin

[1000]	valid_set's binary_logloss: 0.292877
[2000]	valid_set's binary_logloss: 0.292557
[1000]	valid_set's binary_logloss: 0.293533
[2000]	valid_set's binary_logloss: 0.293432
[1000]	valid_set's binary_logloss: 0.291366
[2000]	valid_set's binary_logloss: 0.291025
[1000]	valid_set's binary_logloss: 0.297219
[2000]	valid_set's binary_logloss: 0.296802
[3000]	valid_set's binary_logloss: 0.296848
[1000]	valid_set's binary_logloss: 0.295909
[2000]	valid_set's binary_logloss: 0.295636
[1000]	valid_set's binary_logloss: 0.299362
[2000]	valid_set's binary_logloss: 0.299156
[1000]	valid_set's binary_logloss: 0.293928
[2000]	valid_set's binary_logloss: 0.293784
[1000]	valid_set's binary_logloss: 0.297405
[2000]	valid_set's binary_logloss: 0.297096


	0.9456	 = Validation score   (roc_auc)
	88.66s	 = Training   runtime
	2.9s	 = Validation runtime
Fitting model: LightGBM_BAG_L1 ... Training model for up to 2564.12s of the 2564.11s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with SequentialLocalFoldFittingStrategy (sequential: cpus=20, gpus=0)
	0.9549	 = Validation score   (roc_auc)
	24.19s	 = Training   runtime
	0.8s	 = Validation runtime
Fitting model: RandomForestGini_BAG_L1 ... Training model for up to 2538.38s of the 2538.37s of remaining time.
		To set the same value for all models, do the following when calling predictor.fit: `predictor.fit(..., ag_args_fit={"ag.max_memory_usage_ratio": VALUE})`
		Setting "ag.max_memory_usage_ratio" to values above 1 may result in out-of-memory errors. You may consider using a machine with more memory as a safer alternative.
	Fitting 1 model on all data (use_child_oof=True) | Fitting with cpus=28, gpus=0, mem=2.3/5.7 GB
	0.9522	 = Validation score   (roc_auc)
	10.8s	 = T

In [22]:
from autogluon.tabular import TabularPredictor

path="models/"
predictor = TabularPredictor.load("models/")

predictor.leaderboard(val_set)

Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,0.956262,0.95524,roc_auc,1.60665,0.928721,255.493837,0.005095,0.064776,17.821565,2,True,12
1,CatBoost_BAG_L1,0.95626,0.955234,roc_auc,0.14834,0.065166,213.481292,0.14834,0.065166,213.481292,1,True,5
2,LightGBM_BAG_L1,0.956063,0.954938,roc_auc,1.453214,0.798779,24.19098,1.453214,0.798779,24.19098,1,True,2
3,XGBoost_BAG_L1,0.956049,0.95483,roc_auc,0.903435,0.43219,19.601634,0.903435,0.43219,19.601634,1,True,9
4,LightGBMLarge_BAG_L1,0.955627,0.954486,roc_auc,1.426778,0.847872,32.282502,1.426778,0.847872,32.282502,1,True,11
5,NeuralNetFastAI_BAG_L1,0.954185,0.953198,roc_auc,4.140781,1.853964,1486.066169,4.140781,1.853964,1486.066169,1,True,8
6,NeuralNetTorch_BAG_L1,0.953615,0.952512,roc_auc,2.360407,1.204813,693.888154,2.360407,1.204813,693.888154,1,True,10
7,RandomForestEntr_BAG_L1,0.953351,0.95227,roc_auc,0.828827,11.158037,11.320064,0.828827,11.158037,11.320064,1,True,4
8,RandomForestGini_BAG_L1,0.953183,0.952216,roc_auc,0.878507,10.257609,10.795136,0.878507,10.257609,10.795136,1,True,3
9,ExtraTreesEntr_BAG_L1,0.952034,0.95096,roc_auc,0.802146,11.242105,8.730854,0.802146,11.242105,8.730854,1,True,7


In [23]:
print(f'Prior to calibration (predictor.decision_threshold={predictor.decision_threshold}):')
scores = predictor.evaluate(val_set)

calibrated_decision_threshold = predictor.calibrate_decision_threshold()
predictor.set_decision_threshold(calibrated_decision_threshold)

print(f'After calibration (predictor.decision_threshold={predictor.decision_threshold}):')
scores_calibrated = predictor.evaluate(val_set)

Prior to calibration (predictor.decision_threshold=0.5):




After calibration (predictor.decision_threshold=0.5):


In [26]:
for metric_name in scores:
    metric_score = scores[metric_name]
    metric_score_calibrated = scores_calibrated[metric_name]
    decision_threshold = predictor.decision_threshold
    print(f'decision_threshold={decision_threshold:.3f}\t| metric="{metric_name}"'
          f'\n\ttest_score: {metric_score:.4f}')

decision_threshold=0.500	| metric="roc_auc"
	test_score: 0.9563
decision_threshold=0.500	| metric="accuracy"
	test_score: 0.8901
decision_threshold=0.500	| metric="balanced_accuracy"
	test_score: 0.8882
decision_threshold=0.500	| metric="mcc"
	test_score: 0.7777
decision_threshold=0.500	| metric="f1"
	test_score: 0.8764
decision_threshold=0.500	| metric="precision"
	test_score: 0.8840
decision_threshold=0.500	| metric="recall"
	test_score: 0.8690


In [27]:
predictor.features()

['Age',
 'Sex',
 'Chest pain type',
 'BP',
 'Cholesterol',
 'FBS over 120',
 'EKG results',
 'Max HR',
 'Exercise angina',
 'ST depression',
 'Slope of ST',
 'Number of vessels fluro',
 'Thallium']

In [28]:
datapoint = X_val.iloc[[0]]  # Note: .iloc[0] won't work because it returns pandas Series instead of DataFrame
datapoint

Unnamed: 0,Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium
307256,41,0,1,140,283,0,0,162,0,0.0,1,0,3


In [29]:
predictor.predict(datapoint)


307256    Absence
Name: Heart Disease, dtype: object

In [30]:
predictor.predict_proba(datapoint)  # returns a DataFrame that shows which probability corresponds to which class


Unnamed: 0,Absence,Presence
307256,0.997601,0.002399


In [31]:
predictor.model_best

'WeightedEnsemble_L2'

In [32]:
predictor.leaderboard(val_set)

Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,0.956262,0.95524,roc_auc,1.886279,0.928721,255.493837,0.004984,0.064776,17.821565,2,True,12
1,CatBoost_BAG_L1,0.95626,0.955234,roc_auc,0.409669,0.065166,213.481292,0.409669,0.065166,213.481292,1,True,5
2,LightGBM_BAG_L1,0.956063,0.954938,roc_auc,1.471625,0.798779,24.19098,1.471625,0.798779,24.19098,1,True,2
3,XGBoost_BAG_L1,0.956049,0.95483,roc_auc,0.819648,0.43219,19.601634,0.819648,0.43219,19.601634,1,True,9
4,LightGBMLarge_BAG_L1,0.955627,0.954486,roc_auc,1.445695,0.847872,32.282502,1.445695,0.847872,32.282502,1,True,11
5,NeuralNetFastAI_BAG_L1,0.954185,0.953198,roc_auc,4.247618,1.853964,1486.066169,4.247618,1.853964,1486.066169,1,True,8
6,NeuralNetTorch_BAG_L1,0.953615,0.952512,roc_auc,2.409047,1.204813,693.888154,2.409047,1.204813,693.888154,1,True,10
7,RandomForestEntr_BAG_L1,0.953351,0.95227,roc_auc,2.037335,11.158037,11.320064,2.037335,11.158037,11.320064,1,True,4
8,RandomForestGini_BAG_L1,0.953183,0.952216,roc_auc,0.812468,10.257609,10.795136,0.812468,10.257609,10.795136,1,True,3
9,ExtraTreesEntr_BAG_L1,0.952034,0.95096,roc_auc,2.068323,11.242105,8.730854,2.068323,11.242105,8.730854,1,True,7


In [33]:
predictor.leaderboard(extra_info=True)

Unnamed: 0,model,score_val,eval_metric,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order,...,hyperparameters,hyperparameters_fit,ag_args_fit,features,compile_time,child_hyperparameters,child_hyperparameters_fit,child_ag_args_fit,ancestors,descendants
0,WeightedEnsemble_L2,0.95524,roc_auc,0.928721,255.493837,0.064776,17.821565,2,True,12,...,"{'use_orig_features': False, 'valid_stacker': True, 'max_base_models': 0, 'max_base_models_per_type': 'auto', 'save_bag_folds': True, 'stratify': 'auto', 'bin': 'auto', 'n_bins': None, 'vary_seed_across_folds': False, 'model_random_seed': 0}",{},"{'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': None, 'valid_special_types': None, 'ignored_type_group_special': None, 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None, 'drop_unique': False}","[CatBoost_BAG_L1, LightGBM_BAG_L1]",,"{'ensemble_size': 25, 'subsample_size': 1000000}",{'ensemble_size': 8},"{'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': None, 'valid_special_types': None, 'ignored_type_group_special': None, 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None, 'drop_unique': False}","[CatBoost_BAG_L1, LightGBM_BAG_L1]",[]
1,CatBoost_BAG_L1,0.955234,roc_auc,0.065166,213.481292,0.065166,213.481292,1,True,5,...,"{'use_orig_features': True, 'valid_stacker': True, 'max_base_models': 0, 'max_base_models_per_type': 'auto', 'save_bag_folds': True, 'stratify': 'auto', 'bin': 'auto', 'n_bins': None, 'vary_seed_across_folds': False, 'model_random_seed': 0}",{},"{'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': None, 'valid_special_types': None, 'ignored_type_group_special': None, 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None, 'drop_unique': False}","[Sex, Exercise angina, Cholesterol, Age, Slope of ST, ST depression, Chest pain type, EKG results, Max HR, BP, Thallium, FBS over 120, Number of vessels fluro]",,"{'iterations': 10000, 'learning_rate': 0.05, 'allow_writing_files': False, 'eval_metric': 'Logloss', 'random_seed': 0}",{'iterations': 1050},"{'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': ['bool', 'int', 'float', 'category'], 'valid_special_types': None, 'ignored_type_group_special': None, 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None}",[],[WeightedEnsemble_L2]
2,LightGBM_BAG_L1,0.954938,roc_auc,0.798779,24.19098,0.798779,24.19098,1,True,2,...,"{'use_orig_features': True, 'valid_stacker': True, 'max_base_models': 0, 'max_base_models_per_type': 'auto', 'save_bag_folds': True, 'stratify': 'auto', 'bin': 'auto', 'n_bins': None, 'vary_seed_across_folds': False, 'model_random_seed': 0}",{},"{'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': None, 'valid_special_types': None, 'ignored_type_group_special': None, 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None, 'drop_unique': False}","[Sex, Exercise angina, Cholesterol, Age, Slope of ST, ST depression, Chest pain type, EKG results, Max HR, BP, Thallium, FBS over 120, Number of vessels fluro]",,"{'learning_rate': 0.05, 'seed': 0}",{'num_boost_round': 583},"{'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': ['bool', 'int', 'float', 'category'], 'valid_special_types': None, 'ignored_type_group_special': None, 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None}",[],[WeightedEnsemble_L2]
3,XGBoost_BAG_L1,0.95483,roc_auc,0.43219,19.601634,0.43219,19.601634,1,True,9,...,"{'use_orig_features': True, 'valid_stacker': True, 'max_base_models': 0, 'max_base_models_per_type': 'auto', 'save_bag_folds': True, 'stratify': 'auto', 'bin': 'auto', 'n_bins': None, 'vary_seed_across_folds': False, 'model_random_seed': 0}",{},"{'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': None, 'valid_special_types': None, 'ignored_type_group_special': None, 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None, 'drop_unique': False}","[Sex, Exercise angina, Cholesterol, Age, Slope of ST, ST depression, Chest pain type, EKG results, Max HR, BP, Thallium, FBS over 120, Number of vessels fluro]",,"{'n_estimators': 10000, 'learning_rate': 0.1, 'n_jobs': -1, 'proc.max_category_levels': 100, 'objective': 'binary:logistic', 'booster': 'gbtree', 'seed': 0}",{'n_estimators': 234},"{'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': ['bool', 'int', 'float', 'category'], 'valid_special_types': None, 'ignored_type_group_special': None, 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None}",[],[]
4,LightGBMLarge_BAG_L1,0.954486,roc_auc,0.847872,32.282502,0.847872,32.282502,1,True,11,...,"{'use_orig_features': True, 'valid_stacker': True, 'max_base_models': 0, 'max_base_models_per_type': 'auto', 'save_bag_folds': True, 'stratify': 'auto', 'bin': 'auto', 'n_bins': None, 'vary_seed_across_folds': False, 'model_random_seed': 0}",{},"{'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': None, 'valid_special_types': None, 'ignored_type_group_special': None, 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None, 'drop_unique': False}","[Sex, Exercise angina, Cholesterol, Age, Slope of ST, ST depression, Chest pain type, EKG results, Max HR, BP, Thallium, FBS over 120, Number of vessels fluro]",,"{'learning_rate': 0.03, 'num_leaves': 128, 'feature_fraction': 0.9, 'min_data_in_leaf': 3, 'seed': 0}",{'num_boost_round': 331},"{'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': ['bool', 'int', 'float', 'category'], 'valid_special_types': None, 'ignored_type_group_special': None, 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None}",[],[]
5,NeuralNetFastAI_BAG_L1,0.953198,roc_auc,1.853964,1486.066169,1.853964,1486.066169,1,True,8,...,"{'use_orig_features': True, 'valid_stacker': True, 'max_base_models': 0, 'max_base_models_per_type': 'auto', 'save_bag_folds': True, 'stratify': 'auto', 'bin': 'auto', 'n_bins': None, 'vary_seed_across_folds': False, 'model_random_seed': 0}",{},"{'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': None, 'valid_special_types': None, 'ignored_type_group_special': None, 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None, 'drop_unique': False}","[Sex, Exercise angina, Cholesterol, Age, Slope of ST, ST depression, Chest pain type, EKG results, Max HR, BP, Thallium, FBS over 120, Number of vessels fluro]",,"{'layers': None, 'emb_drop': 0.1, 'ps': 0.1, 'bs': 'auto', 'lr': 0.01, 'epochs': 'auto', 'early.stopping.min_delta': 0.0001, 'early.stopping.patience': 20, 'smoothing': 0.0, 'random_seed': 0}","{'epochs': 30, 'best_epoch': 27}","{'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': ['bool', 'int', 'float', 'category'], 'valid_special_types': None, 'ignored_type_group_special': ['text_ngram', 'text_as_category'], 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None}",[],[]
6,NeuralNetTorch_BAG_L1,0.952512,roc_auc,1.204813,693.888154,1.204813,693.888154,1,True,10,...,"{'use_orig_features': True, 'valid_stacker': True, 'max_base_models': 0, 'max_base_models_per_type': 'auto', 'save_bag_folds': True, 'stratify': 'auto', 'bin': 'auto', 'n_bins': None, 'vary_seed_across_folds': False, 'model_random_seed': 0}",{},"{'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': None, 'valid_special_types': None, 'ignored_type_group_special': None, 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None, 'drop_unique': False}","[Sex, Exercise angina, Cholesterol, Age, Slope of ST, ST depression, Chest pain type, EKG results, Max HR, BP, Thallium, FBS over 120, Number of vessels fluro]",,"{'num_epochs': 1000, 'epochs_wo_improve': None, 'activation': 'relu', 'embedding_size_factor': 1.0, 'embed_exponent': 0.56, 'max_embedding_dim': 100, 'y_range': None, 'y_range_extend': 0.05, 'dropout_prob': 0.1, 'optimizer': 'adam', 'learning_rate': 0.0003, 'weight_decay': 1e-06, 'proc.embed_min_categories': 4, 'proc.impute_strategy': 'median', 'proc.max_category_levels': 100, 'proc.skew_threshold': 0.99, 'use_ngram_features': False, 'num_layers': 4, 'hidden_size': 128, 'max_batch_size': 512, 'use_batchnorm': False, 'loss_function': 'auto', 'seed_value': 0}","{'batch_size': 256, 'num_epochs': 10}","{'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': ['bool', 'int', 'float', 'category'], 'valid_special_types': None, 'ignored_type_group_special': ['text_ngram', 'text_as_category'], 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None}",[],[]
7,RandomForestEntr_BAG_L1,0.95227,roc_auc,11.158037,11.320064,11.158037,11.320064,1,True,4,...,"{'use_orig_features': True, 'valid_stacker': True, 'max_base_models': 0, 'max_base_models_per_type': 'auto', 'save_bag_folds': True, 'stratify': 'auto', 'bin': 'auto', 'n_bins': None, 'vary_seed_across_folds': False, 'use_child_oof': True, 'model_random_seed': 0}",{},"{'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': None, 'valid_special_types': None, 'ignored_type_group_special': None, 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None, 'drop_unique': False}","[Sex, Exercise angina, Cholesterol, Age, Slope of ST, ST depression, Chest pain type, EKG results, Max HR, BP, Thallium, FBS over 120, Number of vessels fluro]",,"{'n_estimators': 300, 'max_leaf_nodes': 15000, 'n_jobs': -1, 'bootstrap': True, 'criterion': 'entropy', 'random_state': 0}",{'n_estimators': 300},"{'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': ['bool', 'int', 'float', 'category'], 'valid_special_types': None, 'ignored_type_group_special': None, 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None}",[],[]
8,RandomForestGini_BAG_L1,0.952216,roc_auc,10.257609,10.795136,10.257609,10.795136,1,True,3,...,"{'use_orig_features': True, 'valid_stacker': True, 'max_base_models': 0, 'max_base_models_per_type': 'auto', 'save_bag_folds': True, 'stratify': 'auto', 'bin': 'auto', 'n_bins': None, 'vary_seed_across_folds': False, 'use_child_oof': True, 'model_random_seed': 0}",{},"{'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': None, 'valid_special_types': None, 'ignored_type_group_special': None, 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None, 'drop_unique': False}","[Sex, Exercise angina, Cholesterol, Age, Slope of ST, ST depression, Chest pain type, EKG results, Max HR, BP, Thallium, FBS over 120, Number of vessels fluro]",,"{'n_estimators': 300, 'max_leaf_nodes': 15000, 'n_jobs': -1, 'bootstrap': True, 'criterion': 'gini', 'random_state': 0}",{'n_estimators': 300},"{'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': ['bool', 'int', 'float', 'category'], 'valid_special_types': None, 'ignored_type_group_special': None, 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None}",[],[]
9,ExtraTreesEntr_BAG_L1,0.95096,roc_auc,11.242105,8.730854,11.242105,8.730854,1,True,7,...,"{'use_orig_features': True, 'valid_stacker': True, 'max_base_models': 0, 'max_base_models_per_type': 'auto', 'save_bag_folds': True, 'stratify': 'auto', 'bin': 'auto', 'n_bins': None, 'vary_seed_across_folds': False, 'use_child_oof': True, 'model_random_seed': 0}",{},"{'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': None, 'valid_special_types': None, 'ignored_type_group_special': None, 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None, 'drop_unique': False}","[Sex, Exercise angina, Cholesterol, Age, Slope of ST, ST depression, Chest pain type, EKG results, Max HR, BP, Thallium, FBS over 120, Number of vessels fluro]",,"{'n_estimators': 300, 'max_leaf_nodes': 15000, 'n_jobs': -1, 'bootstrap': True, 'criterion': 'entropy', 'random_state': 0}",{'n_estimators': 300},"{'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': ['bool', 'int', 'float', 'category'], 'valid_special_types': None, 'ignored_type_group_special': None, 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None}",[],[]


In [34]:
predictor.leaderboard(val_set, extra_metrics=['roc_auc', 'accuracy', 'balanced_accuracy', 'log_loss'])

Unnamed: 0,model,score_test,roc_auc,accuracy,balanced_accuracy,log_loss,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,0.956262,0.956262,0.890143,0.888167,-0.265254,0.95524,roc_auc,1.642989,0.928721,255.493837,0.004988,0.064776,17.821565,2,True,12
1,CatBoost_BAG_L1,0.95626,0.95626,0.890151,0.888164,-0.265248,0.955234,roc_auc,0.155883,0.065166,213.481292,0.155883,0.065166,213.481292,1,True,5
2,LightGBM_BAG_L1,0.956063,0.956063,0.889754,0.887839,-0.266102,0.954938,roc_auc,1.482118,0.798779,24.19098,1.482118,0.798779,24.19098,1,True,2
3,XGBoost_BAG_L1,0.956049,0.956049,0.88981,0.887856,-0.265997,0.95483,roc_auc,0.922643,0.43219,19.601634,0.922643,0.43219,19.601634,1,True,9
4,LightGBMLarge_BAG_L1,0.955627,0.955627,0.889135,0.887195,-0.267381,0.954486,roc_auc,1.494972,0.847872,32.282502,1.494972,0.847872,32.282502,1,True,11
5,NeuralNetFastAI_BAG_L1,0.954185,0.954185,0.887341,0.885376,-0.271341,0.953198,roc_auc,4.11546,1.853964,1486.066169,4.11546,1.853964,1486.066169,1,True,8
6,NeuralNetTorch_BAG_L1,0.953615,0.953615,0.886611,0.884425,-0.273133,0.952512,roc_auc,2.516288,1.204813,693.888154,2.516288,1.204813,693.888154,1,True,10
7,RandomForestEntr_BAG_L1,0.953351,0.953351,0.88681,0.884766,-0.274732,0.95227,roc_auc,0.77042,11.158037,11.320064,0.77042,11.158037,11.320064,1,True,4
8,RandomForestGini_BAG_L1,0.953183,0.953183,0.886667,0.884628,-0.275583,0.952216,roc_auc,0.699015,10.257609,10.795136,0.699015,10.257609,10.795136,1,True,3
9,ExtraTreesEntr_BAG_L1,0.952034,0.952034,0.884317,0.882216,-0.281408,0.95096,roc_auc,0.780325,11.242105,8.730854,0.780325,11.242105,8.730854,1,True,7


In [35]:
i = 0  # index of model to use
all_models = predictor.model_names()
model_to_use = all_models[i]
model_pred = predictor.predict(datapoint, model=model_to_use)
print("Prediction from %s model: %s" % (model_to_use, model_pred.iloc[0]))

Prediction from LightGBMXT_BAG_L1 model: Absence


In [36]:
# Objects defined below are dicts of various information (not printed here as they are quite large):
predictor_information = predictor.info()  # access info about the predictor
model_info = predictor.model_info(model_to_use)  # access info about a model
model_info_alternative = predictor._trainer.load_model(model_to_use).get_info()  # load the inner model and access its info directly

In [37]:
y_pred_proba = predictor.predict_proba(X_val)
predictor.evaluate_predictions(y_true=y_val, y_pred=y_pred_proba)

{'roc_auc': np.float64(0.9562621337306625),
 'accuracy': 0.8901428571428571,
 'balanced_accuracy': np.float64(0.8881668335037967),
 'mcc': 0.7776546964114195,
 'f1': 0.8764416038847431,
 'precision': 0.8839692457280732,
 'recall': 0.8690410861907206}

In [38]:
predictor.feature_importance(val_set)

Computing feature importance via permutation shuffling for 13 features using 5000 rows with 5 shuffle sets...
	15.3s	= Expected runtime (3.06s per shuffle set)
	5.73s	= Actual runtime (Completed 5 of 5 shuffle sets)


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
Thallium,0.034511,0.001564,5.050899e-07,5,0.037731,0.03129
Max HR,0.027891,0.002339,5.877515e-06,5,0.032706,0.023075
Chest pain type,0.027066,0.00208,4.148721e-06,5,0.031348,0.022784
Number of vessels fluro,0.012652,0.000854,2.477903e-06,5,0.014411,0.010894
Exercise angina,0.007424,0.00128,0.0001019552,5,0.010059,0.004788
Sex,0.007005,0.001007,4.994216e-05,5,0.009079,0.004931
Slope of ST,0.005422,0.000665,2.663744e-05,5,0.006792,0.004053
ST depression,0.004632,0.000934,0.000187915,5,0.006555,0.002709
Age,0.003919,0.000675,0.0001015174,5,0.005308,0.002529
EKG results,0.001114,0.000247,0.0002710661,5,0.001622,0.000606


In [49]:
proba = predictor.predict_proba(test)

# Verificaciones críticas
assert (test.index == test_ids.index).all()
assert (test.index == proba.index).all()

In [50]:
positive_class = predictor.positive_class
print("Positive class:", positive_class)

y_pred = proba[positive_class]

Positive class: Presence


In [51]:
import pandas as pd

submission = pd.DataFrame({
    "id": test_ids,
    "Heart Disease": y_pred
})

submission.to_csv("submission.csv", index=False)


In [57]:
!kaggle competitions submit -c {COMP} -f submission.csv -m "AutoGluon baseline ROC 0.956"


Successfully submitted to Predicting Heart Disease



  0%|          | 0.00/7.12M [00:00<?, ?B/s]
  0%|          | 16.0k/7.12M [00:00<02:01, 61.1kB/s]
  3%|▎         | 192k/7.12M [00:00<00:11, 616kB/s]  
  7%|▋         | 544k/7.12M [00:00<00:05, 1.18MB/s]
 16%|█▌        | 1.12M/7.12M [00:00<00:02, 2.44MB/s]
 25%|██▌       | 1.81M/7.12M [00:00<00:01, 3.65MB/s]
 32%|███▏      | 2.25M/7.12M [00:00<00:01, 3.38MB/s]
 51%|█████     | 3.59M/7.12M [00:01<00:00, 6.01MB/s]
 60%|██████    | 4.28M/7.12M [00:01<00:00, 4.84MB/s]
 68%|██████▊   | 4.86M/7.12M [00:01<00:00, 3.35MB/s]
 75%|███████▍  | 5.31M/7.12M [00:01<00:00, 3.19MB/s]
 86%|████████▌ | 6.11M/7.12M [00:02<00:00, 3.18MB/s]
100%|██████████| 7.12M/7.12M [00:03<00:00, 2.46MB/s]
