# AutoGluon TabularPredictor

* テーブルデータ用のAutoGluon
* https://aws.amazon.com/jp/builders-flash/202201/autogluon-tabular-tutorials/?awsf.filter-name=*all
* 現在対応しているモデル  
LightGBM, CatBoost, XGBoost, random forest, extremely randomized trees, k-nearest neighbors, linear regression, neural network with MXNet backend, neural network with FastAI backend

<a href="https://colab.research.google.com/github/fuyu-quant/Data_Science/blob/main/Tabel_Data/AutoML/AutoGluon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%capture
!pip install autogluon

In [56]:
from autogluon.tabular import TabularPredictor
import pprint

import pandas as pd
import numpy as np

from sklearn.datasets import load_iris
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import confusion_matrix

In [35]:
iris_dataset = load_iris()
df = pd.DataFrame(data = iris_dataset.data, columns = iris_dataset.feature_names)
df['target'] = iris_dataset['target']
df.head()


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [36]:
skf = StratifiedKFold(n_splits=4, shuffle=True, random_state=3699)

df['fold'] = 0
i = 0
target = df['target']
target_idx = df.columns.get_loc('fold')
for train_idx, test_idx in skf.split(df, target):
    df.iloc[test_idx, target_idx] = i
    i = 1 + i
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,fold
0,5.1,3.5,1.4,0.2,0,3
1,4.9,3.0,1.4,0.2,0,3
2,4.7,3.2,1.3,0.2,0,1
3,4.6,3.1,1.5,0.2,0,1
4,5.0,3.6,1.4,0.2,0,3


## AutoGluonの学習

In [50]:
train_df = df[df['fold'] != 0]
test_df = df[df['fold'] == 0]
train_df = train_df.drop('fold', axis = 1)
test_df = test_df.drop('fold', axis = 1)

# モデルの情報を保存するフォルダーを指定
save_path = '/content/Autogluon/'

# モデルの性能の評価指標を指定
# ""を使わないとエラーが出る
metric = "accuracy"
# 分類
#‘accuracy’, ‘balanced_accuracy’, ‘f1’, ‘f1_macro’, ‘f1_micro’, 
#‘f1_weighted’, ‘roc_auc’, ‘roc_auc_ovo_macro’, ‘average_precision’, 
#‘precision’, ‘precision_macro’, ‘precision_micro’, ‘precision_weighted’, 
#‘recall’, ‘recall_macro’, ‘recall_micro’, ‘recall_weighted’, ‘log_loss’, ‘pac_score’

# 回帰
# ‘root_mean_squared_error’, ‘mean_squared_error’, ‘mean_absolute_error’, 
#‘median_absolute_error’, ‘mean_absolute_percentage_error’, ‘r2’

predictor = TabularPredictor(label='target',
                             path = save_path,
                             eval_metric=metric,
                             # ログを詳細に出力するか
                             verbosity=3
                             ).fit(train_data=train_df,
                                   # 精度を重視する場合に設定する
                                   #presets="medium_quality",
                                   presets= "good_quality",
                                   presets='best_quality',
                                   # 学習にかける制限時間，秒数で設定
                                   time_limit = 120 
                                              )

Presets specified: ['best_quality']
User Specified kwargs:
{'auto_stack': True}
Full kwargs:
{'_feature_generator_kwargs': None,
 '_save_bag_folds': None,
 'ag_args': None,
 'ag_args_ensemble': None,
 'ag_args_fit': None,
 'auto_stack': True,
 'calibrate': 'auto',
 'excluded_model_types': None,
 'feature_generator': 'auto',
 'feature_prune_kwargs': None,
 'holdout_frac': None,
 'hyperparameter_tune_kwargs': None,
 'keep_only_best': False,
 'name_suffix': None,
 'num_bag_folds': None,
 'num_bag_sets': None,
 'num_stack_levels': None,
 'pseudo_data': None,
 'refit_full': False,
 'save_space': False,
 'set_best_to_refit_full': False,
 'unlabeled_data': None,
 'use_bag_holdout': False,
 'verbosity': 3}
Stack configuration (auto_stack=True): num_stack_levels=0, num_bag_folds=5, num_bag_sets=20
Saving /content/Autogluon/learner.pkl
Saving /content/Autogluon/predictor.pkl
Beginning AutoGluon training ... Time limit = 120s
AutoGluon will save models to "/content/Autogluon/"
AutoGluon Version: 

## AutoGluonのモデルの読み込み

In [None]:
predictor = TabularPredictor.load(save_path)

## 学習時の設定の表示

In [42]:
print("AutoGluonが推察した問題のタイプ:", predictor.problem_type)
print("AutoGluonが各特徴量に対して推察したデータの型:")
pprint.pprint(predictor.feature_metadata.to_dict())

AutoGluonが推察した問題のタイプ: multiclass
AutoGluonが各特徴量に対して推察したデータの型:
{'petal length (cm)': ('float', ()),
 'petal width (cm)': ('float', ()),
 'sepal length (cm)': ('float', ()),
 'sepal width (cm)': ('float', ())}


## 学習の過程の表示

In [39]:
results = predictor.fit_summary()
results

*** Summary of fit() ***
Estimated performance of each model:
                  model  score_val  pred_time_val  fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0              CatBoost   1.000000       0.001342  0.237177                0.001342           0.237177            1       True          8
1              LightGBM   1.000000       0.002587  0.216856                0.002587           0.216856            1       True          5
2        KNeighborsUnif   1.000000       0.003995  0.014968                0.003995           0.014968            1       True          1
3        KNeighborsDist   1.000000       0.004338  0.012241                0.004338           0.012241            1       True          2
4            LightGBMXT   1.000000       0.005383  0.200131                0.005383           0.200131            1       True          4
5         LightGBMLarge   1.000000       0.006411  0.285025                0.006411           0.285025        

{'model_types': {'KNeighborsUnif': 'KNNModel',
  'KNeighborsDist': 'KNNModel',
  'NeuralNetFastAI': 'NNFastAiTabularModel',
  'LightGBMXT': 'LGBModel',
  'LightGBM': 'LGBModel',
  'RandomForestGini': 'RFModel',
  'RandomForestEntr': 'RFModel',
  'CatBoost': 'CatBoostModel',
  'ExtraTreesGini': 'XTModel',
  'ExtraTreesEntr': 'XTModel',
  'XGBoost': 'XGBoostModel',
  'NeuralNetTorch': 'TabularNeuralNetTorchModel',
  'LightGBMLarge': 'LGBModel',
  'WeightedEnsemble_L2': 'WeightedEnsembleModel'},
 'model_performance': {'KNeighborsUnif': 1.0,
  'KNeighborsDist': 1.0,
  'NeuralNetFastAI': 1.0,
  'LightGBMXT': 1.0,
  'LightGBM': 1.0,
  'RandomForestGini': 1.0,
  'RandomForestEntr': 1.0,
  'CatBoost': 1.0,
  'ExtraTreesGini': 1.0,
  'ExtraTreesEntr': 1.0,
  'XGBoost': 0.9565217391304348,
  'NeuralNetTorch': 1.0,
  'LightGBMLarge': 1.0,
  'WeightedEnsemble_L2': 1.0},
 'model_best': 'WeightedEnsemble_L2',
 'model_paths': {'KNeighborsUnif': '/content/Autogluon/models/KNeighborsUnif/',
  'KNeighbo

## アンサンブルモデルの内部の確認

In [67]:
# アンサンブルモデルの中身の確認
# 一番最後の行の'model_weights'がアンサンブルモデルの内部
ensemble = predictor._trainer.load_model("WeightedEnsemble_L2")
display(ensemble.get_info())

Loading: /content/Autogluon/models/WeightedEnsemble_L2/model.pkl


{'name': 'WeightedEnsemble_L2',
 'model_type': 'WeightedEnsembleModel',
 'problem_type': 'multiclass',
 'eval_metric': 'accuracy',
 'stopping_metric': 'accuracy',
 'fit_time': 0.564812183380127,
 'num_classes': 3,
 'quantile_levels': None,
 'predict_time': 0.0008554458618164062,
 'val_score': 0.9821428571428571,
 'hyperparameters': {'use_orig_features': False,
  'max_base_models': 25,
  'max_base_models_per_type': 5,
  'save_bag_folds': True},
 'hyperparameters_fit': {},
 'hyperparameters_nondefault': ['save_bag_folds'],
 'ag_args_fit': {'max_memory_usage_ratio': 1.0,
  'max_time_limit_ratio': 1.0,
  'max_time_limit': None,
  'min_time_limit': 0,
  'valid_raw_types': None,
  'valid_special_types': None,
  'ignored_type_group_special': None,
  'ignored_type_group_raw': None,
  'get_features_kwargs': None,
  'get_features_kwargs_extra': None,
  'predict_1_batch_size': None,
  'temperature_scalar': None,
  'drop_unique': False},
 'num_features': 3,
 'features': ['NeuralNetFastAI_BAG_L1_1'

## AutoGluonでの推論

In [54]:
model_perf = predictor.leaderboard(test_df, silent=True)
display(model_perf)

Loading: /content/Autogluon/models/KNeighborsUnif_BAG_L1/model.pkl
Loading: /content/Autogluon/models/KNeighborsDist_BAG_L1/model.pkl
Loading: /content/Autogluon/models/NeuralNetFastAI_BAG_L1/model.pkl
Loading: /content/Autogluon/models/LightGBMXT_BAG_L1/model.pkl
Loading: /content/Autogluon/models/LightGBM_BAG_L1/model.pkl
Loading: /content/Autogluon/models/RandomForestGini_BAG_L1/model.pkl
Loading: /content/Autogluon/models/RandomForestEntr_BAG_L1/model.pkl
Loading: /content/Autogluon/models/CatBoost_BAG_L1/model.pkl
Loading: /content/Autogluon/models/ExtraTreesGini_BAG_L1/model.pkl
Loading: /content/Autogluon/models/ExtraTreesEntr_BAG_L1/model.pkl
Loading: /content/Autogluon/models/XGBoost_BAG_L1/model.pkl
Loading: /content/Autogluon/models/NeuralNetTorch_BAG_L1/model.pkl
Loading: /content/Autogluon/models/LightGBMLarge_BAG_L1/model.pkl
Loading: /content/Autogluon/models/WeightedEnsemble_L2/model.pkl


Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,KNeighborsDist_BAG_L1,0.973684,0.946429,0.009198,0.054613,0.00556,0.009198,0.054613,0.00556,1,True,2
1,CatBoost_BAG_L1,0.973684,0.955357,0.012703,0.004531,12.83194,0.012703,0.004531,12.83194,1,True,8
2,KNeighborsUnif_BAG_L1,0.973684,0.946429,0.016341,0.057223,0.00535,0.016341,0.057223,0.00535,1,True,1
3,LightGBM_BAG_L1,0.973684,0.9375,0.040314,0.018457,11.50572,0.040314,0.018457,11.50572,1,True,5
4,NeuralNetTorch_BAG_L1,0.973684,0.973214,0.050511,0.041404,13.509427,0.050511,0.041404,13.509427,1,True,12
5,RandomForestEntr_BAG_L1,0.973684,0.946429,0.070827,0.112441,0.718504,0.070827,0.112441,0.718504,1,True,7
6,ExtraTreesGini_BAG_L1,0.973684,0.946429,0.070955,0.171776,0.681035,0.070955,0.171776,0.681035,1,True,9
7,ExtraTreesEntr_BAG_L1,0.973684,0.946429,0.088629,0.207772,0.866243,0.088629,0.207772,0.866243,1,True,10
8,RandomForestGini_BAG_L1,0.973684,0.955357,0.091566,0.123966,0.618788,0.091566,0.123966,0.618788,1,True,6
9,NeuralNetFastAI_BAG_L1,0.973684,0.982143,0.098201,0.076033,16.314769,0.098201,0.076033,16.314769,1,True,3


In [64]:
# 二つ目の引数にモデル名を指定すると指定したモデルでの予測スコアを返す
# 指定しない場合は最良のモデルのスコアを返す
y_pred = predictor.predict_proba(test_df, model="KNeighborsDist_BAG_L1")

y_pred = np.argmax(y_pred.to_numpy(), axis = 1)
true = test_df['target']
# 混同行列の表示
confusion_matrix(true, y_pred)

Loading: /content/Autogluon/models/KNeighborsDist_BAG_L1/model.pkl


array([[13,  0,  0],
       [ 0, 11,  1],
       [ 0,  0, 13]])

## 特徴量重要度の表示
* ある特徴量がランダムにシャッフルされた場合どれぐらいの性能低下を招くかによって重要度を計算．
* スコアがマイナスの場合はその特徴量がモデルに対して悪影響を及ぼしている可能性がある

In [51]:
feature_importances = predictor.feature_importance(test_df)
print("特徴量の重要度:")
display(feature_importances)

Loading: /content/Autogluon/models/WeightedEnsemble_L2/model.pkl
Computing feature importance via permutation shuffling for 4 features using 38 rows with 5 shuffle sets...
Loading: /content/Autogluon/models/NeuralNetFastAI_BAG_L1/model.pkl
Loading: /content/Autogluon/models/WeightedEnsemble_L2/model.pkl
	3.07s	= Expected runtime (0.61s per shuffle set)
Loading: /content/Autogluon/models/NeuralNetFastAI_BAG_L1/model.pkl
Loading: /content/Autogluon/models/WeightedEnsemble_L2/model.pkl
Loading: /content/Autogluon/models/NeuralNetFastAI_BAG_L1/model.pkl
Loading: /content/Autogluon/models/WeightedEnsemble_L2/model.pkl
Loading: /content/Autogluon/models/NeuralNetFastAI_BAG_L1/model.pkl
Loading: /content/Autogluon/models/WeightedEnsemble_L2/model.pkl
Loading: /content/Autogluon/models/NeuralNetFastAI_BAG_L1/model.pkl
Loading: /content/Autogluon/models/WeightedEnsemble_L2/model.pkl
Loading: /content/Autogluon/models/NeuralNetFastAI_BAG_L1/model.pkl
Loading: /content/Autogluon/models/WeightedEn

特徴量の重要度:


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
petal width (cm),0.263158,0.076723,0.000777,5,0.421132,0.105184
petal length (cm),0.247368,0.106245,0.003244,5,0.466129,0.028608
sepal width (cm),0.026316,0.018608,0.017055,5,0.06463,-0.011998
sepal length (cm),0.021053,0.022017,0.04965,5,0.066387,-0.024281
