<a href="https://colab.research.google.com/github/applejxd/colaboratory/blob/master/ml/PyCaretRegressor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PyCaret で回帰分析のデモ

[「Regression Tutorial - Level Beginner」](https://github.com/pycaret/pycaret/blob/master/tutorials/Regression%20Tutorial%20Level%20Beginner%20-%20REG101.ipynb)と
[「Regression - Level Intermediate」](https://github.com/pycaret/pycaret/blob/master/tutorials/Regression%20Tutorial%20Level%20Intermediate%20-%20REG102.ipynb)より。

## インストール

[pip から PyCaret インストール](https://pycaret.gitbook.io/docs/get-started/installation)。

[jinja2 関連のエラーは pandas-profiling==3.1.0 で回避。](https://teratail.com/questions/5b01vplewor7kl)

In [1]:
!pip install pycaret pandas-profiling==3.1.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## 前準備

Diamonds タスクのデータセットを取得。
データ形式は DataFrame。

In [2]:
from pycaret.datasets import get_data
import pandas as pd

dataset: pd.DataFrame = get_data('diamond')
print(f"Table size = {dataset.shape}")

Unnamed: 0,Carat Weight,Cut,Color,Clarity,Polish,Symmetry,Report,Price
0,1.1,Ideal,H,SI1,VG,EX,GIA,5169
1,0.83,Ideal,H,VS1,ID,ID,AGSL,3470
2,0.85,Ideal,H,SI1,EX,EX,GIA,3183
3,0.91,Ideal,E,SI1,VG,VG,GIA,4370
4,0.83,Ideal,G,SI1,EX,EX,GIA,3171


Table size = (6000, 8)


回帰タスクのために学習データとバリデーションデータに分割

In [3]:
# 90% ランダム抽出, 再現性のためにシード random_state を固定
data = dataset.sample(frac=0.9, random_state=786)
data_unseen = dataset.drop(data.index)

# 行番号振り直し
data.reset_index(drop=True, inplace=True)
data_unseen.reset_index(drop=True, inplace=True)

print(f'Data for Modeling: {data.shape}')
print(f'Unseen Data For Predictions: {data_unseen.shape}')

Data for Modeling: (5400, 8)
Unseen Data For Predictions: (600, 8)


## 前処理

PyCaret では regression.setup() だけで前処理完了。
詳細は [API リファレンス](https://pycaret.readthedocs.io/en/latest/api/regression.html)を参照。

実行後に表示される DataType に問題がなければ Enter キーを押下。
押下後に前処理の詳細が表示される。

In [4]:
from pycaret import regression

exp_reg102 = regression.setup(
    # 学習タスクの設定 & シードの固定
    data = data, target = 'Price', session_id=123,
    # 数値変数の標準化 & Yeo-Johnson 変換 (非線形変換) による Gauss-like 分布化
    normalize = True, transformation = True, transform_target = True, 
    # 頻度の低いカテゴリ変数を統合
    combine_rare_levels = True, rare_level_threshold = 0.05,
    # 相関関係から数値変数の重複を削除
    remove_multicollinearity = True, multicollinearity_threshold = 0.95, 
    # 数値変数をカテゴリ変数に変換
    bin_numeric_features = ['Carat Weight'],
    log_experiment = True, experiment_name = 'diamond1') 

Unnamed: 0,Description,Value
0,session_id,123
1,Target,Price
2,Original Data,"(5400, 8)"
3,Missing Values,False
4,Numeric Features,1
5,Categorical Features,6
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(3779, 39)"


## 学習

クロスバリデーションで性能比較。

RANSAC は検証対象外に設定。戻り値は上位3モデルのリスト。

[チュートリアル](https://github.com/pycaret/pycaret/blob/master/tutorials/Regression%20Tutorial%20Level%20Intermediate%20-%20REG102.ipynb)
には前処理でスコア上昇したことが指摘されている。

In [5]:
from typing import List
from pycaret.internal.meta_estimators import PowerTransformedTargetRegressor

top3: List[PowerTransformedTargetRegressor] = \
    regression.compare_models(exclude = ['ransac'], n_select = 3)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lightgbm,Light Gradient Boosting Machine,766.0853,3116467.0,1704.0975,0.9704,0.0799,0.0576,0.109
rf,Random Forest Regressor,850.1194,3267554.0,1770.6698,0.9686,0.0904,0.0657,1.036
huber,Huber Regressor,940.6199,3651906.0,1891.7125,0.964,0.0972,0.0708,0.152
ridge,Ridge Regression,952.2538,3846278.0,1934.6314,0.9624,0.0971,0.0715,0.041
br,Bayesian Ridge,956.6502,3999160.0,1967.8153,0.9608,0.0972,0.0716,0.063
lr,Linear Regression,960.2937,4046533.0,1978.6945,0.9604,0.0973,0.0717,0.589
et,Extra Trees Regressor,964.4979,4410739.0,2062.2772,0.9569,0.1055,0.0759,1.172
dt,Decision Tree Regressor,1000.25,4685153.0,2136.9863,0.9539,0.1082,0.0778,0.036
gbr,Gradient Boosting Regressor,1107.4885,5269003.0,2255.3276,0.9486,0.11,0.0832,0.277
par,Passive Aggressive Regressor,1341.4005,7149373.0,2588.2842,0.9288,0.1282,0.0964,0.03


In [8]:
print(top3[0])

PowerTransformedTargetRegressor(boosting_type='gbdt', class_weight=None,
                                colsample_bytree=1.0, importance_type='split',
                                learning_rate=0.1, max_depth=-1,
                                min_child_samples=20, min_child_weight=0.001,
                                min_split_gain=0.0, n_estimators=100, n_jobs=-1,
                                num_leaves=31, objective=None,
                                power_transformer_method='box-cox',
                                power_transformer_standardize=True,
                                random_state=1...
                                                        importance_type='split',
                                                        learning_rate=0.1,
                                                        max_depth=-1,
                                                        min_child_samples=20,
                                                        min_child_weigh