<a href="https://colab.research.google.com/github/applejxd/colaboratory/blob/master/ml/PyCaretRegressor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PyCaret で回帰分析のデモ

[「Regression Tutorial (REG101) - Level Beginner」](https://github.com/pycaret/pycaret/blob/master/tutorials/Regression%20Tutorial%20Level%20Beginner%20-%20REG101.ipynb)より。

## インストール

[pip から PyCaret インストール](https://pycaret.gitbook.io/docs/get-started/installation)。

[jinja2 関連のエラーは pandas-profiling==3.1.0 で回避。](https://teratail.com/questions/5b01vplewor7kl)

In [1]:
!pip install pycaret pandas-profiling==3.1.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pycaret
  Downloading pycaret-2.3.10-py3-none-any.whl (320 kB)
[K     |████████████████████████████████| 320 kB 26.9 MB/s 
[?25hCollecting pandas-profiling==3.1.0
  Downloading pandas_profiling-3.1.0-py2.py3-none-any.whl (261 kB)
[K     |████████████████████████████████| 261 kB 51.9 MB/s 
Collecting joblib~=1.0.1
  Downloading joblib-1.0.1-py3-none-any.whl (303 kB)
[K     |████████████████████████████████| 303 kB 51.3 MB/s 
[?25hCollecting pydantic>=1.8.1
  Downloading pydantic-1.9.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.1 MB)
[K     |████████████████████████████████| 11.1 MB 18.9 MB/s 
[?25hCollecting PyYAML>=5.0.0
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 18.2 MB/s 
[?25hCollecting visions[type_image_path]==

## 前準備

Diamonds タスクのデータセットを取得。
データ形式は DataFrame。

In [2]:
from pycaret.datasets import get_data
import pandas as pd

dataset: pd.DataFrame = get_data('diamond')
print(f"Table size = {dataset.shape}")

Unnamed: 0,Carat Weight,Cut,Color,Clarity,Polish,Symmetry,Report,Price
0,1.1,Ideal,H,SI1,VG,EX,GIA,5169
1,0.83,Ideal,H,VS1,ID,ID,AGSL,3470
2,0.85,Ideal,H,SI1,EX,EX,GIA,3183
3,0.91,Ideal,E,SI1,VG,VG,GIA,4370
4,0.83,Ideal,G,SI1,EX,EX,GIA,3171


Table size = (6000, 8)


回帰タスクのために学習データとバリデーションデータに分割

In [3]:
# 90% ランダム抽出, 再現性のためにシード random_state を固定
data = dataset.sample(frac=0.9, random_state=786)
data_unseen = dataset.drop(data.index)

# 行番号振り直し
data.reset_index(drop=True, inplace=True)
data_unseen.reset_index(drop=True, inplace=True)

print(f'Data for Modeling: {data.shape}')
print(f'Unseen Data For Predictions: {data_unseen.shape}')

Data for Modeling: (5400, 8)
Unseen Data For Predictions: (600, 8)


## 前処理

PyCaret では regression.setup() だけで前処理完了。

詳細は [API リファレンス](https://pycaret.readthedocs.io/en/latest/api/regression.html)を参照。

In [None]:
from pycaret import regression

exp_reg102 = regression.setup(
    # 学習タスクの設定 & シードの固定
    data = data, target = 'Price', session_id=123,
    # 数値変数の標準化 & Yeo-Johnson 変換 (非線形変換) による Gauss-like 分布化
    normalize = True, transformation = True, transform_target = True, 
    # 頻度の低いカテゴリ変数を統合
    combine_rare_levels = True, rare_level_threshold = 0.05,
    # 相関関係から数値変数の重複を削除
    remove_multicollinearity = True, multicollinearity_threshold = 0.95, 
    # 数値変数をカテゴリ変数に変換
    bin_numeric_features = ['Carat Weight'],
    log_experiment = True, experiment_name = 'diamond1') 

  defaults = yaml.load(f)


IntProgress(value=0, description='Processing: ', max=3)

Text(value="Following data types have been inferred automatically, if they are correct press enter to continue…

Unnamed: 0,Data Type
Carat Weight,Numeric
Cut,Categorical
Color,Categorical
Clarity,Categorical
Polish,Categorical
Symmetry,Categorical
Report,Categorical
Price,Label
