**AutoML OSS入門（6）**

# ローコード機械学習ライブラリ「PyCaret」

本ノートブックの紹介記事と併せてご覧ください。
- [＠IT連載 AutoML OSS入門（6）- 第6回「ローコード機械学習ライブラリ「PyCaret」」](https://atmarkit.itmedia.co.jp/ait/articles/2111/16/news004.html)

また、ノートブックの扱い方やタイタニックの生存予測データについては、連載記事の第1回を参照してください。
- [＠IT連載 AutoML OSS入門（１）- 第1回「機械学習モデル構築作業の煩雑さを解消する「AutoML」とは――歴史、動向、利用のメリットを整理する」](https://www.atmarkit.co.jp/ait/articles/2107/02/news006.html)

## タイタニックデータでAutoML（改良版）
PyCaretでタイタニックの生存予測を行います。<BR>
※このノートブックは、`PyCaret_Titanic.ipynb`を改良したものです。

### セットアップ

PyCaretは`pip`コマンド1行でインストール可能です。

In [None]:
!pip install pycaret



Google Colabを用いる場合、以下のコードを実行することで、インタラクティブな表示ができます。

In [None]:
from pycaret.utils import enable_colab
enable_colab()

Colab mode enabled.


今回取り組むのは分類問題なので、基本的には`pycaret.classification`をインポートしておけば十分です。

In [None]:
from pycaret.classification import *

### データのロード

PyCaretでは[`get_data()`](https://pycaret.org/get-data/)でいろいろなサンプルデータを取得可能で、タイタニックの生存予測データもこの中に含まれています。
```python
from pycaret.datasets import get_data
train_df = get_data('titanic') 
```
ただし、これは学習データ(`train.csv`)のみなので、GitHubにおいてあるものをダウンロードして、解凍します。

In [None]:
!wget -N https://github.com/aiq2020-tw/automl-notebooks/raw/main/titanic.zip
!unzip titanic.zip

--2021-09-22 16:04:02--  https://github.com/aiq2020-tw/automl-notebooks/raw/main/titanic.zip
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/aiq2020-tw/automl-notebooks/main/titanic.zip [following]
--2021-09-22 16:04:02--  https://raw.githubusercontent.com/aiq2020-tw/automl-notebooks/main/titanic.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 34877 (34K) [application/zip]
Saving to: ‘titanic.zip’


Last-modified header missing -- time-stamps turned off.
2021-09-22 16:04:02 (10.9 MB/s) - ‘titanic.zip’ saved [34877/34877]

Archive:  titanic.zip
replace gender_submission.csv? [y]es, [n]o, [A]l

学習データ（`train.csv`）とテストデータ（`test.csv`）を読み込み、前者の先頭5行を表示します。<BR>※`from pycaret.classification import *`をしていることで、`import pandas as pd`をしなくてもエラーになりません。

In [None]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### EDAとデータの前処理

`setup()`でデータの前処理を実行します。その際に`profile=True`に指定し、Pandas ProfilingによるEDA（探索的データ分析）の結果も同時に表示します。

In [None]:
setup(data=train_df,
      target='Survived',                # 目的変数
      ignore_features=['PassengerId'],  # 学習に無益なのでPassengerIdは評価しない
      normalize=True,                   # 正規化する
      profile=True,                     # 同時にPandas ProfilingによるEDAの結果も表示する
      silent=True,             　       # データ型の確認を行わない
      session_id=42)                    # 再現性を確保する（random_stateと同じ意図）

Summarize dataset:   0%|          | 0/25 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



(False,
 'clf-default-name',
               Age      Fare  Pclass_1  ...  Embarked_C  Embarked_Q  Embarked_S
 445 -1.940356e+00  0.980998       1.0  ...         0.0         0.0         1.0
 650  4.140770e-08 -0.469634       0.0  ...         0.0         0.0         1.0
 172 -2.170835e+00 -0.406136       0.0  ...         0.0         0.0         1.0
 450  5.180904e-01 -0.080232       0.0  ...         0.0         0.0         1.0
 314  1.055876e+00 -0.109651       0.0  ...         0.0         0.0         1.0
 ..            ...       ...       ...  ...         ...         ...         ...
 106 -6.343062e-01 -0.474455       0.0  ...         0.0         0.0         1.0
 270  4.140770e-08 -0.016489       1.0  ...         0.0         0.0         1.0
 860  9.022226e-01 -0.347787       0.0  ...         0.0         0.0         1.0
 435 -1.172091e+00  1.729074       1.0  ...         0.0         0.0         1.0
 102 -6.343062e-01  0.891351       1.0  ...         0.0         0.0         1.0
 
 [623 row

前処理が完了すると、623行の学習データと268行の評価データに7:3で分割されます。学習データは`get_config('X_train')`で取得可能です。

In [None]:
get_config('X_train')

Unnamed: 0,Age,Fare,Pclass_1,Pclass_2,Pclass_3,Name_Abbott Mr. Rossmore Edward,Name_Abbott Mrs. Stanton (Rosa Hunt),Name_Abelson Mr. Samuel,Name_Abelson Mrs. Samuel (Hannah Wizosky),Name_Ahlin Mrs. Johan (Johanna Persdotter Larsson),Name_Albimona Mr. Nassef Cassem,Name_Allen Miss. Elisabeth Walton,Name_Allen Mr. William Henry,Name_Allison Miss. Helen Loraine,Name_Allison Mrs. Hudson J C (Bessie Waldo Daniels),Name_Allum Mr. Owen George,Name_Anderson Mr. Harry,Name_Andersson Master. Sigvard Harald Elias,Name_Andersson Miss. Ellis Anna Maria,Name_Andersson Miss. Erna Alexandra,Name_Andersson Mr. Anders Johan,Name_Andersson Mrs. Anders Johan (Alfrida Konstantia Brogren),Name_Andreasson Mr. Paul Edvin,Name_Appleton Mrs. Edward Dale (Charlotte Lamson),Name_Arnold-Franchi Mr. Josef,Name_Asplund Master. Clarence Gustaf Hugo,Name_Asplund Master. Edvin Rojj Felix,Name_Asplund Miss. Lillian Gertrud,Name_Astor Mrs. John Jacob (Madeleine Talmadge Force),Name_Aubart Mme. Leontine Pauline,Name_Ayoub Miss. Banoura,Name_Backstrom Mr. Karl Alfred,Name_Backstrom Mrs. Karl Alfred (Maria Mathilda Gustafsson),Name_Baclini Miss. Eugenie,Name_Baclini Miss. Helene Barbara,Name_Baclini Mrs. Solomon (Latifa Qurban),Name_Bailey Mr. Percy Andrew,Name_Banfield Mr. Frederick James,Name_Barbara Miss. Saiide,Name_Bateman Rev. Robert James,...,Cabin_B77,Cabin_B80,Cabin_B94,Cabin_C125,Cabin_C2,Cabin_C22 C26,Cabin_C32,Cabin_C47,Cabin_C50,Cabin_C54,Cabin_C65,Cabin_C68,Cabin_C7,Cabin_C70,Cabin_C78,Cabin_C85,Cabin_C92,Cabin_C93,Cabin_C99,Cabin_D17,Cabin_D30,Cabin_D33,Cabin_D36,Cabin_D37,Cabin_D49,Cabin_E101,Cabin_E121,Cabin_E24,Cabin_E31,Cabin_E33,Cabin_E67,Cabin_E8,Cabin_F G63,Cabin_F G73,Cabin_F33,Cabin_G6,Cabin_not_available,Embarked_C,Embarked_Q,Embarked_S
445,-1.940356e+00,0.980998,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
650,4.140770e-08,-0.469634,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
172,-2.170835e+00,-0.406136,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
450,5.180904e-01,-0.080232,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
314,1.055876e+00,-0.109651,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
106,-6.343062e-01,-0.474455,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
270,4.140770e-08,-0.016489,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
860,9.022226e-01,-0.347787,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
435,-1.172091e+00,1.729074,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


「Age」列と「Fare」列が正規化されていることや、`1`、`2`、`3`という値を含む「Pclass」列が、One-Hotエンコーディングにより`0`か`1`の`Pclass_1`、`Pclass_2`、`Pclass_3`に変わっていることが分かります。このような前処理が行われ、12列だった`train.csv`の列数が大幅に増加しています。

ただし、すべて値の異なる「Name」列までOne-Hotエンコーディングされてしまっているので、`ignore_features`を見直して再度`setup()`します。

In [None]:
setup(data=train_df,
      target='Survived',
      ignore_features=['PassengerId', 'Name', 'Ticket'],
      normalize=True,
      silent=True,
      session_id=42)

Unnamed: 0,Description,Value
0,session_id,123
1,Target,Survived
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(891, 12)"
5,Missing Values,True
6,Numeric Features,2
7,Categorical Features,6
8,Ordinal Features,False
9,High Cardinality Features,False


(False,
 'clf-default-name',
           Age      Fare  Pclass_1  ...  Embarked_C  Embarked_Q  Embarked_S
 416  0.323710 -0.015230       0.0  ...         0.0         0.0         1.0
 801  0.087673 -0.130344       0.0  ...         0.0         0.0         1.0
 512  0.481069 -0.129653       1.0  ...         0.0         0.0         1.0
 455 -0.069685 -0.468397       0.0  ...         1.0         0.0         0.0
 757 -0.935156 -0.402014       0.0  ...         0.0         0.0         1.0
 ..        ...       ...       ...  ...         ...         ...         ...
 98   0.323710 -0.190203       0.0  ...         0.0         0.0         1.0
 322  0.008994 -0.386358       0.0  ...         0.0         1.0         0.0
 382  0.166352 -0.467859       0.0  ...         0.0         0.0         1.0
 365  0.008994 -0.480292       0.0  ...         0.0         0.0         1.0
 510 -0.069685 -0.471083       0.0  ...         0.0         1.0         0.0
 
 [623 rows x 131 columns],
 {'Bagging': <pycaret.containe

もう一度、`get_config('X_train')`で学習データを確認してみましょう。今度は131列になりました。

In [None]:
get_config('X_train')

Unnamed: 0,Age,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_male,SibSp_0,SibSp_1,SibSp_2,SibSp_3,SibSp_4,SibSp_5,SibSp_8,Parch_0,Parch_1,Parch_2,Parch_3,Parch_4,Parch_5,Cabin_A10,Cabin_A14,Cabin_A16,Cabin_A19,Cabin_A23,Cabin_A24,Cabin_A26,Cabin_A34,Cabin_A36,Cabin_A5,Cabin_A6,Cabin_A7,Cabin_B101,Cabin_B102,Cabin_B20,Cabin_B22,Cabin_B28,Cabin_B3,Cabin_B35,Cabin_B37,Cabin_B39,...,Cabin_D37,Cabin_D46,Cabin_D47,Cabin_D49,Cabin_D50,Cabin_D56,Cabin_D6,Cabin_D7,Cabin_D9,Cabin_E10,Cabin_E101,Cabin_E12,Cabin_E121,Cabin_E17,Cabin_E24,Cabin_E25,Cabin_E31,Cabin_E34,Cabin_E36,Cabin_E38,Cabin_E40,Cabin_E44,Cabin_E46,Cabin_E49,Cabin_E63,Cabin_E67,Cabin_E68,Cabin_E77,Cabin_E8,Cabin_F E69,Cabin_F2,Cabin_F33,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Cabin_not_available,Embarked_C,Embarked_Q,Embarked_S
416,0.323710,-0.015230,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
801,0.087673,-0.130344,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
512,0.481069,-0.129653,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
455,-0.069685,-0.468397,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
757,-0.935156,-0.402014,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98,0.323710,-0.190203,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
322,0.008994,-0.386358,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
382,0.166352,-0.467859,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
365,0.008994,-0.480292,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0


### ベースラインモデルの構築

複数のモデルを構築し比較する前に、まずベースラインとなるモデルを作成してみます。ベースラインモデル構築の主な目的は以下の通りです。

- 機械学習モデルが学習できる形にデータが整形されているか確認すること
- 改良されたモデルと比較して改良の効果を評価すること

PyCaretのモデルライブラリで使用可能なモデルは`models()`で一覧表示できます。


In [None]:
models()

Unnamed: 0_level_0,Name,Reference,Turbo
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
lr,Logistic Regression,sklearn.linear_model._logistic.LogisticRegression,True
knn,K Neighbors Classifier,sklearn.neighbors._classification.KNeighborsCl...,True
nb,Naive Bayes,sklearn.naive_bayes.GaussianNB,True
dt,Decision Tree Classifier,sklearn.tree._classes.DecisionTreeClassifier,True
svm,SVM - Linear Kernel,sklearn.linear_model._stochastic_gradient.SGDC...,True
rbfsvm,SVM - Radial Kernel,sklearn.svm._classes.SVC,False
gpc,Gaussian Process Classifier,sklearn.gaussian_process._gpc.GaussianProcessC...,False
mlp,MLP Classifier,sklearn.neural_network._multilayer_perceptron....,False
ridge,Ridge Classifier,sklearn.linear_model._ridge.RidgeClassifier,True
rf,Random Forest Classifier,sklearn.ensemble._forest.RandomForestClassifier,True


ベースラインモデルは、あまり学習に時間のかからないモデルを利用するのが良いので、ここではSVM（Support Vector Machine）を選択します。

In [None]:
create_model('svm')

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.8254,0.0,0.8,0.7692,0.7843,0.6377,0.6381
1,0.746,0.0,0.88,0.6286,0.7333,0.5034,0.5296
2,0.7778,0.0,0.48,0.9231,0.6316,0.4943,0.5485
3,0.7258,0.0,0.64,0.6667,0.6531,0.4266,0.4268
4,0.7742,0.0,0.7083,0.7083,0.7083,0.5241,0.5241
5,0.7742,0.0,0.5833,0.7778,0.6667,0.5011,0.513
6,0.7419,0.0,0.5,0.75,0.6,0.4206,0.4394
7,0.8065,0.0,0.75,0.75,0.75,0.5921,0.5921
8,0.5,0.0,0.875,0.4286,0.5753,0.1159,0.1653
9,0.8065,0.0,0.6667,0.8,0.7273,0.5792,0.585


SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.001, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge',
              max_iter=1000, n_iter_no_change=5, n_jobs=-1, penalty='l2',
              power_t=0.5, random_state=123, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

まずはここで得られたSVMの精度（`Accuracy`の`Mean`）を最低ラインとして、ここからどれだけ精度を上げられるか見ていきましょう。

### 特徴量エンジニアリング

PyCaretでは、問題設定に依らない汎用的な特徴量エンジニアリングの手法として以下の6つを提供しています。

 - Feature Interaction
 - Polynomial Features
 - Trigonometry Features
 - Group Features
 - Bin Numeric Features
 - Combine Rare Levels

参考：https://pycaret.org/feature-interaction/

ここではBin Numeric Featuresを試してみます。これは`setup()`の引数で指定できます。

In [None]:
setup(data=train_df,
      target='Survived',
      ignore_features=['PassengerId', 'Name', 'Ticket'],
      normalize=True,
      silent=True,
      bin_numeric_features=['Age'],
      session_id=42)

Unnamed: 0,Description,Value
0,session_id,123
1,Target,Survived
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(891, 12)"
5,Missing Values,True
6,Numeric Features,2
7,Categorical Features,6
8,Ordinal Features,False
9,High Cardinality Features,False


(False,
 'clf-default-name',
          Fare  Pclass_1  Pclass_2  Pclass_3  ...  Age_6.0  Age_7.0  Age_8.0  Age_9.0
 416 -0.015230       0.0       1.0       0.0  ...      0.0      0.0      0.0      0.0
 801 -0.130344       0.0       1.0       0.0  ...      0.0      0.0      0.0      0.0
 512 -0.129653       1.0       0.0       0.0  ...      0.0      0.0      0.0      0.0
 455 -0.468397       0.0       0.0       1.0  ...      0.0      0.0      0.0      0.0
 757 -0.402014       0.0       1.0       0.0  ...      0.0      0.0      0.0      0.0
 ..        ...       ...       ...       ...  ...      ...      ...      ...      ...
 98  -0.190203       0.0       1.0       0.0  ...      0.0      0.0      0.0      0.0
 322 -0.386358       0.0       1.0       0.0  ...      0.0      0.0      0.0      0.0
 382 -0.467859       0.0       0.0       1.0  ...      0.0      0.0      0.0      0.0
 365 -0.480292       0.0       0.0       1.0  ...      0.0      0.0      0.0      0.0
 510 -0.471083       0.0 

特徴量エンジニアリングをやりすぎると、カラム（特徴量）の数が増えて学習に時間がかかってしまうので注意が必要です。再度SVMで精度を確認してみます。

In [None]:
create_model('svm')

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.873,0.0,0.76,0.9048,0.8261,0.7273,0.7341
1,0.7937,0.0,0.76,0.7308,0.7451,0.5719,0.5722
2,0.7778,0.0,0.72,0.72,0.72,0.5358,0.5358
3,0.7742,0.0,0.52,0.8667,0.65,0.4983,0.5337
4,0.6774,0.0,0.625,0.5769,0.6,0.3305,0.3312
5,0.8065,0.0,0.7083,0.7727,0.7391,0.5857,0.5871
6,0.8387,0.0,0.7917,0.7917,0.7917,0.6601,0.6601
7,0.8226,0.0,0.7083,0.8095,0.7556,0.6173,0.6207
8,0.7419,0.0,0.7917,0.6333,0.7037,0.4801,0.4895
9,0.8548,0.0,0.8333,0.8,0.8163,0.6964,0.6968


SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.001, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge',
              max_iter=1000, n_iter_no_change=5, n_jobs=-1, penalty='l2',
              power_t=0.5, random_state=123, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

精度が上がりました。

### モデルの構築と比較

PyCaretでは`compare_models()`するだけで複数のモデルを構築し、精度を検証します。デフォルトは`Accuracy`を基準にソートし、`n_select=2`の指定により上位2件分のモデルを返します。

In [None]:
top2 = compare_models(n_select=2)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.8298,0.8521,0.692,0.8466,0.76,0.6304,0.6392,0.03
lr,Logistic Regression,0.8155,0.8483,0.7418,0.7775,0.7589,0.6096,0.6104,0.103
ridge,Ridge Classifier,0.8122,0.0,0.7295,0.7764,0.7517,0.601,0.6022,0.007
lda,Linear Discriminant Analysis,0.809,0.8314,0.7257,0.7722,0.7479,0.5945,0.5955,0.01
svm,SVM - Linear Kernel,0.7961,0.0,0.7218,0.7606,0.7348,0.5703,0.5761,0.009
lightgbm,Light Gradient Boosting Machine,0.793,0.8421,0.68,0.7751,0.7206,0.5574,0.5636,0.051
ada,Ada Boost Classifier,0.7881,0.8102,0.75,0.7227,0.7352,0.5588,0.5599,0.021
rf,Random Forest Classifier,0.7658,0.8436,0.6808,0.7128,0.6938,0.5047,0.5075,0.122
knn,K Neighbors Classifier,0.7577,0.7952,0.632,0.7252,0.6708,0.4813,0.4879,0.126
et,Extra Trees Classifier,0.7577,0.8213,0.6725,0.7016,0.6837,0.488,0.491,0.111


SVMよりも精度の高いモデルはいくつかあることが分かります。最も精度の良かったモデルとそのパラメーターは以下の通りです。

In [None]:
top2[0]

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=123, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

### モデルのチューニング

`tune_model()`でモデルをチューニングします。`n_iter`のデフォルトが10なので、このパラメーターに少し大きめの値を指定しないと、逆に精度が下がることに注意してください。

In [None]:
tuned_top2 = [tune_model(i, n_iter=200) for i in top2]

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.873,0.8895,0.8,0.8696,0.8333,0.7311,0.7327
1,0.8095,0.8568,0.76,0.76,0.76,0.6021,0.6021
2,0.746,0.7974,0.64,0.6957,0.6667,0.4621,0.4632
3,0.8387,0.8162,0.76,0.8261,0.7917,0.6605,0.662
4,0.7419,0.71,0.625,0.6818,0.6522,0.4477,0.4487
5,0.8226,0.8745,0.75,0.7826,0.766,0.6232,0.6236
6,0.8548,0.8904,0.7917,0.8261,0.8085,0.6917,0.6921
7,0.8548,0.9112,0.8333,0.8,0.8163,0.6964,0.6968
8,0.8065,0.8399,0.75,0.75,0.75,0.5921,0.5921
9,0.8548,0.8569,0.8333,0.8,0.8163,0.6964,0.6968


In [None]:
print(top2[0])
print(tuned_top2[0])

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=123, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)
GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=4,
                           max_features=1.0, max_leaf_nodes=None,
                           min_impurity_decrease=0.1, min_impurity_split=None,
    

### モデルの評価

`evaluate_model()`で構築したモデルを評価します。

In [None]:
evaluate_model(tuned_top2[0])

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Hyperparameters', 'param…

上記「Plot Type」のグラフ表示は`plot_model()`でもできます。



### アンサンブル学習

`ensemble_model()`に学習済みモデルオブジェクトを渡すだけで、バギングやブースティングができます。

  - バギング：
  ```python
  ensemble_model(tuned_top2[0], method='Bagging')
  ```
  - ブースティング：
  ```python
  ensemble_model(tuned_top2[0], method='Boosting')
  ```
  - ブレンディング： 
  ```python
  blend_models(estimator_list=tuned_top2, method='hard')
  ```
  - スタッキング： 
  ```python
  stack_models(estimator_list=tuned_top2[1:], meta_model=tuned_top2[0])
  ```

ここでは、`blend_models()`で複数のモデルをブレンドさせてみます。

In [None]:
blended_model = blend_models(estimator_list=tuned_top2, method='hard')

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.8889,0.0,0.76,0.95,0.8444,0.7597,0.7711
1,0.7937,0.0,0.68,0.7727,0.7234,0.5599,0.5628
2,0.746,0.0,0.6,0.7143,0.6522,0.4545,0.4588
3,0.871,0.0,0.72,0.9474,0.8182,0.721,0.7374
4,0.7903,0.0,0.5417,0.8667,0.6667,0.5253,0.5562
5,0.8387,0.0,0.6667,0.8889,0.7619,0.6437,0.6589
6,0.8226,0.0,0.6667,0.8421,0.7442,0.6112,0.6209
7,0.9032,0.0,0.7917,0.95,0.8636,0.7896,0.7975
8,0.8387,0.0,0.7083,0.85,0.7727,0.6493,0.6558
9,0.871,0.0,0.75,0.9,0.8182,0.7195,0.7266


### 予測

ここまでは、7割のデータを学習データとしてしていましたが、予測するモデルを決定したら、`finalize_model()`により全データで学習します。そして、最後に`predict_model()`で予測します。

In [None]:
finalized_model = finalize_model(blended_model)
submission = predict_model(finalized_model, data=test_df)
submission

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Label
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,0
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S,1
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,0
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,0
...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S,0
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C,1
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S,0
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S,0


完了すると、予測した結果を示す「Label」列が追加されます。

以上で、PyCaretによるデータ分析のフローは終了です。Kaggleに提出するため、CSVファイルを出力します。

In [None]:
submission = submission.rename(columns={'Label': 'Survived'})
submission[['PassengerId', 'Survived']].to_csv('submission.csv', index=False)

## その他の機能と応用

ここでは、`automl()`という関数とMLFlowを紹介します。

### 関数 `automl()`

この関数は、`optimize`パラメーターの指標に基づいて、作成されたすべてのモデルから最適なモデルを返します。

In [None]:
automl()

VotingClassifier(estimators=[('gbc',
                              GradientBoostingClassifier(ccp_alpha=0.0,
                                                         criterion='friedman_mse',
                                                         init=None,
                                                         learning_rate=0.1,
                                                         loss='deviance',
                                                         max_depth=4,
                                                         max_features=1.0,
                                                         max_leaf_nodes=None,
                                                         min_impurity_decrease=0.1,
                                                         min_impurity_split=None,
                                                         min_samples_leaf=5,
                                                         min_samples_split=7,
                                              

### MLFlow

`setup()`の引数に`log_experiment=True`と`experiment_name=[任意の名前]`を指定すると、以降に構築したモデルのパラメーターや精度などを記録します。記録した結果は、MLFlowサーバーを起動してGUIで確認できます。

In [None]:
# Colabのサーバーで http://localhost:5000 で立ち上げたMLFlowサーバーにGoogleのプロキシー経由でアクセスするためのURLを出力する
from google.colab.output import eval_js
print(eval_js('google.colab.kernel.proxyPort(5000)'))

https://pi2yatlvi0q-496ff2e9c6d22116-5000-colab.googleusercontent.com/


In [None]:
setup(data=train_df,
      target='Survived',
      ignore_features=['PassengerId', 'Name', 'Ticket'],
      silent=True,
      log_experiment=True,
      experiment_name='test1')
compare_models(n_select=5) 
!mlflow ui

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.8202,0.8704,0.7127,0.7971,0.7504,0.6109,0.6152,0.026
lr,Logistic Regression,0.8122,0.8619,0.7214,0.7774,0.7445,0.5968,0.6012,0.027
ridge,Ridge Classifier,0.8121,0.0,0.717,0.7798,0.7417,0.5952,0.6012,0.007
lightgbm,Light Gradient Boosting Machine,0.8106,0.8585,0.7299,0.7674,0.7434,0.5941,0.5998,0.041
lda,Linear Discriminant Analysis,0.8009,0.8479,0.7002,0.7657,0.7259,0.5709,0.5773,0.01
rf,Random Forest Classifier,0.7978,0.8642,0.7094,0.7568,0.7242,0.5661,0.5758,0.124
et,Extra Trees Classifier,0.7929,0.8496,0.7045,0.7487,0.7203,0.5569,0.5635,0.121
dt,Decision Tree Classifier,0.7817,0.7592,0.6837,0.7326,0.7014,0.531,0.5376,0.007
ada,Ada Boost Classifier,0.7801,0.8406,0.7045,0.7176,0.7073,0.532,0.5358,0.019
knn,K Neighbors Classifier,0.7173,0.7387,0.5337,0.6639,0.5879,0.3771,0.3845,0.03


[2021-09-22 16:06:56 +0000] [154944] [INFO] Starting gunicorn 20.1.0
[2021-09-22 16:06:56 +0000] [154944] [INFO] Listening at: http://127.0.0.1:5000 (154944)
[2021-09-22 16:06:56 +0000] [154944] [INFO] Using worker: sync
[2021-09-22 16:06:56 +0000] [154947] [INFO] Booting worker with pid: 154947

Aborted!
[2021-09-22 16:33:00 +0000] [154944] [INFO] Handling signal: int
[2021-09-22 16:33:00 +0000] [154947] [INFO] Worker exiting (pid: 154947)
[2021-09-22 16:33:00 +0000] [154944] [INFO] Shutting down: Master


`!mlflow ui`でMLFlowサーバーが起動したら、先ほど`eval_js()`で出力したURLにアクセスしてください。


以上で、タイタニックの生存予測データを使ったPyCaretの紹介は終わりです。