**AutoML OSS入門（11）**

# AutoML OSSの比較とAutoMLクラウドサービス

本ノートブックの紹介記事と併せてご覧ください。
- [＠IT連載 AutoML OSS入門（11）- 第11回「最も人気なAutoML OSSは？　注目のAutoMLクラウドサービスも紹介」](https://broom.itmedia.co.jp/ait/articles/2203/24/news004.html)

なお、本ノートブックの扱い方や使用するデータについては、連載記事の第1回を参照してください。
- [＠IT連載 AutoML OSS入門（１）- 第1回「機械学習モデル構築作業の煩雑さを解消する「AutoML」とは――歴史、動向、利用のメリットを整理する」](https://www.atmarkit.co.jp/ait/articles/2107/02/news006.html)

## タイタニックの生存予測を行う最小コード

タイタニックの生存予測の圧縮ファイルを取得し、解凍しておく部分は全OSS共通なのでそれに関しては事前に実施しておきます。

In [None]:
!wget -N https://github.com/aiq2020-tw/automl-notebooks/raw/main/titanic.zip
!unzip titanic.zip

--2022-02-10 07:58:26--  https://github.com/aiq2020-tw/automl-notebooks/raw/main/titanic.zip
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/aiq2020-tw/automl-notebooks/main/titanic.zip [following]
--2022-02-10 07:58:26--  https://raw.githubusercontent.com/aiq2020-tw/automl-notebooks/main/titanic.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 34877 (34K) [application/zip]
Saving to: ‘titanic.zip’


Last-modified header missing -- time-stamps turned off.
2022-02-10 07:58:26 (12.8 MB/s) - ‘titanic.zip’ saved [34877/34877]

Archive:  titanic.zip
  inflating: gender_submission.csv   
  inflatin

### auto-sklearnの場合

In [None]:
!pip install auto-sklearn
# ランタイム再起動
import pandas as pd
from autosklearn.classification import AutoSklearnClassifier
X_train = pd.read_csv('train.csv')
X_test = pd.read_csv('test.csv')
y_train = X_train.pop('Survived')
X_train = X_train.drop(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1)
X_test = X_test.drop(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1)
cls = AutoSklearnClassifier(time_left_for_this_task=120, seed=42)
cls.fit(X_train, y_train)
X_test['Survived'] = cls.predict(X_test)
X_test[['PassengerId','Survived']].to_csv('submission.csv', index=None)

### TPOTの場合

In [None]:
!pip install tpot
from tpot import TPOTClassifier
import pandas as pd
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
X_train = train_df.drop(['PassengerId', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1)
y_train = X_train.pop('Survived')
X_test = test_df.drop(['PassengerId', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1)
tpot = TPOTClassifier(verbosity=2, generations=10, population_size=5, random_state=42)
tpot.fit(X_train, y_train)
predictions = tpot.predict(X_test)
output = pd.DataFrame({'PassengerID': test_df.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)

### AutoGluonの場合

In [None]:
!pip install -U "mxnet<2.0.0"
!pip install autogluon
from autogluon.tabular import TabularDataset, TabularPredictor
import pandas as pd
train_df = TabularDataset('train.csv').drop(labels=['PassengerId'], axis=1)
test_df = TabularDataset('test.csv')
test_df_tmp = pd.DataFrame(test_df['PassengerId'])
test_df = test_df.drop(labels=['PassengerId'], axis=1)
predictor = TabularPredictor(label='Survived', eval_metric='accuracy').fit(train_df)
y_pred = predictor.predict(test_df)
submit_df = pd.DataFrame(y_pred)
submit_df['PassengerId'] = test_df_tmp
submit_df.to_csv('submission.csv', index=False)

### H2Oの場合

In [None]:
!pip install requests, tabulate, future, h2o
import h2o
h2o.init()
# GUIを操作し、データのロード〜モデルの構築・評価を実施
pred = h2o.import_file('./submission.csv')
submission = pred['PassengerId']
submission['Survived'] = pred['predict']
h2o.export_file(frame=submission, path='submission.csv', force=True)

### PyCaretの場合

In [None]:
!pip install pycaret
from pycaret.classification import *
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
setup(data=train_df, target='Survived', silent=True, session_id=42)
best_model = compare_models()
submission = predict_model(best_model, data=test_df)
submission = submission.rename(columns={'Label': 'Survived'})
submission[['PassengerId', 'Survived']].to_csv('submission.csv', index=False)

Unnamed: 0,Description,Value
0,session_id,42
1,Target,Survived
2,Target Type,Binary
3,Label Encoded,
4,Original Data,"(891, 12)"
5,Missing Values,True
6,Numeric Features,3
7,Categorical Features,8
8,Ordinal Features,False
9,High Cardinality Features,False


IntProgress(value=0, description='Processing: ', max=74)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.817,0.8588,0.7016,0.7817,0.7383,0.5986,0.6016,0.284
knn,K Neighbors Classifier,0.7062,0.7068,0.5332,0.6191,0.5686,0.3497,0.3544,0.207


### AutoKerasの場合

In [None]:
!pip install autokeras
import autokeras as ak
import pandas as pd
clf = ak.StructuredDataClassifier(overwrite=True, max_trials=10, seed=42)
clf.fit('train.csv', 'Survived')
test_df = pd.read_csv('test.csv')
test_df['Survived'] = clf.predict(test_df)
print(test_df['Survived'].astype(int))
test_df[['PassengerId','Survived']].astype(int).to_csv('submission.csv', index=None)

### Ludwigの場合

In [None]:
!pip install ludwig
!pip install petastorm
import ludwig
from ludwig.api import LudwigModel
import pandas as pd
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

# 【要相談】辞書型でconfigを記載⇒config_fileをjsonで別で作っておく？？
# Ludwigの標準的な使い方としては、どちらかと言えばjsonファイルは先に作っておく形
config = {
    # データ分割の比率を設定する
    'preprocessing': {'split_probabilities': [0.7, 0.2, 0.1]},
    # 説明変数を設定する
    'input_features': [{'name': 'Pclass', 'type': 'category'},
                       {'name': 'Sex', 'type': 'category'},
                       {'name': 'Age',
                        'preprocessing': {
                            'missing_value_strategy': 'fill_with_mean'},
                        'type': 'numerical'},
                       {'name': 'SibSp', 'type': 'numerical'},
                       {'name': 'Parch', 'type': 'numerical'},
                       {'name': 'Fare',
                        'preprocessing': {
                            'missing_value_strategy': 'fill_with_mean'},
                        'type': 'numerical'},
                       {'name': 'Embarked', 'type': 'category'}],
    # 目的変数を設定する
    'output_features': [{'name': 'Survived', 'type': 'binary'}],
    # 学習に関するパラメーターを設定する（任意）
    'training': {
        'batch_size': 128,
        'epochs': 300,
        'early_stop': 5,
        'learning_rate': 0.001}
    }

model = LudwigModel(config)
train_stats = model.train(train_df, random_seed=42)
predictions = model.predict(test_df)
output = pd.DataFrame({'PassengerID': test_df.PassengerId,
                       'Survived': predictions[0]['Survived_predictions'].astype('int32')})
output.to_csv('submission.csv', index=False)

### NNIの場合

In [None]:
!pip install nni
import nni
import pandas as pd
from sklearn.preprocessing import LabelEncoder

train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
test_df_tmp = test_df.pop('PassengerId')
X_train = train_df.drop(['PassengerId', 'Name'], axis=1)
X_test = test_df.drop(['Name'], axis=1)
y_train = X_train.pop('Survived')
 
list_cols = ['Sex', 'Ticket', 'Cabin', 'Embarked']
for col in list_cols:
    target_column = pd.concat([X_train[col], X_test[col]])
    le = LabelEncoder()
    le.fit(target_column)
    X_train[col] = le.transform(X_train[col])
    X_test[col] = le.transform(X_test[col])

!wget -N https://github.com/aiq2020-tw/automl-notebooks/raw/main/09_NNI/files/search_space.json
!wget -N https://github.com/aiq2020-tw/automl-notebooks/raw/main/09_NNI/files/nni_xgb.py
!wget -N https://github.com/aiq2020-tw/automl-notebooks/raw/main/09_NNI/files/config.yml
from google.colab.output import eval_js
!nnictl create --config config.yml --port 7000 &
print('NNI_URL:' + eval_js('google.colab.kernel.proxyPort(7000)'))
!nnictl experiment show | sed '/^\[/d' > experiment.json
experiment_json = pd.read_json('experiment.json')
import time
time.sleep(180)
experiment_id = experiment_json['id']['experimentName']
!nnictl experiment export $experiment_id --filename nni_output.csv --type csv --intermediate
param_df = pd.read_csv('nni_output.csv')
best_param_df = param_df.loc[[param_df['reward'].idxmax()]]
import xgboost as xgb
clf = xgb.XGBClassifier(
    learning_rate=best_param_df['learning_rate'].values[0],
    colsample_btree=best_param_df['colsample_btree'].values[0],
    max_depth=best_param_df['max_depth'].values[0],
    subsample=best_param_df['subsample'].values[0])
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
submit_df = pd.DataFrame(y_pred, columns=['Survived'])
submit_df['PassengerId'] = test_df_tmp
submit_df.to_csv('submission.csv', index=False)

Collecting nni
  Downloading nni-2.6-py3-none-manylinux1_x86_64.whl (60.1 MB)
[K     |████████████████████████████████| 60.1 MB 174 kB/s 
Collecting responses
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting json-tricks>=3.15.5
  Downloading json_tricks-3.15.5-py2.py3-none-any.whl (26 kB)
Collecting pyyaml>=5.4
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 69.6 MB/s 
[?25hCollecting colorama
  Downloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Collecting schema
  Downloading schema-0.7.5-py2.py3-none-any.whl (17 kB)
Collecting PythonWebHDFS
  Downloading PythonWebHDFS-0.2.3-py3-none-any.whl (10 kB)
Collecting websockets>=10.1
  Downloading websockets-10.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (111 kB)
[K     |████████████████████████████████| 111 kB 75.2 MB/s 
Collectin

### Model Searchの場合

In [None]:
!git clone https://github.com/google/model_search
!pip install -r model_search/requirements.txt
%cd model_search
!protoc --python_out=./ model_search/proto/*.proto
import sys
from absl import app, flags
FLAGS = flags.FLAGS
sys.argv = sys.argv[:1]
try:
    app.run(lambda argv: None)
except:
    pass
import model_search
from model_search import single_trainer
from model_search.data import csv_data
import tensorflow as tf
import pandas as pd
train_df = pd.read_csv('train.csv')
X_train = train_df.drop(['PassengerId', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1)
X_train.to_csv('train.csv', index=False)
trainer = single_trainer.SingleTrainer(
    data=csv_data.Provider(
        label_index=0,
        logits_dimension=2,
        record_defaults=X_train.mean(),
        filename='train.csv'),
    spec='model_search/configs/dnn_config.pbtxt')
trainer.try_models(
    number_models=40,
    train_steps=100,
    eval_steps=10,
    root_dir='titanic_model',
    batch_size=32,
    experiment_name='titanic',
    experiment_owner='model_search_user')
import os
model_dir = os.listdir(f'titanic_model/tuner-1/40/saved_model/')[0]
test_df = pd.read_csv('test.csv')
submit_df = test_df[['PassengerId']]
test_df = test_df.drop(['PassengerId', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1)
test_df　= test_df.fillna(test_df.mean())
X_test = {str(i+1): tf.convert_to_tensor(
test_df[test_df.columns[i]].values, dtype=tf.float32
) for i in range(len(test_df.columns))}
trained_model = tf.keras.models.load_model(f'titanic_model/tuner-1/40/saved_model/{model_dir}')
result = trained_model.signatures['serving_default'](**X_test)
preds = tf.keras.backend.get_value(result['predictions'])
submit_df['Survived'] = preds
submit_df.to_csv('submission.csv', index=False)