# ISID：Azureを用いたAutomated MLと機械学習モデルの説明性・解釈性のデモ

## 事前準備

Azure ML VMを使用。作成したVMにSSHで接続し、JupyterNotebookを設定します。
TeraTermなどを使用します。

conda info -e

でどのような仮想環境が存在するか確認し、py36に入ります。

source activate py36

jupyter notebookのpasswordを設定します。

jupyter notebook password

jupyter notebook を立ち上げます。

jupyter notebook

TeraTermのssh転送の8888を許可します

表示されるURLにアクセスします。
http://localhost:8888/tree




# 0. 実行環境の設定

In [1]:
# 最初に描画の設定をONにする
!jupyter nbextension install --py --sys-prefix azureml.contrib.explain.model.visualize
!jupyter nbextension enable --py --sys-prefix azureml.contrib.explain.model.visualize

Installing /data/anaconda/envs/py36/lib/python3.6/site-packages/azureml/contrib/explain/model/visualize/static -> microsoft-mli-widget
Up to date: /data/anaconda/envs/py36/share/jupyter/nbextensions/microsoft-mli-widget/index.js.map
Up to date: /data/anaconda/envs/py36/share/jupyter/nbextensions/microsoft-mli-widget/index.js
Up to date: /data/anaconda/envs/py36/share/jupyter/nbextensions/microsoft-mli-widget/extension.js.map
Up to date: /data/anaconda/envs/py36/share/jupyter/nbextensions/microsoft-mli-widget/extension.js
- Validating: [32mOK[0m

    To initialize this nbextension in the browser every time the notebook (or other app) loads:
    
          jupyter nbextension enable azureml.contrib.explain.model.visualize --py --sys-prefix
    
Enabling notebook extension microsoft-mli-widget/extension...
      - Validating: [32mOK[0m


In [2]:
# 実行上問題ないwarningは非表示にする
import warnings
warnings.filterwarnings('ignore')


In [3]:
# 乱数シードの固定
seed_value= 1234  # Seedの適当な値

# 1. pythonのシード固定
import os
os.environ['PYTHONHASHSEED']=str(seed_value)
 
# 2. randomのシード固定
import random
random.seed(seed_value)
 
# 3. Numpyのシード固定
import numpy as np
np.random.seed(seed_value)
 

In [4]:
# パッケージのimport
import pandas as pd
import sklearn

# Azure関連
import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
import logging
from azureml.train.automl import AutoMLConfig

W0816 09:15:06.605536 139822054012672 deprecation_wrapper.py:119] From /data/anaconda/envs/py36/lib/python3.6/site-packages/azureml/automl/core/_vendor/automl/client/core/common/tf_wrappers.py:36: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

W0816 09:15:06.606585 139822054012672 deprecation_wrapper.py:119] From /data/anaconda/envs/py36/lib/python3.6/site-packages/azureml/automl/core/_vendor/automl/client/core/common/tf_wrappers.py:36: The name tf.logging.ERROR is deprecated. Please use tf.compat.v1.logging.ERROR instead.



# 1. タイタニック・データのロードと前処理

In [5]:
# タイタニックデータ取得
# 参考
# https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/explain-model/tabular-data/advanced-feature-transformations-explain-local.ipynb

titanic_url = ('https://raw.githubusercontent.com/amueller/'
               'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')
data = pd.read_csv(titanic_url)


In [6]:
# データを作成
target_feature = ['survived']
numeric_features = ['age', 'fare','sibsp','parch']
categorical_features = ['embarked', 'sex', 'pclass']

df = data[target_feature+categorical_features + numeric_features]
y = data[target_feature].values
X = data[categorical_features + numeric_features]


# 2. タイタニックデータについて

In [7]:
print(df.shape)
df.head()

(1309, 8)


Unnamed: 0,survived,embarked,sex,pclass,age,fare,sibsp,parch
0,1,S,female,1,29.0,211.34,0,0
1,1,S,male,1,0.92,151.55,1,2
2,0,S,female,1,2.0,151.55,1,2
3,0,S,male,1,30.0,151.55,1,2
4,0,S,female,1,25.0,151.55,1,2


- Survived：目的変数（生存者は1）

- embarked：乗船した港　Cherbourg（シェルブール）、Queenstown、Southampton（サウサンプトン）の３種類
- sex：男性・女性
- pclass：乗客チケットの階級（1が一番が高い）
- age：年齢
- fare：乗船料金
- sibsp：兄弟、配偶者の同船者数
- parch：両親、子供の同船者数



# 3. データ前処理

In [8]:
# データの型を確認
X.dtypes

embarked     object
sex          object
pclass        int64
age         float64
fare        float64
sibsp         int64
parch         int64
dtype: object

In [9]:
# pclassの型を修正
X["pclass"] = X["pclass"].astype(str)
X.dtypes

# ageは後でintに直す

embarked     object
sex          object
pclass       object
age         float64
fare        float64
sibsp         int64
parch         int64
dtype: object

In [10]:
# 欠損値のある列を確認
X.isnull().any(axis=0)

embarked     True
sex         False
pclass      False
age          True
fare         True
sibsp       False
parch       False
dtype: bool

In [11]:
# embarkedの欠損値を修正
X['embarked'] = X['embarked'].fillna("missing")
X.isnull().any(axis=0)

embarked    False
sex         False
pclass      False
age          True
fare         True
sibsp       False
parch       False
dtype: bool

In [12]:
# ageとfareの欠損値を修正
# データを分割
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# 訓練データで中央値を求める
age_median = x_train["age"].median()
fare_median = x_train["fare"].median()

X["age"] = X["age"].fillna(age_median)
X["fare"] = X["fare"].fillna(fare_median)


In [13]:
X.isnull().any(axis=0)

embarked    False
sex         False
pclass      False
age         False
fare        False
sibsp       False
parch       False
dtype: bool

In [14]:
# ageの型修正
X["age"] = X["age"].astype(np.int64)
X.dtypes

embarked     object
sex          object
pclass       object
age           int64
fare        float64
sibsp         int64
parch         int64
dtype: object

# 4. 機械学習用の前処理モデルとデータを用意する

## 4.1 前処理パイプラインを設定


In [15]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.preprocessing import PolynomialFeatures


transformations = ColumnTransformer([
    ("categorical",  Pipeline(steps=[
        ('encoder', OneHotEncoder(sparse=False))]), categorical_features),
    ("numeric",  Pipeline(steps=[
        ('scaler', PolynomialFeatures(1))]), numeric_features),  # 実質何もしない
])

# 欠損値処理を、前処理パイプラインにimputerで組み込むことも考えられるが、
# データの型を修正するのに、欠損値処理が必要なので、
# 欠損値は先に処理しておく


In [16]:
# 前処理の実施
transformations.fit(X)

# Xに対して前処理を学習させているが、今回は前処理がone-hot EncodingのみなのでOK
# 数値データの平均値での処理などは、訓練データとテストデータを分けてから訓練データに適用すべきなので、
# その際は注意すること


ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('categorical', Pipeline(memory=None,
     steps=[('encoder', OneHotEncoder(categorical_features=None, categories=None,
       dtype=<class 'numpy.float64'>, handle_unknown='error',
       n_values=None, sparse=False))]), ['embarked', 'sex', 'pclass']), ('numeric', Pipeline(memory=None,
     steps=[('scaler', PolynomialFeatures(degree=1, include_bias=True, interaction_only=False))]), ['age', 'fare', 'sibsp', 'parch'])])

## 4.2 データを訓練とテストに分割

In [17]:
# データを分割
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [18]:
# 前処理の実施
x_train_transformed = transformations.transform(x_train)
x_test_transformed = transformations.transform(x_test)


In [19]:
# 前処理後のデータを確認
print(x_train_transformed.shape)
print(x_train_transformed[0:2,:])


(1047, 14)
[[  0.      0.      1.      0.      0.      1.      0.      0.      1.
    1.     25.      7.925   0.      0.   ]
 [  1.      0.      0.      0.      1.      0.      1.      0.      0.
    1.     41.    134.5     0.      0.   ]]


In [20]:
# 列名を確認
transformations.named_transformers_["categorical"].steps[0][1].get_feature_names()

array(['x0_C', 'x0_Q', 'x0_S', 'x0_missing', 'x1_female', 'x1_male',
       'x2_1', 'x2_2', 'x2_3'], dtype=object)

## 4.3 目的変数に対する処理

In [21]:
# yを無次元のnumpyに
y_train = y_train.reshape(-1)
y_test = y_test.reshape(-1)

# 5. Automated MLの実施

## 5.1 Azure MLサービスのワークスペース（WS）に接続し、実験を作成

In [22]:
# ワークスペースに接続
ws = Workspace.from_config(path='./ws_config.json')

# 実験を作成
experiment_name = 'automl-classification6'
experiment = Experiment(ws, experiment_name)

# 実行したら、表示される
#  open the page https://microsoft.com/devicelogin and enter the code ＜・・・＞ to authenticate.
# のコード＜・・・＞を、https://microsoft.com/devicelogin で入力する

## 5.2 Automated MLをローカル環境で実行する設定

In [23]:
# local_run = experiment.submit(automl_config, show_output = True)

automl_config = AutoMLConfig(task = 'classification',
                             verbosity=logging.INFO,
                             primary_metric = 'accuracy',
                             X = x_train_transformed, 
                             y = y_train,
                             n_cross_validations = 8,
                             enable_voting_ensemble=True,
                             enable_stack_ensemble=True,
                             iterations = 50,
                            )


# primary_metricで選べる情報
# https://docs.microsoft.com/ja-jp/azure/machine-learning/service/how-to-understand-automated-ml#classification-metrics
# その他、設定情報
# https://docs.microsoft.com/en-us/python/api/azureml-train-automl/azureml.train.automl.automlconfig?view=azure-ml-py


## 5.3 Automated MLをクラウド環境で実行する場合の設定

現在、コメントアウトしています

In [24]:
'''
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget

# コンピューティング・クラスターの名前
amlcompute_cluster_name = "automlcl"  

provisioning_config = AmlCompute.provisioning_configuration(vm_size="STANDARD_D2_V2",
                                                            # for GPU, use "STANDARD_NC6"
                                                            # vm_priority = 'lowpriority', # optional
                                                            max_nodes=2)
# VMたちを作成
compute_target = ComputeTarget.create(
    ws, amlcompute_cluster_name, provisioning_config)

compute_target.wait_for_completion(
    show_output=True, min_node_count=None, timeout_in_minutes=20)

# -----------------------------------
# リモートVMの設定を与える
# -----------------------------------
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies
import pkg_resources

# RunConfigの設定
conda_run_config = RunConfiguration(framework="python")

# 作成したcompute_targetを指定
conda_run_config.target = compute_target

# その他、設定
conda_run_config.environment.docker.enabled = True
conda_run_config.environment.docker.base_image = azureml.core.runconfig.DEFAULT_CPU_IMAGE
dprep_dependency = 'azureml-dataprep==' + pkg_resources.get_distribution("azureml-dataprep").version
cd = CondaDependencies.create(pip_packages=['azureml-sdk[automl]', dprep_dependency], conda_packages=['numpy','py-xgboost<=0.80'])
conda_run_config.environment.python.conda_dependencies = c

'''

'\nfrom azureml.core.compute import AmlCompute\nfrom azureml.core.compute import ComputeTarget\n\n# コンピューティング・クラスターの名前\namlcompute_cluster_name = "automlcl"  \n\nprovisioning_config = AmlCompute.provisioning_configuration(vm_size="STANDARD_D2_V2",\n                                                            # for GPU, use "STANDARD_NC6"\n                                                            # vm_priority = \'lowpriority\', # optional\n                                                            max_nodes=2)\n# VMたちを作成\ncompute_target = ComputeTarget.create(\n    ws, amlcompute_cluster_name, provisioning_config)\n\ncompute_target.wait_for_completion(\n    show_output=True, min_node_count=None, timeout_in_minutes=20)\n\n# -----------------------------------\n# リモートVMの設定を与える\n# -----------------------------------\nfrom azureml.core.runconfig import RunConfiguration\nfrom azureml.core.conda_dependencies import CondaDependencies\nimport pkg_resources\n\n# RunConfigの設定\nconda_run_config

In [25]:
'''
automl_config = AutoMLConfig(task = 'classification',
                             verbosity=logging.INFO,
                             primary_metric = 'accuracy',
                             X = x_train_transformed, 
                             y = y_train,
                             n_cross_validations = 10,
                             enable_voting_ensemble=True,
                             enable_stack_ensemble=True,
                             iterations = 30,
                             enable_early_stopping=False,
                             run_configuration=conda_run_config  # 追加されている 
                            )

# 詳細
# https://docs.microsoft.com/ja-jp/azure/machine-learning/service/how-to-auto-train-remote
# https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/automated-machine-learning/remote-amlcompute/auto-ml-remote-amlcompute.ipynb
'''

"\nautoml_config = AutoMLConfig(task = 'classification',\n                             verbosity=logging.INFO,\n                             primary_metric = 'accuracy',\n                             X = x_train_transformed, \n                             y = y_train,\n                             n_cross_validations = 10,\n                             enable_voting_ensemble=True,\n                             enable_stack_ensemble=True,\n                             iterations = 30,\n                             enable_early_stopping=False,\n                             run_configuration=conda_run_config  # 追加されている \n                            )\n\n# 詳細\n# https://docs.microsoft.com/ja-jp/azure/machine-learning/service/how-to-auto-train-remote\n# https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/automated-machine-learning/remote-amlcompute/auto-ml-remote-amlcompute.ipynb\n"

## 5.4 Automated MLの開始

In [26]:
my_run = experiment.submit(automl_config, show_output = True)


Running on local machine
Parent Run ID: AutoML_30ebc59a-9a4e-47ce-bdf3-be045c67572e
Current status: DatasetCrossValidationSplit. Generating CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
ITERATION: The iteration being evaluated.
PIPELINE: A summary description of the pipeline being evaluated.
DURATION: Time taken for the current iteration.
METRIC: The result of computing score on the fitted pipeline.
BEST: The best observed score thus far.
****************************************************************************************************

 ITERATION   PIPELINE                                       DURATION      METRIC      BEST
         0   StandardScalerWrapper SGD                      0:00:22       0.7851    0.7851
         1   StandardScalerWrapper SGD                      0:00:08       0.7689    0.7851
         2   MinMaxScaler SGD                           

## 5.5 Automated MLの実行結果を確認

In [27]:
from azureml.widgets import RunDetails
RunDetails(my_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

In [28]:
# 最高性能のモデルを取得
best_run, fitted_model = my_run.get_output()
print(best_run)

Run(Experiment: automl-classification6,
Id: AutoML_30ebc59a-9a4e-47ce-bdf3-be045c67572e_48,
Type: None,
Status: Completed)


In [29]:
# 最高性能のモデルの詳細
from pprint import pprint

def print_model(model, prefix=""):
    for step in model.steps:
        print(prefix + step[0])
        if hasattr(step[1], 'estimators') and hasattr(step[1], 'weights'):
            pprint({'estimators': list(e[0] for e in step[1].estimators), 'weights': step[1].weights})
            print()
            for estimator in step[1].estimators:
                print_model(estimator[1], estimator[0]+ ' - ')
        elif hasattr(step[1], '_base_learners') and hasattr(step[1], '_meta_learner'):
            print("\nMeta Learner")
            pprint(step[1]._meta_learner)
            print()
            for estimator in step[1]._base_learners:
                print_model(estimator[1], estimator[0]+ ' - ')
        else:
            pprint(step[1].get_params())
            print()
            
print_model(fitted_model)

prefittedsoftvotingclassifier
{'estimators': ['41', '14', '30', '11', '39', '34', '28'],
 'weights': [0.13333333333333333,
             0.2,
             0.13333333333333333,
             0.06666666666666667,
             0.26666666666666666,
             0.13333333333333333,
             0.06666666666666667]}

41 - StandardScalerWrapper
{'class_name': 'StandardScaler',
 'copy': True,
 'module_name': 'sklearn.preprocessing.data',
 'with_mean': True,
 'with_std': False}

41 - LightGBMClassifier
{'boosting_type': 'gbdt',
 'class_weight': None,
 'colsample_bytree': 0.6933333333333332,
 'importance_type': 'split',
 'learning_rate': 0.036848421052631586,
 'max_bin': 130,
 'max_depth': 3,
 'min_child_samples': 19,
 'min_child_weight': 10,
 'min_split_gain': 0.10526315789473684,
 'n_estimators': 800,
 'n_jobs': 1,
 'num_leaves': 8,
 'objective': None,
 'random_state': None,
 'reg_alpha': 0.9473684210526315,
 'reg_lambda': 0.3684210526315789,
 'silent': True,
 'subsample': 0.49526315789473685,

In [30]:
# テストデータの性能
from sklearn.metrics import accuracy_score
print(accuracy_score(y_train, fitted_model.predict(x_train_transformed)))
print(accuracy_score(y_test, fitted_model.predict(transformations.transform(x_test))))


0.8567335243553008
0.8282442748091603


In [31]:
# モデルの保存
from sklearn.externals import joblib

filename = './automated_best_model.pkl'
joblib.dump(fitted_model, filename)

['./automated_best_model.pkl']

In [52]:
# モデルを指定して保存する場合
'''
iteration = 8
iter_run, iter_model = my_run.get_output(iteration = iteration)
print(iter_run)
print("\n--以下モデル情報--\n")
print_model(iter_model)

filename = './iter_model.pkl'
joblib.dump(iter_model, filename)
'''

'\niteration = 8\niter_run, iter_model = my_run.get_output(iteration = iteration)\nprint(iter_run)\nprint("\n--以下モデル情報--\n")\nprint_model(iter_model)\n\nfilename = \'./iter_model.pkl\'\njoblib.dump(iter_model, filename)\n'

# 解釈性・説明性技術を使用

In [53]:
# モデルのロード
from sklearn.externals import joblib

filename = './automated_best_model.pkl'
#filename = './iter_model.pkl'

best_model = joblib.load(filename)



In [54]:
best_model

Pipeline(memory=None,
     steps=[('prefittedsoftvotingclassifier', PreFittedSoftVotingClassifier(classification_labels=None,
               estimators=[('41', Pipeline(memory=None,
     steps=[('StandardScalerWrapper', <automl.client.core.common.model_wrappers.StandardScalerWrapper object at 0x7f2a441b4358>), ('LightGBMClass...333333333333, 0.06666666666666667, 0.26666666666666666, 0.13333333333333333, 0.06666666666666667]))])

In [55]:
# モデルに前処理をつなげる
from sklearn.pipeline import Pipeline

model = Pipeline(steps=[('preprocessor', transformations),
                        ('classifier', best_model)])

In [56]:
# 特徴量の名前
feature_names = categorical_features + numeric_features
feature_names

['embarked', 'sex', 'pclass', 'age', 'fare', 'sibsp', 'parch']

In [57]:
# 推論するクラスの名前
classes =["not-Survived", "Survived"]
classes

['not-Survived', 'Survived']

# 予測する

In [71]:
# 1番目のデータに着目する
instance_num = 1
x_test_local = x_test.iloc[instance_num:instance_num+1,:]
print("答え：生存？→", y_test[instance_num])
print("予測：生存？→", model.predict(x_test.iloc[instance_num:instance_num+1,:]))
print("予測：確率→", model.predict_proba(x_test.iloc[instance_num:instance_num+1,:]))
print("Data")
x_test_local

答え：生存？→ 1
予測：生存？→ [1]
予測：確率→ [[0.1583832 0.8416168]]
Data


Unnamed: 0,embarked,sex,pclass,age,fare,sibsp,parch
533,S,female,2,21,21.0,0,1


# モデルとデータの説明

In [74]:
# 説明性のオブジェクトを作成
from azureml.explain.model.tabular_explainer import TabularExplainer
from azureml.contrib.explain.model.visualize import ExplanationDashboard

tabular_explainer = TabularExplainer(model=model.steps[-1][1], 
                                     initialization_examples=x_train,
                                     features=feature_names, 
                                     classes=classes,
                                     transformations=transformations)

# 【重要】
# transformationsでone-hot変換を与えている
# 以下、説明
# https://docs.microsoft.com/ja-jp/python/api/azureml-explain-model/azureml.explain.model.tabularexplainer?view=azure-ml-py


In [75]:
# グローバルの説明性を計算
x_test = x_test.iloc[0:15, :]
global_explanation = tabular_explainer.explain_global(x_test)


100%|██████████| 15/15 [00:06<00:00,  2.41it/s]


# 予測する

In [76]:
# 1番目のデータに着目する
instance_num = 1
x_test_local = x_test.iloc[instance_num:instance_num+1,:]
print("答え：生存？→", y_test[instance_num])
print("予測：生存？→", model.predict(x_test.iloc[instance_num:instance_num+1,:]))
print("予測：確率→", model.predict_proba(x_test.iloc[instance_num:instance_num+1,:]))
print("Data")
x_test_local


答え：生存？→ 1
予測：生存？→ [1]
予測：確率→ [[0.1583832 0.8416168]]
Data


Unnamed: 0,embarked,sex,pclass,age,fare,sibsp,parch
533,S,female,2,21,21.0,0,1


## モデルとデータの説明

In [77]:
from azureml.contrib.explain.model.visualize import ExplanationDashboard
ExplanationDashboard(global_explanation, model, x_test)

ExplanationWidget(value={'localExplanations': [[[0.006007390069643202, 0.05175036604097401, 0.0459983068783129…

<azureml.contrib.explain.model.visualize.ExplanationDashboard.ExplanationDashboard at 0x7f2a2c6cb908>

# 各種説明性情報の出力方法

In [None]:
# グローバルの各種情報

# Sorted SHAP values
print('ranked global importance values: {}'.format(global_explanation.get_ranked_global_values()))
# Corresponding feature names
print('ranked global importance names: {}'.format(global_explanation.get_ranked_global_names()))
# feature ranks (based on original order of features)
print('global importance rank: {}'.format(global_explanation.global_importance_rank))
# per class feature names
print('ranked per class feature names: {}'.format(global_explanation.get_ranked_per_class_names()))
# per class feature importance values
print('ranked per class feature values: {}'.format(global_explanation.get_ranked_per_class_values()))


In [None]:
# globalの各種importance
dict(zip(global_explanation.get_ranked_global_names(), global_explanation.get_ranked_global_values()))

In [None]:
# ローカルの各種情報
instance_num = 1  # 1番目のデータの人の情報
local_explanation = tabular_explainer.explain_local(x_test.iloc[instance_num:instance_num+1,:])

prediction_value = model.predict(x_test)[instance_num]

sorted_local_importance_values = local_explanation.get_ranked_local_values()[prediction_value]
sorted_local_importance_names = local_explanation.get_ranked_local_names()[prediction_value]

prediction_value,dict(zip(sorted_local_importance_names[0], sorted_local_importance_values[0]))

以上