中級編で行ったhERG阻害活性の予測の一例を示す。

In [23]:
import numpy as np
import pandas as pd
import sys

from sklearn import svm
from sklearn.datasets import load_iris, load_digits
from sklearn.model_selection import KFold, ShuffleSplit, GroupKFold, \
                                                                StratifiedKFold, StratifiedShuffleSplit,  \
                                                                LeaveOneOut, LeavePOut, \
                                                                cross_val_predict, cross_val_score, GridSearchCV


print(sys.version_info)

sys.version_info(major=3, minor=6, micro=2, releaselevel='final', serial=0)


### 1. ChEMBLから抽出したhERG活性データの読み込み

advanced/data-set/trainingディレクトリにあるhERG-all.csvデータを読み込む。  
これはChEMBLからhERGに関するアッセイにて活性測定された化合物の情報が含まれている。  
データの抽出の仕方についてはこちらを参考にして欲しい。[Extract-hERG-assay-data-from-ChEMBL.ipynb](https://gist.github.com/yamasakih/62ef4b7396e1681f58a1c06543984c43)

In [24]:
df = pd.read_csv('data-set/training/hERG-all.csv', sep='\t')
df.head()

Unnamed: 0.1,Unnamed: 0,molregno,assay_chembl_id,description,assay_organism,assay_tissue,assay_cell_type,standard_relation,published_value,published_units,standard_value,standard_units,standard_type,activity_comment,published_type,data_validity_comment,published_relation,pchembl_value,standard_inchi,canonical_smiles
0,0,112651,CHEMBL656604,K+ channel blocking activity in COS-7 African ...,Homo sapiens,,COS-7,=,550.0,nM,550.0,nM,IC50,,IC50,,=,6.26,InChI=1S/C24H34N2O/c1-21(2)19-27-20-24(25-15-9...,CC(C)COCC(CN(Cc1ccccc1)c2ccccc2)N3CCCC3
1,1,65351,CHEMBL875385,K+ channel blocking activity in human embryoni...,Homo sapiens,,,=,34400.0,nM,34400.0,nM,IC50,,IC50,,=,4.46,InChI=1S/C19H22F2N4O3/c1-8-5-24(6-9(2)23-8)17-...,C[C@@H]1CN(C[C@H](C)N1)c2c(F)c(N)c3C(=O)C(=CN(...
2,2,6216,CHEMBL662311,K+ channel blocking activity in Chinese hamste...,Homo sapiens,,,=,1470.0,nM,1470.0,nM,IC50,,IC50,,=,5.83,InChI=1S/C17H19ClN2S/c1-19(2)10-5-11-20-14-6-3...,CN(C)CCCN1c2ccccc2Sc3ccc(Cl)cc13
3,3,1543376,CHEMBL691013,K+ channel blocking activity in human embryoni...,Homo sapiens,,,=,5950.0,nM,5950.0,nM,IC50,,IC50,,=,5.23,InChI=1S/C19H20N2O3/c22-18-10-21-12-5-11(18)6-...,O=C(O[C@@H]1C[C@@H]2C[C@H]3C[C@H](C1)N2CC3=O)c...
4,4,65605,CHEMBL691013,K+ channel blocking activity in human embryoni...,Homo sapiens,,,=,0.9,nM,0.9,nM,IC50,,IC50,,=,9.05,InChI=1S/C28H31FN4O/c1-34-25-12-8-21(9-13-25)1...,COc1ccc(CCN2CCC(CC2)Nc3nc4ccccc4n3Cc5ccc(F)cc5...


サンプル数とカラムの数を `shpae` を用いて確認する。

In [25]:
df.shape

(7016, 20)

サンプル数は7016, カラム数は20であった。  
続いて先頭のカラムが不要なので削除する。まずカラムの名前を再度確認してみる。

In [26]:
df.columns

Index(['Unnamed: 0', 'molregno', 'assay_chembl_id', 'description',
       'assay_organism', 'assay_tissue', 'assay_cell_type',
       'standard_relation', 'published_value', 'published_units',
       'standard_value', 'standard_units', 'standard_type', 'activity_comment',
       'published_type', 'data_validity_comment', 'published_relation',
       'pchembl_value', 'standard_inchi', 'canonical_smiles'],
      dtype='object')

先頭のカラムの名前は `Unnamed: 0` であった。 `drop` メソッドを用いて削除する。  
`drop` メソッドにはカラムの名前を指定する必要がある。

In [27]:
df.drop('Unnamed: 0', axis=1).head()

Unnamed: 0,molregno,assay_chembl_id,description,assay_organism,assay_tissue,assay_cell_type,standard_relation,published_value,published_units,standard_value,standard_units,standard_type,activity_comment,published_type,data_validity_comment,published_relation,pchembl_value,standard_inchi,canonical_smiles
0,112651,CHEMBL656604,K+ channel blocking activity in COS-7 African ...,Homo sapiens,,COS-7,=,550.0,nM,550.0,nM,IC50,,IC50,,=,6.26,InChI=1S/C24H34N2O/c1-21(2)19-27-20-24(25-15-9...,CC(C)COCC(CN(Cc1ccccc1)c2ccccc2)N3CCCC3
1,65351,CHEMBL875385,K+ channel blocking activity in human embryoni...,Homo sapiens,,,=,34400.0,nM,34400.0,nM,IC50,,IC50,,=,4.46,InChI=1S/C19H22F2N4O3/c1-8-5-24(6-9(2)23-8)17-...,C[C@@H]1CN(C[C@H](C)N1)c2c(F)c(N)c3C(=O)C(=CN(...
2,6216,CHEMBL662311,K+ channel blocking activity in Chinese hamste...,Homo sapiens,,,=,1470.0,nM,1470.0,nM,IC50,,IC50,,=,5.83,InChI=1S/C17H19ClN2S/c1-19(2)10-5-11-20-14-6-3...,CN(C)CCCN1c2ccccc2Sc3ccc(Cl)cc13
3,1543376,CHEMBL691013,K+ channel blocking activity in human embryoni...,Homo sapiens,,,=,5950.0,nM,5950.0,nM,IC50,,IC50,,=,5.23,InChI=1S/C19H20N2O3/c22-18-10-21-12-5-11(18)6-...,O=C(O[C@@H]1C[C@@H]2C[C@H]3C[C@H](C1)N2CC3=O)c...
4,65605,CHEMBL691013,K+ channel blocking activity in human embryoni...,Homo sapiens,,,=,0.9,nM,0.9,nM,IC50,,IC50,,=,9.05,InChI=1S/C28H31FN4O/c1-34-25-12-8-21(9-13-25)1...,COc1ccc(CCN2CCC(CC2)Nc3nc4ccccc4n3Cc5ccc(F)cc5...


以下のような指定の仕方でも削除することができる。まとめて削除するときなどはこちらの方が楽かもしれない。

In [28]:
df.columns[[0]]

Index(['Unnamed: 0'], dtype='object')

In [29]:
df.drop(df.columns[[0]], axis=1).head()

Unnamed: 0,molregno,assay_chembl_id,description,assay_organism,assay_tissue,assay_cell_type,standard_relation,published_value,published_units,standard_value,standard_units,standard_type,activity_comment,published_type,data_validity_comment,published_relation,pchembl_value,standard_inchi,canonical_smiles
0,112651,CHEMBL656604,K+ channel blocking activity in COS-7 African ...,Homo sapiens,,COS-7,=,550.0,nM,550.0,nM,IC50,,IC50,,=,6.26,InChI=1S/C24H34N2O/c1-21(2)19-27-20-24(25-15-9...,CC(C)COCC(CN(Cc1ccccc1)c2ccccc2)N3CCCC3
1,65351,CHEMBL875385,K+ channel blocking activity in human embryoni...,Homo sapiens,,,=,34400.0,nM,34400.0,nM,IC50,,IC50,,=,4.46,InChI=1S/C19H22F2N4O3/c1-8-5-24(6-9(2)23-8)17-...,C[C@@H]1CN(C[C@H](C)N1)c2c(F)c(N)c3C(=O)C(=CN(...
2,6216,CHEMBL662311,K+ channel blocking activity in Chinese hamste...,Homo sapiens,,,=,1470.0,nM,1470.0,nM,IC50,,IC50,,=,5.83,InChI=1S/C17H19ClN2S/c1-19(2)10-5-11-20-14-6-3...,CN(C)CCCN1c2ccccc2Sc3ccc(Cl)cc13
3,1543376,CHEMBL691013,K+ channel blocking activity in human embryoni...,Homo sapiens,,,=,5950.0,nM,5950.0,nM,IC50,,IC50,,=,5.23,InChI=1S/C19H20N2O3/c22-18-10-21-12-5-11(18)6-...,O=C(O[C@@H]1C[C@@H]2C[C@H]3C[C@H](C1)N2CC3=O)c...
4,65605,CHEMBL691013,K+ channel blocking activity in human embryoni...,Homo sapiens,,,=,0.9,nM,0.9,nM,IC50,,IC50,,=,9.05,InChI=1S/C28H31FN4O/c1-34-25-12-8-21(9-13-25)1...,COc1ccc(CCN2CCC(CC2)Nc3nc4ccccc4n3Cc5ccc(F)cc5...


上記のいずれかの方法で削除したものを再び `df` という名前で保存しておく。

In [31]:
df = df.drop(df.columns[[0]], axis=1)
df.head()

Unnamed: 0,assay_chembl_id,description,assay_organism,assay_tissue,assay_cell_type,standard_relation,published_value,published_units,standard_value,standard_units,standard_type,activity_comment,published_type,data_validity_comment,published_relation,pchembl_value,standard_inchi,canonical_smiles
0,CHEMBL656604,K+ channel blocking activity in COS-7 African ...,Homo sapiens,,COS-7,=,550.0,nM,550.0,nM,IC50,,IC50,,=,6.26,InChI=1S/C24H34N2O/c1-21(2)19-27-20-24(25-15-9...,CC(C)COCC(CN(Cc1ccccc1)c2ccccc2)N3CCCC3
1,CHEMBL875385,K+ channel blocking activity in human embryoni...,Homo sapiens,,,=,34400.0,nM,34400.0,nM,IC50,,IC50,,=,4.46,InChI=1S/C19H22F2N4O3/c1-8-5-24(6-9(2)23-8)17-...,C[C@@H]1CN(C[C@H](C)N1)c2c(F)c(N)c3C(=O)C(=CN(...
2,CHEMBL662311,K+ channel blocking activity in Chinese hamste...,Homo sapiens,,,=,1470.0,nM,1470.0,nM,IC50,,IC50,,=,5.83,InChI=1S/C17H19ClN2S/c1-19(2)10-5-11-20-14-6-3...,CN(C)CCCN1c2ccccc2Sc3ccc(Cl)cc13
3,CHEMBL691013,K+ channel blocking activity in human embryoni...,Homo sapiens,,,=,5950.0,nM,5950.0,nM,IC50,,IC50,,=,5.23,InChI=1S/C19H20N2O3/c22-18-10-21-12-5-11(18)6-...,O=C(O[C@@H]1C[C@@H]2C[C@H]3C[C@H](C1)N2CC3=O)c...
4,CHEMBL691013,K+ channel blocking activity in human embryoni...,Homo sapiens,,,=,0.9,nM,0.9,nM,IC50,,IC50,,=,9.05,InChI=1S/C28H31FN4O/c1-34-25-12-8-21(9-13-25)1...,COc1ccc(CCN2CCC(CC2)Nc3nc4ccccc4n3Cc5ccc(F)cc5...


### 2. 活性値データのしぼりこみ

オートスケーリングを行う。