- (c) Ken-ichi Fukui, 2020/01/17
- MIT License

## SVMによるBreast Cancerデータの識別
- 入れ子交差検証で最適パラメータを探索

In [1]:
import numpy as np
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.preprocessing import scale

### Breast Cancerデータのロード

In [2]:
df = load_breast_cancer()
X = df.data
y = df.target

# z標準化
X = scale(X)

In [3]:
print(df.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, f

### 入れ子交差検証でハイパーパラメータをグリッドサーチ
- グリッドサーチパラメータリストの書き方は下記を参照
- http://scikit-learn.org/stable/modules/grid_search.html#grid-search
- SVCの可能なパラメータリストは下記を参照
- http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC

In [4]:
# 外側ループのための交差検証用データ生成インスタンス
kfold = StratifiedKFold(n_splits=10, shuffle=True)

acc_trn_list = []  #外側ループのfold毎の学習データに対するaccuracy格納用
acc_tst_list = []  #外側ループのfold毎のテストデータに対するaccuracy格納用

# グリッドサーチのパラメータリスト
parameters = {'kernel':['poly','rbf','sigmoid'], 
                      'gamma':[0.01, 0.02, 0.05, 0.1, 0,2, 1, 10, 100], 
                      'degree':[1,2,3,4], 
                      'coef0':[1,3,5,10,100]
                      } 
# 内側ループでグリッドサーチを行う交差検証インスタンス
gs = GridSearchCV(SVC(), parameters, cv=2, iid=False) 

k=0
# 内側ループのグリッドサーチ
for train_itr, test_itr in kfold.split(X, y):
    gs.fit(X[train_itr], y[train_itr])
    print('Fold #{:2d}; Best Parameter: {}, Accuracy: {:.3f}'\
        .format(k+1,gs.best_params_,gs.best_score_))
    acc_trn_list.append(gs.score(X[train_itr],y[train_itr]))
    acc_tst_list.append(gs.score(X[test_itr],y[test_itr]))
    k=k+1

Fold # 1; Best Parameter: {'coef0': 1, 'degree': 4, 'gamma': 0.02, 'kernel': 'poly'}, Accuracy: 0.977
Fold # 2; Best Parameter: {'coef0': 3, 'degree': 4, 'gamma': 0.01, 'kernel': 'poly'}, Accuracy: 0.971
Fold # 3; Best Parameter: {'coef0': 1, 'degree': 2, 'gamma': 0.05, 'kernel': 'poly'}, Accuracy: 0.977
Fold # 4; Best Parameter: {'coef0': 1, 'degree': 3, 'gamma': 0.02, 'kernel': 'poly'}, Accuracy: 0.975
Fold # 5; Best Parameter: {'coef0': 3, 'degree': 3, 'gamma': 0.01, 'kernel': 'poly'}, Accuracy: 0.979
Fold # 6; Best Parameter: {'coef0': 1, 'degree': 4, 'gamma': 0.05, 'kernel': 'poly'}, Accuracy: 0.984
Fold # 7; Best Parameter: {'coef0': 3, 'degree': 3, 'gamma': 0.01, 'kernel': 'poly'}, Accuracy: 0.980
Fold # 8; Best Parameter: {'coef0': 3, 'degree': 2, 'gamma': 0.01, 'kernel': 'poly'}, Accuracy: 0.979
Fold # 9; Best Parameter: {'coef0': 1, 'degree': 4, 'gamma': 0.02, 'kernel': 'poly'}, Accuracy: 0.982
Fold #10; Best Parameter: {'coef0': 5, 'degree': 2, 'gamma': 0.02, 'kernel': 'poly

### 平均Accuracy

In [5]:
print('Training data: %1.3f' % np.mean(acc_trn_list))
print('Test data: %1.3f' % np.mean(acc_tst_list))

Training data: 0.988
Test data: 0.975
