## Sklearn의 GridSearchCV 함수를 사용해 최적의 매개변수들을 구하고, 최종 랜덤포레스트 모델을 만들고 해석하라.

In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os
import graphviz
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import export_graphviz
import warnings
warnings.filterwarnings('ignore')

In [3]:
# set the display option in pandas
pd.set_option('display.float_format', lambda x: '%.3f' % x)	
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('max_info_columns', 1001)

# Loading csv

In [4]:
data = pd.read_csv("./csvfiles/유방암.csv",encoding='cp949')

In [5]:
data.head()

Unnamed: 0,diagnosis,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,0,17.99,10.38,122.8,1001.0,0.118,0.278,0.3,0.147,0.242,0.079,1.095,0.905,8.589,153.4,0.006,0.049,0.054,0.016,0.03,0.006,25.38,17.33,184.6,2019.0,0.162,0.666,0.712,0.265,0.46,0.119
1,0,20.57,17.77,132.9,1326.0,0.085,0.079,0.087,0.07,0.181,0.057,0.543,0.734,3.398,74.08,0.005,0.013,0.019,0.013,0.014,0.004,24.99,23.41,158.8,1956.0,0.124,0.187,0.242,0.186,0.275,0.089
2,0,19.69,21.25,130.0,1203.0,0.11,0.16,0.197,0.128,0.207,0.06,0.746,0.787,4.585,94.03,0.006,0.04,0.038,0.021,0.022,0.005,23.57,25.53,152.5,1709.0,0.144,0.424,0.45,0.243,0.361,0.088
3,0,11.42,20.38,77.58,386.1,0.142,0.284,0.241,0.105,0.26,0.097,0.496,1.156,3.445,27.23,0.009,0.075,0.057,0.019,0.06,0.009,14.91,26.5,98.87,567.7,0.21,0.866,0.687,0.258,0.664,0.173
4,0,20.29,14.34,135.1,1297.0,0.1,0.133,0.198,0.104,0.181,0.059,0.757,0.781,5.438,94.44,0.011,0.025,0.057,0.019,0.018,0.005,22.54,16.67,152.2,1575.0,0.137,0.205,0.4,0.163,0.236,0.077


## check target

In [6]:
data['diagnosis'].describe()

count   569.000
mean      0.627
std       0.484
min       0.000
25%       0.000
50%       1.000
75%       1.000
max       1.000
Name: diagnosis, dtype: float64

In [7]:
# 0이 악성
print('악성(0)의 개수 :', len(data[data['diagnosis']==0]))

# 1이 양성
print('양성(1)의 개수 :', len(data[data['diagnosis']==1]))

악성(0)의 개수 : 212
양성(1)의 개수 : 357


0,1로 구성된 data확인, 분류 기법으로 예측 필요 확인

In [8]:
data.isna().any()

diagnosis                  False
mean radius                False
mean texture               False
mean perimeter             False
mean area                  False
mean smoothness            False
mean compactness           False
mean concavity             False
mean concave points        False
mean symmetry              False
mean fractal dimension     False
radius error               False
texture error              False
perimeter error            False
area error                 False
smoothness error           False
compactness error          False
concavity error            False
concave points error       False
symmetry error             False
fractal dimension error    False
worst radius               False
worst texture              False
worst perimeter            False
worst area                 False
worst smoothness           False
worst compactness          False
worst concavity            False
worst concave points       False
worst symmetry             False
worst frac

결측치는 없음.

In [9]:
from sklearn.model_selection import train_test_split

y = data["diagnosis"]
x = data.drop("diagnosis", axis = 1)
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.3, random_state=42)
print("train data X size:", train_x.shape)
print("train data y size:", train_y.shape)
print("test data X size:", test_x.shape)
print("test data y size:", test_y.shape)


train data X size: (398, 30)
train data y size: (398,)
test data X size: (171, 30)
test data y size: (171,)


In [10]:
tree_uncustom = RandomForestClassifier(random_state=1234, criterion='entropy', max_depth=2, min_samples_leaf=30, n_estimators=9,
                                      min_samples_split=2)
tree_uncustom.fit(train_x, train_y)
# 훈련 데이터 정확도
print("Accuracy on training set:{:.3f}".format(tree_uncustom.score(train_x, train_y)))

# test 데이터 정확도
print("Accuracy on test set:{:.3f}".format(tree_uncustom.score(test_x, test_y)))

Accuracy on training set:0.937
Accuracy on test set:0.953


기본 모델로 예측시 95.3% 확률

# GridSearchCV로 최적의 parameters 찾기(Random Forest)

In [11]:
from sklearn.model_selection import GridSearchCV

In [12]:
# 찾을 파라미터의 범위 설정
criterion = ['gini', 'entropy']
max_depth = list(range(1,10))
list_min_leaf_size = [i * 10 for i in range (1,6)]
n_est = list(range(1, 15))

In [13]:
parameters = {'n_estimators':n_est,'criterion':criterion, 'max_depth': max_depth,
             'min_samples_leaf':list_min_leaf_size, 'max_features':['auto','sqrt']}

In [14]:
dt_random = GridSearchCV(estimator = tree_uncustom, param_grid = parameters, 
                               cv = 10, n_jobs=-1)

In [15]:
dt_random.fit(train_x, train_y)

GridSearchCV(cv=10, error_score='raise-deprecating',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=2, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=30, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=9, n_jobs=None,
            oob_score=False, random_state=1234, verbose=0,
            warm_start=False),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'n_estimators': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], 'criterion': ['gini', 'entropy'], 'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9], 'min_samples_leaf': [10, 20, 30, 40, 50], 'max_features': ['auto', 'sqrt']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [16]:
params = dt_random.best_params_

In [17]:
params

{'criterion': 'gini',
 'max_depth': 4,
 'max_features': 'auto',
 'min_samples_leaf': 10,
 'n_estimators': 4}

In [18]:
print("Accuracy on test set:{:.3f}".format(dt_random.score(test_x, test_y)))

Accuracy on test set:0.965


### parameters 최적화 후 Random Forest model로 96.5%의 정확도로 target feature를 예측했다.(기존 모델 95.3%)
* GridSearchCV를 사용해 아래와 같이 RandoForestClassifier변수를 사용함
    * {'criterion': 'gini', 'max_depth': 4, 'max_features': 'auto','min_samples_leaf': 10, 'n_estimators': 4}