### 교차 검증과 최적의 하이퍼파라미터 튜닝
**하이퍼 파라미터**
- 머신러닝 알고리즘 구성요소
- 값을 조정해 예측 성능을 개선

### GridSerchCV 클래스 생성자 파라미터
- estimator = classifier, regressor, peipeline
- param_grid = key + 리스트 값을 가지는 딕셔너리 (estimator 튜닝을 위한 하이퍼 파라미터)
    - key = 파라미터 명, 리스트값: 파라미터 값
- scoring = 예측 성능을 측정할 평가 방법
    - 성능 평가 지표를 지정하는 문자열
    - 'accuracy'
- cv = 교차 검정을 위한 폴드수
- refit = 최적의 하이퍼 파라미터를 찾은 뒤 입력된 estimator 객체를 해당 하이퍼 파라미터로 재학습 여부
    - 디폴트 = Tre

In [7]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data,
                                                   iris.target,
                                                   test_size = 0.2,
                                                   random_state = 121)
dt_clf = DecisionTreeClassifier()

parameter = {'max_depth' : [1, 2, 3], 'min_samples_split' : [2, 3]}
# 하이퍼 파라미터는 딕셔너리로 지정
# key = 결정트리의 하이퍼 파라미터
# value = 하이퍼 파라미터의 값


**min_samples_split**
= 자식 규칙 노드를 분할해서 만드는데 필요한 최소 샘플 개수
- min_samples_split = 4로 설정하는 경우
    - 최소샘플개수가 4개이하인 경우 분할을 하지 않음

In [8]:
grid_tree = GridSearchCV(dt_clf, param_grid = parameter, cv =3,refit = True,
            return_train_score=True)

grid_tree.fit(X_train, y_train)

scores_df = pd.DataFrame(grid_tree.cv_results_)

In [9]:
scores_df

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_min_samples_split,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,mean_train_score,std_train_score
0,0.00133,0.000469,0.000332,0.0004695721,1,2,"{'max_depth': 1, 'min_samples_split': 2}",0.7,0.7,0.7,0.7,1.110223e-16,5,0.7,0.7,0.7,0.7,1.110223e-16
1,0.000996,1e-06,0.000998,2.973602e-07,1,3,"{'max_depth': 1, 'min_samples_split': 3}",0.7,0.7,0.7,0.7,1.110223e-16,5,0.7,0.7,0.7,0.7,1.110223e-16
2,0.001332,0.000472,0.000332,0.0004688977,2,2,"{'max_depth': 2, 'min_samples_split': 2}",0.925,1.0,0.95,0.958333,0.03118048,3,0.975,0.9375,0.9625,0.958333,0.01559024
3,0.001328,0.000461,0.00066,0.0004666603,2,3,"{'max_depth': 2, 'min_samples_split': 3}",0.925,1.0,0.95,0.958333,0.03118048,3,0.975,0.9375,0.9625,0.958333,0.01559024
4,0.001995,1e-06,0.000666,0.0009425159,3,2,"{'max_depth': 3, 'min_samples_split': 2}",0.975,1.0,0.95,0.975,0.02041241,1,0.9875,0.9625,0.9875,0.979167,0.01178511
5,0.001994,0.000814,0.004654,0.005891373,3,3,"{'max_depth': 3, 'min_samples_split': 3}",0.975,1.0,0.95,0.975,0.02041241,1,0.9875,0.9625,0.9875,0.979167,0.01178511


In [12]:
scores_df[['params', 'mean_test_score', 'rank_test_score']]

Unnamed: 0,params,mean_test_score,rank_test_score
0,"{'max_depth': 1, 'min_samples_split': 2}",0.7,5
1,"{'max_depth': 1, 'min_samples_split': 3}",0.7,5
2,"{'max_depth': 2, 'min_samples_split': 2}",0.958333,3
3,"{'max_depth': 2, 'min_samples_split': 3}",0.958333,3
4,"{'max_depth': 3, 'min_samples_split': 2}",0.975,1
5,"{'max_depth': 3, 'min_samples_split': 3}",0.975,1


In [15]:
print(f'최적의 파라미터 : {grid_tree.best_params_}')
print(f'최고 정확도 : {grid_tree.best_score_}')

최적의 파라미터 : {'max_depth': 3, 'min_samples_split': 2}
최고 정확도 : 0.975


최적의 파라미터값과 정확도로 실습

In [18]:
best_dt = grid_tree.best_estimator_  # 최적의 분류기

# 학습시작
pred = best_dt.predict(X_test)
accuracy_score(y_test, pred)

0.9666666666666667