# Hyperparameter Tuning & Cross-Validation

- k-fold cross validation

- (1-1) GridSearchCV

- (1-2) RandomSearchCV

---
(혼공 242p~262p)

목표: 안정적 검증 점수를 얻을 수 있는 Cross validation을 이해하고, hyperparameter tuning과 Cross validation을 동시 수행하는 GridSearchCV, RandomSearchCV를 사용해보자.

### Read CSV File

In [28]:
import pandas as pd

df = pd.read_csv('https://bit.ly/wine_csv_data')
df.head()

Unnamed: 0,alcohol,sugar,pH,class
0,9.4,1.9,3.51,0.0
1,9.8,2.6,3.2,0.0
2,9.8,2.3,3.26,0.0
3,9.8,1.9,3.16,0.0
4,9.4,1.9,3.51,0.0


In [29]:
X=df.drop('class', axis=1)
y=df['class']

print(X.shape, y.shape)

(6497, 3) (6497,)


### Train Test Split

In [30]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42
)

### K-fold validation
안정적 검증 score 산출

cross_validate


cv 매개변수에 k-fold 개수 혹은 Splitter 객체(shuffle)를 전달한다.(default=5)


(0-1)

In [31]:
from sklearn.model_selection import cross_validate
from sklearn.tree import DecisionTreeClassifier

cv = cross_validate(DecisionTreeClassifier(random_state=42), X_train, y_train)
cv

{'fit_time': array([0.01420569, 0.0133574 , 0.01267648, 0.01244092, 0.01243973]),
 'score_time': array([0.00467014, 0.00398588, 0.00459313, 0.00402069, 0.00397682]),
 'test_score': array([0.85128205, 0.84820513, 0.8788501 , 0.85112936, 0.84394251])}

In [32]:
import numpy as np

# mean_score 값이 최종 모델의 score
print(np.mean(cv['test_score']))

0.8546818301479492


(0-2) splitter
> 회귀 : KFold(n_splits=k, shuffle=True)<br/>
> 분류 : StratifiedKFold

In [33]:
from sklearn.model_selection import StratifiedKFold

splitter = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv = cross_validate(DecisionTreeClassifier(random_state=42), X_train, y_train, cv=splitter)
cv

{'fit_time': array([0.01385069, 0.00985122, 0.00859833, 0.0092051 , 0.00852776]),
 'score_time': array([0.00266409, 0.00261641, 0.00235343, 0.00230908, 0.00216603]),
 'test_score': array([0.85538462, 0.86153846, 0.87268994, 0.85318275, 0.85626283])}

In [34]:
print(np.mean(cv['test_score']))

0.859811720107408


Hyperparameter Tuning과 CrossValidation을 동시 수행하는 GridSearchCV, RandomSearchCV를 살펴보자.

### 1-1. GirdSearchCV

최적화할 parameter dictionary를 전달한다.

**전체 조합**을 통해 최적의 parameter를 찾는다.

In [35]:
from sklearn.model_selection import GridSearchCV

# parameter 정의
params = {'min_impurity_decrease': [0.0001, 0.0002, 0.0003, 0.0004, 0.0005]}

# hyperparameter_tuning & CV
gs = GridSearchCV(DecisionTreeClassifier(random_state=42), params, n_jobs=-1)

# 학습
gs.fit(X_train, y_train)

# 최적의 모델
model=gs.best_estimator_
print(f"{model.score(X_train, y_train)}")

# 최적의 parameter
print(gs.best_params_)

# 전체 cv 검증 결과
scores = gs.cv_results_['mean_test_score']
print(scores)

# best score일때 parameter
best_index = np.argmax(scores)
print(gs.cv_results_['params'][best_index])

0.9137931034482759
{'min_impurity_decrease': 0.0003}
[0.86843111 0.86925267 0.87315179 0.87212531 0.87130627]
{'min_impurity_decrease': 0.0003}


### 1-2. RandomSearchCV

hyperparameter가 수치일 때, scipy 라이브러리에서 제공되는 확률분포를 활용한다.
>randint : 정수, uniform: 실수

**랜덤 조합**을 통해 최적의 parameter를 찾는다.

In [36]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

# parameter 정의
params = {
    'min_impurity_decrease' : uniform(0.0001, 0.001),
    'max_depth' : randint(20, 50),
    'min_samples_split' : randint(2, 25),
    'min_samples_leaf' : randint(1, 25),
}

# hyperparameter tuning & cv
# 100번 samplling
rs = RandomizedSearchCV(DecisionTreeClassifier(random_state=42), params, n_iter=100, n_jobs=-1)

# train
rs.fit(X_train, y_train)

# best model
dt = rs.best_estimator_
print(dt.score(X_test, y_test))

# best parameter
print(rs.best_params_)

# 각 cv 검증 결과 점수
print(np.max(rs.cv_results_['mean_test_score']))

0.8572307692307692
{'max_depth': 46, 'min_impurity_decrease': 0.00010656421036454747, 'min_samples_leaf': 11, 'min_samples_split': 2}
0.8754113620807665


검증 점수가 test score보다 조금 높은 것이 일반적이다.