### 교차검증

- 과적합 : 모델이 학습 데이터에만 과도하게 최적화된 현상
- 그로 인해 일반화된 데이터에서는 예측 가능성이 과하게 떨어지는 현상 발생
- 이를 위하여 데이터를 학습용과 테스트용으로 나누어 학습 후 예측 진행
- 하지만 이것만으로 괜찮은 예측치가 나왔다고 보장할 수 있는가?
- 상기 물음을 극복하고자 나에게 주어진 데이터에 적용한 모델의 성능을 정확히 표현하고자 교차검증 실시

### KFold 교차검증

- 학습용(train) 데이터를 3~5개로 나누어 데이터들을 검증(validation)한다
- 3~5개로 검증한 정확도(accuracy_score)의 평균을 최종 정확도로 한다


### Straified KFold 교차검증
- 학습용 데이터를 3~5개로 나눈 후 각각의 데이터들을 검증한다
- 3~5개로 검증한 정확도(accuracy_score)의 평균을 최종 정확도로 한다
- 학습용 데이터를 나눌 때 각 클래스의 분포가 다르다면 이 분포 상태를 유지해야 된다

In [1]:
import numpy as np
from sklearn.model_selection import KFold

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])

In [2]:
X

array([[1, 2],
       [3, 4],
       [1, 2],
       [3, 4]])

In [3]:
y

array([1, 2, 3, 4])

In [4]:
kf = KFold(n_splits=2)
# n_splits= : 몇 등분으로 나눌 것인지 설정

In [5]:
print(kf.get_n_splits(X))
print(kf)

2
KFold(n_splits=2, random_state=None, shuffle=False)


In [7]:
for train_idx, validation_idx in kf.split(X):
    print('---idx')
    print('train idx :', train_idx) 
    print('validation idx :', validation_idx)
    print('---train data')
    print(X[train_idx])
    print('---validation data')
    print(X[validation_idx])
    print('-'*20)


---idx
train idx : [2 3]
validation idx : [0 1]
---train data
[[1 2]
 [3 4]]
---validation data
[[1 2]
 [3 4]]
--------------------
---idx
train idx : [0 1]
validation idx : [2 3]
---train data
[[1 2]
 [3 4]]
---validation data
[[1 2]
 [3 4]]
--------------------


### wine_tree 다시 만들기

In [18]:
import pandas as pd

red_url = 'https://raw.githubusercontent.com/PinkWink/ML_tutorial/master/dataset/winequality-red.csv'
white_url = 'https://raw.githubusercontent.com/PinkWink/ML_tutorial/master/dataset/winequality-white.csv'

red_wine = pd.read_csv(red_url, sep=';')
white_wine = pd.read_csv(white_url, sep=';')

red_wine['color'] = 1
white_wine['color'] = 0

wine = pd.concat([red_wine, white_wine])

In [21]:
wine['taste'] = [1.0 if grade > 5 else 0.0 for grade in wine['quality']]

X = wine.drop(['taste', 'quality'], axis=1) # feature
y = wine['taste'] # label

### 데이터를 분리해서 만든 wine_tree의 정확도가 신뢰할만 한가?

In [22]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=13)

wine_tree = DecisionTreeClassifier(max_depth=2, random_state=13)
wine_tree.fit(X_train, y_train)

y_pred_tr = wine_tree.predict(X_train)
y_pred_test = wine_tree.predict(X_test)

print('Train ACC : ', accuracy_score(y_train, y_pred_tr))
print('Test ACC : ', accuracy_score(y_test, y_pred_test))
# accuracy_score(정답데이터, 예측데이터) : 정답데이터와 예측데이터를 비교하여 정확도를 계산해달라는 명령어

Train ACC :  0.7294593034442948
Test ACC :  0.7161538461538461


### KFold로 확인하자

In [23]:
from sklearn.model_selection import KFold

kfold = KFold(n_splits=5)
wine_tree_cv = DecisionTreeClassifier(max_depth=2, random_state=13)

### KFold는 index를 반환

In [24]:
for train_idx, validation_idx in kfold.split(X):
    print(len(train_idx), len(validation_idx))

5197 1300
5197 1300
5198 1299
5198 1299
5198 1299


### 5등분된(n_splits = 5로 설정된) 각각의 fold의 학습 후 accuracy_score

In [35]:
cv_accuracy = []

for train_idx, validation_idx in kfold.split(X):
    X_train, X_validation = X.iloc[train_idx], X.iloc[validation_idx]
    y_train, y_validation = y.iloc[train_idx], y.iloc[validation_idx]
    wine_tree_cv.fit(X_train, y_train)
    pred = wine_tree_cv.predict(X_validation)
    cv_accuracy.append(accuracy_score(y_validation, pred))

cv_accuracy

[0.6007692307692307,
 0.6884615384615385,
 0.7090069284064665,
 0.7628945342571208,
 0.7867590454195535]

### 각 정확도(accuracy_score)의 평균을 대표값으로 한다

- 단, 분산이 크지 않다는 조건하에

In [28]:
np.mean(cv_accuracy)

0.7413846153846155

### Straified KFold

In [37]:
from sklearn.model_selection import StratifiedKFold

skfold = StratifiedKFold(n_splits=5)
wine_tree_cv = DecisionTreeClassifier(max_depth=2, random_state=13)

cv_accuracy = []

for train_idx, validation_idx in skfold.split(X, y):
    X_train, X_validation = X.iloc[train_idx], X.iloc[validation_idx]
    y_train, y_validation = y.iloc[train_idx], y.iloc[validation_idx]
    wine_tree_cv.fit(X_train, y_train)
    pred = wine_tree_cv.predict(X_validation)
    cv_accuracy.append(accuracy_score(y_validation, pred))

cv_accuracy

[0.5523076923076923,
 0.6884615384615385,
 0.7143956889915319,
 0.7321016166281755,
 0.7567359507313318]

In [38]:
np.mean(cv_accuracy)

0.6888004974240539

### 교차검증 한번에 하기 : cross_val_score
- cross_val_score(estimator(여기선 결정나무), feature, label, scoring = , cv=)
- scoring = : 예측 성능 평가 지표 (ex. accurency)
- cv = : 데이터를 나눈 수, 교차 검증 폴드 수

In [39]:
from sklearn.model_selection import cross_val_score

X = wine.drop(['taste', 'quality'], axis=1) # feature
y = wine['taste'] # label

skfold = StratifiedKFold(n_splits=5)
wine_tree_cv = DecisionTreeClassifier(max_depth=2, random_state=13)

cross_val_score(wine_tree_cv, X, y, scoring=None, cv=skfold)

array([0.55230769, 0.68846154, 0.71439569, 0.73210162, 0.75673595])

### depth를 높인다고 무조건 ACC가 좋아지진 않는다

In [40]:
wine_tree_cv = DecisionTreeClassifier(max_depth=5, random_state=13)

cross_val_score(wine_tree_cv, X, y, scoring=None, cv=skfold)

array([0.50076923, 0.62615385, 0.69745958, 0.7582756 , 0.74903772])

### train score와 함께 보기 : cross_validate
- cross_validate(estimator(여기선 결정나무), feature, label, scoring = , cv= , return_train_score=True)
- scoring = : 예측 성능 평가 지표 (ex. accurency)
- cv = : 데이터를 나눈 수, 교차 검증 폴드 수
- return_train_score=True : train accuracy_score 확인 

In [42]:
from sklearn.model_selection import cross_validate

X = wine.drop(['taste', 'quality'], axis=1) # feature
y = wine['taste'] # label

skfold = StratifiedKFold(n_splits=5)
wine_tree_cv = DecisionTreeClassifier(max_depth=2, random_state=13)

cross_validate(wine_tree_cv, X, y, scoring=None, cv=skfold, return_train_score=True)

{'fit_time': array([0.00863981, 0.00699806, 0.00700164, 0.00799966, 0.00955439]),
 'score_time': array([0.00112414, 0.00199986, 0.00108504, 0.00117445, 0.00269341]),
 'test_score': array([0.55230769, 0.68846154, 0.71439569, 0.73210162, 0.75673595]),
 'train_score': array([0.74773908, 0.74696941, 0.74317045, 0.73509042, 0.73258946])}