## [ 교차검증(Cross Validation) - cross_val_score()/ cross_validate() ]
- 작은 데이터셋으로 안정적이고 신뢰성있는 모델 평가를 위한 방법
- 학습 데이터셋을 K개 분할 후 매번 다른 데이터로 검증 진행
- 교차 검증 후 모델의 일반화 성능으로 여김
- sklearn.modl_selection 서브모듈에 존재하는 함수들
    * cross_val_score() / cross_val_predict(): cs만큼에 대한 성능 결과/예측 결과 반환 (일부)
    * cross_validate(): 다양한 정보들 반환, 많이 사용됨

[1] 모듈 로딩 및 데이터 준비 <hr>

In [29]:
## [1-1] 모듈 로딩
import numpy as np
import pandas as pd

## -> ML 관련 모듈
from sklearn.model_selection import cross_val_score, cross_validate  ## 교차검증용
from sklearn.neighbors import KNeighborsClassifier                   ## 학습 알고리즘

In [30]:
## [1-2] 데이터 준비
DATA_FILE = '../DATA/iris.csv'

irisDF = pd.read_csv(DATA_FILE)
irisDF.head(3)

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa


[2] 데이터 전처리 및 학습 준비 : 시간 때문에 미진행 => 개인별로 진행<hr>

In [31]:
## [2-1] 품종컬럼 자료형 변환
pd.options.mode.copy_on_write = True

irisDF.varity = irisDF.variety.astype('category')
irisDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal.length  150 non-null    float64
 1   sepal.width   150 non-null    float64
 2   petal.length  150 non-null    float64
 3   petal.width   150 non-null    float64
 4   variety       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


  irisDF.varity = irisDF.variety.astype('category')


In [32]:
## [2-2] 피처와 타겟분리
featureDF = irisDF[ irisDF.columns[:-1] ]
targetSR = irisDF[ irisDF.columns[-1] ]

print(f'featureDF : {featureDF.shape}, targetSR : {targetSR.shape}')

featureDF : (150, 4), targetSR : (150,)


In [33]:
## [2-3] 수치형 컬럼 => 학습 알고리즘에 따라 적용
##                 => KNN 알고리즘은 거리 측정 기반으로 스케일링 필요함 !!
## 스텐다드 스케일러랑 뭐 그런거 쓰면됨 이라고 함

[3] 교차검증 <hr>

In [37]:
## ========================================================
## cross_validate() 함수 : 전달된 데이터에 따라서 자동으로
##                         KFold, StratifiedKFold 설정
    # cross_validate() = 교차검증(K-Fold)을 자동으로 돌려주고,
    # Train/Valid 점수 + 학습시간 + 모델까지 한 번에 반환해주는 함수


## - 필수 매개변수
##   estimator           : 모델 인스턴스
##   cv                  : 기본 5 또는 KFold, StratifiedKFold 인스턴스
##   return_train_score  : 학습용 데이터셋 성능 반환여부 설정
##   return_estimator    : 학습 모델 인스턴스
## ========================================================
## 모델 인스턴스 생성
resultDF = pd.DataFrame(columns=['fit_time','score_time','test_score','train_score','neighbors'])

for neighbors in range(1,21):
    kModel = KNeighborsClassifier(n_neighbors=neighbors)

    ## 함수 호출
    resultDict = cross_validate(kModel,                      # 사용할 모델(KNN, SVM, RF 등)
                                featureDF,                   # 입력 데이터
                                targetSR,                    # 타겟(정답)
                                return_train_score=True,     # Train 점수도 보고 싶으면 True
                                cv = 3)                      # 교차검증 폴드 수 또는 KFold 객체
                                # return_estimator=True     # 각 fold에서 학습된 모델도 보고 싶으면 True
    ret = [ resultDict[k].mean().item() for k in resultDict.keys()]
    ret.append(neighbors)
    resultDF.loc[resultDF.shape[0]] = ret

resultDF

Unnamed: 0,fit_time,score_time,test_score,train_score,neighbors
0,0.00295,0.00468,0.96,1.0,1.0
1,0.003381,0.004908,0.946667,0.976667,2.0
2,0.002508,0.002675,0.973333,0.963333,3.0
3,0.002209,0.002689,0.986667,0.96,4.0
4,0.002346,0.002144,0.98,0.97,5.0
5,0.00201,0.002346,0.973333,0.966667,6.0
6,0.003095,0.001816,0.973333,0.97,7.0
7,0.0,0.008172,0.966667,0.966667,8.0
8,0.001231,0.0,0.973333,0.97,9.0
9,0.001082,0.0,0.966667,0.973333,10.0


In [38]:
resultDF['diff'] = abs(resultDF['test_score']-resultDF['train_score'])
resultDF

Unnamed: 0,fit_time,score_time,test_score,train_score,neighbors,diff
0,0.00295,0.00468,0.96,1.0,1.0,0.04
1,0.003381,0.004908,0.946667,0.976667,2.0,0.03
2,0.002508,0.002675,0.973333,0.963333,3.0,0.01
3,0.002209,0.002689,0.986667,0.96,4.0,0.026667
4,0.002346,0.002144,0.98,0.97,5.0,0.01
5,0.00201,0.002346,0.973333,0.966667,6.0,0.006667
6,0.003095,0.001816,0.973333,0.97,7.0,0.003333
7,0.0,0.008172,0.966667,0.966667,8.0,0.0
8,0.001231,0.0,0.973333,0.97,9.0,0.003333
9,0.001082,0.0,0.966667,0.973333,10.0,0.006667
