### 붓꽃 품종 분류
- 목표 : 붓꽃의 3개 품종을 분류하기
- 데이터셋 : 내장 데이터셋 
- 피쳐 : 4개
- 타켓 : 품종1개
- 학습알고리즘 : KNN

[1] 데이터 준비

In [135]:
# 모듈로딩
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [136]:
# 내장 데이터셋 로딩
data=load_iris(as_frame=True)

In [137]:
# Bunch 인스턴스 => dict 와 유사한 형태
data.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [138]:
featureDF=data['data']
targetSR=data['target']

In [139]:
featureDF.shape,targetSR.shape

((150, 4), (150,))

[2] 학습을 위한 데이터 셋 준비 => 학습용, 검증용, 테스트용

In [140]:
# 학습용 & 테스트용 분리
X_train, X_test, y_train, y_test = train_test_split(featureDF,targetSR,
                                                    stratify=targetSR # 품종따라 비율 골고루 잘나오게
                                                    )

In [141]:
# 학습용 & 검증용용 분리
X_train, X_val, y_train, y_val = train_test_split(X_train,y_train,
                                                    stratify=y_train # 품종따라 비율 골고루 잘나오게
                                                    )

In [142]:
print(f'Train DS : {X_train.shape[0]} { X_train.shape[0]/featureDF.shape[0] *100:.2f}%')
print(f'Train DS : {X_val.shape[0]} { X_val.shape[0]/featureDF.shape[0] *100 :.2f}%')
print(f'Train DS : {X_test.shape[0]} { X_test.shape[0]/featureDF.shape[0] *100 :.2f}%')

Train DS : 84 56.00%
Train DS : 28 18.67%
Train DS : 38 25.33%


[3] 교차검증 방식

In [143]:
from sklearn.model_selection import KFold,StratifiedKFold
from sklearn.tree import DecisionTreeClassifier

In [144]:
# 모델 인스턴스 생성
dtc_model=DecisionTreeClassifier()

[3-1] KFold 기반

In [145]:
# 정확도 저장 리스트
accuracys=[]

# KFold 인스턴스 생성
kfold=KFold() # 괄호안에 안주면 default 5개

In [146]:
# K번 만큼 K개 데이터셋으로 학습 진행
# -> K등분 후 학습용 데이터셋 인덱스, 검증용 데이터셋 인덱스
for idx,(train_index,val_index) in enumerate( kfold.split(featureDF)):
    
    print(f'train_index : { train_index.tolist()}')

    # X_train, X_val 데이터셋 설정
    X_train, y_train = featureDF.iloc[train_index.tolist()],targetSR[train_index.tolist()]
    X_val,y_val = featureDF.iloc[val_index.tolist()],targetSR[val_index.tolist()]

    #학습진행
    dtc_model.fit(X_train, y_train)

    # 평가 => 분류의 경우 score()메서드 => accuracy(정확도) 반환
    train_accuracy=dtc_model.score(X_train,y_train)
    accuracy=dtc_model.score(X_val,y_val)
    accuracys.append([train_accuracy,accuracy])
    print(f'[{idx}번째] Train 정확도 : {train_accuracy} Val 정확도: {accuracy}')

train_index : [30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149]
[0번째] Train 정확도 : 1.0 Val 정확도: 1.0
train_index : [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 

- 2,3,4 번 과대적합!

In [155]:
train_mean=sum([i[0] for i in accuracys])/kfold.n_splits
test_mean=sum([i[1] for i in accuracys])/kfold.n_splits
print(f'train 정확도 : {train_mean}, val 정확도 : {test_mean}')

train 정확도 : 1.0, val 정확도 : 0.9133333333333333


[3-2] StraitifiedKFold
- 정답/레이블/타겟의 비율을 고려해서 데이터 나눔

In [159]:
# 정확도 저장 리스트
accuracys=[]

# KFold 인스턴스 생성
skfold=StratifiedKFold() # 괄호안에 안주면 default 5개

In [163]:
# K번 만큼 K개 데이터셋으로 학습 진행
# -> K등분 후 학습용 데이터셋 인덱스, 검증용 데이터셋 인덱스
for idx,(train_index,val_index) in enumerate( skfold.split(featureDF,targetSR),1):
    
    print(f'train_index : { train_index.tolist()}')

    # X_train, X_val 데이터셋 설정
    X_train, y_train = featureDF.iloc[train_index.tolist()],targetSR[train_index.tolist()]
    X_val,y_val = featureDF.iloc[val_index.tolist()],targetSR[val_index.tolist()]

    #학습진행
    dtc_model.fit(X_train, y_train)

    # 평가 => 분류의 경우 score()메서드 => accuracy(정확도) 반환
    train_accuracy=dtc_model.score(X_train,y_train)
    accuracy=dtc_model.score(X_val,y_val)
    accuracys.append([train_accuracy,accuracy])
    print(f'[{idx}번째] Train 정확도 : {train_accuracy} Val 정확도: {accuracy}')

train_index : [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149]
[1번째] Train 정확도 : 1.0 Val 정확도: 0.9666666666666667
train_index : [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 120, 121, 122, 123, 124, 125, 126, 127, 

In [164]:
train_mean=sum([i[0] for i in accuracys])/skfold.n_splits
test_mean=sum([i[1] for i in accuracys])/skfold.n_splits
print(f'train 정확도 : {train_mean}, val 정확도 : {test_mean}')

train 정확도 : 1.0, val 정확도 : 0.9533333333333334


- 교차검증 및 성능 평가 동시 진행함수
    - cross_val_score, cross_val_predict
    - cross_validate

In [169]:
from sklearn.model_selection import cross_val_predict, cross_val_score, cross_validate

In [174]:
### cross_val_predict
predict=cross_val_predict(dtc_model,featureDF,targetSR,cv=3)

In [175]:
print(f'predict:{predict}')

predict:[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1
 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 2 2 2
 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In [176]:
### cross_val_scroe
cross_val_score(dtc_model,featureDF,targetSR)

array([0.96666667, 0.96666667, 0.9       , 1.        , 1.        ])

In [177]:
### cross_validate
result=cross_validate(dtc_model,featureDF,targetSR)

In [178]:
result

{'fit_time': array([0.00205994, 0.00150013, 0.00199389, 0.00295258, 0.00200009]),
 'score_time': array([0.00141382, 0.00099659, 0.00199413, 0.00199366, 0.00198579]),
 'test_score': array([0.96666667, 0.96666667, 0.9       , 1.        , 1.        ])}

In [179]:
### cross_validate
result=cross_validate(dtc_model,featureDF,targetSR,return_train_score=True)

In [180]:
result

{'fit_time': array([0.00268793, 0.00343561, 0.00309467, 0.00201011, 0.00199771]),
 'score_time': array([0.00100875, 0.00146198, 0.00232959, 0.00100183, 0.00099635]),
 'test_score': array([0.96666667, 0.96666667, 0.9       , 1.        , 1.        ]),
 'train_score': array([1., 1., 1., 1., 1.])}

In [181]:
### cross_validate
result=cross_validate(dtc_model,featureDF,targetSR,
                      return_train_score=True, 
                      return_estimator=True)

In [189]:
resultDF=pd.DataFrame(result).loc[:,['test_score','train_score']]
resultDF

Unnamed: 0,test_score,train_score
0,0.966667,1.0
1,0.966667,1.0
2,0.9,1.0
3,0.966667,1.0
4,1.0,1.0


result 안에 estimator= 모델을 꺼내주는것!  
score 를 보고 최적의 모델을 꺼내 쓸수있음!

In [196]:
bestmodel=result['estimator'][4]