### Ensemble - ExtraTree
- 배깅 방식의 앙상블 => 복원추출 랜덤샘플 + 동일 모델 (DecisionTree)
    - 대표 알고리즘 : RandomForest C/R 
- 페이스트 방식의 앙상블 => 비복원추출 랜덤샘플 + 동일 모델 (DecisionTree)
    - 대표 알고리즘 : ExtraTree C/T

- 와인 분류 => 0과1 , 이진분류

[1] 모듈 로딩 및 데이터 준비

In [80]:
# 모듈로딩 및 데이터 준비
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [81]:
# 데이터
DATA_FILE='../data/wine.csv'

# CSV >> DataFrame
wineDF=pd.read_csv(DATA_FILE)

In [82]:
# 데이터 확인
wineDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   alcohol  6497 non-null   float64
 1   sugar    6497 non-null   float64
 2   pH       6497 non-null   float64
 3   class    6497 non-null   float64
dtypes: float64(4)
memory usage: 203.2 KB


In [83]:
wineDF.head()

Unnamed: 0,alcohol,sugar,pH,class
0,9.4,1.9,3.51,0.0
1,9.8,2.6,3.2,0.0
2,9.8,2.3,3.26,0.0
3,9.8,1.9,3.16,0.0
4,9.4,1.9,3.51,0.0


In [84]:
# 타겟/라벨의 클래스 분포
wineDF['class'].value_counts()

class
1.0    4898
0.0    1599
Name: count, dtype: int64

In [85]:
wineDF.describe()

Unnamed: 0,alcohol,sugar,pH,class
count,6497.0,6497.0,6497.0,6497.0
mean,10.491801,5.443235,3.218501,0.753886
std,1.192712,4.757804,0.160787,0.430779
min,8.0,0.6,2.72,0.0
25%,9.5,1.8,3.11,1.0
50%,10.3,3.0,3.21,1.0
75%,11.3,8.1,3.32,1.0
max,14.9,65.8,4.01,1.0


[2] 학습 준비

In [86]:

from sklearn.model_selection import train_test_split

In [87]:
# 피쳐/독립변수와 타겟/라벨/종속변수 분리
featureDF=wineDF.drop(columns='class')
targetSR=wineDF['class']

print(f'featureDF : {featureDF.shape}, targetSR : {targetSR.shape}')

featureDF : (6497, 3), targetSR : (6497,)


In [88]:
# 학습용 & 테스트용 데이터셋 분할
X_train,X_test,y_train,y_test=train_test_split(featureDF,targetSR,
                                               test_size=0.2,
                                               stratify=targetSR,
                                               random_state=1)

In [89]:
print(f'X_train : {X_train.shape}, y_train : {y_train.shape}')
print(f'X_test : {X_test.shape}, y_test : {y_test.shape}')

X_train : (5197, 3), y_train : (5197,)
X_test : (1300, 3), y_test : (1300,)


[3] 학습 진행

In [90]:
# 학습방법 : 지도학습 > 분류
# 알고리즘 : 앙상블 > 배깅 - ExtraTreesClassfier
from sklearn.ensemble import ExtraTreesClassifier

In [91]:
# 인스턴스 생성  => 100개의 내부 DT 모델에서 사용할 데이터셋 생성
#                  random state 매개변수 설정으로 고정된 데이터셋 생성
#                => oob_score : 샘플 데이터셋 추출 후 남은 데이터셋을 검증용으로 사용
lf_model=ExtraTreesClassifier(random_state=7)

# 학습진행
lf_model.fit(X_train,y_train)

In [92]:
# 모델 파라미터
print(f'lf_model.classes : {lf_model.classes_}')
print(f'lf_model.n_classes : {lf_model.n_classes_}')
print()
print(f'lf_model.feature_names_in_ : {lf_model.feature_names_in_}')
print(f'lf_model.feature_importances_ : {lf_model.feature_importances_}')
print(f'lf_model.n_features_in_ : {lf_model.n_features_in_}')

lf_model.classes : [0. 1.]
lf_model.n_classes : 2

lf_model.feature_names_in_ : ['alcohol' 'sugar' 'pH']
lf_model.feature_importances_ : [0.18992806 0.53030305 0.2797689 ]
lf_model.n_features_in_ : 3


In [93]:
print(f'lf_model.estimator_ : {lf_model.estimator_}')
print(f'lf_model.estimators_ : {lf_model.estimators_}')

lf_model.estimator_ : ExtraTreeClassifier()
lf_model.estimators_ : [ExtraTreeClassifier(random_state=327741615), ExtraTreeClassifier(random_state=976413892), ExtraTreeClassifier(random_state=1202242073), ExtraTreeClassifier(random_state=1369975286), ExtraTreeClassifier(random_state=1882953283), ExtraTreeClassifier(random_state=2053951699), ExtraTreeClassifier(random_state=959775639), ExtraTreeClassifier(random_state=1956722279), ExtraTreeClassifier(random_state=2052949340), ExtraTreeClassifier(random_state=1322904761), ExtraTreeClassifier(random_state=165338510), ExtraTreeClassifier(random_state=1133316631), ExtraTreeClassifier(random_state=4812360), ExtraTreeClassifier(random_state=372560217), ExtraTreeClassifier(random_state=309457262), ExtraTreeClassifier(random_state=1801189930), ExtraTreeClassifier(random_state=1152936666), ExtraTreeClassifier(random_state=68334472), ExtraTreeClassifier(random_state=2146978983), ExtraTreeClassifier(random_state=119248870), ExtraTreeClassifier(rand

In [94]:
#print(f'lf_model.estimators_samples_ : {lf_model.estimators_samples_}')

In [95]:
#print(f'oob score : {lf_model.oob_score_}')

[4] 성능평가

In [96]:
train_score=lf_model.score(X_train,y_train)
test_score=lf_model.score(X_test,y_test)

In [97]:
print(f'train_score : {train_score}, test_score : {test_score}')

train_score : 0.9973061381566288, test_score : 0.8961538461538462


[5] 튜닝

- RandomizedSearchCV 하이퍼파라미터 최적화 클래스  
    - 범위가 넓은 하이퍼파라미터 설정에 좋음
    - 지정된 범위에서 지정된 횟수만큼 하이퍼파라미터 추출하여 조합 진행

In [98]:
# 모듈 로딩
from sklearn.model_selection import RandomizedSearchCV

In [99]:
# RandomForestClassfier 하이퍼 파라미터 설정
params={'max_depth':range(2,16),'min_samples_leaf':range(5,16)}

In [100]:
rf_model=ExtraTreesClassifier(n_estimators=300)
# default : n_estimators=100

In [101]:
searchCV=RandomizedSearchCV(rf_model,
                            param_distributions=params,
                            verbose=4)

In [102]:
searchCV.fit(X_train,y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END ..max_depth=15, min_samples_leaf=8;, score=0.754 total time=   0.2s
[CV 2/5] END ..max_depth=15, min_samples_leaf=8;, score=0.754 total time=   0.2s
[CV 3/5] END ..max_depth=15, min_samples_leaf=8;, score=0.755 total time=   0.2s
[CV 4/5] END ..max_depth=15, min_samples_leaf=8;, score=0.754 total time=   0.2s
[CV 5/5] END ..max_depth=15, min_samples_leaf=8;, score=0.754 total time=   0.2s
[CV 1/5] END ...max_depth=3, min_samples_leaf=7;, score=0.754 total time=   0.2s
[CV 2/5] END ...max_depth=3, min_samples_leaf=7;, score=0.754 total time=   0.2s
[CV 3/5] END ...max_depth=3, min_samples_leaf=7;, score=0.755 total time=   0.2s
[CV 4/5] END ...max_depth=3, min_samples_leaf=7;, score=0.754 total time=   0.2s
[CV 5/5] END ...max_depth=3, min_samples_leaf=7;, score=0.754 total time=   0.2s
[CV 1/5] END ...max_depth=8, min_samples_leaf=7;, score=0.754 total time=   0.2s
[CV 2/5] END ...max_depth=8, min_samples_leaf=7;

In [103]:
print(f'[ SearchCV.best_score_] {searchCV.best_score_}')
print(f'[ SearchCV.best_estimator_] {searchCV.best_estimator_}')
print(f'[ SearchCV.best_params_] {searchCV.best_params_}')

cv_resultDF=pd.DataFrame(searchCV.cv_results_)
cv_resultDF

[ SearchCV.best_score_] 0.75389649811209
[ SearchCV.best_estimator_] ExtraTreesClassifier(max_depth=15, min_samples_leaf=8, n_estimators=300)
[ SearchCV.best_params_] {'min_samples_leaf': 8, 'max_depth': 15}


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_min_samples_leaf,param_max_depth,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.261239,0.007511,0.027683,0.001452,8,15,"{'min_samples_leaf': 8, 'max_depth': 15}",0.753846,0.753846,0.754572,0.753609,0.753609,0.753896,0.000354,1
1,0.230361,0.002381,0.023419,0.000748,7,3,"{'min_samples_leaf': 7, 'max_depth': 3}",0.753846,0.753846,0.754572,0.753609,0.753609,0.753896,0.000354,1
2,0.248591,0.008826,0.027025,0.003739,7,8,"{'min_samples_leaf': 7, 'max_depth': 8}",0.753846,0.753846,0.754572,0.753609,0.753609,0.753896,0.000354,1
3,0.240823,0.00428,0.025735,0.00145,14,8,"{'min_samples_leaf': 14, 'max_depth': 8}",0.753846,0.753846,0.754572,0.753609,0.753609,0.753896,0.000354,1
4,0.271002,0.029761,0.027459,0.001647,5,9,"{'min_samples_leaf': 5, 'max_depth': 9}",0.753846,0.753846,0.754572,0.753609,0.753609,0.753896,0.000354,1
5,0.257355,0.008786,0.027521,0.00148,9,14,"{'min_samples_leaf': 9, 'max_depth': 14}",0.753846,0.753846,0.754572,0.753609,0.753609,0.753896,0.000354,1
6,0.245077,0.015862,0.025168,0.001811,15,4,"{'min_samples_leaf': 15, 'max_depth': 4}",0.753846,0.753846,0.754572,0.753609,0.753609,0.753896,0.000354,1
7,0.279688,0.010482,0.028659,0.001818,10,12,"{'min_samples_leaf': 10, 'max_depth': 12}",0.753846,0.753846,0.754572,0.753609,0.753609,0.753896,0.000354,1
8,0.256153,0.0093,0.039879,0.027428,14,5,"{'min_samples_leaf': 14, 'max_depth': 5}",0.753846,0.753846,0.754572,0.753609,0.753609,0.753896,0.000354,1
9,0.24119,0.005574,0.022932,0.00121,14,2,"{'min_samples_leaf': 14, 'max_depth': 2}",0.753846,0.753846,0.754572,0.753609,0.753609,0.753896,0.000354,1
