### Ensemble - RandomForest & ExtraTree
- 배깅 방식의 앙상블 ==> 중복 랜덤 샘플 + 동일 모델(DT)
    * 대표 알고리즘 : RandomForestC/R
- 페이스트 방식의 앙상블 ==> 중복x 랜덤 샘플 + 동일 모델(DT)
    * 대표 알고리즘 : ExtraTreeC/R

[목표] 와인분류 => 0과 1 2개 종류 분류

[1] 모듈 로딩 및 데이터 준비

In [1]:
# 모듈로딩
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# 데이터 준비
DATA_FILE = '../Data/wine.csv'

# CSV => DataFrame
wineDF = pd.read_csv(DATA_FILE)

In [3]:
# 데이터 확인
wineDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   alcohol  6497 non-null   float64
 1   sugar    6497 non-null   float64
 2   pH       6497 non-null   float64
 3   class    6497 non-null   float64
dtypes: float64(4)
memory usage: 203.2 KB


In [4]:
wineDF.head(2)

Unnamed: 0,alcohol,sugar,pH,class
0,9.4,1.9,3.51,0.0
1,9.8,2.6,3.2,0.0


In [5]:
# 타겟/라벨의 클래스 분포
wineDF['class'].value_counts()

class
1.0    4898
0.0    1599
Name: count, dtype: int64

In [8]:
wineDF.describe()

Unnamed: 0,alcohol,sugar,pH,class
count,6497.0,6497.0,6497.0,6497.0
mean,10.491801,5.443235,3.218501,0.753886
std,1.192712,4.757804,0.160787,0.430779
min,8.0,0.6,2.72,0.0
25%,9.5,1.8,3.11,1.0
50%,10.3,3.0,3.21,1.0
75%,11.3,8.1,3.32,1.0
max,14.9,65.8,4.01,1.0


[2] 학습 준비

In [10]:
featureDF = wineDF.iloc[:,:3] # wineDF[wineDF.columns[:-1]]
targetSR = wineDF['class'] # wineDF[wineDF.columns[-1]]

In [11]:
featureDF.shape, targetSR.shape

((6497, 3), (6497,))

In [9]:
from sklearn.model_selection import train_test_split

In [12]:
# 학습용 & 테스트용 데이터셋 분리
x_train, x_test, y_train, y_test = train_test_split(featureDF,
                                                    targetSR,
                                                    test_size=0.2,
                                                    stratify=targetSR,
                                                    random_state=1)

In [13]:
print(f'x_train : {x_train.shape}, {x_train.ndim}D')
print(f'x_test : {x_test.shape}, {x_test.ndim}D')

print(f'y_train : {y_train.shape}, {y_train.ndim}D')
print(f'y_test : {y_test.shape}, {y_test.ndim}D')

x_train : (5197, 3), 2D
x_test : (1300, 3), 2D
y_train : (5197,), 1D
y_test : (1300,), 1D


[3] 학습 진행

In [16]:
# 학습방법 : 지도학습 > 분류
# 알고리즘 : 앙상블 > 배깅 - RandomForestClassifier

In [17]:
from sklearn.ensemble import RandomForestClassifier

In [29]:
# 인스턴스 생성 => 100개의 내부 DT 모델에서 사용할 데이터셋 생성
#                 random_state 매개변수 설정으로 고정된 데이터셋 생성
#                 oob_score 매개변수 : 샘플 데이터셋 추출 후 남은 데이터셋 검증용으로 사용
lf_model = RandomForestClassifier(random_state=7, 
                                  oob_score=True)

# 학습
lf_model.fit(x_train, y_train)

In [30]:
# 모델 파라미터 확인

print(f'lf_model.classes_            : {lf_model.classes_}')
print(f'lf_model.n_classes_          : {lf_model.n_classes_}')

print(f'lf_model.feature_names_in_   : {lf_model.feature_names_in_}')
print(f'lf_model.n_features_in_      : {lf_model.n_features_in_}')
print(f'lf_model.feature_importances_: {lf_model.feature_importances_}')

lf_model.classes_            : [0. 1.]
lf_model.n_classes_          : 2
lf_model.feature_names_in_   : ['alcohol' 'sugar' 'pH']
lf_model.n_features_in_      : 3
lf_model.feature_importances_: [0.23572103 0.49995154 0.26432743]


In [31]:
# 모델 파라미터 확인2

print(f'lf_model.estimator_            : {lf_model.estimator_}')
for est in lf_model.estimators_: print(est)

lf_model.estimator_            : DecisionTreeClassifier()
DecisionTreeClassifier(max_features='sqrt', random_state=327741615)
DecisionTreeClassifier(max_features='sqrt', random_state=976413892)
DecisionTreeClassifier(max_features='sqrt', random_state=1202242073)
DecisionTreeClassifier(max_features='sqrt', random_state=1369975286)
DecisionTreeClassifier(max_features='sqrt', random_state=1882953283)
DecisionTreeClassifier(max_features='sqrt', random_state=2053951699)
DecisionTreeClassifier(max_features='sqrt', random_state=959775639)
DecisionTreeClassifier(max_features='sqrt', random_state=1956722279)
DecisionTreeClassifier(max_features='sqrt', random_state=2052949340)
DecisionTreeClassifier(max_features='sqrt', random_state=1322904761)
DecisionTreeClassifier(max_features='sqrt', random_state=165338510)
DecisionTreeClassifier(max_features='sqrt', random_state=1133316631)
DecisionTreeClassifier(max_features='sqrt', random_state=4812360)
DecisionTreeClassifier(max_features='sqrt', random_s

In [36]:
# 모델 파라미터 확인3

# print(f'lf_model.estimators_samples_            : {lf_model.estimators_samples_}')

In [33]:
print(f'lf_model.oob_score_ : {lf_model.oob_score_}')

lf_model.oob_score_ : 0.89532422551472


[4] 성능 평가

In [34]:
train_score = lf_model.score(x_train, y_train)
test_score = lf_model.score(x_test, y_test)

print(f'train_score : test_score = {train_score} : {test_score}')

train_score : test_score = 0.9973061381566288 : 0.9


[5] 튜닝

- RandomizedSearchCV 하이퍼파라미터 최적화 클래스
    - 범위가 넓은 하이퍼파라미터 설정에 좋음
    - 지정된 범위에서 지정된 횟수만큼 하이퍼파라미터 추출하여 조합 진행

In [37]:
# 모듈 로딩
from sklearn.model_selection import RandomizedSearchCV

In [49]:
# RandomForestClassifier 하이퍼 파라미터 설정
params = {'max_depth':range(2, 16),
          'min_samples_leaf':range(5, 16),
          'criterion': ['gini', 'entropy', 'log_loss']}

In [39]:
rf_model = RandomForestClassifier(random_state=7)

In [50]:
searchCV = RandomizedSearchCV(rf_model,
                              param_distributions=params,
                            #   n_iter=50,
                              verbose=4)

In [51]:
searchCV.fit(x_train, y_train)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
[CV 1/5] END criterion=entropy, max_depth=13, min_samples_leaf=14;, score=0.875 total time=   0.2s
[CV 2/5] END criterion=entropy, max_depth=13, min_samples_leaf=14;, score=0.837 total time=   0.2s
[CV 3/5] END criterion=entropy, max_depth=13, min_samples_leaf=14;, score=0.878 total time=   0.2s
[CV 4/5] END criterion=entropy, max_depth=13, min_samples_leaf=14;, score=0.882 total time=   0.2s
[CV 5/5] END criterion=entropy, max_depth=13, min_samples_leaf=14;, score=0.873 total time=   0.2s
[CV 1/5] END criterion=log_loss, max_depth=10, min_samples_leaf=6;, score=0.873 total time=   0.2s
[CV 2/5] END criterion=log_loss, max_depth=10, min_samples_leaf=6;, score=0.846 total time=   0.2s
[CV 3/5] END criterion=log_loss, max_depth=10, min_samples_leaf=6;, score=0.876 total time=   0.2s
[CV 4/5] END criterion=log_loss, max_depth=10, min_samples_leaf=6;, score=0.881 total time=   0.2s
[CV 5/5] END criterion=log_loss, max_depth=10, 

In [52]:
# 모델 파라미터 확인
searchCV.best_score_
print(f'searchCV.best_score_ : {searchCV.best_score_}')
print(f'searchCV.best_params_ : {searchCV.best_params_}')
print(f'searchCV.best_estimator_ : {searchCV.best_estimator_}')

cv_resultDF = pd.DataFrame(searchCV.cv_results_)
cv_resultDF

searchCV.best_score_ : 0.8760866587695268
searchCV.best_params_ : {'min_samples_leaf': 5, 'max_depth': 12, 'criterion': 'log_loss'}
searchCV.best_estimator_ : RandomForestClassifier(criterion='log_loss', max_depth=12, min_samples_leaf=5,
                       random_state=7)


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_min_samples_leaf,param_max_depth,param_criterion,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.252325,0.011395,0.019906,0.003943,14,13,entropy,"{'min_samples_leaf': 14, 'max_depth': 13, 'cri...",0.875,0.836538,0.877767,0.881617,0.872955,0.868775,0.016378,11
1,0.271436,0.008006,0.018325,0.003488,6,10,log_loss,"{'min_samples_leaf': 6, 'max_depth': 10, 'crit...",0.873077,0.846154,0.875842,0.880654,0.873917,0.869929,0.012174,8
2,0.149168,0.014908,0.008554,0.005136,6,2,log_loss,"{'min_samples_leaf': 6, 'max_depth': 2, 'crite...",0.753846,0.753846,0.754572,0.753609,0.753609,0.753896,0.000354,48
3,0.136295,0.009543,0.010415,0.006275,6,2,gini,"{'min_samples_leaf': 6, 'max_depth': 2, 'crite...",0.755769,0.767308,0.770934,0.753609,0.753609,0.760246,0.007379,46
4,0.208142,0.020029,0.013506,0.00709,6,5,log_loss,"{'min_samples_leaf': 6, 'max_depth': 5, 'crite...",0.85,0.842308,0.860443,0.871992,0.860443,0.857037,0.010132,36
5,0.251798,0.003457,0.016851,0.002524,10,15,gini,"{'min_samples_leaf': 10, 'max_depth': 15, 'cri...",0.872115,0.835577,0.879692,0.882579,0.872955,0.868584,0.016972,13
6,0.224272,0.011628,0.01529,0.008592,10,7,entropy,"{'min_samples_leaf': 10, 'max_depth': 7, 'crit...",0.861538,0.836538,0.882579,0.876805,0.864293,0.864351,0.01593,30
7,0.141195,0.007152,0.007984,0.005402,12,2,gini,"{'min_samples_leaf': 12, 'max_depth': 2, 'crit...",0.755769,0.767308,0.770934,0.753609,0.753609,0.760246,0.007379,46
8,0.256957,0.006396,0.012861,0.006748,6,8,log_loss,"{'min_samples_leaf': 6, 'max_depth': 8, 'crite...",0.871154,0.838462,0.87873,0.879692,0.871992,0.868006,0.015167,18
9,0.24286,0.007573,0.018315,0.006227,14,13,log_loss,"{'min_samples_leaf': 14, 'max_depth': 13, 'cri...",0.875,0.836538,0.877767,0.881617,0.872955,0.868775,0.016378,11
