[Ensemble - RandomForest & ExtraTree]
- 배깅 방식: 중복 허용, 랜덤 샘플, 동일 모델(Decision Tree)
    * 대표 알고리즘: RandomForest
- 페이스트 방식: 중복 X, 랜덤 샘플, 동일 모델(Decision Tree)
    * 대표 알고리즘: EXtraTree

[와인 분류]
- 2개 종류(0 or 1) 분류

1) 모듈 로딩 및 데이터 준비

In [63]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [64]:
data_file='../data/wine.csv'

In [65]:
wine_df=pd.read_csv(data_file)

In [66]:
wine_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   alcohol  6497 non-null   float64
 1   sugar    6497 non-null   float64
 2   pH       6497 non-null   float64
 3   class    6497 non-null   float64
dtypes: float64(4)
memory usage: 203.2 KB


In [67]:
wine_df.head(3)

Unnamed: 0,alcohol,sugar,pH,class
0,9.4,1.9,3.51,0.0
1,9.8,2.6,3.2,0.0
2,9.8,2.3,3.26,0.0


In [68]:
wine_df['class'].value_counts()

class
1.0    4898
0.0    1599
Name: count, dtype: int64

In [69]:
wine_df.describe()

Unnamed: 0,alcohol,sugar,pH,class
count,6497.0,6497.0,6497.0,6497.0
mean,10.491801,5.443235,3.218501,0.753886
std,1.192712,4.757804,0.160787,0.430779
min,8.0,0.6,2.72,0.0
25%,9.5,1.8,3.11,1.0
50%,10.3,3.0,3.21,1.0
75%,11.3,8.1,3.32,1.0
max,14.9,65.8,4.01,1.0


2) 학습 준비

In [70]:
feature_df=wine_df[wine_df.columns[:-1]]
target_sr=wine_df[wine_df.columns[-1]]

print('[feature_df]',feature_df.shape, '[target_sr]', target_sr.shape)

[feature_df] (6497, 3) [target_sr] (6497,)


In [71]:
#데이터셋 분리
from sklearn.model_selection import train_test_split

In [72]:
x_train,x_test,y_train,y_test=train_test_split(feature_df,target_sr,
                                                random_state=1,
                                                test_size=0.2,
                                                stratify=target_sr)

In [73]:
print('x_train:', x_train.shape, 'y_train:', y_train.shape)
print('x_test:',x_test.shape, 'y_test:',y_test.shape)

x_train: (5197, 3) y_train: (5197,)
x_test: (1300, 3) y_test: (1300,)


3) 학습 진행
- 학습 방법: 지도학습 중 분류
- 알고리즘: Ensemble-bagging(RandomForestClassifier)

In [74]:
from sklearn.ensemble import RandomForestClassifier

In [75]:
#인스턴스 생성
lf_model=RandomForestClassifier(random_state=7, oob_score=True)

#학습 진행
lf_model.fit(x_train,y_train)

In [76]:
#모델 파라미터 확인
print('[classes]',lf_model.classes_)
print('[n_classes]',lf_model.n_classes_,'개')
print()
print('[features names]',lf_model.feature_names_in_)
print('[n features in]',lf_model.n_features_in_,'개')
print('[features importance]',lf_model.feature_importances_)

[classes] [0. 1.]
[n_classes] 2 개

[features names] ['alcohol' 'sugar' 'pH']
[n features in] 3 개
[features importance] [0.23572103 0.49995154 0.26432743]


In [77]:
print('[best estimator]',lf_model.estimator_)

for i in lf_model.estimators_:
    print(i)

[best estimator] DecisionTreeClassifier()
DecisionTreeClassifier(max_features='sqrt', random_state=327741615)
DecisionTreeClassifier(max_features='sqrt', random_state=976413892)
DecisionTreeClassifier(max_features='sqrt', random_state=1202242073)
DecisionTreeClassifier(max_features='sqrt', random_state=1369975286)
DecisionTreeClassifier(max_features='sqrt', random_state=1882953283)
DecisionTreeClassifier(max_features='sqrt', random_state=2053951699)
DecisionTreeClassifier(max_features='sqrt', random_state=959775639)
DecisionTreeClassifier(max_features='sqrt', random_state=1956722279)
DecisionTreeClassifier(max_features='sqrt', random_state=2052949340)
DecisionTreeClassifier(max_features='sqrt', random_state=1322904761)
DecisionTreeClassifier(max_features='sqrt', random_state=165338510)
DecisionTreeClassifier(max_features='sqrt', random_state=1133316631)
DecisionTreeClassifier(max_features='sqrt', random_state=4812360)
DecisionTreeClassifier(max_features='sqrt', random_state=372560217)


In [78]:
print('[sample]', lf_model.estimators_samples_)

AttributeError: 'RandomForestClassifier' object has no attribute 'estimators_samples_'

4) 성능 평가

In [None]:
train_score=lf_model.score(x_train,y_train)
test_score=lf_model.score(x_test,y_test)

In [None]:
print('[train_score]',train_score)
print('[test_score]',test_score)

[train_score] 0.9973061381566288
[test_score] 0.9


In [None]:
print('[lf_model.oob_score_]',lf_model.oob_score_)

[lf_model.oob_score_] 0.89532422551472


5) 튜닝
- RandomizedSearchCV 하이퍼 파라미터 최적화 클래스
    * 범위가 넓은 하이퍼 파라미터 설정에 좋음
    * 지정된 범위에서 지정된 횟수만큼 하이퍼 파라미터 추출하여 조합 진행

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [79]:
params={'max_depth':range(2,16),
        'min_samples_leaf': range(5,16),
        'criterion':['gini','entropy','log_loss']}

In [80]:
rf_model=RandomForestClassifier(random_state=7)

In [81]:
cv=RandomizedSearchCV(rf_model, param_distributions=params,n_iter=50,verbose=4)  #verbose= : 진행단계 보여줌 

In [82]:
cv.fit(x_train,y_train)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
[CV 1/5] END criterion=log_loss, max_depth=5, min_samples_leaf=13;, score=0.855 total time=   0.1s
[CV 2/5] END criterion=log_loss, max_depth=5, min_samples_leaf=13;, score=0.835 total time=   0.1s
[CV 3/5] END criterion=log_loss, max_depth=5, min_samples_leaf=13;, score=0.862 total time=   0.1s
[CV 4/5] END criterion=log_loss, max_depth=5, min_samples_leaf=13;, score=0.876 total time=   0.1s
[CV 5/5] END criterion=log_loss, max_depth=5, min_samples_leaf=13;, score=0.856 total time=   0.1s
[CV 1/5] END criterion=gini, max_depth=13, min_samples_leaf=13;, score=0.872 total time=   0.2s
[CV 2/5] END criterion=gini, max_depth=13, min_samples_leaf=13;, score=0.833 total time=   0.2s
[CV 3/5] END criterion=gini, max_depth=13, min_samples_leaf=13;, score=0.873 total time=   0.2s
[CV 4/5] END criterion=gini, max_depth=13, min_samples_leaf=13;, score=0.881 total time=   0.1s
[CV 5/5] END criterion=gini, max_depth=13, min_samples_leaf

In [83]:
print('[cv.best_score]',cv.best_score_)
print('[v.best_params_]',cv.best_params_)
print('[cv.best_estimator_]',cv.best_estimator_)

[cv.best_score] 0.8735855482342488
[v.best_params_] {'min_samples_leaf': 6, 'max_depth': 13, 'criterion': 'log_loss'}
[cv.best_estimator_] RandomForestClassifier(criterion='log_loss', max_depth=13, min_samples_leaf=6,
                       random_state=7)


In [84]:
cv_result_df=pd.DataFrame(cv.cv_results_)
cv_result_df

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_min_samples_leaf,param_max_depth,param_criterion,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.183572,0.014854,0.011469,0.000508,13,5,log_loss,"{'min_samples_leaf': 13, 'max_depth': 5, 'crit...",0.854808,0.834615,0.862368,0.875842,0.85563,0.856653,0.013349,40
1,0.236552,0.003153,0.014267,0.000411,13,13,gini,"{'min_samples_leaf': 13, 'max_depth': 13, 'cri...",0.872115,0.832692,0.872955,0.880654,0.87103,0.865889,0.016942,25
2,0.267957,0.002913,0.015076,0.000217,8,14,log_loss,"{'min_samples_leaf': 8, 'max_depth': 14, 'crit...",0.879808,0.840385,0.87873,0.883542,0.871992,0.870891,0.015703,7
3,0.237084,0.00152,0.01454,0.000508,15,13,log_loss,"{'min_samples_leaf': 15, 'max_depth': 13, 'cri...",0.873077,0.838462,0.875842,0.884504,0.87103,0.868583,0.015746,16
4,0.259992,0.002642,0.014538,0.000479,8,11,log_loss,"{'min_samples_leaf': 8, 'max_depth': 11, 'crit...",0.875,0.839423,0.876805,0.885467,0.876805,0.8707,0.016058,8
5,0.217253,0.021484,0.014072,0.000683,15,7,gini,"{'min_samples_leaf': 15, 'max_depth': 7, 'crit...",0.869231,0.834615,0.872955,0.883542,0.86333,0.864735,0.016436,29
6,0.268018,0.002321,0.014683,0.000384,7,11,entropy,"{'min_samples_leaf': 7, 'max_depth': 11, 'crit...",0.881731,0.844231,0.875842,0.883542,0.87488,0.872045,0.014297,3
7,0.235542,0.005104,0.014864,0.000692,14,9,entropy,"{'min_samples_leaf': 14, 'max_depth': 9, 'crit...",0.873077,0.838462,0.872955,0.883542,0.87103,0.867813,0.015319,19
8,0.217292,0.00336,0.014064,0.000189,11,7,entropy,"{'min_samples_leaf': 11, 'max_depth': 7, 'crit...",0.857692,0.832692,0.876805,0.877767,0.866218,0.862235,0.016509,30
9,0.148809,0.001846,0.010038,0.000496,8,3,entropy,"{'min_samples_leaf': 8, 'max_depth': 3, 'crite...",0.8,0.800962,0.807507,0.833494,0.818094,0.812011,0.012531,44
