#### Ensemble - RandomForest & ExtraTree
- 배깅 방식의 앙상블 ==> 중복 랜덤 샘플 + 동일 모델 (DT)
    * RandomForestC/R
- 페이스트 박식의 앙상블 ==> 랜덤 샘플 + 동일 모델 (DT)
    * 대표 알고리즘 :ExtraTreeC/R

- 와인분류 => 0과 1 2개 종류 분류

[1] 모듈 로딩 및 데이터 준비

In [None]:
#모듈 로딩
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# 데이터 
DATA_FILE= '../data/wine.csv'
# 불러오기
wineDF=pd.read_csv(DATA_FILE)

In [None]:
#데이터 확인
wineDF.info()

In [None]:
#타겟 .라벨 클래스 분포
wineDF['class'].value_counts()

In [None]:
3#데이터 분석에서 사용되는 함수로, 데이터 프레임이나 시리즈의 기초 통계 정보를 요약해서 보여줍니다. 
#이 함수는 Pandas 라이브러리에서 제공되며, 데이터의 기초 통계를 한눈에 파악할 수 있게 도와줍니다.
wineDF.describe()

[2] 학습 준비

In [None]:
# 학습용 & 테스트용 데이터셋 분활
from sklearn.model_selection import train_test_split

In [None]:
# 피쳐 / 독립변수와 타겟/라벨/종속변수 분리
featureDF=wineDF[wineDF.columns[:-1]]
targetSR=wineDF[wineDF.columns[-1]]

print(f'featureDF : {featureDF.shape}targetSR {targetSR.shape}')

In [None]:
X_train, X_test, y_trian , y_test= train_test_split(featureDF,targetSR,test_size=0.2,
                                                    stratify=targetSR,
                                                    random_state=1)

In [None]:
print(f'X_train :{X_train.shape}y_trine : {y_trian.shape}')
print(f'X_test :{X_test.shape}y_test : {y_test.shape}')

[3]  학습 진행

In [None]:
# 학습 방법 : 지도학습 > 분류
# 알고리즘  : 앙상블 > 배깅 - RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

In [None]:
#인스턴스 생성-> 100개의 내부 DT 모델에서 사용할 데이터셋 생성
# random_state 매개변수 설정으로 고정된 데이터셋 생성
lf_model= RandomForestClassifier(random_state=7)

# g학습
lf_model.fit(X_train, y_trian)

In [None]:
# 모델 파라미터
print(f'classes_ : {lf_model.classes_}')
print(f'n_classes_ : {lf_model.n_classes_}개')
print(f'feature_names_in_ : {lf_model.feature_names_in_ }개')
print(f'feature_names_in_ : {lf_model.n_f }개')

In [None]:
#a 모델파라미터
print(f'classes_                    : {lf_model.estimator_}'
      )

In [None]:
print(f'oob_score_: {lf_model.oob_score_}')

[4]성능 평가

In [None]:
train_score =  lf_model.score(X_train, y_trian)
test_score =  lf_model.score(X_test, y_test)

[5] 튜닝
- RandomizedSearchCV 하이퍼파라미터 최적화 클래스
* 범위가 넓은 하이퍼파라미터 설정에 좋음
* 지정된 범위에서 지정된 횟수 만큼 하이퍼파라미터를 추출하여 조합 진행

In [25]:
from sklearn.model_selection import RandomizedSearchCV

In [26]:
# RandomForestClassifier 하이퍼파라미터
params = {'max_depth': range(2,15),
          'min_samples_leaf' : range(5,16)}

In [27]:
rf_model= RandomForestClassifier(random_state=7)

In [28]:
searchCV =  RandomizedSearchCV(rf_model,
                               param_distributions=params,
                               verbose=4)

In [29]:
searchCV.fit(X_train, y_trian)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END ...max_depth=9, min_samples_leaf=6;, score=0.875 total time=   0.2s
[CV 2/5] END ...max_depth=9, min_samples_leaf=6;, score=0.841 total time=   0.2s
[CV 3/5] END ...max_depth=9, min_samples_leaf=6;, score=0.878 total time=   0.2s
[CV 4/5] END ...max_depth=9, min_samples_leaf=6;, score=0.880 total time=   0.2s
[CV 5/5] END ...max_depth=9, min_samples_leaf=6;, score=0.876 total time=   0.2s
[CV 1/5] END ..max_depth=9, min_samples_leaf=15;, score=0.871 total time=   0.2s
[CV 2/5] END ..max_depth=9, min_samples_leaf=15;, score=0.838 total time=   0.2s
[CV 3/5] END ..max_depth=9, min_samples_leaf=15;, score=0.880 total time=   0.2s
[CV 4/5] END ..max_depth=9, min_samples_leaf=15;, score=0.881 total time=   0.2s
[CV 5/5] END ..max_depth=9, min_samples_leaf=15;, score=0.871 total time=   0.2s
[CV 1/5] END ..max_depth=11, min_samples_leaf=8;, score=0.875 total time=   0.2s
[CV 2/5] END ..max_depth=11, min_samples_leaf=8;

In [32]:
#  모델 파라미터
print(f'[searchCV.best_score_]{searchCV.best_score_}')
print(f'[searchCV.best_params_]{searchCV.best_params_}')
print(f'[searchCV.best_estimator_]{searchCV.best_estimator_}')

cv_resultDF =pd.DataFrame(searchCV.cv_results_)
cv_resultDF

[searchCV.best_score_]0.8699294810098467
[searchCV.best_params_]{'min_samples_leaf': 6, 'max_depth': 9}
[searchCV.best_estimator_]RandomForestClassifier(max_depth=9, min_samples_leaf=6, random_state=7)


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_min_samples_leaf,param_max_depth,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.282648,0.014238,0.015705,0.001275,6,9,"{'min_samples_leaf': 6, 'max_depth': 9}",0.875,0.841346,0.877767,0.879692,0.875842,0.869929,0.014383,1
1,0.246728,0.01268,0.016728,0.001451,15,9,"{'min_samples_leaf': 15, 'max_depth': 9}",0.871154,0.8375,0.879692,0.880654,0.87103,0.868006,0.015787,3
2,0.289315,0.015469,0.01667,0.001169,8,11,"{'min_samples_leaf': 8, 'max_depth': 11}",0.875,0.840385,0.877767,0.880654,0.873917,0.869545,0.014766,2
3,0.263004,0.016849,0.022999,0.015677,13,11,"{'min_samples_leaf': 13, 'max_depth': 11}",0.874038,0.832692,0.871992,0.880654,0.87488,0.866851,0.01732,5
4,0.153396,0.019877,0.010871,0.000883,13,2,"{'min_samples_leaf': 13, 'max_depth': 2}",0.755769,0.767308,0.770934,0.753609,0.753609,0.760246,0.007379,9
5,0.2202,0.019024,0.013112,0.000784,14,5,"{'min_samples_leaf': 14, 'max_depth': 5}",0.852885,0.834615,0.866218,0.87488,0.854668,0.856653,0.01362,7
6,0.148293,0.002541,0.010053,0.000548,14,2,"{'min_samples_leaf': 14, 'max_depth': 2}",0.755769,0.767308,0.770934,0.753609,0.753609,0.760246,0.007379,9
7,0.177975,0.004377,0.011661,0.001127,11,4,"{'min_samples_leaf': 11, 'max_depth': 4}",0.840385,0.838462,0.857555,0.858518,0.839269,0.846838,0.009169,8
8,0.278976,0.01097,0.01581,0.001505,6,8,"{'min_samples_leaf': 6, 'max_depth': 8}",0.874038,0.836538,0.87488,0.876805,0.868142,0.866081,0.015052,6
9,0.249981,0.011065,0.015818,0.002305,11,8,"{'min_samples_leaf': 11, 'max_depth': 8}",0.871154,0.840385,0.875842,0.882579,0.865255,0.867043,0.014488,4
