## 모델 세부 튜닝

### 1) 주요 개념 : 다양한 모델을 선정한 뒤 각 모델의 성능을 더 끌어올리기 위해 모델이 가진 세부 파라미터(하이퍼 파라미터)를 튜닝하거나 모델을 고도화 하는 과정을 의미

### 2) 주요 방법
###   가. Grid Search CV : 가장 일반적인 방법으로 탐색하고자 하는 하이퍼 파라미터 조합을 통해 교차 검증을 사용하여 최적 파라미터를 구하는 방법
###   나. Random Search CV : 각 반복마다 하이퍼 파라미터에 임의의 수를 대입, 지정한 횟수만큼 평가, 탐색 횟수만큼 각기 다른 하이퍼 파라미터값을 탐색
### 다. Hyperopt : 베이지안 최적화의 접근 방식을 취해 하이퍼 파라미터값을 찾는 방식. 베이시안 최적화는 objective function(목적 함수)를 최대/최소로 하는 최적해를 찾는 기법이며, 목적함수와 하이퍼파라미터의 Pair를 대상으로 Surrogate Model을 만들어 평가하면서 순차적으로 업데이트 하면서 최적의 조합을 찾아냄
### 라. Optuna : 하이퍼파라미터 튜닝에 쓰고 있는 프레임웍
(저도 그냥 GSCV쓰다가..요새는 이거 씁니다..)
###   다. 앙상블 방법 : 여러가지 모델을 조합하여 최적의 모델을 도출하는 방법, 개별 모델이 다른 형태의 오차(error)를 만들때 좋은 방법일 수 있으나, 데이터 양이 적거나 하는 경우 오버피팅 가능성 존재


                        

### 사례) 아래는 Grid Search CV를 이용한 하이퍼 파라미터 도출/모델 성능 비교 사례 코드(Random Search CV방법은 GSCV와 매우 유사하므로 생략)

In [211]:
import pandas as pd
import numpy as np

In [212]:
# !pip install feature_engine

In [213]:
df = pd.read_csv('./WA_Fn-UseC_-Telco-Customer-Churn.csv')
df = df.set_index('customerID') ## 분석에 불필요하므로 고객 ID를 인덱스로 DF 재구성

In [214]:
pd.set_option('display.max_columns', None) ## 모든 컬럼을 출력하도록 함
df.head()

Unnamed: 0_level_0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [215]:
df.shape

(7043, 20)

In [216]:
df.isnull().sum() ## 결측치는 없는....것 같지만..

gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

In [217]:
## 범주형 데이터와 연속형 데이터 Feature를 구분 해보도록 하기 위해 일단 2가지 유형으로 컬럼을 구분
cat_columns = df.select_dtypes(['int', 'object']).columns 
ct_columns = df.select_dtypes('float').columns

#### cat_columns에 해당하는 컬럼의 경우, 이 중에 범주형이 아닐 수 있으니 별도 확인 필요

In [218]:
## cat_columns 분석
cat_columns

Index(['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure',
       'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity',
       'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
       'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod',
       'TotalCharges', 'Churn'],
      dtype='object')

#### -. 'TotalCharges' Feature의 경우, object type이라는 것이 이상 => 데이터 형 변환이 필요해 보임
#### -. 'Churn'은 우리가 예측해야 할 종속 변수
#### -. 나머지 Feature 더 탐색 

In [219]:
## 범주형으로 예상한 Feature들의 Unique값을 확인
import operator
cat_col_info = {}
for cat_col in cat_columns:
    cat_col_info[cat_col] = len(df[cat_col].unique())
    
sorted(cat_col_info.items(), key=operator.itemgetter(1), reverse=True) ## 각 Feature의 unique값을 기준으로 내림차 순 정렬

[('TotalCharges', 6531),
 ('tenure', 73),
 ('PaymentMethod', 4),
 ('MultipleLines', 3),
 ('InternetService', 3),
 ('OnlineSecurity', 3),
 ('OnlineBackup', 3),
 ('DeviceProtection', 3),
 ('TechSupport', 3),
 ('StreamingTV', 3),
 ('StreamingMovies', 3),
 ('Contract', 3),
 ('gender', 2),
 ('SeniorCitizen', 2),
 ('Partner', 2),
 ('Dependents', 2),
 ('PhoneService', 2),
 ('PaperlessBilling', 2),
 ('Churn', 2)]

#### - 예상대로 'TotalCharges'의 경우, unique값이 매우 많아 연속형 변수로 보는 것이 타당하고, 성격상 Float형에 가까울 것으로 보여 dtype변환하기로 함
#### - 'tenure'의 경우도 unique값이 많아 연속형 변수로 보는 것이 바람직함
#### - 나머지 feature의 경우는 2~4건의 unique변수로 구성되어 있으므로 범주형 변수로 보도록 함

In [220]:
## TotalCharges 데이터 타입 변환, But 데이터에 ''로 채워진 값이 있어 Error발생..
df['TotalCharges'] = df['TotalCharges'].astype('float64') 

ValueError: ignored

In [221]:
## TotalCharges 데이터의 ''를 nan으로 채우고 데이터 타입 변환
df.loc[df['TotalCharges'] == ' ', 'TotalCharges'] = np.nan
df['TotalCharges'] = df['TotalCharges'].astype('float64')

In [222]:
df.isnull().sum() ### 최종적으로 TotalCharges는 11개의 결측치가 있음을 알 수 있음

gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

In [223]:
## 결측값 비율이 전체 0.16%정도이므로, 삭제 결정
print((len(df.loc[df['TotalCharges'].isnull()]) / df.shape[0]) * 100)
df = df.loc[~df['TotalCharges'].isnull()]
df['TotalCharges'] = df['TotalCharges'].astype('float64')

0.1561834445548772


In [224]:
## 범주형 데이터와 연속형 데이터 Feature 재정의
cat_columns = [x for x in cat_columns if (x!='TotalCharges') & (x!='Churn') & (x!='tenure')] 
ct_columns = [x for x in df.select_dtypes('float').columns] + ['tenure']

In [225]:
print("범주형 Feature : \n", cat_columns)
print("연속형 Feature : \n", ct_columns)

범주형 Feature : 
 ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']
연속형 Feature : 
 ['MonthlyCharges', 'TotalCharges', 'tenure']


In [226]:
## 일부 컬럼 dtype 변경
df['SeniorCitizen'] = df['SeniorCitizen'].astype(object)

In [227]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7032 entries, 7590-VHVEG to 3186-AJIEK
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7032 non-null   object 
 1   SeniorCitizen     7032 non-null   object 
 2   Partner           7032 non-null   object 
 3   Dependents        7032 non-null   object 
 4   tenure            7032 non-null   int64  
 5   PhoneService      7032 non-null   object 
 6   MultipleLines     7032 non-null   object 
 7   InternetService   7032 non-null   object 
 8   OnlineSecurity    7032 non-null   object 
 9   OnlineBackup      7032 non-null   object 
 10  DeviceProtection  7032 non-null   object 
 11  TechSupport       7032 non-null   object 
 12  StreamingTV       7032 non-null   object 
 13  StreamingMovies   7032 non-null   object 
 14  Contract          7032 non-null   object 
 15  PaperlessBilling  7032 non-null   object 
 16  PaymentMethod     7032 non-null 

In [228]:
### 종속, 독립 변수 분리
X = df.drop('Churn', axis=1)
y = df['Churn']

In [229]:
## 데이터셋 분리
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, stratify=y, random_state=34)

In [230]:
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

(5625, 19) (1407, 19) (5625,) (1407,)


In [231]:
## 라벨 분포 확인(약한 클래스 불균형)
y_train.value_counts(normalize=True)

No     0.734222
Yes    0.265778
Name: Churn, dtype: float64

In [232]:
# 라벨 데이터 변환
y_train = y_train.apply(lambda x : 0 if x == 'No' else 1)
y_test = y_test.apply(lambda x : 0 if x == 'No' else 1)

In [233]:
## 이상치 확인
def IQR_rule(val_list):
    Q1 = np.quantile(val_list, 0.25)
    Q3 = np.quantile(val_list, 0.75)
    IQR = Q3-Q1
    
    not_outlier_condition = (Q3 + 1.5*IQR > val_list) & (Q3 - 1.5*IQR < val_list)
    return not_outlier_condition

In [234]:
len(x_train) - x_train[ct_columns].apply(IQR_rule).sum(axis=0) ## 이상치는 없는 것으로 확인..

MonthlyCharges    0
TotalCharges      0
tenure            0
dtype: int64

In [235]:
## 연속형 변수 skewness 확인
for col in ct_columns:
    print("{} : {:.4f}".format(col, x_train[col].skew()))

MonthlyCharges : -0.2141
TotalCharges : 0.9741
tenure : 0.2483


#### 변수치우침이 없는 형태로 간주해도 무방

In [236]:
## 연속형 변수 scaling
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')
sc = StandardScaler()
x_train[ct_columns] = sc.fit_transform(x_train[ct_columns])
x_test[ct_columns] = sc.transform(x_test[ct_columns])

In [237]:
## 범주형 변수 원핫인코딩
x_train = pd.get_dummies(x_train, columns=cat_columns, drop_first=True)
x_test = pd.get_dummies(x_test, columns=cat_columns, drop_first=True)

In [238]:
x_train = x_train.reset_index(drop=True)
x_test = x_test.reset_index(drop=True)

In [239]:
print(x_train.shape, x_test.shape)

(5625, 30) (1407, 30)


## 모델 세부 튜닝 기법(본 주제)

In [240]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

In [76]:
## 교차 검증시 StratifiedKFold를 이용(타겟값(종속변수 값)이 몰리는 것을 막기 위해)
folds = StratifiedKFold(n_splits=20, shuffle=True, random_state=42)

In [77]:
### Logistic Regression 모델을 위한 최적 하이퍼 파라미터 추정(GSCV)
lr = LogisticRegression(max_iter=1000)

## 하이퍼 파라미터 조합을 위한 딕셔너리 생성
parameter_grid = {'class_weight' : ['balanced', None],
                  'penalty' : ['l2', 'l1'],
                  'C' : [0.001, 0.05, 0.08, 0.01, 0.1, 1.0, 10.0],
                  'solver': ['liblinear']
                 }

## 하이퍼 파라미터 조합을 통해 roc_auc를 최적의 값으로 갖는 Logistic Regression 모델 학습 
grid_search = GridSearchCV(lr, param_grid=parameter_grid, cv=folds, scoring='roc_auc', n_jobs=-1)  
grid_search.fit(x_train, y_train)
print(f'Best score of GridSearchCV: {grid_search.best_score_}')  ### best score 도출 결과
print(f'Best parameters: {grid_search.best_params_}')  ## best parameter 도출 결과

Best score of GridSearchCV: 0.8490748469304213
Best parameters: {'C': 10.0, 'class_weight': None, 'penalty': 'l2', 'solver': 'liblinear'}


In [78]:
pd.DataFrame(grid_search.cv_results_)[:5]  ## crossvalidation 수행 결과를 dataframe 형태로 보여줌

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_class_weight,param_penalty,param_solver,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,split10_test_score,split11_test_score,split12_test_score,split13_test_score,split14_test_score,split15_test_score,split16_test_score,split17_test_score,split18_test_score,split19_test_score,mean_test_score,std_test_score,rank_test_score
0,0.025596,0.005659,0.003734,0.00051,0.001,balanced,l2,liblinear,"{'C': 0.001, 'class_weight': 'balanced', 'pena...",0.803092,0.841739,0.841159,0.851367,0.770759,0.869903,0.855987,0.854563,0.826537,0.85068,0.843042,0.82712,0.821618,0.866537,0.85877,0.805049,0.835275,0.827314,0.826602,0.874175,0.837564,0.024648,23
1,0.015718,0.002906,0.003334,0.000178,0.001,balanced,l1,liblinear,"{'C': 0.001, 'class_weight': 'balanced', 'pena...",0.704734,0.74496,0.744219,0.739653,0.713464,0.75754,0.751424,0.749676,0.747152,0.815631,0.735987,0.757638,0.752039,0.762362,0.741553,0.732654,0.747282,0.693786,0.745534,0.738285,0.743779,0.023726,27
2,0.021013,0.003318,0.003789,0.001412,0.001,,l2,liblinear,"{'C': 0.001, 'class_weight': None, 'penalty': ...",0.794589,0.84161,0.830081,0.842041,0.769992,0.85877,0.860518,0.84822,0.830162,0.840065,0.836052,0.821748,0.81644,0.85521,0.846602,0.794628,0.832751,0.821294,0.80699,0.870939,0.830935,0.024545,25
3,0.015343,0.003048,0.003581,0.001379,0.001,,l1,liblinear,"{'C': 0.001, 'class_weight': None, 'penalty': ...",0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.0,28
4,0.028838,0.001632,0.003654,0.000247,0.05,balanced,l2,liblinear,"{'C': 0.05, 'class_weight': 'balanced', 'penal...",0.810113,0.851014,0.860483,0.860692,0.784108,0.879935,0.866796,0.860388,0.830291,0.872168,0.850097,0.821359,0.841812,0.875728,0.864531,0.81411,0.84246,0.845631,0.8389,0.878447,0.847453,0.02475,15


In [79]:
### 최적 모델을 이용하여 테스트 데이터 셋을 이용하여 예측 후 roc_auc값 결과 확인
### ROC-AUC 계산을 위해 해당 label에 대한 확률값(probability)이 필요하므로 predict_proba메서드 활용
y_pred_test_lr = grid_search.best_estimator_.predict_proba(x_test)[:, 1]
print(f'test score: {roc_auc_score(y_test, y_pred_test_lr)}')  ## best model를 이용한 roc_auc_score결과

test score: 0.8246332005868793


In [80]:
### SVC 모델을 위한 최적 하이퍼 파라미터 추정(GSCV)
svc = SVC(probability=True, gamma='scale')  ## ROC-AUC를 위해 SVC probability옵션 설정

parameter_grid = {'C': [0.01, 0.1, 1.0, 10.0],
                  'kernel': ['linear', 'poly', 'rbf'],
                 }

grid_search = GridSearchCV(svc, param_grid=parameter_grid, cv=folds, scoring='roc_auc', n_jobs=-1)
grid_search.fit(x_train, y_train)
print(f'Best score of GridSearchCV: {grid_search.best_score_}')
print(f'Best parameters: {grid_search.best_params_}')

Best score of GridSearchCV: 0.8393338916330741
Best parameters: {'C': 1.0, 'kernel': 'linear'}


In [81]:
pd.DataFrame(grid_search.cv_results_)[:5]

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_kernel,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,split10_test_score,split11_test_score,split12_test_score,split13_test_score,split14_test_score,split15_test_score,split16_test_score,split17_test_score,split18_test_score,split19_test_score,mean_test_score,std_test_score,rank_test_score
0,6.019084,0.09878,0.037874,0.001105,0.01,linear,"{'C': 0.01, 'kernel': 'linear'}",0.815266,0.843221,0.850113,0.850856,0.765457,0.874304,0.85657,0.847638,0.822913,0.86932,0.83877,0.812621,0.825437,0.871133,0.860259,0.804854,0.839029,0.839482,0.828867,0.87055,0.839333,0.026161,2
1,7.094319,0.042118,0.047812,0.003385,0.01,poly,"{'C': 0.01, 'kernel': 'poly'}",0.796329,0.823253,0.846441,0.836357,0.736714,0.844595,0.849515,0.840259,0.795858,0.834628,0.82712,0.773981,0.797994,0.8411,0.841359,0.799288,0.81521,0.825502,0.785825,0.854951,0.818314,0.029603,6
2,9.715111,0.032591,0.077508,0.005326,0.01,rbf,"{'C': 0.01, 'kernel': 'rbf'}",0.801224,0.83401,0.841932,0.848556,0.757218,0.864919,0.849773,0.82822,0.813528,0.848997,0.822201,0.790615,0.788932,0.858058,0.845049,0.7989,0.823236,0.82712,0.797735,0.860194,0.825021,0.027986,5
3,5.899014,0.039423,0.037007,0.003929,0.1,linear,"{'C': 0.1, 'kernel': 'linear'}",0.81372,0.842705,0.852174,0.852644,0.765585,0.874628,0.85877,0.843689,0.82343,0.873528,0.834175,0.808932,0.827184,0.871068,0.856893,0.806537,0.837411,0.845631,0.825372,0.870227,0.839215,0.026641,4
4,6.842065,0.098516,0.041446,0.000701,0.1,poly,"{'C': 0.1, 'kernel': 'poly'}",0.781965,0.82132,0.846763,0.835271,0.764755,0.838511,0.854239,0.825566,0.801683,0.840841,0.813398,0.7611,0.786084,0.82479,0.832751,0.793398,0.802718,0.833463,0.788479,0.854628,0.815086,0.027849,8


In [82]:
### 최적 모델을 이용하여 테스트 데이터 셋을 이용하여 예측 후 roc_auc값 결과 확인
y_pred_test_svc= grid_search.best_estimator_.predict_proba(x_test)[:, 1]
print(f'test score: {roc_auc_score(y_test, y_pred_test_svc)}')  ## best model를 이용한 roc_auc_score결과

test score: 0.8098538042339132


In [83]:
### RandomForestClassifier 모델을 위한 최적 하이퍼 파라미터 추정(GSCV)
rfc = RandomForestClassifier()
parameter_grid = {
    'n_estimators' : [100,200,300], 
    'max_features' : [2,4,6,8],
    'criterion' : ['gini', 'entropy'], 
    'max_depth' : [6,8,10, 16, 20]
}

grid_search = GridSearchCV(rfc, param_grid=parameter_grid, cv=folds, scoring='roc_auc', n_jobs=-1)
grid_search.fit(x_train, y_train)
print(f'Best score of GridSearchCV: {grid_search.best_score_}')
print(f'Best parameters: {grid_search.best_params_}')

Best score of GridSearchCV: 0.8499261075687672
Best parameters: {'criterion': 'gini', 'max_depth': 8, 'max_features': 6, 'n_estimators': 200}


In [84]:
pd.DataFrame(grid_search.cv_results_)[:5]

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_criterion,param_max_depth,param_max_features,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,split10_test_score,split11_test_score,split12_test_score,split13_test_score,split14_test_score,split15_test_score,split16_test_score,split17_test_score,split18_test_score,split19_test_score,mean_test_score,std_test_score,rank_test_score
0,0.462264,0.0102,0.024293,0.002906,gini,6,2,100,"{'criterion': 'gini', 'max_depth': 6, 'max_fea...",0.81475,0.848663,0.844638,0.865355,0.784044,0.873463,0.87877,0.864725,0.838252,0.862071,0.849061,0.827346,0.828026,0.869838,0.854563,0.790032,0.843333,0.857055,0.825793,0.876828,0.84483,0.026087,66
1,0.903722,0.009818,0.041601,0.001829,gini,6,2,200,"{'criterion': 'gini', 'max_depth': 6, 'max_fea...",0.812045,0.854396,0.845604,0.86708,0.784875,0.871521,0.878447,0.866667,0.837735,0.855793,0.850032,0.826667,0.832039,0.873981,0.850032,0.792362,0.843883,0.856505,0.819806,0.87521,0.844734,0.02606,67
2,1.358856,0.01135,0.060288,0.003219,gini,6,2,300,"{'criterion': 'gini', 'max_depth': 6, 'max_fea...",0.810886,0.853366,0.845926,0.872317,0.783725,0.875469,0.8789,0.865696,0.838835,0.855081,0.848608,0.826019,0.826278,0.872168,0.852816,0.79055,0.841748,0.853528,0.822913,0.877217,0.844602,0.026902,68
3,0.550721,0.012035,0.021384,0.000718,gini,6,4,100,"{'criterion': 'gini', 'max_depth': 6, 'max_fea...",0.810564,0.859291,0.847601,0.868102,0.788196,0.878641,0.87767,0.867314,0.842977,0.866667,0.850809,0.825113,0.833981,0.873269,0.858835,0.79288,0.844919,0.859061,0.826602,0.873139,0.847282,0.026277,45
4,1.086942,0.010929,0.039638,0.002814,gini,6,4,200,"{'criterion': 'gini', 'max_depth': 6, 'max_fea...",0.817456,0.853816,0.849823,0.868613,0.791518,0.880388,0.876699,0.867832,0.846731,0.869256,0.84822,0.82822,0.831456,0.876828,0.855793,0.798706,0.843754,0.856181,0.82754,0.870744,0.847979,0.024811,33


In [85]:
### 최적 모델을 이용하여 테스트 데이터 셋을 이용하여 예측 후 roc_auc값 결과 확인
y_pred_test_rfc= grid_search.best_estimator_.predict_proba(x_test)[:, 1]
print(f'test score: {roc_auc_score(y_test, y_pred_test_rfc)}')  ## best model를 이용한 roc_auc_score결과

test score: 0.8244773108363027


### 모델의 성능을 더 최적화 하기 위해 Feature Extraction, Selection, 더 좋은 모델을 선택, stacking 앙상블 모델 구현 등을 고려해 볼 수 있음

## [별도] Optuna를 활용한 하이퍼 파라미터 튜닝

### XGboost Classifier를 이용하여 위 진행한 학습모델링 수행

In [291]:
!pip install optuna

Collecting optuna
  Downloading optuna-2.9.1-py3-none-any.whl (302 kB)
[K     |████████████████████████████████| 302 kB 5.1 MB/s 
[?25hCollecting alembic
  Downloading alembic-1.7.1-py3-none-any.whl (208 kB)
[K     |████████████████████████████████| 208 kB 57.7 MB/s 
Collecting colorlog
  Downloading colorlog-6.4.1-py2.py3-none-any.whl (11 kB)
Collecting cliff
  Downloading cliff-3.9.0-py3-none-any.whl (80 kB)
[K     |████████████████████████████████| 80 kB 8.6 MB/s 
Collecting cmaes>=0.8.2
  Downloading cmaes-0.8.2-py3-none-any.whl (15 kB)
Collecting Mako
  Downloading Mako-1.1.5-py2.py3-none-any.whl (75 kB)
[K     |████████████████████████████████| 75 kB 4.2 MB/s 
[?25hCollecting pbr!=2.1.0,>=2.0.0
  Downloading pbr-5.6.0-py2.py3-none-any.whl (111 kB)
[K     |████████████████████████████████| 111 kB 61.2 MB/s 
Collecting autopage>=0.4.0
  Downloading autopage-0.4.0-py3-none-any.whl (20 kB)
Collecting stevedore>=2.0.1
  Downloading stevedore-3.4.0-py3-none-any.whl (49 kB)
[K  

In [292]:
# 필요한 모듈 임포트
import xgboost
from xgboost import XGBClassifier
from sklearn.metrics import log_loss, roc_auc_score

In [293]:
import optuna
from optuna import Trial
from optuna.samplers import TPESampler

In [311]:
def objective(trial: Trial) -> float:
    params_xgb = {
        "random_state": 42,
        "learning_rate": trial.suggest_categorical("learning_rate", [0.01, 0.05, 0.1]),
        "n_estimators": trial.suggest_categorical("n_estimators", [100, 300, 500, 1000]),
        "reg_alpha": trial.suggest_float("reg_alpha", 1e-8, 3e-5),
        "reg_lambda": trial.suggest_float("reg_lambda", 1e-8, 9e-2),
        "max_depth": trial.suggest_int("max_depth", 6, 10),
        "num_leaves": trial.suggest_int("num_leaves", 2, 256),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.4, 1.0),
        "subsample": trial.suggest_float("subsample", 0.3, 1.0),
        "subsample_freq": trial.suggest_int("subsample_freq", 1, 10),
        "min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
        "scale_pos_weight" : trial.suggest_int("scale_pos_weight", 1, 3)
    }
    
    x_train_1, x_val_1, y_train_1, y_val_1 = train_test_split(x_train, y_train, test_size=0.2)

    model = XGBClassifier(**params_xgb)
    model.fit(
        x_train_1,
        y_train_1,
        eval_set=[(x_train_1, y_train_1), (x_val_1, y_val_1)],
        early_stopping_rounds=20,
        verbose=False,
    )

    xgb_pred = model.predict_proba(x_val_1)[:, 1]
    roc_auc_sc = roc_auc_score(y_val_1, xgb_pred)
    
    return roc_auc_sc

In [312]:
sampler = TPESampler(seed=42) ## 정의 된 하이퍼파라미터 변수 공간 샘플링 시 TPESampler를 사용
study = optuna.create_study(
    study_name="xgb_parameter_opt",
    direction="maximize",  ## roc_auc를 metric으로 하였으므로, maximize로 설정..
    sampler=sampler,   
)
study.optimize(objective, n_trials=50)
print("Best Score:", study.best_value)
print("Best trial:", study.best_trial.params)

[32m[I 2021-08-31 05:53:43,209][0m A new study created in memory with name: xgb_parameter_opt[0m
[32m[I 2021-08-31 05:53:44,366][0m Trial 0 finished with value: 0.8128808080808081 and parameters: {'learning_rate': 0.05, 'n_estimators': 100, 'reg_alpha': 2.598662261179031e-05, 'reg_lambda': 0.05410035504573868, 'max_depth': 14, 'num_leaves': 7, 'colsample_bytree': 0.9819459112971965, 'subsample': 0.8827098485602951, 'subsample_freq': 3, 'min_child_samples': 22, 'scale_pos_weight': 1}. Best is trial 0 with value: 0.8128808080808081.[0m
[32m[I 2021-08-31 05:53:45,596][0m Trial 1 finished with value: 0.8266396200277063 and parameters: {'learning_rate': 0.05, 'n_estimators': 300, 'reg_alpha': 1.0997191680377813e-05, 'reg_lambda': 0.04104630401883339, 'max_depth': 15, 'num_leaves': 52, 'colsample_bytree': 0.708540663048167, 'subsample': 0.7146901982034297, 'subsample_freq': 1, 'min_child_samples': 63, 'scale_pos_weight': 1}. Best is trial 1 with value: 0.8266396200277063.[0m
[32m[I

Best Score: 0.8628882688772054
Best trial: {'learning_rate': 0.01, 'n_estimators': 500, 'reg_alpha': 1.5697834791707956e-06, 'reg_lambda': 0.08945464691863866, 'max_depth': 8, 'num_leaves': 156, 'colsample_bytree': 0.7501920281425067, 'subsample': 0.3060859432912322, 'subsample_freq': 10, 'min_child_samples': 11, 'scale_pos_weight': 1}


In [313]:
study.best_params  ## 최적 파라미터 확인

{'colsample_bytree': 0.7501920281425067,
 'learning_rate': 0.01,
 'max_depth': 8,
 'min_child_samples': 11,
 'n_estimators': 500,
 'num_leaves': 156,
 'reg_alpha': 1.5697834791707956e-06,
 'reg_lambda': 0.08945464691863866,
 'scale_pos_weight': 1,
 'subsample': 0.3060859432912322,
 'subsample_freq': 10}

In [315]:
best_model = XGBClassifier(**study.best_params)  ## 최적 모델 정의
best_model

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.7501920281425067, gamma=0,
              learning_rate=0.01, max_delta_step=0, max_depth=8,
              min_child_samples=11, min_child_weight=1, missing=None,
              n_estimators=500, n_jobs=1, nthread=None, num_leaves=156,
              objective='binary:logistic', random_state=0,
              reg_alpha=1.5697834791707956e-06, reg_lambda=0.08945464691863866,
              scale_pos_weight=1, seed=None, silent=None,
              subsample=0.3060859432912322, subsample_freq=10, verbosity=1)

In [317]:
best_model.fit(x_train, y_train) ## 최적 모델로 학습

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.7501920281425067, gamma=0,
              learning_rate=0.01, max_delta_step=0, max_depth=8,
              min_child_samples=11, min_child_weight=1, missing=None,
              n_estimators=500, n_jobs=1, nthread=None, num_leaves=156,
              objective='binary:logistic', random_state=0,
              reg_alpha=1.5697834791707956e-06, reg_lambda=0.08945464691863866,
              scale_pos_weight=1, seed=None, silent=None,
              subsample=0.3060859432912322, subsample_freq=10, verbosity=1)

In [318]:
### 최적 모델을 이용하여 테스트 데이터 셋을 이용하여 예측 후 roc_auc값 결과 확인
y_pred_test_xgb= best_model.predict_proba(x_test)[:, 1]
print(f'test score: {roc_auc_score(y_test, y_pred_test_xgb)}')  ## best model를 이용한 roc_auc_score결과

test score: 0.8492333217719015
