## **Catboost를 이용한 잡케어 추천 알고리즘 경진대회_테이브팀**


## Library & Data Load

### 컴퓨터 환경
* 코랩환경에서 진행했습니다.
### 라이브러리 버전
* catboost : 1.0.4
* eli5 : 0.11.0
* optuna : 2.10.0
* numpy : 1.19.5
* pandas : 1.1.5
* sklearn : 1.0.2

### 학습 소요 시간
* optuna : 2시간 30분
* catboost : 12분

### 실행 프로세스

1. 데이터 전처리
  * eli5_permutation feature importance 기반 데이터 전처리

2. Optuna
  * best parameter 추출

3. Catboost
  * 모델 적합

In [None]:
!pip install catboost
!pip install eli5
!pip install optuna

Collecting catboost
  Downloading catboost-1.0.4-cp37-none-manylinux1_x86_64.whl (76.1 MB)
[K     |████████████████████████████████| 76.1 MB 1.3 MB/s 
Installing collected packages: catboost
Successfully installed catboost-1.0.4
Collecting eli5
  Downloading eli5-0.11.0-py2.py3-none-any.whl (106 kB)
[K     |████████████████████████████████| 106 kB 5.6 MB/s 
Installing collected packages: eli5
Successfully installed eli5-0.11.0
Collecting optuna
  Downloading optuna-2.10.0-py3-none-any.whl (308 kB)
[K     |████████████████████████████████| 308 kB 4.0 MB/s 
[?25hCollecting cliff
  Downloading cliff-3.10.0-py3-none-any.whl (80 kB)
[K     |████████████████████████████████| 80 kB 8.8 MB/s 
[?25hCollecting alembic
  Downloading alembic-1.7.5-py3-none-any.whl (209 kB)
[K     |████████████████████████████████| 209 kB 48.9 MB/s 
Collecting colorlog
  Downloading colorlog-6.6.0-py2.py3-none-any.whl (11 kB)
Collecting cmaes>=0.8.2
  Downloading cmaes-0.8.2-py3-none-any.whl (15 kB)
Collecti

In [None]:
import os
import sys
import platform
import random
import math
from typing import List ,Dict, Tuple

import pandas as pd
import numpy as np
import catboost
import eli5
import optuna
import sklearn 
from sklearn.model_selection import StratifiedKFold , KFold
from sklearn.metrics import f1_score 

from catboost import Pool,CatBoostClassifier

print(f"- os: {platform.platform()}")
print(f"- python: {sys.version}")
print(f"- pandas: {pd.__version__}")
print(f"- numpy: {np.__version__}")
print(f"- sklearn: {sklearn.__version__}")
print(f"- catboost: {catboost.__version__}")
print(f"- eli5: {eli5.__version__}")
print(f"- optuna: {optuna.__version__}")

- os: Linux-5.4.144+-x86_64-with-Ubuntu-18.04-bionic
- python: 3.7.12 (default, Sep 10 2021, 00:21:48) 
[GCC 7.5.0]
- pandas: 1.1.5
- numpy: 1.19.5
- sklearn: 1.0.2
- catboost: 1.0.4
- eli5: 0.11.0
- optuna: 2.10.0


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
DATA_PATH = '/content/drive/MyDrive/데이콘/Jobcare_data/'
train_data = pd.read_csv(f'{DATA_PATH}train.csv')
test_data = pd.read_csv(f'{DATA_PATH}test.csv')
SEED = 43
code_d = pd.read_csv(f'{DATA_PATH}속성_D_코드.csv').iloc[:,:-1]
code_h = pd.read_csv(f'{DATA_PATH}속성_H_코드.csv')
code_l = pd.read_csv(f'{DATA_PATH}속성_L_코드.csv')

train_data.shape , test_data.shape

((501951, 35), (46404, 34))

## Data Preprocess
- 속성_D_코드.csv,속성_L_코드.csv,속성_H_코드.csv를 학습데이터에 추가하기 위해 데이터 병합을 진행했고 Eli5를 활용하여 선별된, 학습에 방해가 된다고 판단된 코드는 학습 피쳐에서 제외했습니다.


- Eli5 패키지의 permutation feature importance로 catboostclassifier모델을 진행하는데 있어 제외할 컬럼을 미리 선별합니다. 또, cat_feature에 범주형 칼럼리스트를 만들어 학습에 용이하게 했습니다.


In [None]:
code_d.columns= ["attribute_d","attribute_d_d","attribute_d_s","attribute_d_m"]
code_h.columns= ["attribute_h","attribute_h_m","attribute_h_l"]
code_l.columns= ["attribute_l","attribute_l_d","attribute_l_s","attribute_l_m","attribute_l_l"]

In [None]:
def merge_codes(df:pd.DataFrame,df_code:pd.DataFrame,col:str)->pd.DataFrame:
    df = df.copy()
    df_code = df_code.copy()
    df_code = df_code.add_prefix(f"{col}_")
    df_code.columns.values[0] = col
    return pd.merge(df,df_code,how="left",on=col)

In [None]:
def preprocess_data(
                    df:pd.DataFrame,
                    is_train:bool = True,
                    cols_merge:List[Tuple[str,pd.DataFrame]] = []  ,
                    cols_equi:List[Tuple[str,str]]= [] ,
                    cols_drop:List[str] = ["id","person_prefer_f","person_prefer_g" ,"contents_open_dt"]
                    )->Tuple[pd.DataFrame,np.ndarray]:
    df = df.copy()

    y_data = None
    if is_train:
        y_data = df["target"].to_numpy()
        df = df.drop(columns="target")

    for col, df_code in cols_merge:
        df = merge_codes(df,df_code,col)

    cols = df.select_dtypes(bool).columns.tolist()
    df[cols] = df[cols].astype(int)

    for col1, col2 in cols_equi:
        df[f"{col1}_{col2}"] = (df[col1] == df[col2] ).astype(int)

    df = df.drop(columns=cols_drop)
    return (df , y_data)

In [None]:
cols_merge = [
              ("person_prefer_d_1" , code_d),
              ("person_prefer_d_2" , code_d),
              ("person_prefer_d_3" , code_d),
              ("contents_attribute_d" , code_d),
              ("person_prefer_h_1" , code_h),
              ("person_prefer_h_2" , code_h),
              ("person_prefer_h_3" , code_h),
              ("contents_attribute_h" , code_h),
              ("contents_attribute_l" , code_l),
]

# 회원 속성과 콘텐츠 속성의 동일한 코드 여부에 대한 컬럼명 리스트
cols_equi = [

    ("contents_attribute_c","person_prefer_c"),
    ("contents_attribute_e","person_prefer_e"),

    ("person_prefer_d_2_attribute_d_s" , "contents_attribute_d_attribute_d_s"),
    ("person_prefer_d_2_attribute_d_m" , "contents_attribute_d_attribute_d_m"),
    ("person_prefer_d_2_attribute_d_d" , "contents_attribute_d_attribute_d_d"),
    ("person_prefer_d_3_attribute_d_s" , "contents_attribute_d_attribute_d_s"),
    ("person_prefer_d_3_attribute_d_m" , "contents_attribute_d_attribute_d_m"),
    ("person_prefer_d_3_attribute_d_d" , "contents_attribute_d_attribute_d_d"),

    ("person_prefer_h_1_attribute_h_m" , "contents_attribute_h_attribute_h_m"),
    ("person_prefer_h_1_attribute_h_l" , "contents_attribute_h_attribute_h_l"),
    ("person_prefer_h_2_attribute_h_m" , "contents_attribute_h_attribute_h_m"),
    ("person_prefer_h_3_attribute_h_m" , "contents_attribute_h_attribute_h_m"),
    ("person_prefer_h_2_attribute_h_l" , "contents_attribute_h_attribute_h_l"),
    ("person_prefer_h_3_attribute_h_l" , "contents_attribute_h_attribute_h_l"),

]
cols_drop = ["id","person_prefer_f","person_prefer_g", "contents_open_dt"]

## Eli5의 permutation feature importance를 이용하여 변수 중요도가 가장 낮은 피쳐들을 선별하는 작업을 진행합니다.
- permutation feature importance 수행을 위해 미리 "contents_open_dt"는 제거합니다.

###검증 데이터 나누기

In [None]:
x_train, y_train = preprocess_data(train_data, cols_merge = cols_merge , cols_equi= cols_equi , cols_drop = cols_drop)
x_test, _ = preprocess_data(test_data,is_train = False, cols_merge = cols_merge , cols_equi= cols_equi  , cols_drop = cols_drop)
x_train.shape , y_train.shape , x_test.shape

((501951, 68), (501951,), (46404, 68))

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_valid, y_train, y_valid = train_test_split(x_train, y_train, test_size=0.2)

In [None]:
import eli5
from eli5.sklearn import PermutationImportance

model=CatBoostClassifier(silent=True, random_state=0).fit(x_train, y_train)

In [None]:
perm = PermutationImportance(model, random_state=3).fit(x_valid, y_valid)
eli5.show_weights(perm, feature_names = x_valid.columns.tolist(), top=100)

Weight,Feature
0.0114  ± 0.0011,contents_attribute_d_attribute_d_d
0.0100  ± 0.0007,contents_attribute_e
0.0096  ± 0.0004,contents_attribute_d
0.0081  ± 0.0017,person_prefer_d_1
0.0068  ± 0.0017,contents_attribute_h
0.0066  ± 0.0011,person_prefer_e
0.0064  ± 0.0003,contents_attribute_j_1
0.0059  ± 0.0006,d_m_match_yn
0.0054  ± 0.0014,d_l_match_yn
0.0053  ± 0.0010,contents_attribute_a


## 이렇게 "id"를 포함한 5개의 피쳐들을 걸러내고 모델을 학습시킬 준비를 합니다.

In [None]:
cols_drop = ["id","person_prefer_f","person_prefer_g",
             "person_prefer_d_3_attribute_d_m_contents_attribute_d_attribute_d_m", "person_prefer_h_3_attribute_h_l"]

범주형 자료를 cat_features로 리스트 해 학습을 더 수월하게 합니다.

In [None]:
cat_features = x_train.columns[x_train.nunique() > 2].tolist()

In [None]:
x_train, y_train = preprocess_data(train_data, cols_merge = cols_merge , cols_equi= cols_equi , cols_drop = cols_drop)
x_test, _ = preprocess_data(test_data,is_train = False, cols_merge = cols_merge , cols_equi= cols_equi  , cols_drop = cols_drop)
x_train.shape , y_train.shape , x_test.shape 

((501951, 67), (501951,), (46404, 67))

# OPTUNA
## optuna 프레임워크를 활용해 최적의 파라미터 찾기를 진행합니다.

### 1) objective 함수 정의
- 함수 내부 params 안에 파라미터를 정의하고, 랜덤한 파라미터 값으로 모델을 학습하고, validation set을 통해 구해진 f1_score이 반환되는 함수입니다.

In [None]:
# OPTUNA_OPTIMIZATION = True

# def objective(trial):
#     train_x, valid_x, train_y, valid_y = train_test_split(x_train,y_train, test_size=0.3)
    
#     #define parameters
#     params = {
#         'iterations':trial.suggest_int("iterations", 500, 3000),
#         'objective':trial.suggest_categorical('objective',['CrossEntropy','Logloss']),
#         'bootstrap_type':trial.suggest_categorical('bootstrap_type', ['Bayesian', 'Bernoulli', 'MVS']),
#         'od_wait':trial.suggest_int('od_wait', 500, 1000),
#         'learning_rate' : trial.suggest_uniform('learning_rate',0.01,1),
#         'reg_lambda': trial.suggest_uniform('reg_lambda',1e-5,100),
#         'random_strength': trial.suggest_uniform('random_strength',20,50),
#         'depth': trial.suggest_int('depth',1,15),
#         'min_data_in_leaf': trial.suggest_int('min_data_in_leaf',1,20),
#         'leaf_estimation_iterations': trial.suggest_int('leaf_estimation_iterations',1,15),
#         'verbose': False,
#         "eval_metric":'F1',
#         "cat_features" : cat_features,
#         "one_hot_max_size":trial.suggest_int("one_hot_max_size",1,5),
#         'task_type' : 'GPU',
#     }
    
#     if params['bootstrap_type'] == 'Bayesian':
#         params['bagging_temperature'] = trial.suggest_float('bagging_temperature', 0, 10)
#     elif params['bootstrap_type'] == 'Bernoulli':
#         params['subsample'] = trial.suggest_float('subsample', 0.1, 1)
    
#     # model fit
#     model = CatBoostClassifier(**params)
#     model.fit(
#         train_x, train_y, eval_set=[(valid_x, valid_y)],
#         use_best_model=True
#     )
    
#     # validation prediction

#     preds = model.predict(valid_x)
#     pred_labels = np.rint(preds)
#     score = f1_score(valid_y, pred_labels)
#     return score


### 2) Optuna 진행
- optuna.create_study()를 생성하며, f1-score를 최대로 하는 방향으로 지정합니다.(direction='maximize')
- 반복 횟수(n_trials)는 20으로 설정합니다.

In [None]:
# study = optuna.create_study(
#     direction='maximize',
#     study_name='CatbClf'
# )

# study.optimize(
#     objective, 
#     n_trials=20
# )

### 3) 최적의 파라미터 값 출력
- study.best_trial.params에 저장되어 있는 최적의 params를 Best_params로 정의합니다.

In [None]:
# Best_params = study.best_trial.params
# print(f"Best Trial: {study.best_trial.value}")
# print(f"Best Params: {study.best_trial.params}")

**Best Trial: 0.677971389388986**

Best Params: {'iterations': 1422, 'objcetive': 'CrossEntropy', 'bootstrap_type': 'Bayesian', 'od_wait': 666, 'learning_rate': 0.9782109291187356, 'reg_lambda': 70.72533306533951, 'random_strength': 47.81900485462368, 'depth': 3, 'min_data_in_leaf': 20, 'leaf_estimation_iterations': 5, 'one_hot_max_size': 1, 'bagging_temperature': 0.07799233624102353}

In [None]:
best_params ={'iterations': 1422, 'objective': 'CrossEntropy',
              'bootstrap_type': 'Bayesian', 'od_wait': 666,
              'learning_rate': 0.9782109291187356, 'reg_lambda': 70.72533306533951,
              'random_strength': 47.81900485462368, 'depth': 3,
              'min_data_in_leaf': 20, 'leaf_estimation_iterations': 5,
              'one_hot_max_size': 1, 'bagging_temperature': 0.07799233624102353,
              "cat_features": cat_features,
               "eval_metric":'F1',
               'task_type' : 'GPU'}

# Catboost 모델링
###optuna 프레임워크를 통해 선정된 최적의 파라미터(best params)를 Catboost 모델에 적용합니다.

### 1) K-Fold
- K-Fold 검증을 통한 모델 학습을 위해 학습 파라미터를 조정합니다.
- 5-Fold CV를 진행합니다.

In [None]:
is_holdout = False
n_splits = 5
cv = KFold(n_splits=n_splits, shuffle=True, random_state=SEED)

### 2) Catboost 모델 학습
- optuna 프레임워크를 통해 구한 최적의 파라미터(best params)를 Catboost 모델에 적용합니다.

In [None]:
scores = []
models = []

for tri, vai in cv.split(x_train):
    preds = []
    model = CatBoostClassifier(**best_params)
    model.fit(x_train.iloc[tri], y_train[tri],
            eval_set=[(x_train.iloc[vai], y_train[vai])],
        )
    models.append(model)
    scores.append(model.get_best_score()["validation"]["F1"])
    if is_holdout:
        break    

0:	learn: 0.6237145	test: 0.6253576	best: 0.6253576 (0)	total: 101ms	remaining: 2m 23s
1:	learn: 0.6042099	test: 0.6553898	best: 0.6553898 (1)	total: 227ms	remaining: 2m 41s
2:	learn: 0.6111030	test: 0.6619709	best: 0.6619709 (2)	total: 330ms	remaining: 2m 36s
3:	learn: 0.6195072	test: 0.6649779	best: 0.6649779 (3)	total: 429ms	remaining: 2m 32s
4:	learn: 0.6228882	test: 0.6629948	best: 0.6649779 (3)	total: 524ms	remaining: 2m 28s
5:	learn: 0.6259080	test: 0.6643816	best: 0.6649779 (3)	total: 628ms	remaining: 2m 28s
6:	learn: 0.6251730	test: 0.6655572	best: 0.6655572 (6)	total: 721ms	remaining: 2m 25s
7:	learn: 0.6262198	test: 0.6665788	best: 0.6665788 (7)	total: 825ms	remaining: 2m 25s
8:	learn: 0.6290584	test: 0.6679065	best: 0.6679065 (8)	total: 959ms	remaining: 2m 30s
9:	learn: 0.6291707	test: 0.6676764	best: 0.6679065 (8)	total: 1.1s	remaining: 2m 35s
10:	learn: 0.6303621	test: 0.6692347	best: 0.6692347 (10)	total: 1.27s	remaining: 2m 43s
11:	learn: 0.6319123	test: 0.6694523	best:

### 3) CV 결과 확인
- 5- Fold CV의 결과값과 평균을 출력합니다.

In [None]:
print(scores)
print(np.mean(scores))

[0.6720965429345229, 0.6729934598691973, 0.6756877235039475, 0.6742797358784804, 0.6697146302250805]
0.6729544184822458


## 결과 값 제출
- train을 K-fold한 값의 평균을 구하다 보니 예측값의 극단값이 작아질 수 밖에 없었습니다.

- 따라서 threshold를 조정해가며 진행하였고 최적의 threshold : 0.33792를 찾았습니다.


In [None]:
threshold = 0.33792
pred_list = []
scores = []
for i,(tri, vai) in enumerate( cv.split(x_train) ):
    pred = models[i].predict_proba(x_train.iloc[vai])[:, 1]
    pred = np.where(pred >= threshold , 1, 0)
    score = f1_score(y_train[vai],pred)
    scores.append(score)
    pred = models[i].predict_proba(x_test)[:, 1]
    pred_list.append(pred) 
print(scores)
print(np.mean(scores))

[0.7179341932769029, 0.7164733856504062, 0.7162156705992555, 0.7141664123853727, 0.7145710216576358]
0.7158721367139146


In [None]:
pred = np.mean( pred_list , axis = 0 )
pred = np.where(pred >= threshold , 1, 0)

In [None]:
# submission 저장
sample_submission = pd.read_csv(f'{DATA_PATH}sample_submission.csv')
sample_submission['target'] = pred
sample_submission
sample_submission.to_csv('/content/drive/MyDrive/submission_final.csv', index=False)