# ML0615_LGBM_HyperParameterTuning

#### feature_data_4에서 출력한 LGBMRegressor의 결과값을 활용해서 Hyperparameter Tuning을 진행한다.

<!--  -->

- LightGBM에 대한 설명 정리
- 그리드서치에 대한 가벼운 설명
- 하이퍼 파라미터 튜닝을 이런식으로 진행했다 이런거 정리
- 캐글 결과 이미지 출력
    - 과적합
    - 최종 제출

Light GBM은 Gradient Boosting 프레워크로 Tree 기반 학습 알고리즘이다.<br>
다른 알고리즘은 Tree가 수평적으로 확장되는데 Light GBM은 Tree가 수직적으로 확장된다. <br>
즉 Light GBM은 leaf-wise 인 반면 다른 알고리즘은 level-wise 이다. <br>
때문에 Light GBM은 다른 GBM보다 속도가 빠르다. Light GBM은 큰 사이즈의 데이터를 다룰 수 있고 실행시킬 때 적은 메모리를 차지한다. <br>
Light GBM이 인기있는 또 다른 이유는 바로 결과의 정확도에 초점을 맞추기 때문이다. LGBM은 또한 GPU 학습을 지원한다.<br>


### 기존 GBM 
알고리즘은 모든 피처와 모든 데이터를 스캔함

### Light GBM
#### Gradient-based One-Side Sampling(GOSS)
- 일반적으로 가각의 객체들은 다른 Gradient를 가지고 있음
- 그래서 모든데이터를 스캔하는 것이 아닌, Gradient가 매우 큰 데이터는 킵하고 매우 작은 데이터는 랜덤으로 제거 하겠다. 


#### Exclusive Feature Bunding
- 모든 피처를 스캔함에 있어 효율적으로 스캔하기 위한 방법론이다. 



참고 자료 : <a href = 'https://nurilee.com/2020/04/03/lightgbm-definition-parameter-tuning/'>LIGHTGBM 이란? 그리고 PARAMETER 튜닝하기_블로그</a> <br>
참고 자료 : <a href = 'https://youtu.be/4C8SUZJPlMY'>04-8: Ensemble Learning - LightGBM (앙상블 기법 - LightGBM)_YouTube</a><br>
참고 자료 : <a href = 'https://greeksharifa.github.io/machine_learning/2019/12/09/Light-GBM/'>Light GBM 설명 및 사용법_블로그</a><br>

### Training

In [17]:
import numpy as np
import pandas as pd

data = pd.read_csv('/Users/krc/Documents/이어드림 수업/머신러닝_프로젝트/modelingPUBG/data/featured_data/featured_train_4.csv')
data.head()

Unnamed: 0,assists_mean,boosts_mean,damageDealt_mean,DBNOs_mean,headshotKills_mean,heals_mean,killPlace_mean,killPoints_mean,kills_mean,killStreaks_mean,...,revives_mean_rank,rideDistance_mean_rank,roadKills_mean_rank,swimDistance_mean_rank,teamKills_mean_rank,vehicleDestroys_mean_rank,walkDistance_mean_rank,weaponsAcquired_mean_rank,winPoints_mean_rank,winPlacePerc
0,0.0,0.5,109.675,1.0,0.0,0.5,41.0,1242.0,1.0,0.5,...,0.339286,0.375,0.517857,0.410714,0.5,0.482143,0.285714,0.160714,0.357143,0.3333
1,0.0,0.0,47.988333,0.333333,0.0,0.0,90.5,1355.5,0.0,0.0,...,0.339286,0.375,0.517857,0.410714,0.5,0.482143,0.071429,0.107143,0.142857,0.037
2,0.0,0.0,0.0,0.0,0.0,0.0,94.5,1382.0,0.0,0.0,...,0.339286,0.375,0.517857,0.410714,0.5,0.482143,0.035714,0.035714,0.571429,0.0
3,0.0,0.5,11.7,0.0,0.0,0.0,59.5,1178.0,0.0,0.0,...,0.339286,0.375,0.517857,0.410714,0.5,0.482143,0.392857,1.0,0.107143,0.3704
4,1.0,3.5,340.95,2.5,1.0,1.0,14.0,1504.0,3.0,1.5,...,0.339286,0.821429,0.517857,0.892857,0.5,0.482143,0.964286,0.625,0.785714,1.0


In [18]:
# Memory saving function credit to https://www.kaggle.com/gemartin/load-data-reduce-memory-usage
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df


reduce_mem_usage(data)

Memory usage of dataframe is 726.75 MB
Memory usage after optimization is: 181.69 MB
Decreased by 75.0%


Unnamed: 0,assists_mean,boosts_mean,damageDealt_mean,DBNOs_mean,headshotKills_mean,heals_mean,killPlace_mean,killPoints_mean,kills_mean,killStreaks_mean,...,revives_mean_rank,rideDistance_mean_rank,roadKills_mean_rank,swimDistance_mean_rank,teamKills_mean_rank,vehicleDestroys_mean_rank,walkDistance_mean_rank,weaponsAcquired_mean_rank,winPoints_mean_rank,winPlacePerc
0,0.000000,0.500000,109.687500,1.000000,0.000000,0.500000,41.000000,1242.0,1.000000,0.500000,...,0.339355,0.375000,0.518066,0.410645,0.500000,0.482178,0.285645,0.160767,0.357178,0.333252
1,0.000000,0.000000,48.000000,0.333252,0.000000,0.000000,90.500000,1356.0,0.000000,0.000000,...,0.339355,0.375000,0.518066,0.410645,0.500000,0.482178,0.071411,0.107117,0.142822,0.036987
2,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,94.500000,1382.0,0.000000,0.000000,...,0.339355,0.375000,0.518066,0.410645,0.500000,0.482178,0.035706,0.035706,0.571289,0.000000
3,0.000000,0.500000,11.703125,0.000000,0.000000,0.000000,59.500000,1178.0,0.000000,0.000000,...,0.339355,0.375000,0.518066,0.410645,0.500000,0.482178,0.392822,1.000000,0.107117,0.370361
4,1.000000,3.500000,341.000000,2.500000,1.000000,1.000000,14.000000,1504.0,3.000000,1.500000,...,0.339355,0.821289,0.518066,0.893066,0.500000,0.482178,0.964355,0.625000,0.785645,1.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2026739,0.000000,0.000000,16.953125,0.000000,0.000000,0.000000,29.000000,0.0,1.000000,1.000000,...,0.362061,0.724121,0.482666,0.482666,0.482666,0.500000,0.896484,0.413818,0.517090,0.643066
2026740,0.666504,2.666016,205.625000,1.333008,0.666504,6.667969,19.328125,0.0,1.333008,1.000000,...,1.000000,0.931152,0.965332,0.482666,0.482666,0.500000,0.689453,0.844727,0.517090,0.928711
2026741,0.000000,0.000000,25.953125,0.166626,0.000000,0.000000,82.000000,0.0,0.166626,0.166626,...,0.362061,0.258545,0.482666,0.482666,0.482666,0.500000,0.034485,0.068970,0.517090,0.000000
2026742,0.000000,0.500000,59.750000,0.750000,0.000000,0.500000,60.750000,0.0,0.250000,0.250000,...,0.362061,0.258545,0.482666,0.482666,0.482666,0.500000,0.206909,0.344727,0.517090,0.250000


In [20]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error
from lightgbm.sklearn import LGBMRegressor

df = data

X = df.drop(columns='winPlacePerc')
y = df['winPlacePerc']

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
                                            
print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

(1621395, 46) (405349, 46) (1621395,) (405349,)


In [21]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val1 = scaler.transform(X_val)

In [22]:
reg_lgbm = LGBMRegressor()
# 학습
reg_lgbm.fit(X_train, y_train) # eval_set=[(X_train, y_train),(X_val, y_val)] test.csv 데이터를 사용 할 때는 이  파미를 추가한다. , eval_metric도 추가
# 예측
pred_train_lgbm = reg_lgbm.predict(X_train)
# 오차값
mse_train_lgbm = mean_absolute_error(y_train, pred_train_lgbm)

# print("LGBMRegressor Regression\t train = %.4f" % (mse_train_lgbm))
print("LGBMRegressor Regression\t train = %.4f" % (mse_train_lgbm))

LGBMRegressor Regression	 train = 0.0466


### Hyperparameter Tuning

In [None]:
# https://www.kaggle.com/code/sabomasa/pubg-simple-lightgbm

# params = {
#     'num_leaves': 144,
#     'learning_rate': 0.1,
#     'n_estimators': 800,
#     'max_depth':12,
#     'max_bin':55,
#     'bagging_fraction':0.8,
#     'bagging_freq':5,
#     'feature_fraction':0.9,
#     'verbose':50, 
#     'early_stopping_rounds':100
#     }

In [None]:
# https://www.kaggle.com/code/mgiraygokirmak/lightgbm-with-gridsearch-and-feat-importance-att


# def find_best_hyperparameters(model):
#     # Grid parameters for using in Gridsearch while tuning
#     gridParams = {
#         'learning_rate'         : [0.1, 0.01 , 0.05],
#         'n_estimators '         : [1000, 10000, 20000],
#         'bagging_fraction'      : [0.5, 0.6 ,0.7],
#         'feature_fraction'      : [0.5, 0.6 ,0.7],
#         'num_leaves'            : [31, 80, 140]
#     }
#     # Create the grid
#     grid = GridSearchCV(model, 
#                         gridParams,
#                         verbose=5,
#                         cv=3)
#     # Run the grid
#     grid.fit(X_train, y_train)
#     print('Best parameters: %s' % grid.best_params_)
#     print('Accuracy: %.2f' % grid.best_score_)
#     return

<a href = 'https://smecsm.tistory.com/133' >leaf-wise tree의 경우 3가지 중요한 파라미터가 있다.</a>

1. num_leaves : 이것은 트리모델의 복밥성을 컨트롤하는 주요 파라미터이다.
    - 보통 num_leaves = 2^(max_depth)는 depth-wise tree와 같은 수의 leaves를 가지게 하여, 이보다 작게 설정해야 오버피팅을 줄일 수 있다.
    - 예를 들어, max_depth가 7일 경우 좋은 성능을 보였다면, num_leaves는 127보다 적은 70~80사이에서 더 좋은 성능을 얻을 수 있다.
    - default = 31
<br>
<br>
2. min_data_in_leaf
    - 오버피팅을 예방하는 데 중요한 파라미터이다.
    - 값을 크게 하면 너무 깊은 tree를 피할 수 있지만, 언더피팅이 생길 수 도 있다.
    - 아주 큰 데이터 셋(최소 10000건 이상)에서는 100~1000의 값이면 충분하다.
    - default = 20
<br>
<br>
3. max_depth
    - tree의 depth 한계를 지정하는 것
    - default = -1 (가능한 최대, -1일 때 학습하는 모델의 max_depth을 알아내는 것을 찾아볼 것) 

1. 적절한 max_depth 값을 찾아낼 것

    - max_depth를 찾을 때, 우선 큰 값으로 학습하면 default일 때와 metric이 같은 경우가 있습니다. 그 값을 찾은 이후에 gridsearch 함수나 수동으로... 찾아서 적절한 max_depth를 찾으면 될 것 같습니다.
    - feature 약 280개, 데이터 약 10000건인 경우에 max_depth는 약 20~30 사이인 것 같습니다. 그 수치보다 작게 설정한 후 적절한 num_leaves를 튜닝하면 될 것 같습니다.
<br>
<br>

2. max_depth 값에 알맞게 num_leaves를 튜닝할 것

3. min_data_in_leaf를 튜닝할 것




In [None]:
# param = {
#     'max_depth' : [13, 14, 15], 
#     #  'n_estimators' : [10000, 25000, 50000],
#     'learning_rate' : [0.95, 0.9, 0.8]
# }

# gcv = GridSearchCV(reg_lgbm, param, scoring='neg_mean_absolute_error', verbose=1)   # verbose=2은 로그를 볼 수 있음
# gcv.fit(X_train, y_train)

# print(gcv.best_params_)

In [95]:
k

Unnamed: 0,max_depth,learning_rate
0,0.0,0.8
1,1.0,0.0
2,2.0,81.0
3,3.0,0.82
4,4.0,0.83
5,5.0,0.84
6,7.0,0.85
7,8.0,0.86
8,9.0,0.87
9,10.0,0.88


In [103]:
from sklearn.model_selection import GridSearchCV

param = {
    'max_depth' : [9], 
    'n_estimators' : [300, 100],
    'learning_rate' : [0.37, 0.35, 0.4]
}

gcv = GridSearchCV(reg_lgbm, param, scoring='neg_mean_absolute_error', verbose=1)   # verbose=2은 로그를 볼 수 있음
gcv.fit(X_train, y_train)

print(gcv.best_params_)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
{'learning_rate': 0.4, 'max_depth': 9, 'n_estimators': 300}


'learning_rate' : [0.01, 0.05, 0.1], 'max_depth' : [10, 11, 12], 'n_estimators' : [100, 130, 150]<br>
{'learning_rate': 0.1, 'max_depth': 10, 'n_estimators': 150}<br>

'learning_rate' : [0.1, 0.15, 0.2], 'max_depth' : [8, 9, 10], 'n_estimators' : [150, 170, 200]<br>
{'learning_rate': 0.2, 'max_depth': 8, 'n_estimators': 200}<br>

'learning_rate' : [0.2, 0.5, 0.7], 'max_depth' : [6, 7, 8], 'n_estimators' : [100, 300, 500]<br>
{'learning_rate': 0.7, 'max_depth': 6, 'n_estimators': 300}<br>

'learning_rate' : [1, 1.5], 'max_depth' : [4, 5, 6], 'n_estimators' : [100, 300, 500]<br>
{'learning_rate': 1, 'max_depth': 4, 'n_estimators': 500}<br>

'learning_rate' : [1, 1.2], 'max_depth' : [2, 3, 4], 'n_estimators' : [500, 700, 1000]<br>
{'learning_rate': 1, 'max_depth': 2, 'n_estimators': 1000}<br>

'learning_rate' : [1], 'max_depth' : [1, 2], 'n_estimators' : [1000, 1500, 2000]<br>
{'learning_rate': 1, 'max_depth': 2, 'n_estimators': 2000}<br>

'learning_rate' : [1], 'max_depth' : [2], 'n_estimators' : [2000, 5000, 10000]<br>
{'learning_rate': 1, 'max_depth': 2, 'n_estimators': 10000}<br>

'max_depth' : [5, 10, 13], 'learning_rate' : [1,0.9, 0.91]<br>
{'learning_rate': 0.9, 'max_depth': 13}<br><br>


'learning_rate' : [1], 'max_depth' : [13, 1, 7], 'n_estimators' : [100]<br>
{'learning_rate': 0.9, 'max_depth': 13, 'n_estimators': 100}<br>

'learning_rate' : [2, 0.9, 0.5], 'max_depth' : [13, 1, 7], 'n_estimators' : [300]<br>
{'learning_rate': 0.5, 'max_depth': 7, 'n_estimators': 300}<br>

'learning_rate' : [0.9, 0.7, 0.4], 'max_depth' : [13, 9, 7], 'n_estimators' : [300]<br>
{'learning_rate': 0.4, 'max_depth': 9, 'n_estimators': 300}<br>

'learning_rate' : [0.5, 0.45, 0.4], 'max_depth' : [8, 9, 7], 'n_estimators' : [300]<br>
{'learning_rate': 0.4, 'max_depth': 9, 'n_estimators': 300} 이게 최적인듯 함<br>

'learning_rate' : [0.5, 0.45, 0.4], 'max_depth' : [8, 9, 7], 'n_estimators' : [100]<br>
{'learning_rate': 0.4, 'max_depth': 9, 'n_estimators': 100}<br>

'learning_rate' : [0.37, 0.35, 0.4], 'max_depth' : [9], 'n_estimators' : [100, 300]<br>
{'learning_rate': 0.4, 'max_depth': 9, 'n_estimators': 300}<br>

In [104]:
def func(a, b, c):
    reg_lgbm = LGBMRegressor(learning_rate=a, max_depth=b, n_estimators=c)
    # 학습
    reg_lgbm.fit(X_train, y_train) # eval_set=[(X_train, y_train),(X_val, y_val)] test.csv 데이터를 사용 할 때는 이  파미를 추가한다. , eval_metric도 추가
    # 예측
    pred_train_lgbm = reg_lgbm.predict(X_train)
    # 오차값
    mse_train_lgbm = mean_absolute_error(y_train, pred_train_lgbm)

    # print("LGBMRegressor Regression\t train = %.4f" % (mse_train_lgbm))
    print("LGBMRegressor Regression\t train/val = %.4f" % (mse_train_lgbm))

In [105]:
func(0.4, 9, 300)

LGBMRegressor Regression	 train/val = 0.0417


In [None]:
# params = [(0.9, ,), (0.9, ,), (0.9, ,), (0.9, ,), (0.9, ,), (0.9, ,)]

(max_depth=2, n_estimators=102, num_leaves=3) :  train = 0.0612

(learning_rate=0.1, max_depth=10, n_estimators=150) : train = 0.0451

(learning_rate=0.2, max_depth=8, n_estimators=200) : train = 0.0431

(learning_rate=0.7, max_depth=6, n_estimators=300) : 0.0419

(learning_rate=1, max_depth=4, n_estimators=500) : 0.0427

(learning_rate=1, max_depth=2, n_estimators=1000) : 0.0444

(learning_rate=1, max_depth=2, n_estimators=2000) : 0.0432

(learning_rate=0.9, max_depth=13, n_estimators=5000) : 0.0272

(learning_rate=0.9, max_depth=13, n_estimators=10000) : 0.0193 과적합 LGBMRegressor Regression	 train/val = 0.0160, 0.0524

## Test

In [106]:
from timeit import default_timer as timer
from sklearn import preprocessing

import gc, sys
gc.enable()

In [107]:
test = pd.read_csv('/Users/krc/Documents/이어드림 수업/머신러닝_프로젝트/modelingPUBG/data/featured_data/test_V2.csv')
test = reduce_mem_usage(test)

def state(message,start = True, time = 0):
    if(start):
        print(f'Working on {message} ... ')
    else :
        print(f'Working on {message} took ({round(time , 3)}) Sec \n')

def feature_engineering(is_train=True):
    if is_train: 
        print("processing train_V2.csv")
        df = df[df['maxPlace'] > 1]
    else:
        print("processing test_V2.csv")
        df = test

    df = test

    state('rankPoints')
    s = timer()
    # Process the 'rankPoints' feature by replacing any value of (-1) to be (0) :
    df['rankPoints'] = np.where(df['rankPoints'] <= 0 ,0 , df['rankPoints'])
    e = timer()                                  
    state('rankPoints', False, e-s)


    target = 'winPlacePerc'
    # Get a list of the features to be used
    features = list(df.columns)
    
    # Remove some features from the features list :
    features.remove("Id")
    features.remove("matchId")
    features.remove("groupId")
    features.remove("matchDuration")
    features.remove("matchType")
    
    y = None

    if is_train: 
        y = np.array(df.groupby(['matchId','groupId'])[target].agg('mean'), dtype=np.float64)
        # Remove the target from the features list :
        features.remove(target)

    print("get group mean feature")
    agg = df.groupby(['matchId','groupId'])[features].agg('mean')
    # Put the new features into a rank form ( max value will have the highest rank)
    agg_rank = agg.groupby('matchId')[features].rank(pct=True).reset_index()


    if is_train: 
        df_out = agg.reset_index()[['matchId','groupId']]
    # If we are processing the test data let df_out = 'matchId' and 'groupId' without grouping 
    else: 
        df_out = df[['matchId','groupId']]

    df_out = df_out.merge(agg.reset_index(), suffixes=["", ""], how='left', on=['matchId', 'groupId'])
    df_out = df_out.merge(agg_rank, suffixes=["_mean", "_mean_rank"], how='left', on=['matchId', 'groupId'])

    df_out.drop(["matchId", "groupId"], axis=1, inplace=True)

    X = np.array(df_out, dtype=np.float64)

    del df, agg, agg_rank
    gc.collect()

#     df_out['winPlacePerc'] = y

    return df_out

Memory usage of dataframe is 413.18 MB
Memory usage after optimization is: 121.74 MB
Decreased by 70.5%


In [108]:
test = feature_engineering(False)
test.head()

processing test_V2.csv
Working on rankPoints ... 
Working on rankPoints took (0.004) Sec 

get group mean feature


Unnamed: 0,assists_mean,boosts_mean,damageDealt_mean,DBNOs_mean,headshotKills_mean,heals_mean,killPlace_mean,killPoints_mean,kills_mean,killStreaks_mean,...,rankPoints_mean_rank,revives_mean_rank,rideDistance_mean_rank,roadKills_mean_rank,swimDistance_mean_rank,teamKills_mean_rank,vehicleDestroys_mean_rank,walkDistance_mean_rank,weaponsAcquired_mean_rank,winPoints_mean_rank
0,0.25,0.0,31.40625,0.25,0.0,0.0,71.5,0.0,0.0,0.0,...,0.357143,0.267857,0.428571,0.5,0.446429,0.464286,0.464286,0.392857,0.178571,0.517857
1,0.5,5.0,388.25,2.0,0.5,3.5,7.0,0.0,3.5,1.5,...,0.574468,0.989362,0.93617,0.510638,0.489362,0.489362,0.5,0.787234,0.87234,0.510638
2,0.75,2.25,372.5,2.0,1.0,3.25,25.75,0.0,3.0,0.75,...,0.962963,0.314815,0.703704,0.5,0.425926,0.481481,0.481481,0.555556,0.537037,0.518519
3,0.0,0.0,82.75,0.5,0.0,0.0,53.5,0.0,0.0,0.0,...,0.420455,0.386364,0.295455,0.511364,0.511364,0.465909,0.5,0.636364,0.545455,0.511364
4,0.333333,2.333333,220.625,1.333333,1.0,1.666667,12.0,0.0,2.333333,1.0,...,0.888889,0.703704,0.407407,0.518519,0.462963,0.462963,0.518519,0.962963,0.962963,0.518519


In [113]:
result = reg_lgbm.predict(test)
result

array([0.34507238, 0.60803918, 0.57846065, ..., 0.54385831, 0.52883659,
       0.29571118])

In [114]:
len(result)

1934174

In [115]:
submission = pd.read_csv('/Users/krc/Documents/이어드림 수업/머신러닝_프로젝트/modelingPUBG/data/featured_data/sample_submission_V2.csv')

In [116]:
submission.winPlacePerc = result

submission.winPlacePerc.mean()

0.4311119968728811