# Algorithms

In last section we have got preprocessed dataset, in this section we will use different algorithms to predict the target. The experiment information is shown in following table:

|Name|Description|
|-----|-----|
|Algorithm|logistic regression, SVM, random forest, lightGBM
|Metrics|precision, recall, f1, roc_auc|

Due to time limit, instead of submitting to kaggle, we just test our performance on validation dataset

Some algorithms will not be used

|Algorithm|Reason|
|-----|
|Decision tree|need to discreticize the continuous features|
|KNN|computational expensive|
|Neural networks|computational expensive|

In [1]:
import numpy as np
import pandas as pd
from time import time
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb

## 1. Load the data

In [3]:
def load_data(train_data_path, train_label_path, val_data_path, val_label_path):
    val_data = pd.read_csv(val_data_path)
    val_data.drop(columns=['Unnamed: 0'], inplace=True)
    val_label = pd.read_csv(val_label_path)
    val_label.drop(columns=['Unnamed: 0'], inplace=True)
    train_data = pd.read_csv(train_data_path)
    train_data.drop(columns=['Unnamed: 0'], inplace=True)
    train_data.columns = val_data.columns
    train_label = pd.read_csv(train_label_path)
    train_label.drop(columns=['Unnamed: 0'], inplace=True)
    train_label.columns = val_label.columns
    
    print("Train: {},{}".format(train_data.shape, train_label.shape))
    print("Val: {}, {}".format(val_data.shape, val_label.shape))
    
    return train_data, train_label, val_data, val_label

In [5]:
train_data_path = "data/train_data_aug.csv"
train_label_path = "data/train_label_aug.csv"
val_data_path = "data/val_data.csv"
val_label_path = "data/val_label.csv"
train_data, train_label, val_data, val_label = load_data(train_data_path, train_label_path, val_data_path, val_label_path)

Train: (172159, 608),(172159, 1)
Val: (40000, 608), (40000, 1)


In [6]:
train_data.head(3)

Unnamed: 0,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9,...,r1_var_195,r2_var_195,r1_var_196,r2_var_196,r1_var_197,r2_var_197,r1_var_198,r2_var_198,r1_var_199,r2_var_199
0,10.0791,-5.8477,8.8131,4.6344,8.1044,-10.5568,5.0763,14.5844,3.341,6.1917,...,2.6,2.56,-2.5,-2.5,11.3,11.34,16.1,16.12,6.3,6.29
1,12.6703,-0.9864,9.4683,5.1137,8.415,-6.8208,4.9258,18.6177,-2.2627,5.5176,...,-0.6,-0.59,-1.1,-1.06,7.9,7.88,19.4,19.39,0.1,0.06
2,13.2814,-1.598,4.8137,7.023,11.9859,-1.7557,4.3975,20.6678,3.5098,9.0079,...,-1.7,-1.69,6.1,6.07,10.3,10.27,17.3,17.32,-10.3,-10.33


In [13]:
train_label = train_label.astype('int64')

In [14]:
print(train_label.groupby('target').size())
train_label.head(3)

target
0    107963
1     64196
dtype: int64


Unnamed: 0,target
0,0
1,0
2,0


In [15]:
print(val_label.groupby('target').size())
val_label.head(3)

target
0    35951
1     4049
dtype: int64


Unnamed: 0,target
0,0
1,0
2,0


## LGBM

In [19]:
def train_lgbm(param, train_data, train_label, val_data, val_label):
    print("===== Build dataset for lgbm")
    lgbm_train_data = lgb.Dataset(train_data, label=train_label)
    lgbm_val_data = lgb.Dataset(val_data, label=val_label)
    
    print("===== Start training")
    start_time = time()
    clf = lgb.train(param, 
                    lgbm_train_data, 
                    1000000, 
                    valid_sets = [lgbm_train_data, lgbm_val_data], 
                    verbose_eval = 1000, 
                    early_stopping_rounds = 3000)
    training_time = (time() - start_time) / 60.
    print("===== Training time: {:.2f}min".format(training_time))
    
    # compute auroc
    print("===== Get prediction")
    pred_tr = clf.predict(train_data, num_iteration=clf.best_iteration)
    pred_cv = clf.predict(val_data, num_iteration=clf.best_iteration)
    # get metrics
    print("===== Build metrics")
    train_label, val_label = np.ravel(train_label), np.ravel(val_label)
    a_tr = roc_auc_score(train_label, pred_tr)
    a_cv = roc_auc_score(val_label, pred_cv)
    progress = "auroc|train {:.4f}|val {:.4f}".format(a_tr, a_cv)
    print("Final result")
    print(progress)

In [21]:
param = {
    'bagging_freq': 5,
    'bagging_fraction': 0.4,
    'boost_from_average':'false',
    'boost': 'gbdt',
    'feature_fraction': 0.05,
    'learning_rate': 0.01,
    'max_depth': -1,  
    'metric':'auc',
    'min_data_in_leaf': 80,
    'min_sum_hessian_in_leaf': 10.0,
    'num_leaves': 13,
    'num_threads': 22,
    'tree_learner': 'serial',
    'objective': 'binary', 
    'verbosity': 1
}
train_lgbm(param, train_data, train_label, val_data, val_label)

===== Build dataset for lgbm
===== Start training
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.967496	valid_1's auc: 0.865195
[2000]	training's auc: 0.9744	valid_1's auc: 0.885277
[3000]	training's auc: 0.977985	valid_1's auc: 0.894038
[4000]	training's auc: 0.980366	valid_1's auc: 0.898738
[5000]	training's auc: 0.982165	valid_1's auc: 0.900716
[6000]	training's auc: 0.983676	valid_1's auc: 0.901769
[7000]	training's auc: 0.985035	valid_1's auc: 0.901904
[8000]	training's auc: 0.986303	valid_1's auc: 0.902023
[9000]	training's auc: 0.987514	valid_1's auc: 0.90209
[10000]	training's auc: 0.988661	valid_1's auc: 0.901965
[11000]	training's auc: 0.989754	valid_1's auc: 0.901906
[12000]	training's auc: 0.990765	valid_1's auc: 0.901767
Early stopping, best iteration is:
[9031]	training's auc: 0.987549	valid_1's auc: 0.902114
===== Training time: 2.09min
===== Get prediction
===== Build metrics
Final result
auroc|train 0.9875|val 0.9021


In [28]:
# LGBM round 1, seems overfitting
param = {
    'boosting_type': 'gbdt', 
    'objective': 'binary', 
    'tree_learner': 'serial', 
    'verbosity': 1,
    'learning_rate': 0.1, 
    'num_threads': 22,
    'metric':'auc',
    
    'num_leaves': 15, 
    'max_depth': -1,
    }
# train_lgbm(param, train_data, train_label, val_data, val_label)

# [1000]	training's auc: 0.996931	valid_1's auc: 0.893899
# [2000]	training's auc: 0.99992	valid_1's auc: 0.892059

In [25]:
# LGBM round 2: max depth = 15
param_base = {
    'boosting_type': 'gbdt', 
    'objective': 'binary', 
    'tree_learner': 'serial', 
    'verbosity': 1,
    'learning_rate': 0.1, # smaller later 
    'num_threads': 22,
    'metric':'auc',
    
    'num_leaves': 15, 
    'max_depth': -1,
    }
param_list = range(3,16,2)
for curr_param in param_list:
    print("max_depth: {}".format(curr_param))
    param_base['max_depth'] = curr_param
    train_lgbm(param_base, train_data, train_label, val_data, val_label)
    print("===========")
    
# train 0.9921|val 0.8946

max_depth: 3
===== Build dataset for lgbm
===== Start training
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.987355	valid_1's auc: 0.895526
[2000]	training's auc: 0.993022	valid_1's auc: 0.896852
[3000]	training's auc: 0.99625	valid_1's auc: 0.894902
[4000]	training's auc: 0.998186	valid_1's auc: 0.89313
Early stopping, best iteration is:
[1891]	training's auc: 0.992566	valid_1's auc: 0.897038
===== Training time: 1.84min
===== Get prediction
===== Build metrics
Final result
auroc|train 0.9926|val 0.8970
max_depth: 5
===== Build dataset for lgbm
===== Start training
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.996023	valid_1's auc: 0.891726
[2000]	training's auc: 0.999684	valid_1's auc: 0.890093
[3000]	training's auc: 0.99999	valid_1's auc: 0.88843
Early stopping, best iteration is:
[813]	training's auc: 0.994059	valid_1's auc: 0.892079
===== Training time: 1.62min
===== Get prediction
===== Build me

In [27]:
# LGBM round 3: max depth ranges from 17 to 25
# But the performance will not change
param_base = {
    'boosting_type': 'gbdt', 
    'objective': 'binary', 
    'tree_learner': 'serial', 
    'verbosity': 1,
    'learning_rate': 0.1, # smaller later 
    'num_threads': 22,
    'metric':'auc',
    
    'num_leaves': 15, 
    'max_depth': -1,
    }
param_list = range(17,26,2)
for curr_param in param_list:
    print("max_depth: {}".format(curr_param))
    param_base['max_depth'] = curr_param
    train_lgbm(param_base, train_data, train_label, val_data, val_label)
    print("===========")

max_depth: 17
===== Build dataset for lgbm
===== Start training
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.996931	valid_1's auc: 0.893899
[2000]	training's auc: 0.99992	valid_1's auc: 0.892059
[3000]	training's auc: 1	valid_1's auc: 0.891054
Early stopping, best iteration is:
[635]	training's auc: 0.992148	valid_1's auc: 0.894558
===== Training time: 1.73min
===== Get prediction
===== Build metrics
Final result
auroc|train 0.9921|val 0.8946
max_depth: 19
===== Build dataset for lgbm
===== Start training
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.996931	valid_1's auc: 0.893899
[2000]	training's auc: 0.99992	valid_1's auc: 0.892059
[3000]	training's auc: 1	valid_1's auc: 0.891054
Early stopping, best iteration is:
[635]	training's auc: 0.992148	valid_1's auc: 0.894558
===== Training time: 1.71min
===== Get prediction
===== Build metrics
Final result
auroc|train 0.9921|val 0.8946
max_depth: 21
====

KeyboardInterrupt: 

In [29]:
# LGBM round 4: num_leaves -> 5
param_base = {
    'boosting_type': 'gbdt', 
    'objective': 'binary', 
    'tree_learner': 'serial', 
    'verbosity': 1,
    'learning_rate': 0.1, # smaller later 
    'num_threads': 22,
    'metric':'auc',
    
    'num_leaves': 15, 
    'max_depth': 15,
    }
param_list = range(5,16,2)
for curr_param in param_list:
    print("num_leaves: {}".format(curr_param))
    param_base['num_leaves'] = curr_param
    train_lgbm(param_base, train_data, train_label, val_data, val_label)
    print("===========")

num_leaves: 5
===== Build dataset for lgbm
===== Start training
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.982036	valid_1's auc: 0.897806
[2000]	training's auc: 0.98875	valid_1's auc: 0.897783
[3000]	training's auc: 0.992822	valid_1's auc: 0.897007
[4000]	training's auc: 0.995527	valid_1's auc: 0.896069
Early stopping, best iteration is:
[1538]	training's auc: 0.986005	valid_1's auc: 0.898442
===== Training time: 1.60min
===== Get prediction
===== Build metrics
Final result
auroc|train 0.9860|val 0.8984
num_leaves: 7
===== Build dataset for lgbm
===== Start training
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.986802	valid_1's auc: 0.896958
[2000]	training's auc: 0.994286	valid_1's auc: 0.896658
[3000]	training's auc: 0.997716	valid_1's auc: 0.895635
[4000]	training's auc: 0.999223	valid_1's auc: 0.894608
Early stopping, best iteration is:
[1316]	training's auc: 0.989787	valid_1's auc: 0.897596
==

In [31]:
# LGBM round 5: min_data_in leaf --> 120
param_base = {
    'boosting_type': 'gbdt', 
    'boost_from_average': False, 
    'objective': 'binary', 
    'tree_learner': 'serial', 
    'verbosity': 1,
    'learning_rate': 0.1, # smaller later 
    'num_threads': 22,
    'metric':'auc',
    
    'num_leaves': 5, 
    'max_depth': 15,
    'min_data_in_leaf': 120,
    'min_sum_hessian_in_leaf': 10
    }
param_list = range(50,151,10)
for curr_param in param_list:
    print("min_data_in_leaf: {}".format(curr_param))
    param_base['min_data_in_leaf'] = curr_param
    train_lgbm(param_base, train_data, train_label, val_data, val_label)
    print("===========")

min_data_in_leaf: 50
===== Build dataset for lgbm
===== Start training
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.981822	valid_1's auc: 0.898178
[2000]	training's auc: 0.987556	valid_1's auc: 0.898728
[3000]	training's auc: 0.991369	valid_1's auc: 0.897917
[4000]	training's auc: 0.994144	valid_1's auc: 0.896635
Early stopping, best iteration is:
[1626]	training's auc: 0.985716	valid_1's auc: 0.899024
===== Training time: 1.59min
===== Get prediction
===== Build metrics
Final result
auroc|train 0.9857|val 0.8990
min_data_in_leaf: 60
===== Build dataset for lgbm
===== Start training
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.981822	valid_1's auc: 0.898178
[2000]	training's auc: 0.987556	valid_1's auc: 0.898728
[3000]	training's auc: 0.991369	valid_1's auc: 0.897917
[4000]	training's auc: 0.994144	valid_1's auc: 0.896635
Early stopping, best iteration is:
[1626]	training's auc: 0.985716	valid_1's a

In [36]:
# LGBM round 6: bagging_freq = 1
param_base = {
    'boosting_type': 'gbdt', 
    'boost_from_average': False, 
    'objective': 'binary', 
    'tree_learner': 'serial', 
    'verbosity': 1,
    'learning_rate': 0.1, # smaller later 
    'num_threads': 22,
    'metric':'auc',
    
    'num_leaves': 5, 
    'max_depth': 15,
    'min_data_in_leaf': 150,
    'min_sum_hessian_in_leaf': 10,
    
    'bagging_freq': 5,
    'bagging_fraction': 0.4,
    'feature_fraction': 0.05
    }
param_list = range(1,10,2)
for curr_param in param_list:
    print("bagging_freq: {}".format(curr_param))
    param_base['bagging_freq'] = curr_param
    train_lgbm(param_base, train_data, train_label, val_data, val_label)
    print("===========")

bagging_freq: 1
===== Build dataset for lgbm
===== Start training
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.978532	valid_1's auc: 0.89872
[2000]	training's auc: 0.982814	valid_1's auc: 0.899432
[3000]	training's auc: 0.986379	valid_1's auc: 0.896501
[4000]	training's auc: 0.989274	valid_1's auc: 0.894899
Early stopping, best iteration is:
[1591]	training's auc: 0.981213	valid_1's auc: 0.900466
===== Training time: 0.84min
===== Get prediction
===== Build metrics
Final result
auroc|train 0.9812|val 0.9005
bagging_freq: 3
===== Build dataset for lgbm
===== Start training
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.97855	valid_1's auc: 0.898497
[2000]	training's auc: 0.982863	valid_1's auc: 0.899571
[3000]	training's auc: 0.986376	valid_1's auc: 0.89669
[4000]	training's auc: 0.989295	valid_1's auc: 0.893794
Early stopping, best iteration is:
[1442]	training's auc: 0.98059	valid_1's auc: 0.900408
=

In [37]:
# LGBM round 6: bagging_fraction = 0.6
param_base = {
    'boosting_type': 'gbdt', 
    'boost_from_average': False, 
    'objective': 'binary', 
    'tree_learner': 'serial', 
    'verbosity': 1,
    'learning_rate': 0.1, # smaller later 
    'num_threads': 22,
    'metric':'auc',
    
    'num_leaves': 5, 
    'max_depth': 15,
    'min_data_in_leaf': 150,
    'min_sum_hessian_in_leaf': 10,
    
    'bagging_freq': 1,
    'bagging_fraction': 0.4,
    'feature_fraction': 0.05
    }
param_list = [0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6]
for curr_param in param_list:
    print("bagging_fraction: {}".format(curr_param))
    param_base['bagging_fraction'] = curr_param
    train_lgbm(param_base, train_data, train_label, val_data, val_label)
    print("===========")
    
# auroc|train 0.9823|val 0.9015

bagging_fraction: 0.1
===== Build dataset for lgbm
===== Start training
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.977545	valid_1's auc: 0.896133
[2000]	training's auc: 0.98026	valid_1's auc: 0.891286
[3000]	training's auc: 0.982414	valid_1's auc: 0.885124
Early stopping, best iteration is:
[790]	training's auc: 0.976635	valid_1's auc: 0.897013
===== Training time: 0.67min
===== Get prediction
===== Build metrics
Final result
auroc|train 0.9766|val 0.8970
bagging_fraction: 0.15
===== Build dataset for lgbm
===== Start training
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.978083	valid_1's auc: 0.898306
[2000]	training's auc: 0.981378	valid_1's auc: 0.89416
[3000]	training's auc: 0.984141	valid_1's auc: 0.890239
Early stopping, best iteration is:
[864]	training's auc: 0.977468	valid_1's auc: 0.898369
===== Training time: 0.68min
===== Get prediction
===== Build metrics
Final result
auroc|train 0.977

In [38]:
# LGBM round 6: bagging_fraction part 2
param_base = {
    'boosting_type': 'gbdt', 
    'boost_from_average': False, 
    'objective': 'binary', 
    'tree_learner': 'serial', 
    'verbosity': 1,
    'learning_rate': 0.1, # smaller later 
    'num_threads': 22,
    'metric':'auc',
    
    'num_leaves': 5, 
    'max_depth': 15,
    'min_data_in_leaf': 150,
    'min_sum_hessian_in_leaf': 10,
    
    'bagging_freq': 1,
    'bagging_fraction': 0.4,
    'feature_fraction': 0.05
    }
param_list = [0.65, 0.7, 0.75, 0.8]
for curr_param in param_list:
    print("bagging_fraction: {}".format(curr_param))
    param_base['bagging_fraction'] = curr_param
    train_lgbm(param_base, train_data, train_label, val_data, val_label)
    print("===========")

bagging_fraction: 0.65
===== Build dataset for lgbm
===== Start training
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.978805	valid_1's auc: 0.898715
[2000]	training's auc: 0.98328	valid_1's auc: 0.9008
[3000]	training's auc: 0.986955	valid_1's auc: 0.899824
[4000]	training's auc: 0.990037	valid_1's auc: 0.89851
Early stopping, best iteration is:
[1706]	training's auc: 0.982026	valid_1's auc: 0.901172
===== Training time: 0.70min
===== Get prediction
===== Build metrics
Final result
auroc|train 0.9820|val 0.9012
bagging_fraction: 0.7
===== Build dataset for lgbm
===== Start training
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.978797	valid_1's auc: 0.89829
[2000]	training's auc: 0.983387	valid_1's auc: 0.90017
[3000]	training's auc: 0.987077	valid_1's auc: 0.898556
[4000]	training's auc: 0.990148	valid_1's auc: 0.897369
Early stopping, best iteration is:
[1604]	training's auc: 0.981715	valid_1's auc:

In [39]:
# LGBM round 6: feature_fraction, still 0.05
param_base = {
    'boosting_type': 'gbdt', 
    'boost_from_average': False, 
    'objective': 'binary', 
    'tree_learner': 'serial', 
    'verbosity': 1,
    'learning_rate': 0.1, # smaller later 
    'num_threads': 22,
    'metric':'auc',
    
    'num_leaves': 5, 
    'max_depth': 15,
    'min_data_in_leaf': 150,
    'min_sum_hessian_in_leaf': 10,
    
    'bagging_freq': 1,
    'bagging_fraction': 0.6,
    'feature_fraction': 0.05
    }
param_list = [0.01,0.03,0.05,0.07,0.09,0.1,0.15,0.2,0.25,0.3,0.35,0.4]
for curr_param in param_list:
    print("feature_fraction: {}".format(curr_param))
    param_base['feature_fraction'] = curr_param
    train_lgbm(param_base, train_data, train_label, val_data, val_label)
    print("===========")

feature_fraction: 0.01
===== Build dataset for lgbm
===== Start training
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.964989	valid_1's auc: 0.897969
[2000]	training's auc: 0.975375	valid_1's auc: 0.90155
[3000]	training's auc: 0.979725	valid_1's auc: 0.901026
[4000]	training's auc: 0.983015	valid_1's auc: 0.899786
Early stopping, best iteration is:
[1960]	training's auc: 0.97523	valid_1's auc: 0.901603
===== Training time: 0.70min
===== Get prediction
===== Build metrics
Final result
auroc|train 0.9752|val 0.9016
feature_fraction: 0.03
===== Build dataset for lgbm
===== Start training
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.977905	valid_1's auc: 0.898022
[2000]	training's auc: 0.982139	valid_1's auc: 0.900115
[3000]	training's auc: 0.985523	valid_1's auc: 0.898812
[4000]	training's auc: 0.988425	valid_1's auc: 0.89719
Early stopping, best iteration is:
[1788]	training's auc: 0.981355	valid_1's 

In [40]:
# LGBM round 7: final version, learning rate = 0.01
param_base = {
    'boosting_type': 'gbdt', 
    'boost_from_average': False, 
    'objective': 'binary', 
    'tree_learner': 'serial', 
    'verbosity': 1,
    'learning_rate': 0.1, 
    'num_threads': 22,
    'metric':'auc',
    
    'num_leaves': 5, 
    'max_depth': 15,
    'min_data_in_leaf': 150,
    'min_sum_hessian_in_leaf': 10,
    
    'bagging_freq': 1,
    'bagging_fraction': 0.6,
    'feature_fraction': 0.05
    }
param_list = [0.1, 0.05, 0.01, 0.005, 0.002, 0.001]
for curr_param in param_list:
    print("learning_rate: {}".format(curr_param))
    param_base['learning_rate'] = curr_param
    train_lgbm(param_base, train_data, train_label, val_data, val_label)
    print("===========")

learning_rate: 0.1
===== Build dataset for lgbm
===== Start training
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.978803	valid_1's auc: 0.8992
[2000]	training's auc: 0.983269	valid_1's auc: 0.90135
[3000]	training's auc: 0.986954	valid_1's auc: 0.899709
[4000]	training's auc: 0.989994	valid_1's auc: 0.897423
Early stopping, best iteration is:
[1748]	training's auc: 0.982256	valid_1's auc: 0.901476
===== Training time: 0.74min
===== Get prediction
===== Build metrics
Final result
auroc|train 0.9823|val 0.9015
learning_rate: 0.05
===== Build dataset for lgbm
===== Start training
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.974564	valid_1's auc: 0.888411
[2000]	training's auc: 0.978983	valid_1's auc: 0.899222
[3000]	training's auc: 0.981355	valid_1's auc: 0.901133
[4000]	training's auc: 0.983504	valid_1's auc: 0.901363
[5000]	training's auc: 0.985468	valid_1's auc: 0.90106
[6000]	training's auc: 0.9872

[51000]	training's auc: 0.979383	valid_1's auc: 0.900877
[52000]	training's auc: 0.979489	valid_1's auc: 0.901013
[53000]	training's auc: 0.979596	valid_1's auc: 0.901146
[54000]	training's auc: 0.979697	valid_1's auc: 0.901261
[55000]	training's auc: 0.979798	valid_1's auc: 0.901359
[56000]	training's auc: 0.979898	valid_1's auc: 0.901469
[57000]	training's auc: 0.979998	valid_1's auc: 0.901549
[58000]	training's auc: 0.980095	valid_1's auc: 0.901653
[59000]	training's auc: 0.980189	valid_1's auc: 0.90173
[60000]	training's auc: 0.980283	valid_1's auc: 0.901813
[61000]	training's auc: 0.98038	valid_1's auc: 0.901882
[62000]	training's auc: 0.980476	valid_1's auc: 0.901946
[63000]	training's auc: 0.980569	valid_1's auc: 0.90201
[64000]	training's auc: 0.980665	valid_1's auc: 0.902064
[65000]	training's auc: 0.980759	valid_1's auc: 0.902115
[66000]	training's auc: 0.980848	valid_1's auc: 0.902159
[67000]	training's auc: 0.980941	valid_1's auc: 0.902197
[68000]	training's auc: 0.981028	v

KeyboardInterrupt: 

## Thus this could be final version of LGBM

In [41]:
param_base = {
    'boosting_type': 'gbdt', 
    'boost_from_average': False, 
    'objective': 'binary', 
    'tree_learner': 'serial', 
    'verbosity': 1,
    'learning_rate': 0.01, 
    'num_threads': 22,
    'metric':'auc',
    
    'num_leaves': 5, 
    'max_depth': 15,
    'min_data_in_leaf': 150,
    'min_sum_hessian_in_leaf': 10,
    
    'bagging_freq': 1,
    'bagging_fraction': 0.6,
    'feature_fraction': 0.05
    }
train_lgbm(param_base, train_data, train_label, val_data, val_label)

===== Build dataset for lgbm
===== Start training
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.953697	valid_1's auc: 0.835956
[2000]	training's auc: 0.966186	valid_1's auc: 0.86362
[3000]	training's auc: 0.970569	valid_1's auc: 0.877017
[4000]	training's auc: 0.973157	valid_1's auc: 0.884793
[5000]	training's auc: 0.974875	valid_1's auc: 0.890089
[6000]	training's auc: 0.976221	valid_1's auc: 0.893831
[7000]	training's auc: 0.977218	valid_1's auc: 0.896502
[8000]	training's auc: 0.978028	valid_1's auc: 0.898348
[9000]	training's auc: 0.97871	valid_1's auc: 0.899692
[10000]	training's auc: 0.979289	valid_1's auc: 0.900762
[11000]	training's auc: 0.979804	valid_1's auc: 0.901392
[12000]	training's auc: 0.980293	valid_1's auc: 0.90177
[13000]	training's auc: 0.980766	valid_1's auc: 0.902057
[14000]	training's auc: 0.98123	valid_1's auc: 0.902212
[15000]	training's auc: 0.98168	valid_1's auc: 0.902383
[16000]	training's auc: 0.982132	valid_1's au