# Run LGBM on other training set

In last sections, I get lgbm's parameters on a training set and achieved good performance on the validation set

In this section, I will build other form of training set, and check whether the performance on validation set can be better

**Specifically, I will reduce the dimension via PCA**

In [0]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from imblearn.combine import SMOTETomek
from time import time
import lightgbm as lgb
from sklearn.metrics import roc_auc_score

## 1. Load data

In [0]:
def load_data(train_data_path, train_label_path, val_data_path, val_label_path):
    val_data = pd.read_csv(val_data_path)
    val_data.drop(columns=['Unnamed: 0'], inplace=True)
    val_label = pd.read_csv(val_label_path)
    val_label.drop(columns=['Unnamed: 0'], inplace=True)
    train_data = pd.read_csv(train_data_path)
    train_data.drop(columns=['Unnamed: 0'], inplace=True)
    train_data.columns = val_data.columns
    train_label = pd.read_csv(train_label_path)
    train_label.drop(columns=['Unnamed: 0'], inplace=True)
    train_label.columns = val_label.columns
    
    print("Train: {},{}".format(train_data.shape, train_label.shape))
    print("Val: {}, {}".format(val_data.shape, val_label.shape))
    
    return train_data, train_label, val_data, val_label

In [0]:
train_data_path = "data/train_data_aug.csv"
train_label_path = "data/train_label_aug.csv"
val_data_path = "data/val_data.csv"
val_label_path = "data/val_label.csv"
train_data, train_label, val_data, val_label = load_data(train_data_path, train_label_path, val_data_path, val_label_path)
train_label = train_label.astype('int64')

Train: (172159, 608),(172159, 1)
Val: (40000, 608), (40000, 1)


## LGBM on different kinds of training set

In previous sections, without dimension reduction, the result is 

auroc|train 0.9830|val 0.9026

Now we will use PCA to reduce dimension

In [0]:
param_base = {
    'boosting_type': 'gbdt', 
    'boost_from_average': False, 
    'objective': 'binary', 
    'tree_learner': 'serial', 
    'verbosity': 1,
    'learning_rate': 0.01, 
    'num_threads': 22,
    'metric':'auc',
    
    'num_leaves': 5, 
    'max_depth': 15,
    'min_data_in_leaf': 150,
    'min_sum_hessian_in_leaf': 10,
    
    'bagging_freq': 1,
    'bagging_fraction': 0.6,
    'feature_fraction': 0.05
    }

In [0]:
def train_lgbm(param, train_data, train_label, val_data, val_label):
    print("===== Build dataset for lgbm")
    lgbm_train_data = lgb.Dataset(train_data, label=train_label)
    lgbm_val_data = lgb.Dataset(val_data, label=val_label)
    
    print("===== Start training")
    start_time = time()
    clf = lgb.train(param, 
                    lgbm_train_data, 
                    1000000, 
                    valid_sets = [lgbm_train_data, lgbm_val_data], 
                    verbose_eval = 1000, 
                    early_stopping_rounds = 3000)
    training_time = (time() - start_time) / 60.
    print("===== Training time: {:.2f}min".format(training_time))
    
    # compute auroc
    print("===== Get prediction")
    pred_tr = clf.predict(train_data, num_iteration=clf.best_iteration)
    pred_cv = clf.predict(val_data, num_iteration=clf.best_iteration)
    # get metrics
    print("===== Build metrics")
    train_label, val_label = np.ravel(train_label), np.ravel(val_label)
    a_tr = roc_auc_score(train_label, pred_tr)
    a_cv = roc_auc_score(val_label, pred_cv)
    progress = "auroc|train {:.4f}|val {:.4f}".format(a_tr, a_cv)
    print("Final result")
    print(progress)
    
# without PCA
# auroc|train 0.9830|val 0.9026
# train_lgbm(param_base, train_data_aug, train_label_aug, val_data, val_label)

In [0]:
def dimension_reduction(train_data, val_data, n_components=200):
    print("Train: {}\nVal: {}".format(train_data.shape, val_data.shape))
    split_index = train_data.shape[0]
    pca = PCA(n_components=n_components)
    combined_data = pd.concat([train_data, val_data])
    print("Combined: {}".format(combined_data.shape))
    combined_data_pca = pca.fit_transform(combined_data)
    combined_data_pca = pd.DataFrame(combined_data_pca, columns = ['var_pca_{}'.format(i) for i in range(n_components)])
    print("After PCA: {}".format(combined_data_pca.shape))
    train_data_pca = combined_data_pca[:split_index]
    val_data_pca = combined_data_pca[split_index:]
    print("Train: {}\nVal: {}".format(train_data_pca.shape, val_data_pca.shape))
    return train_data_pca, val_data_pca

In [0]:
# round 1: dim = 300
train_data_pca, val_data_pca = dimension_reduction(train_data, val_data, 300)

Train: (172159, 608)
Val: (40000, 608)
Combined: (212159, 608)
After PCA: (212159, 300)
Train: (172159, 300)
Val: (40000, 300)


In [0]:
train_lgbm(param_base, train_data_pca, train_label, val_data_pca, val_label)

===== Build dataset for lgbm
===== Start training
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.946762	valid_1's auc: 0.776749
[2000]	training's auc: 0.95646	valid_1's auc: 0.811801
[3000]	training's auc: 0.960945	valid_1's auc: 0.827428
[4000]	training's auc: 0.963768	valid_1's auc: 0.836371
[5000]	training's auc: 0.965776	valid_1's auc: 0.842404
[6000]	training's auc: 0.967317	valid_1's auc: 0.846597
[7000]	training's auc: 0.968545	valid_1's auc: 0.849658
[8000]	training's auc: 0.969533	valid_1's auc: 0.851772
[9000]	training's auc: 0.970381	valid_1's auc: 0.853485
[10000]	training's auc: 0.971168	valid_1's auc: 0.85475
[11000]	training's auc: 0.971894	valid_1's auc: 0.855725
[12000]	training's auc: 0.972551	valid_1's auc: 0.856444
[13000]	training's auc: 0.973167	valid_1's auc: 0.857041
[14000]	training's auc: 0.97377	valid_1's auc: 0.857619
[15000]	training's auc: 0.974365	valid_1's auc: 0.857891
[16000]	training's auc: 0.974931	valid_1's 

In [0]:
# round 2: dim = 200
train_data_pca, val_data_pca = dimension_reduction(train_data, val_data, 200)
train_lgbm(param_base, train_data_pca, train_label, val_data_pca, val_label)

Train: (172159, 608)
Val: (40000, 608)
Combined: (212159, 608)
After PCA: (212159, 200)
Train: (172159, 200)
Val: (40000, 200)
===== Build dataset for lgbm
===== Start training
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.945249	valid_1's auc: 0.770803
[2000]	training's auc: 0.954326	valid_1's auc: 0.802697
[3000]	training's auc: 0.958504	valid_1's auc: 0.817061
[4000]	training's auc: 0.961215	valid_1's auc: 0.825547
[5000]	training's auc: 0.963143	valid_1's auc: 0.831152
[6000]	training's auc: 0.964644	valid_1's auc: 0.835423
[7000]	training's auc: 0.965828	valid_1's auc: 0.838282
[8000]	training's auc: 0.966799	valid_1's auc: 0.840429
[9000]	training's auc: 0.967598	valid_1's auc: 0.841971
[10000]	training's auc: 0.96833	valid_1's auc: 0.843103
[11000]	training's auc: 0.969014	valid_1's auc: 0.843932
[12000]	training's auc: 0.969671	valid_1's auc: 0.844588
[13000]	training's auc: 0.970287	valid_1's auc: 0.845157
[14000]	training's auc: 0.97

In [0]:
# round 3: dim = 150
train_data_pca, val_data_pca = dimension_reduction(train_data, val_data, 150)
train_lgbm(param_base, train_data_pca, train_label, val_data_pca, val_label)

Train: (172159, 608)
Val: (40000, 608)
Combined: (212159, 608)
After PCA: (212159, 150)
Train: (172159, 150)
Val: (40000, 150)
===== Build dataset for lgbm
===== Start training
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.941165	valid_1's auc: 0.751078
[2000]	training's auc: 0.951436	valid_1's auc: 0.788309
[3000]	training's auc: 0.955301	valid_1's auc: 0.8013
[4000]	training's auc: 0.957711	valid_1's auc: 0.808902
[5000]	training's auc: 0.959476	valid_1's auc: 0.813995
[6000]	training's auc: 0.960874	valid_1's auc: 0.817785
[7000]	training's auc: 0.961999	valid_1's auc: 0.820328
[8000]	training's auc: 0.962955	valid_1's auc: 0.822261
[9000]	training's auc: 0.963776	valid_1's auc: 0.823688
[10000]	training's auc: 0.96451	valid_1's auc: 0.824833
[11000]	training's auc: 0.965181	valid_1's auc: 0.825603
[12000]	training's auc: 0.96583	valid_1's auc: 0.826178
[13000]	training's auc: 0.966437	valid_1's auc: 0.826692
[14000]	training's auc: 0.96703

In [0]:
# round 3: dim = 100
train_data_pca, val_data_pca = dimension_reduction(train_data, val_data, 100)
train_lgbm(param_base, train_data_pca, train_label, val_data_pca, val_label)

Train: (172159, 608)
Val: (40000, 608)
Combined: (212159, 608)
After PCA: (212159, 100)
Train: (172159, 100)
Val: (40000, 100)
===== Build dataset for lgbm
===== Start training
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.915639	valid_1's auc: 0.6503
[2000]	training's auc: 0.929271	valid_1's auc: 0.699622
[3000]	training's auc: 0.939095	valid_1's auc: 0.736353
[4000]	training's auc: 0.944588	valid_1's auc: 0.756413
[5000]	training's auc: 0.947958	valid_1's auc: 0.768363
[6000]	training's auc: 0.950179	valid_1's auc: 0.775534
[7000]	training's auc: 0.95188	valid_1's auc: 0.780546
[8000]	training's auc: 0.953222	valid_1's auc: 0.784051
[9000]	training's auc: 0.954347	valid_1's auc: 0.786538
[10000]	training's auc: 0.955344	valid_1's auc: 0.788561
[11000]	training's auc: 0.956232	valid_1's auc: 0.790089
[12000]	training's auc: 0.957058	valid_1's auc: 0.791336
[13000]	training's auc: 0.957814	valid_1's auc: 0.792258
[14000]	training's auc: 0.9585

In [0]:
# round 3: dim = 80
train_data_pca, val_data_pca = dimension_reduction(train_data, val_data, 80)
train_lgbm(param_base, train_data_pca, train_label, val_data_pca, val_label)

Train: (172159, 608)
Val: (40000, 608)
Combined: (212159, 608)
After PCA: (212159, 80)
Train: (172159, 80)
Val: (40000, 80)
===== Build dataset for lgbm
===== Start training
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.909063	valid_1's auc: 0.627087
[2000]	training's auc: 0.921082	valid_1's auc: 0.671811
[3000]	training's auc: 0.930211	valid_1's auc: 0.704858
[4000]	training's auc: 0.935837	valid_1's auc: 0.725037
[5000]	training's auc: 0.939495	valid_1's auc: 0.737437
[6000]	training's auc: 0.942029	valid_1's auc: 0.74524
[7000]	training's auc: 0.943959	valid_1's auc: 0.750695
[8000]	training's auc: 0.94553	valid_1's auc: 0.754671
[9000]	training's auc: 0.946841	valid_1's auc: 0.757686
[10000]	training's auc: 0.947971	valid_1's auc: 0.760079
[11000]	training's auc: 0.948947	valid_1's auc: 0.761897
[12000]	training's auc: 0.949864	valid_1's auc: 0.763181
[13000]	training's auc: 0.950706	valid_1's auc: 0.764125
[14000]	training's auc: 0.951516