![](https://mma.prnewswire.com/media/1429854/MoneyLion_Logo.jpg?p=facebook)

# <a id='8'>8. Modeling </a>

# LightGBM

In [141]:
data = final_data.copy()
X = data.drop(['Target'], axis=1)
Xbest=X[embeded_lgb_feature] #best features
target = data['Target']
# split into train and test
#stratify = target means that the : returns training and test subsets that have the same proportions of class labels.
X_train, X_test, y_train, y_test = train_test_split(Xbest, target, test_size = 0.3, stratify = target)

To implement KFold cross validation, we will use the LightGBM **cross validation** function, `cv`, because this allows us to use a critical technique for training a GBM, early stopping.  
To use the cv function, we first need to make a LightGBM dataset.

In [166]:
# To use the cv function, we first need to make a LightGBM dataset.
train_set = lgb.Dataset(X_train.values, label=y_train.values,
                           feature_name=X_train.columns.tolist()
                           )
test_set =  lgb.Dataset(X_valid.values, label=y_valid.values,
                       feature_name=X_train.columns.tolist()
                       )  

In the cv call, the num_boost_round is set to 10,000 (num_boost_round is the same as n_estimators), but this number won't actually be reached because we are using early stopping.

The code below carries out both cross validation with 5 folds and early stopping with 100 early stopping rounds.

In [3]:

# Get default hyperparameters
model = lgb.LGBMClassifier()
default_params = model.get_params()

# Remove the number of estimators because we set this to 10000 in the cv call
del default_params['n_estimators']

# Cross validation with early stopping
cv_results = lgb.cv(default_params, train_set, num_boost_round = 10000, early_stopping_rounds = 100, 
                    metrics = 'auc', nfold = 5, seed = 42)

from sklearn.metrics import roc_auc_score
# Optimal number of esimators found in cv
model.n_estimators = len(cv_results['auc-mean'])

# Train and make predicions with model
model.fit(X_train, y_train)
preds = model.predict_proba(X_test)[:, 1]
baseline_auc = roc_auc_score(y_test, preds)

# print('The baseline model scores {:.5f} ROC AUC on the test set.'.format(baseline_auc))

**Objective Function**  

The objective function takes in hyperparameters and outputs a value representing a score.Here our score will be the `ROC AUC` which of course we want to maximize. Later,we will have to use a value to minimize, so we can take  `1−ROC AUC`  as the score.  we will use cross validation with the specified model hyperparameters to get the cross-validation ROC AUC. This score will then be used to select the best model hyperparameter values.

In [172]:
def objective(hyperparameters, iteration):
    """Objective function for grid search. Returns
       the cross validation score from a set of hyperparameters."""
    
    # Number of estimators will be found using early stopping
    if 'n_estimators' in hyperparameters.keys():
        del hyperparameters['n_estimators']
    
     # Perform n_folds cross validation
    cv_results = lgb.cv(hyperparameters, train_set, num_boost_round = 10000, nfold = 5, 
                        early_stopping_rounds = 100, metrics = 'auc', seed = 42)
    
    # results to retun
    score = cv_results['auc-mean'][-1]
    estimators = len(cv_results['auc-mean'])
    hyperparameters['n_estimators'] = estimators 
    
    return [score, hyperparameters, iteration]


In [163]:
score, params, iteration = objective(default_params, 1)

print('The cross-validation ROC AUC was {:.5f}.'.format(score))

You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1015
[LightGBM] [Info] Number of data points in the train set: 21829, number of used features: 19
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1015
[LightGBM] [Info] Number of data points in the train set: 21829, number of used features: 19
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1015
[LightGBM] [Info] Number of data points in the train set: 21830, number of used features: 19
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1015
[LightGBM] [Info] Number of data points in the train set: 21830, number of used features: 19
You can set `force_row_wise=true` to remove the overhead.
And if mem

### Hyperparameter Tuning Implementation
we will use cross validation to determine the performance of model hyperparameters and early stopping with the GBM so we do not have to tune the number of estimators. The basic strategy for both grid and random search is simple: for each hyperparameter value combination, evaluate the cross validation score and record the results along with the hyperparameters. Then, at the end of searching, choose the hyperparameters that yielded the highest cross-validation score, train the model on all the training data, and make predictions on the test data.

In [149]:
# Create a default model
model = lgb.LGBMModel()
model.get_params()

{'boosting_type': 'gbdt',
 'class_weight': None,
 'colsample_bytree': 1.0,
 'importance_type': 'split',
 'learning_rate': 0.1,
 'max_depth': -1,
 'min_child_samples': 20,
 'min_child_weight': 0.001,
 'min_split_gain': 0.0,
 'n_estimators': 100,
 'n_jobs': -1,
 'num_leaves': 31,
 'objective': None,
 'random_state': None,
 'reg_alpha': 0.0,
 'reg_lambda': 0.0,
 'silent': 'warn',
 'subsample': 1.0,
 'subsample_for_bin': 200000,
 'subsample_freq': 0}

Some of these we do not need to tune such as `silent`, `objective`, `random_state`, and `n_jobs`, and we will use early stopping to determine perhaps the most important hyperparameter, the number of individual learners trained, `n_estimators` (also referred to as `num_boost_rounds` or the number of iterations). Some of the hyperparameters do not need to be tuned if others are: for example, `min_child_samples` and `min_child_weight` both limit the complexity of individual decision trees by adjusting the minimum leaf observation requirements and therefore we will only adjust one. However, there are still many hyperparameters to optimize

Each of the values in the dicionary must be a list, so we use list combined with range, np.linspace, and np.logspace to define the range of values for each hyperparameter.

In [156]:
# Hyperparameter grid
param_grid = {
    'boosting_type': ['gbdt', 'goss'],
    'num_leaves': list(range(20, 100)),
    'learning_rate': list(np.logspace(np.log10(0.005), np.log10(0.5), base = 10, num = 100)),
    'subsample_for_bin': list(range(20000, 30000, 200)),
    'min_child_samples': list(range(20, 500, 5)),
    'reg_alpha': list(np.linspace(0, 1)),
    'reg_lambda': list(np.linspace(0, 1)),
    'colsample_bytree': list(np.linspace(0.6, 1, 10)),
    'subsample': list(np.linspace(0.5, 1, 10))
    
}

In [187]:
def grid_search(param_grid, MAX_EVALS = 5):
    """Grid search algorithm (with limit on max evals)"""
    
    # Dataframe to store results
    results = pd.DataFrame(columns = ['score', 'params', 'iteration'],
                              index = list(range(MAX_EVALS)))
    
    keys, values = zip(*param_grid.items()) # unpack the values in the hyperparameter grid dictionary
    
    i = 0
    
    # Iterate through every possible combination of hyperparameters
    for v in itertools.product(*values):
        
        # Create a hyperparameter dictionary
        hyperparameters = dict(zip(keys, v))
        
        # Set the subsample ratio accounting for boosting type
        hyperparameters['subsample'] = 1.0 if hyperparameters['boosting_type'] == 'goss' else hyperparameters['subsample']
        
        #The objective function returns the cross validation score from the hyperparameters which we record in the dataframe.
        eval_results = objective(hyperparameters, i)
        
        results.loc[i, :] = eval_results
        
        i += 1
        
        # Here we will run grid search for 5 iterations for limited time
        if i > MAX_EVALS:
            break
       
    # Sort with best score on top
    results.sort_values('score', ascending = False, inplace = True)
    results.reset_index(inplace = True)
    
    return results   

grid_results = grid_search(param_grid)

print('The best validation score was {:.5f}'.format(grid_results.loc[0, 'score']))
print('\nThe best hyperparameters were:')

pprint.pprint(grid_results.loc[0, 'params'])

You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 995
[LightGBM] [Info] Number of data points in the train set: 21829, number of used features: 19
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 995
[LightGBM] [Info] Number of data points in the train set: 21829, number of used features: 19
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 995
[LightGBM] [Info] Number of data points in the train set: 21830, number of used features: 19
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 995
[LightGBM] [Info] Number of data points in the train set: 21830, number of used features: 19
You can 

You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 995
[LightGBM] [Info] Number of data points in the train set: 21829, number of used features: 19
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 995
[LightGBM] [Info] Number of data points in the train set: 21829, number of used features: 19
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 995
[LightGBM] [Info] Number of data points in the train set: 21830, number of used features: 19
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 995
[LightGBM] [Info] Number of data points in the train set: 21830, number of used features: 19
You can 

You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 995
[LightGBM] [Info] Number of data points in the train set: 21829, number of used features: 19
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 995
[LightGBM] [Info] Number of data points in the train set: 21829, number of used features: 19
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 995
[LightGBM] [Info] Number of data points in the train set: 21830, number of used features: 19
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 995
[LightGBM] [Info] Number of data points in the train set: 21830, number of used features: 19
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [In

In [208]:
param = {
'boosting_type': 'gbdt',
 'colsample_bytree': 0.6,
 'learning_rate': 0.004999999999999999,
 'min_child_samples': 20,
 'n_estimators': 94,
 'num_leaves': 20,
 'objective': 'binary',
 'save_binary': True,
 'verbose': 1,
 'n_estimators': 1000,
 'metric': 'auc',
 'is_unbalance': True,
 'reg_alpha': 0.0,
 'reg_lambda': 0.0,
 'subsample': 0.5,
 'subsample_for_bin': 20000}

In [210]:
classifier = lgb.train(param, train_set,num_boost_round=100, 
                valid_sets = [train_set, test_set], 
                early_stopping_rounds=50)


[LightGBM] [Info] Number of positive: 228, number of negative: 27059
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 995
[LightGBM] [Info] Number of data points in the train set: 27287, number of used features: 19
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.008356 -> initscore=-4.776429
[LightGBM] [Info] Start training from score -4.776429
[1]	training's auc: 0.832723	valid_1's auc: 0.810064
Training until validation scores don't improve for 50 rounds
[2]	training's auc: 0.863834	valid_1's auc: 0.83701
[3]	training's auc: 0.875344	valid_1's auc: 0.848847
[4]	training's auc: 0.883197	valid_1's auc: 0.853995
[5]	training's auc: 0.890316	valid_1's auc: 0.862422
[6]	training's auc: 0.896771	valid_1's auc: 0.865076
[7]	training's auc: 0.899682	valid_1's auc: 0.868088
[8]	training's auc: 0.901516	valid_1's auc: 0.869421
[9]	training's auc: 0.903958	valid_1's auc: 0.872009
[10]	trainin

[184]	training's auc: 0.954669	valid_1's auc: 0.913627
[185]	training's auc: 0.954721	valid_1's auc: 0.913613
[186]	training's auc: 0.954668	valid_1's auc: 0.913548
[187]	training's auc: 0.954833	valid_1's auc: 0.913548
[188]	training's auc: 0.954889	valid_1's auc: 0.913659
[189]	training's auc: 0.954816	valid_1's auc: 0.913524
[190]	training's auc: 0.95476	valid_1's auc: 0.913522
[191]	training's auc: 0.955244	valid_1's auc: 0.913714
[192]	training's auc: 0.955284	valid_1's auc: 0.913751
[193]	training's auc: 0.955272	valid_1's auc: 0.913777
[194]	training's auc: 0.955649	valid_1's auc: 0.914007
[195]	training's auc: 0.955904	valid_1's auc: 0.914229
[196]	training's auc: 0.956011	valid_1's auc: 0.914213
[197]	training's auc: 0.955966	valid_1's auc: 0.914218
[198]	training's auc: 0.956004	valid_1's auc: 0.914308
[199]	training's auc: 0.956152	valid_1's auc: 0.914304
[200]	training's auc: 0.95647	valid_1's auc: 0.914465
[201]	training's auc: 0.95684	valid_1's auc: 0.914673
[202]	trainin

[347]	training's auc: 0.967357	valid_1's auc: 0.921223
[348]	training's auc: 0.967343	valid_1's auc: 0.921252
[349]	training's auc: 0.967374	valid_1's auc: 0.921222
[350]	training's auc: 0.967327	valid_1's auc: 0.921197
[351]	training's auc: 0.967476	valid_1's auc: 0.921359
[352]	training's auc: 0.967699	valid_1's auc: 0.921505
[353]	training's auc: 0.96783	valid_1's auc: 0.921601
[354]	training's auc: 0.967874	valid_1's auc: 0.921624
[355]	training's auc: 0.967908	valid_1's auc: 0.921621
[356]	training's auc: 0.967936	valid_1's auc: 0.921673
[357]	training's auc: 0.96816	valid_1's auc: 0.921802
[358]	training's auc: 0.96825	valid_1's auc: 0.921864
[359]	training's auc: 0.968323	valid_1's auc: 0.921928
[360]	training's auc: 0.968323	valid_1's auc: 0.921886
[361]	training's auc: 0.968576	valid_1's auc: 0.921991
[362]	training's auc: 0.968736	valid_1's auc: 0.922081
[363]	training's auc: 0.968832	valid_1's auc: 0.922195
[364]	training's auc: 0.969056	valid_1's auc: 0.922366
[365]	trainin

[508]	training's auc: 0.978987	valid_1's auc: 0.928444
[509]	training's auc: 0.979018	valid_1's auc: 0.928477
[510]	training's auc: 0.979111	valid_1's auc: 0.928553
[511]	training's auc: 0.979173	valid_1's auc: 0.928496
[512]	training's auc: 0.979216	valid_1's auc: 0.928533
[513]	training's auc: 0.979296	valid_1's auc: 0.928566
[514]	training's auc: 0.979332	valid_1's auc: 0.928602
[515]	training's auc: 0.979394	valid_1's auc: 0.928658
[516]	training's auc: 0.979378	valid_1's auc: 0.928674
[517]	training's auc: 0.979407	valid_1's auc: 0.928705
[518]	training's auc: 0.979476	valid_1's auc: 0.928774
[519]	training's auc: 0.97948	valid_1's auc: 0.928769
[520]	training's auc: 0.979572	valid_1's auc: 0.928833
[521]	training's auc: 0.979592	valid_1's auc: 0.928888
[522]	training's auc: 0.979606	valid_1's auc: 0.928906
[523]	training's auc: 0.979677	valid_1's auc: 0.928959
[524]	training's auc: 0.979747	valid_1's auc: 0.928981
[525]	training's auc: 0.979825	valid_1's auc: 0.928999
[526]	train

[674]	training's auc: 0.985878	valid_1's auc: 0.931003
[675]	training's auc: 0.985891	valid_1's auc: 0.931003
[676]	training's auc: 0.985933	valid_1's auc: 0.931022
[677]	training's auc: 0.985954	valid_1's auc: 0.931015
[678]	training's auc: 0.985986	valid_1's auc: 0.931004
[679]	training's auc: 0.98601	valid_1's auc: 0.930992
[680]	training's auc: 0.986023	valid_1's auc: 0.931008
[681]	training's auc: 0.986052	valid_1's auc: 0.93103
[682]	training's auc: 0.986082	valid_1's auc: 0.931007
Early stopping, best iteration is:
[632]	training's auc: 0.984612	valid_1's auc: 0.931098


The penalty for mislabeling a loan default as legitimate is having a organize's money stolen, which the credit card company typically reimburses. To address this issue we need to protect the company’s finances by trying to flag as many loan defaults(No matter the number of loan applications rejected, there will still be debtors that default)  
Therefore, AUC is a good measure to determine, when the costs of False Positive is high.

