# Model Tuning Practice

By `Atwine Mugume Twinamatsiko`

I have been looking at some kernels on Model tuning using hyperopt and other implimentations so I wat to try it out myself and see the power of what it can do

This is a great [Article](https://medium.com/@pushkarmandot/https-medium-com-pushkarmandot-what-is-lightgbm-how-to-implement-it-how-to-fine-tune-the-parameters-60347819b7fc) to read on lgb

In [1]:
# Data manipulation
import pandas as pd
import numpy as np

# Modeling
import lightgbm as lgb

# Splitting data
from sklearn.model_selection import train_test_split


N_FOLDS = 5
MAX_EVALS = 5

This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.


In [2]:
#let's read in the data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [3]:
#for rapid prototyping, let's take a few rows of the training data
train_sample = train.sample(n = 20000, random_state = 42)

In [4]:
train.head(4)

Unnamed: 0.1,Unnamed: 0,BatchId_0,BatchId_1,BatchId_2,BatchId_3,BatchId_4,BatchId_5,BatchId_6,BatchId_7,BatchId_8,...,TransactionStartTimeDayofweek,TransactionStartTimeDayofyear,TransactionStartTimeIs_month_end,TransactionStartTimeIs_month_start,TransactionStartTimeIs_quarter_end,TransactionStartTimeIs_quarter_start,TransactionStartTimeIs_year_end,TransactionStartTimeIs_year_start,TransactionStartTimeElapsed,FraudResult
0,0,0,0,1,0,1,1,0,1,1,...,3,319,False,False,False,False,False,False,1542248329,0
1,1,0,0,0,1,1,1,1,1,0,...,3,319,False,False,False,False,False,False,1542248348,0
2,2,0,0,1,1,1,0,1,0,1,...,3,319,False,False,False,False,False,False,1542249861,0
3,3,0,0,0,0,0,0,0,1,1,...,3,319,False,False,False,False,False,False,1542252775,0


In [5]:
#let's take out the label of the data
label = train_sample['FraudResult']
train_sample = train_sample.drop(columns=['FraudResult'])

In [6]:
train.shape, test.shape

((95662, 92), (45019, 91))

In [7]:
#let's split the data and get to it already
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(train_sample,
                                                    label)

In [8]:
#we have an imbalance in the data so we are going to do SMOTE
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=27,ratio='minority')
X_train, y_train = smote.fit_sample(X_train, y_train)

#Let's see the shape
X_train.shape, y_train.shape

((29934, 91), (29934,))

In [9]:
print("After OverSampling, counts of label '1': {}".format(sum(y_train==1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train==0)))

After OverSampling, counts of label '1': 14967
After OverSampling, counts of label '0': 14967


In [10]:
#here we want to see what the shapes of the data look like.
print("Training features shape: ", X_train.shape)
print("Testing features shape: ", X_test.shape)

Training features shape:  (29934, 91)
Testing features shape:  (5000, 91)


### Cross Validation

Since we are have a small dataset we can't curve out a validation set, in this case we are going to use crossvalidation instead.

We are going to also impliment early stopping, because as the model builds more trees the more efficient it becomes also complex and the more the the loss reduces.

In the case of the GBM, this means training more decision trees, and in this example, we will use early stopping with 100 rounds, meaning that the training will continue until validation error has not decreased for 100 rounds. Then, the number of estimators that yielded the best score on the validation data will be chosen as the number of estimators to use in the final model.

In [11]:
#in order to use cross validation with lignGBM we need to create an an LGBM dataset
# Create a training and testing dataset
train_set = lgb.Dataset(data = X_train, label = y_train)
test_set = lgb.Dataset(data = X_test, label = y_test)

In [12]:
#in the code below we carry out cross validation with 100 rounds early stopping

model = lgb.LGBMClassifier() #initialize the model
default_params = model.get_params() #get the default hyperparameters

# Remove the number of estimators because we set this to 10000 in the cv call
del default_params['n_estimators'] #remove the default estimators
#please note: num_boost_round is the same as n_estimators

# Cross validation with early stopping
cv_results = lgb.cv(default_params, train_set, num_boost_round = 10000, early_stopping_rounds = 100, 
                    metrics = 'auc', nfold = N_FOLDS, seed = 42)

Please use silent argument of the Dataset constructor to pass this parameter.
  .format(key))


In [13]:
#I need the f1_score metric also because that is what the competition is based on
def print_f1(model):
    from sklearn.metrics import f1_score
    model.fit(X_train,y_train)
    #what ever model we choose we run a predict test on it
    pred = model.predict(X_test)
    
    #print the f1 score
    return f1_score(y_test,pred)
    

In [14]:
#the f1 score of the model above is here.
print_f1(model)

0.75

In [15]:
#lets have a look at the dictionary with the cv_resutls
print('The maximum validation ROC AUC was: {:.5f} with a standard deviation of {:.5f}.'.format(cv_results['auc-mean'][-1], cv_results['auc-stdv'][-1]))
print('The optimal number of boosting rounds (estimators) was {}.'.format(len(cv_results['auc-mean'])))

The maximum validation ROC AUC was: 1.00000 with a standard deviation of 0.00000.
The optimal number of boosting rounds (estimators) was 62.


Normally we use the result above as the baseline in order to beat if we had a lover one, however we are going to continue to implement a working solution for hyperparameter tuning using hyperopt (auto tuner)

In [16]:
#let's check how our model is doing on our test data
from sklearn.metrics import roc_auc_score,f1_score

# Optimal number of esimators found in cv
model.n_estimators = len(cv_results['auc-mean'])

# Train and make predicions with model
model.fit(X_train, y_train)
preds = model.predict_proba(X_test)[:, 1]
baseline_auc = roc_auc_score(y_test, preds)

#here I compute the f1 score manually.
preds_ = model.predict(X_test)
baseline_f1 = f1_score(y_test, preds_)

print('The baseline model scores {:.5f} ROC AUC on the test set.'.format(baseline_auc))
print('The baseline model scores {:.5f} F1_score on the test set.'.format(baseline_f1))

The baseline model scores 0.99991 ROC AUC on the test set.
The baseline model scores 0.82353 F1_score on the test set.


## Hyper parameter tuning implimentation

### Four parts of Hyperparameter tuning

It's helpful to think of hyperparameter tuning as having four parts (these four parts also will form the basis of Bayesian Optimization):

1. Objective function: a function that takes in hyperparameters and returns a score we are trying to minimize or maximize
2. Domain: the set of hyperparameter values over which we want to search. 
3. Algorithm: method for selecting the next set of hyperparameters to evaluate in the objective function.
4. Results history: data structure containing each set of hyperparameters and the resulting score from the objective function.

Switching from grid to random search to Bayesian optimization will only require making minor modifications to these four parts. 

#### Objective Function

The objective function takes in hyperparameters and outputs a value representing a score. Traditionally in optimization, this is a score to minimize, but here our score will be the F1 which of course we want to maximize. Later, when we get to Bayesian Optimization, we will have to use a value to minimize, so we can take $1 - \text{F1}$ as the score. What occurs in the middle of the objective function will vary according to the problem, but for this problem, we will use cross validation with the specified model hyperparameters to get the cross-validation F1. This score will then be used to select the best model hyperparameter values. 

In addition to returning the value to maximize, our objective function will return the hyperparameters and the iteration of the search. These results will let us go back and inspect what occurred during a search. The code below implements a simple objective function which we can use for both grid and random search.

In [17]:
def objective(hyperparameters, iteration):
    """Objective function for grid and random search. Returns
       the cross validation score from a set of hyperparameters."""
    
    # Number of estimators will be found using early stopping
    if 'n_estimators' in hyperparameters.keys():
        del hyperparameters['n_estimators']
    
     # Perform n_folds cross validation
    cv_results = lgb.cv(hyperparameters, train_set, num_boost_round = 10000, nfold = N_FOLDS, 
                        early_stopping_rounds = 100, metrics = 'auc', seed = 42)
    
    # results to retun
    score = cv_results['auc-mean'][-1]
    estimators = len(cv_results['auc-mean'])
    hyperparameters['n_estimators'] = estimators 
    
    return [score, hyperparameters, iteration]

The domain, and in this case this will be the hyperparameters which the algorithm is going to search through in order to get us the best results.

In [18]:
# Create a default model to show hyper parameters
# We don't need to tune all the parameters, just some of them.
model = lgb.LGBMModel()
model.get_params()

{'boosting_type': 'gbdt',
 'class_weight': None,
 'colsample_bytree': 1.0,
 'importance_type': 'split',
 'learning_rate': 0.1,
 'max_depth': -1,
 'min_child_samples': 20,
 'min_child_weight': 0.001,
 'min_split_gain': 0.0,
 'n_estimators': 100,
 'n_jobs': -1,
 'num_leaves': 31,
 'objective': None,
 'random_state': None,
 'reg_alpha': 0.0,
 'reg_lambda': 0.0,
 'silent': True,
 'subsample': 1.0,
 'subsample_for_bin': 200000,
 'subsample_freq': 0}

In [19]:
# Hyperparameter grid
param_grid = {
    'boosting_type': ['gbdt', 'goss', 'dart'],
    'num_leaves': list(range(20, 150)),
    'learning_rate': list(np.logspace(np.log10(0.005), np.log10(0.5), base = 10, num = 1000)),
    'subsample_for_bin': list(range(20000, 300000, 20000)),
    'min_child_samples': list(range(20, 500, 5)),
    'reg_alpha': list(np.linspace(0, 1)),
    'reg_lambda': list(np.linspace(0, 1)),
    'colsample_bytree': list(np.linspace(0.6, 1, 10)),
    'subsample': list(np.linspace(0.5, 1, 100)),
    'is_unbalance': [True, False]
}

# Results History

The results history is a data structure that contains the hyperparameter combinations and the resulting score on the objective function. When we get to Bayesian Optimization, the model actually _uses the past results to decide on the next hyperparmeters_ to evaluate. Random and grid search are _uninformed_ methods that do not use the past history, but we still need the history so we can find out which hyperparameters worked the best! 

A dataframe is a useful data structure to hold the results.

In [20]:
# Dataframes for random and grid search
random_results = pd.DataFrame(columns = ['score', 'params', 'iteration'],
                              index = list(range(MAX_EVALS)))

grid_results = pd.DataFrame(columns = ['score', 'params', 'iteration'],
                              index = list(range(MAX_EVALS)))

# Grid Search Implementation

We are trying to find a way to make this method less computationally expensive because it works in such a way that it fits all values in memory and uses a lot of resources. 

A good way to avoid this is to unpack the dictionary of the parameter grid using the random function such that at each iteration we shall have a set of values that are evaluated and the results stored and learnt from.

In [21]:
import itertools

def grid_search(param_grid, max_evals = MAX_EVALS):
    """Grid search algorithm (with limit on max evals)"""
    
    # Dataframe to store results
    results = pd.DataFrame(columns = ['score', 'params', 'iteration'],
                              index = list(range(MAX_EVALS)))
    
    # https://codereview.stackexchange.com/questions/171173/list-all-possible-permutations-from-a-python-dictionary-of-lists
    
    keys, values = zip(*param_grid.items())#this bundles the dictionary so that items match
    
    i = 0
    
    # Iterate through every possible combination of hyperparameters
    for v in itertools.product(*values):
        
        # Create a hyperparameter dictionary
        hyperparameters = dict(zip(keys, v))
        
        # Set the subsample ratio accounting for boosting type
        hyperparameters['subsample'] = 1.0 if hyperparameters['boosting_type'] == 'goss' else hyperparameters['subsample']
        
        # Evalute the hyperparameters
        eval_results = objective(hyperparameters, i)
        
        results.loc[i, :] = eval_results #this adds the input into the dataframe which will keep results
        
        i += 1
        
        # Normally would not limit iterations
        if i > MAX_EVALS:
            break
       
    # Sort with best score on top
    results.sort_values('score', ascending = False, inplace = True)
    results.reset_index(inplace = True)
    
    return results    

In [22]:
grid_results = grid_search(param_grid)

print('The best validation score was {:.5f}'.format(grid_results.loc[0, 'score']))
print('\nThe best hyperparameters were:')

import pprint
pprint.pprint(grid_results.loc[0, 'params']) #since we sorted these in the function above, the top row has the best score

The best validation score was 1.00000

The best hyperparameters were:
{'boosting_type': 'gbdt',
 'colsample_bytree': 0.6,
 'is_unbalance': True,
 'learning_rate': 0.004999999999999999,
 'min_child_samples': 20,
 'n_estimators': 58,
 'num_leaves': 20,
 'reg_alpha': 0.0,
 'reg_lambda': 0.0,
 'subsample': 0.5,
 'subsample_for_bin': 20000}


Now that we have some parameters let us evaluate them and see what scores they return

In [23]:
# Get the best parameters
grid_search_params = grid_results.loc[0, 'params']

# Create, train, test model
model = lgb.LGBMClassifier(**grid_search_params, random_state=42)
model.fit(X_train, y_train)

#check the f1_score of the new model.
preds = model.predict_proba(X_test)[:, 1]
pred_ = model.predict(X_test)

print('The best model from grid search scores {:.5f} ROC AUC on the test set.'.format(roc_auc_score(y_test, preds)))
print('The best model from grid search scores {:.5f} ROC AUC on the test set.'.format(f1_score(y_test, preds_)))

The best model from grid search scores 0.99994 ROC AUC on the test set.
The best model from grid search scores 0.82353 ROC AUC on the test set.


# Random Search

This  method is better than grid search terms of exploring the sample pace of the variables that we have. It helps us come close to the best parameters through random choice.

Let's reimplemet it below

In [24]:
def random_search(param_grid, max_evals = MAX_EVALS):
    """Random search for hyperparameter optimization"""
    
    import random #from this function we will be able to choose randomly from the parameters 
    
    # Dataframe for results
    results = pd.DataFrame(columns = ['score', 'params', 'iteration'],
                                  index = list(range(MAX_EVALS)))
    
    # Keep searching until reach max evaluations
    for i in range(MAX_EVALS):
        
        # Choose random hyperparameters
        hyperparameters = {k: random.sample(v, 1)[0] for k, v in param_grid.items()}
        hyperparameters['subsample'] = 1.0 if hyperparameters['boosting_type'] == 'goss' else hyperparameters['subsample']

        # Evaluate randomly selected hyperparameters
        eval_results = objective(hyperparameters, i)
        
        results.loc[i, :] = eval_results
    
    # Sort with best score on top
    results.sort_values('score', ascending = False, inplace = True)
    results.reset_index(inplace = True)
    return results 

In [25]:
#let's take it for a spin
random_results = random_search(param_grid)

print('The best validation score was {:.5f}'.format(random_results.loc[0, 'score']))
print('The best hyperparameters were:')

import pprint
pprint.pprint(random_results.loc[0, 'params'])



The best validation score was 1.00000
The best hyperparameters were:
{'boosting_type': 'gbdt',
 'colsample_bytree': 0.7333333333333333,
 'is_unbalance': True,
 'learning_rate': 0.26711466497691755,
 'min_child_samples': 425,
 'n_estimators': 534,
 'num_leaves': 80,
 'reg_alpha': 0.2857142857142857,
 'reg_lambda': 0.8163265306122448,
 'subsample': 0.9040404040404041,
 'subsample_for_bin': 120000}


In [26]:
# Get the best parameters

# Create, train, test model
model = lgb.LGBMClassifier(boosting_type='dart',colsample_bytree=0.77777777778,is_unbalance=True,learning_rate=0.4404685952236995,min_child_samples=400,
                           n_estimators=10000,num_leaves=91,reg_alpha=0.26530612244897955,reg_lambda=0.22448979591836732,subsample=0.83333333333334
                           ,subsample_for_bin=160000, random_state = 42)
model.fit(X_train, y_train)

preds = model.predict_proba(X_test)[:, 1]

pred_ = model.predict(X_test)

print('The best model from random search scores {:.5f} ROC AUC on the test set.'.format(roc_auc_score(y_test, preds)))
print('The best model from random search scores {:.5f} F1_Score on the test set.'.format(f1_score(y_test, pred_)))

The best model from random search scores 0.99997 ROC AUC on the test set.
The best model from random search scores 0.85714 F1_Score on the test set.


In [27]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, pred_))

[[4992    1]
 [   1    6]]


### Next Steps

When we run this computation we would like to do it as many times as possible that is for the random search, this is in order to give it enough time to randomly choose various parameters which will probably yield the best values on our dataset.

In order to be able to keep a record of the cycles and results we are going to implement code to help us write the results to a file, csv format.

### Extremely Important Note about Checking Files

When you want to check the csv file, __do not open it in Excel while the search is ongoing__. This will cause a permission error in Python and the search will be terminated. Instead, you can view the end of the file by typing `tail out_file.csv` from Bash where `out_file.csv` is the name of the file being written to. There are also some text editors, such as notepad or Sublime Text, where you can open the results safely while the search is occurring. However, __do not use Excel to open a file that is being written to in Python__. This is a mistake I've made several times so you do not have to! 

In [28]:
#this will help us create the file we are going to use
import csv

# Create file and open connection
out_file = 'random_search_trials.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)

# Write column names
headers = ['score', 'hyperparameters', 'iteration']
writer.writerow(headers)
of_connection.close()

Now we must slightly modify `random_search` and `grid_search` to write to this file every time. We do this by opening a connection, this time using the `"a"` option for append (the first time we used the `"w"` option for write) and writing a line with the desired information (which in this case is the cross validation score, the hyperparameters, and the number of the iteration). Then we close the connection until the function is called again.

In [29]:
def random_search(param_grid, out_file, max_evals = MAX_EVALS):
    """Random search for hyperparameter optimization. 
       Writes result of search to csv file every search iteration."""
    
    
    # Dataframe for results
    results = pd.DataFrame(columns = ['score', 'params', 'iteration'],
                                  index = list(range(MAX_EVALS)))
    for i in range(MAX_EVALS):
        
        # Choose random hyperparameters
        random_params = {k: random.sample(v, 1)[0] for k, v in param_grid.items()}
        random_params['subsample'] = 1.0 if random_params['boosting_type'] == 'goss' else random_params['subsample']

        # Evaluate randomly selected hyperparameters
        eval_results = objective(random_params, i)
        results.loc[i, :] = eval_results

        # open connection (append option) and write results
        of_connection = open(out_file, 'a')
        writer = csv.writer(of_connection)
        writer.writerow(eval_results)
        
        # make sure to close connection
        of_connection.close()
        
    # Sort with best score on top
    results.sort_values('score', ascending = False, inplace = True)
    results.reset_index(inplace = True)

    return results 

In [30]:
def grid_search(param_grid, out_file, max_evals = MAX_EVALS):
    """Grid search algorithm (with limit on max evals)
       Writes result of search to csv file every search iteration."""
    
    # Dataframe to store results
    results = pd.DataFrame(columns = ['score', 'params', 'iteration'],
                              index = list(range(MAX_EVALS)))
    
    # https://codereview.stackexchange.com/questions/171173/list-all-possible-permutations-from-a-python-dictionary-of-lists
    keys, values = zip(*param_grid.items())
    
    i = 0
    
    # Iterate through every possible combination of hyperparameters
    for v in itertools.product(*values):
        # Select the hyperparameters
        parameters = dict(zip(keys, v))
        
        # Set the subsample ratio accounting for boosting type
        parameters['subsample'] = 1.0 if parameters['boosting_type'] == 'goss' else parameters['subsample']
        
        # Evalute the hyperparameters
        eval_results = objective(parameters, i)
        
        results.loc[i, :] = eval_results
        
        i += 1
        
        # open connection (append option) and write results
        of_connection = open(out_file, 'a')
        writer = csv.writer(of_connection)
        writer.writerow(eval_results)
        
        # make sure to close connection
        of_connection.close()
        
        # Normally would not limit iterations
        if i > MAX_EVALS:
            break
       
    # Sort with best score on top
    results.sort_values('score', ascending = False, inplace = True)
    results.reset_index(inplace = True)
    
    return results    

Now that we have the functions set up, let's run them a couple of times and see which ones bring us the best results

In [31]:
MAX_EVALS = 10

import random

# Create file and open connection
out_file = 'grid_search_trials_1.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)

# Write column names
headers = ['score', 'hyperparameters', 'iteration']
writer.writerow(headers)
of_connection.close()

grid_results = grid_search(param_grid, out_file)


# Create file and open connection
out_file = 'random_search_trials_1.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)

# Write column names
headers = ['score', 'hyperparameters', 'iteration']
writer.writerow(headers)
of_connection.close()

random_results = random_search(param_grid, out_file)

print('All done here!')



All done here!


# Results on Limited Data

We can examine 1000 search iterations of the above functions on the reduced dataset. Later, we can try the hyperparameters that worked the best for the small versions of the data on a complete dataset to see if the best hyperparameters translate when increasing the size of the data 30 times! The 1000 search iterations were not run in a kernel, although they might be able to finish (no guarantees) in the 12 hour time limit. 

First we can find out which method returned the best results. 

In [32]:
random_results = pd.read_csv('random_search_trials_1.csv')
grid_results = pd.read_csv('grid_search_trials_1.csv')

When we save the results to a csv, for some reason the dictionaries are saved as strings. Therefore we need to convert them back to dictionaries after reading in the results using the `ast.literal_eval` function.

In [33]:
import ast

# Convert strings to dictionaries
grid_results['hyperparameters'] = grid_results['hyperparameters'].map(ast.literal_eval)
random_results['hyperparameters'] = random_results['hyperparameters'].map(ast.literal_eval)

Now let's make a function to parse the results from the hyperparameter searches. This returns a dataframe where each column is a hyperparameter and each row has one search result (so taking the dictionary of hyperparameters and mapping it into a row in a dataframe).

In [41]:
def evaluate(results, name):
    """Evaluate model on test data using hyperparameters in results
       Return dataframe of hyperparameters"""
    from sklearn.metrics import f1_score
    
    # Sort with best values on top
    results = results.sort_values('score', ascending = False).reset_index(drop = True)
    
    # Print out cross validation high score
    print('The highest cross validation score from {} was {:.5f} found on iteration {}.'.format(name, results.loc[0, 'score'], results.loc[0, 'iteration']))
    
    # Use best hyperparameters to create a model
    hyperparameters = results.loc[0, 'hyperparameters']
    model = lgb.LGBMClassifier(**hyperparameters)
    
    # Train and make predictions
    model.fit(X_train, y_train)
    preds = model.predict_proba(X_test)[:, 1]
    
    #print the fl score
    preds_ = model.predict(X_test)
    
    print('ROC AUC from {} on test data = {:.5f}.'.format(name, roc_auc_score(y_test, preds)))
    print('F1 from {} on test data = {:.5f}.'.format(name, f1_score(y_test, preds_)))
    
    # Create dataframe of hyperparameters
    hyp_df = pd.DataFrame(columns = list(results.loc[0, 'hyperparameters'].keys()))

    # Iterate through each set of hyperparameters that were evaluated
    for i, hyp in enumerate(results['hyperparameters']):
        hyp_df = hyp_df.append(pd.DataFrame(hyp, index = [0]), 
                               ignore_index = True)
        
    # Put the iteration and score in the hyperparameter dataframe
    hyp_df['iteration'] = results['iteration']
    hyp_df['score'] = results['score']
    
    return hyp_df

In [42]:
grid_hyp = evaluate(grid_results, name = 'grid search')

The highest cross validation score from grid search was 1.00000 found on iteration 0.
ROC AUC from grid search on test data = 0.99969.
F1 from grid search on test data = 0.73684.


In [43]:
random_hyp = evaluate(random_results, name = 'random search')

The highest cross validation score from random search was 1.00000 found on iteration 9.
ROC AUC from random search on test data = 0.99997.
F1 from random search on test data = 0.92308.
