# Optimization As A Service (OPTaaS)

In this notebook, we test out Mind Foundry's OPTimization as a Service (OPTaaS) capabilities. We'll use OPTaaS to try and optimize the hyperparameters of a gradient boosting machine for a supervised multiclass machine learning problem. This set of features was created using automated feature engineering in Featuretools on the data science for good Costa Rican Household poverty prediction competition dataset.

## Roadmap

1. Load in Data
    * Dataset has already been formatted
2. Define objective function for optimization
    * Optimization function takes in hyperparameters and returns a score
    * 5 fold cross validation Macro F1 Score of a gradient boosting machine
    * Write a custom scorer for Light GBM 
3. Define search space for OPTaaS
    * Set up hyperparameter distributions
4. Create a task
    * Goal is to maximize score
    * Add parameters and constraint(s)
5. Run optimization
    * Currently using 100 iterations
    * Option for resuming task with saved results
6. Inspect results

These results will be compared to Bayesian Optimization using Hyperopt and SMAC (coming soon).

In [1]:
import pandas as pd
import numpy as np

import lightgbm as lgb

# Evaluation of the model
from sklearn.model_selection import KFold, train_test_split, StratifiedKFold
from sklearn.metrics import roc_auc_score, f1_score

# Data 

This set of features was created using automated feature engineering in Featuretools. The original dataset is part of the Costa Rican Household poverty prediction competition where the objective is to predict poverty at a household level given individual and household information. This is a supervised multiclass machine learning task.

In [2]:
features = pd.read_csv('data/ft_2000_important.csv')
features.shape

  interactivity=interactivity, compiler=compiler, result=result)


(10307, 2016)

In [4]:
submit_base = pd.read_csv('data/test.csv')[['Id', 'idhogar']]

# Separate out training and testing
train = features[features['Target'].notnull()].copy()
test = features[features['Target'].isnull()].copy()

train_labels = np.array(train.pop('Target'))
test_ids = list(test.pop('idhogar'))

train, test = train.align(test, join = 'inner', axis = 1)

# Deal with data type issues
for c in train:
    if train[c].dtype == 'object':
        train[c] = train[c].astype(np.float32)
        test[c] = test[c].astype(np.float32)

Let's make sure all the data is numeric (as it should be).

In [6]:
print('Train objects: ', train.columns[np.where(train.dtypes == 'object')])
print('Test objects: ', test.columns[np.where(test.dtypes == 'object')])

Train objects:  Index([], dtype='object')
Test objects:  Index([], dtype='object')


Everything looks good with the data. Next we'll import the required mindfoundry methods. You'll need to use your own API key! 

In [7]:
from mindfoundry.optaas.client.client import OPTaaSClient, Goal
from mindfoundry.optaas.client.parameter import (Distribution, CategoricalParameter,
                                                 IntParameter, ChoiceParameter, 
                                                 NumericParameter, FloatParameter)

from mindfoundry.optaas.client.constraint import Constraint

# Read api key
with open('C:/Users/willk/OneDrive/Desktop/optaas_key.txt', 'r') as f:
    api_key = str(f.read())
    
# Set up a client 
client = OPTaaSClient('https://optaas.mindfoundry.ai', api_key)

# Objective Function

The objective function takes in hyperparameters and returns a score to maximize (or minimize). For this problem, the metric is Macro F1 score over the four classes. We first write a custom evaluation metric for the Light GBM, and then an objective function that returns a 5 fold cross validation Macro F1 score for a given set of hyperparameters. 

In [9]:
def macro_f1_score(labels, predictions):
    """Custom Macro F1 Score for Light GBM"""
    
    # Reshape the predictions as needed
    predictions = predictions.reshape(len(np.unique(labels)), -1 ).argmax(axis = 0)
    
    metric_value = f1_score(labels, predictions, average = 'macro')
    
    # Return is name, value, is_higher_better
    return 'macro_f1', metric_value, True

def objective(num_leaves, learning_rate, boosting_type,
                      subsample, subsample_for_bin, min_child_samples,
                      reg_alpha, reg_lambda, colsample_bytree, nfolds=5):
    """Return validation score from hyperparameters for LightGBM"""

    # Using stratified kfold cross validation
    strkfold = StratifiedKFold(n_splits = nfolds, shuffle = True)
    
    # Convert to arrays for indexing
    features = np.array(train)
    labels = np.array(train_labels).reshape((-1 ))
    
    valid_scores = []
    best_estimators = []
    
    # Create model with hyperparameters
    model = lgb.LGBMClassifier(num_leaves=num_leaves, learning_rate=learning_rate,
                               boosting_type=boosting_type, subsample=subsample,
                               subsample_for_bin=subsample_for_bin, 
                               min_child_samples=min_child_samples,
                               reg_alpha=reg_alpha, reg_lambda=reg_lambda, 
                               colsample_bytree=colsample_bytree,
                               class_weight = 'balanced',
                               n_jobs=-1, n_estimators=10000)
    
    # Iterate through the folds
    for i, (train_indices, valid_indices) in enumerate(strkfold.split(features, labels)):
        
        # Training and validation data
        X_train = features[train_indices]
        X_valid = features[valid_indices]
        y_train = labels[train_indices]
        y_valid = labels[valid_indices]
        
        # Train with early stopping
        model.fit(X_train, y_train, early_stopping_rounds = 100, 
                  eval_metric = macro_f1_score,
                  eval_set = [(X_train, y_train), (X_valid, y_valid)],
                  eval_names = ['train', 'valid'],
                  verbose = -1)
        
        # Record the validation fold score
        valid_scores.append(model.best_score_['valid']['macro_f1'])
        best_estimators.append(model.best_iteration_)
        
    best_estimators = np.array(best_estimators)
    valid_scores = np.array(valid_scores)
    
#     return valid_scores, best_estimators

    # Write to the csv file ('a' means append)
#     of_connection = open(OUT_FILE, 'a')
#     writer = csv.writer(of_connection)
#     writer.writerow([loss, hyperparameters, ITERATION, run_time, best_score, best_std])
#     of_connection.close()

    # Dictionary with information for evaluation
#     return {'loss': loss, 'hyperparameters': hyperparameters, 'iteration': ITERATION,
#             'train_time': run_time, 'status': STATUS_OK}

    return valid_scores.mean()

# Configuration

Next we define the hyperparameter distributions. I haven't figured out everything about the library and service, but I tried to recreate hyperparameter distributions I've used for both random search and bayesian optimization in Hyperopt. The `learning_rate` is a log normal, and the `subsample` will depend on the `boosting_type` as we'll see.

In [10]:
boosting_type = CategoricalParameter('boosting_type', 
                                     values = ['gbdt', 'dart', 'goss'], 
                                     id='boosting_type')

num_leaves = IntParameter('num_leaves', minimum=3, 
                          maximum=50, id='num_leaves')

learning_rate = FloatParameter('learning_rate', minimum=0.025, 
                               maximum=0.25, id='learning_rate',
                               distribution=Distribution.LOGUNIFORM)

subsample = FloatParameter('subsample', minimum=0.5, 
                           maximum=1.0, id='subsample')

subsample_for_bin = IntParameter('subsample_for_bin', minimum=2000, 
                                 maximum=100000, id='subsample_for_bin')

min_child_samples = IntParameter('min_child_samples', minimum=5, 
                                 maximum=80, id='min_child_samples')

reg_alpha = FloatParameter('reg_alpha', minimum=0.0, 
                           maximum=1.0, id='reg_alpha')

reg_lambda = FloatParameter('reg_lambda', minimum=0.0, 
                            maximum=1.0, id='reg_lambda')

colsample_bytree = FloatParameter('colsample_bytree', minimum=0.5, 
                                  maximum=1.0, id='colsample_bytree')


## Create a Task

We can use these hyperparameters to create a task for OPTaaS. The constraint makes sure that `subsample = 1` when `boosting_type = "goss"`. This is necessary because "goss" cannot use subsampling. Everything else is straighforward, and we want to maximize the Macro F1 score.

In [12]:
task = client.create_task(
        title = 'Light GBM Opt',
        goal = Goal.max,
        parameters = [num_leaves, learning_rate, boosting_type,
                      subsample, subsample_for_bin, min_child_samples,
                      reg_alpha, reg_lambda, colsample_bytree],
         constraints = [ Constraint(when=boosting_type=='goss', 
                                    then=subsample==1)]
)

# Run Optimization

The last step is to run optimization. We'll start with 100 iterations. The `%%capture` magic makes sure that we don't see all the LightGBM output (which cannot be surpressed) but it also means we can't see the optimization information from OPTaaS.

In [13]:
%%capture 
best_result, best_configuration = task.run(objective, max_iterations = 100)

## Show Results

After 100 iterations, how did the optimization do? We can get the `best_configuration` and `best_score` pretty easily. 

In [14]:
best_configuration

{ 'id': '132dc8d6-bab1-4695-b058-6fcbdbf21684',
  'type': 'exploitation',
  'values': { 'boosting_type': 'dart',
              'colsample_bytree': 0.9843467236959204,
              'learning_rate': 0.11598629586769524,
              'min_child_samples': 44,
              'num_leaves': 49,
              'reg_alpha': 0.35397370408131534,
              'reg_lambda': 0.5904910774606467,
              'subsample': 0.6299872254632797,
              'subsample_for_bin': 60611}}

In [15]:
best_result

{ 'configuration': '132dc8d6-bab1-4695-b058-6fcbdbf21684',
  'id': 3250,
  'score': 0.4629755551376399,
  'user_defined_data': None}

So, now we can use these results to build a final model and make predictions on the test data. It would also be a good idea to examine the entire results so we can see where the best hyperparameter values concentrated. 

We'll save all the results for inspection. These can also be used to start optimization from 100 iterations rather than from a new task.

# Get Results into a Dataframe

Of course we need our results in a dataframe, the data structure of choice for data scientists.

In [62]:
c = task.get_configurations()
with open('configurations.txt', 'w') as f:
    f.write(str(c))

In [63]:
r = task.get_results()
with open('results.txt', 'w') as f:
    f.write(str(r))

In [64]:
c[0]

{ 'id': '8b0bc65b-3e9c-4b93-a9b2-ef108ed152e8',
  'type': 'default',
  'values': { 'boosting_type': 'gbdt',
              'colsample_bytree': 0.75,
              'learning_rate': 0.1375,
              'min_child_samples': 42,
              'num_leaves': 26,
              'reg_alpha': 0.5,
              'reg_lambda': 0.5,
              'subsample': 0.75,
              'subsample_for_bin': 51000}}

In [65]:
r[0]

{ 'configuration': '8b0bc65b-3e9c-4b93-a9b2-ef108ed152e8',
  'id': 3208,
  'score': 0.44335890144866374,
  'user_defined_data': None}

In [66]:
config = pd.DataFrame(columns = [x for x in c[0].values.keys()])

In [67]:
c[0].values

{'boosting_type': 'gbdt',
 'colsample_bytree': 0.75,
 'learning_rate': 0.1375,
 'min_child_samples': 42,
 'num_leaves': 26,
 'reg_alpha': 0.5,
 'reg_lambda': 0.5,
 'subsample': 0.75,
 'subsample_for_bin': 51000}

In [68]:
ids = []
for results in c:
    id_ = results.id
    ids.append(id_)
    
    hyp_dict = results.values
    config = config.append(pd.DataFrame(hyp_dict, index = [0]), 
                           ignore_index = True)
    
config['config_id'] = ids

In [70]:
len(c)

101

In [71]:
len(r)

100

In [72]:
scores = []

for id_ in config['config_id']:
    found = False
    for results in r:
        if results.configuration == id_:
            scores.append(results.score)
            found = True
            
    if not found:
        print(id_)
# config['score'] = scores

dabd3954-3699-47f7-b788-5040d6c543e4


In [76]:
config = config[~(config['config_id'] == 'dabd3954-3699-47f7-b788-5040d6c543e4')]
config['score'] = scores
config.head()

Unnamed: 0,num_leaves,learning_rate,boosting_type,subsample,subsample_for_bin,min_child_samples,reg_alpha,reg_lambda,colsample_bytree,config_id,score
0,26,0.1375,gbdt,0.75,51000,42,0.5,0.5,0.75,8b0bc65b-3e9c-4b93-a9b2-ef108ed152e8,0.443359
1,41,0.066103,gbdt,0.9591,34856,11,0.671155,0.990424,0.826631,e0a3e3b8-4dd6-426c-afed-65b3991bc9bb,0.435779
2,34,0.039836,gbdt,0.625744,71666,49,0.451237,0.378928,0.58892,06f46a30-c8e7-4467-bf5f-d4fcde9e2b0c,0.440514
3,44,0.075335,dart,0.987832,93713,72,0.973434,0.92715,0.609935,133ada9b-896a-4dbb-ada9-5c0777ee9cf4,0.440388
4,41,0.132886,dart,0.847583,45135,47,0.713148,0.197536,0.708709,c5d6b6e1-52be-432a-8694-c577e71709db,0.442698


In [77]:
config.sort_values('score', ascending = False, inplace = True)
config.head()

Unnamed: 0,num_leaves,learning_rate,boosting_type,subsample,subsample_for_bin,min_child_samples,reg_alpha,reg_lambda,colsample_bytree,config_id,score
42,49,0.115986,dart,0.629987,60611,44,0.353974,0.590491,0.984347,132dc8d6-bab1-4695-b058-6fcbdbf21684,0.462976
32,49,0.115619,dart,0.637333,60173,44,0.354967,0.285429,0.983128,a4851af4-ed57-4a3c-9828-9d9fc887655a,0.455598
33,49,0.115797,dart,0.59513,60378,44,0.353974,0.610409,0.984347,067e8a88-fbfc-4d57-b9c8-1cdcecc76b48,0.45396
67,49,0.114474,dart,0.629987,2557,44,0.431246,0.608502,0.984347,22d3235a-d529-471f-ba51-231bec471d9d,0.451359
11,49,0.115619,dart,0.657001,60173,45,0.354967,0.285429,0.983128,33c3098d-5201-4e27-964d-9e3a3a47a96d,0.45122
