# Intro

Grid search and random search do not choose the next hyperparameter based on previous tuning results, which makes these methods inefficient. Bayesian hyperparameter optimization does. 

When generally tuning hyperparameters, we want to minimaize some objective function/score such as root mean squared error. In Bayesian optimization, we want to (1) create a surrogate probability model for the objective fn that give us the probability of the score given hyperparameters (P(score|hyperparameters)), (2) find which hyperparameters optimize the surrogate model, (3) use these optimal hyperparameters in the objective function, (4) update the surrogate model to account for new info from step 3, (5) repeat steps 2-4 until max interations are reached.

In [8]:
# Import packages
import numpy as np
from sklearn import datasets
from sklearn.ensemble import RandomForestRegressor
from timeit import default_timer as timer
from hyperopt import hp, tpe, fmin, Trials, STATUS_OK
from collections import OrderedDict

#use simulated regression dataset to build random forest and use gridsearch to find the
#best parameters 

reg_prob = datasets.make_friedman1(n_samples=110, n_features=10, noise=0.0, random_state=None)
x_train = reg_prob[0][0:100]
y_train = reg_prob[1][0:100]
x_test = reg_prob[0][101:110]
y_test = reg_prob[1][101:110]


# Objective Function

We first need to define our objective function and the space over which we will evaluate this objective function. This is our more complex and time consuming function, so we'll want to reduce the number of times we work with this function. We will only use this fn to choose the first set of the most promising hyperparameters and others will be chosen using the surrogate function.

In [13]:

# define area of which to evaluate objective function

#HPO_PARAMS = {'max_evals':100
#             }  

hyperparameter_grid = {
    'n_estimators': [100, 200, 300, 400, 500, 600],
    'max_depth': [2, 5, 10, 15, 20, 25, 30, 35, 40],
    'min_samples_leaf': [1, 2, 3, 4, 5, 6, 7, 8]
}


print(hyperparameter_grid)

# define random forest objective function 
def objective(hyperparameter_grid):
    
    rf = RandomForestRegressor(hyperparameter_grid)
    
    # Training 
    rf.fit(x_train, y_train)
    
    # Making predictions and evaluating
    predictions = rf.predict(x_valid)
    rmse = np.sqrt(np.mean(np.square(prediction - y_valid)))
    
        # Calculate time to evaluate
    time_elapsed = end - start
    
    results = {'rmse': rmse, 'status': STATUS_OK, 'hyperparameter': hyperparameter, 'time': time_elapsed}
    
    print(results)
    # Return dictionary
    return results
    

    
#print(objective(SPACE))


# We want to return the 
#y = objective(hyperparameters)
#miny = min(y)

{'n_estimators': [100, 200, 300, 400, 500, 600], 'max_depth': [2, 5, 10, 15, 20, 25, 30, 35, 40], 'min_samples_leaf': [1, 2, 3, 4, 5, 6, 7, 8]}


# Surrogate Function

We'll use Tree-structured Parzen Estimator (TPE) model in the hyperopt package to build our surrogate function. TPE uses Bayes rule to represent the probability model as P(score|hyperparameters)= (P(hyperparameters|score) * P(score))/p(hyperperameters)

In [14]:
#trials = Trials()
#_ = fmin(objective, SPACE, trials=trials, algo=tpe.suggest, **HPO_PARAMS)

#best_auc = -1.0 * trials.best_trial['result']['loss']
#best_params = trials.best_trial['misc']['vals']

# log metrics
#print('Best Validation AUC: {}'.format(best_auc))
#print('Best Params: {}'.format(best_params))


# New trials object
trials = Trials()

# Run 2000 evals with the tpe algorithm
best = fmin(fn=objective, space=SPACE, algo=tpe.suggest, trials=trials, 
                max_evals=2000)

#results = trials.results
#results[:2]

#hypopt_trials = Trials()
 
#best_params = fmin(objective, SPACE, algo=tpe.suggest, 
#max_evals=1000, trials= hypopt_trials)
 
#print(best_params)
#print(hypopt_trials.best_trial['result']['loss'])

  0%|          | 0/2000 [00:00<?, ?it/s, best loss: ?]


ValueError: n_estimators must be an integer, got <class 'dict'>.

In [None]:
from hyperopt import hp, tpe, fmin, Trials, STATUS_OK

In [None]:
#build random forest using a simulated regression problem and test it
rf = RandomForestRegressor()
reg_prob = datasets.make_regression(n_samples =  110, n_features = 5, n_informative = 5)
x_train = reg_prob[0][0:100]
y_train = reg_prob[1][0:100]
x_test = reg_prob[0][101:110]
y_test = reg_prob[1][101:110]
rf.fit(x_train, y_train)
print(rf.score(x_train[0:9], y_train[0:9]))
print(rf.score(x_test, y_test))