## Appendix: Hyperparameter Optimization

#### Unlike real like, all that matters in a Kaggle competition is the metric under consideration. Even if the metric improves by the most insignificant amount, it matters in a Kaggle competition. The way to squeeze out maximum performance from a model is hyperparameter optimization

I was getting pretty good results with XGBoost. I now wanted to find the optimal hyperparameters. To do this, I decided to use [Bayesian hyperparameter optimization methods](https://en.wikipedia.org/wiki/Hyperparameter_optimization#Bayesian_optimization). After looking around for a bit, I couldn't find a readily available R package for hyperparameter optimization. So I decided to use a Python package called [hyperopt](https://github.com/hyperopt/hyperopt). 

The way I structured the optimization was to have a python wrapper to my R code. The python wrapper use [rpy2](http://rpy2.bitbucket.org/) to execute the R code.

### Python Wrapper

#### [Full source](hyperopt/xgbDirect-hyperopt.py)
#### Code below cannot be executed. It is for illustrative purposes only

#### Let's import some modules first

```python
import rpy2.robjects as ro
import pandas as pd
from rpy2.robjects import pandas2ri
from hyperopt import fmin, tpe, hp
```

#### The function below is the objective function for the optimizer. Bayesian optimization is a black box optimizer. It will suggest new hyperparameters to explore based on the result obtained from the previous trials.

```python
def objective (params):
    eta, max_depth, subsample, col_sample_bytree, min_child_weight, gamma = params

#####Pass functions to R by generating a list
    rParams = ro.ListVector ({
        'eta': ro.FloatVector ([eta]),
        'max_depth': ro.IntVector ([max_depth]),
        'subsample': ro.FloatVector ([subsample]),
        'col_sample_bytree': ro.FloatVector ([col_sample_bytree]),
        'min_child_weight': ro.FloatVector ([min_child_weight]),
        'gamma': ro.FloatVector ([gamma])})

    ro.globalenv['params'] = rParams
    rmse = ro.r('fitFunc (fitFormula, trainData, params)')
    
###Additional code to log the trial omitted for simplicity. 
###But all that portion does is to log the trial to a file
```    

#### Generate hyperparameter search space. The optimizer needs to know the bounds of the space to explore and the type of parameter

```python
space = (
    hp.loguniform ('eta', -6, 0),
    hp.randint ('max_depth', 6) + 2,
    hp.uniform ('subsample', 0.5, 1),
    hp.uniform ('col_sample_bytree', 0.5, 1),
    hp.uniform ('min_child_weight', 0, 3), 
    hp.uniform ('gamma', 0, 3)
    )
    
pandas2ri.activate()
```

#### Source R file with actual XGBoost code

```python
ro.r('source ("xgbDirect-hyperopt.R")')
       
best = fmin (objective,
             space = space,
             algo = tpe.suggest,
             max_evals= 100)
```

### R Program

#### [Full Source](hyperopt/xgbDirect-hyperopt.R)