# Creating XGBoost model. 

This is based on how we trained 55 separate models for 55 separate wavelengths. 

Note that the difference, other than using XGBoost model, is that Decision Trees does **not** require normalizing data. Hence we can go as it is. 

And since our data is pre-cleaned, we also do not need to put a pipeline into it. 

It is a plus that Decision Trees (and hence XGBoost) works best when the features are a collection of categorical and numerical features, OR purely numerical features, which the latter is for ours. 

And it's a plus if the number of features is far less than the number of training samples. We can drop features as well later during training randomly. 

However, since we are not familiar with XGBoost, and we have lots of features, tuning it is something of a requirement due to inexperience. We would do bayesian optimization for ourselves. Even though wandb offers pre-configured and easily sent job, we could learn more by implementing ourselves plus I have no idea how to retrieve best parameters from Weights and Biases. 

In [2]:
storage_name = "baseline_xgboost_pred_1.txt"

PROJECT_ID = "sunlit-analyst-309609"
%env GCLOUD_PROJECT = $PROJECT_ID
%load_ext google.cloud.bigquery

env: GCLOUD_PROJECT=sunlit-analyst-309609


In [4]:
from __future__ import absolute_import, division, print_function, unicode_literals
import tempfile

import numpy as np
import pandas as pd
from tqdm import tqdm

import xgboost as xgb 
from bayes_opt import BayesianOptimization
from sklearn.model_selection import cross_val_score, train_test_split

import tensorflow as tf 

from google.cloud import bigquery
LOCATION = "us"

Examples taken from https://github.com/fmfn/BayesianOptimization/blob/master/examples/sklearn_example.py and https://www.kdnuggets.com/2019/07/xgboost-random-forest-bayesian-optimisation.html

If you look at their examples you'll find that they only have the function maximize, nothing on minimize. This means if we use RMSE or something we would not get something useful. Hence, there are two ways that could be think of. One, implement the Ariel Score as we want to maximize that. Second, use "negative (root) mean squared error". This way, it could be maximize as well. 

After deciding, `neg_mean_squared_error` would be a good choice. 

In [None]:
def bayesian_optimization(dataset, function, parameters, target="label", n_iter=10, init_points=3):
    """
    Bayesian Optimization Algorithm. 

    Parameters:
    :var dataset: (Pandas.DataFrame) A Pandas DataFrame of our used dataset. 
    :var function: (Python Function) Function containing our model to optimize on. 
    :var parameters: (Python Dict) The dictionary containing the parameters (or its range) to
            optimize on. 
    :var target: (str) The column name of the target. Default to "label". 
    :var n_iter: (int) How many steps of Bayesian Optimization to go through. The more steps the
            more likely to find a good maximum. 
    :var init_points: (int) How many steps of random exploration to perform. Random exploration
            can help in diversifying the exploration space. 
    :var **kwargs: other BayesianOptimization.maximize() parameters. 
    """
    y = dataset.pop(target)
    X = dataset

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05)