# Creating XGBoost model. 

This is based on how we trained 55 separate models for 55 separate wavelengths. 

Note that the difference, other than using XGBoost model, is that Decision Trees does **not** require normalizing data. Hence we can go as it is. 

And since our data is pre-cleaned, we also do not need to put a pipeline into it. 

It is a plus that Decision Trees (and hence XGBoost) works best when the features are a collection of categorical and numerical features, OR purely numerical features, which the latter is for ours. 

And it's a plus if the number of features is far less than the number of training samples. We can drop features as well later during training randomly. 

However, since we are not familiar with XGBoost, and we have lots of features, tuning it is something of a requirement due to inexperience. We would do bayesian optimization for ourselves. Even though wandb offers pre-configured and easily sent job, we could learn more by implementing ourselves plus I have no idea how to retrieve best parameters from Weights and Biases. 

Also note that there's a chance we might not use the whole dataset for hparams tuning if it takes too long. 

In [1]:
storage_name = "baseline_xgboost_pred_1.txt"

PROJECT_ID = "sunlit-analyst-309609"
%env GCLOUD_PROJECT = $PROJECT_ID
%load_ext google.cloud.bigquery

!export GOOGLE_APPLICATION_CREDENTIALS="/workspace/ariel_ml_2021/sunlit-analyst-309609-77b8e2f94cb5.json"

env: GCLOUD_PROJECT=sunlit-analyst-309609


In [34]:
from __future__ import absolute_import, division, print_function, unicode_literals
import tempfile

import copy
import numpy as np
import pandas as pd
from tqdm import tqdm

import xgboost as xgb 
from bayes_opt import BayesianOptimization
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score, train_test_split

import tensorflow as tf 

from google.cloud import bigquery, bigquery_storage
LOCATION = "us"

Examples taken from https://github.com/fmfn/BayesianOptimization/blob/master/examples/sklearn_example.py and https://www.kdnuggets.com/2019/07/xgboost-random-forest-bayesian-optimisation.html

If you look at their examples you'll find that they only have the function maximize, nothing on minimize. This means if we use RMSE or something we would not get something useful. Hence, there are two ways that could be think of. One, implement the Ariel Score as we want to maximize that. Second, use "negative (root) mean squared error". This way, it could be maximize as well. 

After deciding, `neg_mean_squared_error` would be a good choice. 

Tuning hyperparameters for tree-based learners. Based on Datacamp course on Intro to XGBoost. 

- Learning rate (eta). 
- Gamma: min loss reduction to create new tree split. 
- Lambda: (int) L2 regularization
- Alpha: (int) L1 regularization
- max_depth: (positive intger) how deep can a tree grows. 
- subsample: (0, 1]. Fraction of total training set that can be used for any given boosting round. Low means little amount of training data used, but may lead to underfitting. High might mean overfitting. 
- colsample_bytree: (0, 1]. The fraction of **features** that it can be used (selected) from during any given boosting round. Large value means (almost) all features can be used to build a tree. Smaller is additional regularization by restricting number of features. Using all columns might result in overfitting. 

In [24]:
def xgb_cv(X, y, target="label", **kwargs):
    """
    XGBoost Regressor Cross Validation Function.

    Parameters: 
        :var dataset: (Pandas.DataFrame) A Pandas DataFrame of our used dataset. 
        :var target: (str) The column name of the target. Default to "label". 
        :var kwargs: (dict) A dictionary for the optimizer to pass in as params for XGBoost
                Regressor. 
    """
    kwargs["objective"] = "reg:squarederror"
    estimator = xgb.XGBRegressor(params = kwargs)

    # X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05)

    # Using 4-fold validation. 
    cval = cross_val_score(estimator, X, y, scoring="neg_mean_squared_error",
                            cv=4)

    return cval.mean()

In [25]:
def bayesian_optimization(dataset, target, parameters, n_iter=10, init_points=3):
    """
    Bayesian Optimization Algorithm. 

    Parameters:
        :var dataset: (Pandas.DataFrame) A Pandas DataFrame of our used dataset. 
        :var target: (str) The column name of the target. Default to "label".
        :var parameters: (Python Dict) The dictionary containing the parameters (or its range) to
                optimize on. 
        :var n_iter: (int) How many steps of Bayesian Optimization to go through. The more steps 
                the more likely to find a good maximum. 
        :var init_points: (int) How many steps of random exploration to perform. Random 
                exploration can help in diversifying the exploration space. 
        :var **kwargs: other BayesianOptimization.maximize() parameters. 
    """
    y = dataset.pop(target)
    X = dataset
    
    def our_crossval(**kwargs):
        """
        Wrapper function for cross validation. 

        This might requires us to ensure casting of integer is correct, values passed in are   
        correct, etc. This includes capping to (0, 1) range for learning rate. 
        """
        return xgb_cv(
            X, y, target,
            learning_rate = max(min(kwargs["learning_rate"], 0.999), 1e-4),
            reg_lambda = int(kwargs["reg_lambda"]),
            reg_alpha = int(kwargs["reg_alpha"]),
            gamma = int(kwargs["gamma"]),
            max_depth = int(kwargs["max_depth"]),
            subsample = max(min(kwargs["subsample"], 0.999), 1e-3),
            colsample_bytree = max(min(kwargs["colsample_bytree"], 0.999), 1e-3),
        )

    optimizer = BayesianOptimization(
        f = our_crossval,
        pbounds = parameters,
        verbose = 1
    )

    optimizer.maximize(n_iter=n_iter)

    return optimizer.max

As of now I haven't know if there can be fixed values passed in as parameters to the bounds, as the bayesian optimization official github page does not have examples on such, so we will continue as such. 

In [5]:
parameters = {
    "learning_rate": (1e-4, 0.3),
    "reg_lambda": (1, 1000),
    "reg_alpha": (1, 1000),
    "gamma": (1, 1000),
    "max_depth": (1, 25),
    "subsample": (1e-3, 0.999),
    "colsample_bytree": (1e-3, 0.999),
}

target = "label"

Set up bigquery reading. 

In [7]:
from google.oauth2 import service_account

credentials = service_account.Credentials.from_service_account_file(
    "/workspace/ariel_ml_2021/sunlit-analyst-309609-77b8e2f94cb5.json",
    scopes = ["https://www.googleapis.com/auth/cloud-platform"],
)

bqclient = bigquery.Client(credentials=credentials, project=credentials.project_id)
bqstorageclient = bigquery_storage.BigQueryReadClient(credentials=credentials)

We would only try for one file here to see if it runs. Then we would integrate this into real work later on. 

In [8]:
# %%bigquery df --use_bqstorage_api
# SELECT * EXCEPT (AAAA, BB, CC)
# FROM `sunlit-analyst-309609.training_set.train_table_19` a
# LEFT JOIN (
#     SELECT *
#     FROM `sunlit-analyst-309609.training_set.noisy_train_extra_params`
# ) b
# ON a.AAAA = b.AAAA

In [59]:
query_string = f"""
SELECT * EXCEPT (AAAA, BB, CC)
FROM `sunlit-analyst-309609.training_set.train_table_19`
"""

df = (
    bqclient.query(query_string)
    .result()
    .to_dataframe(bqstorage_client=bqstorageclient)
)

In [26]:
returned_value = bayesian_optimization(copy.deepcopy(df), target=target, parameters=parameters)
print(returned_value)

ameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/workspace/.pip-modules/lib/python3.8/site-packages/bayes_opt/target_space.py", line 191, in probe
    target = self._cache[_hashable(x)]
KeyError: (0.7838390843112499, 336.60789118492244, 0.12195455990607709, 17.16334692765465, 508.45330966595867, 459.1508871987576, 0.8795131357941941)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/workspace/.pip-modules/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/workspace/.pip-modules/lib/python3.8/site-packages/xgboost/core.py", line 436, in inner_f
    return f(**kwargs)
  File "/workspace/.pip-modules/lib/python3.8/site-packages/xgboost/sklearn.py", line 711, in fit
    train_dmatrix, evals = _wrap_evaluation_matrices(
  File "/workspace/.pip-modules/lib/python3.8/site-packages/xgboost/sklea

KeyboardInterrupt: 

In [62]:
m = copy.deepcopy(df)

y = m.pop("label")
X = m

In [77]:
my_param = {
    "learning_rate": 1e-3,
    "max_depth": 25,
    "subsample": 0.2,
    "colsample_bytree": 0.3,
    "objective": "reg:squarederror",
    "n_estimators": 500
}

xg_reg = xgb.XGBRegressor(objective="reg:squarederror", n_estimators=15, max_depth=10, verbosity=2)

xg_reg.fit(X, y)

[12:09:01] INFO: ../src/tree/updater_prune.cc:101: tree pruning end, 88 extra nodes, 0 pruned nodes, max_depth=7
[12:09:13] INFO: ../src/tree/updater_prune.cc:101: tree pruning end, 108 extra nodes, 0 pruned nodes, max_depth=7
[12:09:27] INFO: ../src/tree/updater_prune.cc:101: tree pruning end, 170 extra nodes, 0 pruned nodes, max_depth=10
[12:09:49] INFO: ../src/tree/updater_prune.cc:101: tree pruning end, 460 extra nodes, 0 pruned nodes, max_depth=10
[12:10:17] INFO: ../src/tree/updater_prune.cc:101: tree pruning end, 686 extra nodes, 0 pruned nodes, max_depth=10
[12:10:48] INFO: ../src/tree/updater_prune.cc:101: tree pruning end, 842 extra nodes, 0 pruned nodes, max_depth=10
[12:11:22] INFO: ../src/tree/updater_prune.cc:101: tree pruning end, 988 extra nodes, 0 pruned nodes, max_depth=10
[12:11:57] INFO: ../src/tree/updater_prune.cc:101: tree pruning end, 1098 extra nodes, 0 pruned nodes, max_depth=10
[12:12:30] INFO: ../src/tree/updater_prune.cc:101: tree pruning end, 1248 extra no

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=10,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=15, n_jobs=16, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=2)

In [78]:
X_test, y_test = X, y

preds = xg_reg.predict(X_test)

In [82]:
rmse = np.sqrt(mean_squared_error(y_test, preds))

print(f"RMSE: {rmse}")

RMSE: 0.00866371649129658


# To be done

Add `n_estimators` as a tuning parameter to xgboost? 

In [83]:
preds

array([0.0251145 , 0.0338163 , 0.0292234 , ..., 0.04004366, 0.03397508,
       0.03676901], dtype=float32)

In [81]:
y_test

0         0.025124
1         0.025124
2         0.025124
3         0.025124
4         0.025124
            ...   
125595    0.021874
125596    0.021874
125597    0.021874
125598    0.021874
125599    0.021874
Name: label, Length: 125600, dtype: float64