# Ray Tune - Comparision of GridSearchCV and TuneSearchCV with XGBoost

© 2019-2021, Anyscale. All Rights Reserved

![Anyscale Academy](../images/AnyscaleAcademyLogo.png)

<img src = "../images/tune/driver.png" align="center" height=300 width=300>

[Porto Seguro](https://www.portoseguro.com.br/), one of Brazil’s largest auto and homeowner insurance companies, completely agrees. Inaccuracies in car insurance company’s claim predictions raise the cost of insurance for good drivers and reduce the price for bad ones.

A data set from Porto Seguro was used in the [Kaggle's machine learning competetion](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction). The data set is used build a classification model to predict the probability that a driver will initiate an auto insurance claim in the next year. The predictions can be used to further tailor insurance prices, and hopefully make auto insurance coverage more accessible to more drivers.

In this exercise we show two things:

1. Composibility of using different algorithms, hyperameters tuning using sklearn and xgboost
2. Tune's drop-in replacements or wrappers for GridSearch and ability to use optuna as a choice of search algorithm.

Although, drop-in replacements was introduced in earlier in [03-Ray-Tune-with-Sklearn](03-Ray-Tune-with-Sklearn.ipynb), this demonstrates
with the larger dataset and real life use case.

We need some python packages, so let's install them

In [1]:
!pip install boto3 plotly xgboost optuna tune-sklearn



In [2]:
import boto3
from io import BytesIO
import joblib
import numpy as np
import pandas as pd
import pandas as pd
import plotly.express as px
import ray
from ray import tune
from ray.tune.callback import Callback
from ray.tune.suggest.bohb import TuneBOHB
from ray.tune.schedulers import HyperBandForBOHB
from scipy.stats import loguniform, randint, uniform
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
import tqdm
from tqdm.notebook import trange, tqdm
from tune_sklearn import TuneSearchCV
from xgboost import XGBClassifier

In [3]:
import logging
logging.disable(logging.INFO)
logging.disable(logging.WARNING)
import warnings
warnings.filterwarnings("ignore")
import xgboost as xgb
xgb.set_config(verbosity=1)

In [4]:
import os
os.environ["TUNE_DISABLE_AUTO_CALLBACK_SYNCER"] = "1"

class TqdmCallback(Callback):
    def setup(self,
              stop = None,
              num_samples = None,
              total_num_samples = None,
              **info):
        self.pbar = tqdm(total=total_num_samples)

    def on_trial_complete(self, **info):
        self.pbar.update(1)

    def on_experiment_end(self, **info):
        self.pbar.close()

In [5]:
def plot_cv_score(analysis):
    df = analysis.results_df[["average_test_score", "timestamp"]]
    df["timestamp"] = pd.to_datetime(df['timestamp'], unit='s')
    df.set_index("timestamp", inplace=True)
    df.sort_index(inplace=True)
    df["cummax_cv_score"] = df["average_test_score"].cummax()
    df = df[~df.index.duplicated(keep="last")]
    df = df["cummax_cv_score"].resample("1S").bfill()
    fig = px.line(df, y="cummax_cv_score")
    fig.show()

### Step 1: Read our data from S3

In [6]:
%%time
DATA_URL = "https://ray-ci-higgs.s3.us-west-2.amazonaws.com/" \
                      "safe_driver.csv"
print("Reading data from S3...")
train_df = pd.read_csv(DATA_URL, dtype={'id': np.int32, 'target': np.int8})

y = train_df['target'].values
X = train_df.drop(['target', 'id'], axis=1)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=1234)
t, f = train_df.shape
print(f'training set: {t}, features:{f}')

Reading data from S3...
training set: 595212, features:31
CPU times: user 1.53 s, sys: 336 ms, total: 1.87 s
Wall time: 4.19 s


#### Define some utility functions

In [7]:
def print_test_score(model, X_test, y_test):
    y_pred = model.predict_proba(X_test)
    roc_auc = roc_auc_score(y_test, y_pred[:,1])
    print("**************** roc_auc score: {} ****************".format(roc_auc))

def train_model_and_print_test_score(model, X_train, y_train, X_test, y_test):
    skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=1234)
    run_cv = RandomizedSearchCV(model, 
                                param_distributions= {},  # no parameters distribution
                                n_iter=1, 
                                scoring='roc_auc', 
                                n_jobs=-1, 
                                cv=skf.split(X_train,y_train), 
                                verbose=1, 
                                random_state=1001)
    run_cv.fit(X_train, y_train)
    print_test_score(run_cv.best_estimator_, X_test, y_test)

### Step 2: Define our XGBoost classifier 

In [8]:
model = XGBClassifier(objective='binary:logistic', 
                      n_jobs=1, 
                      eval_metric='auc', 
                      random_state=1234, 
                      verbosity=1, 
                      use_label_encoder=False)

### Step 3: Use Vanilla sklearn RandomSearchCV and GridSearchCV without parameters search space

In [None]:
%%time
train_model_and_print_test_score(model, X_train, y_train, X_test, y_test)

### Step 4: Use Vanilla sklearn RandomSearchCV and GridSearchCV with parameters search space

In [10]:
params = {
        "max_depth": randint(1, 5),
        "min_child_weight": loguniform(0.001, 128),
        "subsample": uniform(0.1, 1.0),
        "colsample_bylevel": uniform(0.01, 1.0),
        "colsample_bytree": uniform(0.01, 1.0),
        "reg_alpha": loguniform(1 / 1024, 10.0),
        "reg_lambda": loguniform(1 / 1024, 10.0),
        "scale_pos_weight": [1, 26],
}
number_of_cv_splits = 3

In [11]:
%%time
gs = RandomizedSearchCV(
    model, 
    params,
    cv=number_of_cv_splits,
    n_iter=100, 
    scoring='roc_auc', 
    n_jobs=-1, # use all cores in a single node
    verbose=1,
)

gs.fit(X_train, y_train)                    

Fitting 3 folds for each of 100 candidates, totalling 300 fits
CPU times: user 28.9 s, sys: 452 ms, total: 29.4 s
Wall time: 10min 59s


RandomizedSearchCV(cv=3,
                   estimator=XGBClassifier(base_score=None, booster=None,
                                           colsample_bylevel=None,
                                           colsample_bynode=None,
                                           colsample_bytree=None,
                                           enable_categorical=False,
                                           eval_metric='auc', gamma=None,
                                           gpu_id=None, importance_type=None,
                                           interaction_constraints=None,
                                           learning_rate=None,
                                           max_delta_step=None, max_depth=None,
                                           min_child_weight=None, missing=nan,
                                           mo...
                                        'min_child_weight': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fc330fd2c10>,
     

In [12]:
# Report some time and performance statistics
total_tuning_compute_time = np.sum(gs.cv_results_['mean_fit_time'])
average_train_time = np.mean(gs.cv_results_['mean_fit_time'])
print(f'Sklearn total tuning time grid took: {average_train_time:02f} seconds')
print(f'Sklearn grid search took: {average_train_time:02f} seconds')
print(f'Best score for AUC: {gs.best_score_:.3f}') 

Sklearn total tuning time grid took: 24.557242 seconds
Sklearn grid search took: 24.557242 seconds
Best score for AUC: 0.637


### Step 4: Let's try with Ray Tune

Taking an exisiting sci-kit learn program and converting to Ray Tune, using its drop-in replacement,
takes only few lines of code changes.

In [13]:
# ray.init(address="auto", ignore_reinit_error=True)     # Connects to a Ray cluster   
ray.init(ignore_reinit_error=True)                       # Runs locally on my laptop

{'node_ip_address': '127.0.0.1',
 'raylet_ip_address': '127.0.0.1',
 'redis_address': '127.0.0.1:6379',
 'object_store_address': '/tmp/ray/session_2021-11-27_07-46-01_629220_20667/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2021-11-27_07-46-01_629220_20667/sockets/raylet',
 'webui_url': None,
 'session_dir': '/tmp/ray/session_2021-11-27_07-46-01_629220_20667',
 'metrics_export_port': 60864,
 'node_id': '1d08912e77a1838f2a91274696bdfcfd1839bb17989653add1b5c531'}

In [14]:
ray.cluster_resources()

{'object_store_memory': 4919230464.0,
 'memory': 9838460928.0,
 'CPU': 12.0,
 'node:127.0.0.1': 1.0}

### Define our hyperparameter config space using tune

In [15]:
tune_config_params = {
        "max_depth": tune.randint(1, 5),
        "min_child_weight": tune.loguniform(0.001, 128),
        "subsample": tune.uniform(0.1, 1.0),
        "colsample_bylevel": tune.uniform(0.01, 1.0),
        "colsample_bytree": tune.uniform(0.01, 1.0),
        "reg_alpha": tune.loguniform(1 / 1024, 10.0),
        "reg_lambda": tune.loguniform(1 / 1024, 10.0),
        "scale_pos_weight": tune.choice([1, 26]),
}

### Use Tune's drop-in replacement: TuneSearchCV

In [16]:
%%time
tune_gs = TuneSearchCV(
    model, 
    tune_config_params,
    cv=3,
    n_trials=100, 
    scoring='roc_auc', 
    n_jobs=-1,  # change to 40 if running on a ray cluster 
                # or equal to total number of CPUs 
    verbose=1,
    # Custom Key word arguments
    early_stopping=True,
    max_iters=10,   #equivalent to epoch in NN      
    loggers = ["tensorboard"],
    search_optimization="optuna", # Default is ASHA
    name="tune-experiment"
)

tune_gs.fit(X_train, y_train, tune_params=dict(callbacks=[TqdmCallback()]))

CPU times: user 1min 46s, sys: 25.7 s, total: 2min 12s
Wall time: 44min 50s


TuneSearchCV(cv=3,
             early_stopping=<ray.tune.schedulers.async_hyperband.AsyncHyperBandScheduler object at 0x7fc341b8fd50>,
             estimator=XGBClassifier(base_score=None, booster=None,
                                     colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None,
                                     enable_categorical=False,
                                     eval_metric='auc', gamma=None, gpu_id=None,
                                     importance_type=None,
                                     interaction_constraints=No...
                                  'reg_alpha': <ray.tune.sample.Float object at 0x7fc341ba2d90>,
                                  'reg_lambda': <ray.tune.sample.Float object at 0x7fc341ba2e10>,
                                  'scale_pos_weight': <ray.tune.sample.Categorical object at 0x7fc341ba2e90>,
                                  'subsam

In [17]:
print(f'Best parameters: {tune_gs.best_params_}')
print(f'Best AUC score : {tune_gs.best_score_}')

Best parameters: {'max_depth': 3, 'min_child_weight': 0.06926412977452452, 'subsample': 0.9299617438456829, 'colsample_bylevel': 0.027348326980587125, 'colsample_bytree': 0.12074232732777815, 'reg_alpha': 8.066616375199791, 'reg_lambda': 0.004162026998429759, 'scale_pos_weight': 1}
Best AUC score : 0.637886010407827


#### Plot the times

In [None]:
plot_cv_score(tune_gs.analysis_)

In [None]:
ray.shutdown()