## XGBoost on a single node 

Code example from [Anyscale blog]()

Let's first start by creating a simple single node non-distributed setup with core XGBoost.

In [1]:
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
import time

### Load the scikit-learn data and convert to XGBoot DMatrix data structures

In [2]:
train_x, train_y = load_breast_cancer(return_X_y=True)
train_set = xgb.DMatrix(train_x, train_y)

In [3]:
# Train the model with required arguments to XGBoost trainger

evals_result = {}
start = time.time()
bst = xgb.train({"objective": "binary:logistic",
            "eval_metric": ["logloss", "error"] },
            train_set,
            evals_result=evals_result,
            evals=[(train_set, "train")],
            verbose_eval=False)

print("XGBoost train time: {:.2f} secs".format(time.time() - start))

# Save the model
bst.save_model("model.xgb")

XGBoost train time: 0.03 secs


### Do predictions

In [12]:
# from xgboost import DMatrix, predict
# from sklearn.datasets import load_breast_cancer
# import xgboost as xgb

# data, labels = load_breast_cancer(return_X_y=True)

# dpred = DMatrix(data, labels)

# bst = xgb.Booster(model_file="model.xgb")
# predictions = predict(bst, dpred)

# print(predictions)

### XGBoost-ray on multiple cores

Import the Ray integrated xgboost parameters and trainer

In [9]:
from xgboost_ray import RayDMatrix, RayParams, train
from sklearn.datasets import load_breast_cancer

train_x, train_y = load_breast_cancer(return_X_y=True)
train_set = RayDMatrix(train_x, train_y)

evals_result = {}
start = time.time()
bst = train(
   {
       "objective": "binary:logistic",
       "eval_metric": ["logloss", "error"],
   },
   train_set,
   evals_result=evals_result,
   evals=[(train_set, "train")],
   verbose_eval=False,
   ray_params=RayParams(num_actors=2, cpus_per_actor=1))

print("XGBoost train time: {:.2f} secs".format(time.time() - start))

bst.save_model("model_ray.xgb")

2021-08-21 13:05:11,901	INFO main.py:892 -- [RayXGBoost] Created 2 new actors (2 total actors). Waiting until actors are ready for training.
2021-08-21 13:05:12,722	INFO main.py:937 -- [RayXGBoost] Starting XGBoost training.
[2m[36m(pid=36640)[0m [13:05:12] task [xgboost.ray]:140695871277280 got new rank 0
[2m[36m(pid=36635)[0m [13:05:12] task [xgboost.ray]:140203359316192 got new rank 1
2021-08-21 13:05:13,604	INFO main.py:1408 -- [RayXGBoost] Finished XGBoost training on training data with total N=569 in 1.76 seconds (0.88 pure XGBoost training time).


XGBoost train time: 1.77 secs


### Scikit-learn API

XGBoost-Ray can also act as a drop-in replacement for sklearn-style models, such as XGBRegressor or XGBClassifier.

In [14]:
from xgboost_ray import RayXGBClassifier, RayParams
from sklearn.datasets import load_breast_cancer
import mlflow

X, y = load_breast_cancer(return_X_y=True)

clf = RayXGBClassifier(
    n_jobs=4,  # Number of distributed actors
)
start = time.time()
mlflow.autolog()
clf.fit(X, y)
print("XGBoost train time: {:.2f} secs".format(time.time() - start))

2021/08/21 17:15:37 INFO mlflow.tracking.fluent: Autologging successfully enabled for xgboost.
2021/08/21 17:15:41 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.
2021-08-21 17:15:42,918	INFO main.py:892 -- [RayXGBoost] Created 4 new actors (4 total actors). Waiting until actors are ready for training.
2021-08-21 17:15:44,242	INFO main.py:937 -- [RayXGBoost] Starting XGBoost training.
[2m[36m(pid=37446)[0m [17:15:44] task [xgboost.ray]:140441235322864 got new rank 0
[2m[36m(pid=37453)[0m [17:15:44] task [xgboost.ray]:140395525882496 got new rank 1
[2m[36m(pid=37449)[0m [17:15:44] task [xgboost.ray]:140316194740160 got new rank 3
[2m[36m(pid=37447)[0m [17:15:44] task [xgboost.ray]:140195935647776 got new rank 2
2021-08-21 17:15:45,551	INFO main.py:1408 -- [RayXGBoost] Finished XGBoost training on training data with total N=569 in 3.91 seconds (1.30 pure XGBoost training time).


XGBoost train time: 7.86 secs
