This notebook uses the diamond dataset in seaborn demonstrating how to use the xgboost. It is assumed that you have your environment setup for the xgboost. Some recommended use cases for this notebook:
  - Testing if your environment is ready to run the xgboost
  - Learning some basic setup for the xgboost
  - Testing CPU and GPU running with xgboost

Note: this notebook will show running the xgboost on both CPU and GPU. You can skip the GPU part if unavailable. 

In [1]:
import time
import seaborn as sns
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split
import xgboost as xgb
seed = 123

In [2]:
sns.__version__, np.__version__, sklearn.__version__, xgb.__version__

('0.13.2', '1.26.4', '1.5.2', '2.1.1')

In [3]:
# use diamonds dataset in seaborn
diamonds = sns.load_dataset('diamonds')
diamonds.head(), diamonds.dtypes, diamonds.shape

(   carat      cut color clarity  depth  table  price     x     y     z
 0   0.23    Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
 1   0.21  Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31
 2   0.23     Good     E     VS1   56.9   65.0    327  4.05  4.07  2.31
 3   0.29  Premium     I     VS2   62.4   58.0    334  4.20  4.23  2.63
 4   0.31     Good     J     SI2   63.3   58.0    335  4.34  4.35  2.75,
 carat       float64
 cut        category
 color      category
 clarity    category
 depth       float64
 table       float64
 price         int64
 x           float64
 y           float64
 z           float64
 dtype: object,
 (53940, 10))

In [4]:
# we have three columns for non-numerical features: cut, color, clarity
# their dtypes are already as category
diamonds.describe(exclude=np.number)

Unnamed: 0,cut,color,clarity
count,53940,53940,53940
unique,5,7,8
top,Ideal,G,SI1
freq,21551,11292,13065


In [5]:
# note that category dtype can be either ordinal or nominal
# cut is currently nominal
diamonds['cut'].dtype

CategoricalDtype(categories=['Ideal', 'Premium', 'Very Good', 'Good', 'Fair'], ordered=False, categories_dtype=object)

In [6]:
# Split X, Y
Y = diamonds['price']
X = diamonds.drop('price', axis=1)

In [7]:
# split train and test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=seed, train_size=0.8)
Y_train.shape, Y_test.shape

((43152,), (10788,))

In [8]:
# Make DMatrix for xgboost
# note the argument enable_categorical=True
dtrain = xgb.DMatrix(X_train, Y_train, enable_categorical=True)
dtest = xgb.DMatrix(X_test, Y_test, enable_categorical=True)

In [9]:
# xgb.train
# using cpu
params_cpu = {
    "objective": "reg:squarederror", 
    'eval_metric': 'rmse',
    "tree_method": "hist",
    'device': 'cpu'
    }
n = 5000
evals = [(dtrain, "train"), (dtest, "test")]

model = xgb.train(
   params=params_cpu,
   dtrain=dtrain,
   num_boost_round=n,
   evals=evals,
   verbose_eval=50,
   early_stopping_rounds=50,
)

[0]	train-rmse:2861.85524	test-rmse:2848.74215
[50]	train-rmse:439.96632	test-rmse:545.91459
[100]	train-rmse:378.66674	test-rmse:545.62390
[126]	train-rmse:360.34374	test-rmse:548.42863


In [10]:
# using gpu
params_gpu = {
    "objective": "reg:squarederror", 
    'eval_metric': 'rmse',
    "tree_method": "hist",
    'device': 'gpu'
    }

model = xgb.train(
   params=params_gpu,
   dtrain=dtrain,
   num_boost_round=n,
   evals=evals,
   verbose_eval=50,
   early_stopping_rounds=50,
)

[0]	train-rmse:2861.57928	test-rmse:2848.91516
[50]	train-rmse:430.67545	test-rmse:545.64382
[100]	train-rmse:378.38766	test-rmse:543.54183
[139]	train-rmse:346.25134	test-rmse:545.59116


In [11]:
# compare running time
tic = time.time()
model_cpu = xgb.train(
   params=params_cpu,
   dtrain=dtrain,
   num_boost_round=n,
   evals=evals,
   verbose_eval=0,
#    early_stopping_rounds=50,
)
toc = time.time()
cputime = toc - tic

tic = time.time()
model_cpu = xgb.train(
   params=params_gpu,
   dtrain=dtrain,
   num_boost_round=n,
   evals=evals,
   verbose_eval=0,
#    early_stopping_rounds=50,
)
toc = time.time()
gputime = toc - tic

print(f'cputime {cputime} \n gputime {gputime}')
# cputime > gputime
# note: ran on 1 gpu

cputime 16.291748762130737 
 gputime 17.354901552200317


This notebook demonstrates the minimal xgboost setup. The work flow can be summarized:
  - Data preprocessing: here the diamond dataset from seaborn was used for the demonstration. Since this notebook aims for the minimal setup, the preprocessing wasn't done thorougly. One thing to be mindful about is how each feature's dtype is declared, especially when there are categorical features either ordinal and nominal. Also, it is important to note that if there is any categorical feature, enable_categorical=True must be set when making the DMatrix.
  - Making DMatrix: DMatrix is a specific xgboost's class for the dataset wrapper. The DMatrix object is the one being passed to the xgboost during training.
  - Parameter setup: this step is the key to tell the xgboost how to learn. There are several parameters, and only some important ones are demonstrated here. Since the demo's task is price prediction, the model was told to perform regression with minimizing squred-error objective (reg:squarederror).
  - Train: in this demonstration, the model is trained through xgb.train() API. It is important to note that there are other regression APIs in the xgboost such as xgb.XGBRegressor.

Additionally, the runtimes between using CPU and 1 GPU are compared, showing that using GPU is faster as expected. It is important to note that a proper environment setup is needed to use GPU.