This notebook uses the diamond dataset in seaborn demonstrating how to use the xgboost. It is assumed that you have your environment setup for the xgboost. Some recommended use cases for this notebook:
  - Learning basic workflow of xgboost
  - Learning to represent a categorical data both nominal and ordinal types
  - Testing some hypotheses:
    - Model performance when representing 'cut' variable as ordinal is different compare to setting as nominal?
    - The 'cut' variable helps improving the model performance?

Note: this notebook sets to use GPU. You can change it to CPU easily during parameter setting.

In [86]:
import copy
import seaborn as sns
import numpy as np
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
import xgboost as xgb
seed = 123

In [88]:
sns.__version__, np.__version__, pd.__version__, sklearn.__version__, xgb.__version__

('0.13.2', '1.26.4', '2.2.2', '1.5.2', '2.1.1')

In [89]:
# use diamonds dataset in seaborn
diamonds = sns.load_dataset('diamonds')
diamonds.head(), diamonds.dtypes, diamonds.shape

(   carat      cut color clarity  depth  table  price     x     y     z
 0   0.23    Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
 1   0.21  Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31
 2   0.23     Good     E     VS1   56.9   65.0    327  4.05  4.07  2.31
 3   0.29  Premium     I     VS2   62.4   58.0    334  4.20  4.23  2.63
 4   0.31     Good     J     SI2   63.3   58.0    335  4.34  4.35  2.75,
 carat       float64
 cut        category
 color      category
 clarity    category
 depth       float64
 table       float64
 price         int64
 x           float64
 y           float64
 z           float64
 dtype: object,
 (53940, 10))

In [90]:
# we have three columns for non-numerical features: cut, color, clarity
# their dtypes are already as category
diamonds.describe(exclude=np.number)

Unnamed: 0,cut,color,clarity
count,53940,53940,53940
unique,5,7,8
top,Ideal,G,SI1
freq,21551,11292,13065


In [91]:
# note that category dtype can be either ordinal or nominal
# cut, color, clarity are current nominal.
diamonds['cut'].dtype, diamonds['color'].dtype, diamonds['clarity'].dtype

(CategoricalDtype(categories=['Ideal', 'Premium', 'Very Good', 'Good', 'Fair'], ordered=False, categories_dtype=object),
 CategoricalDtype(categories=['D', 'E', 'F', 'G', 'H', 'I', 'J'], ordered=False, categories_dtype=object),
 CategoricalDtype(categories=['IF', 'VVS1', 'VVS2', 'VS1', 'VS2', 'SI1', 'SI2', 'I1'], ordered=False, categories_dtype=object))

In [92]:
# it might be more accurate representing cut as ordinal, considering that the categories (i.e., Ideal, Premium, ...) implies order, and likely to associate with the price that we are trying to predict.
# will create another dataset, and set it so that cut is ordinal.
# will also re-order the categories from low-to-high explicitly.
diamonds_ordinal = copy.deepcopy(diamonds)
diamonds_ordinal['cut'] = pd.Categorical(diamonds_ordinal['cut'], 
                                         categories=['Fair', 'Good', 'Very Good', 'Premium', 'Ideal'], 
                                         ordered=True)
diamonds_ordinal['cut'].dtype

CategoricalDtype(categories=['Fair', 'Good', 'Very Good', 'Premium', 'Ideal'], ordered=True, categories_dtype=object)

In [93]:
# Split X, Y
Y, Y_ordinal = diamonds['price'], diamonds_ordinal['price']
X, X_ordinal = diamonds.drop('price', axis=1), diamonds_ordinal.drop('price', axis=1)

In [94]:
# split train and test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=seed, train_size=0.8)
Xo_train, Xo_test, Yo_train, Yo_test = train_test_split(X_ordinal, Y_ordinal, random_state=seed, train_size=0.8)
Y_train.shape, Y_test.shape

((43152,), (10788,))

In [95]:
# Make DMatrix for xgboost
# note the argument enable_categorical=True
dtrain = xgb.DMatrix(X_train, Y_train, enable_categorical=True)
dtrain_o = xgb.DMatrix(Xo_train, Yo_train, enable_categorical=True)
dtest = xgb.DMatrix(X_test, Y_test, enable_categorical=True)
dtest_o = xgb.DMatrix(Xo_test, Yo_test, enable_categorical=True)

In [96]:
# using gpu
# change 'device': 'cpu' for using cpu
params_gpu = {
    "objective": "reg:squarederror", 
    'eval_metric': 'rmse',
    "tree_method": "hist",
    'device': 'gpu'
    }
n = 5000 # num_boost_round
vb = 50 # verbose_eval
es = 50 # early_stopping_rounds
evals = [(dtrain, "train"), (dtest, "test"), (dtrain_o, "train_o"), (dtest_o, "test_o")]


In [97]:
model = xgb.train(
   params=params_gpu,
   dtrain=dtrain,
   num_boost_round=n,
   evals=evals,
   verbose_eval=vb,
   early_stopping_rounds=es,
)

[0]	train-rmse:2861.57928	test-rmse:2848.91516	train_o-rmse:2861.57928	test_o-rmse:2848.91516
[50]	train-rmse:430.67545	test-rmse:545.64382	train_o-rmse:514.94236	test_o-rmse:601.69674
[100]	train-rmse:378.38766	test-rmse:543.54183	train_o-rmse:482.34675	test_o-rmse:600.59899
[138]	train-rmse:346.41113	test-rmse:545.63702	train_o-rmse:461.35018	test_o-rmse:601.45831


In [98]:
model_o = xgb.train(
   params=params_gpu,
   dtrain=dtrain_o,
   num_boost_round=n,
   evals=evals,
   verbose_eval=vb,
   early_stopping_rounds=es,
)

[0]	train-rmse:2861.57928	test-rmse:2848.91516	train_o-rmse:2861.57928	test_o-rmse:2848.91516


[50]	train-rmse:514.94236	test-rmse:601.69674	train_o-rmse:430.67545	test_o-rmse:545.64382
[100]	train-rmse:482.34675	test-rmse:600.59899	train_o-rmse:378.38766	test_o-rmse:543.54183
[138]	train-rmse:461.35018	test-rmse:601.45831	train_o-rmse:346.41113	test_o-rmse:545.63702


In [101]:
# Both models perform similarly shown by the evaluation rmses.
# This implies that representing cut as nominal or ordinal doesn't evidently affect the model performance.
# Does cut have any explanatory power at all?

In [102]:
# Investigate the effect of cut
# make another dataset dropping cut, and train a model
dtrain_c = xgb.DMatrix(X_train.drop('cut', axis=1), Y_train, enable_categorical=True)
dtest_c = xgb.DMatrix(X_test.drop('cut', axis=1), Y_test, enable_categorical=True)
dtrain_co = xgb.DMatrix(Xo_train.drop('cut', axis=1), Yo_train, enable_categorical=True)
dtest_co = xgb.DMatrix(Xo_test.drop('cut', axis=1), Yo_test, enable_categorical=True)
evals = [(dtrain_c, "train_c"), (dtest_c, "test_c"), (dtrain_co, "train_co"), (dtest_co, "test_co")]
model_c = xgb.train(
   params=params_gpu,
   dtrain=dtrain_c,
   num_boost_round=n,
   evals=evals,
   verbose_eval=vb,
   early_stopping_rounds=es,
)

[0]	train_c-rmse:2861.57928	test_c-rmse:2848.91516	train_co-rmse:2861.57928	test_co-rmse:2848.91516
[50]	train_c-rmse:444.82757	test_c-rmse:560.36292	train_co-rmse:444.82757	test_co-rmse:560.36292
[100]	train_c-rmse:388.12758	test_c-rmse:562.73100	train_co-rmse:388.12758	test_co-rmse:562.73100
[128]	train_c-rmse:365.41517	test_c-rmse:563.64819	train_co-rmse:365.41517	test_co-rmse:563.64819


In [103]:
# Without cut, the test-rmse increased from 545.59 to 563.64. This implies ~3% improvement if cut is included. 

This notebook demonstrates the minimal xgboost workflow, and how to manipulate your data to represent a variable as nominal or ordinal type.

Additionally, with the diamond dataset, we tested some hypotheses regarding the 'cut' variable. This demonstrates how you can simply manipulate the setup and pursue more deeper understanding by using xgboost.