Continuous features ignored when training with categorical features #7227

adamreeve · 2021-09-14T21:28:35Z

I was testing the new categorical data support and found that if I have a mix of continuous and categorical features, trees always split on the categorical features and ignore the continous features. Is this expected behaviour or a known limitation, or am I doing something wrong here?

I'm testing with commit 3515931 from the master branch, and built with -DUSE_CUDA=ON -DGOOGLE_TEST=on on Fedora 34.

Example reproduction:

import xgboost as xgb
import numpy as np
import pandas as pd

def generate_data(num_rows, num_continuous_features, num_categorical_features, num_categories):
    data = {}
    label = np.zeros(num_rows)
    for i in range(num_continuous_features):
        feat = np.random.rand(num_rows)
        data['cont{0}'.format(i)] = feat
        label += feat
    for i in range(num_categorical_features):
        feat = np.random.randint(0, num_categories, num_rows)
        data['cat{0}'.format(i)] = pd.Series(feat, dtype='category')
        label += feat
    return pd.DataFrame(data), label

X_train, y_train = generate_data(1000, 5, 1, 4)
d_train = xgb.DMatrix(data=X_train, label=y_train, enable_categorical=True)

params = {
  'objective': 'reg:squarederror',
  'max_depth': 4,
  'tree_method': 'gpu_hist',
}

model = xgb.train(
    params=params,
    dtrain=d_train,
    num_boost_round=3,
)

model.dump_model('model.txt')
with open('model.txt', 'r') as model_file:
    print(model_file.read())

This will output a model like this for example, which only uses 'cat0':

booster[0]:
0:[cat0:{0}] yes=2,no=1,missing=2
	1:[cat0:{1}] yes=4,no=3,missing=3
		3:[cat0:{2}] yes=6,no=5,missing=5
			5:leaf=1.49071383
			6:leaf=1.19833255
		4:leaf=0.898031175
	2:leaf=0.592973948
booster[1]:
0:[cat0:{0}] yes=2,no=1,missing=2
	1:[cat0:{1}] yes=4,no=3,missing=3
		3:[cat0:{2}] yes=6,no=5,missing=5
			5:leaf=1.04521966
			6:leaf=0.840265036
		4:leaf=0.629819095
	2:leaf=0.415745467
booster[2]:
0:[cat0:{0}] yes=2,no=1,missing=1
	1:[cat0:{1}] yes=4,no=3,missing=3
		3:[cat0:{2}] yes=6,no=5,missing=6
			5:leaf=0.732859731
			6:leaf=0.589189708
		4:leaf=0.441713214
	2:leaf=0.291487336

The text was updated successfully, but these errors were encountered:

trivialfis · 2021-09-14T23:55:33Z

Thanks for opening the issue! Will open a PR for the fix.

trivialfis added the type: bug label Sep 14, 2021

trivialfis mentioned this issue Sep 15, 2021

Fix mixed types in GPU sketching. #7228

Merged

trivialfis closed this as completed in #7228 Sep 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continuous features ignored when training with categorical features #7227

Continuous features ignored when training with categorical features #7227

adamreeve commented Sep 14, 2021

trivialfis commented Sep 14, 2021

Continuous features ignored when training with categorical features #7227

Continuous features ignored when training with categorical features #7227

Comments

adamreeve commented Sep 14, 2021

trivialfis commented Sep 14, 2021