Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuous features ignored when training with categorical features #7227

Closed
adamreeve opened this issue Sep 14, 2021 · 1 comment · Fixed by #7228
Closed

Continuous features ignored when training with categorical features #7227

adamreeve opened this issue Sep 14, 2021 · 1 comment · Fixed by #7228

Comments

@adamreeve
Copy link

I was testing the new categorical data support and found that if I have a mix of continuous and categorical features, trees always split on the categorical features and ignore the continous features. Is this expected behaviour or a known limitation, or am I doing something wrong here?

I'm testing with commit 3515931 from the master branch, and built with -DUSE_CUDA=ON -DGOOGLE_TEST=on on Fedora 34.

Example reproduction:

import xgboost as xgb
import numpy as np
import pandas as pd

def generate_data(num_rows, num_continuous_features, num_categorical_features, num_categories):
    data = {}
    label = np.zeros(num_rows)
    for i in range(num_continuous_features):
        feat = np.random.rand(num_rows)
        data['cont{0}'.format(i)] = feat
        label += feat
    for i in range(num_categorical_features):
        feat = np.random.randint(0, num_categories, num_rows)
        data['cat{0}'.format(i)] = pd.Series(feat, dtype='category')
        label += feat
    return pd.DataFrame(data), label

X_train, y_train = generate_data(1000, 5, 1, 4)
d_train = xgb.DMatrix(data=X_train, label=y_train, enable_categorical=True)

params = {
  'objective': 'reg:squarederror',
  'max_depth': 4,
  'tree_method': 'gpu_hist',
}

model = xgb.train(
    params=params,
    dtrain=d_train,
    num_boost_round=3,
)

model.dump_model('model.txt')
with open('model.txt', 'r') as model_file:
    print(model_file.read())

This will output a model like this for example, which only uses 'cat0':

booster[0]:
0:[cat0:{0}] yes=2,no=1,missing=2
	1:[cat0:{1}] yes=4,no=3,missing=3
		3:[cat0:{2}] yes=6,no=5,missing=5
			5:leaf=1.49071383
			6:leaf=1.19833255
		4:leaf=0.898031175
	2:leaf=0.592973948
booster[1]:
0:[cat0:{0}] yes=2,no=1,missing=2
	1:[cat0:{1}] yes=4,no=3,missing=3
		3:[cat0:{2}] yes=6,no=5,missing=5
			5:leaf=1.04521966
			6:leaf=0.840265036
		4:leaf=0.629819095
	2:leaf=0.415745467
booster[2]:
0:[cat0:{0}] yes=2,no=1,missing=1
	1:[cat0:{1}] yes=4,no=3,missing=3
		3:[cat0:{2}] yes=6,no=5,missing=6
			5:leaf=0.732859731
			6:leaf=0.589189708
		4:leaf=0.441713214
	2:leaf=0.291487336
@trivialfis
Copy link
Member

Thanks for opening the issue! Will open a PR for the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants