Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better handling for categorical variables #1721

Closed
AbdealiLoKo opened this issue Oct 30, 2016 · 1 comment
Closed

Better handling for categorical variables #1721

AbdealiLoKo opened this issue Oct 30, 2016 · 1 comment

Comments

@AbdealiLoKo
Copy link
Contributor

This is a feature request.

Decision trees lend themselves naturally to categorical variables. It would be nice if xgboost could handle categorical variables inherently. It could read the feature-map or something to identify categorical variables and try to handle them in a better way.

This is important because sometimes it is difficult to encode these categorical variables into numerical values. For large data, One-Hot-Encoding gives waay too many buckets making the data blow up in size. And ordinal encoding assumes a implicit order in the data which is not obvious - for example: airport code.

Many other types of encoding do exist, but why do a round-about way if decision trees already lend themselves to categorical variables more naturally than numerical variables !?

@pommedeterresautee
Copy link
Member

The politic of XGBoost is to not have a special support for categorical variables. It s up to you to manage them before providing the features to the algo.

@lock lock bot locked as resolved and limited conversation to collaborators Oct 25, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants