Factors #95

BlindApe opened this Issue Nov 11, 2014 · 6 comments


None yet

5 participants


How do the factor variables treated?
Transformed to numeric in the levels (by order of them)?
Onehot encoding?
Used as categorical and performing categorical splitting?

tqchen commented Nov 11, 2014

I do not know what you mean by vector. xgboost treat every input feature as numerical, with support for missing values and sparsity. The decision is at the user

So if you want ordered variables, you can transform the variables into numerical levels(say age). Or if you prefer treat it as categorical variable, do one hot encoding.

@tqchen tqchen closed this Nov 11, 2014
@tqchen tqchen reopened this Nov 11, 2014

tqchen (competing as crowwork) converted categorical variables to numeric variables on the criteo competition by computing smoothed conditional probabilities of a click, given the level of the factor.

numeric variable = (number of clicks within a level + (mean click rate) * ballast)/
(number of records within the level + ballast)

This is similar to what R gbm does, except that it does not use ballast and limits the number of levels in the factor to 1024. I think this is a good approach for factors which have too many levels to for one-hotting.

Automating this would be a good future enhancement; I don't have time right now but at some point I plan to clone the repository and add some features from my wish list.

@tqchen tqchen added the question label Nov 12, 2014
@jakob-r jakob-r referenced this issue in mlr-org/mlr Nov 17, 2014

Integrate xgboost #182


this would very interesting but I think that will need a hard change in both tree growing and how it is saved and of course prediction methods.
Support for factors isn't trivial.

tqchen commented Nov 24, 2014

All these things are possible pre-processors, which can be a model that wraps xgboost, when before doing train/predict, run the pre-processing and feed processed data to xgboost. So it is not hard.

This is also reason why I do not explicit support factor in the tree construction algorithm. There could be many ways doing so, and in all the ways, having an algorithm optimized for sparse matrices is efficient for taking the processed data.

Normal tree growing algorithm only support dense numerical features, and have to support one-hot encoding factor explicitly for computation efficiency reason.


Is it possible to get a pointer to the method described here to convert categorical Data to numeric one?

@tqchen tqchen added the R-package label Jan 25, 2015
@tqchen tqchen closed this Jan 15, 2016

@tqchen LightGBM recently got support for Categorical Features. For columns with many categorical values (thousands), where one-hot-encoding is hard, I got massive improvements. xgboost, without categorical features support, is not even a possibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment