support numpy categorical in xgboost sklearn (categorical_features
param)
#7817
Labels
Projects
categorical_features
param)
#7817
Context
xgboost==1.6.0 has released support for categorical data link
Right now, using scikit-learn interface, the idea is to ingest a
pandas/cudf Dataframe
with categorical columns ascategory
Problem with that is that a lot of functions in sklearn output or expect an input of
numpy array
type.For example, preprocessors in
skelarn.preprocessing
, calibration wrappers likeCalibratedClassifierCV
, multioutput wrappers likeMultiOutputRegressor
etc.And while you can pass a
numpy array
andfeature types
toxgb.DMatrix
. You can't toXGBRegressor
(so it handles it to the underlyingDMatrix
)Solution 1 external encoding
We can assume that the encoding from
pandas Dataframe
tonumpy array
is handled externally by the user. And then just declare feature_types as xgb.XGBClassifier.So calling code would be like
Solution 2 internal encoding
Here encoding from
pandas Dataframe
tonumpy array
is handled internally by xgb. This would mean, creating acategory
list at fit time and store it in the classifier, which would make the code a little bit more complex.With this, calling code would be like
Naming
HistGradientBoostingClassifier calls the param that contains feature type cat or num
categorical_features
maybe in xgboost this shall be defined analogously.
The text was updated successfully, but these errors were encountered: