support numpy categorical in xgboost sklearn (`categorical_features` param) #7817

iuiu34 · 2022-04-19T08:05:34Z

Context
xgboost==1.6.0 has released support for categorical data link
Right now, using scikit-learn interface, the idea is to ingest a pandas/cudf Dataframe with categorical columns as category

X["cat_feature"].astype("category")

Problem with that is that a lot of functions in sklearn output or expect an input of numpy array type.
For example, preprocessors in skelarn.preprocessing, calibration wrappers like CalibratedClassifierCV, multioutput wrappers like MultiOutputRegressor etc.

And while you can pass a numpy array and feature types to xgb.DMatrix. You can't to XGBRegressor (so it handles it to the underlying DMatrix )

Solution 1 external encoding

We can assume that the encoding from pandas Dataframe to numpy array is handled externally by the user. And then just declare feature_types as xgb.XGBClassifier.

So calling code would be like

import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder, PolynomialFeatures

X = pd.DataFrame({'c1': ['a', 'a', 'b'], 'c2': [1, 2, 3]})
y = np.array([1, 2])

pipeline = []

transformers = [('str', OrdinalEncoder(), ['c1'])]
transformers += [('num', "passthrough", ['c2'])]
transformers += [('polynomial'), PolynomialFeatures(), ['c2']]

preprocess = ColumnTransformer(transformers)
pipeline += [('preprocess', preprocess)]

learner = xgb.XGBClassifier(
    eval_metric="auc",
    enable_categorical=True,
    max_cat_to_onehot=1,
    feature_types=['c', 'float', 'float', 'float']
)

pipeline += [('learner', learner)]
pipeline = Pipeline(pipeline)

pipeline.fit(X, y)

type(pipeline['preprocess'].transform(X))  # numpy

Solution 2 internal encoding

Here encoding from pandas Dataframe to numpy array is handled internally by xgb. This would mean, creating a categorylist at fit time and store it in the classifier, which would make the code a little bit more complex.

With this, calling code would be like

import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

X = pd.DataFrame({'c1': ['a', 'a', 'b'], 'c2': [1, 2, 3]})
y = np.array([1, 2])

pipeline = []

transformers = [('str', "passthrough", ['c1'])]
transformers += [('num', "passthrough", ['c2'])]
transformers += [('polynomial'), PolynomialFeatures(), ['c2']]

preprocess = ColumnTransformer(transformers)
pipeline += [('preprocess', preprocess)]

learner = xgb.XGBClassifier(
    eval_metric="auc",
    enable_categorical=True,
    max_cat_to_onehot=1,
    feature_types=['c', 'float', 'float', 'float']
)

pipeline += [('learner', learner)]
pipeline = Pipeline(pipeline)

pipeline.fit(X, y)

type(pipeline['preprocess'].transform(X))  # numpy

Naming
HistGradientBoostingClassifier calls the param that contains feature type cat or num categorical_features

categorical_features array-like of {bool, int} of shape (n_features) or shape (n_categorical_features,), default=None
    Indicates the categorical features.
        None : no feature will be considered categorical.
        boolean array-like : boolean mask indicating categorical features.
        integer array-like : integer indices indicating categorical features.
    For each categorical feature, there must be at most max_bins unique categories, and each categorical value must be in [0, max_bins -1].

maybe in xgboost this shall be defined analogously.

The text was updated successfully, but these errors were encountered:

trivialfis · 2022-04-19T08:18:48Z

Thank you for the detailed solutions. Exposing feature_types to sklearn interface is the next logical step, I will work on it in this release.

trivialfis · 2022-04-20T09:16:59Z

I think we will go for solution 1 for the following reasons:

having an encoder inside XGBoost might be more problematic than one might expect. We will have to keep the stability of that encoder as part of the model and handle both CPU/GPU data structures.
I would like to keep a minimal amount of data preprocessing code in XGBoost and make sure its scope doesn't go beyond machine learning.

iuiu34 · 2022-04-21T12:48:42Z

do you plan to support categorical_features: List[bool] for naming compatibility with sklearn? Or you'll keep just feature_types?

class XGBClassifier:
(...)
    def __init__(...,  categorical_features=None):
        (...)
        self.feature_types = ['c' if v else 'float' for v in categorical_features]
(...)

cl = XGBClassifier(categorical_features=[True,False,False])
cl.feature_types 
# ['c','float','float']

trivialfis · 2022-04-21T13:22:41Z

For now, we will just keep the feature_types as the interface.

trivialfis added the feature-request label Apr 19, 2022

trivialfis added this to 2.0 TODO in 2.0 Roadmap via automation Apr 19, 2022

trivialfis mentioned this issue Apr 20, 2022

Expose feature_types to sklearn interface. #7821

Merged

trivialfis closed this as completed in #7821 Apr 21, 2022

trivialfis moved this from 2.0 TODO to 2.0 Done in 2.0 Roadmap Apr 22, 2022

trivialfis removed this from 2.0 Done in 2.0 Roadmap Sep 28, 2022

trivialfis added this to To do in 1.7 Roadmap via automation Sep 28, 2022

trivialfis moved this from To do to Done in 1.7 Roadmap Sep 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support numpy categorical in xgboost sklearn (`categorical_features` param) #7817

support numpy categorical in xgboost sklearn (`categorical_features` param) #7817

iuiu34 commented Apr 19, 2022 •

edited

trivialfis commented Apr 19, 2022

trivialfis commented Apr 20, 2022

iuiu34 commented Apr 21, 2022

trivialfis commented Apr 21, 2022

support numpy categorical in xgboost sklearn (categorical_features param) #7817

support numpy categorical in xgboost sklearn (categorical_features param) #7817

Comments

iuiu34 commented Apr 19, 2022 • edited

trivialfis commented Apr 19, 2022

trivialfis commented Apr 20, 2022

iuiu34 commented Apr 21, 2022

trivialfis commented Apr 21, 2022

support numpy categorical in xgboost sklearn (`categorical_features` param) #7817

support numpy categorical in xgboost sklearn (`categorical_features` param) #7817

iuiu34 commented Apr 19, 2022 •

edited