Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support numpy categorical in xgboost sklearn (categorical_features param) #7817

Closed
iuiu34 opened this issue Apr 19, 2022 · 4 comments · Fixed by #7821
Closed

support numpy categorical in xgboost sklearn (categorical_features param) #7817

iuiu34 opened this issue Apr 19, 2022 · 4 comments · Fixed by #7821

Comments

@iuiu34
Copy link

iuiu34 commented Apr 19, 2022

Context
xgboost==1.6.0 has released support for categorical data link
Right now, using scikit-learn interface, the idea is to ingest a pandas/cudf Dataframe with categorical columns as category

X["cat_feature"].astype("category")

Problem with that is that a lot of functions in sklearn output or expect an input of numpy array type.
For example, preprocessors in skelarn.preprocessing, calibration wrappers like CalibratedClassifierCV, multioutput wrappers like MultiOutputRegressor etc.

And while you can pass a numpy array and feature types to xgb.DMatrix. You can't to XGBRegressor (so it handles it to the underlying DMatrix )

Solution 1 external encoding

We can assume that the encoding from pandas Dataframe to numpy array is handled externally by the user. And then just declare feature_types as xgb.XGBClassifier.

So calling code would be like

import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder, PolynomialFeatures

X = pd.DataFrame({'c1': ['a', 'a', 'b'], 'c2': [1, 2, 3]})
y = np.array([1, 2])

pipeline = []

transformers = [('str', OrdinalEncoder(), ['c1'])]
transformers += [('num', "passthrough", ['c2'])]
transformers += [('polynomial'), PolynomialFeatures(), ['c2']]

preprocess = ColumnTransformer(transformers)
pipeline += [('preprocess', preprocess)]

learner = xgb.XGBClassifier(
    eval_metric="auc",
    enable_categorical=True,
    max_cat_to_onehot=1,
    feature_types=['c', 'float', 'float', 'float']
)

pipeline += [('learner', learner)]
pipeline = Pipeline(pipeline)

pipeline.fit(X, y)

type(pipeline['preprocess'].transform(X))  # numpy

Solution 2 internal encoding

Here encoding from pandas Dataframe to numpy array is handled internally by xgb. This would mean, creating a categorylist at fit time and store it in the classifier, which would make the code a little bit more complex.

With this, calling code would be like

import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

X = pd.DataFrame({'c1': ['a', 'a', 'b'], 'c2': [1, 2, 3]})
y = np.array([1, 2])

pipeline = []

transformers = [('str', "passthrough", ['c1'])]
transformers += [('num', "passthrough", ['c2'])]
transformers += [('polynomial'), PolynomialFeatures(), ['c2']]

preprocess = ColumnTransformer(transformers)
pipeline += [('preprocess', preprocess)]

learner = xgb.XGBClassifier(
    eval_metric="auc",
    enable_categorical=True,
    max_cat_to_onehot=1,
    feature_types=['c', 'float', 'float', 'float']
)

pipeline += [('learner', learner)]
pipeline = Pipeline(pipeline)

pipeline.fit(X, y)

type(pipeline['preprocess'].transform(X))  # numpy

Naming
HistGradientBoostingClassifier calls the param that contains feature type cat or num categorical_features

categorical_features array-like of {bool, int} of shape (n_features) or shape (n_categorical_features,), default=None
    Indicates the categorical features.
        None : no feature will be considered categorical.
        boolean array-like : boolean mask indicating categorical features.
        integer array-like : integer indices indicating categorical features.
    For each categorical feature, there must be at most max_bins unique categories, and each categorical value must be in [0, max_bins -1].

maybe in xgboost this shall be defined analogously.

@trivialfis trivialfis added this to 2.0 TODO in 2.0 Roadmap via automation Apr 19, 2022
@trivialfis
Copy link
Member

Thank you for the detailed solutions. Exposing feature_types to sklearn interface is the next logical step, I will work on it in this release.

@trivialfis
Copy link
Member

I think we will go for solution 1 for the following reasons:

  • having an encoder inside XGBoost might be more problematic than one might expect. We will have to keep the stability of that encoder as part of the model and handle both CPU/GPU data structures.
  • I would like to keep a minimal amount of data preprocessing code in XGBoost and make sure its scope doesn't go beyond machine learning.

@iuiu34
Copy link
Author

iuiu34 commented Apr 21, 2022

do you plan to support categorical_features: List[bool] for naming compatibility with sklearn? Or you'll keep just feature_types?

class XGBClassifier:
(...)
    def __init__(...,  categorical_features=None):
        (...)
        self.feature_types = ['c' if v else 'float' for v in categorical_features]
(...)
cl = XGBClassifier(categorical_features=[True,False,False])
cl.feature_types 
# ['c','float','float']

@trivialfis
Copy link
Member

For now, we will just keep the feature_types as the interface.

@trivialfis trivialfis moved this from 2.0 TODO to 2.0 Done in 2.0 Roadmap Apr 22, 2022
@trivialfis trivialfis removed this from 2.0 Done in 2.0 Roadmap Sep 28, 2022
@trivialfis trivialfis added this to To do in 1.7 Roadmap via automation Sep 28, 2022
@trivialfis trivialfis moved this from To do to Done in 1.7 Roadmap Sep 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

2 participants