-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow float categories to be passed into CatBoost estimators #3966
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3966 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 347 347
Lines 36801 36892 +91
=======================================
+ Hits 36681 36772 +91
Misses 120 120
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
evalml/pipelines/components/utils.py
Outdated
# --> there has to be a better way | ||
# - just string raises TypeError: Cannot convert StringArray to numpy.ndarray | ||
# - just object raises the original cat_features must be integer or string, real number values error | ||
X_float_cats = X[float_cols].astype("string").astype("object").astype("category") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot of bugginess with making the float categories usable as strings, and based on the CatBoost FAQ on why they don't support float categories in the first place, I'm not fully convinced we should support this (note this is different from the case where we can just convert to Integer without truncating values, which is the main thing we're adding support for here).
- If we convert to just
string
, we getTypeError: Cannot convert StringArray to numpy.ndarray
for which there's an open catboost ticket - If we convert to just
object
, that doesn't actually change the underlying float values, so we get the original error. But the combination ofstring
andobject
conversions does indeed work (only in that order, though) - If you try and change type in one fell swoop by creating a
pd.CategoricalDtype(categories=col_categories.astype("string").astype("object"))
dtype that then gets passed into a singleastype
call, it just nulls out the whole column (pandas ticket for this bug: BUG: SettingCategoricalDtype
categories as string objects nulls out data pandas-dev/pandas#51074)
The result is this ugly series of chained astype
calls . Given the messiness introduced by handling the "true float" case and the fact that this is the exact thing CatBoost didn't want to handle, I'm going to suggest we raise our own error for now. I'll add comments with the tickets tracking the bugs described above so that if a need for these string floats arises, we may be able to implement it nicely if any of the bugs are fixed.
self._component_obj.fit(X, y, silent=True, cat_features=cat_cols) | ||
return self | ||
|
||
def predict(self, X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Figured if I was going to handle categories for the regressor (and I did double check it was necessary for predict
!) I'd have to add an explicit predict method -- let me know if there's a better way that I missed though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good!
evalml_pipeline = MulticlassClassificationPipeline([RandomForestClassifier]) | ||
evalml_pipeline = MulticlassClassificationPipeline( | ||
[RandomForestClassifier], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be in a weird state in VSCode with the new ruff linting, so idk if this and the above change are expected or indicative of my own weird environment. Doesn't seem like I have any linting issues in this PR so I'm fine leaving as is.
640e7c7
to
e4311b1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - nice work!
self._component_obj.fit(X, y, silent=True, cat_features=cat_cols) | ||
return self | ||
|
||
def predict(self, X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good!
e4311b1
to
d33ba2a
Compare
fixes #3965
Since catboost cannot handle float categories, converts categories that are really integers to be
int64