Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The result is different with Pandas 'category' and 'object' types #814

Closed
khrisanfov opened this issue May 9, 2019 · 5 comments
Closed

Comments

@khrisanfov
Copy link

khrisanfov commented May 9, 2019

When I create the Pool by passing Pandas DataFrame and the list of categorical features indexes into it, the result of learning is different when categorical features in DataFrame have types 'category' and 'object'. In the 2nd case, the result is much better, and the importance of categorical features is higher than in the 1st case.

catboost version: 0.14.2
Operating System: Ubuntu 18.04
CPU: Intel Xeon E3

@annaveronika
Copy link
Contributor

Category is not supported yet, please use object for now.

@annaveronika
Copy link
Contributor

From the duplicate issue:

Are SKLEARN API estimators can recognize difference between
'ordinal feature (applying pandas.api.types.CategoricalDtype(data, ordered=True))'
and
'nominal feature (applying pandas.api.types.CategoricalDtype(data, ordered=False))'

@andrey-khropov
Copy link
Member

I wrote a quick test and cannot reproduce the difference in results between 'object', 'category' and default column types:

#!/usr/bin/env python3

import sys

import numpy as np

import catboost
from catboost import datasets
from sklearn.model_selection import train_test_split


print ('py version', sys.version)
print ('CatBoost version', catboost.__version__)

(train_df, test_df) = catboost.datasets.amazon()
y = train_df.ACTION
X = train_df.drop('ACTION', axis=1)

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2)

model = catboost.CatBoostClassifier(iterations=30)
model.fit(X_train, y_train, eval_set=(X_valid, y_valid))
preds = model.predict(X_valid)

print ('with category')

model = catboost.CatBoostClassifier(iterations=30)
model.fit(X_train.astype('category'), y_train, eval_set=(X_valid.astype('category'), y_valid))
preds_from_df_with_category = model.predict(X_valid.astype('category'))

assert np.allclose(preds, preds_from_df_with_category)

print ('with object')

model = catboost.CatBoostClassifier(iterations=30)
model.fit(X_train.astype('object'), y_train, eval_set=(X_valid.astype('object'), y_valid))
preds_from_df_with_category = model.predict(X_valid.astype('object'))

assert np.allclose(preds, preds_from_df_with_category)

Maybe you mean that the result is different when you pass DataFrame with columns with 'category' type, but do not specify 'cat_features' parameter of Pool constructor? This is possible because at the moment CatBoost does not treat pandas.DataFrame columns with dtype='category' as categorical features automatically treating them as numeric features by default instead. You still have to specify 'cat_features' parameter if you have any categorical features. We will add special check for this case because it is usually an error.

@khrisanfov
Copy link
Author

@andrey-khropov this is old issue, in the last version category type was supported by CatBoost, but I not tested it yet.

@annaveronika
Copy link
Contributor

Closing the issue then, let us know if there will be problems with this type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants