The result is different with Pandas 'category' and 'object' types #814

khrisanfov · 2019-05-09T21:04:39Z

When I create the Pool by passing Pandas DataFrame and the list of categorical features indexes into it, the result of learning is different when categorical features in DataFrame have types 'category' and 'object'. In the 2nd case, the result is much better, and the importance of categorical features is higher than in the 1st case.

catboost version: 0.14.2
Operating System: Ubuntu 18.04
CPU: Intel Xeon E3

annaveronika · 2019-05-14T09:52:34Z

Category is not supported yet, please use object for now.

annaveronika · 2019-05-17T10:33:05Z

From the duplicate issue:

Are SKLEARN API estimators can recognize difference between
'ordinal feature (applying pandas.api.types.CategoricalDtype(data, ordered=True))'
and
'nominal feature (applying pandas.api.types.CategoricalDtype(data, ordered=False))'

andrey-khropov · 2019-09-12T16:27:16Z

I wrote a quick test and cannot reproduce the difference in results between 'object', 'category' and default column types:

#!/usr/bin/env python3

import sys

import numpy as np

import catboost
from catboost import datasets
from sklearn.model_selection import train_test_split


print ('py version', sys.version)
print ('CatBoost version', catboost.__version__)

(train_df, test_df) = catboost.datasets.amazon()
y = train_df.ACTION
X = train_df.drop('ACTION', axis=1)

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2)

model = catboost.CatBoostClassifier(iterations=30)
model.fit(X_train, y_train, eval_set=(X_valid, y_valid))
preds = model.predict(X_valid)

print ('with category')

model = catboost.CatBoostClassifier(iterations=30)
model.fit(X_train.astype('category'), y_train, eval_set=(X_valid.astype('category'), y_valid))
preds_from_df_with_category = model.predict(X_valid.astype('category'))

assert np.allclose(preds, preds_from_df_with_category)

print ('with object')

model = catboost.CatBoostClassifier(iterations=30)
model.fit(X_train.astype('object'), y_train, eval_set=(X_valid.astype('object'), y_valid))
preds_from_df_with_category = model.predict(X_valid.astype('object'))

assert np.allclose(preds, preds_from_df_with_category)

Maybe you mean that the result is different when you pass DataFrame with columns with 'category' type, but do not specify 'cat_features' parameter of Pool constructor? This is possible because at the moment CatBoost does not treat pandas.DataFrame columns with dtype='category' as categorical features automatically treating them as numeric features by default instead. You still have to specify 'cat_features' parameter if you have any categorical features. We will add special check for this case because it is usually an error.

khrisanfov · 2019-09-12T18:04:01Z

@andrey-khropov this is old issue, in the last version category type was supported by CatBoost, but I not tested it yet.

annaveronika · 2019-09-13T10:29:47Z

Closing the issue then, let us know if there will be problems with this type.

annaveronika mentioned this issue May 17, 2019

Are SKLEARN API estimators can distinguish PANDAS nominal type feature and PANDAS ordinal type feature? #789

Closed

annaveronika closed this as completed Sep 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The result is different with Pandas 'category' and 'object' types #814

The result is different with Pandas 'category' and 'object' types #814

khrisanfov commented May 9, 2019 •

edited

annaveronika commented May 14, 2019

annaveronika commented May 17, 2019

andrey-khropov commented Sep 12, 2019

khrisanfov commented Sep 12, 2019

annaveronika commented Sep 13, 2019

The result is different with Pandas 'category' and 'object' types #814

The result is different with Pandas 'category' and 'object' types #814

Comments

khrisanfov commented May 9, 2019 • edited

annaveronika commented May 14, 2019

annaveronika commented May 17, 2019

andrey-khropov commented Sep 12, 2019

khrisanfov commented Sep 12, 2019

annaveronika commented Sep 13, 2019

khrisanfov commented May 9, 2019 •

edited