Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catboost for Imbalanced Data Sets #223

Closed
carlosr29 opened this issue Jan 13, 2018 · 12 comments
Closed

Catboost for Imbalanced Data Sets #223

carlosr29 opened this issue Jan 13, 2018 · 12 comments

Comments

@carlosr29
Copy link

carlosr29 commented Jan 13, 2018

Is there a parameter like "scale_pos_weight" in catboost package as there is in the xgboost package in python in order to handle imbalanced classes?

I know there is a parameter called "class_weights", but in the official Documentation (https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_parameters-list-docpage/#python-reference_parameters-list it's not well explained if it helpts for the imbalanced problem, and how to set it.

Thanks in advance.

@annaveronika
Copy link
Contributor

Hi, sorry for the long answer, we'll update documentation and add scale_pos_weight parameter.

@annaveronika
Copy link
Contributor

For now you can use class_weights in the following way: set weight 1 for class 0 and weight scale_pos_weight for class 1. It will be equal to having scale_pos_weight parameter.

@annaveronika
Copy link
Contributor

the parameter is added in last release

@annaveronika
Copy link
Contributor

@carlosr29 We are currently working on improving quality for imbalance datasets with binary classification.
If you could share your dataset with us, it would be very helpful.

@elfwired
Copy link

@annaveronika is there a way to work with imbalanced datasets when solving regression problems?

@abhi070493
Copy link

@annaveronika Is there a way to downscale predicted scores after using class_weights? (Since I have noticed that the model over-predicts on using class_weights and I need to use the point estimates for my problem)

@The-Gupta
Copy link

I'm still getting the error, I tried scale_pos_weight as well-

classifier = CatBoostRegressor(class_weights = [0.8, 0.2])
Traceback (most recent call last):

  File "<ipython-input-49-9098f24b0f97>", line 1, in <module>
    classifier = CatBoostRegressor(class_weights = [0.8, 0.2])

TypeError: __init__() got an unexpected keyword argument 'class_weights'

Could you once check the version, if that could be the problem?

@Evgueni-Petrov-aka-espetrov
Copy link
Contributor

Please use CatBoostClassifier.
Parameter class_weights is meaningful only for Logloss, MultiClass, and MultiClassOneVsAll loss functions, while CatBoostRegressor implies that loss function is RMSE, or MAE, or Quantile, or LogLinQuantile, Poisson, or MAPE.

@Sandy4321
Copy link

it is written above Hi, sorry for the long answer, we'll update documentation and add scale_pos_weight parameter
but in the documentation I see scale_pos_weight (alias for: class_weights) https://catboost.ai/docs/concepts/python-reference_parameters-list.html so what you added only synonym?
and still it is not clear how to use it : https://stackoverflow.com/questions/54437646/catboost-precision-imbalanced-classes do have a clear code example,
for example from this video data is inbalanced but model trained as for balanced data https://www.youtube.com/watch?v=xl1fwCza9C8&t=44s code https://github.com/catboost/tutorials/blob/master/events/pydata_moscow_oct_13_2018.ipynb

@annaveronika
Copy link
Contributor

scale_pos_weigh sets weight for objects of the first class. This is equal to setting class_weights to [1, {scale_pos_weight value}]. To deal properly with inbalanced data, you can try experiment with either of these two parameters, or maybe do oversampling.

@sskarkhanis
Copy link

sskarkhanis commented Oct 16, 2019

Question about class_weights for Multi-class problems

in the documentation for CB,here: https://catboost.ai/docs/concepts/python-reference_parameters-list.html

I see that to pass class_weights, we use a list; the documentation shows the example of binary classification as class_weights=[0.1, 4] which works fine in case of binary classification.

I know, I can pass a list of length equivalent to the #Classes but how does catboost assign these weights to the appropriate label in multi-class context?

I calculated the class weights using the sklearn utils as follows,

from sklearn.utils import class_weight
cw = list(class_weight.compute_class_weight('balanced',
                                             np.unique(df_train['Target']),
                                             df_train['Target']))

and get a list for e.g. [0.5, 4.5, 7.5, 3.4]
if I pass the list as-is, my model performance is worse compared to without the class_weights option.

How do I address this? would it be an option to allow class_weights to accept a dictionary?
e.g. class_weights = { "class_A": 3.5 , "class_B": 4.5, "class_C": 0.5 }

@annaveronika
Copy link
Contributor

@sskarkhanis Could you please create a separate issue about this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants