Skip to content

@kizill kizill released this Aug 5, 2021

R package

  • Supported text features in R package, thanks to @Glemhel!
  • Supported virtual Ensembles in R, thanks to @Glemhel!

New features

  • Thank @gmrandazzo for adding multiregression with missing values on targets - MultiRMSEWithMissingValues loss function
  • Supported multiclass prediction in C++ wrapper for model inference C API

Bugfixes

  • Renamed keyword parameter in predict_proba function from X to data, fixes #1785
  • R feature importances: remove pool argument, fix #1438 and #1772
  • Fix CUDA training on Windows, multiple issues. main issue with details #1735
  • Issue #1728: don't dereference pointers when there is no features
  • Fixed empty tree processing in feature strength calculation
  • Fixed missing loss graph points in select_features, #1775
  • Sort csr matrix indices, fixes #1749
  • Fix error "active CatBoost worker is already present in the current process" after previous training interruption or failure. #1795.
  • Fixed erroneous warnings from models validation after training with custom loss or custom error function. Fixes #873 Fixes #1169
Contributors
gmrandazzo and Glemhel
Assets 14

@andrey-khropov andrey-khropov released this Jun 3, 2021

New features

  • #972. Add model evaluation on GPU. Thanks to @rakalexandra.
  • Support Langevin on GPU
  • Save class labels to models in cross validation
  • #1524. Return models after CV. Thanks to @vklyukin
  • [Python] #766. Add CatBoostRanker & pool.get_group_id_hash() for ranking. Thanks to @AnnaAraslanova
  • #262. Make CatBoost widget work in jupyter lab. Thanks to @Dm17r1y
  • [GPU only] Allow to add exponent to score aggregation function
  • Allow to specify threshold parameter for binary classification model. Thanks to @Keksozavr.
  • [C Model API] #503. Allow to specify prediction type.
  • [C Model API] #1201. Get predictions for a specific class.

Breaking changes

  • #1628. Use CUDA 11 by default. CatBoost GPU now requires Linux x86_64 Driver Version >= 450.51.06 Windows x86_64 Driver Version >= 451.82.

Losses and metrics

  • Add MRR and ERR metrics on CPU.
  • Add LambdaMart loss.
  • #1557. Add survivalAFT base logic. Thanks to @blatr.
  • #1286. Add Cox Proportional Hazards Loss. Thanks to @fibersel.
  • #1595. Provide object-oriented interface for setting up metric parameters. Thanks to @ks-korovina.
  • Change default YetiRank decay to 0.85 for better quality.

Python package

R package

Speedups

  • [Python] Speed up custom metrics and objectives with numba (if available)
  • [Python] #1710. Large speedup for cv dataset splitting by sklearn splitter

Other

  • Use Exact leaves estimation method as default on GPU
  • [Spark] #1632. Update version of Scala 2.11 for security reasons.
  • [Python] #1695. Explicitly specify WHEEL 'Root-Is-Purelib' value

Bugfixes

  • Fix default projection dimension for embeddings
  • Fix use_weights for some eval_metrics on GPU - use_weights=False is always respected now
  • [Spark] #1649. The earlyStoppingRounds parameter is not recognized
  • [Spark] #1650. Error when using the autoClassWeights parameter
  • [Spark] #1651. Error about "Auto-stop PValue" when using odType "Iter" and odWait
  • Fix usage of pairlogit weights for CPU fallback metrics when training on GPU
Assets 15

@kizill kizill released this Apr 5, 2021

Speedup

  • Now CatBoost uses non-owning Numpy arrays for passing c++ data to user-defined metric and loss functions in Python. This opens lot's of speedup probabilities: using those vectors in numba.jitted code, in cython code or just using numpy vector functions. Thanks @micyril!

Bugfixes

Assets 15

@andrey-khropov andrey-khropov released this Mar 24, 2021

CatBoost for Apache Spark

This release includes CatBoost for Apache Spark package that supports training, model application and feature evaluation on Apache Spark platform. We've prepared CatBoost for Apache Spark introduction and CatBoost for Apache Spark Architecture
videos for introduction. More details available at CatBoost for Apache Spark home page.

Feature selection

CatBoost supports recursive feature elimination procedure - when you have lot's of feature candidates and you want to select only most influential features by training models and selecting only strongest by feature importance. You can look for details in our tutorial

New features

  • Supported exact leaves estimation method for quantile, MAE and MAPE losses on GPU. You can enable it by setting leaf_estimation_method=Exact explicitly, in next releases we are planning to set it by default.
  • Supported uncertainty prediction for multiclassification models
  • #1568 Added support shap values calculation MultiRMSE models
  • #1520 Added support for pathlib.Path in python package
  • #1456 Added prehashed categorical features and text features to C API for model inference.

Losses and metrics

  • Supported Huber and Tweedie losses in GPU training
  • QueryAUC metric implemented by @fibersel

Breaking changes

  • We changed NDCG calculation principle for groups without relevant docs to make our NDCG score fully compatible with XGBoost and LightGBM implementations. Now we calc dcg==1 when there is no relevant objects in group (when ideal DCG equals zero), later we used score==0 in that case.

Speedups

  • With help of Intel developers team we switched our threading model implementation to Intel Threading Building Blocks. That gives us up to 20% speedup on 28 threads and around 2x speedup when training in 120 threads and largely improves scalability.
  • Speed up rendering fstat plots.
  • Slightly speed up string casting in python package during pool creation.

R package

  • Added path expansion when saving/loading files in R by @david-cortes
  • Added functionality to restore R handle after deserializing model by @david-cortes
  • Retrieve R pointers outside loops to speed up scalar access by @david-cortes
  • Multiple R documentation edits from @david-cortes and @jameslamb
  • #1588 Added precision for converting params to json

Bugfixes

  • #1525 Problem with missing exported functions in Windows R package dll
  • #1315 Low CPU utilization in CPU cross-validation
  • #785 Predict on single item with iloc fixed by @feeeper
  • Segfaults due to null pointer in pool in R package fixed by @david-cortes
  • #1553 Added check for baseline dimensions count in apply
  • #1606 Allow to use CatBoost in AWS Lambda environment: fix bug with setting thread names
  • #1609 and #1309 Print proper error message if all params in grid were invalid
  • Ability to use docstrings in estimators added by @pawelopiela
  • Allow extra space at the end of line for libsvm format

Thanks!

Assets 12

@kizill kizill released this Dec 27, 2020

Release 0.24.4

Speedup

  • Major speedup asymmetric trees training time on CPU (2x speedup on Epsilon with 16 threads). We would like to recognize Intel software engineering team’s contributions to Catboost project.

New features

  • From now on we are releasing Python 3.9 wheels. Related issues: #1491, #1509, #1510
  • Allow boost_from_average for MultiRMSE loss. Issue #1515
  • Add tag pairwise=False for sklearn compatibility. Fixes issue #1518

Bugfixes:

  • Allow fstr calculation for datasets with embeddings
  • Fix feature_importances_ for fstr with texts
  • Virtual ensebles fix: use proper unshrinkage coefficients
  • Fixed constants in RMSEWithUnceratainty loss function calculation to correspond values from original paper
  • Allow shap values calculation for model with zero-weights and non-zero leaf values. Now we use sum of leaf weights on train and current dataset to guarantee non-zero weights for leafs, reachable on current dataset. Fixes issues #1512, #1284
Assets 12

@kizill kizill released this Nov 18, 2020

Release 0.24.3

New functionality

  • Support fstr text features and embeddings. Issue #1293

Bugfixes:

  • Fix model apply speed regression introduced in 0.24.1
  • Different fixes in embeddings support: fixed apply and model serialization, fixed apply on texts and embeddings
  • Fixed virtual ensembles prediction - use proper scaling, fix apply (issue #1462)
  • Fix score() method for RMSEWithUncertainty issue #1482
  • Automatically use correct prediction_type in score()
Assets 12

@kizill kizill released this Oct 7, 2020

Uncertainty prediction

  • Supported uncertainty prediction for classification models.
  • Fixed RMSEWithUncertainty data uncertainty prediction - now it predicts variance, not standard deviation.

New functionality

  • Allow categorical feature counters for MultiRMSE loss function.
  • group_weight parameter added to catboost.utils.eval_metric method to allow passing weights for object groups. Allows correctly match weighted ranking metrics computation when group weights present.
  • Faster non-owning deserialization from memory with less memory overhead - moved some dynamically computed data to model file, other data is computed in lazy manner only when needed.

Experimental functionality

  • Supported embedding features as input and linear discriminant analysis for embeddings preprocessing. Try adding your embeddings as new columns with embedding values array in Pandas.Dataframe and passing corresponding column names to Pool constructor or fit function with embedding_features=['EmbeddingFeaturesColumnName1, ...] parameter. Another way of adding your embedding vectors is new type of column in Column Description file NumVector and adding semicolon separated embeddings column to your XSV file: ClassLabel\t0.1;0.2;0.3\t....

Educational materials

  • Published new tutorial on uncertainty prediction.

Bugfixes:

  • Reduced GPU memory usage in multi gpu training when there is no need to compute categorical feature counters.
  • Now CatBoost allows to specify use_weights for metrics when auto_class_weights parameter is set.
  • Correctly handle NaN values in plot_predictions function.
  • Fixed floating point precision drop releated bugs during Multiclass training with lots of objects in our case, bug was triggered while training on 25mln objects on single GPU card.
  • Now average parameter is passed to TotalF1 metric while training on GPU.
  • Added class labels checks
  • Disallow feature remapping in model predict when there is empty feature names in model.
Assets 12

@kizill kizill released this Aug 27, 2020

Uncertainty prediction

Main feature of this release is total uncertainty prediction support via virtual ensembles.
You can read the theoretical background in the preprint Uncertainty in Gradient Boosting via Ensembles from our research team.
We introduced new training parameter posterior_sampling, that allows to estimate total uncertainty.
Setting posterior_sampling=True implies enabling Langevin boosting, setting model_shrink_rate to 1/(2*N) and setting diffusion_temperature to N, where N is dataset size.
CatBoost object method virtual_ensembles_predict splits model into virtual_ensembles_count submodels.
Calling model.virtual_ensembles_predict(.., prediction_type='TotalUncertainty') returns mean prediction, variance (and knowledge uncertrainty for models, trained with RMSEWithUncertainty loss function).
Calling model.virtual_ensembles_predict(.., prediction_type='VirtEnsembles') returns virtual_ensembles_count predictions of virtual submodels for each object.

New functionality

  • Supported non-owning model deserialization for models with categorical feature counters

Speedups

  • We've done lot's of speedups for sparse data loading. For example, on bosch sparse dataset preprocessing speed got 4.5x speedup while running in 28 thread setting.

Bugfixes:

  • Fixed target check for PairLogitPairwise on GPU. Issue #1217
  • Supported n_features_in_ attribute required for using CatBoost in sklearn pipelines. Issue #1363
Assets 12

@kizill kizill released this Aug 5, 2020

New functionality

  • We've finally implemented MVS sampling for GPU training. Switched default bootstrap algorithm to MVS for RMSE loss function while training on GPU
  • Implemented near-zero cost model deserialization from memory blob. Currently, if your model doesn't use categorical features CTR counters and text features you can deserialize model from, for example, memory-mapped file.
  • Added ability to load trained models from binary string or file-like stream. To load model from bytes string use load_model(blob=b'....'), to deserialize form file-like stream use load_model(stream=gzip.open('model.cbm.gz', 'rb'))
  • Fixed auto-learning rate estimation params for GPU
  • Supported beta parameter for QuerySoftMax function on CPU and GPU

New losses and metrics

  • New loss function RMSEWithUncertainty - it allows to estimate data uncertainty for trained regression models. The trained model will give you a two-element vector for each object with the first element as regression model prediction and the second element as an estimation of data uncertainty for that prediction.

Speedups

  • Major speedups for CPU training: kdd98 -9%, higgs -18%, msrank -28%. We would like to recognize Intel software engineering team’s contributions to Catboost project. This was mutually beneficial activity, and we look forward to continuing joint cooperation.

Bugfixes:

  • Fixed CatBoost model export as Python code
  • Fixed AUC metric creation
  • Add text features to model.feature_names_. Issue #1314
  • Allow models, trained on datasets with NaN values (Min treatment) and without NaNs in model_sum() or as the base model in init_model=. Issue #1271

Educational materials

Assets 12

@nikitxskv nikitxskv released this May 26, 2020

New functionality

  • Added plot_partial_dependence method in python-package (Now it works for models with symmetric trees trained on dataset with numerical features only). Implemented by @felixandrer.
  • Allowed using boost_from_average option together with model_shrink_rate option. In this case shrinkage is applied to the starting value..
  • Added new auto_class_weights option in python-package, R-package and cli with possible values Balanced and SqrtBalanced. For Balanced every class is weighted maxSumWeightInClass / sumWeightInClass, where sumWeightInClass is sum of weights of all samples in this class. If no weights are present then sample weight is 1. And maxSumWeightInClass - is maximum sum weight among all classes. For SqrtBalanced the formula is sqrt(maxSumWeightInClass / sumWeightInClass). This option supported in binclass and multiclass tasks. Implemented by @egiby.
  • Supported model_size_reg option on GPU. Set to 0.5 by default (same as in CPU). This regularization works slightly differently on GPU: feature combinations are regularized more aggressively than on CPU. For CPU cost of a combination is equal to number of different feature values in this combinations that are present in training dataset. On GPU cost of a combination is equal to number of all possible different values of this combination. For example, if combination contains two categorical features c1 and c2, then the cost will be #categories in c1 * #categories in c2, even though many of the values from this combination might not be present in the dataset.
  • Added calculation of Shapley values, (see formula (2) from https://arxiv.org/pdf/1802.03888.pdf). By default estimation from this paper (Algorithm 2) is calcucated, that is much more faster. To use this mode specify shap_calc_type parameter of CatBoost.get_feature_importance function as "Exact". Implemented by @LordProtoss.

Bugfixes:

  • Fixed onnx converter for old onnx versions.
Assets 12