Releases: catboost/catboost
1.2.7
Bugfixes
- [R-package]: Restore basic functionality.
Build & testing
- [GPU] Return configuration for multi-node GPU training with CMake-based build. See documentation.
1.2.6
⚠️ R-package is broken in this release. Please use release 1.2.7+
Major changes
- CatBoost open source build, test and release infrastructure has been switched to GitHub actions. It is possible to run it if you fork CatBoost repository as well. See the announcement for details.
Python package
- Adapt
numpy
dependency specification to prohibitnumpy >= 2.0
for now. #2671
New features
- User-defined metric GPU evaluation for task_type=GPU. Thanks to @pnsemyon.
- GPU Custom objective support. Thanks to @pnsemyon.
- [C/C++ applier].
APT_MULTI_PROBABILITY
prediction type is now supported. #2639. Thanks to @aivarasbaranauskas. GroupQuantile
metric- Aggregated graph features
Build & testing
- [Windows]: Visual Studio 2022 with MSVC toolset 14.29.30133 is now supported. #2302
Speedups
- [GPU]: Increase block size in
QueryCrossEntropy
(~3x faster on a100 for 6m samples, 350 features, query size near 1).
Improvements
Bugfixes
- [C/C++ applier]. Add missed
PredictSpecificClassFlat
to calcer.exports. #2715 - [Linux]. Restore readable backtraces
- [GPU] Make CUDA_MAX_THREADS_PER_SM cuda arch-specific
- [JVM applier][Windows]: Fixed bloating temp directory with copies of native libraries on Windows. #2622. Thanks to @DKARAGODIN.
- Calculate F1, Precision, and Recall for all labels in multi-label classification
- Synchronize values of NCB::NModelEvaluation::EPredictionType and EApiPredictionType. #2643
- Fix sign of 2nd derivative for Tweedie loss
- Fix 'Can't find borders for feature ...' error when using text features on GPU. #2657
- Fix indexing of tokenized text features in model saver and dataset loader when some features are ignored
- Fix descent direction for Cox regression fix #2701
- Fix GetTreeNodeToLeaf in multidimensional case (fixes plot_tree for multidimensional approx with non-oblivious trees). #2668
1.2.5
New features
Bugfixes
- [Python-package]: Check eval_period parameter validity for staged prediction. #2593
- [Python-package]: Fix _CustomLoggersStack.pop logic. #2620
- [R-package]: Fix Caret object: Inconsistent grid creation with documentation. #2609
- [JVM applier]: Fix issues with exposing undesired symbols in JNI shared libraries (including allocators) on macOS. #2606
- Fix training with embedding features on GPU. #2249, #2308, #2591
- Fix training with text features on GPU
- Use correct sample count in MultiRMSE on multiple GPUs. #2557
- Fix sign of 2nd order derivative in Huber loss
- Enable gradient walker for non-additive metrics
- Fixes for Cox objective: buffer overflow in derivatives calculation, derivatives summation, metric calculation, disable ordered boosting
- Fix text features data serialization in the model files
1.2.3
Python package
- Support Python 3.12. #2510
- [Performance]: Fix ineffective loops in Cython. Significant speedups (up to 3x) on dataset construction from data in C-order can be expected.
- [Performance]: Make features data initialization from C-order
numpy.ndarray
s withfloat32
data type multithreaded. Significant speedups of 5x up to 10x (on CPUs with many cores) can be expected. #385, #2542 - Save training metrics into the model metadata. So
best_score_
,evals_result_
,best_iteration_
model attributes now work after model saving and loading. Can be removed by model metadata manipulation if needed. #1166 - [Breaking change]. Support a separate boolean target type, now
Class
predictions for models that have been trained with boolean targets will also be boolean instead ofTrue
,False
strings as before. Such models will be incompatible with the previous versions of CatBoost appliers. If you want the old behavior convert your target toFalse
,True
strings before training. #1954 - Restrict
jupyterlab
version for setup to 3.x for now. Fixes #2530 utils.read_cd
: Support CD files with non-increasing column indices.- Make
log_cout
,log_cerr
specification consistent, avoid reset in recursive calls. - Late-initialize default values for
log_cout
,log_cerr
. #2195 - Add missing generated metrics:
Cox
,PairLogitPairwise
,UserPerObjMetric
,SurvivalAft
.
New features
- Support boolean target/labels type during training in Python and Spark (in the latter case only when using
fit
withPool
arguments) andClass
prediction in Python. #1954 - [Spark]: Support Spark 3.5.x.
- [C/C++ applier]. Add functions for getting indices of features of different types to C and C++ API. #2568. Thanks to @nimusp.
- [C/C++ applier]. Add staged prediction functions to C API. #2584. Thanks to @Mb-NextTime.
- [JVM applier]. Add loading CatBoostModel from a byte array to API. #2539
- [Linux] Support CgroupsV2 when computing default number of threads used in parallel computations. #2519. Thanks to @elukey.
- [CLI] Support printing
Auxiliary
columns by name in evaluation result output. #1659 - Save training metrics into the model metadata. Can be removed by model metadata manipulation if needed. #1166
Build & testing
- [Windows]: Use
clang-cl
compiler and tools from Visual Studio 2022 for the build without CUDA (build with CUDA still uses standard Microsoft toolchain from Visual Studio 2019). - [macOS]: Pass
os.version
toconan
host settings to ensure version consistency. - [Linux aarch64]: Set
-mno-outline-atomics
for modern versions of CLang and GCC to avoid unresolved symbols linking errors. #2527 - Added missing
CMakeLists
for unit tests forutil
. #2525
Bugfixes
- [Performance]: Fix performance regression that could slow down training on GPU by 50% on some datasets that had been introduced in release 1.2. Thanks to @JeanPaulShapo.
- [Python-package]: Fix segfault on Pool(data=None). #2522
- [Python-package]: Fix Python exception in
Pool()
whenpairs_weight
is a numpy array. #1913 - [Python-package]: Fix segfault and other strange errors when specifying custom logger with
__call__
method. #2277 - [Python-package]: Fix returning complex params in hyperparameter search. #1741, #1833
- [Python-package]: Fix ignored exceptions for missed metrics descriptions on startup. This has not been visible to users but has been making debugging more difficult.
- [Python-package]: Fix misleading
Targets are required for YetiRank loss function.
error in Cross validation. #2083 - [Python-package]: Fix
Pool.get_label()
returns constantTrue
for boolean labels. #2133 - [Python-package]: Copying models does not lose
best_score_
,evals_result_
,best_iteration_
attributes values anymore. #1793 - [Spark]: Fix hangs at the end of the training. #2151
Precision
metric default value in the absense of positive samples is changed to 0 and a warning is added
(similar to the behavior ofscikit-learn
implementation). #2422- Fix ignoring embedding features
- Try to avoid hash collisions when computing group ids with datasets with a lot of groups (may occur in datasets with around a 10^9 samples).
- Fix Multiclass models export to C++ and Python code. #2549
- Fix dataset_statistics mode when no
Target
data is available. - Fix
Error: can't proceed some features
error on GPU. #1024 - Fix
allow_const_label=True
for classification. #1933 - Add checking of approx and target dimensions for
SurvivalAft
objective/metric. - Fix Focal loss derivatives sign. #2563
1.2.2
Bugfixes
- Fix Segmentation fault when using custom
eval_metric
in binary python packages of version 1.2.1 on PyPI. #2486 - Fix LossFunctionChange fstr with embedding features.
- Fix a segmentation fault in JVM applier when using embedding features on JVM 11+.
- Fix CTR data handling in model summation (especially for models with CTRs with multiple target quantizations).
1.2.1
New features
- Allow to optimize specific ranking loss functions with YetiRank and YetiRankPairwise by specifying
mode
parameter. See Which Tricks are Important for Learning to Rank? paper for details (this family of losses is calledYetiLoss
there). CPU-only for now. - Add Kernel Gradient Boosting support (use
catboost.sample_gaussian_process
function). #2408, thanks to @TakeOver. See Gradient Boosting Performs Gaussian Process Inference paper for details. - LambdaMart loss: support new target metrics MRR, ERR and MAP.
- StochasticRank loss: support new target metrics ERR and MRR.
- Support MultiRMSE on GPU. #2264, #2390
- Load JSON model format in Java Client. #1627, thanks to @timotta
- Implement exporting of Multiclass models to C++ and Python. #2283, thanks to @antoninkriz
Improvements
- Speedup BM25 feature calcers 3x
- Use
int
instead of deprecatednumpy.int
. #2378 - Add
ModelCalcerWrapper::CalcFlatTransposed
, #2413 thanks to @faucct - Update dependencies to avoid known vulnerabilities
Bugfixes
- Fix __shfl_up_sync mask. #2339
- TFocalMetric negative values fix. #2386, thanks to @diditforlulz273
- Focal loss: Use user-defined alpha and gamma
- Fix exception propagation: Rethrow exceptions caused by user's python code as C++ exceptions
- CatBoost trained with user defined objective was incompatible with ShapValues calculation
- Avoid nan's in Newton step calculation for RMSEWithUncertainty
- Fix score method for y with shape (N, 1). #2405
- Fix scalePosWeight support for Spark. #2470
1.2
Release 1.2
Major changes
CatBoost's build system has been switched from Ya Make (Yandex's build system) to CMake. This means more transparency in the build process and more familiar tools for Open Source developers.
For now it is possible to build CatBoost for:
- Linux on x86-64 with or without CUDA
- Linux on aarch64 with or without CUDA
- macOS on x86-64 and arm64, including creating universal binaries
- Windows on x86-64 with or without CUDA
- Android (only model applier) on All supported ABIs.
This allowed us to prepare the Python package in the source distribution form (also known as sdist
). #830
msvs
subdirectory with the Microsoft Visual Studio solution has been removed. Visual Studio solutions can be generated using CMake instead.make
subdirectory with Makefiles has been removed. UseCMake
+ninja
(recommended) orCMake
+make
instead.
Python package
- Switch to the standard Python build and installation method that uses
setup.py
instead of the custommk_wheel.py
script. All common scenarios (sdist
,build
,install
, editableinstall
,bdist_wheel
) are supported. - Switch wheel platform tag on Linux from obsolete
manylinux1
tomanylinux2014
. - The source distribution is now available on PyPI. #830
- Wheels for Linux aarch64 are now available on PyPI. #2091
- Support Python 3.11. #2213
- Drop support for obsolete Python 3.6.
- Make wheels PEP427-compliant. #2165
- Fix wrong checksums in wheels that caused problems with poetry. #2331
- Improved performance due to caching TBB local executors. #2203
- Add
fixed_binary_splits
to the regressor, classifier, and ranker. - Compatibility with pandas 2.0. #2320
- CatBoost widget is now compatible with ipywidgets 8.x. #2266
Rust package
- Support CUDA applier. #1925, thanks to @getumen.
- Properly forward debug/release setting to native library build.
- Passing features: switch from
String
andVec
types for features toAsRef
of slices to make code more generic - Support text and embedding features.
- Support multidimensional output in predictions.
New features
- [JVM applier]: Support CUDA.
- [Spark]: Support Spark 3.4.x (if you want to use Spark with python 3.11 use this version).
- Static model applier library now works on Windows.
- Add
binary-classification-threshold
parameter to the CLI model applier. - Support Multi-target regression with text features (but only Bag-of-Words features are generated for now). #2229
- Support
RMSEWithUncertainty
loss function on GPU. - Support
MultiLogloss
andMultiCrossEntropy
loss functions with numerical features on GPU. - Support
MultiLogloss
loss function with text features on CPU and GPU. #1885 - Enable univariate metrics for models with uncertainty
- Add
Focal
loss (CPU-only for now). #1807, thanks to @diditforlulz273.
Improvements
- Removed legacy dependency on Python 2 interpreter in the build process. #2297
- Calc metrics: Throw catboost exception if column index exceeds column count.
- Speedup
MultiLogloss
on CPU by 8% per tree (110K samples, 20 targets, 480 float features, 3 cat features, 16 cores CPU). - Update .NET projects from obsolete .NET Core 2.1 to .NET Core 3.1.
- Code generation for new CUDA Compute Architectures 8.6, 8.9 and 9.0 is enabled by default (requires CUDA 11.8 to build from source).
- Check that evaluator implementation is available in
TFullModel::SetEvaluatorType
(it was possible to get a Segmentation fault when calling it for non-available implementstion). AddTFullModel::GetSupportedEvaluatorTypes
. - Cross Validation on GPU no longer requires
allow_write_files=True
.
Bugfixes
- [Python-package]: Clear model params before load_model. Fixes #2225.
- [Python-package]: Fix CatBoostRanker score computation. #2231
- [Python-package]: Fix
_get_embedding_feature_indices
. #2273 - [Python-package]: Fix
set_feature_names
with text or embedding features. #2090 - [Python-package]: pandas.Categorical.categories is not necessarily a numpy.ndarray. #1965
- [Spark]: Pass classpath in a file to avoid hitting cmdline length limits. #1842
- [CUDA Applier]: Apply scale and bias.
- [CUDA Applier]: Fix that
libs/model_interface applier
always produced an error in CUDA mode. - Fix CUDA error 700 in pairwise ranking.
- Fix kernel registration for distributed training on GPU.
- Fix `floating point exception' on CPU for small datasets on GPU.
- Fix wrong log message 'There are invalid params and some of them will be ignored'. #2253
- Fix incorrect results and crashes for GPU applier on Nvidia Ampere - based GPUs.
- Fix 'CUDA error 9' in Multi-GPU training.
- Fix serialization of embedding features structures in the model.
- Fix GPU buffer overrun in distributed multi-classification training.
- Fix
catboost/cuda/cuda_util/sort.cpp:166: CUDA error 9
on Nvidia Ampere - based GPUs. - Fix inf/nan parsing in dataset input files.
- Fix floating point exception for very small datasets on GPU.
- Fix: built static applier library lacked the part with 'global' objects. #2187
- Fix sum of models with categorical features with CTRs.
- Fix: model_interface/cmake_example failed build "‘runtime_error’ is not a member of ‘std’". #2324, thanks to @Mandelag.
- Fix Segmentation fault in Cross Validation and hyperparameter search functions that use it on GPU.
- Fix Segmentation fault in
utils.eval_metrics
for groupwise metrics when group data has not been specified. #2343 - Fix errors when running Cross Validation repeatedly on GPU. #2221
P.S. There's an issue with somewhat unexpected binary size increases. We're investingating in #2369
1.1.1
Release 1.1.1
New features
- Support building for Linux on aarch64 from sources using CMake (no prebuilt binaries or PyPI packages yet). #1981
- [C/C++ applier] Support embedding features. #2172
- [C/C++ applier] Add
GetModelUsedFeaturesNames
. #2204 - [Python] Add text features to
utils.create_cd
. #2193 - [Spark] Full support for Apache Spark 3.3
- [Spark] Read/write PySpark's DataFrame-like API for Pool. #2030
- [Spark] Allow to specify trainingDriver and worker listening ports. #2181
Bugfixes
- Fix prediction dimension check for RMSEWithUncertainty and MultiQuantile. #2155
- [C/C++ applier] Fix segmentation fault in prediction for multiple objects for multiple dimension models.
- [JVM applier] Fix catboost-common dependency version in catboost-prediction (Fixes JVM applier on macOS). #2121
- [Python] Update for pandas 1.5.0: iteritems -> items (Fixes annoying deprecation warning). #2179
- [Python] Fix segmentation fault when target is
np.ndarray
withdtype=object
. #2201 - [Python] Fix specifying
feature_names
inutils.create_cd
. #2211
1.1
Release 1.1
New features
-
Multiquantile regression
Now it's possible to train models with shared tree structure and multiple predicted quantile values in each leaf. Currently this approach doesn't give a strong guarantee for predicted quantile values consistency, but it still provides more consistency than training multiple independent models for each quantile. You can read short description in the documentation. Short example for Python:
loss_function='MultiQuantile:alpha=0.2,0.4'
. Supported only on CPU for now. -
Support text and embedding features for regression and ranking.
-
Spark: Read/write Spark's Dataset-like API for Pool. #2030
-
Support HashedCateg column type. This allows to use externally prehashed categorical features both in training and prediction.
-
New option
plot_file
in Python functions withplot
parameter allows to save plots to file. #758 -
Add eval_fraction parameter. #1500
-
Non-symmetric trees model summation.
-
init_model
parameter now works with non-symmetric trees. -
Partial support for Apache Spark 3.3 (only for Scala 2.12 and without PySpark).
Speedups
- 2x speedup DCG, nDCG and FilteredDCG metrics calculation for groups with >= 50 objects and with top=-1 (all objects from each group, default value)
- Fixed 2x slowdown of PairLogit and other ranking losses on CPU introduced in release 0.23
Bugfixes
1.0.6
Release 1.0.6
New features
- Fixed splits for binary features on gpu for non-symmetric trees -- specify the set of splits to start each tree in the model with
--fixed-binary-splits
orfixed_binary_splits
in Python package (by default, there are no fixed splits)
Documentation
- New sections on MultiRMSEWithMissingValues and LogCosh
- New section on get_embedding_feature_indices
- Add info on gpu support for metrics
Bug-fixes
- Fix warning about resetting logger when logging to sys.stdout & sys.stderr from different threads #1855
- Fix model summation in CatBoost for Apache Spark
- Fix performance and scalability of query auc for ranking (1m samples, query size 2, 8 cpu cores 0.55s -> 0.04s)
- Fix support for text features and embeddings in Java applier #2043
- Fix nan/inf split scores with yeti rank pairwise loss
- Fix nan/inf feature strengths in pair logit on cpu