Skip to content

Releases: catboost/catboost

1.2.7

07 Sep 20:11
Compare
Choose a tag to compare

Bugfixes

  • [R-package]: Restore basic functionality.

Build & testing

  • [GPU] Return configuration for multi-node GPU training with CMake-based build. See documentation.

1.2.6

05 Sep 10:59
Compare
Choose a tag to compare

⚠️ R-package is broken in this release. Please use release 1.2.7+

Major changes

  • CatBoost open source build, test and release infrastructure has been switched to GitHub actions. It is possible to run it if you fork CatBoost repository as well. See the announcement for details.

Python package

  • Adapt numpy dependency specification to prohibit numpy >= 2.0 for now. #2671

New features

Build & testing

  • [Windows]: Visual Studio 2022 with MSVC toolset 14.29.30133 is now supported. #2302

Speedups

  • [GPU]: Increase block size in QueryCrossEntropy (~3x faster on a100 for 6m samples, 350 features, query size near 1).

Improvements

  • [datasets] Use mkstemp to replace deprecated mktemp. #2660. Thanks to @fatmo666

Bugfixes

  • [C/C++ applier]. Add missed PredictSpecificClassFlat to calcer.exports. #2715
  • [Linux]. Restore readable backtraces
  • [GPU] Make CUDA_MAX_THREADS_PER_SM cuda arch-specific
  • [JVM applier][Windows]: Fixed bloating temp directory with copies of native libraries on Windows. #2622. Thanks to @DKARAGODIN.
  • Calculate F1, Precision, and Recall for all labels in multi-label classification
  • Synchronize values of NCB::NModelEvaluation::EPredictionType and EApiPredictionType. #2643
  • Fix sign of 2nd derivative for Tweedie loss
  • Fix 'Can't find borders for feature ...' error when using text features on GPU. #2657
  • Fix indexing of tokenized text features in model saver and dataset loader when some features are ignored
  • Fix descent direction for Cox regression fix #2701
  • Fix GetTreeNodeToLeaf in multidimensional case (fixes plot_tree for multidimensional approx with non-oblivious trees). #2668

1.2.5

18 Apr 20:19
Compare
Choose a tag to compare

New features

  • [Python-package]: Support custom eval metrics on GPU. #1792. Thanks to @pnsemyon.

Bugfixes

  • [Python-package]: Check eval_period parameter validity for staged prediction. #2593
  • [Python-package]: Fix _CustomLoggersStack.pop logic. #2620
  • [R-package]: Fix Caret object: Inconsistent grid creation with documentation. #2609
  • [JVM applier]: Fix issues with exposing undesired symbols in JNI shared libraries (including allocators) on macOS. #2606
  • Fix training with embedding features on GPU. #2249, #2308, #2591
  • Fix training with text features on GPU
  • Use correct sample count in MultiRMSE on multiple GPUs. #2557
  • Fix sign of 2nd order derivative in Huber loss
  • Enable gradient walker for non-additive metrics
  • Fixes for Cox objective: buffer overflow in derivatives calculation, derivatives summation, metric calculation, disable ordered boosting
  • Fix text features data serialization in the model files

1.2.3

23 Feb 14:10
Compare
Choose a tag to compare

Python package

  • Support Python 3.12. #2510
  • [Performance]: Fix ineffective loops in Cython. Significant speedups (up to 3x) on dataset construction from data in C-order can be expected.
  • [Performance]: Make features data initialization from C-order numpy.ndarrays with float32 data type multithreaded. Significant speedups of 5x up to 10x (on CPUs with many cores) can be expected. #385, #2542
  • Save training metrics into the model metadata. So best_score_, evals_result_, best_iteration_ model attributes now work after model saving and loading. Can be removed by model metadata manipulation if needed. #1166
  • [Breaking change]. Support a separate boolean target type, now Class predictions for models that have been trained with boolean targets will also be boolean instead of True, False strings as before. Such models will be incompatible with the previous versions of CatBoost appliers. If you want the old behavior convert your target to False, True strings before training. #1954
  • Restrict jupyterlab version for setup to 3.x for now. Fixes #2530
  • utils.read_cd: Support CD files with non-increasing column indices.
  • Make log_cout, log_cerr specification consistent, avoid reset in recursive calls.
  • Late-initialize default values for log_cout, log_cerr. #2195
  • Add missing generated metrics: Cox, PairLogitPairwise, UserPerObjMetric, SurvivalAft.

New features

  • Support boolean target/labels type during training in Python and Spark (in the latter case only when using fit with Pool arguments) and Class prediction in Python. #1954
  • [Spark]: Support Spark 3.5.x.
  • [C/C++ applier]. Add functions for getting indices of features of different types to C and C++ API. #2568. Thanks to @nimusp.
  • [C/C++ applier]. Add staged prediction functions to C API. #2584. Thanks to @Mb-NextTime.
  • [JVM applier]. Add loading CatBoostModel from a byte array to API. #2539
  • [Linux] Support CgroupsV2 when computing default number of threads used in parallel computations. #2519. Thanks to @elukey.
  • [CLI] Support printing Auxiliary columns by name in evaluation result output. #1659
  • Save training metrics into the model metadata. Can be removed by model metadata manipulation if needed. #1166

Build & testing

  • [Windows]: Use clang-cl compiler and tools from Visual Studio 2022 for the build without CUDA (build with CUDA still uses standard Microsoft toolchain from Visual Studio 2019).
  • [macOS]: Pass os.version to conan host settings to ensure version consistency.
  • [Linux aarch64]: Set -mno-outline-atomics for modern versions of CLang and GCC to avoid unresolved symbols linking errors. #2527
  • Added missing CMakeLists for unit tests for util. #2525

Bugfixes

  • [Performance]: Fix performance regression that could slow down training on GPU by 50% on some datasets that had been introduced in release 1.2. Thanks to @JeanPaulShapo.
  • [Python-package]: Fix segfault on Pool(data=None). #2522
  • [Python-package]: Fix Python exception in Pool() when pairs_weight is a numpy array. #1913
  • [Python-package]: Fix segfault and other strange errors when specifying custom logger with __call__ method. #2277
  • [Python-package]: Fix returning complex params in hyperparameter search. #1741, #1833
  • [Python-package]: Fix ignored exceptions for missed metrics descriptions on startup. This has not been visible to users but has been making debugging more difficult.
  • [Python-package]: Fix misleading Targets are required for YetiRank loss function. error in Cross validation. #2083
  • [Python-package]: Fix Pool.get_label() returns constant True for boolean labels. #2133
  • [Python-package]: Copying models does not lose best_score_, evals_result_, best_iteration_ attributes values anymore. #1793
  • [Spark]: Fix hangs at the end of the training. #2151
  • Precision metric default value in the absense of positive samples is changed to 0 and a warning is added
    (similar to the behavior of scikit-learn implementation). #2422
  • Fix ignoring embedding features
  • Try to avoid hash collisions when computing group ids with datasets with a lot of groups (may occur in datasets with around a 10^9 samples).
  • Fix Multiclass models export to C++ and Python code. #2549
  • Fix dataset_statistics mode when no Target data is available.
  • Fix Error: can't proceed some features error on GPU. #1024
  • Fix allow_const_label=True for classification. #1933
  • Add checking of approx and target dimensions for SurvivalAft objective/metric.
  • Fix Focal loss derivatives sign. #2563

1.2.2

19 Sep 20:01
Compare
Choose a tag to compare

Bugfixes

  • Fix Segmentation fault when using custom eval_metric in binary python packages of version 1.2.1 on PyPI. #2486
  • Fix LossFunctionChange fstr with embedding features.
  • Fix a segmentation fault in JVM applier when using embedding features on JVM 11+.
  • Fix CTR data handling in model summation (especially for models with CTRs with multiple target quantizations).

1.2.1

28 Aug 20:57
Compare
Choose a tag to compare

New features

Improvements

  • Speedup BM25 feature calcers 3x
  • Use int instead of deprecated numpy.int. #2378
  • Add ModelCalcerWrapper::CalcFlatTransposed, #2413 thanks to @faucct
  • Update dependencies to avoid known vulnerabilities

Bugfixes

  • Fix __shfl_up_sync mask. #2339
  • TFocalMetric negative values fix. #2386, thanks to @diditforlulz273
  • Focal loss: Use user-defined alpha and gamma
  • Fix exception propagation: Rethrow exceptions caused by user's python code as C++ exceptions
  • CatBoost trained with user defined objective was incompatible with ShapValues calculation
  • Avoid nan's in Newton step calculation for RMSEWithUncertainty
  • Fix score method for y with shape (N, 1). #2405
  • Fix scalePosWeight support for Spark. #2470

1.2

01 May 23:11
Compare
Choose a tag to compare
1.2

Release 1.2

Major changes

CatBoost's build system has been switched from Ya Make (Yandex's build system) to CMake. This means more transparency in the build process and more familiar tools for Open Source developers.
For now it is possible to build CatBoost for:

  • Linux on x86-64 with or without CUDA
  • Linux on aarch64 with or without CUDA
  • macOS on x86-64 and arm64, including creating universal binaries
  • Windows on x86-64 with or without CUDA
  • Android (only model applier) on All supported ABIs.

This allowed us to prepare the Python package in the source distribution form (also known as sdist). #830

  • msvs subdirectory with the Microsoft Visual Studio solution has been removed. Visual Studio solutions can be generated using CMake instead.
  • make subdirectory with Makefiles has been removed. Use CMake + ninja (recommended) or CMake + make instead.

Python package

  • Switch to the standard Python build and installation method that uses setup.py instead of the custom mk_wheel.py script. All common scenarios (sdist, build, install, editable install, bdist_wheel) are supported.
  • Switch wheel platform tag on Linux from obsolete manylinux1 to manylinux2014.
  • The source distribution is now available on PyPI. #830
  • Wheels for Linux aarch64 are now available on PyPI. #2091
  • Support Python 3.11. #2213
  • Drop support for obsolete Python 3.6.
  • Make wheels PEP427-compliant. #2165
  • Fix wrong checksums in wheels that caused problems with poetry. #2331
  • Improved performance due to caching TBB local executors. #2203
  • Add fixed_binary_splits to the regressor, classifier, and ranker.
  • Compatibility with pandas 2.0. #2320
  • CatBoost widget is now compatible with ipywidgets 8.x. #2266

Rust package

  • Support CUDA applier. #1925, thanks to @getumen.
  • Properly forward debug/release setting to native library build.
  • Passing features: switch from String and Vec types for features to AsRef of slices to make code more generic
  • Support text and embedding features.
  • Support multidimensional output in predictions.

New features

  • [JVM applier]: Support CUDA.
  • [Spark]: Support Spark 3.4.x (if you want to use Spark with python 3.11 use this version).
  • Static model applier library now works on Windows.
  • Add binary-classification-threshold parameter to the CLI model applier.
  • Support Multi-target regression with text features (but only Bag-of-Words features are generated for now). #2229
  • Support RMSEWithUncertainty loss function on GPU.
  • Support MultiLogloss and MultiCrossEntropy loss functions with numerical features on GPU.
  • Support MultiLogloss loss function with text features on CPU and GPU. #1885
  • Enable univariate metrics for models with uncertainty
  • Add Focal loss (CPU-only for now). #1807, thanks to @diditforlulz273.

Improvements

  • Removed legacy dependency on Python 2 interpreter in the build process. #2297
  • Calc metrics: Throw catboost exception if column index exceeds column count.
  • Speedup MultiLogloss on CPU by 8% per tree (110K samples, 20 targets, 480 float features, 3 cat features, 16 cores CPU).
  • Update .NET projects from obsolete .NET Core 2.1 to .NET Core 3.1.
  • Code generation for new CUDA Compute Architectures 8.6, 8.9 and 9.0 is enabled by default (requires CUDA 11.8 to build from source).
  • Check that evaluator implementation is available in TFullModel::SetEvaluatorType (it was possible to get a Segmentation fault when calling it for non-available implementstion). Add TFullModel::GetSupportedEvaluatorTypes.
  • Cross Validation on GPU no longer requires allow_write_files=True.

Bugfixes

  • [Python-package]: Clear model params before load_model. Fixes #2225.
  • [Python-package]: Fix CatBoostRanker score computation. #2231
  • [Python-package]: Fix _get_embedding_feature_indices. #2273
  • [Python-package]: Fix set_feature_names with text or embedding features. #2090
  • [Python-package]: pandas.Categorical.categories is not necessarily a numpy.ndarray. #1965
  • [Spark]: Pass classpath in a file to avoid hitting cmdline length limits. #1842
  • [CUDA Applier]: Apply scale and bias.
  • [CUDA Applier]: Fix that libs/model_interface applier always produced an error in CUDA mode.
  • Fix CUDA error 700 in pairwise ranking.
  • Fix kernel registration for distributed training on GPU.
  • Fix `floating point exception' on CPU for small datasets on GPU.
  • Fix wrong log message 'There are invalid params and some of them will be ignored'. #2253
  • Fix incorrect results and crashes for GPU applier on Nvidia Ampere - based GPUs.
  • Fix 'CUDA error 9' in Multi-GPU training.
  • Fix serialization of embedding features structures in the model.
  • Fix GPU buffer overrun in distributed multi-classification training.
  • Fix catboost/cuda/cuda_util/sort.cpp:166: CUDA error 9 on Nvidia Ampere - based GPUs.
  • Fix inf/nan parsing in dataset input files.
  • Fix floating point exception for very small datasets on GPU.
  • Fix: built static applier library lacked the part with 'global' objects. #2187
  • Fix sum of models with categorical features with CTRs.
  • Fix: model_interface/cmake_example failed build "‘runtime_error’ is not a member of ‘std’". #2324, thanks to @Mandelag.
  • Fix Segmentation fault in Cross Validation and hyperparameter search functions that use it on GPU.
  • Fix Segmentation fault in utils.eval_metrics for groupwise metrics when group data has not been specified. #2343
  • Fix errors when running Cross Validation repeatedly on GPU. #2221

P.S. There's an issue with somewhat unexpected binary size increases. We're investingating in #2369

1.1.1

01 Nov 19:55
Compare
Choose a tag to compare

Release 1.1.1

New features

  • Support building for Linux on aarch64 from sources using CMake (no prebuilt binaries or PyPI packages yet). #1981
  • [C/C++ applier] Support embedding features. #2172
  • [C/C++ applier] Add GetModelUsedFeaturesNames. #2204
  • [Python] Add text features to utils.create_cd. #2193
  • [Spark] Full support for Apache Spark 3.3
  • [Spark] Read/write PySpark's DataFrame-like API for Pool. #2030
  • [Spark] Allow to specify trainingDriver and worker listening ports. #2181

Bugfixes

  • Fix prediction dimension check for RMSEWithUncertainty and MultiQuantile. #2155
  • [C/C++ applier] Fix segmentation fault in prediction for multiple objects for multiple dimension models.
  • [JVM applier] Fix catboost-common dependency version in catboost-prediction (Fixes JVM applier on macOS). #2121
  • [Python] Update for pandas 1.5.0: iteritems -> items (Fixes annoying deprecation warning). #2179
  • [Python] Fix segmentation fault when target is np.ndarray with dtype=object. #2201
  • [Python] Fix specifying feature_names in utils.create_cd. #2211

1.1

26 Sep 19:30
Compare
Choose a tag to compare
1.1

Release 1.1

New features

  • Multiquantile regression

    Now it's possible to train models with shared tree structure and multiple predicted quantile values in each leaf. Currently this approach doesn't give a strong guarantee for predicted quantile values consistency, but it still provides more consistency than training multiple independent models for each quantile. You can read short description in the documentation. Short example for Python: loss_function='MultiQuantile:alpha=0.2,0.4'. Supported only on CPU for now.

  • Support text and embedding features for regression and ranking.

  • Spark: Read/write Spark's Dataset-like API for Pool. #2030

  • Support HashedCateg column type. This allows to use externally prehashed categorical features both in training and prediction.

  • New option plot_file in Python functions with plot parameter allows to save plots to file. #758

  • Add eval_fraction parameter. #1500

  • Non-symmetric trees model summation.

  • init_model parameter now works with non-symmetric trees.

  • Partial support for Apache Spark 3.3 (only for Scala 2.12 and without PySpark).

Speedups

  • 2x speedup DCG, nDCG and FilteredDCG metrics calculation for groups with >= 50 objects and with top=-1 (all objects from each group, default value)
  • Fixed 2x slowdown of PairLogit and other ranking losses on CPU introduced in release 0.23

Bugfixes

  • Fix for pandas integer array. #2096
  • Save feature names to json format. #2102
  • Fix feature weights on CPU
  • Use feature weights on GPU
  • Fix gradient calculation for QueryRMSE on GPU
  • Fix ranking metrics with group weights in calc_metrics
  • Fix JVM applier on data with text features. #2132

1.0.6

19 May 07:31
Compare
Choose a tag to compare

Release 1.0.6

New features

  • Fixed splits for binary features on gpu for non-symmetric trees -- specify the set of splits to start each tree in the model with --fixed-binary-splits or fixed_binary_splits in Python package (by default, there are no fixed splits)

Documentation

Bug-fixes

  • Fix warning about resetting logger when logging to sys.stdout & sys.stderr from different threads #1855
  • Fix model summation in CatBoost for Apache Spark
  • Fix performance and scalability of query auc for ranking (1m samples, query size 2, 8 cpu cores 0.55s -> 0.04s)
  • Fix support for text features and embeddings in Java applier #2043
  • Fix nan/inf split scores with yeti rank pairwise loss
  • Fix nan/inf feature strengths in pair logit on cpu