Skip to content

cop1cat/catboost-utils

Repository files navigation

catboost_utils

A UX wrapper over CatBoost — readable errors, pre-flight data validation, ergonomic custom losses, sklearn pipeline compatibility, structured logging, lossless save/load, and exception-safe callbacks.

Not a fork. Not a replacement. A wrapper. Use it where it helps; mix freely with stock catboost.

Install

pip install catboost-utils
# or, with sklearn pipeline support:
pip install "catboost-utils[sklearn]"

Requires Python 3.10+ and CatBoost 1.2+.

Quick start

from catboost_utils import CBXClassifier

model = CBXClassifier(
    iterations=500,
    auto_cat_features=True,        # detects str/category/bool columns automatically
    nan_fill="__NA__",             # explicit handling of NaN in cat features (no magic)
    early_stopping="auto",         # enables sane defaults when eval_set is given
)
model.fit(X_train, y_train, eval_set=(X_val, y_val))

isinstance(model, CatBoostClassifier) is still True. clone(), GridSearchCV, and pickle work out of the box.

What's in the box

Every module is independent. Use only what you need.

errors — readable error messages

from catboost_utils import wrap, CBXError

m = wrap(CatBoostClassifier(iterations=10))
try:
    m.fit(X, y)   # X has a string column not declared in cat_features
except CBXError as e:
    print(e.human_message)  # e.g. "Feature 'city' (index 5) has invalid type ..."
    print(e.hint)           # e.g. "Convert float values and NaN to strings ..."

wrap() swaps the model's class to a CBX-enhanced subclass — isinstance checks keep working, and pickle round-trips correctly.

validation — pre-flight checks

from catboost_utils import validate

report = validate(X, y, cat_features=["city"])
report                  # in Jupyter: rich HTML table of issues + warnings
report.raise_if_failed()  # raises ValidationError if any blocking issue

Catches NaN-in-cat-features, inf, single-class targets, undeclared object columns, datetime columns, class-weights conflicts, and GPU/multi-thread non-determinism — before training crashes with a cryptic message.

objectives — custom losses, numba-jit'ed

import numpy as np
from catboost_utils.objectives import objective, metric

@objective(task="regression")
def my_huber(y_true: np.ndarray, y_pred: np.ndarray):
    delta = 1.0
    err = y_pred - y_true
    is_small = np.abs(err) <= delta
    grad = np.where(is_small, err, delta * np.sign(err))
    hess = np.where(is_small, 1.0, 0.0)
    return grad, hess

@metric(task="regression", name="MAE", higher_is_better=False)
def mae(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    return float(np.mean(np.abs(y_true - y_pred)))

model = CatBoostRegressor(loss_function=my_huber, eval_metric=mae)

The decorator handles all CatBoost-isms (list-of-list approxes, sign convention, weights, sigmoid/softmax internal transform). Functions are JIT-compiled with numba; multiclass works at C-speed despite CatBoost's per-object API.

pipeline — sklearn-friendly classes

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from catboost_utils import CBXRegressor

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model", CBXRegressor(iterations=100)),
])
pipe.fit(X, y)

Works inside Pipeline, GridSearchCV, cross_val_score, and clone().

logging — structured training output

import logging
from catboost_utils.logging import setup_logging, attach

setup_logging(level=logging.INFO, structured=False)
attach(model)
model.fit(X, y)
# INFO catboost_utils.training - iteration=10 learn_loss=0.423 test_loss=0.451 ...

Use structured=True for JSON output. Each parsed line carries an cbx_iteration extra dict for downstream log processors.

explain — feature importance + SHAP, named DataFrames

from catboost_utils.explain import feature_importance, shap_values, check_early_stopping

fi = feature_importance(model, X)            # sorted DataFrame with feature names
sv = shap_values(model, X)                   # DataFrame: features + expected_value
check_early_stopping(model, eval_set=eval_set)  # raise CBXError if misconfigured

io — lossless save/load

from catboost_utils.io import save, load

save(model, "artifact.cbm")              # writes artifact.cbm + artifact.cbm.meta.json
restored = load("artifact.cbm")          # restores best_iteration, feature_names, etc.

The sidecar bundles best_iteration, feature_names, cat_features, class_names, training params, and version info. load() works without a sidecar (logs a warning) so external .cbm files keep loading.

callbacks — exception-safe wrapper

from catboost_utils.callbacks import safe

cb = safe(my_callback)
model.fit(X, y, callbacks=[cb])
cb.raise_if_failed()   # surfaces any exception your callback raised, with original traceback

Principles

  • Backwards compatible — anything that works in CatBoost works through catboost_utils.
  • Opt-in — every module is independent. Use what you need; ignore the rest.
  • No magic — no silent data transformations. Auto-fixes are always parameters the user passes explicitly (nan_fill="...", auto_cat_features=True).
  • Strict typing — every public function fully annotated; mypy --strict clean.

Compatibility

  • Python: 3.10, 3.11, 3.12
  • CatBoost: ≥ 1.2, < 2.0
  • sklearn: 1.3+ (optional, only for the pipeline module)

Versioning

Pre-1.0 (0.x.y). Any minor bump may include breaking changes — see CHANGELOG.md. 1.0.0 will be cut once the public API is frozen and CI is green across the matrix.

License

Apache 2.0 — see LICENSE.

About

UX wrapper over CatBoost

Resources

License

Stars

Watchers

Forks

Contributors

Languages