Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial support for Catboost #377

Open
4 tasks
hcho3 opened this issue Apr 19, 2022 · 1 comment
Open
4 tasks

Initial support for Catboost #377

hcho3 opened this issue Apr 19, 2022 · 1 comment

Comments

@hcho3
Copy link
Collaborator

hcho3 commented Apr 19, 2022

We would like to add support for Catboost models. Users of Treelite should be able to load Catboost models and run prediction.

Overview

Catboost has a custom target encoding method to encode categorical data, and produces special kinds of decision trees called oblivious trees. See the Catboost paper for more details.

In general, target encoder is a function that takes a categorical input and puts out a numeric output. The function is an "encoding," in the sense that the categorical input is encoded as a real number. The advantage of target encoding is that we can exclusively use the simple test of form [feature] < [threshold] in all of our decision trees.

The challenge is that Catboost uses a custom flavor of target encoding. The goal, therefore, is to abstract away as much complexity as possible.

Proposed Design

The treelite model spec

template <typename ThresholdType, typename LeafOutputType>
class ModelImpl : public Model {
public:
/*! \brief member trees */
std::vector<Tree<ThresholdType, LeafOutputType>> trees;

should be updated to include an optional field to store the target encoding function. The target encoding component should be a lookup table of form

(categorical_feature_id, categorical_value) -> [ numerical vector ]
(categorical_feature_id, categorical_value) -> [ numerical vector ]
(categorical_feature_id, categorical_value) -> [ numerical vector ]
...

where each possible categorical value is mapped to a vector of length 1 or greater.

Catboost uses CityHash to convert string categories into int64, so the target encoding field must allow both int64 and float32 types for the categorical input.

Scope

Catboost allows users to save models in two formats: FlatBuffer and JSON. For the initial version, we'll only support the JSON format.
Initially, we'll convert oblivious trees into regular decision trees. We may add ObliviousTree class to the Treelite model spec in the future.
In addition, we'll only support the simple_ctr configuration, where the target encoding function takes in only one single categorical feature at a time. We won't support the combination_ctr configuration where multiple categorical features are fed into the target encoder.

TODOs

  • Add the target encoder to the Treelite model spec
  • Implement the deserializer for the Catboost JSON model. The deserializer will be placed in src/frontend.
  • Update GTIL to support inferencing with Catboost.
  • Update the C codegen to support text inputs and target encoding. I expect this step to be challenging, given the complexity in the C codegen.
@hcho3 hcho3 changed the title Catboost support Initial support for Catboost Apr 19, 2022
@hcho3 hcho3 pinned this issue Apr 19, 2022
@hcho3 hcho3 mentioned this issue Apr 19, 2022
@hcho3 hcho3 unpinned this issue Oct 21, 2022
@hcho3
Copy link
Collaborator Author

hcho3 commented Aug 30, 2024

Prototype for a working inference engine is available here: https://github.com/hcho3/catboost_python_repro

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant