-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is incremental training supported? #464
Comments
For now you can use one of two options:
|
As for the initial request - I think it's a great idea for contribution! |
@annaveronika I have tried the second solution, where I have multiple batches of data. Fitting the first Pool of data works fine, but when trying to fit to the second Pool of data, I get the following error while loading from the snapshot file:
I have tried both to initialize a CatBoost model first and then call fit with different pools and to re-initialize the model every batch. Both methods caused the same error, so for now this does not suffice as a proper solution for incremental training. The first solution is less neat, as this would mean there would be a large cascade of predictions using an ensemble of models. What is the reason the Pool would have to be the same as the original Pool? Are there any updates on the implementation of incremental training? |
Hi! Snapshots could be used only on the same pool. CatBoost models for some learning modes (ordered-boosting, categorical features support) heavily relies on some dataset preprocessing (so we could avoid overfitting on data with cat features), and this preprocessing could not be applied to other dataset. about solution number 2: |
I could not find any documentation on the sum_models function. Do you have an example of its usage? I would guess it would go something like:
|
Hello! We will publish documentation about sum_models(
models,
weights=None,
ctr_merge_policy='IntersectingCountersAverage') Purpose:This method blends trees and counters of two or more trained catboost models into new model. For example it's useful when you want to blend models trained on different cross-validation folds. Parameters:
|
is this done ? |
This feature is currently under active development and is expected to be in master branch within a week |
May I ask about the current progress of this feature? |
This feature is already implemented. See the documentation here: https://catboost.ai/docs/concepts/python-reference_sum_models.html |
Incremental training is supported on CPU starting from version 0.15. I've created a separate issue for GPU #860 |
Could you please also link to the relevant documentation / parameter for future reference? I can't seem to find any mention of incremental training. |
Hello @annaveronika, |
https://catboost.ai/docs/concepts/python-reference_pool_set_baseline.html - here are the docs. set_baseline sets initial values of the formula. The values of the next trees will be added to this baseline values. There is also a separate example here: |
Hello @annaveronika, Thank you |
Hi, what is the difference between Training continuation (via |
@fingoldo I think it is more appropriate to use GBDT models are built by trying to predict the error from the combination of previous trees. Then you sum the outputs of all the trees. Therefore it makes sense to sum(trees[0:N]) + trees[N+1:M]) With |
@Quetzalcohuatl I trained a catboost model for 3 lakh samples(not fitting in RAM altogether, so splitted it into batches/chunks of 15000 samples per chunk). I used k-fold cross-validation and for each fold, I trained my single model using init_model=previous model (The previous model was stored in a file, and that file was updated in next batch). So, say for batch_0(no init_model) -> batch_1(init_model=batch_0 model) -> batch_2(init_model=batch_1 model) ... and so on (The parameters were kept same for all batches). I finally get a single model trained on all 3 L samples for fold_0(the same is repeated for other folds). Some Q below-
Help n suggestions greatly required. |
Regarding snapshotting in So far I still don't understand how I would go about implementing incremental learning with the Thank you for any help! 🙂 |
hi @felixmpaulus ✋ catboost saves snapshots to snapshot_file every snapshot_interval seconds to save/load model, you may use save_model/load_model functions |
Hey @Evgueni-Petrov-aka-espetrov, thank you very much for the quick response. It helped a lot! I implemented the snapshot-functionality with
and the initial run is successful.
What is meant by here is the Debug output:
Thank you very much for your help! 🙂 |
To add to my previous comment: Sometimes the following error occurs as well:
Isn't it expected to have differing train and learn datasets with every new training? Thank you again! |
Problem: Which to do incremental training
catboost version: latest
Operating System: any
I know in XGBoost the training params has a
process_type
which can be set toupdate
giving optimal incremental training. Does catboost have something similar?The text was updated successfully, but these errors were encountered: