Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is incremental training supported? #464

Closed
Tjorriemorrie opened this issue Sep 13, 2018 · 22 comments
Closed

Is incremental training supported? #464

Tjorriemorrie opened this issue Sep 13, 2018 · 22 comments
Labels

Comments

@Tjorriemorrie
Copy link

Problem: Which to do incremental training
catboost version: latest
Operating System: any

I know in XGBoost the training params has a process_type which can be set to update giving optimal incremental training. Does catboost have something similar?

params.update({'process_type': 'update',
               'updater'     : 'refresh',
               'refresh_leaf': True})
model_2_v2_update = xgb.train(params, xg_train_2, 30, xgb_model=model_1)
@annaveronika
Copy link
Contributor

For now you can use one of two options:

  1. Use baseline - https://tech.yandex.com/catboost/doc/dg/features/proceed-training-docpage/#proceed-training
    With this you first train model, then apply it to your data, then you pass the result as baseline to the next training.
  2. Use snapshotting - https://tech.yandex.com/catboost/doc/dg/features/snapshots-docpage/#snapshots
    During the training you save snapshots. Then when starting next training you can use snapshot that you got from previous training, and it will be used for the first part of the ensamble.

@annaveronika
Copy link
Contributor

As for the initial request - I think it's a great idea for contribution!

@TomScheffers
Copy link

@annaveronika I have tried the second solution, where I have multiple batches of data. Fitting the first Pool of data works fine, but when trying to fit to the second Pool of data, I get the following error while loading from the snapshot file:

Traceback (most recent call last):
  File "sink.py", line 54, in <module>
    model.fit(train_pool, eval_set=test_pool, early_stopping_rounds=100, save_snapshot=True, snapshot_file='models/snapshots/snapshot_'+file_name)
  File "C:\Users\TomSc\AppData\Local\Programs\Python\Python37\lib\site-packages\catboost\core.py", line 2526, in fit
    save_snapshot, snapshot_file, snapshot_interval)
  File "C:\Users\TomSc\AppData\Local\Programs\Python\Python37\lib\site-packages\catboost\core.py", line 1137, in _fit
    self._train(train_pool, eval_sets, params, allow_clear_pool)
  File "C:\Users\TomSc\AppData\Local\Programs\Python\Python37\lib\site-packages\catboost\core.py", line 839, in _train
    self._object._train(train_pool, test_pool, params, allow_clear_pool)
  File "_catboost.pyx", line 1446, in _catboost._CatBoost._train
  File "_catboost.pyx", line 1467, in _catboost._CatBoost._train
_catboost.CatboostError: c:/goagent/pipelines/buildmaster/catboost.git/catboost/libs/algo/learn_context.cpp:179: Current pool differs from the original pool

I have tried both to initialize a CatBoost model first and then call fit with different pools and to re-initialize the model every batch. Both methods caused the same error, so for now this does not suffice as a proper solution for incremental training.

The first solution is less neat, as this would mean there would be a large cascade of predictions using an ensemble of models.

What is the reason the Pool would have to be the same as the original Pool? Are there any updates on the implementation of incremental training?

@Noxoomo
Copy link
Member

Noxoomo commented Nov 21, 2018

 Hi!

Snapshots could be used only on the same pool. CatBoost models for some learning modes (ordered-boosting, categorical features support) heavily relies on some dataset preprocessing (so we could avoid overfitting on data with cat features), and this preprocessing could not be applied to other dataset.

about solution number 2:
There will be no problem with cascade of predictions — CatBoost has sum_models function, that will combine models into one automatically.

@TomScheffers
Copy link

I could not find any documentation on the sum_models function. Do you have an example of its usage?

I would guess it would go something like:

all_models = []
trees = 5000
epochs = 2
batches = 10
for e in range(epochs):
    for s in range(batches):
        #get a batch of data
        X_train, X_test, y_train, y_test, w_train, w_test = dm.get_batch(batch=s)

        #check if there are earlier models, else we add a baseline
        if all_models:
            train_pool = Pool(data=X_train, label=y_train, cat_features=dm.cat_features, weight=w_train, baseline=sum_model.predict(X_train))
            test_pool = Pool(data=X_test, label=y_test, cat_features=dm.cat_features, weight=w_test, baseline=sum_model.predict(X_test))
        else:
            train_pool = Pool(data=X_train, label=y_train, cat_features=dm.cat_features, weight=w_train)
            test_pool = Pool(data=X_test, label=y_test, cat_features=dm.cat_features, weight=w_test)
        
        #fit partial model
        model = CatBoostRegressor(iterations=(trees / (epochs * batches)))
        model.fit(train_pool, eval_set=test_pool, early_stopping_rounds=100)
        
        #keep track of all models and the summation model
        all_models.append(model)
        sum_model = catboost.sum_models(all_models)

@kizill
Copy link
Member

kizill commented Dec 9, 2018

Hello! We will publish documentation about sum_models soon, for now i'll paste part of it here:

sum_models(
    models,
    weights=None,
    ctr_merge_policy='IntersectingCountersAverage')

Purpose:

This method blends trees and counters of two or more trained catboost models into new model. For example it's useful when you want to blend models trained on different cross-validation folds.
Leaf values could be weighted with user provided weights (by default 1) for each model. For example, providing [1.0/N, ...] weights you can blend N models into one that gives average prediction.

Parameters:

Parameter Possible types Description Default value
models list of CatBoost models List with model objects to blend required params
weights None or list of numbers If None equals unitary model weights. Else should be list with numeric items (same length as models) None
ctr_merge_policy str Parameter that controls counters merging policy. Possible values:
* FailIfCtrsIntersects - ensure that models has zero intersecting counters
* LeaveMostDiversifiedTable - use most diversified table by count of unique hash values
* IntersectingCountersAverage - average ctr counter values in intersecting bins
IntersectingCountersAverage

@anuragreddygv323
Copy link

is this done ?

@andrey-khropov
Copy link
Member

is this done ?

This feature is currently under active development and is expected to be in master branch within a week

@chriskim92
Copy link

This feature is currently under active development and is expected to be in master branch within a week

May I ask about the current progress of this feature?

@TomScheffers
Copy link

TomScheffers commented Jun 5, 2019

This feature is already implemented. See the documentation here: https://catboost.ai/docs/concepts/python-reference_sum_models.html

@annaveronika
Copy link
Contributor

Incremental training is supported on CPU starting from version 0.15. I've created a separate issue for GPU #860
Closing this one.

@proto-n
Copy link

proto-n commented Jun 6, 2019

Incremental training is supported on CPU starting from version 0.15. I've created a separate issue for GPU #860
Closing this one.

Could you please also link to the relevant documentation / parameter for future reference? I can't seem to find any mention of incremental training.

@parisaazimaee
Copy link

Hello @annaveronika,
I was looking for a parameter to initialize the predictions for binary classification. And I found this.
I am looking for equivalent of "set_base_margin" parameter in XGBoost. I was wondering if "set_base_line" in the pool method does the same thing?

@annaveronika
Copy link
Contributor

https://catboost.ai/docs/concepts/python-reference_pool_set_baseline.html - here are the docs. set_baseline sets initial values of the formula. The values of the next trees will be added to this baseline values.

There is also a separate example here:
https://catboost.ai/docs/concepts/python-usages-examples.html#baseline

@BabakRezaei
Copy link

Hello @annaveronika,
Is it possible to use sum_models for CatBoostRegressor?

Thank you

@fingoldo
Copy link

fingoldo commented Jan 2, 2022

Hi, what is the difference between Training continuation (via model2.fit(train_data, train_labels, init_model=model1)) and Batch training (via batch2.set_baseline(model1.predict(batch1) and later sum_models)? Can Training continuation be applied to different data chunks? Can Batch training be used with different fit parameters in the intermediate models? When should each one be preferred? It's not clear from the docs. I'm struggling with a huge dataset that does not fit into RAM. I was thinking of splitting it into (sparse) 200Gb chunks and training incrementally (from files). What would be your advice for my use case, Training continuation, or Batch training?

@Quetzalcohuatl
Copy link

@fingoldo I think it is more appropriate to use baseline instead of using init_model or batch training.

GBDT models are built by trying to predict the error from the combination of previous trees. Then you sum the outputs of all the trees. Therefore it makes sense to sum(trees[0:N]) + trees[N+1:M])

With batch training they use sum_models which just averages the averages the predictions across the 2 models, instead of having the 2nd model getting the benefit of learning from the residuals of the 1st model. In my opinion, that is inferior, because the 2nd model can't be as powerful...

@dk-github-acc
Copy link

@Quetzalcohuatl I trained a catboost model for 3 lakh samples(not fitting in RAM altogether, so splitted it into batches/chunks of 15000 samples per chunk). I used k-fold cross-validation and for each fold, I trained my single model using init_model=previous model (The previous model was stored in a file, and that file was updated in next batch). So, say for batch_0(no init_model) -> batch_1(init_model=batch_0 model) -> batch_2(init_model=batch_1 model) ... and so on (The parameters were kept same for all batches). I finally get a single model trained on all 3 L samples for fold_0(the same is repeated for other folds).

Some Q below-

  1. Will the sum_model(fold_0_model, fold_1_model, fold_2_model, fold_3_model ) give accurate results?
  2. Are the residuals from the previous model not learned by next models (batch after batch) in init_model ?
  3. In 'batch training' you said that model doesn't leverage the residuals from 1st model. But in the example given here, for batch_2 baseline is set from batch_1 predictions by 1st model. Does it still not learn the residuals from model_1's predictions on batch_1 ? For refernce see the image below.

image

  1. I want to improve my MAE/MSE/RMSE scores on test data and reduce overfitting, so what would you suggest a). Use cross-validation with init_model and hyperparamter tuning (init_model so that I can compare with my previous trainings) b). Use baseline for new training with cross-validation and hyp. tuning (I feel that I can't compare new model results with previous model or state the improvements, if any) ?

Help n suggestions greatly required.
Thanks!

@felixmpaulus
Copy link

For now you can use one of two options:

  1. Use baseline - https://tech.yandex.com/catboost/doc/dg/features/proceed-training-docpage/#proceed-training
    With this you first train model, then apply it to your data, then you pass the result as baseline to the next training.
  2. Use snapshotting - https://tech.yandex.com/catboost/doc/dg/features/snapshots-docpage/#snapshots
    During the training you save snapshots. Then when starting next training you can use snapshot that you got from previous training, and it will be used for the first part of the ensamble.

Regarding snapshotting in R:
How do I load a specific file when starting training?
How do I save Model at a specific location with a specific filename?
It sounds to me like snapshotting is handled automatically in the background and executed only when the script is interrupted, not when a training is successfully completed.

So far I still don't understand how I would go about implementing incremental learning with the R-Package. A function like init_model would be great.

Thank you for any help! 🙂

@Evgueni-Petrov-aka-espetrov
Copy link
Contributor

hi @felixmpaulus
many thanks for using catboost 😍

catboost saves snapshots to snapshot_file every snapshot_interval seconds
these parameters are passed to train function

to save/load model, you may use save_model/load_model functions

@felixmpaulus
Copy link

felixmpaulus commented Jan 24, 2023

Hey @Evgueni-Petrov-aka-espetrov, thank you very much for the quick response. It helped a lot!

I implemented the snapshot-functionality with

save_snapshot = TRUE, 
snapshot_file = "catboost_437_1", 
snapshot_interval = 1

and the initial run is successful.
But when I try to load the snapshot, the following error occurs:

Error in catboost.train(train_pool, NULL, params = list(loss_function = "Logloss", : 
catboost/private/libs/algo/learn_context.cpp:419: Can't load progress from snapshot file: catboost_info/catboost_437_1 : 
catboost/private/libs/algo/learn_context.cpp:398: Current training params differ from the params saved in snapshot

What is meant by trainings params?
The features of both datasets are identical!
I use the same dataset for the initial training and the incremental one (just different samples for training and testing).
The params I pass to the catboost.train function are also identical since it is the exact same code block that gets executed.

here is the Debug output:

> run_pipeline(ensemble_algorithm = 'catboost')
[1] "getting data"
[1] "getting all relevant features"
[1] "continuing with target: 437_1"
[1] "Features: Mileage, WERK, AUSLIEFERUNGSLAND, MARKE, FAHRZEUGMODELL, KRAFTSTOFF, LEISTUNG, Jahresfahrstrecke, climaZone"
[1] "preparing data"
[1] "spliting data"
[1] "training classifier 437_1, algorithm: catboost, weaklearn: decision_tree"
Custom logger is already specified. Specify more than one logger at same time is not thread safe.Learning rate set to 0.016632
Features checksum calculation time: 0.0001497905812
Create new LearnProgress
Fold: Use owned online single ctrs
Fold: Use owned online single ctrs
Fold: Use owned online single ctrs
Fold: Use owned online single ctrs
Mem usage: Before start train: 1800945664
learnProgressRestored->SerializedTrainParams = {"detailed_profile":false,"boosting_options":{"model_shrink_mode":"Constant","approx_on_full_history":false,"fold_len_multiplier":2,"fold_permutation_block":0,"posterior_sampling":false,"boosting_type":"Plain","iterations":1000,"model_shrink_rate":0,"od_config":{"wait_iterations":20,"type":"None","stop_pvalue":0},"boost_from_average":false,"permutation_count":4,"learning_rate":0.016720000654459},"pool_metainfo_options":{"tags":{}},"metrics":{"objective_metric":{"type":"Logloss","params":{}},"eval_metric":{"type":"Logloss","params":{}},"custom_metrics":[]},"metadata":{},"cat_feature_params":{"store_all_simple_ctr":false,"ctr_leaf_count_limit":18446744073709551615,"simple_ctrs":[{"ctr_binarization":{"border_count":15,"border_type":"Uniform"},"target_binarization":{"border_count":1,"border_type":"MinEntropy"},"prior_estimation":"No","priors":[[0,1],[0.5,1],[1,1]],"ctr_type":"Borders"},{"ctr_binarization":{"border_count":15,"border_type":"Uniform"},"prior_estimation":"No","priors":[[0,1]],"ctr_type":"Counter"}],"counter_calc_method":"SkipTest","one_hot_max_size":2,"max_ctr_complexity":4,"combinations_ctrs":[{"ctr_binarization":{"border_count":15,"border_type":"Uniform"},"target_binarization":{"border_count":1,"border_type":"MinEntropy"},"prior_estimation":"No","priors":[[0,1],[0.5,1],[1,1]],"ctr_type":"Borders"},{"ctr_binarization":{"border_count":15,"border_type":"Uniform"},"prior_estimation":"No","priors":[[0,1]],"ctr_type":"Counter"}],"target_binarization":{"border_count":1,"border_type":"MinEntropy"},"per_feature_ctrs":{}},"logging_level":"Verbose","data_processing_options":{"ignored_features":[],"float_features_binarization":{"border_count":254,"dev_max_subset_size_for_build_borders":200000,"nan_mode":"Min","border_type":"GreedyLogSum"},"has_time":false,"dev_sparse_array_indexing":"Indices","allow_const_label":false,"dev_default_value_fraction_for_sparse":0.8299999833106995,"class_names":[0,1],"embedding_processing_options":{"embedding_processing":{"default":["LDA","KNN"]}},"dev_group_features":false,"eval_fraction":0,"classes_count":0,"dev_leafwise_scoring":false,"auto_class_weights":"None","target_border":null,"force_unit_auto_pair_weights":false,"text_processing_options":{"feature_processing":{"default":[{"dictionaries_names":["BiGram","Word"],"feature_calcers":["BoW"],"tokenizers_names":["Space"]},{"dictionaries_names":["Word"],"feature_calcers":["NaiveBayes"],"tokenizers_names":["Space"]}]},"dictionaries":[{"start_token_id":"0","occurrence_lower_bound":"5","skip_step":"0","end_of_word_token_policy":"Insert","token_level_type":"Word","end_of_sentence_token_policy":"Skip","gram_order":"2","max_dictionary_size":"50000","dictionary_id":"BiGram"},{"start_token_id":"0","occurrence_lower_bound":"5","skip_step":"0","end_of_word_token_policy":"Insert","token_level_type":"Word","end_of_sentence_token_policy":"Skip","gram_order":"1","max_dictionary_size":"50000","dictionary_id":"Word"}],"tokenizers":[{"number_token":"🔢","skip_empty":"1","number_process_policy":"LeaveAsIs","tokenizer_id":"Space","token_types":["Number","Unknown","Word"],"delimiter":" ","languages":[],"lemmatizing":"0","split_by_set":"0","lowercasing":"0","subtokens_policy":"SingleToken","separator_type":"ByDelimiter"}]},"class_weights":[],"per_float_feature_quantization":{}},"loss_function":{"type":"Logloss","params":{}},"tree_learner_options":{"model_size_reg":0.5,"sampling_frequency":"PerTree","bayesian_matrix_reg":0.10000000149011612,"score_function":"Cosine","monotone_constraints":{},"leaf_estimation_method":"Newton","dev_score_calc_obj_block_size":5000000,"grow_policy":"SymmetricTree","min_data_in_leaf":1,"random_strength":1,"dev_efb_max_buckets":1024,"l2_leaf_reg":3,"bootstrap":{"mvs_reg":null,"subsample":0.800000011920929,"type":"MVS"},"depth":6,"max_leaves":64,"leaf_estimation_backtracking":"AnyImprovement","rsm":1,"dev_leafwise_approxes":false,"penalties":{"per_object_feature_penalties":{},"first_feature_use_penalties":{},"feature_weights":{},"penalties_coefficient":1},"leaf_estimation_iterations":10,"sparse_features_conflict_fraction":0},"task_type":"CPU","flat_params":{"metric_period":100,"snapshot_interval":1,"save_snapshot":true,"iterations":1000,"loss_function":"Logloss","snapshot_file":"catboost_437_1","logging_level":"Verbose"},"random_seed":0,"system_options":{"thread_count":10,"file_with_hosts":"hosts.txt","node_type":"SingleHost","node_port":0,"used_ram_limit":""}} LearnProgress->SerializedTrainParams = {"detailed_profile":false,"boosting_options":{"model_shrink_mode":"Constant","approx_on_full_history":false,"fold_len_multiplier":2,"fold_permutation_block":0,"posterior_sampling":false,"boosting_type":"Plain","iterations":1000,"model_shrink_rate":0,"od_config":{"wait_iterations":20,"type":"None","stop_pvalue":0},"boost_from_average":false,"permutation_count":4,"learning_rate":0.016631999984383583},"pool_metainfo_options":{"tags":{}},"metrics":{"objective_metric":{"type":"Logloss","params":{}},"eval_metric":{"type":"Logloss","params":{}},"custom_metrics":[]},"metadata":{},"cat_feature_params":{"store_all_simple_ctr":false,"ctr_leaf_count_limit":18446744073709551615,"simple_ctrs":[{"ctr_binarization":{"border_count":15,"border_type":"Uniform"},"target_binarization":{"border_count":1,"border_type":"MinEntropy"},"prior_estimation":"No","priors":[[0,1],[0.5,1],[1,1]],"ctr_type":"Borders"},{"ctr_binarization":{"border_count":15,"border_type":"Uniform"},"prior_estimation":"No","priors":[[0,1]],"ctr_type":"Counter"}],"counter_calc_method":"SkipTest","one_hot_max_size":2,"max_ctr_complexity":4,"combinations_ctrs":[{"ctr_binarization":{"border_count":15,"border_type":"Uniform"},"target_binarization":{"border_count":1,"border_type":"MinEntropy"},"prior_estimation":"No","priors":[[0,1],[0.5,1],[1,1]],"ctr_type":"Borders"},{"ctr_binarization":{"border_count":15,"border_type":"Uniform"},"prior_estimation":"No","priors":[[0,1]],"ctr_type":"Counter"}],"target_binarization":{"border_count":1,"border_type":"MinEntropy"},"per_feature_ctrs":{}},"logging_level":"Debug","data_processing_options":{"ignored_features":[],"float_features_binarization":{"border_count":254,"dev_max_subset_size_for_build_borders":200000,"nan_mode":"Min","border_type":"GreedyLogSum"},"has_time":false,"dev_sparse_array_indexing":"Indices","allow_const_label":false,"dev_default_value_fraction_for_sparse":0.8299999833106995,"class_names":[0,1],"embedding_processing_options":{"embedding_processing":{"default":["LDA","KNN"]}},"dev_group_features":false,"eval_fraction":0,"classes_count":0,"dev_leafwise_scoring":false,"auto_class_weights":"None","target_border":null,"force_unit_auto_pair_weights":false,"text_processing_options":{"feature_processing":{"default":[{"dictionaries_names":["BiGram","Word"],"feature_calcers":["BoW"],"tokenizers_names":["Space"]},{"dictionaries_names":["Word"],"feature_calcers":["NaiveBayes"],"tokenizers_names":["Space"]}]},"dictionaries":[{"start_token_id":"0","occurrence_lower_bound":"5","skip_step":"0","end_of_word_token_policy":"Insert","token_level_type":"Word","end_of_sentence_token_policy":"Skip","gram_order":"2","max_dictionary_size":"50000","dictionary_id":"BiGram"},{"start_token_id":"0","occurrence_lower_bound":"5","skip_step":"0","end_of_word_token_policy":"Insert","token_level_type":"Word","end_of_sentence_token_policy":"Skip","gram_order":"1","max_dictionary_size":"50000","dictionary_id":"Word"}],"tokenizers":[{"number_token":"🔢","skip_empty":"1","number_process_policy":"LeaveAsIs","tokenizer_id":"Space","token_types":["Number","Unknown","Word"],"delimiter":" ","languages":[],"lemmatizing":"0","split_by_set":"0","lowercasing":"0","subtokens_policy":"SingleToken","separator_type":"ByDelimiter"}]},"class_weights":[],"per_float_feature_quantization":{}},"loss_function":{"type":"Logloss","params":{}},"tree_learner_options":{"model_size_reg":0.5,"sampling_frequency":"PerTree","bayesian_matrix_reg":0.10000000149011612,"score_function":"Cosine","monotone_constraints":{},"leaf_estimation_method":"Newton","dev_score_calc_obj_block_size":5000000,"grow_policy":"SymmetricTree","min_data_in_leaf":1,"random_strength":1,"dev_efb_max_buckets":1024,"l2_leaf_reg":3,"bootstrap":{"mvs_reg":null,"subsample":0.800000011920929,"type":"MVS"},"depth":6,"max_leaves":64,"leaf_estimation_backtracking":"AnyImprovement","rsm":1,"dev_leafwise_approxes":false,"penalties":{"per_object_feature_penalties":{},"first_feature_use_penalties":{},"feature_weights":{},"penalties_coefficient":1},"leaf_estimation_iterations":10,"sparse_features_conflict_fraction":0},"task_type":"CPU","flat_params":{"metric_period":100,"snapshot_interval":1,"save_snapshot":true,"iterations":1000,"loss_function":"Logloss","snapshot_file":"catboost_437_1","logging_level":"Debug"},"random_seed":0,"system_options":{"thread_count":10,"file_with_hosts":"hosts.txt","node_type":"SingleHost","node_port":0,"used_ram_limit":""}}

Thank you very much for your help! 🙂

@felixmpaulus
Copy link

To add to my previous comment: Sometimes the following error occurs as well:

Error in catboost.train(train_pool, NULL, params = list(loss_function = "Logloss", : 
catboost/private/libs/algo/learn_context.cpp:419: Can't load progress from snapshot file: catboost_info/experiment.cbsnapshot : catboost/private/libs/algo/learn_context.cpp:408: Current learn and test datasets differ from the datasets used for snapshot learnProgressRestored->LearnAndTestQuantizedFeaturesCheckSum = 3454709553 LearnProgress->LearnAndTestQuantizedFeaturesCheckSum = 804472809

Isn't it expected to have differing train and learn datasets with every new training?

Thank you again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests