Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #3579: Fix memory leak when learing rate callback is registered #3803

Closed
wants to merge 2 commits into from

Conversation

hcho3
Copy link
Collaborator

@hcho3 hcho3 commented Oct 17, 2018

Diagnosis The learning rate callback function calls XGBoosterSetParam() to update the learning rate. The XGBoosterSetParam() function in turn calls Learner::Configure(), which resets and re-initializes each tree updater, calling FastHistMaker::Init(). The FastHistMaker::Init() function in turn
re-allocates internal objects that were meant to be recycled across iterations. Thus memory usage increases over time.

Fix The learning rate callback should call a new function XGBoosterUpdateParamInPlace(). The new function is designed so that no object is re-allocated.

Closes #3579.

@hcho3
Copy link
Collaborator Author

hcho3 commented Oct 17, 2018

TODO: Make sure learning rate callbacks work correctly in distributed mode.

**Diagnosis** The learning rate callback function calls `XGBoosterSetParam()`
to update the learning rate. The `XGBoosterSetParam()` function in turn calls
`Learner::Configure()`, which resets and re-initializes each tree updater,
calling `FastHistMaker::Init()`. The `FastHistMaker::Init()` function in turn
re-allocates internal objects that were meant to be recycled across iterations.
Thus memory usage increases over time.

**Fix** The learning rate callback should call a new function
`XGBoosterUpdateParamInPlace()`. The new function is designed so that no object
is re-allocated.
@hcho3
Copy link
Collaborator Author

hcho3 commented Oct 18, 2018

@RAMitchell @trivialfis @khotilov The fix ended up introducing quite a bit of boilerplate code, since the runtime parameter update would need to be passed all the way to individual parameter structure. So far I'm not able to come up with a better alternative. Any suggestion is welcome.

@trivialfis
Copy link
Member

trivialfis commented Oct 18, 2018

@hcho3 It seems to be a reasonably quick fix for this issue. There are some other parameters that we want to implement. For example, a better control for n_gpus, issues like #3794 really come down to changing parameters. I am not yet familiar with many parts of the code base. See if we can come up with something better later.

@trivialfis
Copy link
Member

trivialfis commented Oct 20, 2018

@hcho3 Is it possible to verify that some parameters need to be updated during Init()? So that we don't need to update everything. It's not clear to me how to do it yet.

@hcho3
Copy link
Collaborator Author

hcho3 commented Oct 20, 2018

Not really, since Init() is supposed to be called once in the first iteration only, and the C++ code isn't aware of the Python callback functions that potentially change training parameters in the middle of the training.

Right now, callbacks call SetParam(), which calls Init() and cause memory leak. The new approach will have callbacks call UpdateParamInPlace() instead.

@RAMitchell
Copy link
Member

RAMitchell commented Oct 20, 2018

@hcho3 just had a quick glance, can review properly in a few days. I am assuming the problem is the histogram matrix being regenerated every time? The correct solution is probably to remove the histogram matrix from the tree updater and instead associate it with the dmatrix object, which maintains its state after changes in parameters. I realise this is a lot more work but it would address the fundamental problem. This must also be a problem in the gpu_hist updater so we should do the same there.

@trivialfis
Copy link
Member

@RAMitchell I was thinking about a more complicated parameters manager in dmlc-core, which can accept some optional functors(callbacks), and call it whenever a corresponding parameter has changed.

@hcho3
Copy link
Collaborator Author

hcho3 commented Oct 20, 2018

@trivialfis I don't think dmlc-core supports callback functionality for parameters. It would be a useful addition.

@RAMitchell I agree that histogram matrix should be merged to DMatrix class. In fact, in the long run we should have DMatrix class manage multiple representations (quantized? CSR / ELLPACK / CSC layouts?). Once this refactor is done, updaters would be "shallow" objects, storing no data. We could also remove UpdateParamInPlace() boilerplate as well. However, this kind of refactor is a lot of work, so I propose to return to it after 0.81 release.

@RAMitchell
Copy link
Member

@hcho3 you could list this as a known issue and simply delay the fix for future versions. I think the issue is not critical, e.g. you can use the learning rate decay with the exact tree method if necessary.

I am happy to merge this, but the risk is that unforeseen issues arise with the temporary fix and we end up sinking time into the temporary fix and not the long-term fix.

Just some thoughts, go with what you think is best.

@hcho3
Copy link
Collaborator Author

hcho3 commented Oct 22, 2018

@RAMitchell

you can use the learning rate decay with the exact tree method if necessary.

In fact, only hist is affected by this issue, since it stores extra data inside. Both exact and approx are not affected.

the risk is that unforeseen issues arise with the temporary fix and we end up sinking time into the temporary fix and not the long-term fix.

I agree with this sentiment. In fact, I've been hesitant to merge this because of all the boilerplate required. I also agree that we should focus on the long-term, robust fix.

I will close this pull request now. Also,

  1. The 0.81 release note will indicate this as a known issue. Users will be asked to use exact / approx to avoid memory leak.
  2. After 0.81, we should start working on adding multiple representations to DMatrix class. This also has ramifications to Call for contribution: improve multi-core CPU performance of 'hist' #3810 as well.

@hcho3 hcho3 closed this Oct 22, 2018
@lock lock bot locked as resolved and limited conversation to collaborators Jan 20, 2019
@hcho3 hcho3 deleted the eta_decay_memleak branch July 17, 2023 17:32
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants