Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jvm-packages] cleaning checkpoint file after a successful training #4754

Merged
merged 2 commits into from Aug 14, 2019

Conversation

@CodingCat
Copy link
Member

commented Aug 8, 2019

No description provided.

@CodingCat

This comment has been minimized.

Copy link
Member Author

commented Aug 8, 2019

@trams would you please help to review?

@trams
Copy link
Contributor

left a comment

This is a good change. I think generally we should update also docs and future changelog to communicate API change (however minor)

Show resolved Hide resolved ...ark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/CheckpointManager.scala Outdated
(checkpointPath, checkpointInterval)

val skipCheckpointFile: Boolean = params.get("skip_clean_checkpoint") match {
case None => false

This comment has been minimized.

Copy link
@trams

trams Aug 8, 2019

Contributor

I am a bit worried that it changes "API".
Before this change xgboost-spark does not clean it checkpoint folder
After this change it will do the cleaning by default.

What do you think about creating a param
"clean_checkpoint" instead of "skip_clean_checkpoint"
Those who wants can enable cleaning.

One use case when cleaning checkpoint folder may not be a good idea is if we desire to train N trees, optionally validate it and then continue training (may be with different hyperparameters) another M trees.

This comment has been minimized.

Copy link
@CodingCat

CodingCat Aug 8, 2019

Author Member

actually the previous implementation is buggy, even you wanted N trees, the leftover of checkpoint is the one produced after N-1 iterations,

the reason I want to make cleaning as a default behavior is that I encountered several times that the left over makes my successive training starts with a checkpoint instead of from scratch if I didn't change checkpoint path

This comment has been minimized.

Copy link
@trams

trams Aug 14, 2019

Contributor

👍 Now I understand you motivation

@@ -53,6 +53,12 @@ private[spark] class CheckpointManager(sc: SparkContext, checkpointPath: String)
}
}

def cleanPath(): Unit = {
if (checkpointPath != "") {
FileSystem.get(sc.hadoopConfiguration).delete(new Path(checkpointPath), true)

This comment has been minimized.

Copy link
@trams

trams Aug 8, 2019

Contributor

This assumes that CheckpontManager owns the folder (i.e. all files in this folder has been created by this or earlier CheckpointManager) so it is safe to remove the whole folder.

This is true for our use case. I am not sure it is actually true for everybody. At least we should update the docs (and 1.0 changelog) to mention this

One way to solve this problem would be to actually reuse cleanUpHigherVersions. We can pass it here to clean all versions. After that we can remove the folder non recursively. That would remove the empty directory if any

@@ -473,6 +475,11 @@ object XGBoost extends Serializable {
tracker.stop()
}
}.last
// we should delete the checkpoint directory after a successful training
if (!skipCleanCheckpoint) {

This comment has been minimized.

Copy link
@trams

trams Aug 8, 2019

Contributor

[Really minor] I am not sure whether xgboost has java|scala coding style. I generally prefer to have cleanCheckpoint flag not skipCleanCheckpoint. That would avoid adding extra "not" which makes reading the code slightly harder
P.S. This is really bike shadding

This comment has been minimized.

Copy link
@CodingCat

CodingCat Aug 8, 2019

Author Member

as explained above, cleaning checkpoint is the desired behavior. skipCleanCheckpoint is there mainly for testing

regarding the use cases you want to continue training, I think that belongs to another feature that you can start training from an existing model

This comment has been minimized.

Copy link
@trams

trams Aug 14, 2019

Contributor

Agree on your proposed feature: train starting from an existing model

Show resolved Hide resolved ...rc/test/scala/ml/dmlc/xgboost4j/scala/spark/CheckpointManagerSuite.scala Outdated
Show resolved Hide resolved ...rc/test/scala/ml/dmlc/xgboost4j/scala/spark/CheckpointManagerSuite.scala Outdated
Show resolved Hide resolved ...rc/test/scala/ml/dmlc/xgboost4j/scala/spark/CheckpointManagerSuite.scala Outdated
@codecov-io

This comment has been minimized.

Copy link

commented Aug 8, 2019

Codecov Report

Merging #4754 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #4754   +/-   ##
=======================================
  Coverage   79.59%   79.59%           
=======================================
  Files          11       11           
  Lines        1965     1965           
=======================================
  Hits         1564     1564           
  Misses        401      401

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 19f9fd5...82cdd76. Read the comment docs.

@trams

trams approved these changes Aug 14, 2019

@CodingCat CodingCat merged commit 7b5cbcc into dmlc:master Aug 14, 2019

10 checks passed

Jenkins Linux: Build Stage built successfully
Details
Jenkins Linux: Formatting Check Stage built successfully
Details
Jenkins Linux: Get sources Stage built successfully
Details
Jenkins Linux: Test Stage built successfully
Details
Jenkins Win64: Build Stage built successfully
Details
Jenkins Win64: Get sources Stage built successfully
Details
Jenkins Win64: Test Stage built successfully
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/jenkins/pr-merge This commit looks good
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.