Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: local clusters to store checkpoint data in home [DET-5154] #2170

Merged
merged 4 commits into from
Apr 6, 2021

Conversation

ioga
Copy link
Contributor

@ioga ioga commented Apr 5, 2021

Description

  • Default storage location for checkpoint data in local clusters deployed via det deploy local has been changed to OS-specific user data dir (e.g. $XDG_DATA_HOME/determined or ~/.local/share/determined on Linux, and ~/Library/Application Support/determined on macOS). Previously, /tmp was used.
  • This location can be changed using --storage-host-path command line flag of det deploy local.
  • If users provide their own custom master.yaml via --master-config-path, checkpoint_storage configuration in yaml will continue to take precedence.

Test Plan

  • Check that the old experiment data in /tmp is not lost after upgrade.
  • Check the new default directory works ok, --storage-host-path, and --master-config-path.

Commentary (optional)

Checklist

  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

@cla-bot cla-bot bot added the cla-signed label Apr 5, 2021
@ioga ioga force-pushed the checkpoint-storage branch 2 times, most recently from f31d26a to a307dee Compare April 5, 2021 19:26
Copy link
Contributor

@stoksc stoksc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, just some q's

@@ -129,4 +130,9 @@ func registerConfig() {
defaults.Telemetry.SegmentMasterKey, "the Segment write key for the master")
registerString(flags, name("telemetry", "segment-webui-key"),
defaults.Telemetry.SegmentWebUIKey, "the Segment write key for the WebUI")

registerString(flags, name("checkpoint-storage", "type"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should work fine afaik, but, you know, test it, lol. i'm not 100% sure how this registerString will work with our union types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had to register this type (and not just host_path) because the union type validator would throw up on host_path without type=shared_fs.

// DefaultCheckpointStorageType is SharedFS.
DefaultCheckpointStorageType = "shared_fs"
// DefaultSharedFSHostPath is the default path on hosts for SharedFS storage mounts.
DefaultSharedFSHostPath = "/tmp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding this default will mean users can now not supply checkpoint_storage anywhere and have a valid exp conf, and this is new afaik, it always had to be set somewhere previously. Now that there is a valid default i'd make sure the defaulting logic with the master config's checkpoint config and the template checkpoint configs doesn't get busted. I'm not sure right off what would happen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the flag is going to apply to the master config, and det deploy just needs to edit the master config defaults, do we need this code messing with the exp conf at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I assume previously it'd come from the pre-packaged master.yaml, which explicitly specifies /tmp. I suppose the change is that if you pass a master.yaml which doesn't have checkpoint_storage, we'd now store data in /tmp instead of exiting. That probably won't break existing users.
  2. There's a unit test TestUnmarshalMasterConfigurationViaViper, which effectively checks that the config values are the same as in internal.DefaultConfig. All of this helps to make that true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but this change updates the default experiment config (model.DefaultExperimentConfig), not master config (internal.DefaultConfig).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

master's checkpoint storage config comes from experiment config: https://github.com/ioga/determined/blob/master/master/internal/config.go#L30
do you think default checkpoint storage config should be separate and different for master and experiment configs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i think that makes sense. I think experiments should have no default but master config should have a default (or have no default and have det-deploy supply the default for local cases).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

went down the latter path of not changing the default, please see the updated code.

It doesn't use registerString machinery and doesn't expose the same config as a flag, because pflags methods require you to provide the default, so using BindPFlag (e.g. in registerString) ends up setting the default. And it can't be empty string either, because the parser throws up because it's not a valid config.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sweet, thanks for updating.

@ioga ioga requested a review from dzhu April 5, 2021 20:48
@ioga ioga marked this pull request as ready for review April 5, 2021 20:48
@ioga ioga requested a review from shiyuann as a code owner April 5, 2021 20:48
Copy link
Contributor

@stoksc stoksc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ioga ioga merged commit d979092 into determined-ai:master Apr 6, 2021
@ioga ioga deleted the checkpoint-storage branch April 6, 2021 23:23
@dannysauer dannysauer added this to the 0.15.0 milestone Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants