-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: local clusters to store checkpoint data in home [DET-5154] #2170
Conversation
f31d26a
to
a307dee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, just some q's
master/cmd/determined-master/init.go
Outdated
@@ -129,4 +130,9 @@ func registerConfig() { | |||
defaults.Telemetry.SegmentMasterKey, "the Segment write key for the master") | |||
registerString(flags, name("telemetry", "segment-webui-key"), | |||
defaults.Telemetry.SegmentWebUIKey, "the Segment write key for the WebUI") | |||
|
|||
registerString(flags, name("checkpoint-storage", "type"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should work fine afaik, but, you know, test it, lol. i'm not 100% sure how this registerString
will work with our union types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had to register this type (and not just host_path) because the union type validator would throw up on host_path without type=shared_fs.
master/pkg/model/defaults.go
Outdated
// DefaultCheckpointStorageType is SharedFS. | ||
DefaultCheckpointStorageType = "shared_fs" | ||
// DefaultSharedFSHostPath is the default path on hosts for SharedFS storage mounts. | ||
DefaultSharedFSHostPath = "/tmp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding this default will mean users can now not supply checkpoint_storage
anywhere and have a valid exp conf, and this is new afaik, it always had to be set somewhere previously. Now that there is a valid default i'd make sure the defaulting logic with the master config's checkpoint config and the template checkpoint configs doesn't get busted. I'm not sure right off what would happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the flag is going to apply to the master config, and det deploy just needs to edit the master config defaults, do we need this code messing with the exp conf at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I assume previously it'd come from the pre-packaged master.yaml, which explicitly specifies /tmp. I suppose the change is that if you pass a master.yaml which doesn't have checkpoint_storage, we'd now store data in /tmp instead of exiting. That probably won't break existing users.
- There's a unit test
TestUnmarshalMasterConfigurationViaViper
, which effectively checks that the config values are the same as ininternal.DefaultConfig
. All of this helps to make that true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but this change updates the default experiment config (model.DefaultExperimentConfig
), not master config (internal.DefaultConfig
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
master's checkpoint storage config comes from experiment config: https://github.com/ioga/determined/blob/master/master/internal/config.go#L30
do you think default checkpoint storage config should be separate and different for master and experiment configs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i think that makes sense. I think experiments should have no default but master config should have a default (or have no default and have det-deploy supply the default for local cases).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
went down the latter path of not changing the default, please see the updated code.
It doesn't use registerString
machinery and doesn't expose the same config as a flag, because pflags
methods require you to provide the default, so using BindPFlag
(e.g. in registerString
) ends up setting the default. And it can't be empty string either, because the parser throws up because it's not a valid config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sweet, thanks for updating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Description
det deploy local
has been changed to OS-specific user data dir (e.g.$XDG_DATA_HOME/determined
or~/.local/share/determined
on Linux, and~/Library/Application Support/determined
on macOS). Previously,/tmp
was used.--storage-host-path
command line flag ofdet deploy local
.master.yaml
via--master-config-path
,checkpoint_storage
configuration in yaml will continue to take precedence.Test Plan
Commentary (optional)
Checklist
docs/release-notes/
.See Release Note for details.