-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: cluster wide startup hook for tasks #9124
Conversation
✅ Deploy Preview for determined-ui canceled.
|
@@ -170,6 +172,7 @@ var mergeCopier = copier.Option{IgnoreEmpty: true, DeepCopy: true} | |||
func (c TaskContainerDefaultsConfig) Merge( | |||
other TaskContainerDefaultsConfig, | |||
) (TaskContainerDefaultsConfig, error) { | |||
// Q: where is this used other than tests? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
determined/master/pkg/schemas/merge.go
Line 66 in d5f807d
func Merge[T any](obj T, src T) T { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks is that a different strategy compared to agent and k8s? related to the thread here https://hpe-aiatscale.slack.com/archives/C06GMG83ZE0/p1712609136055039
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #9124 +/- ##
==========================================
+ Coverage 46.35% 46.90% +0.55%
==========================================
Files 754 754
Lines 104666 104666
Branches 2412 2412
==========================================
+ Hits 48515 49094 +579
+ Misses 55943 55364 -579
Partials 208 208
Flags with carried forward coverage won't be shown. Click here to find out more. |
92eb6f0
to
9555937
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one comment that I think should provide a simplification
master/static/srv/task-setup.sh
Outdated
@@ -53,3 +53,37 @@ if [ "$HOME" = "/" ]; then | |||
)" || HOME="$PWD" | |||
export HOME | |||
fi | |||
|
|||
cleaned_args=() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is two points here that I think are making this feels a little more complicated
-
Can we just hardcode the path? So we always know we just need to run
source /run/determined/tcd-startup-hook.sh
-
Can we assume the file always exists (if it is empty it is a noop).
If we make these assumptions could the only change to the startup files be adding a source /run/determined/tcd-startup-hook.sh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's much simpler, I think I was just excited to write some bash arg parsing I'll update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it safe to assume task RunDir is fixed? we have a single constant we have other pieces that'd break if this was to change.RunDir
in our go code for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BOsterbuhr any comments based on your usage and requirements?
@@ -170,6 +172,7 @@ var mergeCopier = copier.Option{IgnoreEmpty: true, DeepCopy: true} | |||
func (c TaskContainerDefaultsConfig) Merge( | |||
other TaskContainerDefaultsConfig, | |||
) (TaskContainerDefaultsConfig, error) { | |||
// Q: where is this used other than tests? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a new field for startup script in task container defaults add basic merge policy allow normal tcd merge. add test add validation and StartupScriptFilename wip run the startupscripts add to run archive remove the bogus exit0 task-setup tuning remove static srv dynamic-tcd.sh restore original args merge q remove startup_script_filename rename startup script to hook task-setup.sh: Rename tcd_filename to tcd_startup_hook_path clean up and only conditionally add the dynamic script add unit tests comment cleanup only err if arg is provided but file missing fmt
these should be nested ideally
030237e
to
170ffd1
Compare
@hamidzr we need to update |
170ffd1
to
6ef596a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice job
@pytest.mark.e2e_cpu_elastic | ||
@pytest.mark.e2e_cpu_cross_version | ||
@pytest.mark.e2e_gpu # Note, e2e_gpu and not gpu_required hits k8s cpu tests. | ||
@pytest.mark.e2e_slurm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if you have already but would check this passes in slurm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I manually tested the older version and I'm checking the new one. any good way of adding this to be run on slurm from this repo? would need a separate follow up pr on ee after we land this or setup a different ee pr and make modifications there.
@@ -10,6 +10,10 @@ set -e | |||
# to register the proxy with the Determined master. | |||
"$DET_PYTHON_EXECUTABLE" -m determined.exec.prep_container --proxy --download_context_directory | |||
|
|||
set -x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do commands not support start up hook?
maybe just make a ticket calling this out? Not sure if that is intentional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
``startup_hook`` | ||
================ | ||
|
||
An optional inline script that will be executed as part of task set up. This is defined under |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you had a seperate pr but helm / helm docs?
If you wanna do seperate that is fine but I think I prefer one PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that'll take me longer to test potentially not until tomorrow. I think it'd make sense to make sure the main changes are tested and landed asap?
Ticket
TODO
Description
determined/master/internal/rm/kubernetesrm/kubernetes_resource_manager.go
Line 631 in d5f807d
Test Plan
basic example
Checklist
docs/release-notes/
.See Release Note for details.
https://hpe-aiatscale.atlassian.net/browse/RM-105