Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: cluster wide startup hook for tasks #9124

Merged
merged 10 commits into from
Apr 9, 2024
Merged

feat: cluster wide startup hook for tasks #9124

merged 10 commits into from
Apr 9, 2024

Conversation

hamidzr
Copy link
Member

@hamidzr hamidzr commented Apr 8, 2024

Ticket

TODO

  • slurm test
  • docs and release notes
  • an e2e tests that shows the hook running. via logs? or command

Description

Test Plan

  • set startup hook various task container defaults: master config, resource manager
    • check that RM startuphook override cluster wide one
  • unit tests

basic example

        task_container_defaults:
          startup_hook: |
            echo "TCD startup thing from master config"
            echo "task type is $DET_TASK_TYPE"
hbp ➜ d/determined-ee git:(hz-ee-startuphook) ✗ det cmd run 'ls $HOME; date; ls /run/determined/'
Launched command (id: 3d1d922b-4edd-4556-b2df-a94806138a4a).
[2024-04-09T21:50:07.432659Z]          || INFO: Scheduling Command (singularly-enormous-ibex) (id: 3d1d922b-4edd-4556-b2df-a94806138a4a.1)
[2024-04-09T21:50:11.340487Z]          || INFO: HPC Job ID: 3
[2024-04-09T21:50:11.342877Z]          || INFO: Command (singularly-enormous-ibex) was assigned to an agent
[2024-04-09T21:50:16.739730Z] [cont=0] || WARNING: [10466] determined: falling back to naive proxy ip resolution (error=None)
[2024-04-09T21:50:17.238141Z] [cont=0] || + test -f /run/determined/dynamic-tcd-startup-hook.sh
[2024-04-09T21:50:17.238237Z] [cont=0] || + source /run/determined/dynamic-tcd-startup-hook.sh
[2024-04-09T21:50:17.238266Z] [cont=0] || ++ echo hi from master tcd slurm
[2024-04-09T21:50:17.238279Z] [cont=0] || + set +x
[2024-04-09T21:50:17.238379Z] [cont=0] || hi from master tcd slurm
[2024-04-09T21:50:17.246344Z] [cont=0] || Tue Apr  9 21:50:17 UTC 2024
[2024-04-09T21:50:17.248080Z] [cont=0] || command-entrypoint.sh
[2024-04-09T21:50:17.248151Z] [cont=0] || dynamic-tcd-startup-hook.sh
[2024-04-09T21:50:17.248172Z] [cont=0] || etc
[2024-04-09T21:50:17.248190Z] [cont=0] || info
[2024-04-09T21:50:17.248208Z] [cont=0] || pythonuserbase
[2024-04-09T21:50:17.248223Z] [cont=0] || ship-logs.sh
[2024-04-09T21:50:17.248234Z] [cont=0] || ship_logs.py
[2024-04-09T21:50:17.248260Z] [cont=0] || singularity-entrypoint-wrapper.sh
[2024-04-09T21:50:17.248267Z] [cont=0] || task-setup.sh
[2024-04-09T21:50:17.248273Z] [cont=0] || train
[2024-04-09T21:50:17.248279Z] [cont=0] || workdir
[2024-04-09T21:50:21.314010Z]          || INFO: resources exited successfully with a zero exit code
[2024-04-09T21:50:21.330339Z]          || INFO: Command (singularly-enormous-ibex) was terminated: allocation stopped early after all resources exited: resources exited successfully with a zero exit code
Task log stream ended. To reopen log stream, run: det task logs -f 3d1d922b-4edd-4556-b2df-a94806138a4a

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

https://hpe-aiatscale.atlassian.net/browse/RM-105

@cla-bot cla-bot bot added the cla-signed label Apr 8, 2024
Copy link

netlify bot commented Apr 8, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit 713a59c
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/6615b87565911500081eff96

@@ -170,6 +172,7 @@ var mergeCopier = copier.Option{IgnoreEmpty: true, DeepCopy: true}
func (c TaskContainerDefaultsConfig) Merge(
other TaskContainerDefaultsConfig,
) (TaskContainerDefaultsConfig, error) {
// Q: where is this used other than tests?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func Merge[T any](obj T, src T) T {
?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks is that a different strategy compared to agent and k8s? related to the thread here https://hpe-aiatscale.slack.com/archives/C06GMG83ZE0/p1712609136055039

master/pkg/tasks/task.go Outdated Show resolved Hide resolved
Copy link

codecov bot commented Apr 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 46.90%. Comparing base (8b11e3a) to head (170ffd1).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9124      +/-   ##
==========================================
+ Coverage   46.35%   46.90%   +0.55%     
==========================================
  Files         754      754              
  Lines      104666   104666              
  Branches     2412     2412              
==========================================
+ Hits        48515    49094     +579     
+ Misses      55943    55364     -579     
  Partials      208      208              
Flag Coverage Δ
backend 34.24% <ø> (+0.04%) ⬆️
harness 63.79% <ø> (+1.46%) ⬆️
web 36.82% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

see 15 files with indirect coverage changes

@hamidzr hamidzr force-pushed the hz-global-startup branch 3 times, most recently from 92eb6f0 to 9555937 Compare April 8, 2024 21:03
@hamidzr hamidzr changed the title cluster wide tasks startup feat: cluster wide tasks startup Apr 8, 2024
@hamidzr hamidzr marked this pull request as ready for review April 8, 2024 22:30
@hamidzr hamidzr requested review from a team as code owners April 8, 2024 22:30
@hamidzr hamidzr requested a review from kkunapuli April 8, 2024 22:30
@hamidzr hamidzr changed the title feat: cluster wide tasks startup feat: cluster wide startup hook for tasks Apr 8, 2024
@hamidzr hamidzr requested a review from stoksc April 8, 2024 22:31
Copy link
Contributor

@NicholasBlaskey NicholasBlaskey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one comment that I think should provide a simplification

master/pkg/check/check_string.go Outdated Show resolved Hide resolved
@@ -53,3 +53,37 @@ if [ "$HOME" = "/" ]; then
)" || HOME="$PWD"
export HOME
fi

cleaned_args=()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is two points here that I think are making this feels a little more complicated

  1. Can we just hardcode the path? So we always know we just need to run source /run/determined/tcd-startup-hook.sh

  2. Can we assume the file always exists (if it is empty it is a noop).

If we make these assumptions could the only change to the startup files be adding a source /run/determined/tcd-startup-hook.sh

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's much simpler, I think I was just excited to write some bash arg parsing I'll update

Copy link
Member Author

@hamidzr hamidzr Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it safe to assume task RunDir is fixed? we have a single constant RunDir in our go code for it we have other pieces that'd break if this was to change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BOsterbuhr any comments based on your usage and requirements?

master/static/srv/notebook-entrypoint.sh Outdated Show resolved Hide resolved
@@ -170,6 +172,7 @@ var mergeCopier = copier.Option{IgnoreEmpty: true, DeepCopy: true}
func (c TaskContainerDefaultsConfig) Merge(
other TaskContainerDefaultsConfig,
) (TaskContainerDefaultsConfig, error) {
// Q: where is this used other than tests?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@determined-ci determined-ci added the documentation Improvements or additions to documentation label Apr 9, 2024
@determined-ci determined-ci requested a review from a team April 9, 2024 15:55
add a new field for startup script in task container defaults

add basic merge policy

allow normal tcd merge. add test

add validation and StartupScriptFilename

wip run the startupscripts

add to run archive

remove the bogus exit0

task-setup tuning

remove static srv dynamic-tcd.sh

restore original args

merge q

remove startup_script_filename

rename startup script to hook

task-setup.sh: Rename tcd_filename to tcd_startup_hook_path

clean up and only conditionally add the dynamic script

add unit tests

comment cleanup

only err if arg is provided but file missing

fmt
these should be nested ideally
@BOsterbuhr
Copy link

@hamidzr we need to update helm/charts/determined/templates/master-config.yaml as well as helm/charts/determined/values.yaml to allow for this to be configured via helm.

@hamidzr
Copy link
Member Author

hamidzr commented Apr 9, 2024

@hamidzr we need to update helm/charts/determined/templates/master-config.yaml as well as helm/charts/determined/values.yaml to allow for this to be configured via helm.

thanks for the snippet and the reminder. I started working on that here

Copy link
Contributor

@NicholasBlaskey NicholasBlaskey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice job

@pytest.mark.e2e_cpu_elastic
@pytest.mark.e2e_cpu_cross_version
@pytest.mark.e2e_gpu # Note, e2e_gpu and not gpu_required hits k8s cpu tests.
@pytest.mark.e2e_slurm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if you have already but would check this passes in slurm

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I manually tested the older version and I'm checking the new one. any good way of adding this to be run on slurm from this repo? would need a separate follow up pr on ee after we land this or setup a different ee pr and make modifications there.

@@ -10,6 +10,10 @@ set -e
# to register the proxy with the Determined master.
"$DET_PYTHON_EXECUTABLE" -m determined.exec.prep_container --proxy --download_context_directory

set -x
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do commands not support start up hook?

maybe just make a ticket calling this out? Not sure if that is intentional

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

``startup_hook``
================

An optional inline script that will be executed as part of task set up. This is defined under
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you had a seperate pr but helm / helm docs?

If you wanna do seperate that is fine but I think I prefer one PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that'll take me longer to test potentially not until tomorrow. I think it'd make sense to make sure the main changes are tested and landed asap?

@hamidzr hamidzr merged commit 28b3aff into main Apr 9, 2024
69 of 85 checks passed
@hamidzr hamidzr deleted the hz-global-startup branch April 9, 2024 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants