Skip to content

Conversation

@jswudi
Copy link
Contributor

@jswudi jswudi commented Sep 10, 2024

Issue #, if available:

The help message for auto-resume is incorrect.

Description of changes:

HyperPod resilience job auto resume supports in all namespaces.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Contributor

@adheshgarg adheshgarg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved

@jswudi jswudi merged commit 3659d55 into aws:main Sep 10, 2024
xiaoxshe pushed a commit to xiaoxshe/sagemaker-hyperpod-cli that referenced this pull request Dec 4, 2024
* update the helm chart to create team level roles and bindings

* revert unrelated changes

* Rename quotaAllocationTarget to computeQuotaTarget

* remove kueue related resources from helm chart

* Remove parameters of kueue from chart

* flip the team role creation to false

* Revise readme to add instructions to create the role and binding
xiaoxshe added a commit that referenced this pull request Dec 4, 2024
* add recipes feature for distributed training

* improve unit test coverage for recipes feature

* add support recipes along with command line args

* add recipes

* Crescendo helm chart for role and rolebinding (#17)

* update the helm chart to create team level roles and bindings

* revert unrelated changes

* Rename quotaAllocationTarget to computeQuotaTarget

* remove kueue related resources from helm chart

* Remove parameters of kueue from chart

* flip the team role creation to false

* Revise readme to add instructions to create the role and binding

* add changelog for distributed training

* change to public submodules

* QuotaAllocation support for Hyperpod CLI (#12)

* QuotaAllocation support for Hyperpod CLI

---------

Co-authored-by: Amazon GitHub Automation <54958958+amazon-auto@users.noreply.github.com>
Co-authored-by: Song Jiang <jiangsongbz@gmail.com>
Co-authored-by: Baiyang Li <baiyanl@amazon.com>
Co-authored-by: baiyli <105086653+baiyli@users.noreply.github.com>

* Remove custom_launcher folder

* sync with mainline

---------

Co-authored-by: cansun <80425164+can-sun@users.noreply.github.com>
Co-authored-by: Amazon GitHub Automation <54958958+amazon-auto@users.noreply.github.com>
Co-authored-by: Song Jiang <jiangsongbz@gmail.com>
Co-authored-by: Baiyang Li <baiyanl@amazon.com>
Co-authored-by: baiyli <105086653+baiyli@users.noreply.github.com>
Co-authored-by: Can Sun <sucan@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants