Skip to content

feat: Recipe Job Init Experience (hyp init hyp-recipe-job)#409

Merged
mollyheamazon merged 23 commits intoaws:mainfrom
mollyheamazon:model-customization
Apr 16, 2026
Merged

feat: Recipe Job Init Experience (hyp init hyp-recipe-job)#409
mollyheamazon merged 23 commits intoaws:mainfrom
mollyheamazon:model-customization

Conversation

@mollyheamazon
Copy link
Copy Markdown
Collaborator

What's changing and why?

This PR introduces hyp-recipe-job as a new template for hyp init, enabling customers to initialize, configure, and submit fine-tuning and evaluation jobs backed by SageMaker JumpStart Hub recipes — without managing recipe
files, GitHub submodules, or S3 URIs manually.

Previously, HyperPod CLI V3 had no support for recipe-based jobs. This gap is now addressed by fetching recipes directly from JumpStart Hub APIs at init time.

User Experience Flow

1. Initialize a recipe job

hyp init hyp-recipe-job ./my-job \
  --model-id meta-llama/Llama-3.1-8B-Instruct \
  --technique SFT \
  --instance-type ml.p4d.24xlarge

This fetches the matching recipe from JumpStart Hub, downloads the override params and k8s template, and generates:

  • config.yaml — grouped, annotated parameter file (Job Identity → Data → Output → Hyperparameters → MLflow → Compute → Model)
  • k8s.jinja — Kubernetes job manifest template
  • .override_spec.json — local schema for validation and configure

If --instance-type is omitted, an interactive cluster selection is launched: lists your HyperPod clusters filtered to those with instance types supported by the recipe, prompts for selection, and automatically updates kubeconfig.

2a. Edit config.yaml

Users fill in required fields (data paths, output paths, job name) and optionally tune hyperparameters. The file is grouped with section headers and inline type/constraint comments.

2b. Configure individual fields

hyp configure --learning-rate 0.00005
hyp configure --global-batch-size 16

3. Validate config.yaml

hyp validate

Validates the current directory's config.yaml against the recipe's parameter schema (.override_spec.json). Checks required fields are present, types are correct, and values satisfy constraints (min/max/enum). Run this after editing to catch errors before submission.

4. Submit the job

hyp create 

Validates config, warns if the instance type isn't present in the current cluster, renders the k8s manifest, and submits to Kubernetes. A timestamped snapshot is saved under run/<timestamp>/.

Debugging & Job Management

Once a recipe job is submitted, the full set of job management commands is available under hyp-recipe-job:

# List all recipe jobs in a namespace
hyp list hyp-recipe-job [-n <namespace>]

# Inspect job status and conditions (use this when pods are Pending/Failed)
hyp describe hyp-recipe-job --job-name <name> [-n <namespace>]

# List pods for a specific job (to identify which pod to fetch logs from)
hyp list-pods hyp-recipe-job --job-name <name> [-n <namespace>]

# Fetch logs from a specific pod
hyp get-logs hyp-recipe-job --job-name <name> --pod-name <pod> [-n <namespace>]

# Delete a job
hyp delete hyp-recipe-job --job-name <name> [-n <namespace>]

Features

Model ID formats supported

  • JumpStart model ID: meta-textgeneration-llama-3-1-8b-instruct
  • HuggingFace model ID: meta-llama/Llama-3.1-8B-Instruct (resolved via list_hub_contents + @recipe: keyword filter, reference doc)
  • Hub Content ARN: arn:aws:sagemaker:...:hub-content/MyHub/Model/my-model/1.0.0 (private hub support for internal team development)

Techniques supported

  • Fine-tuning: SFT, DPO, RLAIF, RLVR, CPT, PPO
  • Evaluation: deterministic, LLMAJ

Grouped config.yaml rendering

Reference doc
Parameters are ordered by user priority (not alphabetically): Job Identity → Data → Output → Core Hyperparameters → Advanced Hyperparameters → MLflow → Compute → Model. Unknown params from future recipes fall into an Other section automatically.

Instance type warning at submit time

If the instance_type in config.yaml doesn't match any node in the current cluster, a warning is shown before submission (non-blocking).

Edge Cases

Scenario Behavior
HuggingFace ID that doesn't match search (e.g. Qwen/Qwen3-0.6B) Static fallback table resolves it
Completely unknown HuggingFace ID Clear error with link to list-hub-contents docs
--instance-type not provided Interactive cluster selection; kubeconfig updated on selection
No compatible clusters found Clear error listing supported instance types

Considerations

  • Integ test currently pending downstream recipe fix for e2e happy case.

  • FSx dependency: Recipe k8s templates reference a fsx-claim PVC. This is a cluster infrastructure prerequisite — not something the CLI can provision. Documented in the getting started guide.

  • HuggingFace ID resolution: Uses list_hub_contents + @recipe: keyword filter dynamically. Works for 7/9 open-weight models; 2 edge cases (DeepSeek, Qwen3-0.6B) use a small static fallback table. A long-term fix would be asking JumpStart to add @hf_model_id: as a searchable keyword.

Reviewer Guidelines

‼️ Merge Requirements: PRs with failing integration tests cannot be merged without justification.

One of the following must be true:

  • All automated PR checks pass
  • Failed tests include local run results/screenshots proving they work
  • Changes are documentation-only

mollyheamazon and others added 23 commits March 16, 2026 16:19
* model customization init/find model

* Adding direct create exp

* Model customization Init/Create/Find

* Latest model cust changes

* init migration done with template validation

* Init full experience migrated, CRUDL simple addition in hyp_cli.py, unit tests added, pending nova forge happy case for integ test

* remove argcomplete since it is not supported yet

* add reset command for dynamic template

* fix integ test error for init flow

* remove recipe finder and discovery changes

---------

Co-authored-by: Amarjeet LNU <jamjee@amazon.com>
…efactor code for modularization, unit test added (aws#292)
…l, remove direct create support (aws#297)

* bug fix for matching instance type for override params and delete command:

* add pre-training-job and evaluation-job, set instance-type to optional, remove direct create support

* update checkpointless flag to framework to support more modes
…eg test and example notebook, pending recipe update
@mollyheamazon mollyheamazon requested a review from a team as a code owner April 16, 2026 18:28
@mollyheamazon mollyheamazon merged commit 3edd643 into aws:main Apr 16, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants