v3.3.0
What's new
Added 🎉
- Added
--torchrunflag togantry run, which is a shortcut for configuring your experiment andtorchrun
to run your command with all GPUs across all replicas. - Added
--textoption togantry listcommand for filtering by name or description. - Added
clientparameter toapi.update_workload_description()for providing an existing Beaker client,
which avoids creating one each time the function is called. - Added support for configuring the GitHub token secret name in a
pyproject.tomlas the field[tool.gantry.gh_token_secret]. - Added the top-level flag
--check-for-upgrades/--no-check-for-upgradeswith corresponding env varGANTRY_CHECK_FOR_UPGRADES. - Added
--slack-webhook-urloption togantry runcommand for getting updates on Slack. For now these webhooks are only sent if following the job locally, e.g. via--show-logsor--timeout=-1. - Added
gantry.api.Recipeclass for programmatically configuring workloads. - Added
--interconnectoption togantry runcommand. - Added AWS CLI, Google Cloud CLI, and InfiniBand drivers to the default Beaker image.
- Added options to
gantry runfor automatically installing AWS (--aws-config-secret,--aws-credentials-secret) and Google Cloud credentials (--google-credentials-secret) from Beaker secrets. - Added
--start-timeoutoption togantry run.
Changed ⚠️
- You can now specify the
--envoption as just--env 'NAME'instead of--env 'NAME=VALUE'to take theVALUEfrom a local environment variable of that name. - You can now specify the
--env-secretoption as just--env-secret 'NAME'instead of--env-secret 'NAME=SECRET_NAME'to create a new secret from a local environment variable of that name. - Gantry will now automatically configure NCCL for InfiniBand when appropriate.
--skip-tcpxo-setupis now deprecated in favor of--skip-nccl-setup.- Create
entrypoint.shdataset with the same budget as the workload.
Fixed ✅
- Made
api.update_workload_description()more efficient. - Shallow clone only a single commit at runtime for much faster clones, especially with large repos.
- Made checking for upgrades more robust.
- Improved error message hints when
gantry.apiis used directly.
Commits
4b1e893 (chore) bump version to v3.3.0 for release
c708c8d Attach the workload’s budget to entrypoint.sh dataset (#168)
34b1b95 fix typo in readme
c9be7e7 improve job waiting logic
d42dd0e Add --start-timeout option to gantry run
cbc6ef0 bump version for nighly release
f4aad11 Add get_global_config() function
eeb55aa warn when 'gcloud' missing while installing Google creds
582cdb7 Add AWS CLI and Gcloud CLI (#167)
b78f282 bump version for nightly release
fe400a3 Big QoL updates (#166)
323d852 improve err messages when API is used from Python
ec579a0 validate slack webhook url
bffc10a refactor slack notifications
2950a04 refactor
289a76e updates
98638de fix webhooks
aca01e2 Add --slack-webhook-url option
c323857 bump version for release
e020b90 add some missing docstrings
2f403a5 display commit message
a9bbfae add property for commit message
ad5330a don't show exit code on failure unless it's non-zero
3e9c6f4 Add top-level flag to skip/force check for upgrades
eb93b78 make checking for upgrades more robust
b50a428 display results even for failed experiments
301a0ca improve logging
84a2b3b shallow clone single commit (#164)
51df41d fix typo
abd9bf0 fix formatting color of results dataset URL
ea753d5 bump version for nightly release
ec73b37 improve update_workload_description(), --env, and --env-secret options (#163)
211fb1e handle long descriptions better
ca8b86f Add --text option to gantry list command
f275e7b fix
e4286ba make FAQ easier to read (#162)
ba99018 fix typo
e5aa827 Add a few retries to gh auth setup-git call
5f0a750 export new functions for docs
a5335a2 clean up imports