Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: add a dedicated exception for cli errors #5649

Merged
merged 20 commits into from Feb 22, 2023

Conversation

hamidzr
Copy link
Contributor

@hamidzr hamidzr commented Dec 27, 2022

Description

  • move error reporting to a new exception type.
  • change some of the use cases to demo

Test Plan

Commentary (optional)

slack convo https://determined-ai.slack.com/archives/C02PV33GSN5/p1672772383839719

DET-8930

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

@cla-bot cla-bot bot added the cla-signed label Dec 27, 2022
@netlify
Copy link

netlify bot commented Dec 27, 2022

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit b2dbc3f
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/63f65bcd7eda370008e75e0f

except CliError as e:
if e.e_stack:
print(colored(f"Error: {e}", "yellow"), file=sys.stderr)
die(f"{e.name}: {e.message}", exit_code=e.exit_code)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we don't want to always have this prefix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any thoughts on this? I took out the prefix since that'd be the least change in output. we can opt to add them back later

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exception class name as a prefix? yeah I suppose we can remove it. if it's the same for all errors there's not much value in it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had it be separate for argument errors. I guess it could be useful if you wanted to make the cli output a bit more machine readable but it has its own tradeoffs we don't need to worry about it here

@hamidzr hamidzr changed the title wip: cli error handling chore: cli error handling Feb 14, 2023
@hamidzr hamidzr changed the title chore: cli error handling chore: add a specific exception for cli errors Feb 14, 2023
@hamidzr hamidzr marked this pull request as ready for review February 14, 2023 17:51
@hamidzr hamidzr requested review from a team and carolinaecalderon February 14, 2023 19:10
@hamidzr hamidzr changed the title chore: add a specific exception for cli errors chore: add a dedicated exception for cli errors Feb 14, 2023
harness/determined/cli/errors.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ioga ioga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, didn't mean to approve

@hamidzr hamidzr merged commit 56cc437 into determined-ai:master Feb 22, 2023
@hamidzr hamidzr deleted the cli-err-handling branch February 22, 2023 19:14
tayritenour added a commit that referenced this pull request Mar 9, 2023
* docs: Improvements to HPC launcher docs (#6042)

* Provide inline info about agent-specific scheduling options that do not apply to HPC Launcher
  configurations.
* Identify enroot-specific differences from docker (like for Singularity)
* Provide reference to custom resource pools as an option to deal with non-homogenous
  Slurm/PBS partitions.

* chore: Allow newer Node versions 17-19 (#6038)

* fix: k8s rm gives wrong slot count in rendezvous (#6044)

* chore: bump version: 0.19.12-dev0 -> 0.20.0-dev0 (#6048)

* chore: remove `applicableRoutespace` (#6040)

* chore: warning fixes in web (#6041)

* chore: fix warnings

* chore: change eslint rules

* chore: fix gpu nightly errors (#6046)

* chore: missing nodev18 (#6050)

* chore: add a dedicated exception for cli errors (#5649)

switch sys.exit calls in cli with a new user-facing exception.

* fix: SSO layout (#6053)

* chore: clean up UI kit (#6039)

* fix: lopsided training with 2,1 gpus (#6054)

There was a guard to skip local zmq setup when local_size < 2, but that
became no longer valid when local_size varied from worker to worker.

The result is one extra global allgather in some cases, no big deal.

* docs: add rbac ntsc & mr release notes (#6049)

* chore: manual bump version (#6058)

* ci: retry downloading GKE auth plugin [DET-8956] (#6056)

We got a failure due to a timeout on this, so let it retry a few times.

* docs: update Singularity known issues. (#6047)

* fix: Add #rank to worker segment instead of timestamp segment of Pytorch Profiler files [MLG-326] (#6037)

* Add pytorch profiler specific handling logic for appending rank to file name

* Change to use f-string

* fix only file name being passed in

* remove print statement

* fix: handle agent shutdown msg (#6065)

* chore: manually bump vite version (#6066)

* feat: Display better x-axis ticks on charts with time axis [WEB-849] (#6051)

* remove xTickValues from props now that it can be calculated internally

* test: add logging to a flaky test (#6068)

Test is flaky but hard to pin down, so add some prints for next time.

* fix: Unrelated models are shown in a workspace model registry tab (#6067)

* feat: Added task-based historical allocation endpoint [DET-8537] (#6015)

* fix: show `not found` and `spinner` properly (#6070)

* fix: show `not found` and `spinner` properly

* chore: change home redirect path

* fix: projectDetail page

* fix: project.workspaceId

* build(deps): bump golang.org/x/text from 0.3.5 to 0.3.8 in /proto (#6061)

* build(deps): bump golang.org/x/text from 0.3.5 to 0.3.8 in /proto

Bumps [golang.org/x/text](https://github.com/golang/text) from 0.3.5 to 0.3.8.
- [Release notes](https://github.com/golang/text/releases)
- [Commits](golang/text@v0.3.5...v0.3.8)

---
updated-dependencies:
- dependency-name: golang.org/x/text
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

* fix: gpt-neox docker image and startup hook to work for non-provileged user (#6060)

* test: add locking around migration [DET-8957] (#6071)

In integration tests, multiple processes can attempt to run migrations
against the same database at once, which can lead to errors because
PostgreSQL's `CREATE TABLE IF NOT EXISTS` is not great with concurrency
(it allows for a time-of-check/time-of-use failure).

The specific errors we were seeing were conflicts in the pg_type table,
so the code now locks that table for the duration of the migration
transaction.

More information:
https://www.postgresql.org/message-id/CA+TgmoZAdYVtwBfp1FL2sMZbiHCWT4UPrzRLNnX1Nb30Ku3-gg@mail.gmail.com
https://stackoverflow.com/questions/29900845

* fix: logging inconsistent newlines in slurm (#6074)

* fix: checkpoint helper for points > 1000 and points > maxDatapoints (#6069)

* fix: replace migration table lock with advisory lock (#6077)

Taking a table lock sometimes runs into permissions issues; advisory
locking should avoid that.

Also, I realized the locking should probably be after the deferred
transaction close instead of before.

* build: check npm version on install (#6079)

This removes the `check-requirements` make target in the react folder
and replaces it with npm's native version check against the engines
property. This should make managing the node version slightly easier
because there's one less place to check.

* build: Apply webui lint fixes in precommit (#6078)

* allow linters to fix in precommit

This updates the web linters to automatically apply fixes when doing a
pre-commit check. This should ideally streamline the commit process to
reduce the amount of times the user needs to run prettier and eslint
before committing.

* tweak stylelint and eslint commands

* stage changed files

* type file_paths argument

* ci: adjust target accuracy for a test (#6085)

We got one failure [1] where the accuracy ended up just a hair below
0.83, so drop the target.

[1] https://app.circleci.com/pipelines/github/determined-ai/determined/33883/workflows/4a5d3257-6061-4f4d-bd66-096a580a5959/jobs/1194282/steps

* chore: UserBadge moves into design kit (#6086)

* chore: Remove chart feature flag, remove unused code [WEB-930] (#6064)

* tooltipsPlugin and TrialDetailsOverview alternates go into place
* checkpoint helper for points > 1000 and points > maxDatapoints
* move former LearningCurveChart into TrialsComparison
* sync up with #6069 changes

* fix: replace defaultvalue with initialValue (#6076)

* fix: Dont suggest moving model into its current workspace (#6088)

* ci: delete database at beginning of det deploy tests [DET-8937] (#6089)

Previously, the database was being retained between tests, sometimes
causing tests to fail when extra agents appeared due to agent
reattach. The tests should generally be independent anyway, so reset
the database (by default, with an option to disable) each time the
cluster or master comes up.

* feat: add Facepile component (#6081)

* fix: pre-commit web bug fix (#6090)

* ci: make GKE test jobs run serially (#6096)

We keep hitting GKE GPU quotas; this will probably help with that.

* fix: GPU counting for k8s cluster info page (#6094)

* fix: test-e2e-gke-parallel use t4 (#6093)

* ci: retry protoc download (#6095)

We got an incorrect file downloaded one time [1], so retry this
download, like in 2906257 (#5996).

[1] https://app.circleci.com/pipelines/github/determined-ai/determined/34074/workflows/e48681f8-8b75-4349-82eb-06e922d8bfcb/jobs/1202610

* refactor: add Card to UI Kit [WEB-818] (#5893)

* docs: Launcher doc improvements (#6099)

- Generalize journalctl command example --since option to work on Unbutu.
- Clarify user_name/group_name account requirements.

* feat: Attend to TODOS accross the code base (#6087)

* perf: tweak metrics series query. (#6105)

* chore: race could cause run container to return a different error than expected [DET-8870] (#6092)

* chore: add more metadata to slurm logs (#6030)

* chore: remove `ExpCompareMetricNames`, `ExpCompareTrialsSample` endpoints. (#6106)

* docs: fix reported DataPoint label doc (#6107)

* fix: tolerate additional non-CPU, non-GPU quotas in k8s (#6109)

* fix: stop filtering of valid options to reflect build issues (#6116)

* fix: modal theme color (#6117)

* fix: add bgColor in trial comparison table (#6119)

* fix: browser console warnings (#6122)

* fix: browser console warnings

* fix: remove spread operator

* chore: UIKit Pivot renaming (#6120)

* fix: correct minor JSX syntax (#6126)

* docs: add myst_parser extension (#6127)

We would like to support markdown-format documentation.  There are still
some kinks to be worked out with converting rst to myst files, but this
is a start.

* docs: fix some broken redirects (#6129)

* feat: add 3rd batch of TODO removals (#6115)

* feat: generic proxy configs [DET-8761] (#5978)

* build: [MLG-336] Limit the version of protobuf (#6134)

build: [MLG-336] Limit the version of protobuf

Installing the requirement `tensorflow-macos=2.8.0` pulls protobuf as a downstream dependency. Version 4.21 of the Python protobuf package had a breaking change that makes it incompatible with tensorflow-2.8.0 (see tensorflow/tensorflow#56077). Later patches to Tensorflow limit the version of protobuf to 3.20. We've got a work item to update the tensorflow we include, but until then this change gives the ceiling on tensorflow's protobuf dependency that its later versions enforce.

* chore: update detectron2 example to use v0.6 and reenable nightly test [MLG-301] (#6103)

* Run model in EventStorage context

* Use new Docker images

* Remove pytest.skip from test_detectron2_coco_pytorch_const

* Update README.md

* Minor code reduction

* Dockerfile (listed in .detignore)

* Use determinedai repo instead of a personal repo

* Makefile for building and publishing the Docker image

* docs: Bring content changes from docusaurus-ls beta (#6121)

* docs: Bring content changes from docusaurus-ls beta

Bring over content changes from the beta including reorganization changes.

* additional organizational edits

updating index pages, adding a top nav to welcome page

* added redirects

* revisions based on feedback

* rstfmt run

* feat: display workspace icon in ProjectCard (#6125)

* fix: checkpoint GC should set resource pool (#6136) [DET-9018]

* docs: bump rstfmt version (#6138)

* chore: add dev cli option to get auth token (#6008)

add a `curl` option to help with curling various endpoints

* build(deps): bump golang.org/x/net from 0.0.0-20210405180319-a5a99cb37ef4 to 0.7.0 in /proto (#6130)

Bumps [golang.org/x/net](https://github.com/golang/net) from 0.0.0-20210405180319-a5a99cb37ef4 to 0.7.0.
- [Release notes](https://github.com/golang/net/releases)
- [Commits](https://github.com/golang/net/commits/v0.7.0)

* perf: Improved performance of historical allocation task endpoint, removed training/validation times (#6135)

* fix: FOUNDENG-438 Podman tests from the gate are breaking znodes again (#6146)

* chore: add Toggle component to UI Kit [WEB-841] (#6144)

* chore: Add tags to UI kit [WEB-816] (#6100)

* chore: Move SelectFilter into kit folder and update it [WEB-843] (#6102)

* fix: replace `InlineEditor` with UIKit input (#6082)

* fix: replace `InlineEditor`

* fix: add modal for experiment name

* fix: layout of settings page

* fix: setting page

* fix: minor changes

* feat: move experiment `description` and `tags` into edit modal

* chore: add `N/A` when description is empty in experiment detail

* fix: value bug

* fix: revert tag; remove tag from edit modal due to design inconsistancy

* chore:  add/test pt-only images and bumpenvs (#6097)

* add pt images to some unit tests

* add pt-images to circleci config

* run bumpenvs procedure

* fix test function signatures and linting

* fix warnings linting

* fix docs

* expand unit tests coverage

* build(deps): bump github.com/prometheus/client_golang from 1.10.0 to 1.11.1 in /master (#6004)

Bumps [github.com/prometheus/client_golang](https://github.com/prometheus/client_golang) from 1.10.0 to 1.11.1.
- [Release notes](https://github.com/prometheus/client_golang/releases)
- [Changelog](https://github.com/prometheus/client_golang/blob/main/CHANGELOG.md)
- [Commits](prometheus/client_golang@v1.10.0...v1.11.1)

* feat: add det.import_from_path (#5737)

import_from_path allows users to import arbitrary code from a
checkpoint, even if the modules in the checkpoint have the same name as
modules they already have imported, but contain different code.

This is common when importing, for example, an old model_def.py that has
been updated since the original checkpoint was saved.

* fix: post rank_id correctly for fluent-less logging (#6151) [DET-8999]

* ci: send Slack notification on GKE node pool creation failure (#6152)

In order to prevent quota failures from showing up as CI failures, this
makes node pool creation failure send a Slack notification and mark the
job as successful.

I couldn't figure out how to use the Slack orb while distinguishing this
particular situation from the general failure case, so I just slapped in
a direct request to the already-configured Slack webhook for sending
messages to #ci-bots.

`circleci-agent step halt` marks the job as successful, which is why we
want a notification at all. For some reason, CircleCI fails to provide
an equivalent for marking the job as canceled or some other state
besides success/failure; we could make a call to the CircleCI API to
cancel the current job, but that would rely on having a CircleCI token
available, which we're trying to get away from.

* chore: drop unused columns from `raw_steps`, `raw_validations`, and `raw_checkpoints`. (#6110)

* fix: render spinner while auth check pending (#6098)

* chore: update hpc-launching-architecture doc - add default slurm option --no-requeue (#6141)

* docs: Content updates (#6154)

formatted the setup cluster table to match the approved version in the docusaurus ls beta

* feat: display user id in `det user list`. (#6156)

* fix: Additional tables get experiment- / workspace-specific storagePath [WEB-962] (#6128)

* fix: Additional tables get experiment- and workspace-specific storagePath

* useMemo

* fix: selection width in `move experiment` modal (#6149)

* fix: selection width in `move experiment` modal

* fix: add form wrapping

* chore: remove change

* feat: show trial hyperparameters for custom searchers [MLG-343] (#6162)

* feat: show trial hyperparameters for custom searchers [MLG-343]

* fix: corrected timestamp handling to do an interval overlap instead of contains (#6164)

* chore:  add release notes (#6167)

* chore: add release notes

* format with rstfmt

* chore: suppress help output for det dev (#6145)

avoid showing the `dev` option in `det -h` output

* chore: lock api state for backward compatibility check

* chore: bump version: 0.20.0-dev0 -> 0.20.1-dev0

* fix: separate Router and authCheck (#6170)

* fix: useMemo does not depend on trial having been loaded (#6173)

* chore: pass Labels/project/workspace to TaskSpec (#6172)

* refactor: replace user store with observables [WEB-799] (#6140)

* fix: hide Foldable menu options when button is visible (#6178)

This fixes an issue where, when using a `PageHeaderFoldable` component,
options that appear in the header always appear in the overflow menu.

* feat: add labels to GCP instances created with det deploy gcp [MLG-170] (#6147)

* feat: add labels to GCP instances created with det deploy gcp

* Changes to mimic det deploy aws --add-tags

* Add labels to other resources as well

* revert: reflag new chart experience (#6181)

* build: eliminate java dependency for typescript swagger bindings (#6139)

* fix: Close expiriment fork/ continue trial modal properly (#6174)

* fix: Continue Trial flow does not take the new `max_length` (#6168)

* fix: pass workspace ID when creating tensor board from WebUI [WEB-1019] (#6186)

* fix: don't print ':' when err msg is empty (#6190)

* fix: exp move modal (#6183)

* fix: exp move modal

* fix: minor fixes

* fix: add `archived` param and simplify query (#6175)

* fix: add `archived` param and simplify query

* chore: indent

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Jerry J. Harrow <84593277+jerryharrow@users.noreply.github.com>
Co-authored-by: Nick Doiron <nick.doiron@hpe.com>
Co-authored-by: NicholasBlaskey <nick.blaskey@hpe.com>
Co-authored-by: Eric <31023784+eecsliu@users.noreply.github.com>
Co-authored-by: Keita Nonaka <keita.nonaka@hpe.com>
Co-authored-by: Erik Wilson <erik.wilson@hpe.com>
Co-authored-by: Hamid Zare <12127420+hamidzr@users.noreply.github.com>
Co-authored-by: johnkim-det <97752292+johnkim-det@users.noreply.github.com>
Co-authored-by: Ryan <rb@hpe.com>
Co-authored-by: Danny Zhu <dzhu@hpe.com>
Co-authored-by: CanmingCobble <107056780+CanmingCobble@users.noreply.github.com>
Co-authored-by: szewaiyuen6 <sze-wai.yuen@hpe.com>
Co-authored-by: Bradley Laney <bradley.laney@hpe.com>
Co-authored-by: julian-determined-ai <103522725+julian-determined-ai@users.noreply.github.com>
Co-authored-by: Corban Beaird <corban.beaird@hpe.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: liamcli <liam@determined.ai>
Co-authored-by: Ashton G <ashton.galloway@hpe.com>
Co-authored-by: thiagodallacqua-hpe <104855841+thiagodallacqua-hpe@users.noreply.github.com>
Co-authored-by: Max <max.russell@hpe.com>
Co-authored-by: Maksim <maksim.kouznetsov@hpe.com>
Co-authored-by: Emily <15078396+EmilyBonar@users.noreply.github.com>
Co-authored-by: Ilia Glazkov <ilia.glazkov@hpe.com>
Co-authored-by: Caleb Hoyoul Kang <caleb.kang@hpe.com>
Co-authored-by: Wes Turner <wesley.turner@hpe.com>
Co-authored-by: Daniel R. Hunter <103537968+drh-determined-ai@users.noreply.github.com>
Co-authored-by: Tara Charter <tara.charter@hpe.com>
Co-authored-by: rcorujo <90728398+rcorujo@users.noreply.github.com>
Co-authored-by: Guangqing Tang <40620519+gt2345@users.noreply.github.com>
Co-authored-by: MikhailKardash <mikhail.kardash@hpe.com>
Co-authored-by: Jagadeesh Madagundi <jagadeesh545@gmail.com>
Co-authored-by: gt2345 <gt2345@columbia.edu>
Co-authored-by: Trent Watson <trent.watson@hpe.com>
tayritenour added a commit that referenced this pull request May 30, 2023
* Save model info; add Core API DeepSpeed example.

* extract ds profiler results

* Place activation mem into search metric.

* Obtain metrics from the model info file

* remove unused code

* adapted to passing dicts into report_completed

* cleanup and small changes

* refactoring and trial helper classes

* initial random search logic started

* minor changes

* minor cleanup

* use context manager, expanded base searcher

* remove ds_autotuning dir from examples

* minor edits

* readme updates and other cleanup

* bug fixes and a hack to avoid needing nested model dirs

* README updates

* Feat deepspeed autotune (#5875)

Added current Core API prototype.

* switched to triggering their autotuning in our trials

* remove checkpoint wrapper

* implementing checkpointing

* implementing checkpointing

* feat: allow includes in custom searcher experiment [MLG-338] (#6091)

* cleanups, bug fixing, and more examples

* adding native dsat tests

* cleanup

* minor edits

* fix missing is_chief

* Feat deepspeed autotune (#6159)

Trigger the native DS AT exit behavior for all trials.

* minor edits

* Feat deepspeed autotune git fixes (#6180)

deleted extraneous files

* readme fix and make cifar10 work off grenoble (#6187)

* Deepspeed Feature Branch - merge master (#6193)

* docs: Improvements to HPC launcher docs (#6042)

* Provide inline info about agent-specific scheduling options that do not apply to HPC Launcher
  configurations.
* Identify enroot-specific differences from docker (like for Singularity)
* Provide reference to custom resource pools as an option to deal with non-homogenous
  Slurm/PBS partitions.

* chore: Allow newer Node versions 17-19 (#6038)

* fix: k8s rm gives wrong slot count in rendezvous (#6044)

* chore: bump version: 0.19.12-dev0 -> 0.20.0-dev0 (#6048)

* chore: remove `applicableRoutespace` (#6040)

* chore: warning fixes in web (#6041)

* chore: fix warnings

* chore: change eslint rules

* chore: fix gpu nightly errors (#6046)

* chore: missing nodev18 (#6050)

* chore: add a dedicated exception for cli errors (#5649)

switch sys.exit calls in cli with a new user-facing exception.

* fix: SSO layout (#6053)

* chore: clean up UI kit (#6039)

* fix: lopsided training with 2,1 gpus (#6054)

There was a guard to skip local zmq setup when local_size < 2, but that
became no longer valid when local_size varied from worker to worker.

The result is one extra global allgather in some cases, no big deal.

* docs: add rbac ntsc & mr release notes (#6049)

* chore: manual bump version (#6058)

* ci: retry downloading GKE auth plugin [DET-8956] (#6056)

We got a failure due to a timeout on this, so let it retry a few times.

* docs: update Singularity known issues. (#6047)

* fix: Add #rank to worker segment instead of timestamp segment of Pytorch Profiler files [MLG-326] (#6037)

* Add pytorch profiler specific handling logic for appending rank to file name

* Change to use f-string

* fix only file name being passed in

* remove print statement

* fix: handle agent shutdown msg (#6065)

* chore: manually bump vite version (#6066)

* feat: Display better x-axis ticks on charts with time axis [WEB-849] (#6051)

* remove xTickValues from props now that it can be calculated internally

* test: add logging to a flaky test (#6068)

Test is flaky but hard to pin down, so add some prints for next time.

* fix: Unrelated models are shown in a workspace model registry tab (#6067)

* feat: Added task-based historical allocation endpoint [DET-8537] (#6015)

* fix: show `not found` and `spinner` properly (#6070)

* fix: show `not found` and `spinner` properly

* chore: change home redirect path

* fix: projectDetail page

* fix: project.workspaceId

* build(deps): bump golang.org/x/text from 0.3.5 to 0.3.8 in /proto (#6061)

* build(deps): bump golang.org/x/text from 0.3.5 to 0.3.8 in /proto

Bumps [golang.org/x/text](https://github.com/golang/text) from 0.3.5 to 0.3.8.
- [Release notes](https://github.com/golang/text/releases)
- [Commits](golang/text@v0.3.5...v0.3.8)

---
updated-dependencies:
- dependency-name: golang.org/x/text
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

* fix: gpt-neox docker image and startup hook to work for non-provileged user (#6060)

* test: add locking around migration [DET-8957] (#6071)

In integration tests, multiple processes can attempt to run migrations
against the same database at once, which can lead to errors because
PostgreSQL's `CREATE TABLE IF NOT EXISTS` is not great with concurrency
(it allows for a time-of-check/time-of-use failure).

The specific errors we were seeing were conflicts in the pg_type table,
so the code now locks that table for the duration of the migration
transaction.

More information:
https://www.postgresql.org/message-id/CA+TgmoZAdYVtwBfp1FL2sMZbiHCWT4UPrzRLNnX1Nb30Ku3-gg@mail.gmail.com
https://stackoverflow.com/questions/29900845

* fix: logging inconsistent newlines in slurm (#6074)

* fix: checkpoint helper for points > 1000 and points > maxDatapoints (#6069)

* fix: replace migration table lock with advisory lock (#6077)

Taking a table lock sometimes runs into permissions issues; advisory
locking should avoid that.

Also, I realized the locking should probably be after the deferred
transaction close instead of before.

* build: check npm version on install (#6079)

This removes the `check-requirements` make target in the react folder
and replaces it with npm's native version check against the engines
property. This should make managing the node version slightly easier
because there's one less place to check.

* build: Apply webui lint fixes in precommit (#6078)

* allow linters to fix in precommit

This updates the web linters to automatically apply fixes when doing a
pre-commit check. This should ideally streamline the commit process to
reduce the amount of times the user needs to run prettier and eslint
before committing.

* tweak stylelint and eslint commands

* stage changed files

* type file_paths argument

* ci: adjust target accuracy for a test (#6085)

We got one failure [1] where the accuracy ended up just a hair below
0.83, so drop the target.

[1] https://app.circleci.com/pipelines/github/determined-ai/determined/33883/workflows/4a5d3257-6061-4f4d-bd66-096a580a5959/jobs/1194282/steps

* chore: UserBadge moves into design kit (#6086)

* chore: Remove chart feature flag, remove unused code [WEB-930] (#6064)

* tooltipsPlugin and TrialDetailsOverview alternates go into place
* checkpoint helper for points > 1000 and points > maxDatapoints
* move former LearningCurveChart into TrialsComparison
* sync up with #6069 changes

* fix: replace defaultvalue with initialValue (#6076)

* fix: Dont suggest moving model into its current workspace (#6088)

* ci: delete database at beginning of det deploy tests [DET-8937] (#6089)

Previously, the database was being retained between tests, sometimes
causing tests to fail when extra agents appeared due to agent
reattach. The tests should generally be independent anyway, so reset
the database (by default, with an option to disable) each time the
cluster or master comes up.

* feat: add Facepile component (#6081)

* fix: pre-commit web bug fix (#6090)

* ci: make GKE test jobs run serially (#6096)

We keep hitting GKE GPU quotas; this will probably help with that.

* fix: GPU counting for k8s cluster info page (#6094)

* fix: test-e2e-gke-parallel use t4 (#6093)

* ci: retry protoc download (#6095)

We got an incorrect file downloaded one time [1], so retry this
download, like in 2906257 (#5996).

[1] https://app.circleci.com/pipelines/github/determined-ai/determined/34074/workflows/e48681f8-8b75-4349-82eb-06e922d8bfcb/jobs/1202610

* refactor: add Card to UI Kit [WEB-818] (#5893)

* docs: Launcher doc improvements (#6099)

- Generalize journalctl command example --since option to work on Unbutu.
- Clarify user_name/group_name account requirements.

* feat: Attend to TODOS accross the code base (#6087)

* perf: tweak metrics series query. (#6105)

* chore: race could cause run container to return a different error than expected [DET-8870] (#6092)

* chore: add more metadata to slurm logs (#6030)

* chore: remove `ExpCompareMetricNames`, `ExpCompareTrialsSample` endpoints. (#6106)

* docs: fix reported DataPoint label doc (#6107)

* fix: tolerate additional non-CPU, non-GPU quotas in k8s (#6109)

* fix: stop filtering of valid options to reflect build issues (#6116)

* fix: modal theme color (#6117)

* fix: add bgColor in trial comparison table (#6119)

* fix: browser console warnings (#6122)

* fix: browser console warnings

* fix: remove spread operator

* chore: UIKit Pivot renaming (#6120)

* fix: correct minor JSX syntax (#6126)

* docs: add myst_parser extension (#6127)

We would like to support markdown-format documentation.  There are still
some kinks to be worked out with converting rst to myst files, but this
is a start.

* docs: fix some broken redirects (#6129)

* feat: add 3rd batch of TODO removals (#6115)

* feat: generic proxy configs [DET-8761] (#5978)

* build: [MLG-336] Limit the version of protobuf (#6134)

build: [MLG-336] Limit the version of protobuf

Installing the requirement `tensorflow-macos=2.8.0` pulls protobuf as a downstream dependency. Version 4.21 of the Python protobuf package had a breaking change that makes it incompatible with tensorflow-2.8.0 (see tensorflow/tensorflow#56077). Later patches to Tensorflow limit the version of protobuf to 3.20. We've got a work item to update the tensorflow we include, but until then this change gives the ceiling on tensorflow's protobuf dependency that its later versions enforce.

* chore: update detectron2 example to use v0.6 and reenable nightly test [MLG-301] (#6103)

* Run model in EventStorage context

* Use new Docker images

* Remove pytest.skip from test_detectron2_coco_pytorch_const

* Update README.md

* Minor code reduction

* Dockerfile (listed in .detignore)

* Use determinedai repo instead of a personal repo

* Makefile for building and publishing the Docker image

* docs: Bring content changes from docusaurus-ls beta (#6121)

* docs: Bring content changes from docusaurus-ls beta

Bring over content changes from the beta including reorganization changes.

* additional organizational edits

updating index pages, adding a top nav to welcome page

* added redirects

* revisions based on feedback

* rstfmt run

* feat: display workspace icon in ProjectCard (#6125)

* fix: checkpoint GC should set resource pool (#6136) [DET-9018]

* docs: bump rstfmt version (#6138)

* chore: add dev cli option to get auth token (#6008)

add a `curl` option to help with curling various endpoints

* build(deps): bump golang.org/x/net from 0.0.0-20210405180319-a5a99cb37ef4 to 0.7.0 in /proto (#6130)

Bumps [golang.org/x/net](https://github.com/golang/net) from 0.0.0-20210405180319-a5a99cb37ef4 to 0.7.0.
- [Release notes](https://github.com/golang/net/releases)
- [Commits](https://github.com/golang/net/commits/v0.7.0)

* perf: Improved performance of historical allocation task endpoint, removed training/validation times (#6135)

* fix: FOUNDENG-438 Podman tests from the gate are breaking znodes again (#6146)

* chore: add Toggle component to UI Kit [WEB-841] (#6144)

* chore: Add tags to UI kit [WEB-816] (#6100)

* chore: Move SelectFilter into kit folder and update it [WEB-843] (#6102)

* fix: replace `InlineEditor` with UIKit input (#6082)

* fix: replace `InlineEditor`

* fix: add modal for experiment name

* fix: layout of settings page

* fix: setting page

* fix: minor changes

* feat: move experiment `description` and `tags` into edit modal

* chore: add `N/A` when description is empty in experiment detail

* fix: value bug

* fix: revert tag; remove tag from edit modal due to design inconsistancy

* chore:  add/test pt-only images and bumpenvs (#6097)

* add pt images to some unit tests

* add pt-images to circleci config

* run bumpenvs procedure

* fix test function signatures and linting

* fix warnings linting

* fix docs

* expand unit tests coverage

* build(deps): bump github.com/prometheus/client_golang from 1.10.0 to 1.11.1 in /master (#6004)

Bumps [github.com/prometheus/client_golang](https://github.com/prometheus/client_golang) from 1.10.0 to 1.11.1.
- [Release notes](https://github.com/prometheus/client_golang/releases)
- [Changelog](https://github.com/prometheus/client_golang/blob/main/CHANGELOG.md)
- [Commits](prometheus/client_golang@v1.10.0...v1.11.1)

* feat: add det.import_from_path (#5737)

import_from_path allows users to import arbitrary code from a
checkpoint, even if the modules in the checkpoint have the same name as
modules they already have imported, but contain different code.

This is common when importing, for example, an old model_def.py that has
been updated since the original checkpoint was saved.

* fix: post rank_id correctly for fluent-less logging (#6151) [DET-8999]

* ci: send Slack notification on GKE node pool creation failure (#6152)

In order to prevent quota failures from showing up as CI failures, this
makes node pool creation failure send a Slack notification and mark the
job as successful.

I couldn't figure out how to use the Slack orb while distinguishing this
particular situation from the general failure case, so I just slapped in
a direct request to the already-configured Slack webhook for sending
messages to #ci-bots.

`circleci-agent step halt` marks the job as successful, which is why we
want a notification at all. For some reason, CircleCI fails to provide
an equivalent for marking the job as canceled or some other state
besides success/failure; we could make a call to the CircleCI API to
cancel the current job, but that would rely on having a CircleCI token
available, which we're trying to get away from.

* chore: drop unused columns from `raw_steps`, `raw_validations`, and `raw_checkpoints`. (#6110)

* fix: render spinner while auth check pending (#6098)

* chore: update hpc-launching-architecture doc - add default slurm option --no-requeue (#6141)

* docs: Content updates (#6154)

formatted the setup cluster table to match the approved version in the docusaurus ls beta

* feat: display user id in `det user list`. (#6156)

* fix: Additional tables get experiment- / workspace-specific storagePath [WEB-962] (#6128)

* fix: Additional tables get experiment- and workspace-specific storagePath

* useMemo

* fix: selection width in `move experiment` modal (#6149)

* fix: selection width in `move experiment` modal

* fix: add form wrapping

* chore: remove change

* feat: show trial hyperparameters for custom searchers [MLG-343] (#6162)

* feat: show trial hyperparameters for custom searchers [MLG-343]

* fix: corrected timestamp handling to do an interval overlap instead of contains (#6164)

* chore:  add release notes (#6167)

* chore: add release notes

* format with rstfmt

* chore: suppress help output for det dev (#6145)

avoid showing the `dev` option in `det -h` output

* chore: lock api state for backward compatibility check

* chore: bump version: 0.20.0-dev0 -> 0.20.1-dev0

* fix: separate Router and authCheck (#6170)

* fix: useMemo does not depend on trial having been loaded (#6173)

* chore: pass Labels/project/workspace to TaskSpec (#6172)

* refactor: replace user store with observables [WEB-799] (#6140)

* fix: hide Foldable menu options when button is visible (#6178)

This fixes an issue where, when using a `PageHeaderFoldable` component,
options that appear in the header always appear in the overflow menu.

* feat: add labels to GCP instances created with det deploy gcp [MLG-170] (#6147)

* feat: add labels to GCP instances created with det deploy gcp

* Changes to mimic det deploy aws --add-tags

* Add labels to other resources as well

* revert: reflag new chart experience (#6181)

* build: eliminate java dependency for typescript swagger bindings (#6139)

* fix: Close expiriment fork/ continue trial modal properly (#6174)

* fix: Continue Trial flow does not take the new `max_length` (#6168)

* fix: pass workspace ID when creating tensor board from WebUI [WEB-1019] (#6186)

* fix: don't print ':' when err msg is empty (#6190)

* fix: exp move modal (#6183)

* fix: exp move modal

* fix: minor fixes

* fix: add `archived` param and simplify query (#6175)

* fix: add `archived` param and simplify query

* chore: indent

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Jerry J. Harrow <84593277+jerryharrow@users.noreply.github.com>
Co-authored-by: Nick Doiron <nick.doiron@hpe.com>
Co-authored-by: NicholasBlaskey <nick.blaskey@hpe.com>
Co-authored-by: Eric <31023784+eecsliu@users.noreply.github.com>
Co-authored-by: Keita Nonaka <keita.nonaka@hpe.com>
Co-authored-by: Erik Wilson <erik.wilson@hpe.com>
Co-authored-by: Hamid Zare <12127420+hamidzr@users.noreply.github.com>
Co-authored-by: johnkim-det <97752292+johnkim-det@users.noreply.github.com>
Co-authored-by: Ryan <rb@hpe.com>
Co-authored-by: Danny Zhu <dzhu@hpe.com>
Co-authored-by: CanmingCobble <107056780+CanmingCobble@users.noreply.github.com>
Co-authored-by: szewaiyuen6 <sze-wai.yuen@hpe.com>
Co-authored-by: Bradley Laney <bradley.laney@hpe.com>
Co-authored-by: julian-determined-ai <103522725+julian-determined-ai@users.noreply.github.com>
Co-authored-by: Corban Beaird <corban.beaird@hpe.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: liamcli <liam@determined.ai>
Co-authored-by: Ashton G <ashton.galloway@hpe.com>
Co-authored-by: thiagodallacqua-hpe <104855841+thiagodallacqua-hpe@users.noreply.github.com>
Co-authored-by: Max <max.russell@hpe.com>
Co-authored-by: Maksim <maksim.kouznetsov@hpe.com>
Co-authored-by: Emily <15078396+EmilyBonar@users.noreply.github.com>
Co-authored-by: Ilia Glazkov <ilia.glazkov@hpe.com>
Co-authored-by: Caleb Hoyoul Kang <caleb.kang@hpe.com>
Co-authored-by: Wes Turner <wesley.turner@hpe.com>
Co-authored-by: Daniel R. Hunter <103537968+drh-determined-ai@users.noreply.github.com>
Co-authored-by: Tara Charter <tara.charter@hpe.com>
Co-authored-by: rcorujo <90728398+rcorujo@users.noreply.github.com>
Co-authored-by: Guangqing Tang <40620519+gt2345@users.noreply.github.com>
Co-authored-by: MikhailKardash <mikhail.kardash@hpe.com>
Co-authored-by: Jagadeesh Madagundi <jagadeesh545@gmail.com>
Co-authored-by: gt2345 <gt2345@columbia.edu>
Co-authored-by: Trent Watson <trent.watson@hpe.com>

* handle add user

* Revert "Deepspeed Feature Branch - merge master (#6193)"

This reverts commit 05ee6a2.

* Move dsat into harness (#6254)

move dsat into harness

* fixed too-large dataset bug (#6262)

* MLG-337 (#6282)

 ds_config.json centered workflow and cleanup

* Feat deepspeed autotune fix merge conflict (#6289)

Merging with master, fixing a merge conflict, and minor cleanup.

* reset webui to latest master (#6291)

* linting fixes (#6296)

* Refactor DS AT for Trial Compatibility (#6307)

Refactored the searcher to use a json based config workflow with overwrite_deepspeed_args, as in (most of) the official examples.

* fix: remove all Close operations (#6383)

Refactored to remove all Close operations to avoid race condition errors. Also added quick no op example and fixed other bugs.

* restore util.py

* restoring more files to lastest master version

* merge in util.py changes

* Remove OOM catcher.

* Add Close operations back in

* Standardize autotuning config names.

* General clean up

* Fix dsat reporting bug.

* Minor changes.

* Clean up.

* Try/except hack around dead agent due to exit

* Minor changes.

* feat: allow users to specify zero optim search space and runner config (#6452)

* fix: do not merge user zero search config with defaults (#6464)

* populate the custom searcher logs with the correct event (#6470)

* fix: search state accounting (#6473)

* feat: add simple linear batch test searcher (#6482)

* fix: searcher refactor (#6513)

* deprecate the zero_search_config functionality (#6517)

* fix: timing fix and cleanup with tests (#6555)

* remove single should_stop and add more granular state

* adding the beginning of some autotuning tests

* real unit tests for DS AT

* Clean up the trial tracker properties, base searcher, and names

* Do not close twice

---------

Co-authored-by: Taylor Ritenour <taylor.ritenour@hpe.com>

* fix: minor cleanup (#6558)

* feat: deepspeed autotune trial based methods (#6575)

* fix: minor bug broke Trial class DS AT (#6613)

* feat: optional steps completed in context manager (#6619)

Make steps_completed in reporting context optional

Update examples to use updated context manager

minor example changes

Minor model changes

Delete old trial class examples

Add torchvision model trial example

* MLG-369: Some initial tests for DS AT (#6551)

* real unit tests for DS AT

* clean up and standardize the tests more

* also send a Close operation

* touch up the tests

* clarifications and clean up

* feat: deepspeed autotune cli args (#6643)

* Add __main__ and reuse original args on-cluster.

* Cleanup

* Adding include

* Finish adding include

* more args

* add search runner exp_id to follow on exp description

* minor comments

* Different registering system and starting the queue

* more args and cleanup

* About to use a queue (kind of)

* Attempting to enforce trial constraints

* Add early stopping as arg

* Remove old autotuning args and get them instead from cli

* Corrected args bugs

* Fixed many refactoring bugs

* Cleanup and bug fixes

* Move exp_conf edits to _run_dsat

* Add zero stages arg

* Move exp conf changes out of searcher and use zero_stages arg

* Fix simple searcher bug

* minor cleanup

* cleanup

* Starting refactor away from searcher state

* Changed searcher state computation to trial tracker

* Revert to single configs everywhere

* Update readme

* Edit added description text

* Add closed attr to Trials and bug fixes

* clarifying comment

* Rebase onto latest feature branch

* Add actual deque, fix bugs

* Remove print test

* small trial example fix

* Clean up try/except hack

* config edits

* feat: visualize cli args (#6651)

* Add CLI args to config hparams for visualization

* remove pickle path arg

* fix up the dsat tests (#6664)

* fix up the dsat tests

* make sure to pass through the args parsing function

* touch up the tests so that they can issue a failure to the experiment (#6676)

* chore: update deepspeed to 0.8.3 [MLG-399] (#6685)

* feat: hf trainer examples (#6687)

* chore: refactor dsat module to be independent of deepspeed imports [MLG-499] (#6694)

* chore: refactor dsat module to be independent of deepspeed imports [MLG-499]

* also update the test and det_callback.py

* chore: move over the examples for DSAT [MLG-500] (#6717)

* feat: add deepspeed autotuning examples [MLG-500]

* some clean up and UX improvements

* remove double parens, make sure that orchestrator id is on the far left

* feat: add follow on exp option (#6720)

* feat: migrate the huggingface det_callback [MLG-487] (#6724)

* migrating the det_callback

* feat: migrate the huggingface det_callback [MLG-487]

* don't export DetCallback through the top level integrations

* feat: minor test updates (#6746)

* feat: use lock with json overwrite for hf (#6742)

* feat: use lock with json overwrite for hf

* handle merge with our previous DetCallback refactors

---------

Co-authored-by: Garrett Goon <garrett.goon@hpe.com>

* feat: basic trial tracker tests (#6754)

* fix: hf overwrite bug

* fix: remove old import (#6761)

* feat: adding e2e tests for DSAT, enabling searcher to shutdown client experiment [MLG-369] (#6781)

* migrating the det_callback

* feat: migrate the huggingface det_callback [MLG-487]

* wip working on the e2e tests

* getting the basics of the tests running. Still appear to be issues though

* fixing up the tests

* fixing up tests

* handle cases where explicit exceptions are raised in the dsat search runner

* clean up for merging

* revert restarts change

* feat: add search method tests (#6785)

* add search method tests

* update hf ex readme

* quick fix for the unit tests (#6796)

* feat: write best ds config json to checkpoint (#6787)

* feat: refactor argparse into subparsers (#6801)

* feat: add binary search (#6806)

* fix: move lock to fix hf overwrites (#6828)

* fix: small fixes (#6825)

* expand message for model profile info failure

* correct the progress calculation

* Remove autotuning section from checkpointed best configs

* also write the best metrics to the checkpoint

* fix: proper placement of start/end profile step (#6834)

* feat: more random search test (#6837)

* chore: stabilize static typed python [MLG-498] (#6846)

* flake8 fixes

* mypy issues

* updates according to comments for mypy changes (#6864)

* feat: asha (#6852)

* merged in prev asha code

* starting asha tests

* more tests and cleanup

* test cleanup

* asha params closer to current nomenclature

* refactor asha args

* import fix

* mypy

* add asha to __all__

* add search data to stage 3 test

* chore: move hf examples (#6871)

* replace old hf integrations examples with new ones

* fix no-op bug in hf helper function

* fix helper function imports

* update readme

* use the python module for `searcher` directly (#6883)

* use the python module for `searcher` directly rather than importing the individual elements

* additional fixes

* feat: clean up dsat examples (#6891)

* moving files

* moving more files

* more file movement

* stage 1 in config

* core api script cleanup

* deepspeed.yaml core api cleanup

* align deepspeed.yaml files

* align ds_config.json files

* remove lr scheduler

* shorten length to 100

* model_def.py cleanup

* Added checkpointing and better metrics reporting

* cleaned up readme

* change example dir name

* add torchvision examples to test_official.py

* starting e2e tests

* add all e2e tests

* remove accidental double test

* chore: to do cleanup (#6895)

* cli docs

* cli doc strings

* remove todo comment

* update doc strings in dsat _utils.py

* searcher class doc strings

* More search method doc strings for non-public classes

* remove searcher state tests

* remove many todos in _dsat_search_method.py

* remove todos elsewhere

* fix: do not schedule the same trial twice (#6896)

* attempting to fix tests (#6899)

* fix: move examples dir one level up and finish docs/Makefile changes (#6898)

* fix: remove improper test_official.py tests (#6900)

* fix docs formatting (#6902)

* fix docs formatting

* add deepspeed autotune directory to example builds

* support hf examples (#6910)

* feat: ds config from include (#6905)

* move overwrite_deepspeed_config back to det.pytorch.deepspeed

* allow for the ds json to be --include'd

* self.hparams -> hparams bug

* doc string edits

* move ds_config.json back inside of no_op

* chore: update the custom searcher docs [MLG-447] (#6934)

* updating the docs for custom searcher wip

* wip, fixing up the docs, making sure things and clean and link properly

* chore: update the custom searcher docs [MLG-447]

* updates according to comments

* fix docs build

* changes according to comments in dsat branch (#6943)

* changes according to comments.

- removing no_op
- removing cache_dir in hf examples
- removing erroneous release-notes

* revert the changes to the environments so we are in sync with bumpenvs

* adjust the huggingface versions to the current minor version

* update version

* bug: fully wrap hf JSON loading around a FileLock (#6950)

* feat: deepspeed autotune user guide (#6929)

* starting dsat user guide

* import cleanup in hf examples

* more editing

* starting to list cli options

* formatting

* git mv hf examples to make more descriptive dirs

* remove TODO

* incorporating feedback

* links to examples

* cleanup

* Update cli help

* general cli options cleanup

* incorporate taylor's second round of feedback

* incorporate tara's comments

* fix: no auto in hf cli (#6963)

* add int check to hf cli arg overwrite

* fix accidential trivial test

* new test ensuring auto not used in hf cli args

* add checks against copying "auto" to HF CLI entrypoint

* add link to the dsat user guide (#6964)

* add link to the dsat user guide

* updated wording

* add defaults to dsat cli help menu (#6970)

* feat: more tests and asha cleanup (#6966)

* account for asha early stopping in test

* clean up lineage_completed_rung

* more full mock experiment tests

* Only skip completed trials added to queue

* base rungs off latest, rather than root

* max trial computation cleanup

* only include curr_rung <= rung_idx trial in best computation

* promotion respects rung idx test

* test cleanup

* more test cleanup

* add test_get_best_trial_in_lineage

* doc string cleanup

* fix up test_full_experiment_reverse_ordered_results

* minor wording edit

* always pop off highest curr_rung asha trial next

* fix the doc builds, add a release-note (#6973)

* fix the doc builds, add a release-note

* update docs names

* make flake8 behave

* update by the pre-commit complaints

* fix: readme cleanup (#6974)

* touch up hf trainer readme

* shorten up and simplify torchvision readme

* feat: update defaults and small tweaks (#6975)

* trials_per_random_config 5

* max trials 64

* min binary search trials 3

* fix text by avoiding trivial search range

* remove should_discard function to avoid possible locking

* increase timeouts for mock tests

* mypy fix

* update ceiling computation for new random trials

* base the ceiling off the max mbs computation, not the midpoint

* schedule longer lineages first in asha

* update docs to reflect new defaults

* fmt examples

* make isort behave

* address some comments about logging levels and comments

* remote erroneous TODO

* small spelling mistake

* move the `determined.integrations.huggingface.DetCallback` to `determined.transformers.DetCallback`

* fix: comment and environment cleanup (#6988)

* explain try/except in search runner

* remove while true and comments in _deepspeed_trial.py

* fix all deepspeed yaml files

* remove step id comment

* more todo cleanup

* make sure the docs build again

* fix the names of the e2e tests and in README

* don't run so many e2e_tests for deepspeed

* reduce hf image class ds slots per trial (#6998)

* fix e2e tests

* fix the convergence tests

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Maksim Kouznetsov <maksim.kouznetsov@hpe.com>
Co-authored-by: Garrett Goon <garrett.goon@hpe.com>
Co-authored-by: Garrett Goon <44747910+garrett361@users.noreply.github.com>
Co-authored-by: Jerry J. Harrow <84593277+jerryharrow@users.noreply.github.com>
Co-authored-by: Nick Doiron <nick.doiron@hpe.com>
Co-authored-by: NicholasBlaskey <nick.blaskey@hpe.com>
Co-authored-by: Eric <31023784+eecsliu@users.noreply.github.com>
Co-authored-by: Keita Nonaka <keita.nonaka@hpe.com>
Co-authored-by: Erik Wilson <erik.wilson@hpe.com>
Co-authored-by: Hamid Zare <12127420+hamidzr@users.noreply.github.com>
Co-authored-by: johnkim-det <97752292+johnkim-det@users.noreply.github.com>
Co-authored-by: Ryan <rb@hpe.com>
Co-authored-by: Danny Zhu <dzhu@hpe.com>
Co-authored-by: CanmingCobble <107056780+CanmingCobble@users.noreply.github.com>
Co-authored-by: szewaiyuen6 <sze-wai.yuen@hpe.com>
Co-authored-by: Bradley Laney <bradley.laney@hpe.com>
Co-authored-by: julian-determined-ai <103522725+julian-determined-ai@users.noreply.github.com>
Co-authored-by: Corban Beaird <corban.beaird@hpe.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: liamcli <liam@determined.ai>
Co-authored-by: Ashton G <ashton.galloway@hpe.com>
Co-authored-by: thiagodallacqua-hpe <104855841+thiagodallacqua-hpe@users.noreply.github.com>
Co-authored-by: Max <max.russell@hpe.com>
Co-authored-by: Emily <15078396+EmilyBonar@users.noreply.github.com>
Co-authored-by: Ilia Glazkov <ilia.glazkov@hpe.com>
Co-authored-by: Caleb Hoyoul Kang <caleb.kang@hpe.com>
Co-authored-by: Wes Turner <wesley.turner@hpe.com>
Co-authored-by: Daniel R. Hunter <103537968+drh-determined-ai@users.noreply.github.com>
Co-authored-by: Tara Charter <tara.charter@hpe.com>
Co-authored-by: rcorujo <90728398+rcorujo@users.noreply.github.com>
Co-authored-by: Guangqing Tang <40620519+gt2345@users.noreply.github.com>
Co-authored-by: MikhailKardash <mikhail.kardash@hpe.com>
Co-authored-by: Jagadeesh Madagundi <jagadeesh545@gmail.com>
Co-authored-by: gt2345 <gt2345@columbia.edu>
Co-authored-by: Trent Watson <trent.watson@hpe.com>
Co-authored-by: Emily Bonar <emily.bonar@hpe.com>
erikwilson added a commit that referenced this pull request Jun 1, 2023
* Save model info; add Core API DeepSpeed example.

* extract ds profiler results

* Place activation mem into search metric.

* Obtain metrics from the model info file

* remove unused code

* adapted to passing dicts into report_completed

* cleanup and small changes

* refactoring and trial helper classes

* initial random search logic started

* minor changes

* minor cleanup

* use context manager, expanded base searcher

* remove ds_autotuning dir from examples

* minor edits

* readme updates and other cleanup

* bug fixes and a hack to avoid needing nested model dirs

* README updates

* Feat deepspeed autotune (#5875)

Added current Core API prototype.

* switched to triggering their autotuning in our trials

* remove checkpoint wrapper

* implementing checkpointing

* implementing checkpointing

* feat: allow includes in custom searcher experiment [MLG-338] (#6091)

* cleanups, bug fixing, and more examples

* adding native dsat tests

* cleanup

* minor edits

* fix missing is_chief

* Feat deepspeed autotune (#6159)

Trigger the native DS AT exit behavior for all trials.

* minor edits

* Feat deepspeed autotune git fixes (#6180)

deleted extraneous files

* readme fix and make cifar10 work off grenoble (#6187)

* Deepspeed Feature Branch - merge master (#6193)

* docs: Improvements to HPC launcher docs (#6042)

* Provide inline info about agent-specific scheduling options that do not apply to HPC Launcher
  configurations.
* Identify enroot-specific differences from docker (like for Singularity)
* Provide reference to custom resource pools as an option to deal with non-homogenous
  Slurm/PBS partitions.

* chore: Allow newer Node versions 17-19 (#6038)

* fix: k8s rm gives wrong slot count in rendezvous (#6044)

* chore: bump version: 0.19.12-dev0 -> 0.20.0-dev0 (#6048)

* chore: remove `applicableRoutespace` (#6040)

* chore: warning fixes in web (#6041)

* chore: fix warnings

* chore: change eslint rules

* chore: fix gpu nightly errors (#6046)

* chore: missing nodev18 (#6050)

* chore: add a dedicated exception for cli errors (#5649)

switch sys.exit calls in cli with a new user-facing exception.

* fix: SSO layout (#6053)

* chore: clean up UI kit (#6039)

* fix: lopsided training with 2,1 gpus (#6054)

There was a guard to skip local zmq setup when local_size < 2, but that
became no longer valid when local_size varied from worker to worker.

The result is one extra global allgather in some cases, no big deal.

* docs: add rbac ntsc & mr release notes (#6049)

* chore: manual bump version (#6058)

* ci: retry downloading GKE auth plugin [DET-8956] (#6056)

We got a failure due to a timeout on this, so let it retry a few times.

* docs: update Singularity known issues. (#6047)

* fix: Add #rank to worker segment instead of timestamp segment of Pytorch Profiler files [MLG-326] (#6037)

* Add pytorch profiler specific handling logic for appending rank to file name

* Change to use f-string

* fix only file name being passed in

* remove print statement

* fix: handle agent shutdown msg (#6065)

* chore: manually bump vite version (#6066)

* feat: Display better x-axis ticks on charts with time axis [WEB-849] (#6051)

* remove xTickValues from props now that it can be calculated internally

* test: add logging to a flaky test (#6068)

Test is flaky but hard to pin down, so add some prints for next time.

* fix: Unrelated models are shown in a workspace model registry tab (#6067)

* feat: Added task-based historical allocation endpoint [DET-8537] (#6015)

* fix: show `not found` and `spinner` properly (#6070)

* fix: show `not found` and `spinner` properly

* chore: change home redirect path

* fix: projectDetail page

* fix: project.workspaceId

* build(deps): bump golang.org/x/text from 0.3.5 to 0.3.8 in /proto (#6061)

* build(deps): bump golang.org/x/text from 0.3.5 to 0.3.8 in /proto

Bumps [golang.org/x/text](https://github.com/golang/text) from 0.3.5 to 0.3.8.
- [Release notes](https://github.com/golang/text/releases)
- [Commits](golang/text@v0.3.5...v0.3.8)

---
updated-dependencies:
- dependency-name: golang.org/x/text
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

* fix: gpt-neox docker image and startup hook to work for non-provileged user (#6060)

* test: add locking around migration [DET-8957] (#6071)

In integration tests, multiple processes can attempt to run migrations
against the same database at once, which can lead to errors because
PostgreSQL's `CREATE TABLE IF NOT EXISTS` is not great with concurrency
(it allows for a time-of-check/time-of-use failure).

The specific errors we were seeing were conflicts in the pg_type table,
so the code now locks that table for the duration of the migration
transaction.

More information:
https://www.postgresql.org/message-id/CA+TgmoZAdYVtwBfp1FL2sMZbiHCWT4UPrzRLNnX1Nb30Ku3-gg@mail.gmail.com
https://stackoverflow.com/questions/29900845

* fix: logging inconsistent newlines in slurm (#6074)

* fix: checkpoint helper for points > 1000 and points > maxDatapoints (#6069)

* fix: replace migration table lock with advisory lock (#6077)

Taking a table lock sometimes runs into permissions issues; advisory
locking should avoid that.

Also, I realized the locking should probably be after the deferred
transaction close instead of before.

* build: check npm version on install (#6079)

This removes the `check-requirements` make target in the react folder
and replaces it with npm's native version check against the engines
property. This should make managing the node version slightly easier
because there's one less place to check.

* build: Apply webui lint fixes in precommit (#6078)

* allow linters to fix in precommit

This updates the web linters to automatically apply fixes when doing a
pre-commit check. This should ideally streamline the commit process to
reduce the amount of times the user needs to run prettier and eslint
before committing.

* tweak stylelint and eslint commands

* stage changed files

* type file_paths argument

* ci: adjust target accuracy for a test (#6085)

We got one failure [1] where the accuracy ended up just a hair below
0.83, so drop the target.

[1] https://app.circleci.com/pipelines/github/determined-ai/determined/33883/workflows/4a5d3257-6061-4f4d-bd66-096a580a5959/jobs/1194282/steps

* chore: UserBadge moves into design kit (#6086)

* chore: Remove chart feature flag, remove unused code [WEB-930] (#6064)

* tooltipsPlugin and TrialDetailsOverview alternates go into place
* checkpoint helper for points > 1000 and points > maxDatapoints
* move former LearningCurveChart into TrialsComparison
* sync up with #6069 changes

* fix: replace defaultvalue with initialValue (#6076)

* fix: Dont suggest moving model into its current workspace (#6088)

* ci: delete database at beginning of det deploy tests [DET-8937] (#6089)

Previously, the database was being retained between tests, sometimes
causing tests to fail when extra agents appeared due to agent
reattach. The tests should generally be independent anyway, so reset
the database (by default, with an option to disable) each time the
cluster or master comes up.

* feat: add Facepile component (#6081)

* fix: pre-commit web bug fix (#6090)

* ci: make GKE test jobs run serially (#6096)

We keep hitting GKE GPU quotas; this will probably help with that.

* fix: GPU counting for k8s cluster info page (#6094)

* fix: test-e2e-gke-parallel use t4 (#6093)

* ci: retry protoc download (#6095)

We got an incorrect file downloaded one time [1], so retry this
download, like in 2906257 (#5996).

[1] https://app.circleci.com/pipelines/github/determined-ai/determined/34074/workflows/e48681f8-8b75-4349-82eb-06e922d8bfcb/jobs/1202610

* refactor: add Card to UI Kit [WEB-818] (#5893)

* docs: Launcher doc improvements (#6099)

- Generalize journalctl command example --since option to work on Unbutu.
- Clarify user_name/group_name account requirements.

* feat: Attend to TODOS accross the code base (#6087)

* perf: tweak metrics series query. (#6105)

* chore: race could cause run container to return a different error than expected [DET-8870] (#6092)

* chore: add more metadata to slurm logs (#6030)

* chore: remove `ExpCompareMetricNames`, `ExpCompareTrialsSample` endpoints. (#6106)

* docs: fix reported DataPoint label doc (#6107)

* fix: tolerate additional non-CPU, non-GPU quotas in k8s (#6109)

* fix: stop filtering of valid options to reflect build issues (#6116)

* fix: modal theme color (#6117)

* fix: add bgColor in trial comparison table (#6119)

* fix: browser console warnings (#6122)

* fix: browser console warnings

* fix: remove spread operator

* chore: UIKit Pivot renaming (#6120)

* fix: correct minor JSX syntax (#6126)

* docs: add myst_parser extension (#6127)

We would like to support markdown-format documentation.  There are still
some kinks to be worked out with converting rst to myst files, but this
is a start.

* docs: fix some broken redirects (#6129)

* feat: add 3rd batch of TODO removals (#6115)

* feat: generic proxy configs [DET-8761] (#5978)

* build: [MLG-336] Limit the version of protobuf (#6134)

build: [MLG-336] Limit the version of protobuf

Installing the requirement `tensorflow-macos=2.8.0` pulls protobuf as a downstream dependency. Version 4.21 of the Python protobuf package had a breaking change that makes it incompatible with tensorflow-2.8.0 (see tensorflow/tensorflow#56077). Later patches to Tensorflow limit the version of protobuf to 3.20. We've got a work item to update the tensorflow we include, but until then this change gives the ceiling on tensorflow's protobuf dependency that its later versions enforce.

* chore: update detectron2 example to use v0.6 and reenable nightly test [MLG-301] (#6103)

* Run model in EventStorage context

* Use new Docker images

* Remove pytest.skip from test_detectron2_coco_pytorch_const

* Update README.md

* Minor code reduction

* Dockerfile (listed in .detignore)

* Use determinedai repo instead of a personal repo

* Makefile for building and publishing the Docker image

* docs: Bring content changes from docusaurus-ls beta (#6121)

* docs: Bring content changes from docusaurus-ls beta

Bring over content changes from the beta including reorganization changes.

* additional organizational edits

updating index pages, adding a top nav to welcome page

* added redirects

* revisions based on feedback

* rstfmt run

* feat: display workspace icon in ProjectCard (#6125)

* fix: checkpoint GC should set resource pool (#6136) [DET-9018]

* docs: bump rstfmt version (#6138)

* chore: add dev cli option to get auth token (#6008)

add a `curl` option to help with curling various endpoints

* build(deps): bump golang.org/x/net from 0.0.0-20210405180319-a5a99cb37ef4 to 0.7.0 in /proto (#6130)

Bumps [golang.org/x/net](https://github.com/golang/net) from 0.0.0-20210405180319-a5a99cb37ef4 to 0.7.0.
- [Release notes](https://github.com/golang/net/releases)
- [Commits](https://github.com/golang/net/commits/v0.7.0)

* perf: Improved performance of historical allocation task endpoint, removed training/validation times (#6135)

* fix: FOUNDENG-438 Podman tests from the gate are breaking znodes again (#6146)

* chore: add Toggle component to UI Kit [WEB-841] (#6144)

* chore: Add tags to UI kit [WEB-816] (#6100)

* chore: Move SelectFilter into kit folder and update it [WEB-843] (#6102)

* fix: replace `InlineEditor` with UIKit input (#6082)

* fix: replace `InlineEditor`

* fix: add modal for experiment name

* fix: layout of settings page

* fix: setting page

* fix: minor changes

* feat: move experiment `description` and `tags` into edit modal

* chore: add `N/A` when description is empty in experiment detail

* fix: value bug

* fix: revert tag; remove tag from edit modal due to design inconsistancy

* chore:  add/test pt-only images and bumpenvs (#6097)

* add pt images to some unit tests

* add pt-images to circleci config

* run bumpenvs procedure

* fix test function signatures and linting

* fix warnings linting

* fix docs

* expand unit tests coverage

* build(deps): bump github.com/prometheus/client_golang from 1.10.0 to 1.11.1 in /master (#6004)

Bumps [github.com/prometheus/client_golang](https://github.com/prometheus/client_golang) from 1.10.0 to 1.11.1.
- [Release notes](https://github.com/prometheus/client_golang/releases)
- [Changelog](https://github.com/prometheus/client_golang/blob/main/CHANGELOG.md)
- [Commits](prometheus/client_golang@v1.10.0...v1.11.1)

* feat: add det.import_from_path (#5737)

import_from_path allows users to import arbitrary code from a
checkpoint, even if the modules in the checkpoint have the same name as
modules they already have imported, but contain different code.

This is common when importing, for example, an old model_def.py that has
been updated since the original checkpoint was saved.

* fix: post rank_id correctly for fluent-less logging (#6151) [DET-8999]

* ci: send Slack notification on GKE node pool creation failure (#6152)

In order to prevent quota failures from showing up as CI failures, this
makes node pool creation failure send a Slack notification and mark the
job as successful.

I couldn't figure out how to use the Slack orb while distinguishing this
particular situation from the general failure case, so I just slapped in
a direct request to the already-configured Slack webhook for sending
messages to #ci-bots.

`circleci-agent step halt` marks the job as successful, which is why we
want a notification at all. For some reason, CircleCI fails to provide
an equivalent for marking the job as canceled or some other state
besides success/failure; we could make a call to the CircleCI API to
cancel the current job, but that would rely on having a CircleCI token
available, which we're trying to get away from.

* chore: drop unused columns from `raw_steps`, `raw_validations`, and `raw_checkpoints`. (#6110)

* fix: render spinner while auth check pending (#6098)

* chore: update hpc-launching-architecture doc - add default slurm option --no-requeue (#6141)

* docs: Content updates (#6154)

formatted the setup cluster table to match the approved version in the docusaurus ls beta

* feat: display user id in `det user list`. (#6156)

* fix: Additional tables get experiment- / workspace-specific storagePath [WEB-962] (#6128)

* fix: Additional tables get experiment- and workspace-specific storagePath

* useMemo

* fix: selection width in `move experiment` modal (#6149)

* fix: selection width in `move experiment` modal

* fix: add form wrapping

* chore: remove change

* feat: show trial hyperparameters for custom searchers [MLG-343] (#6162)

* feat: show trial hyperparameters for custom searchers [MLG-343]

* fix: corrected timestamp handling to do an interval overlap instead of contains (#6164)

* chore:  add release notes (#6167)

* chore: add release notes

* format with rstfmt

* chore: suppress help output for det dev (#6145)

avoid showing the `dev` option in `det -h` output

* chore: lock api state for backward compatibility check

* chore: bump version: 0.20.0-dev0 -> 0.20.1-dev0

* fix: separate Router and authCheck (#6170)

* fix: useMemo does not depend on trial having been loaded (#6173)

* chore: pass Labels/project/workspace to TaskSpec (#6172)

* refactor: replace user store with observables [WEB-799] (#6140)

* fix: hide Foldable menu options when button is visible (#6178)

This fixes an issue where, when using a `PageHeaderFoldable` component,
options that appear in the header always appear in the overflow menu.

* feat: add labels to GCP instances created with det deploy gcp [MLG-170] (#6147)

* feat: add labels to GCP instances created with det deploy gcp

* Changes to mimic det deploy aws --add-tags

* Add labels to other resources as well

* revert: reflag new chart experience (#6181)

* build: eliminate java dependency for typescript swagger bindings (#6139)

* fix: Close expiriment fork/ continue trial modal properly (#6174)

* fix: Continue Trial flow does not take the new `max_length` (#6168)

* fix: pass workspace ID when creating tensor board from WebUI [WEB-1019] (#6186)

* fix: don't print ':' when err msg is empty (#6190)

* fix: exp move modal (#6183)

* fix: exp move modal

* fix: minor fixes

* fix: add `archived` param and simplify query (#6175)

* fix: add `archived` param and simplify query

* chore: indent

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Jerry J. Harrow <84593277+jerryharrow@users.noreply.github.com>
Co-authored-by: Nick Doiron <nick.doiron@hpe.com>
Co-authored-by: NicholasBlaskey <nick.blaskey@hpe.com>
Co-authored-by: Eric <31023784+eecsliu@users.noreply.github.com>
Co-authored-by: Keita Nonaka <keita.nonaka@hpe.com>
Co-authored-by: Erik Wilson <erik.wilson@hpe.com>
Co-authored-by: Hamid Zare <12127420+hamidzr@users.noreply.github.com>
Co-authored-by: johnkim-det <97752292+johnkim-det@users.noreply.github.com>
Co-authored-by: Ryan <rb@hpe.com>
Co-authored-by: Danny Zhu <dzhu@hpe.com>
Co-authored-by: CanmingCobble <107056780+CanmingCobble@users.noreply.github.com>
Co-authored-by: szewaiyuen6 <sze-wai.yuen@hpe.com>
Co-authored-by: Bradley Laney <bradley.laney@hpe.com>
Co-authored-by: julian-determined-ai <103522725+julian-determined-ai@users.noreply.github.com>
Co-authored-by: Corban Beaird <corban.beaird@hpe.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: liamcli <liam@determined.ai>
Co-authored-by: Ashton G <ashton.galloway@hpe.com>
Co-authored-by: thiagodallacqua-hpe <104855841+thiagodallacqua-hpe@users.noreply.github.com>
Co-authored-by: Max <max.russell@hpe.com>
Co-authored-by: Maksim <maksim.kouznetsov@hpe.com>
Co-authored-by: Emily <15078396+EmilyBonar@users.noreply.github.com>
Co-authored-by: Ilia Glazkov <ilia.glazkov@hpe.com>
Co-authored-by: Caleb Hoyoul Kang <caleb.kang@hpe.com>
Co-authored-by: Wes Turner <wesley.turner@hpe.com>
Co-authored-by: Daniel R. Hunter <103537968+drh-determined-ai@users.noreply.github.com>
Co-authored-by: Tara Charter <tara.charter@hpe.com>
Co-authored-by: rcorujo <90728398+rcorujo@users.noreply.github.com>
Co-authored-by: Guangqing Tang <40620519+gt2345@users.noreply.github.com>
Co-authored-by: MikhailKardash <mikhail.kardash@hpe.com>
Co-authored-by: Jagadeesh Madagundi <jagadeesh545@gmail.com>
Co-authored-by: gt2345 <gt2345@columbia.edu>
Co-authored-by: Trent Watson <trent.watson@hpe.com>

* handle add user

* Revert "Deepspeed Feature Branch - merge master (#6193)"

This reverts commit 05ee6a2.

* Move dsat into harness (#6254)

move dsat into harness

* fixed too-large dataset bug (#6262)

* MLG-337 (#6282)

 ds_config.json centered workflow and cleanup

* Feat deepspeed autotune fix merge conflict (#6289)

Merging with master, fixing a merge conflict, and minor cleanup.

* reset webui to latest master (#6291)

* linting fixes (#6296)

* Refactor DS AT for Trial Compatibility (#6307)

Refactored the searcher to use a json based config workflow with overwrite_deepspeed_args, as in (most of) the official examples.

* fix: remove all Close operations (#6383)

Refactored to remove all Close operations to avoid race condition errors. Also added quick no op example and fixed other bugs.

* restore util.py

* restoring more files to lastest master version

* merge in util.py changes

* Remove OOM catcher.

* Add Close operations back in

* Standardize autotuning config names.

* General clean up

* Fix dsat reporting bug.

* Minor changes.

* Clean up.

* Try/except hack around dead agent due to exit

* Minor changes.

* feat: allow users to specify zero optim search space and runner config (#6452)

* fix: do not merge user zero search config with defaults (#6464)

* populate the custom searcher logs with the correct event (#6470)

* fix: search state accounting (#6473)

* feat: add simple linear batch test searcher (#6482)

* fix: searcher refactor (#6513)

* deprecate the zero_search_config functionality (#6517)

* fix: timing fix and cleanup with tests (#6555)

* remove single should_stop and add more granular state

* adding the beginning of some autotuning tests

* real unit tests for DS AT

* Clean up the trial tracker properties, base searcher, and names

* Do not close twice

---------

Co-authored-by: Taylor Ritenour <taylor.ritenour@hpe.com>

* fix: minor cleanup (#6558)

* feat: deepspeed autotune trial based methods (#6575)

* fix: minor bug broke Trial class DS AT (#6613)

* feat: optional steps completed in context manager (#6619)

Make steps_completed in reporting context optional

Update examples to use updated context manager

minor example changes

Minor model changes

Delete old trial class examples

Add torchvision model trial example

* MLG-369: Some initial tests for DS AT (#6551)

* real unit tests for DS AT

* clean up and standardize the tests more

* also send a Close operation

* touch up the tests

* clarifications and clean up

* feat: deepspeed autotune cli args (#6643)

* Add __main__ and reuse original args on-cluster.

* Cleanup

* Adding include

* Finish adding include

* more args

* add search runner exp_id to follow on exp description

* minor comments

* Different registering system and starting the queue

* more args and cleanup

* About to use a queue (kind of)

* Attempting to enforce trial constraints

* Add early stopping as arg

* Remove old autotuning args and get them instead from cli

* Corrected args bugs

* Fixed many refactoring bugs

* Cleanup and bug fixes

* Move exp_conf edits to _run_dsat

* Add zero stages arg

* Move exp conf changes out of searcher and use zero_stages arg

* Fix simple searcher bug

* minor cleanup

* cleanup

* Starting refactor away from searcher state

* Changed searcher state computation to trial tracker

* Revert to single configs everywhere

* Update readme

* Edit added description text

* Add closed attr to Trials and bug fixes

* clarifying comment

* Rebase onto latest feature branch

* Add actual deque, fix bugs

* Remove print test

* small trial example fix

* Clean up try/except hack

* config edits

* feat: visualize cli args (#6651)

* Add CLI args to config hparams for visualization

* remove pickle path arg

* fix up the dsat tests (#6664)

* fix up the dsat tests

* make sure to pass through the args parsing function

* touch up the tests so that they can issue a failure to the experiment (#6676)

* chore: update deepspeed to 0.8.3 [MLG-399] (#6685)

* feat: hf trainer examples (#6687)

* chore: refactor dsat module to be independent of deepspeed imports [MLG-499] (#6694)

* chore: refactor dsat module to be independent of deepspeed imports [MLG-499]

* also update the test and det_callback.py

* chore: move over the examples for DSAT [MLG-500] (#6717)

* feat: add deepspeed autotuning examples [MLG-500]

* some clean up and UX improvements

* remove double parens, make sure that orchestrator id is on the far left

* feat: add follow on exp option (#6720)

* feat: migrate the huggingface det_callback [MLG-487] (#6724)

* migrating the det_callback

* feat: migrate the huggingface det_callback [MLG-487]

* don't export DetCallback through the top level integrations

* feat: minor test updates (#6746)

* feat: use lock with json overwrite for hf (#6742)

* feat: use lock with json overwrite for hf

* handle merge with our previous DetCallback refactors

---------

Co-authored-by: Garrett Goon <garrett.goon@hpe.com>

* feat: basic trial tracker tests (#6754)

* fix: hf overwrite bug

* fix: remove old import (#6761)

* feat: adding e2e tests for DSAT, enabling searcher to shutdown client experiment [MLG-369] (#6781)

* migrating the det_callback

* feat: migrate the huggingface det_callback [MLG-487]

* wip working on the e2e tests

* getting the basics of the tests running. Still appear to be issues though

* fixing up the tests

* fixing up tests

* handle cases where explicit exceptions are raised in the dsat search runner

* clean up for merging

* revert restarts change

* feat: add search method tests (#6785)

* add search method tests

* update hf ex readme

* quick fix for the unit tests (#6796)

* feat: write best ds config json to checkpoint (#6787)

* feat: refactor argparse into subparsers (#6801)

* feat: add binary search (#6806)

* fix: move lock to fix hf overwrites (#6828)

* fix: small fixes (#6825)

* expand message for model profile info failure

* correct the progress calculation

* Remove autotuning section from checkpointed best configs

* also write the best metrics to the checkpoint

* fix: proper placement of start/end profile step (#6834)

* feat: more random search test (#6837)

* chore: stabilize static typed python [MLG-498] (#6846)

* flake8 fixes

* mypy issues

* updates according to comments for mypy changes (#6864)

* feat: asha (#6852)

* merged in prev asha code

* starting asha tests

* more tests and cleanup

* test cleanup

* asha params closer to current nomenclature

* refactor asha args

* import fix

* mypy

* add asha to __all__

* add search data to stage 3 test

* chore: move hf examples (#6871)

* replace old hf integrations examples with new ones

* fix no-op bug in hf helper function

* fix helper function imports

* update readme

* use the python module for `searcher` directly (#6883)

* use the python module for `searcher` directly rather than importing the individual elements

* additional fixes

* feat: clean up dsat examples (#6891)

* moving files

* moving more files

* more file movement

* stage 1 in config

* core api script cleanup

* deepspeed.yaml core api cleanup

* align deepspeed.yaml files

* align ds_config.json files

* remove lr scheduler

* shorten length to 100

* model_def.py cleanup

* Added checkpointing and better metrics reporting

* cleaned up readme

* change example dir name

* add torchvision examples to test_official.py

* starting e2e tests

* add all e2e tests

* remove accidental double test

* chore: to do cleanup (#6895)

* cli docs

* cli doc strings

* remove todo comment

* update doc strings in dsat _utils.py

* searcher class doc strings

* More search method doc strings for non-public classes

* remove searcher state tests

* remove many todos in _dsat_search_method.py

* remove todos elsewhere

* fix: do not schedule the same trial twice (#6896)

* attempting to fix tests (#6899)

* fix: move examples dir one level up and finish docs/Makefile changes (#6898)

* fix: remove improper test_official.py tests (#6900)

* fix docs formatting (#6902)

* fix docs formatting

* add deepspeed autotune directory to example builds

* support hf examples (#6910)

* feat: ds config from include (#6905)

* move overwrite_deepspeed_config back to det.pytorch.deepspeed

* allow for the ds json to be --include'd

* self.hparams -> hparams bug

* doc string edits

* move ds_config.json back inside of no_op

* chore: update the custom searcher docs [MLG-447] (#6934)

* updating the docs for custom searcher wip

* wip, fixing up the docs, making sure things and clean and link properly

* chore: update the custom searcher docs [MLG-447]

* updates according to comments

* fix docs build

* changes according to comments in dsat branch (#6943)

* changes according to comments.

- removing no_op
- removing cache_dir in hf examples
- removing erroneous release-notes

* revert the changes to the environments so we are in sync with bumpenvs

* adjust the huggingface versions to the current minor version

* update version

* bug: fully wrap hf JSON loading around a FileLock (#6950)

* feat: deepspeed autotune user guide (#6929)

* starting dsat user guide

* import cleanup in hf examples

* more editing

* starting to list cli options

* formatting

* git mv hf examples to make more descriptive dirs

* remove TODO

* incorporating feedback

* links to examples

* cleanup

* Update cli help

* general cli options cleanup

* incorporate taylor's second round of feedback

* incorporate tara's comments

* fix: no auto in hf cli (#6963)

* add int check to hf cli arg overwrite

* fix accidential trivial test

* new test ensuring auto not used in hf cli args

* add checks against copying "auto" to HF CLI entrypoint

* add link to the dsat user guide (#6964)

* add link to the dsat user guide

* updated wording

* add defaults to dsat cli help menu (#6970)

* feat: more tests and asha cleanup (#6966)

* account for asha early stopping in test

* clean up lineage_completed_rung

* more full mock experiment tests

* Only skip completed trials added to queue

* base rungs off latest, rather than root

* max trial computation cleanup

* only include curr_rung <= rung_idx trial in best computation

* promotion respects rung idx test

* test cleanup

* more test cleanup

* add test_get_best_trial_in_lineage

* doc string cleanup

* fix up test_full_experiment_reverse_ordered_results

* minor wording edit

* always pop off highest curr_rung asha trial next

* fix the doc builds, add a release-note (#6973)

* fix the doc builds, add a release-note

* update docs names

* make flake8 behave

* update by the pre-commit complaints

* fix: readme cleanup (#6974)

* touch up hf trainer readme

* shorten up and simplify torchvision readme

* feat: update defaults and small tweaks (#6975)

* trials_per_random_config 5

* max trials 64

* min binary search trials 3

* fix text by avoiding trivial search range

* remove should_discard function to avoid possible locking

* increase timeouts for mock tests

* mypy fix

* update ceiling computation for new random trials

* base the ceiling off the max mbs computation, not the midpoint

* schedule longer lineages first in asha

* update docs to reflect new defaults

* fmt examples

* make isort behave

* address some comments about logging levels and comments

* remote erroneous TODO

* small spelling mistake

* move the `determined.integrations.huggingface.DetCallback` to `determined.transformers.DetCallback`

* fix: comment and environment cleanup (#6988)

* explain try/except in search runner

* remove while true and comments in _deepspeed_trial.py

* fix all deepspeed yaml files

* remove step id comment

* more todo cleanup

* make sure the docs build again

* fix the names of the e2e tests and in README

* don't run so many e2e_tests for deepspeed

* reduce hf image class ds slots per trial (#6998)

* fix e2e tests

* fix the convergence tests

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Maksim Kouznetsov <maksim.kouznetsov@hpe.com>
Co-authored-by: Garrett Goon <garrett.goon@hpe.com>
Co-authored-by: Garrett Goon <44747910+garrett361@users.noreply.github.com>
Co-authored-by: Jerry J. Harrow <84593277+jerryharrow@users.noreply.github.com>
Co-authored-by: Nick Doiron <nick.doiron@hpe.com>
Co-authored-by: NicholasBlaskey <nick.blaskey@hpe.com>
Co-authored-by: Eric <31023784+eecsliu@users.noreply.github.com>
Co-authored-by: Keita Nonaka <keita.nonaka@hpe.com>
Co-authored-by: Erik Wilson <erik.wilson@hpe.com>
Co-authored-by: Hamid Zare <12127420+hamidzr@users.noreply.github.com>
Co-authored-by: johnkim-det <97752292+johnkim-det@users.noreply.github.com>
Co-authored-by: Ryan <rb@hpe.com>
Co-authored-by: Danny Zhu <dzhu@hpe.com>
Co-authored-by: CanmingCobble <107056780+CanmingCobble@users.noreply.github.com>
Co-authored-by: szewaiyuen6 <sze-wai.yuen@hpe.com>
Co-authored-by: Bradley Laney <bradley.laney@hpe.com>
Co-authored-by: julian-determined-ai <103522725+julian-determined-ai@users.noreply.github.com>
Co-authored-by: Corban Beaird <corban.beaird@hpe.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: liamcli <liam@determined.ai>
Co-authored-by: Ashton G <ashton.galloway@hpe.com>
Co-authored-by: thiagodallacqua-hpe <104855841+thiagodallacqua-hpe@users.noreply.github.com>
Co-authored-by: Max <max.russell@hpe.com>
Co-authored-by: Emily <15078396+EmilyBonar@users.noreply.github.com>
Co-authored-by: Ilia Glazkov <ilia.glazkov@hpe.com>
Co-authored-by: Caleb Hoyoul Kang <caleb.kang@hpe.com>
Co-authored-by: Wes Turner <wesley.turner@hpe.com>
Co-authored-by: Daniel R. Hunter <103537968+drh-determined-ai@users.noreply.github.com>
Co-authored-by: Tara Charter <tara.charter@hpe.com>
Co-authored-by: rcorujo <90728398+rcorujo@users.noreply.github.com>
Co-authored-by: Guangqing Tang <40620519+gt2345@users.noreply.github.com>
Co-authored-by: MikhailKardash <mikhail.kardash@hpe.com>
Co-authored-by: Jagadeesh Madagundi <jagadeesh545@gmail.com>
Co-authored-by: gt2345 <gt2345@columbia.edu>
Co-authored-by: Trent Watson <trent.watson@hpe.com>
Co-authored-by: Emily Bonar <emily.bonar@hpe.com>
wes-turner added a commit that referenced this pull request Jun 5, 2023
* Save model info; add Core API DeepSpeed example.

* extract ds profiler results

* Place activation mem into search metric.

* Obtain metrics from the model info file

* remove unused code

* adapted to passing dicts into report_completed

* cleanup and small changes

* refactoring and trial helper classes

* initial random search logic started

* minor changes

* minor cleanup

* use context manager, expanded base searcher

* remove ds_autotuning dir from examples

* minor edits

* readme updates and other cleanup

* bug fixes and a hack to avoid needing nested model dirs

* README updates

* Feat deepspeed autotune (#5875)

Added current Core API prototype.

* switched to triggering their autotuning in our trials

* remove checkpoint wrapper

* implementing checkpointing

* implementing checkpointing

* feat: allow includes in custom searcher experiment [MLG-338] (#6091)

* cleanups, bug fixing, and more examples

* adding native dsat tests

* cleanup

* minor edits

* fix missing is_chief

* Feat deepspeed autotune (#6159)

Trigger the native DS AT exit behavior for all trials.

* minor edits

* Feat deepspeed autotune git fixes (#6180)

deleted extraneous files

* readme fix and make cifar10 work off grenoble (#6187)

* Deepspeed Feature Branch - merge master (#6193)

* docs: Improvements to HPC launcher docs (#6042)

* Provide inline info about agent-specific scheduling options that do not apply to HPC Launcher
  configurations.
* Identify enroot-specific differences from docker (like for Singularity)
* Provide reference to custom resource pools as an option to deal with non-homogenous
  Slurm/PBS partitions.

* chore: Allow newer Node versions 17-19 (#6038)

* fix: k8s rm gives wrong slot count in rendezvous (#6044)

* chore: bump version: 0.19.12-dev0 -> 0.20.0-dev0 (#6048)

* chore: remove `applicableRoutespace` (#6040)

* chore: warning fixes in web (#6041)

* chore: fix warnings

* chore: change eslint rules

* chore: fix gpu nightly errors (#6046)

* chore: missing nodev18 (#6050)

* chore: add a dedicated exception for cli errors (#5649)

switch sys.exit calls in cli with a new user-facing exception.

* fix: SSO layout (#6053)

* chore: clean up UI kit (#6039)

* fix: lopsided training with 2,1 gpus (#6054)

There was a guard to skip local zmq setup when local_size < 2, but that
became no longer valid when local_size varied from worker to worker.

The result is one extra global allgather in some cases, no big deal.

* docs: add rbac ntsc & mr release notes (#6049)

* chore: manual bump version (#6058)

* ci: retry downloading GKE auth plugin [DET-8956] (#6056)

We got a failure due to a timeout on this, so let it retry a few times.

* docs: update Singularity known issues. (#6047)

* fix: Add #rank to worker segment instead of timestamp segment of Pytorch Profiler files [MLG-326] (#6037)

* Add pytorch profiler specific handling logic for appending rank to file name

* Change to use f-string

* fix only file name being passed in

* remove print statement

* fix: handle agent shutdown msg (#6065)

* chore: manually bump vite version (#6066)

* feat: Display better x-axis ticks on charts with time axis [WEB-849] (#6051)

* remove xTickValues from props now that it can be calculated internally

* test: add logging to a flaky test (#6068)

Test is flaky but hard to pin down, so add some prints for next time.

* fix: Unrelated models are shown in a workspace model registry tab (#6067)

* feat: Added task-based historical allocation endpoint [DET-8537] (#6015)

* fix: show `not found` and `spinner` properly (#6070)

* fix: show `not found` and `spinner` properly

* chore: change home redirect path

* fix: projectDetail page

* fix: project.workspaceId

* build(deps): bump golang.org/x/text from 0.3.5 to 0.3.8 in /proto (#6061)

* build(deps): bump golang.org/x/text from 0.3.5 to 0.3.8 in /proto

Bumps [golang.org/x/text](https://github.com/golang/text) from 0.3.5 to 0.3.8.
- [Release notes](https://github.com/golang/text/releases)
- [Commits](golang/text@v0.3.5...v0.3.8)

---
updated-dependencies:
- dependency-name: golang.org/x/text
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

* fix: gpt-neox docker image and startup hook to work for non-provileged user (#6060)

* test: add locking around migration [DET-8957] (#6071)

In integration tests, multiple processes can attempt to run migrations
against the same database at once, which can lead to errors because
PostgreSQL's `CREATE TABLE IF NOT EXISTS` is not great with concurrency
(it allows for a time-of-check/time-of-use failure).

The specific errors we were seeing were conflicts in the pg_type table,
so the code now locks that table for the duration of the migration
transaction.

More information:
https://www.postgresql.org/message-id/CA+TgmoZAdYVtwBfp1FL2sMZbiHCWT4UPrzRLNnX1Nb30Ku3-gg@mail.gmail.com
https://stackoverflow.com/questions/29900845

* fix: logging inconsistent newlines in slurm (#6074)

* fix: checkpoint helper for points > 1000 and points > maxDatapoints (#6069)

* fix: replace migration table lock with advisory lock (#6077)

Taking a table lock sometimes runs into permissions issues; advisory
locking should avoid that.

Also, I realized the locking should probably be after the deferred
transaction close instead of before.

* build: check npm version on install (#6079)

This removes the `check-requirements` make target in the react folder
and replaces it with npm's native version check against the engines
property. This should make managing the node version slightly easier
because there's one less place to check.

* build: Apply webui lint fixes in precommit (#6078)

* allow linters to fix in precommit

This updates the web linters to automatically apply fixes when doing a
pre-commit check. This should ideally streamline the commit process to
reduce the amount of times the user needs to run prettier and eslint
before committing.

* tweak stylelint and eslint commands

* stage changed files

* type file_paths argument

* ci: adjust target accuracy for a test (#6085)

We got one failure [1] where the accuracy ended up just a hair below
0.83, so drop the target.

[1] https://app.circleci.com/pipelines/github/determined-ai/determined/33883/workflows/4a5d3257-6061-4f4d-bd66-096a580a5959/jobs/1194282/steps

* chore: UserBadge moves into design kit (#6086)

* chore: Remove chart feature flag, remove unused code [WEB-930] (#6064)

* tooltipsPlugin and TrialDetailsOverview alternates go into place
* checkpoint helper for points > 1000 and points > maxDatapoints
* move former LearningCurveChart into TrialsComparison
* sync up with #6069 changes

* fix: replace defaultvalue with initialValue (#6076)

* fix: Dont suggest moving model into its current workspace (#6088)

* ci: delete database at beginning of det deploy tests [DET-8937] (#6089)

Previously, the database was being retained between tests, sometimes
causing tests to fail when extra agents appeared due to agent
reattach. The tests should generally be independent anyway, so reset
the database (by default, with an option to disable) each time the
cluster or master comes up.

* feat: add Facepile component (#6081)

* fix: pre-commit web bug fix (#6090)

* ci: make GKE test jobs run serially (#6096)

We keep hitting GKE GPU quotas; this will probably help with that.

* fix: GPU counting for k8s cluster info page (#6094)

* fix: test-e2e-gke-parallel use t4 (#6093)

* ci: retry protoc download (#6095)

We got an incorrect file downloaded one time [1], so retry this
download, like in 2906257 (#5996).

[1] https://app.circleci.com/pipelines/github/determined-ai/determined/34074/workflows/e48681f8-8b75-4349-82eb-06e922d8bfcb/jobs/1202610

* refactor: add Card to UI Kit [WEB-818] (#5893)

* docs: Launcher doc improvements (#6099)

- Generalize journalctl command example --since option to work on Unbutu.
- Clarify user_name/group_name account requirements.

* feat: Attend to TODOS accross the code base (#6087)

* perf: tweak metrics series query. (#6105)

* chore: race could cause run container to return a different error than expected [DET-8870] (#6092)

* chore: add more metadata to slurm logs (#6030)

* chore: remove `ExpCompareMetricNames`, `ExpCompareTrialsSample` endpoints. (#6106)

* docs: fix reported DataPoint label doc (#6107)

* fix: tolerate additional non-CPU, non-GPU quotas in k8s (#6109)

* fix: stop filtering of valid options to reflect build issues (#6116)

* fix: modal theme color (#6117)

* fix: add bgColor in trial comparison table (#6119)

* fix: browser console warnings (#6122)

* fix: browser console warnings

* fix: remove spread operator

* chore: UIKit Pivot renaming (#6120)

* fix: correct minor JSX syntax (#6126)

* docs: add myst_parser extension (#6127)

We would like to support markdown-format documentation.  There are still
some kinks to be worked out with converting rst to myst files, but this
is a start.

* docs: fix some broken redirects (#6129)

* feat: add 3rd batch of TODO removals (#6115)

* feat: generic proxy configs [DET-8761] (#5978)

* build: [MLG-336] Limit the version of protobuf (#6134)

build: [MLG-336] Limit the version of protobuf

Installing the requirement `tensorflow-macos=2.8.0` pulls protobuf as a downstream dependency. Version 4.21 of the Python protobuf package had a breaking change that makes it incompatible with tensorflow-2.8.0 (see tensorflow/tensorflow#56077). Later patches to Tensorflow limit the version of protobuf to 3.20. We've got a work item to update the tensorflow we include, but until then this change gives the ceiling on tensorflow's protobuf dependency that its later versions enforce.

* chore: update detectron2 example to use v0.6 and reenable nightly test [MLG-301] (#6103)

* Run model in EventStorage context

* Use new Docker images

* Remove pytest.skip from test_detectron2_coco_pytorch_const

* Update README.md

* Minor code reduction

* Dockerfile (listed in .detignore)

* Use determinedai repo instead of a personal repo

* Makefile for building and publishing the Docker image

* docs: Bring content changes from docusaurus-ls beta (#6121)

* docs: Bring content changes from docusaurus-ls beta

Bring over content changes from the beta including reorganization changes.

* additional organizational edits

updating index pages, adding a top nav to welcome page

* added redirects

* revisions based on feedback

* rstfmt run

* feat: display workspace icon in ProjectCard (#6125)

* fix: checkpoint GC should set resource pool (#6136) [DET-9018]

* docs: bump rstfmt version (#6138)

* chore: add dev cli option to get auth token (#6008)

add a `curl` option to help with curling various endpoints

* build(deps): bump golang.org/x/net from 0.0.0-20210405180319-a5a99cb37ef4 to 0.7.0 in /proto (#6130)

Bumps [golang.org/x/net](https://github.com/golang/net) from 0.0.0-20210405180319-a5a99cb37ef4 to 0.7.0.
- [Release notes](https://github.com/golang/net/releases)
- [Commits](https://github.com/golang/net/commits/v0.7.0)

* perf: Improved performance of historical allocation task endpoint, removed training/validation times (#6135)

* fix: FOUNDENG-438 Podman tests from the gate are breaking znodes again (#6146)

* chore: add Toggle component to UI Kit [WEB-841] (#6144)

* chore: Add tags to UI kit [WEB-816] (#6100)

* chore: Move SelectFilter into kit folder and update it [WEB-843] (#6102)

* fix: replace `InlineEditor` with UIKit input (#6082)

* fix: replace `InlineEditor`

* fix: add modal for experiment name

* fix: layout of settings page

* fix: setting page

* fix: minor changes

* feat: move experiment `description` and `tags` into edit modal

* chore: add `N/A` when description is empty in experiment detail

* fix: value bug

* fix: revert tag; remove tag from edit modal due to design inconsistancy

* chore:  add/test pt-only images and bumpenvs (#6097)

* add pt images to some unit tests

* add pt-images to circleci config

* run bumpenvs procedure

* fix test function signatures and linting

* fix warnings linting

* fix docs

* expand unit tests coverage

* build(deps): bump github.com/prometheus/client_golang from 1.10.0 to 1.11.1 in /master (#6004)

Bumps [github.com/prometheus/client_golang](https://github.com/prometheus/client_golang) from 1.10.0 to 1.11.1.
- [Release notes](https://github.com/prometheus/client_golang/releases)
- [Changelog](https://github.com/prometheus/client_golang/blob/main/CHANGELOG.md)
- [Commits](prometheus/client_golang@v1.10.0...v1.11.1)

* feat: add det.import_from_path (#5737)

import_from_path allows users to import arbitrary code from a
checkpoint, even if the modules in the checkpoint have the same name as
modules they already have imported, but contain different code.

This is common when importing, for example, an old model_def.py that has
been updated since the original checkpoint was saved.

* fix: post rank_id correctly for fluent-less logging (#6151) [DET-8999]

* ci: send Slack notification on GKE node pool creation failure (#6152)

In order to prevent quota failures from showing up as CI failures, this
makes node pool creation failure send a Slack notification and mark the
job as successful.

I couldn't figure out how to use the Slack orb while distinguishing this
particular situation from the general failure case, so I just slapped in
a direct request to the already-configured Slack webhook for sending
messages to #ci-bots.

`circleci-agent step halt` marks the job as successful, which is why we
want a notification at all. For some reason, CircleCI fails to provide
an equivalent for marking the job as canceled or some other state
besides success/failure; we could make a call to the CircleCI API to
cancel the current job, but that would rely on having a CircleCI token
available, which we're trying to get away from.

* chore: drop unused columns from `raw_steps`, `raw_validations`, and `raw_checkpoints`. (#6110)

* fix: render spinner while auth check pending (#6098)

* chore: update hpc-launching-architecture doc - add default slurm option --no-requeue (#6141)

* docs: Content updates (#6154)

formatted the setup cluster table to match the approved version in the docusaurus ls beta

* feat: display user id in `det user list`. (#6156)

* fix: Additional tables get experiment- / workspace-specific storagePath [WEB-962] (#6128)

* fix: Additional tables get experiment- and workspace-specific storagePath

* useMemo

* fix: selection width in `move experiment` modal (#6149)

* fix: selection width in `move experiment` modal

* fix: add form wrapping

* chore: remove change

* feat: show trial hyperparameters for custom searchers [MLG-343] (#6162)

* feat: show trial hyperparameters for custom searchers [MLG-343]

* fix: corrected timestamp handling to do an interval overlap instead of contains (#6164)

* chore:  add release notes (#6167)

* chore: add release notes

* format with rstfmt

* chore: suppress help output for det dev (#6145)

avoid showing the `dev` option in `det -h` output

* chore: lock api state for backward compatibility check

* chore: bump version: 0.20.0-dev0 -> 0.20.1-dev0

* fix: separate Router and authCheck (#6170)

* fix: useMemo does not depend on trial having been loaded (#6173)

* chore: pass Labels/project/workspace to TaskSpec (#6172)

* refactor: replace user store with observables [WEB-799] (#6140)

* fix: hide Foldable menu options when button is visible (#6178)

This fixes an issue where, when using a `PageHeaderFoldable` component,
options that appear in the header always appear in the overflow menu.

* feat: add labels to GCP instances created with det deploy gcp [MLG-170] (#6147)

* feat: add labels to GCP instances created with det deploy gcp

* Changes to mimic det deploy aws --add-tags

* Add labels to other resources as well

* revert: reflag new chart experience (#6181)

* build: eliminate java dependency for typescript swagger bindings (#6139)

* fix: Close expiriment fork/ continue trial modal properly (#6174)

* fix: Continue Trial flow does not take the new `max_length` (#6168)

* fix: pass workspace ID when creating tensor board from WebUI [WEB-1019] (#6186)

* fix: don't print ':' when err msg is empty (#6190)

* fix: exp move modal (#6183)

* fix: exp move modal

* fix: minor fixes

* fix: add `archived` param and simplify query (#6175)

* fix: add `archived` param and simplify query

* chore: indent

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Jerry J. Harrow <84593277+jerryharrow@users.noreply.github.com>
Co-authored-by: Nick Doiron <nick.doiron@hpe.com>
Co-authored-by: NicholasBlaskey <nick.blaskey@hpe.com>
Co-authored-by: Eric <31023784+eecsliu@users.noreply.github.com>
Co-authored-by: Keita Nonaka <keita.nonaka@hpe.com>
Co-authored-by: Erik Wilson <erik.wilson@hpe.com>
Co-authored-by: Hamid Zare <12127420+hamidzr@users.noreply.github.com>
Co-authored-by: johnkim-det <97752292+johnkim-det@users.noreply.github.com>
Co-authored-by: Ryan <rb@hpe.com>
Co-authored-by: Danny Zhu <dzhu@hpe.com>
Co-authored-by: CanmingCobble <107056780+CanmingCobble@users.noreply.github.com>
Co-authored-by: szewaiyuen6 <sze-wai.yuen@hpe.com>
Co-authored-by: Bradley Laney <bradley.laney@hpe.com>
Co-authored-by: julian-determined-ai <103522725+julian-determined-ai@users.noreply.github.com>
Co-authored-by: Corban Beaird <corban.beaird@hpe.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: liamcli <liam@determined.ai>
Co-authored-by: Ashton G <ashton.galloway@hpe.com>
Co-authored-by: thiagodallacqua-hpe <104855841+thiagodallacqua-hpe@users.noreply.github.com>
Co-authored-by: Max <max.russell@hpe.com>
Co-authored-by: Maksim <maksim.kouznetsov@hpe.com>
Co-authored-by: Emily <15078396+EmilyBonar@users.noreply.github.com>
Co-authored-by: Ilia Glazkov <ilia.glazkov@hpe.com>
Co-authored-by: Caleb Hoyoul Kang <caleb.kang@hpe.com>
Co-authored-by: Wes Turner <wesley.turner@hpe.com>
Co-authored-by: Daniel R. Hunter <103537968+drh-determined-ai@users.noreply.github.com>
Co-authored-by: Tara Charter <tara.charter@hpe.com>
Co-authored-by: rcorujo <90728398+rcorujo@users.noreply.github.com>
Co-authored-by: Guangqing Tang <40620519+gt2345@users.noreply.github.com>
Co-authored-by: MikhailKardash <mikhail.kardash@hpe.com>
Co-authored-by: Jagadeesh Madagundi <jagadeesh545@gmail.com>
Co-authored-by: gt2345 <gt2345@columbia.edu>
Co-authored-by: Trent Watson <trent.watson@hpe.com>

* handle add user

* Revert "Deepspeed Feature Branch - merge master (#6193)"

This reverts commit 05ee6a2.

* Move dsat into harness (#6254)

move dsat into harness

* fixed too-large dataset bug (#6262)

* MLG-337 (#6282)

 ds_config.json centered workflow and cleanup

* Feat deepspeed autotune fix merge conflict (#6289)

Merging with master, fixing a merge conflict, and minor cleanup.

* reset webui to latest master (#6291)

* linting fixes (#6296)

* Refactor DS AT for Trial Compatibility (#6307)

Refactored the searcher to use a json based config workflow with overwrite_deepspeed_args, as in (most of) the official examples.

* fix: remove all Close operations (#6383)

Refactored to remove all Close operations to avoid race condition errors. Also added quick no op example and fixed other bugs.

* restore util.py

* restoring more files to lastest master version

* merge in util.py changes

* Remove OOM catcher.

* Add Close operations back in

* Standardize autotuning config names.

* General clean up

* Fix dsat reporting bug.

* Minor changes.

* Clean up.

* Try/except hack around dead agent due to exit

* Minor changes.

* feat: allow users to specify zero optim search space and runner config (#6452)

* fix: do not merge user zero search config with defaults (#6464)

* populate the custom searcher logs with the correct event (#6470)

* fix: search state accounting (#6473)

* feat: add simple linear batch test searcher (#6482)

* fix: searcher refactor (#6513)

* deprecate the zero_search_config functionality (#6517)

* fix: timing fix and cleanup with tests (#6555)

* remove single should_stop and add more granular state

* adding the beginning of some autotuning tests

* real unit tests for DS AT

* Clean up the trial tracker properties, base searcher, and names

* Do not close twice

---------

Co-authored-by: Taylor Ritenour <taylor.ritenour@hpe.com>

* fix: minor cleanup (#6558)

* feat: deepspeed autotune trial based methods (#6575)

* fix: minor bug broke Trial class DS AT (#6613)

* feat: optional steps completed in context manager (#6619)

Make steps_completed in reporting context optional

Update examples to use updated context manager

minor example changes

Minor model changes

Delete old trial class examples

Add torchvision model trial example

* MLG-369: Some initial tests for DS AT (#6551)

* real unit tests for DS AT

* clean up and standardize the tests more

* also send a Close operation

* touch up the tests

* clarifications and clean up

* feat: deepspeed autotune cli args (#6643)

* Add __main__ and reuse original args on-cluster.

* Cleanup

* Adding include

* Finish adding include

* more args

* add search runner exp_id to follow on exp description

* minor comments

* Different registering system and starting the queue

* more args and cleanup

* About to use a queue (kind of)

* Attempting to enforce trial constraints

* Add early stopping as arg

* Remove old autotuning args and get them instead from cli

* Corrected args bugs

* Fixed many refactoring bugs

* Cleanup and bug fixes

* Move exp_conf edits to _run_dsat

* Add zero stages arg

* Move exp conf changes out of searcher and use zero_stages arg

* Fix simple searcher bug

* minor cleanup

* cleanup

* Starting refactor away from searcher state

* Changed searcher state computation to trial tracker

* Revert to single configs everywhere

* Update readme

* Edit added description text

* Add closed attr to Trials and bug fixes

* clarifying comment

* Rebase onto latest feature branch

* Add actual deque, fix bugs

* Remove print test

* small trial example fix

* Clean up try/except hack

* config edits

* feat: visualize cli args (#6651)

* Add CLI args to config hparams for visualization

* remove pickle path arg

* fix up the dsat tests (#6664)

* fix up the dsat tests

* make sure to pass through the args parsing function

* touch up the tests so that they can issue a failure to the experiment (#6676)

* chore: update deepspeed to 0.8.3 [MLG-399] (#6685)

* feat: hf trainer examples (#6687)

* chore: refactor dsat module to be independent of deepspeed imports [MLG-499] (#6694)

* chore: refactor dsat module to be independent of deepspeed imports [MLG-499]

* also update the test and det_callback.py

* chore: move over the examples for DSAT [MLG-500] (#6717)

* feat: add deepspeed autotuning examples [MLG-500]

* some clean up and UX improvements

* remove double parens, make sure that orchestrator id is on the far left

* feat: add follow on exp option (#6720)

* feat: migrate the huggingface det_callback [MLG-487] (#6724)

* migrating the det_callback

* feat: migrate the huggingface det_callback [MLG-487]

* don't export DetCallback through the top level integrations

* feat: minor test updates (#6746)

* feat: use lock with json overwrite for hf (#6742)

* feat: use lock with json overwrite for hf

* handle merge with our previous DetCallback refactors

---------

Co-authored-by: Garrett Goon <garrett.goon@hpe.com>

* feat: basic trial tracker tests (#6754)

* fix: hf overwrite bug

* fix: remove old import (#6761)

* feat: adding e2e tests for DSAT, enabling searcher to shutdown client experiment [MLG-369] (#6781)

* migrating the det_callback

* feat: migrate the huggingface det_callback [MLG-487]

* wip working on the e2e tests

* getting the basics of the tests running. Still appear to be issues though

* fixing up the tests

* fixing up tests

* handle cases where explicit exceptions are raised in the dsat search runner

* clean up for merging

* revert restarts change

* feat: add search method tests (#6785)

* add search method tests

* update hf ex readme

* quick fix for the unit tests (#6796)

* feat: write best ds config json to checkpoint (#6787)

* feat: refactor argparse into subparsers (#6801)

* feat: add binary search (#6806)

* fix: move lock to fix hf overwrites (#6828)

* fix: small fixes (#6825)

* expand message for model profile info failure

* correct the progress calculation

* Remove autotuning section from checkpointed best configs

* also write the best metrics to the checkpoint

* fix: proper placement of start/end profile step (#6834)

* feat: more random search test (#6837)

* chore: stabilize static typed python [MLG-498] (#6846)

* flake8 fixes

* mypy issues

* updates according to comments for mypy changes (#6864)

* feat: asha (#6852)

* merged in prev asha code

* starting asha tests

* more tests and cleanup

* test cleanup

* asha params closer to current nomenclature

* refactor asha args

* import fix

* mypy

* add asha to __all__

* add search data to stage 3 test

* chore: move hf examples (#6871)

* replace old hf integrations examples with new ones

* fix no-op bug in hf helper function

* fix helper function imports

* update readme

* use the python module for `searcher` directly (#6883)

* use the python module for `searcher` directly rather than importing the individual elements

* additional fixes

* feat: clean up dsat examples (#6891)

* moving files

* moving more files

* more file movement

* stage 1 in config

* core api script cleanup

* deepspeed.yaml core api cleanup

* align deepspeed.yaml files

* align ds_config.json files

* remove lr scheduler

* shorten length to 100

* model_def.py cleanup

* Added checkpointing and better metrics reporting

* cleaned up readme

* change example dir name

* add torchvision examples to test_official.py

* starting e2e tests

* add all e2e tests

* remove accidental double test

* chore: to do cleanup (#6895)

* cli docs

* cli doc strings

* remove todo comment

* update doc strings in dsat _utils.py

* searcher class doc strings

* More search method doc strings for non-public classes

* remove searcher state tests

* remove many todos in _dsat_search_method.py

* remove todos elsewhere

* fix: do not schedule the same trial twice (#6896)

* attempting to fix tests (#6899)

* fix: move examples dir one level up and finish docs/Makefile changes (#6898)

* fix: remove improper test_official.py tests (#6900)

* fix docs formatting (#6902)

* fix docs formatting

* add deepspeed autotune directory to example builds

* support hf examples (#6910)

* feat: ds config from include (#6905)

* move overwrite_deepspeed_config back to det.pytorch.deepspeed

* allow for the ds json to be --include'd

* self.hparams -> hparams bug

* doc string edits

* move ds_config.json back inside of no_op

* chore: update the custom searcher docs [MLG-447] (#6934)

* updating the docs for custom searcher wip

* wip, fixing up the docs, making sure things and clean and link properly

* chore: update the custom searcher docs [MLG-447]

* updates according to comments

* fix docs build

* changes according to comments in dsat branch (#6943)

* changes according to comments.

- removing no_op
- removing cache_dir in hf examples
- removing erroneous release-notes

* revert the changes to the environments so we are in sync with bumpenvs

* adjust the huggingface versions to the current minor version

* update version

* bug: fully wrap hf JSON loading around a FileLock (#6950)

* feat: deepspeed autotune user guide (#6929)

* starting dsat user guide

* import cleanup in hf examples

* more editing

* starting to list cli options

* formatting

* git mv hf examples to make more descriptive dirs

* remove TODO

* incorporating feedback

* links to examples

* cleanup

* Update cli help

* general cli options cleanup

* incorporate taylor's second round of feedback

* incorporate tara's comments

* fix: no auto in hf cli (#6963)

* add int check to hf cli arg overwrite

* fix accidential trivial test

* new test ensuring auto not used in hf cli args

* add checks against copying "auto" to HF CLI entrypoint

* add link to the dsat user guide (#6964)

* add link to the dsat user guide

* updated wording

* add defaults to dsat cli help menu (#6970)

* feat: more tests and asha cleanup (#6966)

* account for asha early stopping in test

* clean up lineage_completed_rung

* more full mock experiment tests

* Only skip completed trials added to queue

* base rungs off latest, rather than root

* max trial computation cleanup

* only include curr_rung <= rung_idx trial in best computation

* promotion respects rung idx test

* test cleanup

* more test cleanup

* add test_get_best_trial_in_lineage

* doc string cleanup

* fix up test_full_experiment_reverse_ordered_results

* minor wording edit

* always pop off highest curr_rung asha trial next

* fix the doc builds, add a release-note (#6973)

* fix the doc builds, add a release-note

* update docs names

* make flake8 behave

* update by the pre-commit complaints

* fix: readme cleanup (#6974)

* touch up hf trainer readme

* shorten up and simplify torchvision readme

* feat: update defaults and small tweaks (#6975)

* trials_per_random_config 5

* max trials 64

* min binary search trials 3

* fix text by avoiding trivial search range

* remove should_discard function to avoid possible locking

* increase timeouts for mock tests

* mypy fix

* update ceiling computation for new random trials

* base the ceiling off the max mbs computation, not the midpoint

* schedule longer lineages first in asha

* update docs to reflect new defaults

* fmt examples

* make isort behave

* address some comments about logging levels and comments

* remote erroneous TODO

* small spelling mistake

* move the `determined.integrations.huggingface.DetCallback` to `determined.transformers.DetCallback`

* fix: comment and environment cleanup (#6988)

* explain try/except in search runner

* remove while true and comments in _deepspeed_trial.py

* fix all deepspeed yaml files

* remove step id comment

* more todo cleanup

* make sure the docs build again

* fix the names of the e2e tests and in README

* don't run so many e2e_tests for deepspeed

* reduce hf image class ds slots per trial (#6998)

* fix e2e tests

* fix the convergence tests

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Maksim Kouznetsov <maksim.kouznetsov@hpe.com>
Co-authored-by: Garrett Goon <garrett.goon@hpe.com>
Co-authored-by: Garrett Goon <44747910+garrett361@users.noreply.github.com>
Co-authored-by: Jerry J. Harrow <84593277+jerryharrow@users.noreply.github.com>
Co-authored-by: Nick Doiron <nick.doiron@hpe.com>
Co-authored-by: NicholasBlaskey <nick.blaskey@hpe.com>
Co-authored-by: Eric <31023784+eecsliu@users.noreply.github.com>
Co-authored-by: Keita Nonaka <keita.nonaka@hpe.com>
Co-authored-by: Erik Wilson <erik.wilson@hpe.com>
Co-authored-by: Hamid Zare <12127420+hamidzr@users.noreply.github.com>
Co-authored-by: johnkim-det <97752292+johnkim-det@users.noreply.github.com>
Co-authored-by: Ryan <rb@hpe.com>
Co-authored-by: Danny Zhu <dzhu@hpe.com>
Co-authored-by: CanmingCobble <107056780+CanmingCobble@users.noreply.github.com>
Co-authored-by: szewaiyuen6 <sze-wai.yuen@hpe.com>
Co-authored-by: Bradley Laney <bradley.laney@hpe.com>
Co-authored-by: julian-determined-ai <103522725+julian-determined-ai@users.noreply.github.com>
Co-authored-by: Corban Beaird <corban.beaird@hpe.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: liamcli <liam@determined.ai>
Co-authored-by: Ashton G <ashton.galloway@hpe.com>
Co-authored-by: thiagodallacqua-hpe <104855841+thiagodallacqua-hpe@users.noreply.github.com>
Co-authored-by: Max <max.russell@hpe.com>
Co-authored-by: Emily <15078396+EmilyBonar@users.noreply.github.com>
Co-authored-by: Ilia Glazkov <ilia.glazkov@hpe.com>
Co-authored-by: Caleb Hoyoul Kang <caleb.kang@hpe.com>
Co-authored-by: Wes Turner <wesley.turner@hpe.com>
Co-authored-by: Daniel R. Hunter <103537968+drh-determined-ai@users.noreply.github.com>
Co-authored-by: Tara Charter <tara.charter@hpe.com>
Co-authored-by: rcorujo <90728398+rcorujo@users.noreply.github.com>
Co-authored-by: Guangqing Tang <40620519+gt2345@users.noreply.github.com>
Co-authored-by: MikhailKardash <mikhail.kardash@hpe.com>
Co-authored-by: Jagadeesh Madagundi <jagadeesh545@gmail.com>
Co-authored-by: gt2345 <gt2345@columbia.edu>
Co-authored-by: Trent Watson <trent.watson@hpe.com>
Co-authored-by: Emily Bonar <emily.bonar@hpe.com>
wes-turner added a commit that referenced this pull request Jun 5, 2023
* Save model info; add Core API DeepSpeed example.

* extract ds profiler results

* Place activation mem into search metric.

* Obtain metrics from the model info file

* remove unused code

* adapted to passing dicts into report_completed

* cleanup and small changes

* refactoring and trial helper classes

* initial random search logic started

* minor changes

* minor cleanup

* use context manager, expanded base searcher

* remove ds_autotuning dir from examples

* minor edits

* readme updates and other cleanup

* bug fixes and a hack to avoid needing nested model dirs

* README updates

* Feat deepspeed autotune (#5875)

Added current Core API prototype.

* switched to triggering their autotuning in our trials

* remove checkpoint wrapper

* implementing checkpointing

* implementing checkpointing

* feat: allow includes in custom searcher experiment [MLG-338] (#6091)

* cleanups, bug fixing, and more examples

* adding native dsat tests

* cleanup

* minor edits

* fix missing is_chief

* Feat deepspeed autotune (#6159)

Trigger the native DS AT exit behavior for all trials.

* minor edits

* Feat deepspeed autotune git fixes (#6180)

deleted extraneous files

* readme fix and make cifar10 work off grenoble (#6187)

* Deepspeed Feature Branch - merge master (#6193)

* docs: Improvements to HPC launcher docs (#6042)

* Provide inline info about agent-specific scheduling options that do not apply to HPC Launcher
  configurations.
* Identify enroot-specific differences from docker (like for Singularity)
* Provide reference to custom resource pools as an option to deal with non-homogenous
  Slurm/PBS partitions.

* chore: Allow newer Node versions 17-19 (#6038)

* fix: k8s rm gives wrong slot count in rendezvous (#6044)

* chore: bump version: 0.19.12-dev0 -> 0.20.0-dev0 (#6048)

* chore: remove `applicableRoutespace` (#6040)

* chore: warning fixes in web (#6041)

* chore: fix warnings

* chore: change eslint rules

* chore: fix gpu nightly errors (#6046)

* chore: missing nodev18 (#6050)

* chore: add a dedicated exception for cli errors (#5649)

switch sys.exit calls in cli with a new user-facing exception.

* fix: SSO layout (#6053)

* chore: clean up UI kit (#6039)

* fix: lopsided training with 2,1 gpus (#6054)

There was a guard to skip local zmq setup when local_size < 2, but that
became no longer valid when local_size varied from worker to worker.

The result is one extra global allgather in some cases, no big deal.

* docs: add rbac ntsc & mr release notes (#6049)

* chore: manual bump version (#6058)

* ci: retry downloading GKE auth plugin [DET-8956] (#6056)

We got a failure due to a timeout on this, so let it retry a few times.

* docs: update Singularity known issues. (#6047)

* fix: Add #rank to worker segment instead of timestamp segment of Pytorch Profiler files [MLG-326] (#6037)

* Add pytorch profiler specific handling logic for appending rank to file name

* Change to use f-string

* fix only file name being passed in

* remove print statement

* fix: handle agent shutdown msg (#6065)

* chore: manually bump vite version (#6066)

* feat: Display better x-axis ticks on charts with time axis [WEB-849] (#6051)

* remove xTickValues from props now that it can be calculated internally

* test: add logging to a flaky test (#6068)

Test is flaky but hard to pin down, so add some prints for next time.

* fix: Unrelated models are shown in a workspace model registry tab (#6067)

* feat: Added task-based historical allocation endpoint [DET-8537] (#6015)

* fix: show `not found` and `spinner` properly (#6070)

* fix: show `not found` and `spinner` properly

* chore: change home redirect path

* fix: projectDetail page

* fix: project.workspaceId

* build(deps): bump golang.org/x/text from 0.3.5 to 0.3.8 in /proto (#6061)

* build(deps): bump golang.org/x/text from 0.3.5 to 0.3.8 in /proto

Bumps [golang.org/x/text](https://github.com/golang/text) from 0.3.5 to 0.3.8.
- [Release notes](https://github.com/golang/text/releases)
- [Commits](golang/text@v0.3.5...v0.3.8)

---
updated-dependencies:
- dependency-name: golang.org/x/text
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

* fix: gpt-neox docker image and startup hook to work for non-provileged user (#6060)

* test: add locking around migration [DET-8957] (#6071)

In integration tests, multiple processes can attempt to run migrations
against the same database at once, which can lead to errors because
PostgreSQL's `CREATE TABLE IF NOT EXISTS` is not great with concurrency
(it allows for a time-of-check/time-of-use failure).

The specific errors we were seeing were conflicts in the pg_type table,
so the code now locks that table for the duration of the migration
transaction.

More information:
https://www.postgresql.org/message-id/CA+TgmoZAdYVtwBfp1FL2sMZbiHCWT4UPrzRLNnX1Nb30Ku3-gg@mail.gmail.com
https://stackoverflow.com/questions/29900845

* fix: logging inconsistent newlines in slurm (#6074)

* fix: checkpoint helper for points > 1000 and points > maxDatapoints (#6069)

* fix: replace migration table lock with advisory lock (#6077)

Taking a table lock sometimes runs into permissions issues; advisory
locking should avoid that.

Also, I realized the locking should probably be after the deferred
transaction close instead of before.

* build: check npm version on install (#6079)

This removes the `check-requirements` make target in the react folder
and replaces it with npm's native version check against the engines
property. This should make managing the node version slightly easier
because there's one less place to check.

* build: Apply webui lint fixes in precommit (#6078)

* allow linters to fix in precommit

This updates the web linters to automatically apply fixes when doing a
pre-commit check. This should ideally streamline the commit process to
reduce the amount of times the user needs to run prettier and eslint
before committing.

* tweak stylelint and eslint commands

* stage changed files

* type file_paths argument

* ci: adjust target accuracy for a test (#6085)

We got one failure [1] where the accuracy ended up just a hair below
0.83, so drop the target.

[1] https://app.circleci.com/pipelines/github/determined-ai/determined/33883/workflows/4a5d3257-6061-4f4d-bd66-096a580a5959/jobs/1194282/steps

* chore: UserBadge moves into design kit (#6086)

* chore: Remove chart feature flag, remove unused code [WEB-930] (#6064)

* tooltipsPlugin and TrialDetailsOverview alternates go into place
* checkpoint helper for points > 1000 and points > maxDatapoints
* move former LearningCurveChart into TrialsComparison
* sync up with #6069 changes

* fix: replace defaultvalue with initialValue (#6076)

* fix: Dont suggest moving model into its current workspace (#6088)

* ci: delete database at beginning of det deploy tests [DET-8937] (#6089)

Previously, the database was being retained between tests, sometimes
causing tests to fail when extra agents appeared due to agent
reattach. The tests should generally be independent anyway, so reset
the database (by default, with an option to disable) each time the
cluster or master comes up.

* feat: add Facepile component (#6081)

* fix: pre-commit web bug fix (#6090)

* ci: make GKE test jobs run serially (#6096)

We keep hitting GKE GPU quotas; this will probably help with that.

* fix: GPU counting for k8s cluster info page (#6094)

* fix: test-e2e-gke-parallel use t4 (#6093)

* ci: retry protoc download (#6095)

We got an incorrect file downloaded one time [1], so retry this
download, like in 2906257 (#5996).

[1] https://app.circleci.com/pipelines/github/determined-ai/determined/34074/workflows/e48681f8-8b75-4349-82eb-06e922d8bfcb/jobs/1202610

* refactor: add Card to UI Kit [WEB-818] (#5893)

* docs: Launcher doc improvements (#6099)

- Generalize journalctl command example --since option to work on Unbutu.
- Clarify user_name/group_name account requirements.

* feat: Attend to TODOS accross the code base (#6087)

* perf: tweak metrics series query. (#6105)

* chore: race could cause run container to return a different error than expected [DET-8870] (#6092)

* chore: add more metadata to slurm logs (#6030)

* chore: remove `ExpCompareMetricNames`, `ExpCompareTrialsSample` endpoints. (#6106)

* docs: fix reported DataPoint label doc (#6107)

* fix: tolerate additional non-CPU, non-GPU quotas in k8s (#6109)

* fix: stop filtering of valid options to reflect build issues (#6116)

* fix: modal theme color (#6117)

* fix: add bgColor in trial comparison table (#6119)

* fix: browser console warnings (#6122)

* fix: browser console warnings

* fix: remove spread operator

* chore: UIKit Pivot renaming (#6120)

* fix: correct minor JSX syntax (#6126)

* docs: add myst_parser extension (#6127)

We would like to support markdown-format documentation.  There are still
some kinks to be worked out with converting rst to myst files, but this
is a start.

* docs: fix some broken redirects (#6129)

* feat: add 3rd batch of TODO removals (#6115)

* feat: generic proxy configs [DET-8761] (#5978)

* build: [MLG-336] Limit the version of protobuf (#6134)

build: [MLG-336] Limit the version of protobuf

Installing the requirement `tensorflow-macos=2.8.0` pulls protobuf as a downstream dependency. Version 4.21 of the Python protobuf package had a breaking change that makes it incompatible with tensorflow-2.8.0 (see tensorflow/tensorflow#56077). Later patches to Tensorflow limit the version of protobuf to 3.20. We've got a work item to update the tensorflow we include, but until then this change gives the ceiling on tensorflow's protobuf dependency that its later versions enforce.

* chore: update detectron2 example to use v0.6 and reenable nightly test [MLG-301] (#6103)

* Run model in EventStorage context

* Use new Docker images

* Remove pytest.skip from test_detectron2_coco_pytorch_const

* Update README.md

* Minor code reduction

* Dockerfile (listed in .detignore)

* Use determinedai repo instead of a personal repo

* Makefile for building and publishing the Docker image

* docs: Bring content changes from docusaurus-ls beta (#6121)

* docs: Bring content changes from docusaurus-ls beta

Bring over content changes from the beta including reorganization changes.

* additional organizational edits

updating index pages, adding a top nav to welcome page

* added redirects

* revisions based on feedback

* rstfmt run

* feat: display workspace icon in ProjectCard (#6125)

* fix: checkpoint GC should set resource pool (#6136) [DET-9018]

* docs: bump rstfmt version (#6138)

* chore: add dev cli option to get auth token (#6008)

add a `curl` option to help with curling various endpoints

* build(deps): bump golang.org/x/net from 0.0.0-20210405180319-a5a99cb37ef4 to 0.7.0 in /proto (#6130)

Bumps [golang.org/x/net](https://github.com/golang/net) from 0.0.0-20210405180319-a5a99cb37ef4 to 0.7.0.
- [Release notes](https://github.com/golang/net/releases)
- [Commits](https://github.com/golang/net/commits/v0.7.0)

* perf: Improved performance of historical allocation task endpoint, removed training/validation times (#6135)

* fix: FOUNDENG-438 Podman tests from the gate are breaking znodes again (#6146)

* chore: add Toggle component to UI Kit [WEB-841] (#6144)

* chore: Add tags to UI kit [WEB-816] (#6100)

* chore: Move SelectFilter into kit folder and update it [WEB-843] (#6102)

* fix: replace `InlineEditor` with UIKit input (#6082)

* fix: replace `InlineEditor`

* fix: add modal for experiment name

* fix: layout of settings page

* fix: setting page

* fix: minor changes

* feat: move experiment `description` and `tags` into edit modal

* chore: add `N/A` when description is empty in experiment detail

* fix: value bug

* fix: revert tag; remove tag from edit modal due to design inconsistancy

* chore:  add/test pt-only images and bumpenvs (#6097)

* add pt images to some unit tests

* add pt-images to circleci config

* run bumpenvs procedure

* fix test function signatures and linting

* fix warnings linting

* fix docs

* expand unit tests coverage

* build(deps): bump github.com/prometheus/client_golang from 1.10.0 to 1.11.1 in /master (#6004)

Bumps [github.com/prometheus/client_golang](https://github.com/prometheus/client_golang) from 1.10.0 to 1.11.1.
- [Release notes](https://github.com/prometheus/client_golang/releases)
- [Changelog](https://github.com/prometheus/client_golang/blob/main/CHANGELOG.md)
- [Commits](prometheus/client_golang@v1.10.0...v1.11.1)

* feat: add det.import_from_path (#5737)

import_from_path allows users to import arbitrary code from a
checkpoint, even if the modules in the checkpoint have the same name as
modules they already have imported, but contain different code.

This is common when importing, for example, an old model_def.py that has
been updated since the original checkpoint was saved.

* fix: post rank_id correctly for fluent-less logging (#6151) [DET-8999]

* ci: send Slack notification on GKE node pool creation failure (#6152)

In order to prevent quota failures from showing up as CI failures, this
makes node pool creation failure send a Slack notification and mark the
job as successful.

I couldn't figure out how to use the Slack orb while distinguishing this
particular situation from the general failure case, so I just slapped in
a direct request to the already-configured Slack webhook for sending
messages to #ci-bots.

`circleci-agent step halt` marks the job as successful, which is why we
want a notification at all. For some reason, CircleCI fails to provide
an equivalent for marking the job as canceled or some other state
besides success/failure; we could make a call to the CircleCI API to
cancel the current job, but that would rely on having a CircleCI token
available, which we're trying to get away from.

* chore: drop unused columns from `raw_steps`, `raw_validations`, and `raw_checkpoints`. (#6110)

* fix: render spinner while auth check pending (#6098)

* chore: update hpc-launching-architecture doc - add default slurm option --no-requeue (#6141)

* docs: Content updates (#6154)

formatted the setup cluster table to match the approved version in the docusaurus ls beta

* feat: display user id in `det user list`. (#6156)

* fix: Additional tables get experiment- / workspace-specific storagePath [WEB-962] (#6128)

* fix: Additional tables get experiment- and workspace-specific storagePath

* useMemo

* fix: selection width in `move experiment` modal (#6149)

* fix: selection width in `move experiment` modal

* fix: add form wrapping

* chore: remove change

* feat: show trial hyperparameters for custom searchers [MLG-343] (#6162)

* feat: show trial hyperparameters for custom searchers [MLG-343]

* fix: corrected timestamp handling to do an interval overlap instead of contains (#6164)

* chore:  add release notes (#6167)

* chore: add release notes

* format with rstfmt

* chore: suppress help output for det dev (#6145)

avoid showing the `dev` option in `det -h` output

* chore: lock api state for backward compatibility check

* chore: bump version: 0.20.0-dev0 -> 0.20.1-dev0

* fix: separate Router and authCheck (#6170)

* fix: useMemo does not depend on trial having been loaded (#6173)

* chore: pass Labels/project/workspace to TaskSpec (#6172)

* refactor: replace user store with observables [WEB-799] (#6140)

* fix: hide Foldable menu options when button is visible (#6178)

This fixes an issue where, when using a `PageHeaderFoldable` component,
options that appear in the header always appear in the overflow menu.

* feat: add labels to GCP instances created with det deploy gcp [MLG-170] (#6147)

* feat: add labels to GCP instances created with det deploy gcp

* Changes to mimic det deploy aws --add-tags

* Add labels to other resources as well

* revert: reflag new chart experience (#6181)

* build: eliminate java dependency for typescript swagger bindings (#6139)

* fix: Close expiriment fork/ continue trial modal properly (#6174)

* fix: Continue Trial flow does not take the new `max_length` (#6168)

* fix: pass workspace ID when creating tensor board from WebUI [WEB-1019] (#6186)

* fix: don't print ':' when err msg is empty (#6190)

* fix: exp move modal (#6183)

* fix: exp move modal

* fix: minor fixes

* fix: add `archived` param and simplify query (#6175)

* fix: add `archived` param and simplify query

* chore: indent

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Jerry J. Harrow <84593277+jerryharrow@users.noreply.github.com>
Co-authored-by: Nick Doiron <nick.doiron@hpe.com>
Co-authored-by: NicholasBlaskey <nick.blaskey@hpe.com>
Co-authored-by: Eric <31023784+eecsliu@users.noreply.github.com>
Co-authored-by: Keita Nonaka <keita.nonaka@hpe.com>
Co-authored-by: Erik Wilson <erik.wilson@hpe.com>
Co-authored-by: Hamid Zare <12127420+hamidzr@users.noreply.github.com>
Co-authored-by: johnkim-det <97752292+johnkim-det@users.noreply.github.com>
Co-authored-by: Ryan <rb@hpe.com>
Co-authored-by: Danny Zhu <dzhu@hpe.com>
Co-authored-by: CanmingCobble <107056780+CanmingCobble@users.noreply.github.com>
Co-authored-by: szewaiyuen6 <sze-wai.yuen@hpe.com>
Co-authored-by: Bradley Laney <bradley.laney@hpe.com>
Co-authored-by: julian-determined-ai <103522725+julian-determined-ai@users.noreply.github.com>
Co-authored-by: Corban Beaird <corban.beaird@hpe.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: liamcli <liam@determined.ai>
Co-authored-by: Ashton G <ashton.galloway@hpe.com>
Co-authored-by: thiagodallacqua-hpe <104855841+thiagodallacqua-hpe@users.noreply.github.com>
Co-authored-by: Max <max.russell@hpe.com>
Co-authored-by: Maksim <maksim.kouznetsov@hpe.com>
Co-authored-by: Emily <15078396+EmilyBonar@users.noreply.github.com>
Co-authored-by: Ilia Glazkov <ilia.glazkov@hpe.com>
Co-authored-by: Caleb Hoyoul Kang <caleb.kang@hpe.com>
Co-authored-by: Wes Turner <wesley.turner@hpe.com>
Co-authored-by: Daniel R. Hunter <103537968+drh-determined-ai@users.noreply.github.com>
Co-authored-by: Tara Charter <tara.charter@hpe.com>
Co-authored-by: rcorujo <90728398+rcorujo@users.noreply.github.com>
Co-authored-by: Guangqing Tang <40620519+gt2345@users.noreply.github.com>
Co-authored-by: MikhailKardash <mikhail.kardash@hpe.com>
Co-authored-by: Jagadeesh Madagundi <jagadeesh545@gmail.com>
Co-authored-by: gt2345 <gt2345@columbia.edu>
Co-authored-by: Trent Watson <trent.watson@hpe.com>

* handle add user

* Revert "Deepspeed Feature Branch - merge master (#6193)"

This reverts commit 05ee6a2.

* Move dsat into harness (#6254)

move dsat into harness

* fixed too-large dataset bug (#6262)

* MLG-337 (#6282)

 ds_config.json centered workflow and cleanup

* Feat deepspeed autotune fix merge conflict (#6289)

Merging with master, fixing a merge conflict, and minor cleanup.

* reset webui to latest master (#6291)

* linting fixes (#6296)

* Refactor DS AT for Trial Compatibility (#6307)

Refactored the searcher to use a json based config workflow with overwrite_deepspeed_args, as in (most of) the official examples.

* fix: remove all Close operations (#6383)

Refactored to remove all Close operations to avoid race condition errors. Also added quick no op example and fixed other bugs.

* restore util.py

* restoring more files to lastest master version

* merge in util.py changes

* Remove OOM catcher.

* Add Close operations back in

* Standardize autotuning config names.

* General clean up

* Fix dsat reporting bug.

* Minor changes.

* Clean up.

* Try/except hack around dead agent due to exit

* Minor changes.

* feat: allow users to specify zero optim search space and runner config (#6452)

* fix: do not merge user zero search config with defaults (#6464)

* populate the custom searcher logs with the correct event (#6470)

* fix: search state accounting (#6473)

* feat: add simple linear batch test searcher (#6482)

* fix: searcher refactor (#6513)

* deprecate the zero_search_config functionality (#6517)

* fix: timing fix and cleanup with tests (#6555)

* remove single should_stop and add more granular state

* adding the beginning of some autotuning tests

* real unit tests for DS AT

* Clean up the trial tracker properties, base searcher, and names

* Do not close twice

---------

Co-authored-by: Taylor Ritenour <taylor.ritenour@hpe.com>

* fix: minor cleanup (#6558)

* feat: deepspeed autotune trial based methods (#6575)

* fix: minor bug broke Trial class DS AT (#6613)

* feat: optional steps completed in context manager (#6619)

Make steps_completed in reporting context optional

Update examples to use updated context manager

minor example changes

Minor model changes

Delete old trial class examples

Add torchvision model trial example

* MLG-369: Some initial tests for DS AT (#6551)

* real unit tests for DS AT

* clean up and standardize the tests more

* also send a Close operation

* touch up the tests

* clarifications and clean up

* feat: deepspeed autotune cli args (#6643)

* Add __main__ and reuse original args on-cluster.

* Cleanup

* Adding include

* Finish adding include

* more args

* add search runner exp_id to follow on exp description

* minor comments

* Different registering system and starting the queue

* more args and cleanup

* About to use a queue (kind of)

* Attempting to enforce trial constraints

* Add early stopping as arg

* Remove old autotuning args and get them instead from cli

* Corrected args bugs

* Fixed many refactoring bugs

* Cleanup and bug fixes

* Move exp_conf edits to _run_dsat

* Add zero stages arg

* Move exp conf changes out of searcher and use zero_stages arg

* Fix simple searcher bug

* minor cleanup

* cleanup

* Starting refactor away from searcher state

* Changed searcher state computation to trial tracker

* Revert to single configs everywhere

* Update readme

* Edit added description text

* Add closed attr to Trials and bug fixes

* clarifying comment

* Rebase onto latest feature branch

* Add actual deque, fix bugs

* Remove print test

* small trial example fix

* Clean up try/except hack

* config edits

* feat: visualize cli args (#6651)

* Add CLI args to config hparams for visualization

* remove pickle path arg

* fix up the dsat tests (#6664)

* fix up the dsat tests

* make sure to pass through the args parsing function

* touch up the tests so that they can issue a failure to the experiment (#6676)

* chore: update deepspeed to 0.8.3 [MLG-399] (#6685)

* feat: hf trainer examples (#6687)

* chore: refactor dsat module to be independent of deepspeed imports [MLG-499] (#6694)

* chore: refactor dsat module to be independent of deepspeed imports [MLG-499]

* also update the test and det_callback.py

* chore: move over the examples for DSAT [MLG-500] (#6717)

* feat: add deepspeed autotuning examples [MLG-500]

* some clean up and UX improvements

* remove double parens, make sure that orchestrator id is on the far left

* feat: add follow on exp option (#6720)

* feat: migrate the huggingface det_callback [MLG-487] (#6724)

* migrating the det_callback

* feat: migrate the huggingface det_callback [MLG-487]

* don't export DetCallback through the top level integrations

* feat: minor test updates (#6746)

* feat: use lock with json overwrite for hf (#6742)

* feat: use lock with json overwrite for hf

* handle merge with our previous DetCallback refactors

---------

Co-authored-by: Garrett Goon <garrett.goon@hpe.com>

* feat: basic trial tracker tests (#6754)

* fix: hf overwrite bug

* fix: remove old import (#6761)

* feat: adding e2e tests for DSAT, enabling searcher to shutdown client experiment [MLG-369] (#6781)

* migrating the det_callback

* feat: migrate the huggingface det_callback [MLG-487]

* wip working on the e2e tests

* getting the basics of the tests running. Still appear to be issues though

* fixing up the tests

* fixing up tests

* handle cases where explicit exceptions are raised in the dsat search runner

* clean up for merging

* revert restarts change

* feat: add search method tests (#6785)

* add search method tests

* update hf ex readme

* quick fix for the unit tests (#6796)

* feat: write best ds config json to checkpoint (#6787)

* feat: refactor argparse into subparsers (#6801)

* feat: add binary search (#6806)

* fix: move lock to fix hf overwrites (#6828)

* fix: small fixes (#6825)

* expand message for model profile info failure

* correct the progress calculation

* Remove autotuning section from checkpointed best configs

* also write the best metrics to the checkpoint

* fix: proper placement of start/end profile step (#6834)

* feat: more random search test (#6837)

* chore: stabilize static typed python [MLG-498] (#6846)

* flake8 fixes

* mypy issues

* updates according to comments for mypy changes (#6864)

* feat: asha (#6852)

* merged in prev asha code

* starting asha tests

* more tests and cleanup

* test cleanup

* asha params closer to current nomenclature

* refactor asha args

* import fix

* mypy

* add asha to __all__

* add search data to stage 3 test

* chore: move hf examples (#6871)

* replace old hf integrations examples with new ones

* fix no-op bug in hf helper function

* fix helper function imports

* update readme

* use the python module for `searcher` directly (#6883)

* use the python module for `searcher` directly rather than importing the individual elements

* additional fixes

* feat: clean up dsat examples (#6891)

* moving files

* moving more files

* more file movement

* stage 1 in config

* core api script cleanup

* deepspeed.yaml core api cleanup

* align deepspeed.yaml files

* align ds_config.json files

* remove lr scheduler

* shorten length to 100

* model_def.py cleanup

* Added checkpointing and better metrics reporting

* cleaned up readme

* change example dir name

* add torchvision examples to test_official.py

* starting e2e tests

* add all e2e tests

* remove accidental double test

* chore: to do cleanup (#6895)

* cli docs

* cli doc strings

* remove todo comment

* update doc strings in dsat _utils.py

* searcher class doc strings

* More search method doc strings for non-public classes

* remove searcher state tests

* remove many todos in _dsat_search_method.py

* remove todos elsewhere

* fix: do not schedule the same trial twice (#6896)

* attempting to fix tests (#6899)

* fix: move examples dir one level up and finish docs/Makefile changes (#6898)

* fix: remove improper test_official.py tests (#6900)

* fix docs formatting (#6902)

* fix docs formatting

* add deepspeed autotune directory to example builds

* support hf examples (#6910)

* feat: ds config from include (#6905)

* move overwrite_deepspeed_config back to det.pytorch.deepspeed

* allow for the ds json to be --include'd

* self.hparams -> hparams bug

* doc string edits

* move ds_config.json back inside of no_op

* chore: update the custom searcher docs [MLG-447] (#6934)

* updating the docs for custom searcher wip

* wip, fixing up the docs, making sure things and clean and link properly

* chore: update the custom searcher docs [MLG-447]

* updates according to comments

* fix docs build

* changes according to comments in dsat branch (#6943)

* changes according to comments.

- removing no_op
- removing cache_dir in hf examples
- removing erroneous release-notes

* revert the changes to the environments so we are in sync with bumpenvs

* adjust the huggingface versions to the current minor version

* update version

* bug: fully wrap hf JSON loading around a FileLock (#6950)

* feat: deepspeed autotune user guide (#6929)

* starting dsat user guide

* import cleanup in hf examples

* more editing

* starting to list cli options

* formatting

* git mv hf examples to make more descriptive dirs

* remove TODO

* incorporating feedback

* links to examples

* cleanup

* Update cli help

* general cli options cleanup

* incorporate taylor's second round of feedback

* incorporate tara's comments

* fix: no auto in hf cli (#6963)

* add int check to hf cli arg overwrite

* fix accidential trivial test

* new test ensuring auto not used in hf cli args

* add checks against copying "auto" to HF CLI entrypoint

* add link to the dsat user guide (#6964)

* add link to the dsat user guide

* updated wording

* add defaults to dsat cli help menu (#6970)

* feat: more tests and asha cleanup (#6966)

* account for asha early stopping in test

* clean up lineage_completed_rung

* more full mock experiment tests

* Only skip completed trials added to queue

* base rungs off latest, rather than root

* max trial computation cleanup

* only include curr_rung <= rung_idx trial in best computation

* promotion respects rung idx test

* test cleanup

* more test cleanup

* add test_get_best_trial_in_lineage

* doc string cleanup

* fix up test_full_experiment_reverse_ordered_results

* minor wording edit

* always pop off highest curr_rung asha trial next

* fix the doc builds, add a release-note (#6973)

* fix the doc builds, add a release-note

* update docs names

* make flake8 behave

* update by the pre-commit complaints

* fix: readme cleanup (#6974)

* touch up hf trainer readme

* shorten up and simplify torchvision readme

* feat: update defaults and small tweaks (#6975)

* trials_per_random_config 5

* max trials 64

* min binary search trials 3

* fix text by avoiding trivial search range

* remove should_discard function to avoid possible locking

* increase timeouts for mock tests

* mypy fix

* update ceiling computation for new random trials

* base the ceiling off the max mbs computation, not the midpoint

* schedule longer lineages first in asha

* update docs to reflect new defaults

* fmt examples

* make isort behave

* address some comments about logging levels and comments

* remote erroneous TODO

* small spelling mistake

* move the `determined.integrations.huggingface.DetCallback` to `determined.transformers.DetCallback`

* fix: comment and environment cleanup (#6988)

* explain try/except in search runner

* remove while true and comments in _deepspeed_trial.py

* fix all deepspeed yaml files

* remove step id comment

* more todo cleanup

* make sure the docs build again

* fix the names of the e2e tests and in README

* don't run so many e2e_tests for deepspeed

* reduce hf image class ds slots per trial (#6998)

* fix e2e tests

* fix the convergence tests

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Maksim Kouznetsov <maksim.kouznetsov@hpe.com>
Co-authored-by: Garrett Goon <garrett.goon@hpe.com>
Co-authored-by: Garrett Goon <44747910+garrett361@users.noreply.github.com>
Co-authored-by: Jerry J. Harrow <84593277+jerryharrow@users.noreply.github.com>
Co-authored-by: Nick Doiron <nick.doiron@hpe.com>
Co-authored-by: NicholasBlaskey <nick.blaskey@hpe.com>
Co-authored-by: Eric <31023784+eecsliu@users.noreply.github.com>
Co-authored-by: Keita Nonaka <keita.nonaka@hpe.com>
Co-authored-by: Erik Wilson <erik.wilson@hpe.com>
Co-authored-by: Hamid Zare <12127420+hamidzr@users.noreply.github.com>
Co-authored-by: johnkim-det <97752292+johnkim-det@users.noreply.github.com>
Co-authored-by: Ryan <rb@hpe.com>
Co-authored-by: Danny Zhu <dzhu@hpe.com>
Co-authored-by: CanmingCobble <107056780+CanmingCobble@users.noreply.github.com>
Co-authored-by: szewaiyuen6 <sze-wai.yuen@hpe.com>
Co-authored-by: Bradley Laney <bradley.laney@hpe.com>
Co-authored-by: julian-determined-ai <103522725+julian-determined-ai@users.noreply.github.com>
Co-authored-by: Corban Beaird <corban.beaird@hpe.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: liamcli <liam@determined.ai>
Co-authored-by: Ashton G <ashton.galloway@hpe.com>
Co-authored-by: thiagodallacqua-hpe <104855841+thiagodallacqua-hpe@users.noreply.github.com>
Co-authored-by: Max <max.russell@hpe.com>
Co-authored-by: Emily <15078396+EmilyBonar@users.noreply.github.com>
Co-authored-by: Ilia Glazkov <ilia.glazkov@hpe.com>
Co-authored-by: Caleb Hoyoul Kang <caleb.kang@hpe.com>
Co-authored-by: Wes Turner <wesley.turner@hpe.com>
Co-authored-by: Daniel R. Hunter <103537968+drh-determined-ai@users.noreply.github.com>
Co-authored-by: Tara Charter <tara.charter@hpe.com>
Co-authored-by: rcorujo <90728398+rcorujo@users.noreply.github.com>
Co-authored-by: Guangqing Tang <40620519+gt2345@users.noreply.github.com>
Co-authored-by: MikhailKardash <mikhail.kardash@hpe.com>
Co-authored-by: Jagadeesh Madagundi <jagadeesh545@gmail.com>
Co-authored-by: gt2345 <gt2345@columbia.edu>
Co-authored-by: Trent Watson <trent.watson@hpe.com>
Co-authored-by: Emily Bonar <emily.bonar@hpe.com>
@dannysauer dannysauer added this to the 0.20.1 milestone Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants