Skip to content

Commit

Permalink
Deepspeed Feature Branch - merge master (#6193)
Browse files Browse the repository at this point in the history
* docs: Improvements to HPC launcher docs (#6042)

* Provide inline info about agent-specific scheduling options that do not apply to HPC Launcher
  configurations.
* Identify enroot-specific differences from docker (like for Singularity)
* Provide reference to custom resource pools as an option to deal with non-homogenous
  Slurm/PBS partitions.

* chore: Allow newer Node versions 17-19 (#6038)

* fix: k8s rm gives wrong slot count in rendezvous (#6044)

* chore: bump version: 0.19.12-dev0 -> 0.20.0-dev0 (#6048)

* chore: remove `applicableRoutespace` (#6040)

* chore: warning fixes in web (#6041)

* chore: fix warnings

* chore: change eslint rules

* chore: fix gpu nightly errors (#6046)

* chore: missing nodev18 (#6050)

* chore: add a dedicated exception for cli errors (#5649)

switch sys.exit calls in cli with a new user-facing exception.

* fix: SSO layout (#6053)

* chore: clean up UI kit (#6039)

* fix: lopsided training with 2,1 gpus (#6054)

There was a guard to skip local zmq setup when local_size < 2, but that
became no longer valid when local_size varied from worker to worker.

The result is one extra global allgather in some cases, no big deal.

* docs: add rbac ntsc & mr release notes (#6049)

* chore: manual bump version (#6058)

* ci: retry downloading GKE auth plugin [DET-8956] (#6056)

We got a failure due to a timeout on this, so let it retry a few times.

* docs: update Singularity known issues. (#6047)

* fix: Add #rank to worker segment instead of timestamp segment of Pytorch Profiler files [MLG-326] (#6037)

* Add pytorch profiler specific handling logic for appending rank to file name

* Change to use f-string

* fix only file name being passed in

* remove print statement

* fix: handle agent shutdown msg (#6065)

* chore: manually bump vite version (#6066)

* feat: Display better x-axis ticks on charts with time axis [WEB-849] (#6051)

* remove xTickValues from props now that it can be calculated internally

* test: add logging to a flaky test (#6068)

Test is flaky but hard to pin down, so add some prints for next time.

* fix: Unrelated models are shown in a workspace model registry tab (#6067)

* feat: Added task-based historical allocation endpoint [DET-8537] (#6015)

* fix: show `not found` and `spinner` properly (#6070)

* fix: show `not found` and `spinner` properly

* chore: change home redirect path

* fix: projectDetail page

* fix: project.workspaceId

* build(deps): bump golang.org/x/text from 0.3.5 to 0.3.8 in /proto (#6061)

* build(deps): bump golang.org/x/text from 0.3.5 to 0.3.8 in /proto

Bumps [golang.org/x/text](https://github.com/golang/text) from 0.3.5 to 0.3.8.
- [Release notes](https://github.com/golang/text/releases)
- [Commits](golang/text@v0.3.5...v0.3.8)

---
updated-dependencies:
- dependency-name: golang.org/x/text
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

* fix: gpt-neox docker image and startup hook to work for non-provileged user (#6060)

* test: add locking around migration [DET-8957] (#6071)

In integration tests, multiple processes can attempt to run migrations
against the same database at once, which can lead to errors because
PostgreSQL's `CREATE TABLE IF NOT EXISTS` is not great with concurrency
(it allows for a time-of-check/time-of-use failure).

The specific errors we were seeing were conflicts in the pg_type table,
so the code now locks that table for the duration of the migration
transaction.

More information:
https://www.postgresql.org/message-id/CA+TgmoZAdYVtwBfp1FL2sMZbiHCWT4UPrzRLNnX1Nb30Ku3-gg@mail.gmail.com
https://stackoverflow.com/questions/29900845

* fix: logging inconsistent newlines in slurm (#6074)

* fix: checkpoint helper for points > 1000 and points > maxDatapoints (#6069)

* fix: replace migration table lock with advisory lock (#6077)

Taking a table lock sometimes runs into permissions issues; advisory
locking should avoid that.

Also, I realized the locking should probably be after the deferred
transaction close instead of before.

* build: check npm version on install (#6079)

This removes the `check-requirements` make target in the react folder
and replaces it with npm's native version check against the engines
property. This should make managing the node version slightly easier
because there's one less place to check.

* build: Apply webui lint fixes in precommit (#6078)

* allow linters to fix in precommit

This updates the web linters to automatically apply fixes when doing a
pre-commit check. This should ideally streamline the commit process to
reduce the amount of times the user needs to run prettier and eslint
before committing.

* tweak stylelint and eslint commands

* stage changed files

* type file_paths argument

* ci: adjust target accuracy for a test (#6085)

We got one failure [1] where the accuracy ended up just a hair below
0.83, so drop the target.

[1] https://app.circleci.com/pipelines/github/determined-ai/determined/33883/workflows/4a5d3257-6061-4f4d-bd66-096a580a5959/jobs/1194282/steps

* chore: UserBadge moves into design kit (#6086)

* chore: Remove chart feature flag, remove unused code [WEB-930] (#6064)

* tooltipsPlugin and TrialDetailsOverview alternates go into place
* checkpoint helper for points > 1000 and points > maxDatapoints
* move former LearningCurveChart into TrialsComparison
* sync up with #6069 changes

* fix: replace defaultvalue with initialValue (#6076)

* fix: Dont suggest moving model into its current workspace (#6088)

* ci: delete database at beginning of det deploy tests [DET-8937] (#6089)

Previously, the database was being retained between tests, sometimes
causing tests to fail when extra agents appeared due to agent
reattach. The tests should generally be independent anyway, so reset
the database (by default, with an option to disable) each time the
cluster or master comes up.

* feat: add Facepile component (#6081)

* fix: pre-commit web bug fix (#6090)

* ci: make GKE test jobs run serially (#6096)

We keep hitting GKE GPU quotas; this will probably help with that.

* fix: GPU counting for k8s cluster info page (#6094)

* fix: test-e2e-gke-parallel use t4 (#6093)

* ci: retry protoc download (#6095)

We got an incorrect file downloaded one time [1], so retry this
download, like in 2906257 (#5996).

[1] https://app.circleci.com/pipelines/github/determined-ai/determined/34074/workflows/e48681f8-8b75-4349-82eb-06e922d8bfcb/jobs/1202610

* refactor: add Card to UI Kit [WEB-818] (#5893)

* docs: Launcher doc improvements (#6099)

- Generalize journalctl command example --since option to work on Unbutu.
- Clarify user_name/group_name account requirements.

* feat: Attend to TODOS accross the code base (#6087)

* perf: tweak metrics series query. (#6105)

* chore: race could cause run container to return a different error than expected [DET-8870] (#6092)

* chore: add more metadata to slurm logs (#6030)

* chore: remove `ExpCompareMetricNames`, `ExpCompareTrialsSample` endpoints. (#6106)

* docs: fix reported DataPoint label doc (#6107)

* fix: tolerate additional non-CPU, non-GPU quotas in k8s (#6109)

* fix: stop filtering of valid options to reflect build issues (#6116)

* fix: modal theme color (#6117)

* fix: add bgColor in trial comparison table (#6119)

* fix: browser console warnings (#6122)

* fix: browser console warnings

* fix: remove spread operator

* chore: UIKit Pivot renaming (#6120)

* fix: correct minor JSX syntax (#6126)

* docs: add myst_parser extension (#6127)

We would like to support markdown-format documentation.  There are still
some kinks to be worked out with converting rst to myst files, but this
is a start.

* docs: fix some broken redirects (#6129)

* feat: add 3rd batch of TODO removals (#6115)

* feat: generic proxy configs [DET-8761] (#5978)

* build: [MLG-336] Limit the version of protobuf (#6134)

build: [MLG-336] Limit the version of protobuf

Installing the requirement `tensorflow-macos=2.8.0` pulls protobuf as a downstream dependency. Version 4.21 of the Python protobuf package had a breaking change that makes it incompatible with tensorflow-2.8.0 (see tensorflow/tensorflow#56077). Later patches to Tensorflow limit the version of protobuf to 3.20. We've got a work item to update the tensorflow we include, but until then this change gives the ceiling on tensorflow's protobuf dependency that its later versions enforce.

* chore: update detectron2 example to use v0.6 and reenable nightly test [MLG-301] (#6103)

* Run model in EventStorage context

* Use new Docker images

* Remove pytest.skip from test_detectron2_coco_pytorch_const

* Update README.md

* Minor code reduction

* Dockerfile (listed in .detignore)

* Use determinedai repo instead of a personal repo

* Makefile for building and publishing the Docker image

* docs: Bring content changes from docusaurus-ls beta (#6121)

* docs: Bring content changes from docusaurus-ls beta

Bring over content changes from the beta including reorganization changes.

* additional organizational edits

updating index pages, adding a top nav to welcome page

* added redirects

* revisions based on feedback

* rstfmt run

* feat: display workspace icon in ProjectCard (#6125)

* fix: checkpoint GC should set resource pool (#6136) [DET-9018]

* docs: bump rstfmt version (#6138)

* chore: add dev cli option to get auth token (#6008)

add a `curl` option to help with curling various endpoints

* build(deps): bump golang.org/x/net from 0.0.0-20210405180319-a5a99cb37ef4 to 0.7.0 in /proto (#6130)

Bumps [golang.org/x/net](https://github.com/golang/net) from 0.0.0-20210405180319-a5a99cb37ef4 to 0.7.0.
- [Release notes](https://github.com/golang/net/releases)
- [Commits](https://github.com/golang/net/commits/v0.7.0)

* perf: Improved performance of historical allocation task endpoint, removed training/validation times (#6135)

* fix: FOUNDENG-438 Podman tests from the gate are breaking znodes again (#6146)

* chore: add Toggle component to UI Kit [WEB-841] (#6144)

* chore: Add tags to UI kit [WEB-816] (#6100)

* chore: Move SelectFilter into kit folder and update it [WEB-843] (#6102)

* fix: replace `InlineEditor` with UIKit input (#6082)

* fix: replace `InlineEditor`

* fix: add modal for experiment name

* fix: layout of settings page

* fix: setting page

* fix: minor changes

* feat: move experiment `description` and `tags` into edit modal

* chore: add `N/A` when description is empty in experiment detail

* fix: value bug

* fix: revert tag; remove tag from edit modal due to design inconsistancy

* chore:  add/test pt-only images and bumpenvs (#6097)

* add pt images to some unit tests

* add pt-images to circleci config

* run bumpenvs procedure

* fix test function signatures and linting

* fix warnings linting

* fix docs

* expand unit tests coverage

* build(deps): bump github.com/prometheus/client_golang from 1.10.0 to 1.11.1 in /master (#6004)

Bumps [github.com/prometheus/client_golang](https://github.com/prometheus/client_golang) from 1.10.0 to 1.11.1.
- [Release notes](https://github.com/prometheus/client_golang/releases)
- [Changelog](https://github.com/prometheus/client_golang/blob/main/CHANGELOG.md)
- [Commits](prometheus/client_golang@v1.10.0...v1.11.1)

* feat: add det.import_from_path (#5737)

import_from_path allows users to import arbitrary code from a
checkpoint, even if the modules in the checkpoint have the same name as
modules they already have imported, but contain different code.

This is common when importing, for example, an old model_def.py that has
been updated since the original checkpoint was saved.

* fix: post rank_id correctly for fluent-less logging (#6151) [DET-8999]

* ci: send Slack notification on GKE node pool creation failure (#6152)

In order to prevent quota failures from showing up as CI failures, this
makes node pool creation failure send a Slack notification and mark the
job as successful.

I couldn't figure out how to use the Slack orb while distinguishing this
particular situation from the general failure case, so I just slapped in
a direct request to the already-configured Slack webhook for sending
messages to #ci-bots.

`circleci-agent step halt` marks the job as successful, which is why we
want a notification at all. For some reason, CircleCI fails to provide
an equivalent for marking the job as canceled or some other state
besides success/failure; we could make a call to the CircleCI API to
cancel the current job, but that would rely on having a CircleCI token
available, which we're trying to get away from.

* chore: drop unused columns from `raw_steps`, `raw_validations`, and `raw_checkpoints`. (#6110)

* fix: render spinner while auth check pending (#6098)

* chore: update hpc-launching-architecture doc - add default slurm option --no-requeue (#6141)

* docs: Content updates (#6154)

formatted the setup cluster table to match the approved version in the docusaurus ls beta

* feat: display user id in `det user list`. (#6156)

* fix: Additional tables get experiment- / workspace-specific storagePath [WEB-962] (#6128)

* fix: Additional tables get experiment- and workspace-specific storagePath

* useMemo

* fix: selection width in `move experiment` modal (#6149)

* fix: selection width in `move experiment` modal

* fix: add form wrapping

* chore: remove change

* feat: show trial hyperparameters for custom searchers [MLG-343] (#6162)

* feat: show trial hyperparameters for custom searchers [MLG-343]

* fix: corrected timestamp handling to do an interval overlap instead of contains (#6164)

* chore:  add release notes (#6167)

* chore: add release notes

* format with rstfmt

* chore: suppress help output for det dev (#6145)

avoid showing the `dev` option in `det -h` output

* chore: lock api state for backward compatibility check

* chore: bump version: 0.20.0-dev0 -> 0.20.1-dev0

* fix: separate Router and authCheck (#6170)

* fix: useMemo does not depend on trial having been loaded (#6173)

* chore: pass Labels/project/workspace to TaskSpec (#6172)

* refactor: replace user store with observables [WEB-799] (#6140)

* fix: hide Foldable menu options when button is visible (#6178)

This fixes an issue where, when using a `PageHeaderFoldable` component,
options that appear in the header always appear in the overflow menu.

* feat: add labels to GCP instances created with det deploy gcp [MLG-170] (#6147)

* feat: add labels to GCP instances created with det deploy gcp

* Changes to mimic det deploy aws --add-tags

* Add labels to other resources as well

* revert: reflag new chart experience (#6181)

* build: eliminate java dependency for typescript swagger bindings (#6139)

* fix: Close expiriment fork/ continue trial modal properly (#6174)

* fix: Continue Trial flow does not take the new `max_length` (#6168)

* fix: pass workspace ID when creating tensor board from WebUI [WEB-1019] (#6186)

* fix: don't print ':' when err msg is empty (#6190)

* fix: exp move modal (#6183)

* fix: exp move modal

* fix: minor fixes

* fix: add `archived` param and simplify query (#6175)

* fix: add `archived` param and simplify query

* chore: indent

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Jerry J. Harrow <84593277+jerryharrow@users.noreply.github.com>
Co-authored-by: Nick Doiron <nick.doiron@hpe.com>
Co-authored-by: NicholasBlaskey <nick.blaskey@hpe.com>
Co-authored-by: Eric <31023784+eecsliu@users.noreply.github.com>
Co-authored-by: Keita Nonaka <keita.nonaka@hpe.com>
Co-authored-by: Erik Wilson <erik.wilson@hpe.com>
Co-authored-by: Hamid Zare <12127420+hamidzr@users.noreply.github.com>
Co-authored-by: johnkim-det <97752292+johnkim-det@users.noreply.github.com>
Co-authored-by: Ryan <rb@hpe.com>
Co-authored-by: Danny Zhu <dzhu@hpe.com>
Co-authored-by: CanmingCobble <107056780+CanmingCobble@users.noreply.github.com>
Co-authored-by: szewaiyuen6 <sze-wai.yuen@hpe.com>
Co-authored-by: Bradley Laney <bradley.laney@hpe.com>
Co-authored-by: julian-determined-ai <103522725+julian-determined-ai@users.noreply.github.com>
Co-authored-by: Corban Beaird <corban.beaird@hpe.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: liamcli <liam@determined.ai>
Co-authored-by: Ashton G <ashton.galloway@hpe.com>
Co-authored-by: thiagodallacqua-hpe <104855841+thiagodallacqua-hpe@users.noreply.github.com>
Co-authored-by: Max <max.russell@hpe.com>
Co-authored-by: Maksim <maksim.kouznetsov@hpe.com>
Co-authored-by: Emily <15078396+EmilyBonar@users.noreply.github.com>
Co-authored-by: Ilia Glazkov <ilia.glazkov@hpe.com>
Co-authored-by: Caleb Hoyoul Kang <caleb.kang@hpe.com>
Co-authored-by: Wes Turner <wesley.turner@hpe.com>
Co-authored-by: Daniel R. Hunter <103537968+drh-determined-ai@users.noreply.github.com>
Co-authored-by: Tara Charter <tara.charter@hpe.com>
Co-authored-by: rcorujo <90728398+rcorujo@users.noreply.github.com>
Co-authored-by: Guangqing Tang <40620519+gt2345@users.noreply.github.com>
Co-authored-by: MikhailKardash <mikhail.kardash@hpe.com>
Co-authored-by: Jagadeesh Madagundi <jagadeesh545@gmail.com>
Co-authored-by: gt2345 <gt2345@columbia.edu>
Co-authored-by: Trent Watson <trent.watson@hpe.com>
  • Loading branch information
Show file tree
Hide file tree
Showing 478 changed files with 14,774 additions and 14,121 deletions.
4 changes: 3 additions & 1 deletion .bumpversion.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[bumpversion]
current_version = 0.19.12-dev0
current_version = 0.20.1-dev0
commit = true
tag = true
tag_name = {new_version}
Expand Down Expand Up @@ -28,6 +28,8 @@ values =

[bumpversion:file:webui/react/craco.config.js]

[bumpversion:file:webui/react/vite.config.ts]

[bumpversion:file:helm/charts/determined/Chart.yaml]

[bumpversion:glob:model_hub/examples/huggingface/*/*.yaml]
Expand Down
79 changes: 55 additions & 24 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ executors:
parameters:
det-version:
type: string
default: 0.19.12-dev0
default: 0.20.1-dev0
docker-image:
type: string
default: determinedai/cimg-base:latest
Expand Down Expand Up @@ -182,11 +182,11 @@ commands:
- when:
condition: <<parameters.tf1>>
steps:
- run: docker pull determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-ad0591c
- run: docker pull determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-7aa5364
- when:
condition: <<parameters.tf2>>
steps:
- run: docker pull determinedai/environments:py-3.8-pytorch-1.12-tf-2.8-cpu-ad0591c
- run: docker pull determinedai/environments:py-3.8-pytorch-1.12-tf-2.8-cpu-7aa5364

login-docker:
parameters:
Expand Down Expand Up @@ -214,7 +214,7 @@ commands:

install-protoc:
steps:
- run: curl -o /tmp/protoc.zip -L https://github.com/protocolbuffers/protobuf/releases/download/v3.17.1/protoc-3.17.1-linux-x86_64.zip
- run: curl --retry-connrefused --retry 10 -o /tmp/protoc.zip -L https://github.com/protocolbuffers/protobuf/releases/download/v3.17.1/protoc-3.17.1-linux-x86_64.zip
- run: unzip -o /tmp/protoc.zip -d $HOME/.local

setup-go-intg-deps:
Expand Down Expand Up @@ -531,7 +531,7 @@ commands:
default: "m5.large"
compute-agent-instance-type:
type: string
default: "p2.xlarge"
default: "g4dn.xlarge"
max-dynamic-agents:
type: integer
default: 1
Expand Down Expand Up @@ -629,7 +629,7 @@ commands:
type: string
compute-agent-instance-type:
type: string
default: "p2.xlarge"
default: "g4dn.xlarge"
aux-agent-instance-type:
type: string
default: "m5.large"
Expand Down Expand Up @@ -744,7 +744,14 @@ commands:
google-project-id: <<parameters.google-project-id>>
- run:
command: |
gcloud components install gke-gcloud-auth-plugin --quiet
tries=5
until gcloud components install gke-gcloud-auth-plugin --quiet; do
if [[ $((--tries)) -eq 0 ]]; then
exit 1
fi
sleep 15
done
echo "export USE_GKE_GCLOUD_AUTH_PLUGIN=True" >> $BASH_ENV
name: Install GKE auth plugin
- run:
Expand Down Expand Up @@ -794,7 +801,21 @@ commands:
condition: <<parameters.gpus-per-machine>>
steps:
- run:
command: gcloud container node-pools create accel --cluster ${CLUSTER_ID} --region <<parameters.region>> --num-nodes <<parameters.num-machines>> --accelerator type=<<parameters.gpu-type>>,count=<<parameters.gpus-per-machine>> --machine-type=<<parameters.machine-type>> --scopes cloud-platform --node-taints=<<parameters.accel-node-taints>>
command: |
gcloud container node-pools create accel \
--cluster ${CLUSTER_ID} \
--region '<<parameters.region>>' \
--num-nodes '<<parameters.num-machines>>' \
--accelerator 'type=<<parameters.gpu-type>>,count=<<parameters.gpus-per-machine>>' \
--machine-type='<<parameters.machine-type>>' \
--scopes cloud-platform \
--node-taints='<<parameters.accel-node-taints>>' \
|| (
curl "$SLACK_WEBHOOK" \
-H 'Content-Type: application/json' \
-d "{\"text\":\"GKE node pool creation failed! $CIRCLE_BUILD_URL\"}"
circleci-agent step halt
)
name: Create GPU node pool
- unless:
condition: <<parameters.gpus-per-machine>>
Expand Down Expand Up @@ -1210,14 +1231,18 @@ jobs:

check-ts-bindings:
docker:
- image: cimg/openjdk:14.0.2
- image: <<pipeline.parameters.docker-image>>
steps:
- checkout
- skip-if-docs-only
- setup-python-venv:
install-python: false
extra-requirements-file: "bindings/requirements.txt"
executor: <<pipeline.parameters.docker-image>>
- attach_workspace:
at: .
- run: make -C bindings get-deps
- run: make -C bindings check/typescript-fetch
- run: make -C bindings force-gen
- run: make -C bindings check/typescript

check-py-bindings:
docker:
Expand All @@ -1228,6 +1253,7 @@ jobs:
- skip-if-webui-only
- setup-python-venv:
install-python: false
extra-requirements-file: "bindings/requirements.txt"
executor: <<pipeline.parameters.docker-image>>
- attach_workspace:
at: .
Expand Down Expand Up @@ -1775,7 +1801,7 @@ jobs:
parameters:
compute-agent-instance-type:
type: string
default: "p2.xlarge"
default: "g4dn.xlarge"
aux-agent-instance-type:
type: string
default: "m5.large"
Expand Down Expand Up @@ -1826,7 +1852,7 @@ jobs:
type: string
compute-agent-instance-type:
type: string
default: p2.xlarge
default: g4dn.xlarge
aux-agent-instance-type:
type: string
default: m5.large
Expand Down Expand Up @@ -1943,7 +1969,7 @@ jobs:
type: string
default: "1"
environment-image:
default: determinedai/environments:cuda-11.3-pytorch-1.12-tf-2.8-gpu-0.19.12
default: determinedai/environments:cuda-11.3-pytorch-1.12-tf-2.8-gpu-0.20.1
type: string
accel-node-taints:
type: string
Expand All @@ -1963,6 +1989,11 @@ jobs:
determined: true
extra-requirements-file: "e2e_tests/tests/requirements.txt"
executor: <<pipeline.parameters.docker-image>>
- queue/until_front_of_line:
only-on-branch: master
# Basically wait forever -- we would prefer not to fail tests, and
# we'll likely never be this backed up.
time: "10000"
- setup-gke-cluster:
cluster-id: <<parameters.cluster-id-prefix>>-$(git rev-parse --short HEAD)-${CIRCLE_BUILD_NUM}-${CIRCLE_NODE_INDEX}
labels: test-mark=<<parameters.mark>>
Expand Down Expand Up @@ -2076,7 +2107,7 @@ jobs:
type: string
compute-agent-instance-type:
type: string
default: p2.xlarge
default: g4dn.xlarge
aux-agent-instance-type:
type: string
default: m5.large
Expand Down Expand Up @@ -2363,7 +2394,7 @@ workflows:
matrix:
parameters:
parallelism: [2]
compute-agent-instance-type: ["p2.8xlarge"]
compute-agent-instance-type: ["g4dn.metal"]
aux-agent-instance-type: ["m5.large"]
cluster-id-prefix: ["parallel"]
mark: ["parallel"]
Expand Down Expand Up @@ -2415,6 +2446,7 @@ workflows:
parallelism: [1]
slack-mentions: ["${SLACK_USER_ID}"]
machine-type: ["n1-standard-32"]
gpu-type: ["nvidia-tesla-t4"]
gpus-per-machine: [4]
num-machines: [2]

Expand Down Expand Up @@ -2615,6 +2647,7 @@ workflows:
parallelism: [1]
slack-mentions: ["${SLACK_USER_ID}"]
machine-type: ["n1-standard-32"]
gpu-type: ["nvidia-tesla-t4"]
gpus-per-machine: [4]
num-machines: [2]

Expand Down Expand Up @@ -2735,7 +2768,6 @@ workflows:
parallelism: [2]
cluster-id-prefix: ["nightly"]
mark: ["nightly"]
compute-agent-instance-type: ["g4dn.4xlarge"]

# GPU tests
- request-gpu-tests:
Expand Down Expand Up @@ -2808,7 +2840,7 @@ workflows:
- package-and-push-system-dev
matrix:
parameters:
compute-agent-instance-type: ["p2.8xlarge"]
compute-agent-instance-type: ["g4dn.metal"]
aux-agent-instance-type: ["m5.large"]
cluster-id-prefix: ["transformers"]
mark: ["model_hub_transformers"]
Expand All @@ -2824,7 +2856,7 @@ workflows:
- package-and-push-system-dev
matrix:
parameters:
compute-agent-instance-type: ["p2.8xlarge"]
compute-agent-instance-type: ["g4dn.metal"]
aux-agent-instance-type: ["m5.large"]
cluster-id-prefix: ["transformers-amp"]
mark: ["model_hub_transformers_amp"]
Expand Down Expand Up @@ -2875,15 +2907,14 @@ workflows:
parallelism: [2]
cluster-id-prefix: ["nightly"]
mark: ["nightly"]
compute-agent-instance-type: ["g4dn.4xlarge"]
- test-e2e-aws:
name: test-e2e-gpu-distributed
context: aws
matrix:
parameters:
cluster-id-prefix: ["distributed"]
mark: ["distributed"]
compute-agent-instance-type: ["p2.8xlarge"]
compute-agent-instance-type: ["g4dn.metal"]
aux-agent-instance-type: ["m5.large"]
max-dynamic-agents: [2]
- test-e2e-aws:
Expand All @@ -2893,7 +2924,7 @@ workflows:
parameters:
cluster-id-prefix: ["transformers"]
mark: ["model_hub_transformers"]
compute-agent-instance-type: ["p2.8xlarge"]
compute-agent-instance-type: ["g4dn.metal"]
aux-agent-instance-type: ["m5.large"]
max-dynamic-agents: [2]
- test-e2e-aws:
Expand All @@ -2903,7 +2934,7 @@ workflows:
parameters:
cluster-id-prefix: ["transformers-amp"]
mark: ["model_hub_transformers_amp"]
compute-agent-instance-type: ["p2.8xlarge"]
compute-agent-instance-type: ["g4dn.metal"]
aux-agent-instance-type: ["m5.large"]
max-dynamic-agents: [2]
- test-e2e-aws:
Expand All @@ -2913,7 +2944,7 @@ workflows:
parameters:
cluster-id-prefix: ["mmdetection"]
mark: ["model_hub_mmdetection"]
compute-agent-instance-type: ["p2.8xlarge"]
compute-agent-instance-type: ["g4dn.metal"]
aux-agent-instance-type: ["m5.large"]
max-dynamic-agents: [2]
# DeepSpeed tests do not work on K80s, so we need V100 instances.
Expand Down
4 changes: 3 additions & 1 deletion .gitattributes
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@
# ignore generated code for cli diffs and merges. treat as binary.
# ignore generated code for GitHub UIs.
proto/pkg/**/* -diff -merge linguist-generated=true
master/pkg/schemas/**/* -diff -merge linguist-generated=true
master/pkg/schemas/expconf/zgen_* -diff -merge linguist-generated=true
webui/react/src/services/api-ts-sdk/**/* -diff -merge linguist-generated=true
harness/determined/common/api/bindings.py -diff -merge linguist-generated=true
docs/swagger-ui/swagger-ui*js* -diff -merge
docs/swagger-ui/swagger-ui-main* diff merge
9 changes: 4 additions & 5 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,12 +51,11 @@ git clone --recurse-submodules https://github.com/determined-ai/determined.git
- python3-venv
- python3-wheel
- python3-dev
- Node (>= 16.13, < 17)
- Node (>= 16.13, < 20)
- NPM (>= 8)
- Docker (>= 19.03)
- Helm (>= 3.0.0)
- protobuf-compiler (>= 3.15)
- Java (>= 7)
- cURL (>= 7)
- jq (>= 1.6)
- socat (>= 1.7)
Expand Down Expand Up @@ -96,13 +95,13 @@ eval "$($HOME/.linuxbrew/bin/brew shellenv)"
Install the Determined prerequisites:

```sh
brew install go@1.18 python@3.7 node@16 openjdk@11 protobuf docker helm curl jq socat
brew install go@1.18 python@3.7 node@16 protobuf docker helm curl jq socat
```

Add Python, Node, and openJDK to your PATH:
Add Python and Node to your PATH:

```sh
echo 'export PATH="$HOME/.linuxbrew/opt/python@3.7/bin:$HOME/.linuxbrew/opt/node@16/bin:$HOME/.linuxbrew/opt/openjdk@11/bin:$PATH"' >> $HOME/.profile
echo 'export PATH="$HOME/.linuxbrew/opt/python@3.7/bin:$HOME/.linuxbrew/opt/node@16/bin:$PATH"' >> $HOME/.profile
source $HOME/.profile
```

Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
0.19.12-dev0
0.20.1-dev0
12 changes: 6 additions & 6 deletions agent/go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ require (
github.com/sirupsen/logrus v1.8.1
github.com/spf13/cobra v1.6.1
github.com/spf13/pflag v1.0.5
golang.org/x/sys v0.2.0
golang.org/x/sys v0.5.0
gotest.tools v2.2.0+incompatible
)

Expand Down Expand Up @@ -106,9 +106,9 @@ require (
github.com/opencontainers/image-spec v1.0.2 // indirect
github.com/opentracing/opentracing-go v1.2.0 // indirect
github.com/pmezard/go-difflib v1.0.0 // indirect
github.com/prometheus/client_golang v1.10.0 // indirect
github.com/prometheus/client_golang v1.11.1 // indirect
github.com/prometheus/client_model v0.2.0 // indirect
github.com/prometheus/common v0.25.0 // indirect
github.com/prometheus/common v0.26.0 // indirect
github.com/prometheus/procfs v0.6.0 // indirect
github.com/rogpeppe/go-internal v1.9.0 // indirect
github.com/santhosh-tekuri/jsonschema/v2 v2.2.0 // indirect
Expand Down Expand Up @@ -141,11 +141,11 @@ require (
go.opentelemetry.io/proto/otlp v0.12.1 // indirect
go.uber.org/atomic v1.10.0 // indirect
golang.org/x/crypto v0.0.0-20220829220503-c86fa9a7ed90 // indirect
golang.org/x/net v0.0.0-20211209124913-491a49abca63 // indirect
golang.org/x/net v0.7.0 // indirect
golang.org/x/oauth2 v0.0.0-20211104180415-d3ed0bb246c8 // indirect
golang.org/x/sync v0.1.0 // indirect
golang.org/x/term v0.0.0-20210615171337-6886f2dfbf5b // indirect
golang.org/x/text v0.3.7 // indirect
golang.org/x/term v0.5.0 // indirect
golang.org/x/text v0.7.0 // indirect
golang.org/x/time v0.0.0-20210723032227-1f47c861a9ac // indirect
google.golang.org/api v0.56.0 // indirect
google.golang.org/appengine v1.6.7 // indirect
Expand Down

0 comments on commit 05ee6a2

Please sign in to comment.