Skip to content

chore: clean webui root make targets#369

Merged
ybt195 merged 1 commit into
determined-ai:masterfrom
ybt195:clean-webui-root
May 5, 2020
Merged

chore: clean webui root make targets#369
ybt195 merged 1 commit into
determined-ai:masterfrom
ybt195:clean-webui-root

Conversation

@ybt195

@ybt195 ybt195 commented May 5, 2020

Copy link
Copy Markdown

Test Plan

  • UI change:
    • add screenshots
    • React? build & check storybooks
  • user-facing api change: modify documentation and examples
  • user-facing api change: add the "User-facing API Change" label
  • bug fix: add regression test
  • bug fix: determine if there are other similar bugs in the codebase
  • new feature: add test coverage for any user-facing aspects
  • refactor: maintain existing code coverage

@ybt195 ybt195 requested a review from hkang1 May 5, 2020 19:20
@cla-bot cla-bot Bot added the cla-signed label May 5, 2020

@hkang1 hkang1 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@hkang1 hkang1 assigned ybt195 and unassigned hkang1 May 5, 2020
@ybt195 ybt195 merged commit c088405 into determined-ai:master May 5, 2020
mackrorysd pushed a commit that referenced this pull request Aug 2, 2022
* chore: configuration item resources.devices
mackrorysd pushed a commit that referenced this pull request Aug 2, 2022
* chore: configuration item resources.devices
eecsliu pushed a commit to eecsliu/determined that referenced this pull request Sep 7, 2022
* chore: configuration item resources.devices
eecsliu pushed a commit to eecsliu/determined that referenced this pull request Sep 7, 2022
* chore: configuration item resources.devices
eecsliu pushed a commit to eecsliu/determined that referenced this pull request Sep 7, 2022
* chore: configuration item resources.devices
eecsliu pushed a commit to eecsliu/determined that referenced this pull request Sep 26, 2022
* chore: configuration item resources.devices
nrajanee pushed a commit to nrajanee/determined that referenced this pull request Dec 1, 2022
* chore: configuration item resources.devices
nrajanee pushed a commit to nrajanee/determined that referenced this pull request Dec 1, 2022
… random failures (determined-ai#539)

* chore: update Go dependency files

This is a sort of artificial commit used to concentrate all of the
go.{mod,sum} changes for EE into one place to hopefully reduce
conflicts.

* feat: add support for SCIM provisioning

This commit brings the SCIM support for v0.11.2 back from the dead with
minimal changes. The only new fixes/features are conflicts now correctly
returns 409 not 500, password sync is supported, externalId is not
required, and the username/password that the IdP uses for basic auth
when talking to us is configurable.

* chore: remove Apache V2 license text

The enterprise edition is not licensed under Apache V2, so remove the
license text to avoid confusion.

* fix: ignore SCIM meta field

When we receive a PATCH or PUT from Okta,
they may reply to us with the meta field we
initially gave to them. This is ok; it is in
spec to receive and ignore it.

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add support for SAML

To meet customer needs, we need to provide support for users to
authenticate with SAML. The commit adds /saml/sso and /saml/initiate to
receive SAML responses and initiate SP-initiated flow with SAML
requests, respectively. It also adds SSO providers to /info for the
frontend to determined if SAML is enabled or not and the ability to
configure SAML via master.yaml.

* feat: add support for automatically storing creds

As a Determined user, I don't want to paste a token every time I want to
auth. This adds support for, when auth'ing from the CLI, redirecting to
localhost and automatically storing the auth token.

* ci: modify CI/CD steps for EE

We avoid running jobs that will publish our product to public
repositories. This currently means that we cannot test cloud deployments
in an automated manner. However, none of our EE features are related to
cloud deployments so this should not be a problem.

* ci: enable EE releases with CircleCI

* chore: enable SCIM for test-e2e (#15)

* chore: run apt-get update for deps of master.deb (#16)

When we install ee via a deb, apt-get tries to resolve the named deps of
it but, in CI we haven't run apt-get update so we can't find it.

* feat: support acting as an OAuth 2.0 authorization server (#10)

This adds support for the OAuth 2.0 authorization code flow to the
master. Apart from the core flow itself, HTTP endpoints are provided for
managing the set of OAuth client applications (accessible only by admin
users).

Since we're only addressing one particular scenario at the moment, we
allow only a single client application to be registered, since that's
all we need for this use case and that simplifies management a bit.

* feat: enable OAuth protection for SCIM endpoints (#10)

* ci: add -ee to tags (#29)

* ci: use make publish for releasing Docker images (#36)

(cherry picked from commit a5de11b5cc87e848d0e81601d15038dd4f44d7e1)

* fix: remove det-deploy local from ee ci (#53)

* chore: increase refresh token exp (#60) [DET-4754]

* chore: remove sessions for deprovisioned users (#61) [DET-4844]

* ci: skip a bunch of testing and deployment for EE

* fix: extract master URL correctly from resource pools (#64)

With the introduction of resource pools, the provisioner is no longer a top level field. Update code so that we pull the MasterURL from the resource pools' provider field.

* fix: add ssoProviders to default store info for ee

* chore: do not publish model_hub package in EE

* fix: update master GoReleaser config to account for new binary

* chore: rebrand to HPE MLDE

Basic renaming of Docker images, docs bucket, etc. Also clearly remove
any implication of publishing to NVCR.

* chore: rebrand docs content and theme for HPE before publishing

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* feat: add OIDC support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work.  Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also revert a few diffs between the config for EE vs OSS that don't seem
intentional/warranted.

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* fix: update docs skip check remote address to EE

* chore: add launcher client (determined-ai#209)

* feat: slurm support (#98)

This change adds a dispatcher resource manager, which gives Determined the ability to run on top of Slurm.

Co-authored-by: rcorujo <90728398+rcorujo@users.noreply.github.com>
Co-authored-by: Phillip Gaisford <phillip.gaisford@hpe.com>
Co-authored-by: phillip-gaisford <98362331+phillip-gaisford@users.noreply.github.com>
Co-authored-by: Jerry J. Harrow <84593277+jerryharrow@users.noreply.github.com>
Co-authored-by: Jagadeesh Madagundi <jagadeesh545@gmail.com>
Co-authored-by: Philip Norman <philipnrmn@users.noreply.github.com>

* chore: Provide Slurm job submission failure test cases (FOUNDENG-86) (determined-ai#321)

Wrote test cases for when the CircleCI integration with SLURM is implemented. Each test case launches an experiment, waits for the error, and verifies the log of the error. It also creates a new test category called e2e_slurm.

* chore: created new branch to merge with master instead of dispatcher

* chore: added .yaml test files

* fix: simplified test .yaml files and moved file location

* fix: revert devcluster-casablanca.yaml

* fix: compensate for breaking change determined-ai#4460 (determined-ai#326)

* fix: compensate for breaking change determined-ai#4460

* chore: FOUNDENG-102 Determined shows killed shells as still running (determined-ai#327)

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* chore: FOUNDENG-102 Determined shows killed shells as still running

* Added locking to the monitoredJobs

Co-authored-by: Bradley Laney <bradley.laney@gmail.com>
Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* chore: dispatcher RM supports slot type ROCM (determined-ai#329)

* chore: dispatcher RM supports slot type ROCM

* chore: allow launch using podman (determined-ai#334)

* fix: Cleanup CPU-only system error reporting (FOUNDENG-117) (determined-ai#335)

Ensure that the extended error messages are reported on submission failure by expanding the pattern.

Suppress environment cleanup on LeveDebug and greater as LevelTrace is kind of unusable due to the amount of output logged.

* chore: take agent slot type from partition config (determined-ai#336)

* chore: take agent slot type from partition config

* test: add unit tests. (FOUNDENG-71) (determined-ai#339)

* FOUNDENG-71. Add unit tests.

* test:add unit tests. (FOUNDENG-71)

* test: add coverage for ROCM. (determined-ai#340)

* refactor: make sso a plugin [DET-7560] (determined-ai#341)

* test: add unit tests (FOUNDENG-70) (determined-ai#344)

* chore: Provide a working cache_dir for slurm devcluster (determined-ai#347)

The new cache_dir master.yaml attribute defaults to /var/cache/determined
which users do not normally have access too, so provide a different
default for the tools/slurmcluster.sh script so that it works without
hacking the system.

* chore: Enhance slurmcluster.sh to support authenticated launcher. (determined-ai#349)

* chore: Enhance slurmcluster.sh to support authenticated launcher.

Add new -a option which will attempt to pull the .launcher.token
from the cluster.   If a token file exists for the cluster, it
is used by the master.

* Update slurmcluster.sh

* fix: Exported functions (e.g. which) may break experiments (FOUNDENG-145) (determined-ai#351)

Bash-exported functions are set as environment variables and by default
are inherited into singularity containers.   On some systems the which
command is configured this way and injects arguments into the which
command.  When invoked inside of a determined environment image the
which command does not support these arguments and it breaks the check
for the python3 being on the path, thus breaking most experiments.

Clear all exported functions to avoid this potential collision.

* chore: Compile and document the OSS dependencies of the Launcher [FOUNDENG-105] (determined-ai#354)

Added 181 licenses for OSS dependencies for the slurm launcher. Also modified gen-attributions.py to include Slurm Launcher section in the documentation.

Co-authored-by: Cameron Quilici <cameron.quilici@hpe.com>

* chore: disable dependabot in EE repo (determined-ai#358)

Dependencies are generally updated in the OSS repo.  Disabling all
dependabot updates here until there's a mechanism to selectively do so.

* chore: Specify network mode for PodMan containers over Slurm (FOUNDENG-149) (determined-ai#359)

Unlike Singularity, PodMan behaves like docker.   Set --network=host
to enable dtrain processes on specific ports.

* feat: Slurm support with Singularity or PodMan (determined-ai#361)

Document PodMan support.

* chore: FOUNDENG-134 Handle both CUDA_VISIBLE_DEVICES/ROCR_VISIBLE_DEVICES. (determined-ai#360)

* ci: slurm ci (determined-ai#342)

This change adds basic CI for slurm and enables a few tests. After this, we'll work on enabling more tests and enabling GPU runners.

* chore: launch config for image cache & capabilities (determined-ai#363)

* Launch config for image cache & capabilities

* chore: configuration item resources.devices (determined-ai#369)

* chore: configuration item resources.devices

* feat: Enable det shell start with podman (FOUNDENG-152) (determined-ai#371)

* feat: Enable explicit port management with podman (FOUNDENG-150) (determined-ai#373)

* chore: FOUNDENG-120 convert SLURM ResourcePool request to goroutine to improve response time for determined master. (determined-ai#352)

* fix: e2e_test test_slurm.py test_node_not_available fails on CPU based cluster (Mosaic) due to different Error output (FOUNDENG-132)  (determined-ai#364)

[FOUNDENG-132]
Added extra error logs to account for differences with clusters with no GPUs. Also updated test_docker_login to include checks for error logs that are due to docker download rate limitations.

* chore: add flag to avoid overlapping resource pool requests (determined-ai#377)

* refactor: solidify rm interface (ee)

* chore: Update slurmcluster.sh to handle new casablanca-mgmt1 configuration (determined-ai#380)

The system casablanca has been updated with the name casablanca now pointing to casablanca-login instead of casablanca-mgmt1 as it used to.  Update the script to fix casablanca to fully work on casablanca-login.

* chore: temp fix for failing slurm tests (determined-ai#381)

* fix: skip DET special ports in config podman port mappings (FE-163) (determined-ai#379)

* chore: fix issues with locking and unlocking mutexes (determined-ai#382)

* test: Enable logging tests on slurm (determined-ai#367)

Enable some tests within test_logging.py for slurm.
Fix the cluster.utils agents url, and properly
reference the agents dict within the response.

* test: Add mosiac to slurmcluster to enable testing (determined-ai#384)

Add maosaic cluster as an option to slurmcluster.sh to enable
test validation there and debugging.

* test: Enable test case on slurm (FOUNDENG-171) (determined-ai#385)

Enable test_pytorch_const_warm_start.
Did not enable test_pytorch_load because it uses mnist_pytorch/const-pytorch11.yaml
which forces /tmp to be shared which causes permission problems across multiple
users.

* test: add test coverage for pkg/tasks/dispatcher_task.go. (determined-ai#386)

* test: Add slurmcluster.sh support for -d and cluster osprey (determined-ai#387)

Enable use of slurmcluster.sh with launchers started by the
new loadDevLaucher.sh script.

Add configuration for osprey & swan cluster.

* test: enable test_pytorch_lighting.py for e2e_slurm. (determined-ai#388)

* test: Add test cluster raptor config in slurmcluster.sh (determined-ai#390)

Add to the script to enable testing with raptor.

* chore: determine type of WLM on the cluster (determined-ai#392)

* chore: determine type of WLM on the cluster

* test: enable test_launch for e2e_slurm. (determined-ai#389)

* test: Add default /etc/host bind mount in slurmcluster.sh (determined-ai#395)

Podman V4.0 no longer maps /etc/hosts into the container which means that none of the admin/login nodes, nor compute node names can be resolved.  We are adding that to the default master.yaml configured on install with the launcher.  This adds the same /etc/host setup in slurmcluster.sh so that when using podman (-p) it maps /etc/hosts by default.

* test: update slurmcluster.sh: casablanca-login2 (determined-ai#396)

* feat: Add support for PBS (FOUNDENG-187) (determined-ai#397)

Add search list of both Slurm & PBS carriers to enable dynamic
selection of whichever is available on the target system.
Updated unit tests to expect two carriers.

* fix: provide better message when failed to fetch resource pool details. (determined-ai#398)

* chore: FOUNDENG-126 Enhance Determined container prep logic to work for PBS (determined-ai#399)

* test: Add support for Grenoble Slurm test system (determined-ai#401)

Add new option for test system o184i023.
Include option for non-default slot_type.
Correct the protocol when -d is specified to be http
even if the default for the system is https.

* chore: PBS awareness in resource pools. (determined-ai#402)

* chore: PBS awareness in resource pools.

* fix: Workaround rocm-smi python issue (FOUNDENG-127) (determined-ai#403)

rocm-smi does not work within a singularity container when the host is RHEL
and the container is Ubuntu.   This is a workaround to that incompatibility.

* chore: relax Slurm jobs do not require gres. (determined-ai#404)

* chore: set PBS resource pool properties (determined-ai#408)

* fix: Allow over-mounting of /tmp/work (FOUNDENG-205) (determined-ai#410)

With Singularity /tmp is removed and re-linked to a user directory
to avoid the default host-wide share /tmp and provide more space that
the limited Singularity tmpfs space (10mb).  Make the removal of /tmp
handle injected sub-directories by bind mounts, by detecting the
error and reporting an ERROR message instead of failing.

Also add a FATAL error message giving context if an error in the shell
script is terminated due to non-zero exit (set -e).

* chore: fix bugs wrt backoff package in logging scripts (determined-ai#405)

* fix: remove slurm-resources-info file on job cleanup (determined-ai#411)

* fix: remove slurm-resource-info file on job cleanup

* fix: remove slurm-resources-info file on job cleanup

* fix: Properly pass along PBS queue to launcher (FE-202) (determined-ai#413)

Small fix to pass along the resource_pool name as the PBS/Slurm
queue (partition is only supported by by Slurm, but PBS/Slurm both
support queue).

* chore: config pbs resource manager type. (determined-ai#416)

* fix: Drop unused resource tracking data (FOUNDENG-215) (determined-ai#419)

DispatcherRM has been maintaining data resource mapping data that has been unused since it migrated into the DB.   Drop the fields we do not need.

* chore: support custom experiment config for PBS args (determined-ai#423)

* chore: reload auth token on authorization error  (determined-ai#418)

* FOUNDENG-209. Reload auth token on authorization error.

* choer: reload auth token on authorization error.

* chore: reload auth token on authorization error.

* chore: reload auth token on authorization error

* feat: add EE portion of RBAC (determined-ai#415)

Co-authored-by: Max Russell <max.russell@hpe.com>

* feat: permission summary API and permissions + precanned roles (determined-ai#426)

* chore: attempt calculation of RM SlotsPerAgent (determined-ai#425)

* chore: attempt calculation of RM SlotsPerAgent

* chore: Enhancement to slurmcluster.sh (determined-ai#430)

Usability tweaks to improve robustness.
- If tunnels are to be started terminate any existing non-interactive sshd processes
  for the user which should be from older hung tunnels.
- If -a is unable to retrieve a token (e.g. CTRL/C to abort it), leave the
  existing token intact, instead of destroying it with an empty value.
- Add a short sleep to enable tunnels to stabilize before starting devcluster.
  On occastion they are not ready and it causes a spurious failure.
- Remove conf for interns.

* chore: upgrade bad user agent messaging (determined-ai#431)

* chore: upgrade bad user agent messaging

* fix: Generalize launcher prefixes for PBS (determined-ai#432)

Some messages include Slurm/PBS and the carrier name.   Generalize the regex to allow either Slurm/PBS so that message processing will be handled similarly.

* fix: Avoid using multiple carriers on failure we retry (FOUNDENG-232) (determined-ai#433)

We have different carriers for Slurm/PBS, but if we list them both
and the user job fails, it tries the next.   Use the dispatcherRM
wlmType to specify the carrier in use to avoid this fallback.

* chore: Ensure CUDA_VISIBLE_DEVICES is respresented as a comma separated list of simple numbers (determined-ai#420)

* fix: nil ptr on protoing and workspace permission missing from viewer (determined-ai#438)

* fix: Initial prep_container should fail quickly (FOUNDENG-217) (determined-ai#435)

Fix the exception rasied when master is not reacable (MasterNotFoundException)
APIHttpError only happens when call completes without a successful status
response.

The initial prep_container was ntended to fail reasonably quickly
to enable diagnostics of misconfiguration.   Recent changes
for re-using the common session (DET-8003) settings has caused the initial
communication to retry for more than 30 mins thus defeating the
original intent of prep_container.trial_prep and the error message
it provides.

This change lowers the session retry in prep_container (6 retries with 0.5
backoff -> 64 seconds) to enable the diagnostic message to be posted reasonably quickly.

* feat: Reconnect to Slurm jobs on startup (FOUNDENG-215) (determined-ai#429)

We previously terminated running jobs upon a master restart.
Now that Determined core supports re-attaching to jobs, do the
same for DispatchRM.

- Change configure to indiate that DispatcherRM supports reattach
- Handle allocation messages with Restore:true
- Fail any allocations on Restore:true if the Dispatch ID is missing.
- Handle the case were we no longer have the payload name
  which was lost in the restart.   Ask the launcher for it in the
  very rare case where the job was started but fails before the
  rendezvous and we need the payload name to retrieve the logs.
- When in debug mode, defer dispatch cleanup util the next restart.
  On restart terminate all dispatches.

* feat: rbac authz experiments api [DET-8207] (determined-ai#434)

* feat: add RBAC implementation of workspace authz (determined-ai#436)

* fix: Wait for termination before deleting dispatch (FOUNDENG-217) (determined-ai#442)

When killing a dispatch, we are not able to immediately delete the
dispatch because the files may still be in use by the running container.
Wait until we get to a terminal state before performing the delete.

* feat: rbac implementation for user authorization [DET-8205] (determined-ai#445)

* feat: rbac project authz implementation (determined-ai#446)

* feat: auto assign workspace admin to workspace creator [DET-8212] (determined-ai#440)

* fix: Starting state now shows Running [FOUNDENG-242] (determined-ai#447)

Once the job was queue with Slurm/PBS we triggered the Starting
state.  Prior to 19.4 this use to show as QUEUED in the UI, but
now has changed to "RUNNING (PREPARING ENV)" which is not
accurate.   So map PENDING -> Assigned such that the UI
continues to show "QUEUED" until the job starts running.

* chore: Enhance slurmcluster.sh (determined-ai#448)

Enhance option parsing to relax arg order requrirements.
Add -i arg to override the default logging level to info.

* fix: Improve PBS error reporting [FOUNDENG-248] (determined-ai#453)

Augment the error logs with the HTTP response value on failure.  The returned
error does not always have the underlying info (e.g. 404 Not Found).

Fix the patterns use for matching messages with PBS.   Add an entry for
Slurm (which was showing up previously because no messages matched).

Add filtering based up on the reporter to eliminate some noise that we never
want to see from the Dispatcher infrastructure.

* chore: update dispatcher-wrapper.sh to remove code that sets SLURM_PROCID from PBS_TASKNUM (determined-ai#454)

* feat: get workspace assigned users and groups [DET-8442] (determined-ai#444)

* chore: remove refs to pbs/slurm in environment (determined-ai#458)

* Revert "chore: remove refs to pbs/slurm in environment (determined-ai#458)" (determined-ai#461)

This reverts commit 81f64c80aea646f8c5edeb80164929d877783a80.

* chore: update err message suitable for slurm/pbs. (determined-ai#462)

* chore: generalize message for Slurm/PBS. (determined-ai#463)

* feat: ee support for agent user group settings per workspace. (determined-ai#460)

* chore: remove refs to pbs/slurm in environment (determined-ai#465)

* chore: consume experiment PBS & Slurm batch args (determined-ai#472)

* fix: Add export PATH for PBS Carrier [FOUNDENG-266] (determined-ai#474)

We minimally need the path to be inherited into the PBS
script job such that singularity run can successfully pull
and image.   It needs /usr/sbin/ on the path, but PBS
apparently doesn't inherit the system path or any such reasonable
path.  This changes allows inheritance of all environment
variables to cover PATH, and anything else the launcher may
have added to their environ (PATH, LD_LIBRARY_PATH, etc).

* ci: auto-deploy `latest-ee-gke`. (determined-ai#467)

* feat: echo auth for ee (determined-ai#479)

* chore: assign cluster admin to 'admin' for new clusters (determined-ai#477)

* feat: Add CAN_EDIT_WEBHOOKS permission to pre-canned admin role [WEB-218] (determined-ai#471)

* feat: RBAC authz for user groups [DET-8477] (determined-ai#473)

* chore: Fix build break due to unused import (determined-ai#486)

Drop unused imports.

* fix: deal with some lint (determined-ai#491)

* fix: FOUNDENG-283 Determined UI Resource Pools page incorrectly shows CPU usage (determined-ai#490)

* fix: Correct quoting in error message (determined-ai#492)

We have a custom bash error handler if any command returns
a non-zero.  Fix the quoting and spelling so that it actually works.

* fix: Restore det shell on podman [FOUNDENG-280] (determined-ai#493)

When running rootless podman, inside the container we are
root/uid=0 and that maps to the user account outside the
container.   All is fine until we attempt to ssh into the
container which actually then uses the launching username/uid.
Under normal circumstance /run/determined/ssh has only 0600
permissions for only the owning user to read, but with podman
root maps to the user/uid, so the launching user is not seen
as having access to the files.

Until we find a better solution, dynamically relax the permissions
to be a+x on the /run/determined/ssh directory path such that
the user can read /run/determined/ssh/authorized_keys and enable
ssh into the container to work proplerly.

Additionally, drop use of the podman --hostuser arg, as it doens't help
the situation and we already provide the launching user in a custom
passwd entry.

* chore: ee lint fixes and implement added authz method (determined-ai#497)

* fix: remove =true from sso url querystring (determined-ai#494)

* chore: support slots per node (determined-ai#500)

* fix: add `ON DELETE CASCADE` for `role_assignments.group_id` column (determined-ai#501)

* feat: RBAC authz for RBAC [DET-8206] [DET-8368] (determined-ai#480)

* fix: searching roles results in 500 error (determined-ai#503)

* fix: PodMan map user to UID and GID to 0 in passwd [FOUNDENG-300] (determined-ai#504)

In rootless PodMan the user executes as uid/gid 0:0 inside the container
which maps to the actual launching user outside the container.  If
the entry point user is 'root' then map the agent user to 0:0 in
/run/determined/etc/passwd such that outside the container the access
is seen as the launching user.

/run/determined/etc/passwd contains a single line (written by Determined)
to represent the agent user.

* feat: make list groups roles and list users roles return assignment info (determined-ai#498)

* chore: Disable test_node_not_available [FOUNDENG-304] (determined-ai#510)

The test is queueing instead of getting the expected error message on the mosaic
slurm cluster.  Need to resolve before re-enabling.

* chore: Add sawmill test system to slurmcluster.sh (determined-ai#511)

Add config for sawmil and detect systems that do not have installed launchers, and indicate that -d is required.

* chore: Disable test_node_not_available [FOUNDENG-304] (determined-ai#512)

Additionally rename the disabled test_node_not_available, to avoid
warnings about a test without an annotation.

* chore: experiment log show Slurm/PBS job ID. (determined-ai#502)

* fix: add sso login routes to list of echo routes that don't require auth (determined-ai#509)

Co-authored-by: Addison Snelling <asnell@hpe.com>

* chore: Add node atlas to slurmcluster.sh (determined-ai#513)

Enable testing with another data center system.

* feat: Fully support apptainer fork of singularity [FOUNDENG-292] (determined-ai#507)

Apptainer 1.0 is a fork of Singularity 3.8.  Reduce use of SINGULARITY_* variables.
hpe-hpc-launcher 3.1.4 supports capabilities and cached bypass.   --no-mount=tmp
has been the default for a bit, so not explicitly needed.

We retain the use of the SINGULARITY_DOCKER* and add APPTAINER_DOCKER*
for creds as there is no CLI option alternative.   Adding the APPTAINER_* version
eliminates warnings.

* fix: get group 500 error for rbac can't access case [DET-8588, DET-8589] (determined-ai#506)

* chore: log error on insufficient launcher version. (determined-ai#508)

* fix: redirect to cli relay on det auth login (determined-ai#519)

* refactor: rbac: move from `is_global` to scope type masks [DET-8569] (determined-ai#515)

* fix: 500 error for workspace membership without perms (determined-ai#525)

* test: update expected error messages. (determined-ai#526)

* chore: rbac refactor authorization code (determined-ai#527)

* chore: add checkpoint storage permission (determined-ai#518)

* fix: allow workspace viewers to view roles in webui. (determined-ai#530)

* chore: Fix test_node_not_available test [FOUNDENG-304] (determined-ai#517)

When scheduling CPUs (unlike GPUs), test_node_not_available
will submit a job that will set pending forever due to lack of resources.
This is happening on mosaic (our Slurm runner system today)
so skip the test if no GPUs available.

Also put a limit on the tests wait time for slurm failure test cases to 600s
(5min) to avoid default wait of 30 mins which avoids blocking up the gate
excessively on a test falure.

* test: disable restart on expected failure case. (determined-ai#528)

* chore: make authz_rbac workspaces return PermissionDeniedError (determined-ai#521)

* fix: FOUNDENG-303 Pausing, then resuming an experiment fails (determined-ai#533)

* ci: fix incorrect image name (determined-ai#535)

* ci: fix incorrect image names
* update a comment at the same time

* chore: Add additional configuration options in slurmcluster.sh (determined-ai#537)

Add the capability to set a default image, ask_container_defaults, and
partition_overrides in the master.yaml.

Add configuration for sawmil to make grizzly nodes cuda
and provide a default image, and MPI settings.

Eliminate the need for the CLUSTERS list to be manually updated by
just checking for the cluster configuration directly.

Indicate the generated master.yaml file name to simplify debugging
when injecting multiline options.

* ci: autorebase PRs on master force push [INFENG-122] (determined-ai#532)

* fix: FOUNDENG-310 test_noop_pause_hpc needs timeout increase to avoid random failures

* Test still randomly fails. Passed first time, failed second time. Increased timeout to 420 seconds just to see what happens.

* Increased the overall timeout to 20 minutes

* fixed some merge issues

Co-authored-by: Danny Zhu <dzhu@hpe.com>
Co-authored-by: Bradley Laney <bradlaney@determined.ai>
Co-authored-by: Neil Conway <neil@determined.ai>
Co-authored-by: Bradley Laney <bradley.laney@gmail.com>
Co-authored-by: brian <brian@determined.ai>
Co-authored-by: Danny Zhu <dzhu@determined.ai>
Co-authored-by: Brian Friedenberg <12980763+brain-good@users.noreply.github.com>
Co-authored-by: Armand McQueen <armandmcqueen@users.noreply.github.com>
Co-authored-by: Armand McQueen <armandmcqueen@gmail.com>
Co-authored-by: Caleb Kang <caleb@determined.ai>
Co-authored-by: Ilia Glazkov <ilia@determined.ai>
Co-authored-by: Philip Norman <philipnrmn@users.noreply.github.com>
Co-authored-by: Sean Mackrory <mackrory@determined.ai>
Co-authored-by: Nick Doiron <ndoiron@mapmeld.com>
Co-authored-by: Eric <31023784+eecsliu@users.noreply.github.com>
Co-authored-by: Hamid Zare <12127420+hamidzr@users.noreply.github.com>
Co-authored-by: Phillip Gaisford <phillip.gaisford@hpe.com>
Co-authored-by: phillip-gaisford <98362331+phillip-gaisford@users.noreply.github.com>
Co-authored-by: Jerry J. Harrow <84593277+jerryharrow@users.noreply.github.com>
Co-authored-by: Jagadeesh Madagundi <jagadeesh545@gmail.com>
Co-authored-by: CharlesTran1 <69864849+CharlesTran1@users.noreply.github.com>
Co-authored-by: CanmingCobble <107056780+CanmingCobble@users.noreply.github.com>
Co-authored-by: NicholasBlaskey <nick.blaskey@hpe.com>
Co-authored-by: Cameron Quilici <cameron.quilici@hpe.com>
Co-authored-by: Danny Sauer <danny.sauer@hpe.com>
Co-authored-by: Bradley Laney <bradley.laney@hpe.com>
Co-authored-by: Max <max.russell@hpe.com>
Co-authored-by: Ilia Glazkov <ilia.glazkov@hpe.com>
Co-authored-by: julian-determined-ai <103522725+julian-determined-ai@users.noreply.github.com>
Co-authored-by: Trent Watson <trent.watson@hpe.com>
Co-authored-by: Addison Snelling <asnell@hpe.com>
nrajanee pushed a commit to nrajanee/determined that referenced this pull request Dec 13, 2022
* chore: configuration item resources.devices
nrajanee pushed a commit to nrajanee/determined that referenced this pull request Dec 13, 2022
… random failures (determined-ai#539)

* chore: update Go dependency files

This is a sort of artificial commit used to concentrate all of the
go.{mod,sum} changes for EE into one place to hopefully reduce
conflicts.

* feat: add support for SCIM provisioning

This commit brings the SCIM support for v0.11.2 back from the dead with
minimal changes. The only new fixes/features are conflicts now correctly
returns 409 not 500, password sync is supported, externalId is not
required, and the username/password that the IdP uses for basic auth
when talking to us is configurable.

* chore: remove Apache V2 license text

The enterprise edition is not licensed under Apache V2, so remove the
license text to avoid confusion.

* fix: ignore SCIM meta field

When we receive a PATCH or PUT from Okta,
they may reply to us with the meta field we
initially gave to them. This is ok; it is in
spec to receive and ignore it.

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add support for SAML

To meet customer needs, we need to provide support for users to
authenticate with SAML. The commit adds /saml/sso and /saml/initiate to
receive SAML responses and initiate SP-initiated flow with SAML
requests, respectively. It also adds SSO providers to /info for the
frontend to determined if SAML is enabled or not and the ability to
configure SAML via master.yaml.

* feat: add support for automatically storing creds

As a Determined user, I don't want to paste a token every time I want to
auth. This adds support for, when auth'ing from the CLI, redirecting to
localhost and automatically storing the auth token.

* ci: modify CI/CD steps for EE

We avoid running jobs that will publish our product to public
repositories. This currently means that we cannot test cloud deployments
in an automated manner. However, none of our EE features are related to
cloud deployments so this should not be a problem.

* ci: enable EE releases with CircleCI

* chore: enable SCIM for test-e2e (#15)

* chore: run apt-get update for deps of master.deb (#16)

When we install ee via a deb, apt-get tries to resolve the named deps of
it but, in CI we haven't run apt-get update so we can't find it.

* feat: support acting as an OAuth 2.0 authorization server (#10)

This adds support for the OAuth 2.0 authorization code flow to the
master. Apart from the core flow itself, HTTP endpoints are provided for
managing the set of OAuth client applications (accessible only by admin
users).

Since we're only addressing one particular scenario at the moment, we
allow only a single client application to be registered, since that's
all we need for this use case and that simplifies management a bit.

* feat: enable OAuth protection for SCIM endpoints (#10)

* ci: add -ee to tags (#29)

* ci: use make publish for releasing Docker images (#36)

(cherry picked from commit a5de11b5cc87e848d0e81601d15038dd4f44d7e1)

* fix: remove det-deploy local from ee ci (#53)

* chore: increase refresh token exp (#60) [DET-4754]

* chore: remove sessions for deprovisioned users (#61) [DET-4844]

* ci: skip a bunch of testing and deployment for EE

* fix: extract master URL correctly from resource pools (#64)

With the introduction of resource pools, the provisioner is no longer a top level field. Update code so that we pull the MasterURL from the resource pools' provider field.

* fix: add ssoProviders to default store info for ee

* chore: do not publish model_hub package in EE

* fix: update master GoReleaser config to account for new binary

* chore: rebrand to HPE MLDE

Basic renaming of Docker images, docs bucket, etc. Also clearly remove
any implication of publishing to NVCR.

* chore: rebrand docs content and theme for HPE before publishing

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* feat: add OIDC support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work.  Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also revert a few diffs between the config for EE vs OSS that don't seem
intentional/warranted.

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* fix: update docs skip check remote address to EE

* chore: add launcher client (determined-ai#209)

* feat: slurm support (#98)

This change adds a dispatcher resource manager, which gives Determined the ability to run on top of Slurm.

Co-authored-by: rcorujo <90728398+rcorujo@users.noreply.github.com>
Co-authored-by: Phillip Gaisford <phillip.gaisford@hpe.com>
Co-authored-by: phillip-gaisford <98362331+phillip-gaisford@users.noreply.github.com>
Co-authored-by: Jerry J. Harrow <84593277+jerryharrow@users.noreply.github.com>
Co-authored-by: Jagadeesh Madagundi <jagadeesh545@gmail.com>
Co-authored-by: Philip Norman <philipnrmn@users.noreply.github.com>

* chore: Provide Slurm job submission failure test cases (FOUNDENG-86) (determined-ai#321)

Wrote test cases for when the CircleCI integration with SLURM is implemented. Each test case launches an experiment, waits for the error, and verifies the log of the error. It also creates a new test category called e2e_slurm.

* chore: created new branch to merge with master instead of dispatcher

* chore: added .yaml test files

* fix: simplified test .yaml files and moved file location

* fix: revert devcluster-casablanca.yaml

* fix: compensate for breaking change determined-ai#4460 (determined-ai#326)

* fix: compensate for breaking change determined-ai#4460

* chore: FOUNDENG-102 Determined shows killed shells as still running (determined-ai#327)

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* chore: FOUNDENG-102 Determined shows killed shells as still running

* Added locking to the monitoredJobs

Co-authored-by: Bradley Laney <bradley.laney@gmail.com>
Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* chore: dispatcher RM supports slot type ROCM (determined-ai#329)

* chore: dispatcher RM supports slot type ROCM

* chore: allow launch using podman (determined-ai#334)

* fix: Cleanup CPU-only system error reporting (FOUNDENG-117) (determined-ai#335)

Ensure that the extended error messages are reported on submission failure by expanding the pattern.

Suppress environment cleanup on LeveDebug and greater as LevelTrace is kind of unusable due to the amount of output logged.

* chore: take agent slot type from partition config (determined-ai#336)

* chore: take agent slot type from partition config

* test: add unit tests. (FOUNDENG-71) (determined-ai#339)

* FOUNDENG-71. Add unit tests.

* test:add unit tests. (FOUNDENG-71)

* test: add coverage for ROCM. (determined-ai#340)

* refactor: make sso a plugin [DET-7560] (determined-ai#341)

* test: add unit tests (FOUNDENG-70) (determined-ai#344)

* chore: Provide a working cache_dir for slurm devcluster (determined-ai#347)

The new cache_dir master.yaml attribute defaults to /var/cache/determined
which users do not normally have access too, so provide a different
default for the tools/slurmcluster.sh script so that it works without
hacking the system.

* chore: Enhance slurmcluster.sh to support authenticated launcher. (determined-ai#349)

* chore: Enhance slurmcluster.sh to support authenticated launcher.

Add new -a option which will attempt to pull the .launcher.token
from the cluster.   If a token file exists for the cluster, it
is used by the master.

* Update slurmcluster.sh

* fix: Exported functions (e.g. which) may break experiments (FOUNDENG-145) (determined-ai#351)

Bash-exported functions are set as environment variables and by default
are inherited into singularity containers.   On some systems the which
command is configured this way and injects arguments into the which
command.  When invoked inside of a determined environment image the
which command does not support these arguments and it breaks the check
for the python3 being on the path, thus breaking most experiments.

Clear all exported functions to avoid this potential collision.

* chore: Compile and document the OSS dependencies of the Launcher [FOUNDENG-105] (determined-ai#354)

Added 181 licenses for OSS dependencies for the slurm launcher. Also modified gen-attributions.py to include Slurm Launcher section in the documentation.

Co-authored-by: Cameron Quilici <cameron.quilici@hpe.com>

* chore: disable dependabot in EE repo (determined-ai#358)

Dependencies are generally updated in the OSS repo.  Disabling all
dependabot updates here until there's a mechanism to selectively do so.

* chore: Specify network mode for PodMan containers over Slurm (FOUNDENG-149) (determined-ai#359)

Unlike Singularity, PodMan behaves like docker.   Set --network=host
to enable dtrain processes on specific ports.

* feat: Slurm support with Singularity or PodMan (determined-ai#361)

Document PodMan support.

* chore: FOUNDENG-134 Handle both CUDA_VISIBLE_DEVICES/ROCR_VISIBLE_DEVICES. (determined-ai#360)

* ci: slurm ci (determined-ai#342)

This change adds basic CI for slurm and enables a few tests. After this, we'll work on enabling more tests and enabling GPU runners.

* chore: launch config for image cache & capabilities (determined-ai#363)

* Launch config for image cache & capabilities

* chore: configuration item resources.devices (determined-ai#369)

* chore: configuration item resources.devices

* feat: Enable det shell start with podman (FOUNDENG-152) (determined-ai#371)

* feat: Enable explicit port management with podman (FOUNDENG-150) (determined-ai#373)

* chore: FOUNDENG-120 convert SLURM ResourcePool request to goroutine to improve response time for determined master. (determined-ai#352)

* fix: e2e_test test_slurm.py test_node_not_available fails on CPU based cluster (Mosaic) due to different Error output (FOUNDENG-132)  (determined-ai#364)

[FOUNDENG-132]
Added extra error logs to account for differences with clusters with no GPUs. Also updated test_docker_login to include checks for error logs that are due to docker download rate limitations.

* chore: add flag to avoid overlapping resource pool requests (determined-ai#377)

* refactor: solidify rm interface (ee)

* chore: Update slurmcluster.sh to handle new casablanca-mgmt1 configuration (determined-ai#380)

The system casablanca has been updated with the name casablanca now pointing to casablanca-login instead of casablanca-mgmt1 as it used to.  Update the script to fix casablanca to fully work on casablanca-login.

* chore: temp fix for failing slurm tests (determined-ai#381)

* fix: skip DET special ports in config podman port mappings (FE-163) (determined-ai#379)

* chore: fix issues with locking and unlocking mutexes (determined-ai#382)

* test: Enable logging tests on slurm (determined-ai#367)

Enable some tests within test_logging.py for slurm.
Fix the cluster.utils agents url, and properly
reference the agents dict within the response.

* test: Add mosiac to slurmcluster to enable testing (determined-ai#384)

Add maosaic cluster as an option to slurmcluster.sh to enable
test validation there and debugging.

* test: Enable test case on slurm (FOUNDENG-171) (determined-ai#385)

Enable test_pytorch_const_warm_start.
Did not enable test_pytorch_load because it uses mnist_pytorch/const-pytorch11.yaml
which forces /tmp to be shared which causes permission problems across multiple
users.

* test: add test coverage for pkg/tasks/dispatcher_task.go. (determined-ai#386)

* test: Add slurmcluster.sh support for -d and cluster osprey (determined-ai#387)

Enable use of slurmcluster.sh with launchers started by the
new loadDevLaucher.sh script.

Add configuration for osprey & swan cluster.

* test: enable test_pytorch_lighting.py for e2e_slurm. (determined-ai#388)

* test: Add test cluster raptor config in slurmcluster.sh (determined-ai#390)

Add to the script to enable testing with raptor.

* chore: determine type of WLM on the cluster (determined-ai#392)

* chore: determine type of WLM on the cluster

* test: enable test_launch for e2e_slurm. (determined-ai#389)

* test: Add default /etc/host bind mount in slurmcluster.sh (determined-ai#395)

Podman V4.0 no longer maps /etc/hosts into the container which means that none of the admin/login nodes, nor compute node names can be resolved.  We are adding that to the default master.yaml configured on install with the launcher.  This adds the same /etc/host setup in slurmcluster.sh so that when using podman (-p) it maps /etc/hosts by default.

* test: update slurmcluster.sh: casablanca-login2 (determined-ai#396)

* feat: Add support for PBS (FOUNDENG-187) (determined-ai#397)

Add search list of both Slurm & PBS carriers to enable dynamic
selection of whichever is available on the target system.
Updated unit tests to expect two carriers.

* fix: provide better message when failed to fetch resource pool details. (determined-ai#398)

* chore: FOUNDENG-126 Enhance Determined container prep logic to work for PBS (determined-ai#399)

* test: Add support for Grenoble Slurm test system (determined-ai#401)

Add new option for test system o184i023.
Include option for non-default slot_type.
Correct the protocol when -d is specified to be http
even if the default for the system is https.

* chore: PBS awareness in resource pools. (determined-ai#402)

* chore: PBS awareness in resource pools.

* fix: Workaround rocm-smi python issue (FOUNDENG-127) (determined-ai#403)

rocm-smi does not work within a singularity container when the host is RHEL
and the container is Ubuntu.   This is a workaround to that incompatibility.

* chore: relax Slurm jobs do not require gres. (determined-ai#404)

* chore: set PBS resource pool properties (determined-ai#408)

* fix: Allow over-mounting of /tmp/work (FOUNDENG-205) (determined-ai#410)

With Singularity /tmp is removed and re-linked to a user directory
to avoid the default host-wide share /tmp and provide more space that
the limited Singularity tmpfs space (10mb).  Make the removal of /tmp
handle injected sub-directories by bind mounts, by detecting the
error and reporting an ERROR message instead of failing.

Also add a FATAL error message giving context if an error in the shell
script is terminated due to non-zero exit (set -e).

* chore: fix bugs wrt backoff package in logging scripts (determined-ai#405)

* fix: remove slurm-resources-info file on job cleanup (determined-ai#411)

* fix: remove slurm-resource-info file on job cleanup

* fix: remove slurm-resources-info file on job cleanup

* fix: Properly pass along PBS queue to launcher (FE-202) (determined-ai#413)

Small fix to pass along the resource_pool name as the PBS/Slurm
queue (partition is only supported by by Slurm, but PBS/Slurm both
support queue).

* chore: config pbs resource manager type. (determined-ai#416)

* fix: Drop unused resource tracking data (FOUNDENG-215) (determined-ai#419)

DispatcherRM has been maintaining data resource mapping data that has been unused since it migrated into the DB.   Drop the fields we do not need.

* chore: support custom experiment config for PBS args (determined-ai#423)

* chore: reload auth token on authorization error  (determined-ai#418)

* FOUNDENG-209. Reload auth token on authorization error.

* choer: reload auth token on authorization error.

* chore: reload auth token on authorization error.

* chore: reload auth token on authorization error

* feat: add EE portion of RBAC (determined-ai#415)

Co-authored-by: Max Russell <max.russell@hpe.com>

* feat: permission summary API and permissions + precanned roles (determined-ai#426)

* chore: attempt calculation of RM SlotsPerAgent (determined-ai#425)

* chore: attempt calculation of RM SlotsPerAgent

* chore: Enhancement to slurmcluster.sh (determined-ai#430)

Usability tweaks to improve robustness.
- If tunnels are to be started terminate any existing non-interactive sshd processes
  for the user which should be from older hung tunnels.
- If -a is unable to retrieve a token (e.g. CTRL/C to abort it), leave the
  existing token intact, instead of destroying it with an empty value.
- Add a short sleep to enable tunnels to stabilize before starting devcluster.
  On occastion they are not ready and it causes a spurious failure.
- Remove conf for interns.

* chore: upgrade bad user agent messaging (determined-ai#431)

* chore: upgrade bad user agent messaging

* fix: Generalize launcher prefixes for PBS (determined-ai#432)

Some messages include Slurm/PBS and the carrier name.   Generalize the regex to allow either Slurm/PBS so that message processing will be handled similarly.

* fix: Avoid using multiple carriers on failure we retry (FOUNDENG-232) (determined-ai#433)

We have different carriers for Slurm/PBS, but if we list them both
and the user job fails, it tries the next.   Use the dispatcherRM
wlmType to specify the carrier in use to avoid this fallback.

* chore: Ensure CUDA_VISIBLE_DEVICES is respresented as a comma separated list of simple numbers (determined-ai#420)

* fix: nil ptr on protoing and workspace permission missing from viewer (determined-ai#438)

* fix: Initial prep_container should fail quickly (FOUNDENG-217) (determined-ai#435)

Fix the exception rasied when master is not reacable (MasterNotFoundException)
APIHttpError only happens when call completes without a successful status
response.

The initial prep_container was ntended to fail reasonably quickly
to enable diagnostics of misconfiguration.   Recent changes
for re-using the common session (DET-8003) settings has caused the initial
communication to retry for more than 30 mins thus defeating the
original intent of prep_container.trial_prep and the error message
it provides.

This change lowers the session retry in prep_container (6 retries with 0.5
backoff -> 64 seconds) to enable the diagnostic message to be posted reasonably quickly.

* feat: Reconnect to Slurm jobs on startup (FOUNDENG-215) (determined-ai#429)

We previously terminated running jobs upon a master restart.
Now that Determined core supports re-attaching to jobs, do the
same for DispatchRM.

- Change configure to indiate that DispatcherRM supports reattach
- Handle allocation messages with Restore:true
- Fail any allocations on Restore:true if the Dispatch ID is missing.
- Handle the case were we no longer have the payload name
  which was lost in the restart.   Ask the launcher for it in the
  very rare case where the job was started but fails before the
  rendezvous and we need the payload name to retrieve the logs.
- When in debug mode, defer dispatch cleanup util the next restart.
  On restart terminate all dispatches.

* feat: rbac authz experiments api [DET-8207] (determined-ai#434)

* feat: add RBAC implementation of workspace authz (determined-ai#436)

* fix: Wait for termination before deleting dispatch (FOUNDENG-217) (determined-ai#442)

When killing a dispatch, we are not able to immediately delete the
dispatch because the files may still be in use by the running container.
Wait until we get to a terminal state before performing the delete.

* feat: rbac implementation for user authorization [DET-8205] (determined-ai#445)

* feat: rbac project authz implementation (determined-ai#446)

* feat: auto assign workspace admin to workspace creator [DET-8212] (determined-ai#440)

* fix: Starting state now shows Running [FOUNDENG-242] (determined-ai#447)

Once the job was queue with Slurm/PBS we triggered the Starting
state.  Prior to 19.4 this use to show as QUEUED in the UI, but
now has changed to "RUNNING (PREPARING ENV)" which is not
accurate.   So map PENDING -> Assigned such that the UI
continues to show "QUEUED" until the job starts running.

* chore: Enhance slurmcluster.sh (determined-ai#448)

Enhance option parsing to relax arg order requrirements.
Add -i arg to override the default logging level to info.

* fix: Improve PBS error reporting [FOUNDENG-248] (determined-ai#453)

Augment the error logs with the HTTP response value on failure.  The returned
error does not always have the underlying info (e.g. 404 Not Found).

Fix the patterns use for matching messages with PBS.   Add an entry for
Slurm (which was showing up previously because no messages matched).

Add filtering based up on the reporter to eliminate some noise that we never
want to see from the Dispatcher infrastructure.

* chore: update dispatcher-wrapper.sh to remove code that sets SLURM_PROCID from PBS_TASKNUM (determined-ai#454)

* feat: get workspace assigned users and groups [DET-8442] (determined-ai#444)

* chore: remove refs to pbs/slurm in environment (determined-ai#458)

* Revert "chore: remove refs to pbs/slurm in environment (determined-ai#458)" (determined-ai#461)

This reverts commit 81f64c80aea646f8c5edeb80164929d877783a80.

* chore: update err message suitable for slurm/pbs. (determined-ai#462)

* chore: generalize message for Slurm/PBS. (determined-ai#463)

* feat: ee support for agent user group settings per workspace. (determined-ai#460)

* chore: remove refs to pbs/slurm in environment (determined-ai#465)

* chore: consume experiment PBS & Slurm batch args (determined-ai#472)

* fix: Add export PATH for PBS Carrier [FOUNDENG-266] (determined-ai#474)

We minimally need the path to be inherited into the PBS
script job such that singularity run can successfully pull
and image.   It needs /usr/sbin/ on the path, but PBS
apparently doesn't inherit the system path or any such reasonable
path.  This changes allows inheritance of all environment
variables to cover PATH, and anything else the launcher may
have added to their environ (PATH, LD_LIBRARY_PATH, etc).

* ci: auto-deploy `latest-ee-gke`. (determined-ai#467)

* feat: echo auth for ee (determined-ai#479)

* chore: assign cluster admin to 'admin' for new clusters (determined-ai#477)

* feat: Add CAN_EDIT_WEBHOOKS permission to pre-canned admin role [WEB-218] (determined-ai#471)

* feat: RBAC authz for user groups [DET-8477] (determined-ai#473)

* chore: Fix build break due to unused import (determined-ai#486)

Drop unused imports.

* fix: deal with some lint (determined-ai#491)

* fix: FOUNDENG-283 Determined UI Resource Pools page incorrectly shows CPU usage (determined-ai#490)

* fix: Correct quoting in error message (determined-ai#492)

We have a custom bash error handler if any command returns
a non-zero.  Fix the quoting and spelling so that it actually works.

* fix: Restore det shell on podman [FOUNDENG-280] (determined-ai#493)

When running rootless podman, inside the container we are
root/uid=0 and that maps to the user account outside the
container.   All is fine until we attempt to ssh into the
container which actually then uses the launching username/uid.
Under normal circumstance /run/determined/ssh has only 0600
permissions for only the owning user to read, but with podman
root maps to the user/uid, so the launching user is not seen
as having access to the files.

Until we find a better solution, dynamically relax the permissions
to be a+x on the /run/determined/ssh directory path such that
the user can read /run/determined/ssh/authorized_keys and enable
ssh into the container to work proplerly.

Additionally, drop use of the podman --hostuser arg, as it doens't help
the situation and we already provide the launching user in a custom
passwd entry.

* chore: ee lint fixes and implement added authz method (determined-ai#497)

* fix: remove =true from sso url querystring (determined-ai#494)

* chore: support slots per node (determined-ai#500)

* fix: add `ON DELETE CASCADE` for `role_assignments.group_id` column (determined-ai#501)

* feat: RBAC authz for RBAC [DET-8206] [DET-8368] (determined-ai#480)

* fix: searching roles results in 500 error (determined-ai#503)

* fix: PodMan map user to UID and GID to 0 in passwd [FOUNDENG-300] (determined-ai#504)

In rootless PodMan the user executes as uid/gid 0:0 inside the container
which maps to the actual launching user outside the container.  If
the entry point user is 'root' then map the agent user to 0:0 in
/run/determined/etc/passwd such that outside the container the access
is seen as the launching user.

/run/determined/etc/passwd contains a single line (written by Determined)
to represent the agent user.

* feat: make list groups roles and list users roles return assignment info (determined-ai#498)

* chore: Disable test_node_not_available [FOUNDENG-304] (determined-ai#510)

The test is queueing instead of getting the expected error message on the mosaic
slurm cluster.  Need to resolve before re-enabling.

* chore: Add sawmill test system to slurmcluster.sh (determined-ai#511)

Add config for sawmil and detect systems that do not have installed launchers, and indicate that -d is required.

* chore: Disable test_node_not_available [FOUNDENG-304] (determined-ai#512)

Additionally rename the disabled test_node_not_available, to avoid
warnings about a test without an annotation.

* chore: experiment log show Slurm/PBS job ID. (determined-ai#502)

* fix: add sso login routes to list of echo routes that don't require auth (determined-ai#509)

Co-authored-by: Addison Snelling <asnell@hpe.com>

* chore: Add node atlas to slurmcluster.sh (determined-ai#513)

Enable testing with another data center system.

* feat: Fully support apptainer fork of singularity [FOUNDENG-292] (determined-ai#507)

Apptainer 1.0 is a fork of Singularity 3.8.  Reduce use of SINGULARITY_* variables.
hpe-hpc-launcher 3.1.4 supports capabilities and cached bypass.   --no-mount=tmp
has been the default for a bit, so not explicitly needed.

We retain the use of the SINGULARITY_DOCKER* and add APPTAINER_DOCKER*
for creds as there is no CLI option alternative.   Adding the APPTAINER_* version
eliminates warnings.

* fix: get group 500 error for rbac can't access case [DET-8588, DET-8589] (determined-ai#506)

* chore: log error on insufficient launcher version. (determined-ai#508)

* fix: redirect to cli relay on det auth login (determined-ai#519)

* refactor: rbac: move from `is_global` to scope type masks [DET-8569] (determined-ai#515)

* fix: 500 error for workspace membership without perms (determined-ai#525)

* test: update expected error messages. (determined-ai#526)

* chore: rbac refactor authorization code (determined-ai#527)

* chore: add checkpoint storage permission (determined-ai#518)

* fix: allow workspace viewers to view roles in webui. (determined-ai#530)

* chore: Fix test_node_not_available test [FOUNDENG-304] (determined-ai#517)

When scheduling CPUs (unlike GPUs), test_node_not_available
will submit a job that will set pending forever due to lack of resources.
This is happening on mosaic (our Slurm runner system today)
so skip the test if no GPUs available.

Also put a limit on the tests wait time for slurm failure test cases to 600s
(5min) to avoid default wait of 30 mins which avoids blocking up the gate
excessively on a test falure.

* test: disable restart on expected failure case. (determined-ai#528)

* chore: make authz_rbac workspaces return PermissionDeniedError (determined-ai#521)

* fix: FOUNDENG-303 Pausing, then resuming an experiment fails (determined-ai#533)

* ci: fix incorrect image name (determined-ai#535)

* ci: fix incorrect image names
* update a comment at the same time

* chore: Add additional configuration options in slurmcluster.sh (determined-ai#537)

Add the capability to set a default image, ask_container_defaults, and
partition_overrides in the master.yaml.

Add configuration for sawmil to make grizzly nodes cuda
and provide a default image, and MPI settings.

Eliminate the need for the CLUSTERS list to be manually updated by
just checking for the cluster configuration directly.

Indicate the generated master.yaml file name to simplify debugging
when injecting multiline options.

* ci: autorebase PRs on master force push [INFENG-122] (determined-ai#532)

* fix: FOUNDENG-310 test_noop_pause_hpc needs timeout increase to avoid random failures

* Test still randomly fails. Passed first time, failed second time. Increased timeout to 420 seconds just to see what happens.

* Increased the overall timeout to 20 minutes

* fixed some merge issues

Co-authored-by: Danny Zhu <dzhu@hpe.com>
Co-authored-by: Bradley Laney <bradlaney@determined.ai>
Co-authored-by: Neil Conway <neil@determined.ai>
Co-authored-by: Bradley Laney <bradley.laney@gmail.com>
Co-authored-by: brian <brian@determined.ai>
Co-authored-by: Danny Zhu <dzhu@determined.ai>
Co-authored-by: Brian Friedenberg <12980763+brain-good@users.noreply.github.com>
Co-authored-by: Armand McQueen <armandmcqueen@users.noreply.github.com>
Co-authored-by: Armand McQueen <armandmcqueen@gmail.com>
Co-authored-by: Caleb Kang <caleb@determined.ai>
Co-authored-by: Ilia Glazkov <ilia@determined.ai>
Co-authored-by: Philip Norman <philipnrmn@users.noreply.github.com>
Co-authored-by: Sean Mackrory <mackrory@determined.ai>
Co-authored-by: Nick Doiron <ndoiron@mapmeld.com>
Co-authored-by: Eric <31023784+eecsliu@users.noreply.github.com>
Co-authored-by: Hamid Zare <12127420+hamidzr@users.noreply.github.com>
Co-authored-by: Phillip Gaisford <phillip.gaisford@hpe.com>
Co-authored-by: phillip-gaisford <98362331+phillip-gaisford@users.noreply.github.com>
Co-authored-by: Jerry J. Harrow <84593277+jerryharrow@users.noreply.github.com>
Co-authored-by: Jagadeesh Madagundi <jagadeesh545@gmail.com>
Co-authored-by: CharlesTran1 <69864849+CharlesTran1@users.noreply.github.com>
Co-authored-by: CanmingCobble <107056780+CanmingCobble@users.noreply.github.com>
Co-authored-by: NicholasBlaskey <nick.blaskey@hpe.com>
Co-authored-by: Cameron Quilici <cameron.quilici@hpe.com>
Co-authored-by: Danny Sauer <danny.sauer@hpe.com>
Co-authored-by: Bradley Laney <bradley.laney@hpe.com>
Co-authored-by: Max <max.russell@hpe.com>
Co-authored-by: Ilia Glazkov <ilia.glazkov@hpe.com>
Co-authored-by: julian-determined-ai <103522725+julian-determined-ai@users.noreply.github.com>
Co-authored-by: Trent Watson <trent.watson@hpe.com>
Co-authored-by: Addison Snelling <asnell@hpe.com>
tayritenour pushed a commit to tayritenour/determined that referenced this pull request Apr 25, 2023
* chore: configuration item resources.devices
eecsliu pushed a commit to eecsliu/determined that referenced this pull request Jun 23, 2023
* chore: configuration item resources.devices
stoksc pushed a commit that referenced this pull request Jun 26, 2023
* chore: configuration item resources.devices
eecsliu pushed a commit that referenced this pull request Jun 28, 2023
* chore: configuration item resources.devices
eecsliu pushed a commit that referenced this pull request Jun 28, 2023
* chore: configuration item resources.devices
stoksc pushed a commit that referenced this pull request Jul 20, 2023
* chore: configuration item resources.devices
eecsliu pushed a commit that referenced this pull request Jul 24, 2023
* chore: configuration item resources.devices
stoksc pushed a commit that referenced this pull request Oct 17, 2023
* chore: configuration item resources.devices
azhou-determined pushed a commit that referenced this pull request Dec 7, 2023
* chore: configuration item resources.devices
wes-turner pushed a commit that referenced this pull request Feb 2, 2024
* chore: configuration item resources.devices
rb-determined-ai pushed a commit that referenced this pull request Feb 29, 2024
* chore: configuration item resources.devices
amandavialva01 pushed a commit that referenced this pull request Mar 18, 2024
* chore: configuration item resources.devices
eecsliu pushed a commit that referenced this pull request Apr 18, 2024
* chore: configuration item resources.devices
eecsliu pushed a commit to determined-ai/determined-release-testing that referenced this pull request Apr 22, 2024
* chore: configuration item resources.devices
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants