New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

test clabot. #189

Closed

rb-determined-ai wants to merge 1 commit into determined-ai:master from rb-determined-ai:test-clabot

Member

rb-determined-ai commented Apr 20, 2020

Test Plan

UI change:
- add screenshots
- React? build & check storybooks
user-facing api change: modify documentation and examples
user-facing api change: add the "User-facing API Change" label
bug fix: add regression test
bug fix: determine if there are other similar bugs in the codebase
new feature: add test coverage for any user-facing aspects
refactor: maintain existing code coverage


          test clabot.

4b4655a

rb-determined-ai added the Work in progress label

rb-determined-ai self-assigned this

cla-bot bot commented Apr 20, 2020

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have the users @rb-determined-ai on file. In order for us to review and merge your code, please contact the project maintainers to get yourself added.

cla-bot bot removed the Work in progress label

rb-determined-ai added the Work in progress label

Member Author

rb-determined-ai commented Apr 20, 2020

Member Author

rb-determined-ai commented Apr 20, 2020

cla-bot bot added the cla-signed label

cla-bot bot commented Apr 20, 2020

The cla-bot has been summoned, and re-checked this pull request!

Member Author

rb-determined-ai commented Apr 20, 2020

Member Author

rb-determined-ai commented Apr 20, 2020

cla-bot bot commented Apr 20, 2020

The cla-bot has been summoned, and re-checked this pull request!

Member Author

rb-determined-ai commented Apr 20, 2020

cla-bot bot commented Apr 20, 2020

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have the users @rb-determined-ai on file. In order for us to review and merge your code, please contact the project maintainers to get yourself added.

cla-bot bot removed Work in progress cla-signed labels

cla-bot bot commented Apr 20, 2020

The cla-bot has been summoned, and re-checked this pull request!

rb-determined-ai added the Work in progress label

Member Author

rb-determined-ai commented Apr 20, 2020

cla-bot bot commented Apr 20, 2020

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have the users @rb-determined-ai on file. In order for us to review and merge your code, please contact the project maintainers to get yourself added.

cla-bot bot removed the Work in progress label

cla-bot bot commented Apr 20, 2020

The cla-bot has been summoned, and re-checked this pull request!

rb-determined-ai added the Work in progress label

Member Author

rb-determined-ai commented Apr 20, 2020

cla-bot bot added the CLA signed label

cla-bot bot commented Apr 20, 2020

The cla-bot has been summoned, and re-checked this pull request!

rb-determined-ai removed the CLA signed label

Member Author

rb-determined-ai commented Apr 20, 2020

Yeah, there's just a bug in cla-bot.

ColinEberhardt/cla-bot#9

rb-determined-ai closed this

mackrorysd pushed a commit that referenced this pull request


          feat: slurm support (#98)

66ce984

This change adds a dispatcher resource manager, which gives Determined the ability to run on top of Slurm.

Feature branch commit history:

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
Dispatcher RM messages.

* chore: HAL-2893 Synchronize reading of Slurm resource log file (#179)

* fix: HAL-2903 Drop special handling of stdin/stdout for SLURM (#180)

The /run/determined/train directory is no longer shared between
all containers in a run, so drop the slurm-job specific task
setup code that redirects the stdin/stdout link via /tmp (which
singularity maps to /tmp on the host which can lead to conflict
between jobs refereincing the same /tmp/stdin.log link between jobs).

* feat: FOUNDENG-21 Enable SLURM preemption (#181)

Initial work to enable SLURM job preemption.
- For SLURM jobs intercept SIGTERM as a notification of pending preemption
when triggered notify the master via new pending_preemption API
- Add pending_preemption API on the master
- Upon call to the pending_preemption, dispatcher RM triggers ReleaseResources
(whith ForcePremption:true) for the allocation.
- The above triggers a preemption, and the long poll _preempt.py is performing will
return indicating that the job should preempt, and after checkpointing it will
exit with a success status.

* chore: put some life into the 'job queue' UI page (#189)

* chore: Version of Bradley's wrap_rank.py fix (#188)

This is semantically eqivalent to

https://github.com/determined-ai/determined/pull/4109

So you can apply it to your tree to get things working on dispatcher branch.

* chore: HAL-2788 Process and send logs from Dispatcher(SLURM) to Default Log Backend (#182)

Co-authored-by: Bradley Laney <bradlaney@determined.ai>

* chore: per-partition configs (#196)

This change enables us to configure per-partition overrides for certain configurations and adds a few overrides for now. The mechanism matches determined agents where expconf settings (if available) override pool level defaults override master config level defaults.

* chore: avoid API blockage & overpolling (#198)

* chore: avoid API blockage & overpolling

when following launched jobs to completion

* chore: provide per-pool job stats (#199)

* chore: provide per-pool job stats

on receipt of GetJobQueueStatsRequest

* fix: FOUNDENG-13 Properly set default pool when no GPUs (#201)

When no partitions have GPUs, we ended up without a default
compute pool, so jobs would get subbmitted to pool "".
That would be fine, except we couldn't then match the pool
named "" to its attributes (i.e. has no GPUs) and thus we
would try to allocate GPUs from a pool without them.
So when no GPUs in any partition, set the default compute pool
(internal to the Dispatcher RM) to the default AUX pool, so
that when a request for the default pool comes in we properly
match it up with the pool attributes and correctly swith to
using the cpu as the slot type.

Also add a devcluster file for shuco for testing.

* chore: fix up slot type resolution (#197)

* chore: fix default scheme for dispatcher config (#203)

* chore: Refinements to Dispatcher/SLURM logs implementations (#191)

* fix: FOUNDENG-14 generated ResourcesStateChanged events (#202)

Generate ResourcesStateChanged to cover running, and other
resources states from the dispatcher RM.

AllocationReady comes from the container for shells, notebooks,
and tensorboards, but the clients also depend upon the state
being running immediately after AllocationReady, so to avoid
a race condition where the dispatcher RM discovers the running
state by polling for dispatcher RM, force the state to running upon reception
of the AllocationReady event.

* fix: avoid potential race between AllocationReady and Running state (#206)

Pull in oss version of this fix to avoid merge conflict.

* chore: Add devcluster files for casablanca-login and horizon (#207)

Custom devcluster files to enable simple testing for
two new test machines.

* chore: fix tests (#204)

* chore: cleanup stray stuff (#208)

* chore: rename iface configs for consistency (#210)

* fix: FOUNDENG-40 Favor default partition in all cases unless no GPUs (#213)

it would be preferable to favor the default partition explicitly.

New algorithm:

- Use default partition always for aux partition
- Use default partition for GPU partition (unless it has no GPUs).

Updated unit tests.

* chore: Improve branding/naming of dispatcher/launcher (#212)

For customer-visible purposes, use "Slurm" when referring to the scheduling
system that is managed by the dispatcher RM internally.

* fix: FOUNDENG-41 Cancel running SLURM jobs on restart (#215)

On the startup of the master there may be on-going jobs on the cluster
associated with determined jobs. Terminate them all and remove them
from the DB to prevent duplicate jobs as determined starts up new jobs
which will continue from the last checkpoint.

* fix: FOUNDENG-43 Change in yaml parser breaks default partitions (#217)

Unblock testing by reverting back to our previous yaml parser
to restore default partition identification.

* fix: correctly parse resource pool info with ghodss/yaml package [FOUNDENG-45] (#218)

* fix: handle nil response from dispatch terminate (#219)

* fix: FOUNDENG-41 properly handle missing dispatches (#221)

When cleaning up dispatches from the DB on startup, if the
dispatch does not exist (404) then still delete the DB entry
so we don't attempt to clean it up again on restart.

* chore: fill in DET_UNIQUE_PORT_OFFSET (#220)

* feat: FOUNDENG-24 Improve branding/naming of dispatcher/launcher (#216)

Change the resource_manger.type value that customer's provide
from dispatcher to slurm

* Revert "chore: fill in DET_UNIQUE_PORT_OFFSET (#220)"

This reverts commit d3de26b7d652bf15eb28ee0d9a6b2c5113efcaa3.

* chore: fill in DET_UNIQUE_PORT_OFFSET, with defaults (#225)

* fix: FOUNDENG-41 If we get an error not 404, don't delete from DB (#222)

If we get a nil response on the error path, we do not know for sure
that the dispatch does not exist, so leave it in the DB for a later
attempt at cleanup.

* chore: Saturate Determined's `GET /api/v1/agents` (#226)

* Saturate Determined's `GET /api/v1/agents` API

for Dispatcher RM

* fix: FOUNDENG-49 Slurm Logs and scripts are cleaned up (#228)

Changes to make sure that when a Dispatch is dropped from the DB
we have ensured that the DispatchIDs (launches & environment) have
been removed by the launcher.

Use common method for cleanup in both cases, and enhance
the method to verify deletion (404 is also considered a success).
If we hit any failure in cleanup of launcher data, we defer the
removal of the Dispatch until a later time (they are re-processed
on startup).

* chore: Saturate Determined's `GET /api/v1/agents` API (#229)

* Saturate Determined's `GET /api/v1/agents` API

for Dispatcher RM.
o small tweak to mark the agents as not available if Slurm has then as allocated.

* chore: Enable security.tls settings for slurm RM (#200)

We want to enable use of https:8043 for development to enable us to
connect to test machines from our desktops via Pulse VPN.
Also update casablanca-login devcluster config to use :8043

* fix: Fix parameter order Dispatch cleanup (#230)

The parameter order to the resourceQueryPostActions cleanup
method were reversed, so the Dispatch was not properly removed.

* fix: FOUNDENG-48 `ERROR: Cannot set UID=0` (#231)

Launched Slurm jobs cannot be run as a root user (the default if
no agent is configured). Catch this error early in the dispatcher
RM by returning an error if an attempt to impersonate root
and report it explicitly.

* fix: FOUNDENG-25 Dispatcher job error reporting is sub-standard (#232)

When reporting exit status messages, the dispatcher RM includes
all events that happen on the distpatch. These are mostly useless
status updates. Fixes:
- Include only WARNING/ERROR level messags in the reporting.
those potentially have real error messages in them.
- Filter the prefix for job failure messages to make them cleaner.

* chore: New 'job_storage_root' field in master.yaml needs to be added to DAI (#233)

* chore: FOUNDENG-26 Implement batch processing for Dispatcher/SLURM logs (#211)

* chore: FOUNDENG-51 update dispatcher monitor to handle 404 status for cancelled experiments (#236)

* fix: update master GoReleaser config to account for new binary

* chore: enable hpe branding (#79)

* chore: publish EE docs to hpecaide.determined.ai

* chore: rebrand docs content and theme for HPE before publishing

* chore: push EE Docker images with `hpecaide` names

* fix: update reference to config package (#80)

* fix: update reference to config package

https://github.com/determined-ai/determined/pull/3175 moved config into
its own package, this fixes an -ee specific reference which was not
updated, fixing `make all`.

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* rrci: avoid implication that the password for SCIM tests is secure (#83)

* feat: oidc support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: change branding to MLDE (#114)

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also update configuration to avoid giving the impression that we might want to
publish EE images to NVCR, and revert a few diffs between the config for EE vs
OSS that don't seem intentional/warranted.

* chore: finish CAIDE -> MLDE, change hpemlde -> hpe-mlde

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* chore: more CAIDE -> MLDE changes (#153)

* chore: update web initial title and meta description (#155)

* chore: refactor for semantic conflict in service.go:Service.extractToken (#186)

* fix: use correct tags for EE release

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: just warn on trying to set single resource as daemon, for now (#125)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* Support environment configuration on singularity (#152)

Address todos:
- TODO(HAL-2868): Support adding and dropping linux caps for containers.
- TODO(HAL-2869): Support communicating with authenticated container registries.
- Also force_pull_image option.

Design Note: https://connect.us.cray.com/confluence/pages/viewpage.action?pageId=230482663

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Add missing checkpoint-gc link supporting unshared data volumes (#156)

Add a missing link from /run/determined/checkpoint_gc to the
unshared data volume to enable them to complete successfully.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* HAL-2788 Handle task script archives as shared. (#162)

Simplify our bind mounts by treating the archive
with logging configuration scrtips as shared.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Add command and shell entry script support. (#166)

The shared volume support needs to setup links for files that
are not intended to be shared. This is not yet completely
generalized, so we need to update tasks.go to add the links
for now. Add some missing links.

* chore: Dynamically handle links for unshared archive volumes (#168)

When archives are mapped into the per-container space we
need to setup softlinks for each element within /run/determined
directory that is not treated as shared. This was previously
done using a static list of names. Replace that with a dynamic
list so that we don't miss files that are specified in other
parts of determined. Also add exceptions for command and
shell entrypoints which are read-only so we can leave them in the
shared /run/determined directory without setting up links.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: With slurm detect non-zero exit as job failure (#172)

If any container launched by the SLURM batch job exits with
a non-zero status, terminate the others with an error.
Without this a failure on horovodrun args, or other such
errors would exit the one container, but others would wait
forever.

Happy path exits, need a different fix.

* chore: Make shell_entrypoint work with SLURM (#174)

The shell command entry point needs the fix to properly
inherit DET_SLOTS from SLUM CUDA configuration.
Additionally, HPC systems include functions in their
environment to support modules. Because the variables
include () in the names, it breaks the SSH variable escaping
code. Avoid processing of these functions by adding any
variable name contiaing () to the blacklist.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
Dispatcher RM messages.

* chore: HAL-2893 Synchronize reading of Slurm resource log file (#179)

* fix: HAL-2903 Drop special handling of stdin/stdout for SLURM (#180)

The /run/determined/train directory is no longer shared between
all containers in a run, so drop the slurm-job specific task
setup code that redirects the stdin/stdout link via /tmp (which
singularity maps to /tmp on the host which can lead to conflict
between jobs refereincing the same /tmp/stdin.log link between jobs).

* feat: FOUNDENG-21 Enable SLURM preemption (#181)

Initial work to enable SLURM job preemption.
- For SLURM jobs intercept SIGTERM as a notification of pending preemption
when triggered notify the master via new pending_preemption API
- Add pending_preemption API on the master
- Upon call to the pending_preemption, dispatcher RM triggers ReleaseResources
(whith ForcePremption:true) for the allocation.
- The above triggers a preemption, and the long poll _preempt.py is performing will
return indicating that the job should preempt, and after checkpointing it will
exit with a success status.

* chore: put some life into the 'job queue' UI page (#189)

* chore: Version of Bradley's wrap_rank.py fix (#188)

This is semantically eqivalent to

https://github.com/determined-ai/determined/pull/4109

So you can apply it to your tree to get things working on dispatcher branch.

* chore: HAL-2788 Process and send logs from Dispatcher(SLURM) to Default Log Backend (#182)

Co-authored-by: Bradley Laney <bradlaney@determined.ai>

* fix: correct regex group name (#193)

* fix: HAL-2907 Simplify SLURM job naming (#190)

This is just a slight increment to reduce the verbosity, to cleanup
the username and dispatch ID we likely will need a launcher
enhancemement.

* chore: fix tb collisions (#194)

* chore: HAL-2894 Multiple distributed jobs collide on horovod run sshd port (#195)

* chore: per-partition configs (#196)

This change enables us to configure per-partition overrides for certain configurations and adds a few overrides for now. The mechanism matches determined agents where expconf settings (if available) override pool level defaults override master config level defaults.

* chore: avoid API blockage & overpolling (#198)

* chore: avoid API blockage & overpolling

when following launched jobs to completion

* chore: provide per-pool job stats (#199)

* chore: provide per-pool job stats

on receipt of GetJobQueueStatsRequest

* fix: FOUNDENG-13 Properly set default pool when no GPUs (#201)

When no partitions have GPUs, we ended up without a default
compute pool, so jobs would get subbmitted to pool "".
That would be fine, except we couldn't then match the pool
named "" to its attributes (i.e. has no GPUs) and thus we
would try to allocate GPUs from a pool without them.
So when no GPUs in any partition, set the default compute pool
(internal to the Dispatcher RM) to the default AUX pool, so
that when a request for the default pool comes in we properly
match it up with the pool attributes and correctly swith to
using the cpu as the slot type.

Also add a devcluster file for shuco for testing.

* chore: fix up slot type resolution (#197)

* chore: fix default scheme for dispatcher config (#203)

* chore: update Go dependency files

This is a sort of artificial commit used to concentrate all of the
go.{mod,sum} changes for EE into one place to hopefully reduce
conflicts.

* feat: add support for SCIM provisioning

This commit brings the SCIM support for v0.11.2 back from the dead with
minimal changes. The only new fixes/features are conflicts now correctly
returns 409 not 500, password sync is supported, externalId is not
required, and the username/password that the IdP uses for basic auth
when talking to us is configurable.

* chore: remove Apache V2 license text

The enterprise edition is not licensed under Apache V2, so remove the
license text to avoid confusion.

* fix: ignore SCIM meta field

When we receive a PATCH or PUT from Okta,
they may reply to us with the meta field we
initially gave to them. This is ok; it is in
spec to receive and ignore it.

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add support for SAML

To meet customer needs, we need to provide support for users to
authenticate with SAML. The commit adds /saml/sso and /saml/initiate to
receive SAML responses and initiate SP-initiated flow with SAML
requests, respectively. It also adds SSO providers to /info for the
frontend to determined if SAML is enabled or not and the ability to
configure SAML via master.yaml.

* feat: add support for automatically storing creds

As a Determined user, I don't want to paste a token every time I want to
auth. This adds support for, when auth'ing from the CLI, redirecting to
localhost and automatically storing the auth token.

* ci: modify CI/CD steps for EE

We avoid running jobs that will publish our product to public
repositories. This currently means that we cannot test cloud deployments
in an automated manner. However, none of our EE features are related to
cloud deployments so this should not be a problem.

* ci: enable EE releases with CircleCI

* chore: enable SCIM for test-e2e (#15)

* chore: run apt-get update for deps of master.deb (#16)

When we install ee via a deb, apt-get tries to resolve the named deps of
it but, in CI we haven't run apt-get update so we can't find it.

* feat: support acting as an OAuth 2.0 authorization server (#10)

This adds support for the OAuth 2.0 authorization code flow to the
master. Apart from the core flow itself, HTTP endpoints are provided for
managing the set of OAuth client applications (accessible only by admin
users).

Since we're only addressing one particular scenario at the moment, we
allow only a single client application to be registered, since that's
all we need for this use case and that simplifies management a bit.

* feat: enable OAuth protection for SCIM endpoints (#10)

* ci: add -ee to tags (#29)

* ci: use make publish for releasing Docker images (#36)

(cherry picked from commit a5de11b5cc87e848d0e81601d15038dd4f44d7e1)

* fix: remove det-deploy local from ee ci (#53)

* chore: increase refresh token exp (#60) [DET-4754]

* chore: remove sessions for deprovisioned users (#61) [DET-4844]

* ci: skip a bunch of testing and deployment for EE

* fix: extract master URL correctly from resource pools (#64)

With the introduction of resource pools, the provisioner is no longer a top level field. Update code so that we pull the MasterURL from the resource pools' provider field.

* fix: add ssoProviders to default store info for ee

* chore: do not publish model_hub package in EE

* fix: update master GoReleaser config to account for new binary

* chore: rebrand to HPE MLDE

Basic renaming of Docker images, docs bucket, etc. Also clearly remove
any implication of publishing to NVCR.

* chore: rebrand docs content and theme for HPE before publishing

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* ci: avoid implication that the password for SCIM tests is secure (#83)

* feat: add OIDC support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also revert a few diffs between the config for EE vs OSS that don't seem
intentional/warranted.

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* chore: refactor for semantic conflict in service.go:Service.extractToken (#186)

* fix: update docs skip check remote address to EE

* chore: add launcher client (#209)

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* Support environment configuration on singularity (#152)

Address todos:
- TODO(HAL-2868): Support adding and dropping linux caps for containers.
- TODO(HAL-2869): Support communicating with authenticated container registries.
- Also force_pull_image option.

Design Note: https://connect.us.cray.com/confluence/pages/viewpage.action?pageId=230482663

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Add missing checkpoint-gc link supporting unshared data volumes (#156)

Add a missing link from /run/determined/checkpoint_gc to the
unshared data volume to enable them to complete successfully.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* HAL-2788 Handle task script archives as shared. (#162)

Simplify our bind mounts by treating the archive
with logging configuration scrtips as shared.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Add command and shell entry script support. (#166)

The shared volume support needs to setup links for files that
are not intended to be shared. This is not yet completely
generalized, so we need to update tasks.go to add the links
for now. Add some missing links.

* chore: Dynamically handle links for unshared archive volumes (#168)

When archives are mapped into the per-container space we
need to setup softlinks for each element within /run/determined
directory that is not treated as shared. This was previously
done using a static list of names. Replace that with a dynamic
list so that we don't miss files that are specified in other
parts of determined. Also add exceptions for command and
shell entrypoints which are read-only so we can leave them in the
shared /run/determined directory without setting up links.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: With slurm detect non-zero exit as job failure (#172)

If any container launched by the SLURM batch job exits with
a non-zero status, terminate the others with an error.
Without this a failure on horovodrun args, or other such
errors would exit the one container, but others would wait
forever.

Happy path exits, need a different fix.

* chore: Make shell_entrypoint work with SLURM (#174)

The shell command entry point needs the fix to properly
inherit DET_SLOTS from SLUM CUDA configuration.
Additionally, HPC systems include functions in their
environment to support modules. Because the variables
include () in the names, it breaks the SSH variable escaping
code. Avoid processing of these functions by adding any
variable name contiaing () to the blacklist.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
…

mackrorysd added a commit that referenced this pull request

* chore: update Go dependency files

This is a sort of artificial commit used to concentrate all of the
go.{mod,sum} changes for EE into one place to hopefully reduce
conflicts.

* feat: add support for SCIM provisioning

This commit brings the SCIM support for v0.11.2 back from the dead with
minimal changes. The only new fixes/features are conflicts now correctly
returns 409 not 500, password sync is supported, externalId is not
required, and the username/password that the IdP uses for basic auth
when talking to us is configurable.

* chore: remove Apache V2 license text

The enterprise edition is not licensed under Apache V2, so remove the
license text to avoid confusion.

* fix: ignore SCIM meta field

When we receive a PATCH or PUT from Okta,
they may reply to us with the meta field we
initially gave to them. This is ok; it is in
spec to receive and ignore it.

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add support for SAML

To meet customer needs, we need to provide support for users to
authenticate with SAML. The commit adds /saml/sso and /saml/initiate to
receive SAML responses and initiate SP-initiated flow with SAML
requests, respectively. It also adds SSO providers to /info for the
frontend to determined if SAML is enabled or not and the ability to
configure SAML via master.yaml.

* feat: add support for automatically storing creds

As a Determined user, I don't want to paste a token every time I want to
auth. This adds support for, when auth'ing from the CLI, redirecting to
localhost and automatically storing the auth token.

* ci: modify CI/CD steps for EE

We avoid running jobs that will publish our product to public
repositories. This currently means that we cannot test cloud deployments
in an automated manner. However, none of our EE features are related to
cloud deployments so this should not be a problem.

* ci: enable EE releases with CircleCI

* chore: enable SCIM for test-e2e (#15)

* chore: run apt-get update for deps of master.deb (#16)

When we install ee via a deb, apt-get tries to resolve the named deps of
it but, in CI we haven't run apt-get update so we can't find it.

* feat: support acting as an OAuth 2.0 authorization server (#10)

This adds support for the OAuth 2.0 authorization code flow to the
master. Apart from the core flow itself, HTTP endpoints are provided for
managing the set of OAuth client applications (accessible only by admin
users).

Since we're only addressing one particular scenario at the moment, we
allow only a single client application to be registered, since that's
all we need for this use case and that simplifies management a bit.

* feat: enable OAuth protection for SCIM endpoints (#10)

* ci: add -ee to tags (#29)

* ci: use make publish for releasing Docker images (#36)

(cherry picked from commit a5de11b5cc87e848d0e81601d15038dd4f44d7e1)

* fix: remove det-deploy local from ee ci (#53)

* chore: increase refresh token exp (#60) [DET-4754]

* chore: remove sessions for deprovisioned users (#61) [DET-4844]

* ci: skip a bunch of testing and deployment for EE

* fix: extract master URL correctly from resource pools (#64)

With the introduction of resource pools, the provisioner is no longer a top level field. Update code so that we pull the MasterURL from the resource pools' provider field.

* fix: add ssoProviders to default store info for ee

* chore: do not publish model_hub package in EE

* fix: update master GoReleaser config to account for new binary

* chore: rebrand to HPE MLDE

Basic renaming of Docker images, docs bucket, etc. Also clearly remove
any implication of publishing to NVCR.

* chore: rebrand docs content and theme for HPE before publishing

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* ci: avoid implication that the password for SCIM tests is secure (#83)

* feat: add OIDC support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also revert a few diffs between the config for EE vs OSS that don't seem
intentional/warranted.

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* fix: update docs skip check remote address to EE

* chore: add launcher client (#209)

* feat: slurm support (#98)

This change adds a dispatcher resource manager, which gives Determined the ability to run on top of Slurm.

Feature branch commit history:

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
Dispatcher RM messages.

* chore: HAL-2893 Synchronize reading of Slurm resource log file (#179)

* fix: HAL-2903 Drop special handling of stdin/stdout for SLURM (#180)

The /run/determined/train directory is no longer shared between
all containers in a run, so drop the slurm-job specific task
setup code that redirects the stdin/stdout link via /tmp (which
singularity maps to /tmp on the host which can lead to conflict
between jobs refereincing the same /tmp/stdin.log link between jobs).

* feat: FOUNDENG-21 Enable SLURM preemption (#181)

Initial work to enable SLURM job preemption.
- For SLURM jobs intercept SIGTERM as a notification of pending preemption
when triggered notify the master via new pending_preemption API
- Add pending_preemption API on the master
- Upon call to the pending_preemption, dispatcher RM triggers ReleaseResources
(whith ForcePremption:true) for the allocation.
- The above triggers a preemption, and the long poll _preempt.py is performing will
return indicating that the job should preempt, and after checkpointing it will
exit with a success status.

* chore: put some life into the 'job queue' UI page (#189)

* chore: Version of Bradley's wrap_rank.py fix (#188)

This is semantically eqivalent to

https://github.com/determined-ai/determined/pull/4109

So you can apply it to your tree to get things working on dispatcher branch.

* chore: HAL-2788 Process and send logs from Dispatcher(SLURM) to Default Log Backend (#182)

Co-authored-by: Bradley Laney <bradlaney@determined.ai>

* chore: per-partition configs (#196)

This change enables us to configure per-partition overrides for certain configurations and adds a few overrides for now. The mechanism matches determined agents where expconf settings (if available) override pool level defaults override master config level defaults.

* chore: avoid API blockage & overpolling (#198)

* chore: avoid API blockage & overpolling

when following launched jobs to completion

* chore: provide per-pool job stats (#199)

* chore: provide per-pool job stats

on receipt of GetJobQueueStatsRequest

* fix: FOUNDENG-13 Properly set default pool when no GPUs (#201)

When no partitions have GPUs, we ended up without a default
compute pool, so jobs would get subbmitted to pool "".
That would be fine, except we couldn't then match the pool
named "" to its attributes (i.e. has no GPUs) and thus we
would try to allocate GPUs from a pool without them.
So when no GPUs in any partition, set the default compute pool
(internal to the Dispatcher RM) to the default AUX pool, so
that when a request for the default pool comes in we properly
match it up with the pool attributes and correctly swith to
using the cpu as the slot type.

Also add a devcluster file for shuco for testing.

* chore: fix up slot type resolution (#197)

* chore: fix default scheme for dispatcher config (#203)

* chore: Refinements to Dispatcher/SLURM logs implementations (#191)

* fix: FOUNDENG-14 generated ResourcesStateChanged events (#202)

Generate ResourcesStateChanged to cover running, and other
resources states from the dispatcher RM.

AllocationReady comes from the container for shells, notebooks,
and tensorboards, but the clients also depend upon the state
being running immediately after AllocationReady, so to avoid
a race condition where the dispatcher RM discovers the running
state by polling for dispatcher RM, force the state to running upon reception
of the AllocationReady event.

* fix: avoid potential race between AllocationReady and Running state (#206)

Pull in oss version of this fix to avoid merge conflict.

* chore: Add devcluster files for casablanca-login and horizon (#207)

Custom devcluster files to enable simple testing for
two new test machines.

* chore: fix tests (#204)

* chore: cleanup stray stuff (#208)

* chore: rename iface configs for consistency (#210)

* fix: FOUNDENG-40 Favor default partition in all cases unless no GPUs (#213)

it would be preferable to favor the default partition explicitly.

New algorithm:

- Use default partition always for aux partition
- Use default partition for GPU partition (unless it has no GPUs).

Updated unit tests.

* chore: Improve branding/naming of dispatcher/launcher (#212)

For customer-visible purposes, use "Slurm" when referring to the scheduling
system that is managed by the dispatcher RM internally.

* fix: FOUNDENG-41 Cancel running SLURM jobs on restart (#215)

On the startup of the master there may be on-going jobs on the cluster
associated with determined jobs. Terminate them all and remove them
from the DB to prevent duplicate jobs as determined starts up new jobs
which will continue from the last checkpoint.

* fix: FOUNDENG-43 Change in yaml parser breaks default partitions (#217)

Unblock testing by reverting back to our previous yaml parser
to restore default partition identification.

* fix: correctly parse resource pool info with ghodss/yaml package [FOUNDENG-45] (#218)

* fix: handle nil response from dispatch terminate (#219)

* fix: FOUNDENG-41 properly handle missing dispatches (#221)

When cleaning up dispatches from the DB on startup, if the
dispatch does not exist (404) then still delete the DB entry
so we don't attempt to clean it up again on restart.

* chore: fill in DET_UNIQUE_PORT_OFFSET (#220)

* feat: FOUNDENG-24 Improve branding/naming of dispatcher/launcher (#216)

Change the resource_manger.type value that customer's provide
from dispatcher to slurm

* Revert "chore: fill in DET_UNIQUE_PORT_OFFSET (#220)"

This reverts commit d3de26b7d652bf15eb28ee0d9a6b2c5113efcaa3.

* chore: fill in DET_UNIQUE_PORT_OFFSET, with defaults (#225)

* fix: FOUNDENG-41 If we get an error not 404, don't delete from DB (#222)

If we get a nil response on the error path, we do not know for sure
that the dispatch does not exist, so leave it in the DB for a later
attempt at cleanup.

* chore: Saturate Determined's `GET /api/v1/agents` (#226)

* Saturate Determined's `GET /api/v1/agents` API

for Dispatcher RM

* fix: FOUNDENG-49 Slurm Logs and scripts are cleaned up (#228)

Changes to make sure that when a Dispatch is dropped from the DB
we have ensured that the DispatchIDs (launches & environment) have
been removed by the launcher.

Use common method for cleanup in both cases, and enhance
the method to verify deletion (404 is also considered a success).
If we hit any failure in cleanup of launcher data, we defer the
removal of the Dispatch until a later time (they are re-processed
on startup).

* chore: Saturate Determined's `GET /api/v1/agents` API (#229)

* Saturate Determined's `GET /api/v1/agents` API

for Dispatcher RM.
o small tweak to mark the agents as not available if Slurm has then as allocated.

* chore: Enable security.tls settings for slurm RM (#200)

We want to enable use of https:8043 for development to enable us to
connect to test machines from our desktops via Pulse VPN.
Also update casablanca-login devcluster config to use :8043

* fix: Fix parameter order Dispatch cleanup (#230)

The parameter order to the resourceQueryPostActions cleanup
method were reversed, so the Dispatch was not properly removed.

* fix: FOUNDENG-48 `ERROR: Cannot set UID=0` (#231)

Launched Slurm jobs cannot be run as a root user (the default if
no agent is configured). Catch this error early in the dispatcher
RM by returning an error if an attempt to impersonate root
and report it explicitly.

* fix: FOUNDENG-25 Dispatcher job error reporting is sub-standard (#232)

When reporting exit status messages, the dispatcher RM includes
all events that happen on the distpatch. These are mostly useless
status updates. Fixes:
- Include only WARNING/ERROR level messags in the reporting.
those potentially have real error messages in them.
- Filter the prefix for job failure messages to make them cleaner.

* chore: New 'job_storage_root' field in master.yaml needs to be added to DAI (#233)

* chore: FOUNDENG-26 Implement batch processing for Dispatcher/SLURM logs (#211)

* chore: FOUNDENG-51 update dispatcher monitor to handle 404 status for cancelled experiments (#236)

* fix: update master GoReleaser config to account for new binary

* chore: enable hpe branding (#79)

* chore: publish EE docs to hpecaide.determined.ai

* chore: rebrand docs content and theme for HPE before publishing

* chore: push EE Docker images with `hpecaide` names

* fix: update reference to config package (#80)

* fix: update reference to config package

https://github.com/determined-ai/determined/pull/3175 moved config into
its own package, this fixes an -ee specific reference which was not
updated, fixing `make all`.

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* rrci: avoid implication that the password for SCIM tests is secure (#83)

* feat: oidc support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: change branding to MLDE (#114)

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also update configuration to avoid giving the impression that we might want to
publish EE images to NVCR, and revert a few diffs between the config for EE vs
OSS that don't seem intentional/warranted.

* chore: finish CAIDE -> MLDE, change hpemlde -> hpe-mlde

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* chore: more CAIDE -> MLDE changes (#153)

* chore: update web initial title and meta description (#155)

* chore: refactor for semantic conflict in service.go:Service.extractToken (#186)

* fix: use correct tags for EE release

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: just warn on trying to set single resource as daemon, for now (#125)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* Support environment configuration on singularity (#152)

Address todos:
- TODO(HAL-2868): Support adding and dropping linux caps for containers.
- TODO(HAL-2869): Support communicating with authenticated container registries.
- Also force_pull_image option.

Design Note: https://connect.us.cray.com/confluence/pages/viewpage.action?pageId=230482663

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Add missing checkpoint-gc link supporting unshared data volumes (#156)

Add a missing link from /run/determined/checkpoint_gc to the
unshared data volume to enable them to complete successfully.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* HAL-2788 Handle task script archives as shared. (#162)

Simplify our bind mounts by treating the archive
with logging configuration scrtips as shared.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Add command and shell entry script support. (#166)

The shared volume support needs to setup links for files that
are not intended to be shared. This is not yet completely
generalized, so we need to update tasks.go to add the links
for now. Add some missing links.

* chore: Dynamically handle links for unshared archive volumes (#168)

When archives are mapped into the per-container space we
need to setup softlinks for each element within /run/determined
directory that is not treated as shared. This was previously
done using a static list of names. Replace that with a dynamic
list so that we don't miss files that are specified in other
parts of determined. Also add exceptions for command and
shell entrypoints which are read-only so we can leave them in the
shared /run/determined directory without setting up links.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: With slurm detect non-zero exit as job failure (#172)

If any container launched by the SLURM batch job exits with
a non-zero status, terminate the others with an error.
Without this a failure on horovodrun args, or other such
errors would exit the one container, but others would wait
forever.

Happy path exits, need a different fix.

* chore: Make shell_entrypoint work with SLURM (#174)

The shell command entry point needs the fix to properly
inherit DET_SLOTS from SLUM CUDA configuration.
Additionally, HPC systems include functions in their
environment to support modules. Because the variables
include () in the names, it breaks the SSH variable escaping
code. Avoid processing of these functions by adding any
variable name contiaing () to the blacklist.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
Dispatcher RM messages.

* chore: HAL-2893 Synchronize reading of Slurm resource log file (#179)

* fix: HAL-2903 Drop special handling of stdin/stdout for SLURM (#180)

The /run/determined/train directory is no longer shared between
all containers in a run, so drop the slurm-job specific task
setup code that redirects the stdin/stdout link via /tmp (which
singularity maps to /tmp on the host which can lead to conflict
between jobs refereincing the same /tmp/stdin.log link between jobs).

* feat: FOUNDENG-21 Enable SLURM preemption (#181)

Initial work to enable SLURM job preemption.
- For SLURM jobs intercept SIGTERM as a notification of pending preemption
when triggered notify the master via new pending_preemption API
- Add pending_preemption API on the master
- Upon call to the pending_preemption, dispatcher RM triggers ReleaseResources
(whith ForcePremption:true) for the allocation.
- The above triggers a preemption, and the long poll _preempt.py is performing will
return indicating that the job should preempt, and after checkpointing it will
exit with a success status.

* chore: put some life into the 'job queue' UI page (#189)

* chore: Version of Bradley's wrap_rank.py fix (#188)

This is semantically eqivalent to

https://github.com/determined-ai/determined/pull/4109

So you can apply it to your tree to get things working on dispatcher branch.

* chore: HAL-2788 Process and send logs from Dispatcher(SLURM) to Default Log Backend (#182)

Co-authored-by: Bradley Laney <bradlaney@determined.ai>

* fix: correct regex group name (#193)

* fix: HAL-2907 Simplify SLURM job naming (#190)

This is just a slight increment to reduce the verbosity, to cleanup
the username and dispatch ID we likely will need a launcher
enhancemement.

* chore: fix tb collisions (#194)

* chore: HAL-2894 Multiple distributed jobs collide on horovod run sshd port (#195)

* chore: per-partition configs (#196)

This change enables us to configure per-partition overrides for certain configurations and adds a few overrides for now. The mechanism matches determined agents where expconf settings (if available) override pool level defaults override master config level defaults.

* chore: avoid API blockage & overpolling (#198)

* chore: avoid API blockage & overpolling

when following launched jobs to completion

* chore: provide per-pool job stats (#199)

* chore: provide per-pool job stats

on receipt of GetJobQueueStatsRequest

* fix: FOUNDENG-13 Properly set default pool when no GPUs (#201)

When no partitions have GPUs, we ended up without a default
compute pool, so jobs would get subbmitted to pool "".
That would be fine, except we couldn't then match the pool
named "" to its attributes (i.e. has no GPUs) and thus we
would try to allocate GPUs from a pool without them.
So when no GPUs in any partition, set the default compute pool
(internal to the Dispatcher RM) to the default AUX pool, so
that when a request for the default pool comes in we properly
match it up with the pool attributes and correctly swith to
using the cpu as the slot type.

Also add a devcluster file for shuco for testing.

* chore: fix up slot type resolution (#197)

* chore: fix default scheme for dispatcher config (#203)

* chore: update Go dependency files

This is a sort of artificial commit used to concentrate all of the
go.{mod,sum} changes for EE into one place to hopefully reduce
conflicts.

* feat: add support for SCIM provisioning

This commit brings the SCIM support for v0.11.2 back from the dead with
minimal changes. The only new fixes/features are conflicts now correctly
returns 409 not 500, password sync is supported, externalId is not
required, and the username/password that the IdP uses for basic auth
when talking to us is configurable.

* chore: remove Apache V2 license text

The enterprise edition is not licensed under Apache V2, so remove the
license text to avoid confusion.

* fix: ignore SCIM meta field

When we receive a PATCH or PUT from Okta,
they may reply to us with the meta field we
initially gave to them. This is ok; it is in
spec to receive and ignore it.

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add support for SAML

To meet customer needs, we need to provide support for users to
authenticate with SAML. The commit adds /saml/sso and /saml/initiate to
receive SAML responses and initiate SP-initiated flow with SAML
requests, respectively. It also adds SSO providers to /info for the
frontend to determined if SAML is enabled or not and the ability to
configure SAML via master.yaml.

* feat: add support for automatically storing creds

As a Determined user, I don't want to paste a token every time I want to
auth. This adds support for, when auth'ing from the CLI, redirecting to
localhost and automatically storing the auth token.

* ci: modify CI/CD steps for EE

We avoid running jobs that will publish our product to public
repositories. This currently means that we cannot test cloud deployments
in an automated manner. However, none of our EE features are related to
cloud deployments so this should not be a problem.

* ci: enable EE releases with CircleCI

* chore: enable SCIM for test-e2e (#15)

* chore: run apt-get update for deps of master.deb (#16)

When we install ee via a deb, apt-get tries to resolve the named deps of
it but, in CI we haven't run apt-get update so we can't find it.

* feat: support acting as an OAuth 2.0 authorization server (#10)

This adds support for the OAuth 2.0 authorization code flow to the
master. Apart from the core flow itself, HTTP endpoints are provided for
managing the set of OAuth client applications (accessible only by admin
users).

Since we're only addressing one particular scenario at the moment, we
allow only a single client application to be registered, since that's
all we need for this use case and that simplifies management a bit.

* feat: enable OAuth protection for SCIM endpoints (#10)

* ci: add -ee to tags (#29)

* ci: use make publish for releasing Docker images (#36)

(cherry picked from commit a5de11b5cc87e848d0e81601d15038dd4f44d7e1)

* fix: remove det-deploy local from ee ci (#53)

* chore: increase refresh token exp (#60) [DET-4754]

* chore: remove sessions for deprovisioned users (#61) [DET-4844]

* ci: skip a bunch of testing and deployment for EE

* fix: extract master URL correctly from resource pools (#64)

With the introduction of resource pools, the provisioner is no longer a top level field. Update code so that we pull the MasterURL from the resource pools' provider field.

* fix: add ssoProviders to default store info for ee

* chore: do not publish model_hub package in EE

* fix: update master GoReleaser config to account for new binary

* chore: rebrand to HPE MLDE

Basic renaming of Docker images, docs bucket, etc. Also clearly remove
any implication of publishing to NVCR.

* chore: rebrand docs content and theme for HPE before publishing

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* ci: avoid implication that the password for SCIM tests is secure (#83)

* feat: add OIDC support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also revert a few diffs between the config for EE vs OSS that don't seem
intentional/warranted.

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* chore: refactor for semantic conflict in service.go:Service.extractToken (#186)

* fix: update docs skip check remote address to EE

* chore: add launcher client (#209)

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's re…

mackrorysd pushed a commit that referenced this pull request


          feat: slurm support (#98)

ecf2e75

This change adds a dispatcher resource manager, which gives Determined the ability to run on top of Slurm.

Feature branch commit history:

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
Dispatcher RM messages.

* chore: HAL-2893 Synchronize reading of Slurm resource log file (#179)

* fix: HAL-2903 Drop special handling of stdin/stdout for SLURM (#180)

The /run/determined/train directory is no longer shared between
all containers in a run, so drop the slurm-job specific task
setup code that redirects the stdin/stdout link via /tmp (which
singularity maps to /tmp on the host which can lead to conflict
between jobs refereincing the same /tmp/stdin.log link between jobs).

* feat: FOUNDENG-21 Enable SLURM preemption (#181)

Initial work to enable SLURM job preemption.
- For SLURM jobs intercept SIGTERM as a notification of pending preemption
when triggered notify the master via new pending_preemption API
- Add pending_preemption API on the master
- Upon call to the pending_preemption, dispatcher RM triggers ReleaseResources
(whith ForcePremption:true) for the allocation.
- The above triggers a preemption, and the long poll _preempt.py is performing will
return indicating that the job should preempt, and after checkpointing it will
exit with a success status.

* chore: put some life into the 'job queue' UI page (#189)

* chore: Version of Bradley's wrap_rank.py fix (#188)

This is semantically eqivalent to

https://github.com/determined-ai/determined/pull/4109

So you can apply it to your tree to get things working on dispatcher branch.

* chore: HAL-2788 Process and send logs from Dispatcher(SLURM) to Default Log Backend (#182)

Co-authored-by: Bradley Laney <bradlaney@determined.ai>

* chore: per-partition configs (#196)

This change enables us to configure per-partition overrides for certain configurations and adds a few overrides for now. The mechanism matches determined agents where expconf settings (if available) override pool level defaults override master config level defaults.

* chore: avoid API blockage & overpolling (#198)

* chore: avoid API blockage & overpolling

when following launched jobs to completion

* chore: provide per-pool job stats (#199)

* chore: provide per-pool job stats

on receipt of GetJobQueueStatsRequest

* fix: FOUNDENG-13 Properly set default pool when no GPUs (#201)

When no partitions have GPUs, we ended up without a default
compute pool, so jobs would get subbmitted to pool "".
That would be fine, except we couldn't then match the pool
named "" to its attributes (i.e. has no GPUs) and thus we
would try to allocate GPUs from a pool without them.
So when no GPUs in any partition, set the default compute pool
(internal to the Dispatcher RM) to the default AUX pool, so
that when a request for the default pool comes in we properly
match it up with the pool attributes and correctly swith to
using the cpu as the slot type.

Also add a devcluster file for shuco for testing.

* chore: fix up slot type resolution (#197)

* chore: fix default scheme for dispatcher config (#203)

* chore: Refinements to Dispatcher/SLURM logs implementations (#191)

* fix: FOUNDENG-14 generated ResourcesStateChanged events (#202)

Generate ResourcesStateChanged to cover running, and other
resources states from the dispatcher RM.

AllocationReady comes from the container for shells, notebooks,
and tensorboards, but the clients also depend upon the state
being running immediately after AllocationReady, so to avoid
a race condition where the dispatcher RM discovers the running
state by polling for dispatcher RM, force the state to running upon reception
of the AllocationReady event.

* fix: avoid potential race between AllocationReady and Running state (#206)

Pull in oss version of this fix to avoid merge conflict.

* chore: Add devcluster files for casablanca-login and horizon (#207)

Custom devcluster files to enable simple testing for
two new test machines.

* chore: fix tests (#204)

* chore: cleanup stray stuff (#208)

* chore: rename iface configs for consistency (#210)

* fix: FOUNDENG-40 Favor default partition in all cases unless no GPUs (#213)

it would be preferable to favor the default partition explicitly.

New algorithm:

- Use default partition always for aux partition
- Use default partition for GPU partition (unless it has no GPUs).

Updated unit tests.

* chore: Improve branding/naming of dispatcher/launcher (#212)

For customer-visible purposes, use "Slurm" when referring to the scheduling
system that is managed by the dispatcher RM internally.

* fix: FOUNDENG-41 Cancel running SLURM jobs on restart (#215)

On the startup of the master there may be on-going jobs on the cluster
associated with determined jobs. Terminate them all and remove them
from the DB to prevent duplicate jobs as determined starts up new jobs
which will continue from the last checkpoint.

* fix: FOUNDENG-43 Change in yaml parser breaks default partitions (#217)

Unblock testing by reverting back to our previous yaml parser
to restore default partition identification.

* fix: correctly parse resource pool info with ghodss/yaml package [FOUNDENG-45] (#218)

* fix: handle nil response from dispatch terminate (#219)

* fix: FOUNDENG-41 properly handle missing dispatches (#221)

When cleaning up dispatches from the DB on startup, if the
dispatch does not exist (404) then still delete the DB entry
so we don't attempt to clean it up again on restart.

* chore: fill in DET_UNIQUE_PORT_OFFSET (#220)

* feat: FOUNDENG-24 Improve branding/naming of dispatcher/launcher (#216)

Change the resource_manger.type value that customer's provide
from dispatcher to slurm

* Revert "chore: fill in DET_UNIQUE_PORT_OFFSET (#220)"

This reverts commit d3de26b7d652bf15eb28ee0d9a6b2c5113efcaa3.

* chore: fill in DET_UNIQUE_PORT_OFFSET, with defaults (#225)

* fix: FOUNDENG-41 If we get an error not 404, don't delete from DB (#222)

If we get a nil response on the error path, we do not know for sure
that the dispatch does not exist, so leave it in the DB for a later
attempt at cleanup.

* chore: Saturate Determined's `GET /api/v1/agents` (#226)

* Saturate Determined's `GET /api/v1/agents` API

for Dispatcher RM

* fix: FOUNDENG-49 Slurm Logs and scripts are cleaned up (#228)

Changes to make sure that when a Dispatch is dropped from the DB
we have ensured that the DispatchIDs (launches & environment) have
been removed by the launcher.

Use common method for cleanup in both cases, and enhance
the method to verify deletion (404 is also considered a success).
If we hit any failure in cleanup of launcher data, we defer the
removal of the Dispatch until a later time (they are re-processed
on startup).

* chore: Saturate Determined's `GET /api/v1/agents` API (#229)

* Saturate Determined's `GET /api/v1/agents` API

for Dispatcher RM.
o small tweak to mark the agents as not available if Slurm has then as allocated.

* chore: Enable security.tls settings for slurm RM (#200)

We want to enable use of https:8043 for development to enable us to
connect to test machines from our desktops via Pulse VPN.
Also update casablanca-login devcluster config to use :8043

* fix: Fix parameter order Dispatch cleanup (#230)

The parameter order to the resourceQueryPostActions cleanup
method were reversed, so the Dispatch was not properly removed.

* fix: FOUNDENG-48 `ERROR: Cannot set UID=0` (#231)

Launched Slurm jobs cannot be run as a root user (the default if
no agent is configured). Catch this error early in the dispatcher
RM by returning an error if an attempt to impersonate root
and report it explicitly.

* fix: FOUNDENG-25 Dispatcher job error reporting is sub-standard (#232)

When reporting exit status messages, the dispatcher RM includes
all events that happen on the distpatch. These are mostly useless
status updates. Fixes:
- Include only WARNING/ERROR level messags in the reporting.
those potentially have real error messages in them.
- Filter the prefix for job failure messages to make them cleaner.

* chore: New 'job_storage_root' field in master.yaml needs to be added to DAI (#233)

* chore: FOUNDENG-26 Implement batch processing for Dispatcher/SLURM logs (#211)

* chore: FOUNDENG-51 update dispatcher monitor to handle 404 status for cancelled experiments (#236)

* fix: update master GoReleaser config to account for new binary

* chore: enable hpe branding (#79)

* chore: publish EE docs to hpecaide.determined.ai

* chore: rebrand docs content and theme for HPE before publishing

* chore: push EE Docker images with `hpecaide` names

* fix: update reference to config package (#80)

* fix: update reference to config package

https://github.com/determined-ai/determined/pull/3175 moved config into
its own package, this fixes an -ee specific reference which was not
updated, fixing `make all`.

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* rrci: avoid implication that the password for SCIM tests is secure (#83)

* feat: oidc support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: change branding to MLDE (#114)

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also update configuration to avoid giving the impression that we might want to
publish EE images to NVCR, and revert a few diffs between the config for EE vs
OSS that don't seem intentional/warranted.

* chore: finish CAIDE -> MLDE, change hpemlde -> hpe-mlde

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* chore: more CAIDE -> MLDE changes (#153)

* chore: update web initial title and meta description (#155)

* chore: refactor for semantic conflict in service.go:Service.extractToken (#186)

* fix: use correct tags for EE release

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: just warn on trying to set single resource as daemon, for now (#125)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* Support environment configuration on singularity (#152)

Address todos:
- TODO(HAL-2868): Support adding and dropping linux caps for containers.
- TODO(HAL-2869): Support communicating with authenticated container registries.
- Also force_pull_image option.

Design Note: https://connect.us.cray.com/confluence/pages/viewpage.action?pageId=230482663

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Add missing checkpoint-gc link supporting unshared data volumes (#156)

Add a missing link from /run/determined/checkpoint_gc to the
unshared data volume to enable them to complete successfully.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* HAL-2788 Handle task script archives as shared. (#162)

Simplify our bind mounts by treating the archive
with logging configuration scrtips as shared.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Add command and shell entry script support. (#166)

The shared volume support needs to setup links for files that
are not intended to be shared. This is not yet completely
generalized, so we need to update tasks.go to add the links
for now. Add some missing links.

* chore: Dynamically handle links for unshared archive volumes (#168)

When archives are mapped into the per-container space we
need to setup softlinks for each element within /run/determined
directory that is not treated as shared. This was previously
done using a static list of names. Replace that with a dynamic
list so that we don't miss files that are specified in other
parts of determined. Also add exceptions for command and
shell entrypoints which are read-only so we can leave them in the
shared /run/determined directory without setting up links.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: With slurm detect non-zero exit as job failure (#172)

If any container launched by the SLURM batch job exits with
a non-zero status, terminate the others with an error.
Without this a failure on horovodrun args, or other such
errors would exit the one container, but others would wait
forever.

Happy path exits, need a different fix.

* chore: Make shell_entrypoint work with SLURM (#174)

The shell command entry point needs the fix to properly
inherit DET_SLOTS from SLUM CUDA configuration.
Additionally, HPC systems include functions in their
environment to support modules. Because the variables
include () in the names, it breaks the SSH variable escaping
code. Avoid processing of these functions by adding any
variable name contiaing () to the blacklist.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
Dispatcher RM messages.

* chore: HAL-2893 Synchronize reading of Slurm resource log file (#179)

* fix: HAL-2903 Drop special handling of stdin/stdout for SLURM (#180)

The /run/determined/train directory is no longer shared between
all containers in a run, so drop the slurm-job specific task
setup code that redirects the stdin/stdout link via /tmp (which
singularity maps to /tmp on the host which can lead to conflict
between jobs refereincing the same /tmp/stdin.log link between jobs).

* feat: FOUNDENG-21 Enable SLURM preemption (#181)

Initial work to enable SLURM job preemption.
- For SLURM jobs intercept SIGTERM as a notification of pending preemption
when triggered notify the master via new pending_preemption API
- Add pending_preemption API on the master
- Upon call to the pending_preemption, dispatcher RM triggers ReleaseResources
(whith ForcePremption:true) for the allocation.
- The above triggers a preemption, and the long poll _preempt.py is performing will
return indicating that the job should preempt, and after checkpointing it will
exit with a success status.

* chore: put some life into the 'job queue' UI page (#189)

* chore: Version of Bradley's wrap_rank.py fix (#188)

This is semantically eqivalent to

https://github.com/determined-ai/determined/pull/4109

So you can apply it to your tree to get things working on dispatcher branch.

* chore: HAL-2788 Process and send logs from Dispatcher(SLURM) to Default Log Backend (#182)

Co-authored-by: Bradley Laney <bradlaney@determined.ai>

* fix: correct regex group name (#193)

* fix: HAL-2907 Simplify SLURM job naming (#190)

This is just a slight increment to reduce the verbosity, to cleanup
the username and dispatch ID we likely will need a launcher
enhancemement.

* chore: fix tb collisions (#194)

* chore: HAL-2894 Multiple distributed jobs collide on horovod run sshd port (#195)

* chore: per-partition configs (#196)

This change enables us to configure per-partition overrides for certain configurations and adds a few overrides for now. The mechanism matches determined agents where expconf settings (if available) override pool level defaults override master config level defaults.

* chore: avoid API blockage & overpolling (#198)

* chore: avoid API blockage & overpolling

when following launched jobs to completion

* chore: provide per-pool job stats (#199)

* chore: provide per-pool job stats

on receipt of GetJobQueueStatsRequest

* fix: FOUNDENG-13 Properly set default pool when no GPUs (#201)

When no partitions have GPUs, we ended up without a default
compute pool, so jobs would get subbmitted to pool "".
That would be fine, except we couldn't then match the pool
named "" to its attributes (i.e. has no GPUs) and thus we
would try to allocate GPUs from a pool without them.
So when no GPUs in any partition, set the default compute pool
(internal to the Dispatcher RM) to the default AUX pool, so
that when a request for the default pool comes in we properly
match it up with the pool attributes and correctly swith to
using the cpu as the slot type.

Also add a devcluster file for shuco for testing.

* chore: fix up slot type resolution (#197)

* chore: fix default scheme for dispatcher config (#203)

* chore: update Go dependency files

This is a sort of artificial commit used to concentrate all of the
go.{mod,sum} changes for EE into one place to hopefully reduce
conflicts.

* feat: add support for SCIM provisioning

This commit brings the SCIM support for v0.11.2 back from the dead with
minimal changes. The only new fixes/features are conflicts now correctly
returns 409 not 500, password sync is supported, externalId is not
required, and the username/password that the IdP uses for basic auth
when talking to us is configurable.

* chore: remove Apache V2 license text

The enterprise edition is not licensed under Apache V2, so remove the
license text to avoid confusion.

* fix: ignore SCIM meta field

When we receive a PATCH or PUT from Okta,
they may reply to us with the meta field we
initially gave to them. This is ok; it is in
spec to receive and ignore it.

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add support for SAML

To meet customer needs, we need to provide support for users to
authenticate with SAML. The commit adds /saml/sso and /saml/initiate to
receive SAML responses and initiate SP-initiated flow with SAML
requests, respectively. It also adds SSO providers to /info for the
frontend to determined if SAML is enabled or not and the ability to
configure SAML via master.yaml.

* feat: add support for automatically storing creds

As a Determined user, I don't want to paste a token every time I want to
auth. This adds support for, when auth'ing from the CLI, redirecting to
localhost and automatically storing the auth token.

* ci: modify CI/CD steps for EE

We avoid running jobs that will publish our product to public
repositories. This currently means that we cannot test cloud deployments
in an automated manner. However, none of our EE features are related to
cloud deployments so this should not be a problem.

* ci: enable EE releases with CircleCI

* chore: enable SCIM for test-e2e (#15)

* chore: run apt-get update for deps of master.deb (#16)

When we install ee via a deb, apt-get tries to resolve the named deps of
it but, in CI we haven't run apt-get update so we can't find it.

* feat: support acting as an OAuth 2.0 authorization server (#10)

This adds support for the OAuth 2.0 authorization code flow to the
master. Apart from the core flow itself, HTTP endpoints are provided for
managing the set of OAuth client applications (accessible only by admin
users).

Since we're only addressing one particular scenario at the moment, we
allow only a single client application to be registered, since that's
all we need for this use case and that simplifies management a bit.

* feat: enable OAuth protection for SCIM endpoints (#10)

* ci: add -ee to tags (#29)

* ci: use make publish for releasing Docker images (#36)

(cherry picked from commit a5de11b5cc87e848d0e81601d15038dd4f44d7e1)

* fix: remove det-deploy local from ee ci (#53)

* chore: increase refresh token exp (#60) [DET-4754]

* chore: remove sessions for deprovisioned users (#61) [DET-4844]

* ci: skip a bunch of testing and deployment for EE

* fix: extract master URL correctly from resource pools (#64)

With the introduction of resource pools, the provisioner is no longer a top level field. Update code so that we pull the MasterURL from the resource pools' provider field.

* fix: add ssoProviders to default store info for ee

* chore: do not publish model_hub package in EE

* fix: update master GoReleaser config to account for new binary

* chore: rebrand to HPE MLDE

Basic renaming of Docker images, docs bucket, etc. Also clearly remove
any implication of publishing to NVCR.

* chore: rebrand docs content and theme for HPE before publishing

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* ci: avoid implication that the password for SCIM tests is secure (#83)

* feat: add OIDC support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also revert a few diffs between the config for EE vs OSS that don't seem
intentional/warranted.

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* chore: refactor for semantic conflict in service.go:Service.extractToken (#186)

* fix: update docs skip check remote address to EE

* chore: add launcher client (#209)

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* Support environment configuration on singularity (#152)

Address todos:
- TODO(HAL-2868): Support adding and dropping linux caps for containers.
- TODO(HAL-2869): Support communicating with authenticated container registries.
- Also force_pull_image option.

Design Note: https://connect.us.cray.com/confluence/pages/viewpage.action?pageId=230482663

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Add missing checkpoint-gc link supporting unshared data volumes (#156)

Add a missing link from /run/determined/checkpoint_gc to the
unshared data volume to enable them to complete successfully.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* HAL-2788 Handle task script archives as shared. (#162)

Simplify our bind mounts by treating the archive
with logging configuration scrtips as shared.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Add command and shell entry script support. (#166)

The shared volume support needs to setup links for files that
are not intended to be shared. This is not yet completely
generalized, so we need to update tasks.go to add the links
for now. Add some missing links.

* chore: Dynamically handle links for unshared archive volumes (#168)

When archives are mapped into the per-container space we
need to setup softlinks for each element within /run/determined
directory that is not treated as shared. This was previously
done using a static list of names. Replace that with a dynamic
list so that we don't miss files that are specified in other
parts of determined. Also add exceptions for command and
shell entrypoints which are read-only so we can leave them in the
shared /run/determined directory without setting up links.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: With slurm detect non-zero exit as job failure (#172)

If any container launched by the SLURM batch job exits with
a non-zero status, terminate the others with an error.
Without this a failure on horovodrun args, or other such
errors would exit the one container, but others would wait
forever.

Happy path exits, need a different fix.

* chore: Make shell_entrypoint work with SLURM (#174)

The shell command entry point needs the fix to properly
inherit DET_SLOTS from SLUM CUDA configuration.
Additionally, HPC systems include functions in their
environment to support modules. Because the variables
include () in the names, it breaks the SSH variable escaping
code. Avoid processing of these functions by adding any
variable name contiaing () to the blacklist.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
…

mackrorysd added a commit that referenced this pull request

* chore: update Go dependency files

This is a sort of artificial commit used to concentrate all of the
go.{mod,sum} changes for EE into one place to hopefully reduce
conflicts.

* feat: add support for SCIM provisioning

This commit brings the SCIM support for v0.11.2 back from the dead with
minimal changes. The only new fixes/features are conflicts now correctly
returns 409 not 500, password sync is supported, externalId is not
required, and the username/password that the IdP uses for basic auth
when talking to us is configurable.

* chore: remove Apache V2 license text

The enterprise edition is not licensed under Apache V2, so remove the
license text to avoid confusion.

* fix: ignore SCIM meta field

When we receive a PATCH or PUT from Okta,
they may reply to us with the meta field we
initially gave to them. This is ok; it is in
spec to receive and ignore it.

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add support for SAML

To meet customer needs, we need to provide support for users to
authenticate with SAML. The commit adds /saml/sso and /saml/initiate to
receive SAML responses and initiate SP-initiated flow with SAML
requests, respectively. It also adds SSO providers to /info for the
frontend to determined if SAML is enabled or not and the ability to
configure SAML via master.yaml.

* feat: add support for automatically storing creds

As a Determined user, I don't want to paste a token every time I want to
auth. This adds support for, when auth'ing from the CLI, redirecting to
localhost and automatically storing the auth token.

* ci: modify CI/CD steps for EE

We avoid running jobs that will publish our product to public
repositories. This currently means that we cannot test cloud deployments
in an automated manner. However, none of our EE features are related to
cloud deployments so this should not be a problem.

* ci: enable EE releases with CircleCI

* chore: enable SCIM for test-e2e (#15)

* chore: run apt-get update for deps of master.deb (#16)

When we install ee via a deb, apt-get tries to resolve the named deps of
it but, in CI we haven't run apt-get update so we can't find it.

* feat: support acting as an OAuth 2.0 authorization server (#10)

This adds support for the OAuth 2.0 authorization code flow to the
master. Apart from the core flow itself, HTTP endpoints are provided for
managing the set of OAuth client applications (accessible only by admin
users).

Since we're only addressing one particular scenario at the moment, we
allow only a single client application to be registered, since that's
all we need for this use case and that simplifies management a bit.

* feat: enable OAuth protection for SCIM endpoints (#10)

* ci: add -ee to tags (#29)

* ci: use make publish for releasing Docker images (#36)

(cherry picked from commit a5de11b5cc87e848d0e81601d15038dd4f44d7e1)

* fix: remove det-deploy local from ee ci (#53)

* chore: increase refresh token exp (#60) [DET-4754]

* chore: remove sessions for deprovisioned users (#61) [DET-4844]

* ci: skip a bunch of testing and deployment for EE

* fix: extract master URL correctly from resource pools (#64)

With the introduction of resource pools, the provisioner is no longer a top level field. Update code so that we pull the MasterURL from the resource pools' provider field.

* fix: add ssoProviders to default store info for ee

* chore: do not publish model_hub package in EE

* fix: update master GoReleaser config to account for new binary

* chore: rebrand to HPE MLDE

Basic renaming of Docker images, docs bucket, etc. Also clearly remove
any implication of publishing to NVCR.

* chore: rebrand docs content and theme for HPE before publishing

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* ci: avoid implication that the password for SCIM tests is secure (#83)

* feat: add OIDC support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also revert a few diffs between the config for EE vs OSS that don't seem
intentional/warranted.

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* fix: update docs skip check remote address to EE

* chore: add launcher client (#209)

* feat: slurm support (#98)

This change adds a dispatcher resource manager, which gives Determined the ability to run on top of Slurm.

Feature branch commit history:

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
Dispatcher RM messages.

* chore: HAL-2893 Synchronize reading of Slurm resource log file (#179)

* fix: HAL-2903 Drop special handling of stdin/stdout for SLURM (#180)

The /run/determined/train directory is no longer shared between
all containers in a run, so drop the slurm-job specific task
setup code that redirects the stdin/stdout link via /tmp (which
singularity maps to /tmp on the host which can lead to conflict
between jobs refereincing the same /tmp/stdin.log link between jobs).

* feat: FOUNDENG-21 Enable SLURM preemption (#181)

Initial work to enable SLURM job preemption.
- For SLURM jobs intercept SIGTERM as a notification of pending preemption
when triggered notify the master via new pending_preemption API
- Add pending_preemption API on the master
- Upon call to the pending_preemption, dispatcher RM triggers ReleaseResources
(whith ForcePremption:true) for the allocation.
- The above triggers a preemption, and the long poll _preempt.py is performing will
return indicating that the job should preempt, and after checkpointing it will
exit with a success status.

* chore: put some life into the 'job queue' UI page (#189)

* chore: Version of Bradley's wrap_rank.py fix (#188)

This is semantically eqivalent to

https://github.com/determined-ai/determined/pull/4109

So you can apply it to your tree to get things working on dispatcher branch.

* chore: HAL-2788 Process and send logs from Dispatcher(SLURM) to Default Log Backend (#182)

Co-authored-by: Bradley Laney <bradlaney@determined.ai>

* chore: per-partition configs (#196)

This change enables us to configure per-partition overrides for certain configurations and adds a few overrides for now. The mechanism matches determined agents where expconf settings (if available) override pool level defaults override master config level defaults.

* chore: avoid API blockage & overpolling (#198)

* chore: avoid API blockage & overpolling

when following launched jobs to completion

* chore: provide per-pool job stats (#199)

* chore: provide per-pool job stats

on receipt of GetJobQueueStatsRequest

* fix: FOUNDENG-13 Properly set default pool when no GPUs (#201)

When no partitions have GPUs, we ended up without a default
compute pool, so jobs would get subbmitted to pool "".
That would be fine, except we couldn't then match the pool
named "" to its attributes (i.e. has no GPUs) and thus we
would try to allocate GPUs from a pool without them.
So when no GPUs in any partition, set the default compute pool
(internal to the Dispatcher RM) to the default AUX pool, so
that when a request for the default pool comes in we properly
match it up with the pool attributes and correctly swith to
using the cpu as the slot type.

Also add a devcluster file for shuco for testing.

* chore: fix up slot type resolution (#197)

* chore: fix default scheme for dispatcher config (#203)

* chore: Refinements to Dispatcher/SLURM logs implementations (#191)

* fix: FOUNDENG-14 generated ResourcesStateChanged events (#202)

Generate ResourcesStateChanged to cover running, and other
resources states from the dispatcher RM.

AllocationReady comes from the container for shells, notebooks,
and tensorboards, but the clients also depend upon the state
being running immediately after AllocationReady, so to avoid
a race condition where the dispatcher RM discovers the running
state by polling for dispatcher RM, force the state to running upon reception
of the AllocationReady event.

* fix: avoid potential race between AllocationReady and Running state (#206)

Pull in oss version of this fix to avoid merge conflict.

* chore: Add devcluster files for casablanca-login and horizon (#207)

Custom devcluster files to enable simple testing for
two new test machines.

* chore: fix tests (#204)

* chore: cleanup stray stuff (#208)

* chore: rename iface configs for consistency (#210)

* fix: FOUNDENG-40 Favor default partition in all cases unless no GPUs (#213)

it would be preferable to favor the default partition explicitly.

New algorithm:

- Use default partition always for aux partition
- Use default partition for GPU partition (unless it has no GPUs).

Updated unit tests.

* chore: Improve branding/naming of dispatcher/launcher (#212)

For customer-visible purposes, use "Slurm" when referring to the scheduling
system that is managed by the dispatcher RM internally.

* fix: FOUNDENG-41 Cancel running SLURM jobs on restart (#215)

On the startup of the master there may be on-going jobs on the cluster
associated with determined jobs. Terminate them all and remove them
from the DB to prevent duplicate jobs as determined starts up new jobs
which will continue from the last checkpoint.

* fix: FOUNDENG-43 Change in yaml parser breaks default partitions (#217)

Unblock testing by reverting back to our previous yaml parser
to restore default partition identification.

* fix: correctly parse resource pool info with ghodss/yaml package [FOUNDENG-45] (#218)

* fix: handle nil response from dispatch terminate (#219)

* fix: FOUNDENG-41 properly handle missing dispatches (#221)

When cleaning up dispatches from the DB on startup, if the
dispatch does not exist (404) then still delete the DB entry
so we don't attempt to clean it up again on restart.

* chore: fill in DET_UNIQUE_PORT_OFFSET (#220)

* feat: FOUNDENG-24 Improve branding/naming of dispatcher/launcher (#216)

Change the resource_manger.type value that customer's provide
from dispatcher to slurm

* Revert "chore: fill in DET_UNIQUE_PORT_OFFSET (#220)"

This reverts commit d3de26b7d652bf15eb28ee0d9a6b2c5113efcaa3.

* chore: fill in DET_UNIQUE_PORT_OFFSET, with defaults (#225)

* fix: FOUNDENG-41 If we get an error not 404, don't delete from DB (#222)

If we get a nil response on the error path, we do not know for sure
that the dispatch does not exist, so leave it in the DB for a later
attempt at cleanup.

* chore: Saturate Determined's `GET /api/v1/agents` (#226)

* Saturate Determined's `GET /api/v1/agents` API

for Dispatcher RM

* fix: FOUNDENG-49 Slurm Logs and scripts are cleaned up (#228)

Changes to make sure that when a Dispatch is dropped from the DB
we have ensured that the DispatchIDs (launches & environment) have
been removed by the launcher.

Use common method for cleanup in both cases, and enhance
the method to verify deletion (404 is also considered a success).
If we hit any failure in cleanup of launcher data, we defer the
removal of the Dispatch until a later time (they are re-processed
on startup).

* chore: Saturate Determined's `GET /api/v1/agents` API (#229)

* Saturate Determined's `GET /api/v1/agents` API

for Dispatcher RM.
o small tweak to mark the agents as not available if Slurm has then as allocated.

* chore: Enable security.tls settings for slurm RM (#200)

We want to enable use of https:8043 for development to enable us to
connect to test machines from our desktops via Pulse VPN.
Also update casablanca-login devcluster config to use :8043

* fix: Fix parameter order Dispatch cleanup (#230)

The parameter order to the resourceQueryPostActions cleanup
method were reversed, so the Dispatch was not properly removed.

* fix: FOUNDENG-48 `ERROR: Cannot set UID=0` (#231)

Launched Slurm jobs cannot be run as a root user (the default if
no agent is configured). Catch this error early in the dispatcher
RM by returning an error if an attempt to impersonate root
and report it explicitly.

* fix: FOUNDENG-25 Dispatcher job error reporting is sub-standard (#232)

When reporting exit status messages, the dispatcher RM includes
all events that happen on the distpatch. These are mostly useless
status updates. Fixes:
- Include only WARNING/ERROR level messags in the reporting.
those potentially have real error messages in them.
- Filter the prefix for job failure messages to make them cleaner.

* chore: New 'job_storage_root' field in master.yaml needs to be added to DAI (#233)

* chore: FOUNDENG-26 Implement batch processing for Dispatcher/SLURM logs (#211)

* chore: FOUNDENG-51 update dispatcher monitor to handle 404 status for cancelled experiments (#236)

* fix: update master GoReleaser config to account for new binary

* chore: enable hpe branding (#79)

* chore: publish EE docs to hpecaide.determined.ai

* chore: rebrand docs content and theme for HPE before publishing

* chore: push EE Docker images with `hpecaide` names

* fix: update reference to config package (#80)

* fix: update reference to config package

https://github.com/determined-ai/determined/pull/3175 moved config into
its own package, this fixes an -ee specific reference which was not
updated, fixing `make all`.

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* rrci: avoid implication that the password for SCIM tests is secure (#83)

* feat: oidc support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: change branding to MLDE (#114)

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also update configuration to avoid giving the impression that we might want to
publish EE images to NVCR, and revert a few diffs between the config for EE vs
OSS that don't seem intentional/warranted.

* chore: finish CAIDE -> MLDE, change hpemlde -> hpe-mlde

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* chore: more CAIDE -> MLDE changes (#153)

* chore: update web initial title and meta description (#155)

* chore: refactor for semantic conflict in service.go:Service.extractToken (#186)

* fix: use correct tags for EE release

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: just warn on trying to set single resource as daemon, for now (#125)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* Support environment configuration on singularity (#152)

Address todos:
- TODO(HAL-2868): Support adding and dropping linux caps for containers.
- TODO(HAL-2869): Support communicating with authenticated container registries.
- Also force_pull_image option.

Design Note: https://connect.us.cray.com/confluence/pages/viewpage.action?pageId=230482663

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Add missing checkpoint-gc link supporting unshared data volumes (#156)

Add a missing link from /run/determined/checkpoint_gc to the
unshared data volume to enable them to complete successfully.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* HAL-2788 Handle task script archives as shared. (#162)

Simplify our bind mounts by treating the archive
with logging configuration scrtips as shared.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Add command and shell entry script support. (#166)

The shared volume support needs to setup links for files that
are not intended to be shared. This is not yet completely
generalized, so we need to update tasks.go to add the links
for now. Add some missing links.

* chore: Dynamically handle links for unshared archive volumes (#168)

When archives are mapped into the per-container space we
need to setup softlinks for each element within /run/determined
directory that is not treated as shared. This was previously
done using a static list of names. Replace that with a dynamic
list so that we don't miss files that are specified in other
parts of determined. Also add exceptions for command and
shell entrypoints which are read-only so we can leave them in the
shared /run/determined directory without setting up links.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: With slurm detect non-zero exit as job failure (#172)

If any container launched by the SLURM batch job exits with
a non-zero status, terminate the others with an error.
Without this a failure on horovodrun args, or other such
errors would exit the one container, but others would wait
forever.

Happy path exits, need a different fix.

* chore: Make shell_entrypoint work with SLURM (#174)

The shell command entry point needs the fix to properly
inherit DET_SLOTS from SLUM CUDA configuration.
Additionally, HPC systems include functions in their
environment to support modules. Because the variables
include () in the names, it breaks the SSH variable escaping
code. Avoid processing of these functions by adding any
variable name contiaing () to the blacklist.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
Dispatcher RM messages.

* chore: HAL-2893 Synchronize reading of Slurm resource log file (#179)

* fix: HAL-2903 Drop special handling of stdin/stdout for SLURM (#180)

The /run/determined/train directory is no longer shared between
all containers in a run, so drop the slurm-job specific task
setup code that redirects the stdin/stdout link via /tmp (which
singularity maps to /tmp on the host which can lead to conflict
between jobs refereincing the same /tmp/stdin.log link between jobs).

* feat: FOUNDENG-21 Enable SLURM preemption (#181)

Initial work to enable SLURM job preemption.
- For SLURM jobs intercept SIGTERM as a notification of pending preemption
when triggered notify the master via new pending_preemption API
- Add pending_preemption API on the master
- Upon call to the pending_preemption, dispatcher RM triggers ReleaseResources
(whith ForcePremption:true) for the allocation.
- The above triggers a preemption, and the long poll _preempt.py is performing will
return indicating that the job should preempt, and after checkpointing it will
exit with a success status.

* chore: put some life into the 'job queue' UI page (#189)

* chore: Version of Bradley's wrap_rank.py fix (#188)

This is semantically eqivalent to

https://github.com/determined-ai/determined/pull/4109

So you can apply it to your tree to get things working on dispatcher branch.

* chore: HAL-2788 Process and send logs from Dispatcher(SLURM) to Default Log Backend (#182)

Co-authored-by: Bradley Laney <bradlaney@determined.ai>

* fix: correct regex group name (#193)

* fix: HAL-2907 Simplify SLURM job naming (#190)

This is just a slight increment to reduce the verbosity, to cleanup
the username and dispatch ID we likely will need a launcher
enhancemement.

* chore: fix tb collisions (#194)

* chore: HAL-2894 Multiple distributed jobs collide on horovod run sshd port (#195)

* chore: per-partition configs (#196)

This change enables us to configure per-partition overrides for certain configurations and adds a few overrides for now. The mechanism matches determined agents where expconf settings (if available) override pool level defaults override master config level defaults.

* chore: avoid API blockage & overpolling (#198)

* chore: avoid API blockage & overpolling

when following launched jobs to completion

* chore: provide per-pool job stats (#199)

* chore: provide per-pool job stats

on receipt of GetJobQueueStatsRequest

* fix: FOUNDENG-13 Properly set default pool when no GPUs (#201)

When no partitions have GPUs, we ended up without a default
compute pool, so jobs would get subbmitted to pool "".
That would be fine, except we couldn't then match the pool
named "" to its attributes (i.e. has no GPUs) and thus we
would try to allocate GPUs from a pool without them.
So when no GPUs in any partition, set the default compute pool
(internal to the Dispatcher RM) to the default AUX pool, so
that when a request for the default pool comes in we properly
match it up with the pool attributes and correctly swith to
using the cpu as the slot type.

Also add a devcluster file for shuco for testing.

* chore: fix up slot type resolution (#197)

* chore: fix default scheme for dispatcher config (#203)

* chore: update Go dependency files

This is a sort of artificial commit used to concentrate all of the
go.{mod,sum} changes for EE into one place to hopefully reduce
conflicts.

* feat: add support for SCIM provisioning

This commit brings the SCIM support for v0.11.2 back from the dead with
minimal changes. The only new fixes/features are conflicts now correctly
returns 409 not 500, password sync is supported, externalId is not
required, and the username/password that the IdP uses for basic auth
when talking to us is configurable.

* chore: remove Apache V2 license text

The enterprise edition is not licensed under Apache V2, so remove the
license text to avoid confusion.

* fix: ignore SCIM meta field

When we receive a PATCH or PUT from Okta,
they may reply to us with the meta field we
initially gave to them. This is ok; it is in
spec to receive and ignore it.

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add support for SAML

To meet customer needs, we need to provide support for users to
authenticate with SAML. The commit adds /saml/sso and /saml/initiate to
receive SAML responses and initiate SP-initiated flow with SAML
requests, respectively. It also adds SSO providers to /info for the
frontend to determined if SAML is enabled or not and the ability to
configure SAML via master.yaml.

* feat: add support for automatically storing creds

As a Determined user, I don't want to paste a token every time I want to
auth. This adds support for, when auth'ing from the CLI, redirecting to
localhost and automatically storing the auth token.

* ci: modify CI/CD steps for EE

We avoid running jobs that will publish our product to public
repositories. This currently means that we cannot test cloud deployments
in an automated manner. However, none of our EE features are related to
cloud deployments so this should not be a problem.

* ci: enable EE releases with CircleCI

* chore: enable SCIM for test-e2e (#15)

* chore: run apt-get update for deps of master.deb (#16)

When we install ee via a deb, apt-get tries to resolve the named deps of
it but, in CI we haven't run apt-get update so we can't find it.

* feat: support acting as an OAuth 2.0 authorization server (#10)

This adds support for the OAuth 2.0 authorization code flow to the
master. Apart from the core flow itself, HTTP endpoints are provided for
managing the set of OAuth client applications (accessible only by admin
users).

Since we're only addressing one particular scenario at the moment, we
allow only a single client application to be registered, since that's
all we need for this use case and that simplifies management a bit.

* feat: enable OAuth protection for SCIM endpoints (#10)

* ci: add -ee to tags (#29)

* ci: use make publish for releasing Docker images (#36)

(cherry picked from commit a5de11b5cc87e848d0e81601d15038dd4f44d7e1)

* fix: remove det-deploy local from ee ci (#53)

* chore: increase refresh token exp (#60) [DET-4754]

* chore: remove sessions for deprovisioned users (#61) [DET-4844]

* ci: skip a bunch of testing and deployment for EE

* fix: extract master URL correctly from resource pools (#64)

With the introduction of resource pools, the provisioner is no longer a top level field. Update code so that we pull the MasterURL from the resource pools' provider field.

* fix: add ssoProviders to default store info for ee

* chore: do not publish model_hub package in EE

* fix: update master GoReleaser config to account for new binary

* chore: rebrand to HPE MLDE

Basic renaming of Docker images, docs bucket, etc. Also clearly remove
any implication of publishing to NVCR.

* chore: rebrand docs content and theme for HPE before publishing

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* ci: avoid implication that the password for SCIM tests is secure (#83)

* feat: add OIDC support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also revert a few diffs between the config for EE vs OSS that don't seem
intentional/warranted.

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* chore: refactor for semantic conflict in service.go:Service.extractToken (#186)

* fix: update docs skip check remote address to EE

* chore: add launcher client (#209)

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's re…

eecsliu pushed a commit to eecsliu/determined that referenced this pull request


          feat: slurm support (#98)

b539e87

This change adds a dispatcher resource manager, which gives Determined the ability to run on top of Slurm.

Feature branch commit history:

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
Dispatcher RM messages.

* chore: HAL-2893 Synchronize reading of Slurm resource log file (#179)

* fix: HAL-2903 Drop special handling of stdin/stdout for SLURM (#180)

The /run/determined/train directory is no longer shared between
all containers in a run, so drop the slurm-job specific task
setup code that redirects the stdin/stdout link via /tmp (which
singularity maps to /tmp on the host which can lead to conflict
between jobs refereincing the same /tmp/stdin.log link between jobs).

* feat: FOUNDENG-21 Enable SLURM preemption (#181)

Initial work to enable SLURM job preemption.
- For SLURM jobs intercept SIGTERM as a notification of pending preemption
when triggered notify the master via new pending_preemption API
- Add pending_preemption API on the master
- Upon call to the pending_preemption, dispatcher RM triggers ReleaseResources
(whith ForcePremption:true) for the allocation.
- The above triggers a preemption, and the long poll _preempt.py is performing will
return indicating that the job should preempt, and after checkpointing it will
exit with a success status.

* chore: put some life into the 'job queue' UI page (#189)

* chore: Version of Bradley's wrap_rank.py fix (#188)

This is semantically eqivalent to

https://github.com/determined-ai/determined/pull/4109

So you can apply it to your tree to get things working on dispatcher branch.

* chore: HAL-2788 Process and send logs from Dispatcher(SLURM) to Default Log Backend (#182)

Co-authored-by: Bradley Laney <bradlaney@determined.ai>

* chore: per-partition configs (#196)

This change enables us to configure per-partition overrides for certain configurations and adds a few overrides for now. The mechanism matches determined agents where expconf settings (if available) override pool level defaults override master config level defaults.

* chore: avoid API blockage & overpolling (#198)

* chore: avoid API blockage & overpolling

when following launched jobs to completion

* chore: provide per-pool job stats (#199)

* chore: provide per-pool job stats

on receipt of GetJobQueueStatsRequest

* fix: FOUNDENG-13 Properly set default pool when no GPUs (#201)

When no partitions have GPUs, we ended up without a default
compute pool, so jobs would get subbmitted to pool "".
That would be fine, except we couldn't then match the pool
named "" to its attributes (i.e. has no GPUs) and thus we
would try to allocate GPUs from a pool without them.
So when no GPUs in any partition, set the default compute pool
(internal to the Dispatcher RM) to the default AUX pool, so
that when a request for the default pool comes in we properly
match it up with the pool attributes and correctly swith to
using the cpu as the slot type.

Also add a devcluster file for shuco for testing.

* chore: fix up slot type resolution (#197)

* chore: fix default scheme for dispatcher config (#203)

* chore: Refinements to Dispatcher/SLURM logs implementations (#191)

* fix: FOUNDENG-14 generated ResourcesStateChanged events (#202)

Generate ResourcesStateChanged to cover running, and other
resources states from the dispatcher RM.

AllocationReady comes from the container for shells, notebooks,
and tensorboards, but the clients also depend upon the state
being running immediately after AllocationReady, so to avoid
a race condition where the dispatcher RM discovers the running
state by polling for dispatcher RM, force the state to running upon reception
of the AllocationReady event.

* fix: avoid potential race between AllocationReady and Running state (#206)

Pull in oss version of this fix to avoid merge conflict.

* chore: Add devcluster files for casablanca-login and horizon (#207)

Custom devcluster files to enable simple testing for
two new test machines.

* chore: fix tests (#204)

* chore: cleanup stray stuff (#208)

* chore: rename iface configs for consistency (#210)

* fix: FOUNDENG-40 Favor default partition in all cases unless no GPUs (#213)

it would be preferable to favor the default partition explicitly.

New algorithm:

- Use default partition always for aux partition
- Use default partition for GPU partition (unless it has no GPUs).

Updated unit tests.

* chore: Improve branding/naming of dispatcher/launcher (#212)

For customer-visible purposes, use "Slurm" when referring to the scheduling
system that is managed by the dispatcher RM internally.

* fix: FOUNDENG-41 Cancel running SLURM jobs on restart (#215)

On the startup of the master there may be on-going jobs on the cluster
associated with determined jobs. Terminate them all and remove them
from the DB to prevent duplicate jobs as determined starts up new jobs
which will continue from the last checkpoint.

* fix: FOUNDENG-43 Change in yaml parser breaks default partitions (#217)

Unblock testing by reverting back to our previous yaml parser
to restore default partition identification.

* fix: correctly parse resource pool info with ghodss/yaml package [FOUNDENG-45] (#218)

* fix: handle nil response from dispatch terminate (#219)

* fix: FOUNDENG-41 properly handle missing dispatches (#221)

When cleaning up dispatches from the DB on startup, if the
dispatch does not exist (404) then still delete the DB entry
so we don't attempt to clean it up again on restart.

* chore: fill in DET_UNIQUE_PORT_OFFSET (#220)

* feat: FOUNDENG-24 Improve branding/naming of dispatcher/launcher (#216)

Change the resource_manger.type value that customer's provide
from dispatcher to slurm

* Revert "chore: fill in DET_UNIQUE_PORT_OFFSET (#220)"

This reverts commit d3de26b7d652bf15eb28ee0d9a6b2c5113efcaa3.

* chore: fill in DET_UNIQUE_PORT_OFFSET, with defaults (#225)

* fix: FOUNDENG-41 If we get an error not 404, don't delete from DB (#222)

If we get a nil response on the error path, we do not know for sure
that the dispatch does not exist, so leave it in the DB for a later
attempt at cleanup.

* chore: Saturate Determined's `GET /api/v1/agents` (#226)

* Saturate Determined's `GET /api/v1/agents` API

for Dispatcher RM

* fix: FOUNDENG-49 Slurm Logs and scripts are cleaned up (#228)

Changes to make sure that when a Dispatch is dropped from the DB
we have ensured that the DispatchIDs (launches & environment) have
been removed by the launcher.

Use common method for cleanup in both cases, and enhance
the method to verify deletion (404 is also considered a success).
If we hit any failure in cleanup of launcher data, we defer the
removal of the Dispatch until a later time (they are re-processed
on startup).

* chore: Saturate Determined's `GET /api/v1/agents` API (#229)

* Saturate Determined's `GET /api/v1/agents` API

for Dispatcher RM.
o small tweak to mark the agents as not available if Slurm has then as allocated.

* chore: Enable security.tls settings for slurm RM (#200)

We want to enable use of https:8043 for development to enable us to
connect to test machines from our desktops via Pulse VPN.
Also update casablanca-login devcluster config to use :8043

* fix: Fix parameter order Dispatch cleanup (#230)

The parameter order to the resourceQueryPostActions cleanup
method were reversed, so the Dispatch was not properly removed.

* fix: FOUNDENG-48 `ERROR: Cannot set UID=0` (#231)

Launched Slurm jobs cannot be run as a root user (the default if
no agent is configured). Catch this error early in the dispatcher
RM by returning an error if an attempt to impersonate root
and report it explicitly.

* fix: FOUNDENG-25 Dispatcher job error reporting is sub-standard (#232)

When reporting exit status messages, the dispatcher RM includes
all events that happen on the distpatch. These are mostly useless
status updates. Fixes:
- Include only WARNING/ERROR level messags in the reporting.
those potentially have real error messages in them.
- Filter the prefix for job failure messages to make them cleaner.

* chore: New 'job_storage_root' field in master.yaml needs to be added to DAI (#233)

* chore: FOUNDENG-26 Implement batch processing for Dispatcher/SLURM logs (#211)

* chore: FOUNDENG-51 update dispatcher monitor to handle 404 status for cancelled experiments (#236)

* fix: update master GoReleaser config to account for new binary

* chore: enable hpe branding (#79)

* chore: publish EE docs to hpecaide.determined.ai

* chore: rebrand docs content and theme for HPE before publishing

* chore: push EE Docker images with `hpecaide` names

* fix: update reference to config package (#80)

* fix: update reference to config package

https://github.com/determined-ai/determined/pull/3175 moved config into
its own package, this fixes an -ee specific reference which was not
updated, fixing `make all`.

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* rrci: avoid implication that the password for SCIM tests is secure (#83)

* feat: oidc support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: change branding to MLDE (#114)

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also update configuration to avoid giving the impression that we might want to
publish EE images to NVCR, and revert a few diffs between the config for EE vs
OSS that don't seem intentional/warranted.

* chore: finish CAIDE -> MLDE, change hpemlde -> hpe-mlde

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* chore: more CAIDE -> MLDE changes (#153)

* chore: update web initial title and meta description (#155)

* chore: refactor for semantic conflict in service.go:Service.extractToken (#186)

* fix: use correct tags for EE release

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: just warn on trying to set single resource as daemon, for now (#125)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* Support environment configuration on singularity (#152)

Address todos:
- TODO(HAL-2868): Support adding and dropping linux caps for containers.
- TODO(HAL-2869): Support communicating with authenticated container registries.
- Also force_pull_image option.

Design Note: https://connect.us.cray.com/confluence/pages/viewpage.action?pageId=230482663

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Add missing checkpoint-gc link supporting unshared data volumes (#156)

Add a missing link from /run/determined/checkpoint_gc to the
unshared data volume to enable them to complete successfully.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* HAL-2788 Handle task script archives as shared. (#162)

Simplify our bind mounts by treating the archive
with logging configuration scrtips as shared.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Add command and shell entry script support. (#166)

The shared volume support needs to setup links for files that
are not intended to be shared. This is not yet completely
generalized, so we need to update tasks.go to add the links
for now. Add some missing links.

* chore: Dynamically handle links for unshared archive volumes (#168)

When archives are mapped into the per-container space we
need to setup softlinks for each element within /run/determined
directory that is not treated as shared. This was previously
done using a static list of names. Replace that with a dynamic
list so that we don't miss files that are specified in other
parts of determined. Also add exceptions for command and
shell entrypoints which are read-only so we can leave them in the
shared /run/determined directory without setting up links.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: With slurm detect non-zero exit as job failure (#172)

If any container launched by the SLURM batch job exits with
a non-zero status, terminate the others with an error.
Without this a failure on horovodrun args, or other such
errors would exit the one container, but others would wait
forever.

Happy path exits, need a different fix.

* chore: Make shell_entrypoint work with SLURM (#174)

The shell command entry point needs the fix to properly
inherit DET_SLOTS from SLUM CUDA configuration.
Additionally, HPC systems include functions in their
environment to support modules. Because the variables
include () in the names, it breaks the SSH variable escaping
code. Avoid processing of these functions by adding any
variable name contiaing () to the blacklist.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
Dispatcher RM messages.

* chore: HAL-2893 Synchronize reading of Slurm resource log file (#179)

* fix: HAL-2903 Drop special handling of stdin/stdout for SLURM (#180)

The /run/determined/train directory is no longer shared between
all containers in a run, so drop the slurm-job specific task
setup code that redirects the stdin/stdout link via /tmp (which
singularity maps to /tmp on the host which can lead to conflict
between jobs refereincing the same /tmp/stdin.log link between jobs).

* feat: FOUNDENG-21 Enable SLURM preemption (#181)

Initial work to enable SLURM job preemption.
- For SLURM jobs intercept SIGTERM as a notification of pending preemption
when triggered notify the master via new pending_preemption API
- Add pending_preemption API on the master
- Upon call to the pending_preemption, dispatcher RM triggers ReleaseResources
(whith ForcePremption:true) for the allocation.
- The above triggers a preemption, and the long poll _preempt.py is performing will
return indicating that the job should preempt, and after checkpointing it will
exit with a success status.

* chore: put some life into the 'job queue' UI page (#189)

* chore: Version of Bradley's wrap_rank.py fix (#188)

This is semantically eqivalent to

https://github.com/determined-ai/determined/pull/4109

So you can apply it to your tree to get things working on dispatcher branch.

* chore: HAL-2788 Process and send logs from Dispatcher(SLURM) to Default Log Backend (#182)

Co-authored-by: Bradley Laney <bradlaney@determined.ai>

* fix: correct regex group name (#193)

* fix: HAL-2907 Simplify SLURM job naming (#190)

This is just a slight increment to reduce the verbosity, to cleanup
the username and dispatch ID we likely will need a launcher
enhancemement.

* chore: fix tb collisions (#194)

* chore: HAL-2894 Multiple distributed jobs collide on horovod run sshd port (#195)

* chore: per-partition configs (#196)

This change enables us to configure per-partition overrides for certain configurations and adds a few overrides for now. The mechanism matches determined agents where expconf settings (if available) override pool level defaults override master config level defaults.

* chore: avoid API blockage & overpolling (#198)

* chore: avoid API blockage & overpolling

when following launched jobs to completion

* chore: provide per-pool job stats (#199)

* chore: provide per-pool job stats

on receipt of GetJobQueueStatsRequest

* fix: FOUNDENG-13 Properly set default pool when no GPUs (#201)

When no partitions have GPUs, we ended up without a default
compute pool, so jobs would get subbmitted to pool "".
That would be fine, except we couldn't then match the pool
named "" to its attributes (i.e. has no GPUs) and thus we
would try to allocate GPUs from a pool without them.
So when no GPUs in any partition, set the default compute pool
(internal to the Dispatcher RM) to the default AUX pool, so
that when a request for the default pool comes in we properly
match it up with the pool attributes and correctly swith to
using the cpu as the slot type.

Also add a devcluster file for shuco for testing.

* chore: fix up slot type resolution (#197)

* chore: fix default scheme for dispatcher config (#203)

* chore: update Go dependency files

This is a sort of artificial commit used to concentrate all of the
go.{mod,sum} changes for EE into one place to hopefully reduce
conflicts.

* feat: add support for SCIM provisioning

This commit brings the SCIM support for v0.11.2 back from the dead with
minimal changes. The only new fixes/features are conflicts now correctly
returns 409 not 500, password sync is supported, externalId is not
required, and the username/password that the IdP uses for basic auth
when talking to us is configurable.

* chore: remove Apache V2 license text

The enterprise edition is not licensed under Apache V2, so remove the
license text to avoid confusion.

* fix: ignore SCIM meta field

When we receive a PATCH or PUT from Okta,
they may reply to us with the meta field we
initially gave to them. This is ok; it is in
spec to receive and ignore it.

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add support for SAML

To meet customer needs, we need to provide support for users to
authenticate with SAML. The commit adds /saml/sso and /saml/initiate to
receive SAML responses and initiate SP-initiated flow with SAML
requests, respectively. It also adds SSO providers to /info for the
frontend to determined if SAML is enabled or not and the ability to
configure SAML via master.yaml.

* feat: add support for automatically storing creds

As a Determined user, I don't want to paste a token every time I want to
auth. This adds support for, when auth'ing from the CLI, redirecting to
localhost and automatically storing the auth token.

* ci: modify CI/CD steps for EE

We avoid running jobs that will publish our product to public
repositories. This currently means that we cannot test cloud deployments
in an automated manner. However, none of our EE features are related to
cloud deployments so this should not be a problem.

* ci: enable EE releases with CircleCI

* chore: enable SCIM for test-e2e (#15)

* chore: run apt-get update for deps of master.deb (#16)

When we install ee via a deb, apt-get tries to resolve the named deps of
it but, in CI we haven't run apt-get update so we can't find it.

* feat: support acting as an OAuth 2.0 authorization server (#10)

This adds support for the OAuth 2.0 authorization code flow to the
master. Apart from the core flow itself, HTTP endpoints are provided for
managing the set of OAuth client applications (accessible only by admin
users).

Since we're only addressing one particular scenario at the moment, we
allow only a single client application to be registered, since that's
all we need for this use case and that simplifies management a bit.

* feat: enable OAuth protection for SCIM endpoints (#10)

* ci: add -ee to tags (#29)

* ci: use make publish for releasing Docker images (#36)

(cherry picked from commit a5de11b5cc87e848d0e81601d15038dd4f44d7e1)

* fix: remove det-deploy local from ee ci (#53)

* chore: increase refresh token exp (#60) [DET-4754]

* chore: remove sessions for deprovisioned users (#61) [DET-4844]

* ci: skip a bunch of testing and deployment for EE

* fix: extract master URL correctly from resource pools (#64)

With the introduction of resource pools, the provisioner is no longer a top level field. Update code so that we pull the MasterURL from the resource pools' provider field.

* fix: add ssoProviders to default store info for ee

* chore: do not publish model_hub package in EE

* fix: update master GoReleaser config to account for new binary

* chore: rebrand to HPE MLDE

Basic renaming of Docker images, docs bucket, etc. Also clearly remove
any implication of publishing to NVCR.

* chore: rebrand docs content and theme for HPE before publishing

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* ci: avoid implication that the password for SCIM tests is secure (#83)

* feat: add OIDC support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also revert a few diffs between the config for EE vs OSS that don't seem
intentional/warranted.

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* chore: refactor for semantic conflict in service.go:Service.extractToken (#186)

* fix: update docs skip check remote address to EE

* chore: add launcher client (#209)

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* Support environment configuration on singularity (#152)

Address todos:
- TODO(HAL-2868): Support adding and dropping linux caps for containers.
- TODO(HAL-2869): Support communicating with authenticated container registries.
- Also force_pull_image option.

Design Note: https://connect.us.cray.com/confluence/pages/viewpage.action?pageId=230482663

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Add missing checkpoint-gc link supporting unshared data volumes (#156)

Add a missing link from /run/determined/checkpoint_gc to the
unshared data volume to enable them to complete successfully.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* HAL-2788 Handle task script archives as shared. (#162)

Simplify our bind mounts by treating the archive
with logging configuration scrtips as shared.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Add command and shell entry script support. (#166)

The shared volume support needs to setup links for files that
are not intended to be shared. This is not yet completely
generalized, so we need to update tasks.go to add the links
for now. Add some missing links.

* chore: Dynamically handle links for unshared archive volumes (#168)

When archives are mapped into the per-container space we
need to setup softlinks for each element within /run/determined
directory that is not treated as shared. This was previously
done using a static list of names. Replace that with a dynamic
list so that we don't miss files that are specified in other
parts of determined. Also add exceptions for command and
shell entrypoints which are read-only so we can leave them in the
shared /run/determined directory without setting up links.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: With slurm detect non-zero exit as job failure (#172)

If any container launched by the SLURM batch job exits with
a non-zero status, terminate the others with an error.
Without this a failure on horovodrun args, or other such
errors would exit the one container, but others would wait
forever.

Happy path exits, need a different fix.

* chore: Make shell_entrypoint work with SLURM (#174)

The shell command entry point needs the fix to properly
inherit DET_SLOTS from SLUM CUDA configuration.
Additionally, HPC systems include functions in their
environment to support modules. Because the variables
include () in the names, it breaks the SSH variable escaping
code. Avoid processing of these functions by adding any
variable name contiaing () to the blacklist.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
…

eecsliu added a commit to eecsliu/determined that referenced this pull request

* chore: update Go dependency files

This is a sort of artificial commit used to concentrate all of the
go.{mod,sum} changes for EE into one place to hopefully reduce
conflicts.

* feat: add support for SCIM provisioning

This commit brings the SCIM support for v0.11.2 back from the dead with
minimal changes. The only new fixes/features are conflicts now correctly
returns 409 not 500, password sync is supported, externalId is not
required, and the username/password that the IdP uses for basic auth
when talking to us is configurable.

* chore: remove Apache V2 license text

The enterprise edition is not licensed under Apache V2, so remove the
license text to avoid confusion.

* fix: ignore SCIM meta field

When we receive a PATCH or PUT from Okta,
they may reply to us with the meta field we
initially gave to them. This is ok; it is in
spec to receive and ignore it.

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add support for SAML

To meet customer needs, we need to provide support for users to
authenticate with SAML. The commit adds /saml/sso and /saml/initiate to
receive SAML responses and initiate SP-initiated flow with SAML
requests, respectively. It also adds SSO providers to /info for the
frontend to determined if SAML is enabled or not and the ability to
configure SAML via master.yaml.

* feat: add support for automatically storing creds

As a Determined user, I don't want to paste a token every time I want to
auth. This adds support for, when auth'ing from the CLI, redirecting to
localhost and automatically storing the auth token.

* ci: modify CI/CD steps for EE

We avoid running jobs that will publish our product to public
repositories. This currently means that we cannot test cloud deployments
in an automated manner. However, none of our EE features are related to
cloud deployments so this should not be a problem.

* ci: enable EE releases with CircleCI

* chore: enable SCIM for test-e2e (#15)

* chore: run apt-get update for deps of master.deb (#16)

When we install ee via a deb, apt-get tries to resolve the named deps of
it but, in CI we haven't run apt-get update so we can't find it.

* feat: support acting as an OAuth 2.0 authorization server (#10)

This adds support for the OAuth 2.0 authorization code flow to the
master. Apart from the core flow itself, HTTP endpoints are provided for
managing the set of OAuth client applications (accessible only by admin
users).

Since we're only addressing one particular scenario at the moment, we
allow only a single client application to be registered, since that's
all we need for this use case and that simplifies management a bit.

* feat: enable OAuth protection for SCIM endpoints (#10)

* ci: add -ee to tags (#29)

* ci: use make publish for releasing Docker images (#36)

(cherry picked from commit a5de11b5cc87e848d0e81601d15038dd4f44d7e1)

* fix: remove det-deploy local from ee ci (#53)

* chore: increase refresh token exp (#60) [DET-4754]

* chore: remove sessions for deprovisioned users (#61) [DET-4844]

* ci: skip a bunch of testing and deployment for EE

* fix: extract master URL correctly from resource pools (#64)

With the introduction of resource pools, the provisioner is no longer a top level field. Update code so that we pull the MasterURL from the resource pools' provider field.

* fix: add ssoProviders to default store info for ee

* chore: do not publish model_hub package in EE

* fix: update master GoReleaser config to account for new binary

* chore: rebrand to HPE MLDE

Basic renaming of Docker images, docs bucket, etc. Also clearly remove
any implication of publishing to NVCR.

* chore: rebrand docs content and theme for HPE before publishing

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* ci: avoid implication that the password for SCIM tests is secure (#83)

* feat: add OIDC support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also revert a few diffs between the config for EE vs OSS that don't seem
intentional/warranted.

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* fix: update docs skip check remote address to EE

* chore: add launcher client (#209)

* feat: slurm support (#98)

This change adds a dispatcher resource manager, which gives Determined the ability to run on top of Slurm.

Feature branch commit history:

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
Dispatcher RM messages.

* chore: HAL-2893 Synchronize reading of Slurm resource log file (#179)

* fix: HAL-2903 Drop special handling of stdin/stdout for SLURM (#180)

The /run/determined/train directory is no longer shared between
all containers in a run, so drop the slurm-job specific task
setup code that redirects the stdin/stdout link via /tmp (which
singularity maps to /tmp on the host which can lead to conflict
between jobs refereincing the same /tmp/stdin.log link between jobs).

* feat: FOUNDENG-21 Enable SLURM preemption (#181)

Initial work to enable SLURM job preemption.
- For SLURM jobs intercept SIGTERM as a notification of pending preemption
when triggered notify the master via new pending_preemption API
- Add pending_preemption API on the master
- Upon call to the pending_preemption, dispatcher RM triggers ReleaseResources
(whith ForcePremption:true) for the allocation.
- The above triggers a preemption, and the long poll _preempt.py is performing will
return indicating that the job should preempt, and after checkpointing it will
exit with a success status.

* chore: put some life into the 'job queue' UI page (#189)

* chore: Version of Bradley's wrap_rank.py fix (#188)

This is semantically eqivalent to

https://github.com/determined-ai/determined/pull/4109

So you can apply it to your tree to get things working on dispatcher branch.

* chore: HAL-2788 Process and send logs from Dispatcher(SLURM) to Default Log Backend (#182)

Co-authored-by: Bradley Laney <bradlaney@determined.ai>

* chore: per-partition configs (#196)

This change enables us to configure per-partition overrides for certain configurations and adds a few overrides for now. The mechanism matches determined agents where expconf settings (if available) override pool level defaults override master config level defaults.

* chore: avoid API blockage & overpolling (#198)

* chore: avoid API blockage & overpolling

when following launched jobs to completion

* chore: provide per-pool job stats (#199)

* chore: provide per-pool job stats

on receipt of GetJobQueueStatsRequest

* fix: FOUNDENG-13 Properly set default pool when no GPUs (#201)

When no partitions have GPUs, we ended up without a default
compute pool, so jobs would get subbmitted to pool "".
That would be fine, except we couldn't then match the pool
named "" to its attributes (i.e. has no GPUs) and thus we
would try to allocate GPUs from a pool without them.
So when no GPUs in any partition, set the default compute pool
(internal to the Dispatcher RM) to the default AUX pool, so
that when a request for the default pool comes in we properly
match it up with the pool attributes and correctly swith to
using the cpu as the slot type.

Also add a devcluster file for shuco for testing.

* chore: fix up slot type resolution (#197)

* chore: fix default scheme for dispatcher config (#203)

* chore: Refinements to Dispatcher/SLURM logs implementations (#191)

* fix: FOUNDENG-14 generated ResourcesStateChanged events (#202)

Generate ResourcesStateChanged to cover running, and other
resources states from the dispatcher RM.

AllocationReady comes from the container for shells, notebooks,
and tensorboards, but the clients also depend upon the state
being running immediately after AllocationReady, so to avoid
a race condition where the dispatcher RM discovers the running
state by polling for dispatcher RM, force the state to running upon reception
of the AllocationReady event.

* fix: avoid potential race between AllocationReady and Running state (#206)

Pull in oss version of this fix to avoid merge conflict.

* chore: Add devcluster files for casablanca-login and horizon (#207)

Custom devcluster files to enable simple testing for
two new test machines.

* chore: fix tests (#204)

* chore: cleanup stray stuff (#208)

* chore: rename iface configs for consistency (#210)

* fix: FOUNDENG-40 Favor default partition in all cases unless no GPUs (#213)

it would be preferable to favor the default partition explicitly.

New algorithm:

- Use default partition always for aux partition
- Use default partition for GPU partition (unless it has no GPUs).

Updated unit tests.

* chore: Improve branding/naming of dispatcher/launcher (#212)

For customer-visible purposes, use "Slurm" when referring to the scheduling
system that is managed by the dispatcher RM internally.

* fix: FOUNDENG-41 Cancel running SLURM jobs on restart (#215)

On the startup of the master there may be on-going jobs on the cluster
associated with determined jobs. Terminate them all and remove them
from the DB to prevent duplicate jobs as determined starts up new jobs
which will continue from the last checkpoint.

* fix: FOUNDENG-43 Change in yaml parser breaks default partitions (#217)

Unblock testing by reverting back to our previous yaml parser
to restore default partition identification.

* fix: correctly parse resource pool info with ghodss/yaml package [FOUNDENG-45] (#218)

* fix: handle nil response from dispatch terminate (#219)

* fix: FOUNDENG-41 properly handle missing dispatches (#221)

When cleaning up dispatches from the DB on startup, if the
dispatch does not exist (404) then still delete the DB entry
so we don't attempt to clean it up again on restart.

* chore: fill in DET_UNIQUE_PORT_OFFSET (#220)

* feat: FOUNDENG-24 Improve branding/naming of dispatcher/launcher (#216)

Change the resource_manger.type value that customer's provide
from dispatcher to slurm

* Revert "chore: fill in DET_UNIQUE_PORT_OFFSET (#220)"

This reverts commit d3de26b7d652bf15eb28ee0d9a6b2c5113efcaa3.

* chore: fill in DET_UNIQUE_PORT_OFFSET, with defaults (#225)

* fix: FOUNDENG-41 If we get an error not 404, don't delete from DB (#222)

If we get a nil response on the error path, we do not know for sure
that the dispatch does not exist, so leave it in the DB for a later
attempt at cleanup.

* chore: Saturate Determined's `GET /api/v1/agents` (#226)

* Saturate Determined's `GET /api/v1/agents` API

for Dispatcher RM

* fix: FOUNDENG-49 Slurm Logs and scripts are cleaned up (#228)

Changes to make sure that when a Dispatch is dropped from the DB
we have ensured that the DispatchIDs (launches & environment) have
been removed by the launcher.

Use common method for cleanup in both cases, and enhance
the method to verify deletion (404 is also considered a success).
If we hit any failure in cleanup of launcher data, we defer the
removal of the Dispatch until a later time (they are re-processed
on startup).

* chore: Saturate Determined's `GET /api/v1/agents` API (#229)

* Saturate Determined's `GET /api/v1/agents` API

for Dispatcher RM.
o small tweak to mark the agents as not available if Slurm has then as allocated.

* chore: Enable security.tls settings for slurm RM (#200)

We want to enable use of https:8043 for development to enable us to
connect to test machines from our desktops via Pulse VPN.
Also update casablanca-login devcluster config to use :8043

* fix: Fix parameter order Dispatch cleanup (#230)

The parameter order to the resourceQueryPostActions cleanup
method were reversed, so the Dispatch was not properly removed.

* fix: FOUNDENG-48 `ERROR: Cannot set UID=0` (#231)

Launched Slurm jobs cannot be run as a root user (the default if
no agent is configured). Catch this error early in the dispatcher
RM by returning an error if an attempt to impersonate root
and report it explicitly.

* fix: FOUNDENG-25 Dispatcher job error reporting is sub-standard (#232)

When reporting exit status messages, the dispatcher RM includes
all events that happen on the distpatch. These are mostly useless
status updates. Fixes:
- Include only WARNING/ERROR level messags in the reporting.
those potentially have real error messages in them.
- Filter the prefix for job failure messages to make them cleaner.

* chore: New 'job_storage_root' field in master.yaml needs to be added to DAI (#233)

* chore: FOUNDENG-26 Implement batch processing for Dispatcher/SLURM logs (#211)

* chore: FOUNDENG-51 update dispatcher monitor to handle 404 status for cancelled experiments (#236)

* fix: update master GoReleaser config to account for new binary

* chore: enable hpe branding (#79)

* chore: publish EE docs to hpecaide.determined.ai

* chore: rebrand docs content and theme for HPE before publishing

* chore: push EE Docker images with `hpecaide` names

* fix: update reference to config package (#80)

* fix: update reference to config package

https://github.com/determined-ai/determined/pull/3175 moved config into
its own package, this fixes an -ee specific reference which was not
updated, fixing `make all`.

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* rrci: avoid implication that the password for SCIM tests is secure (#83)

* feat: oidc support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: change branding to MLDE (#114)

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also update configuration to avoid giving the impression that we might want to
publish EE images to NVCR, and revert a few diffs between the config for EE vs
OSS that don't seem intentional/warranted.

* chore: finish CAIDE -> MLDE, change hpemlde -> hpe-mlde

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* chore: more CAIDE -> MLDE changes (#153)

* chore: update web initial title and meta description (#155)

* chore: refactor for semantic conflict in service.go:Service.extractToken (#186)

* fix: use correct tags for EE release

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: just warn on trying to set single resource as daemon, for now (#125)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* Support environment configuration on singularity (#152)

Address todos:
- TODO(HAL-2868): Support adding and dropping linux caps for containers.
- TODO(HAL-2869): Support communicating with authenticated container registries.
- Also force_pull_image option.

Design Note: https://connect.us.cray.com/confluence/pages/viewpage.action?pageId=230482663

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Add missing checkpoint-gc link supporting unshared data volumes (#156)

Add a missing link from /run/determined/checkpoint_gc to the
unshared data volume to enable them to complete successfully.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* HAL-2788 Handle task script archives as shared. (#162)

Simplify our bind mounts by treating the archive
with logging configuration scrtips as shared.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Add command and shell entry script support. (#166)

The shared volume support needs to setup links for files that
are not intended to be shared. This is not yet completely
generalized, so we need to update tasks.go to add the links
for now. Add some missing links.

* chore: Dynamically handle links for unshared archive volumes (#168)

When archives are mapped into the per-container space we
need to setup softlinks for each element within /run/determined
directory that is not treated as shared. This was previously
done using a static list of names. Replace that with a dynamic
list so that we don't miss files that are specified in other
parts of determined. Also add exceptions for command and
shell entrypoints which are read-only so we can leave them in the
shared /run/determined directory without setting up links.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: With slurm detect non-zero exit as job failure (#172)

If any container launched by the SLURM batch job exits with
a non-zero status, terminate the others with an error.
Without this a failure on horovodrun args, or other such
errors would exit the one container, but others would wait
forever.

Happy path exits, need a different fix.

* chore: Make shell_entrypoint work with SLURM (#174)

The shell command entry point needs the fix to properly
inherit DET_SLOTS from SLUM CUDA configuration.
Additionally, HPC systems include functions in their
environment to support modules. Because the variables
include () in the names, it breaks the SSH variable escaping
code. Avoid processing of these functions by adding any
variable name contiaing () to the blacklist.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
Dispatcher RM messages.

* chore: HAL-2893 Synchronize reading of Slurm resource log file (#179)

* fix: HAL-2903 Drop special handling of stdin/stdout for SLURM (#180)

The /run/determined/train directory is no longer shared between
all containers in a run, so drop the slurm-job specific task
setup code that redirects the stdin/stdout link via /tmp (which
singularity maps to /tmp on the host which can lead to conflict
between jobs refereincing the same /tmp/stdin.log link between jobs).

* feat: FOUNDENG-21 Enable SLURM preemption (#181)

Initial work to enable SLURM job preemption.
- For SLURM jobs intercept SIGTERM as a notification of pending preemption
when triggered notify the master via new pending_preemption API
- Add pending_preemption API on the master
- Upon call to the pending_preemption, dispatcher RM triggers ReleaseResources
(whith ForcePremption:true) for the allocation.
- The above triggers a preemption, and the long poll _preempt.py is performing will
return indicating that the job should preempt, and after checkpointing it will
exit with a success status.

* chore: put some life into the 'job queue' UI page (#189)

* chore: Version of Bradley's wrap_rank.py fix (#188)

This is semantically eqivalent to

https://github.com/determined-ai/determined/pull/4109

So you can apply it to your tree to get things working on dispatcher branch.

* chore: HAL-2788 Process and send logs from Dispatcher(SLURM) to Default Log Backend (#182)

Co-authored-by: Bradley Laney <bradlaney@determined.ai>

* fix: correct regex group name (#193)

* fix: HAL-2907 Simplify SLURM job naming (#190)

This is just a slight increment to reduce the verbosity, to cleanup
the username and dispatch ID we likely will need a launcher
enhancemement.

* chore: fix tb collisions (#194)

* chore: HAL-2894 Multiple distributed jobs collide on horovod run sshd port (#195)

* chore: per-partition configs (#196)

This change enables us to configure per-partition overrides for certain configurations and adds a few overrides for now. The mechanism matches determined agents where expconf settings (if available) override pool level defaults override master config level defaults.

* chore: avoid API blockage & overpolling (#198)

* chore: avoid API blockage & overpolling

when following launched jobs to completion

* chore: provide per-pool job stats (#199)

* chore: provide per-pool job stats

on receipt of GetJobQueueStatsRequest

* fix: FOUNDENG-13 Properly set default pool when no GPUs (#201)

When no partitions have GPUs, we ended up without a default
compute pool, so jobs would get subbmitted to pool "".
That would be fine, except we couldn't then match the pool
named "" to its attributes (i.e. has no GPUs) and thus we
would try to allocate GPUs from a pool without them.
So when no GPUs in any partition, set the default compute pool
(internal to the Dispatcher RM) to the default AUX pool, so
that when a request for the default pool comes in we properly
match it up with the pool attributes and correctly swith to
using the cpu as the slot type.

Also add a devcluster file for shuco for testing.

* chore: fix up slot type resolution (#197)

* chore: fix default scheme for dispatcher config (#203)

* chore: update Go dependency files

This is a sort of artificial commit used to concentrate all of the
go.{mod,sum} changes for EE into one place to hopefully reduce
conflicts.

* feat: add support for SCIM provisioning

This commit brings the SCIM support for v0.11.2 back from the dead with
minimal changes. The only new fixes/features are conflicts now correctly
returns 409 not 500, password sync is supported, externalId is not
required, and the username/password that the IdP uses for basic auth
when talking to us is configurable.

* chore: remove Apache V2 license text

The enterprise edition is not licensed under Apache V2, so remove the
license text to avoid confusion.

* fix: ignore SCIM meta field

When we receive a PATCH or PUT from Okta,
they may reply to us with the meta field we
initially gave to them. This is ok; it is in
spec to receive and ignore it.

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add support for SAML

To meet customer needs, we need to provide support for users to
authenticate with SAML. The commit adds /saml/sso and /saml/initiate to
receive SAML responses and initiate SP-initiated flow with SAML
requests, respectively. It also adds SSO providers to /info for the
frontend to determined if SAML is enabled or not and the ability to
configure SAML via master.yaml.

* feat: add support for automatically storing creds

As a Determined user, I don't want to paste a token every time I want to
auth. This adds support for, when auth'ing from the CLI, redirecting to
localhost and automatically storing the auth token.

* ci: modify CI/CD steps for EE

We avoid running jobs that will publish our product to public
repositories. This currently means that we cannot test cloud deployments
in an automated manner. However, none of our EE features are related to
cloud deployments so this should not be a problem.

* ci: enable EE releases with CircleCI

* chore: enable SCIM for test-e2e (#15)

* chore: run apt-get update for deps of master.deb (#16)

When we install ee via a deb, apt-get tries to resolve the named deps of
it but, in CI we haven't run apt-get update so we can't find it.

* feat: support acting as an OAuth 2.0 authorization server (#10)

This adds support for the OAuth 2.0 authorization code flow to the
master. Apart from the core flow itself, HTTP endpoints are provided for
managing the set of OAuth client applications (accessible only by admin
users).

Since we're only addressing one particular scenario at the moment, we
allow only a single client application to be registered, since that's
all we need for this use case and that simplifies management a bit.

* feat: enable OAuth protection for SCIM endpoints (#10)

* ci: add -ee to tags (#29)

* ci: use make publish for releasing Docker images (#36)

(cherry picked from commit a5de11b5cc87e848d0e81601d15038dd4f44d7e1)

* fix: remove det-deploy local from ee ci (#53)

* chore: increase refresh token exp (#60) [DET-4754]

* chore: remove sessions for deprovisioned users (#61) [DET-4844]

* ci: skip a bunch of testing and deployment for EE

* fix: extract master URL correctly from resource pools (#64)

With the introduction of resource pools, the provisioner is no longer a top level field. Update code so that we pull the MasterURL from the resource pools' provider field.

* fix: add ssoProviders to default store info for ee

* chore: do not publish model_hub package in EE

* fix: update master GoReleaser config to account for new binary

* chore: rebrand to HPE MLDE

Basic renaming of Docker images, docs bucket, etc. Also clearly remove
any implication of publishing to NVCR.

* chore: rebrand docs content and theme for HPE before publishing

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* ci: avoid implication that the password for SCIM tests is secure (#83)

* feat: add OIDC support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also revert a few diffs between the config for EE vs OSS that don't seem
intentional/warranted.

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* chore: refactor for semantic conflict in service.go:Service.extractToken (#186)

* fix: update docs skip check remote address to EE

* chore: add launcher client (#209)

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's re…

eecsliu added a commit to eecsliu/determined that referenced this pull request

…or PBS (determined-ai#399)

* chore: update Go dependency files

This is a sort of artificial commit used to concentrate all of the
go.{mod,sum} changes for EE into one place to hopefully reduce
conflicts.

* feat: add support for SCIM provisioning

This commit brings the SCIM support for v0.11.2 back from the dead with
minimal changes. The only new fixes/features are conflicts now correctly
returns 409 not 500, password sync is supported, externalId is not
required, and the username/password that the IdP uses for basic auth
when talking to us is configurable.

* chore: remove Apache V2 license text

The enterprise edition is not licensed under Apache V2, so remove the
license text to avoid confusion.

* fix: ignore SCIM meta field

When we receive a PATCH or PUT from Okta,
they may reply to us with the meta field we
initially gave to them. This is ok; it is in
spec to receive and ignore it.

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add support for SAML

To meet customer needs, we need to provide support for users to
authenticate with SAML. The commit adds /saml/sso and /saml/initiate to
receive SAML responses and initiate SP-initiated flow with SAML
requests, respectively. It also adds SSO providers to /info for the
frontend to determined if SAML is enabled or not and the ability to
configure SAML via master.yaml.

* feat: add support for automatically storing creds

As a Determined user, I don't want to paste a token every time I want to
auth. This adds support for, when auth'ing from the CLI, redirecting to
localhost and automatically storing the auth token.

* ci: modify CI/CD steps for EE

We avoid running jobs that will publish our product to public
repositories. This currently means that we cannot test cloud deployments
in an automated manner. However, none of our EE features are related to
cloud deployments so this should not be a problem.

* ci: enable EE releases with CircleCI

* chore: enable SCIM for test-e2e (#15)

* chore: run apt-get update for deps of master.deb (#16)

When we install ee via a deb, apt-get tries to resolve the named deps of
it but, in CI we haven't run apt-get update so we can't find it.

* feat: support acting as an OAuth 2.0 authorization server (#10)

This adds support for the OAuth 2.0 authorization code flow to the
master. Apart from the core flow itself, HTTP endpoints are provided for
managing the set of OAuth client applications (accessible only by admin
users).

Since we're only addressing one particular scenario at the moment, we
allow only a single client application to be registered, since that's
all we need for this use case and that simplifies management a bit.

* feat: enable OAuth protection for SCIM endpoints (#10)

* ci: add -ee to tags (#29)

* ci: use make publish for releasing Docker images (#36)

(cherry picked from commit a5de11b5cc87e848d0e81601d15038dd4f44d7e1)

* fix: remove det-deploy local from ee ci (#53)

* chore: increase refresh token exp (#60) [DET-4754]

* chore: remove sessions for deprovisioned users (#61) [DET-4844]

* ci: skip a bunch of testing and deployment for EE

* fix: extract master URL correctly from resource pools (#64)

With the introduction of resource pools, the provisioner is no longer a top level field. Update code so that we pull the MasterURL from the resource pools' provider field.

* fix: add ssoProviders to default store info for ee

* chore: do not publish model_hub package in EE

* fix: update master GoReleaser config to account for new binary

* chore: rebrand to HPE MLDE

Basic renaming of Docker images, docs bucket, etc. Also clearly remove
any implication of publishing to NVCR.

* chore: rebrand docs content and theme for HPE before publishing

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* feat: add OIDC support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also revert a few diffs between the config for EE vs OSS that don't seem
intentional/warranted.

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* fix: update docs skip check remote address to EE

* chore: add launcher client (#209)

* feat: slurm support (#98)

This change adds a dispatcher resource manager, which gives Determined the ability to run on top of Slurm.

Feature branch commit history:

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
Dispatcher RM messages.

* chore: HAL-2893 Synchronize reading of Slurm resource log file (#179)

* fix: HAL-2903 Drop special handling of stdin/stdout for SLURM (#180)

The /run/determined/train directory is no longer shared between
all containers in a run, so drop the slurm-job specific task
setup code that redirects the stdin/stdout link via /tmp (which
singularity maps to /tmp on the host which can lead to conflict
between jobs refereincing the same /tmp/stdin.log link between jobs).

* feat: FOUNDENG-21 Enable SLURM preemption (#181)

Initial work to enable SLURM job preemption.
- For SLURM jobs intercept SIGTERM as a notification of pending preemption
when triggered notify the master via new pending_preemption API
- Add pending_preemption API on the master
- Upon call to the pending_preemption, dispatcher RM triggers ReleaseResources
(whith ForcePremption:true) for the allocation.
- The above triggers a preemption, and the long poll _preempt.py is performing will
return indicating that the job should preempt, and after checkpointing it will
exit with a success status.

* chore: put some life into the 'job queue' UI page (#189)

* chore: Version of Bradley's wrap_rank.py fix (#188)

This is semantically eqivalent to

https://github.com/determined-ai/determined/pull/4109

So you can apply it to your tree to get things working on dispatcher branch.

* chore: HAL-2788 Process and send logs from Dispatcher(SLURM) to Default Log Backend (#182)

Co-authored-by: Bradley Laney <bradlaney@determined.ai>

* chore: per-partition configs (#196)

This change enables us to configure per-partition overrides for certain configurations and adds a few overrides for now. The mechanism matches determined agents where expconf settings (if available) override pool level defaults override master config level defaults.

* chore: avoid API blockage & overpolling (#198)

* chore: avoid API blockage & overpolling

when following launched jobs to completion

* chore: provide per-pool job stats (#199)

* chore: provide per-pool job stats

on receipt of GetJobQueueStatsRequest

* fix: FOUNDENG-13 Properly set default pool when no GPUs (#201)

When no partitions have GPUs, we ended up without a default
compute pool, so jobs would get subbmitted to pool "".
That would be fine, except we couldn't then match the pool
named "" to its attributes (i.e. has no GPUs) and thus we
would try to allocate GPUs from a pool without them.
So when no GPUs in any partition, set the default compute pool
(internal to the Dispatcher RM) to the default AUX pool, so
that when a request for the default pool comes in we properly
match it up with the pool attributes and correctly swith to
using the cpu as the slot type.

Also add a devcluster file for shuco for testing.

* chore: fix up slot type resolution (#197)

* chore: fix default scheme for dispatcher config (#203)

* chore: Refinements to Dispatcher/SLURM logs implementations (#191)

* fix: FOUNDENG-14 generated ResourcesStateChanged events (#202)

Generate ResourcesStateChanged to cover running, and other
resources states from the dispatcher RM.

AllocationReady comes from the container for shells, notebooks,
and tensorboards, but the clients also depend upon the state
being running immediately after AllocationReady, so to avoid
a race condition where the dispatcher RM discovers the running
state by polling for dispatcher RM, force the state to running upon reception
of the AllocationReady event.

* fix: avoid potential race between AllocationReady and Running state (#206)

Pull in oss version of this fix to avoid merge conflict.

* chore: Add devcluster files for casablanca-login and horizon (#207)

Custom devcluster files to enable simple testing for
two new test machines.

* chore: fix tests (#204)

* chore: cleanup stray stuff (#208)

* chore: rename iface configs for consistency (#210)

* fix: FOUNDENG-40 Favor default partition in all cases unless no GPUs (#213)

it would be preferable to favor the default partition explicitly.

New algorithm:

- Use default partition always for aux partition
- Use default partition for GPU partition (unless it has no GPUs).

Updated unit tests.

* chore: Improve branding/naming of dispatcher/launcher (#212)

For customer-visible purposes, use "Slurm" when referring to the scheduling
system that is managed by the dispatcher RM internally.

* fix: FOUNDENG-41 Cancel running SLURM jobs on restart (#215)

On the startup of the master there may be on-going jobs on the cluster
associated with determined jobs. Terminate them all and remove them
from the DB to prevent duplicate jobs as determined starts up new jobs
which will continue from the last checkpoint.

* fix: FOUNDENG-43 Change in yaml parser breaks default partitions (#217)

Unblock testing by reverting back to our previous yaml parser
to restore default partition identification.

* fix: correctly parse resource pool info with ghodss/yaml package [FOUNDENG-45] (#218)

* fix: handle nil response from dispatch terminate (#219)

* fix: FOUNDENG-41 properly handle missing dispatches (#221)

When cleaning up dispatches from the DB on startup, if the
dispatch does not exist (404) then still delete the DB entry
so we don't attempt to clean it up again on restart.

* chore: fill in DET_UNIQUE_PORT_OFFSET (#220)

* feat: FOUNDENG-24 Improve branding/naming of dispatcher/launcher (#216)

Change the resource_manger.type value that customer's provide
from dispatcher to slurm

* Revert "chore: fill in DET_UNIQUE_PORT_OFFSET (#220)"

This reverts commit d3de26b7d652bf15eb28ee0d9a6b2c5113efcaa3.

* chore: fill in DET_UNIQUE_PORT_OFFSET, with defaults (#225)

* fix: FOUNDENG-41 If we get an error not 404, don't delete from DB (#222)

If we get a nil response on the error path, we do not know for sure
that the dispatch does not exist, so leave it in the DB for a later
attempt at cleanup.

* chore: Saturate Determined's `GET /api/v1/agents` (#226)

* Saturate Determined's `GET /api/v1/agents` API

for Dispatcher RM

* fix: FOUNDENG-49 Slurm Logs and scripts are cleaned up (#228)

Changes to make sure that when a Dispatch is dropped from the DB
we have ensured that the DispatchIDs (launches & environment) have
been removed by the launcher.

Use common method for cleanup in both cases, and enhance
the method to verify deletion (404 is also considered a success).
If we hit any failure in cleanup of launcher data, we defer the
removal of the Dispatch until a later time (they are re-processed
on startup).

* chore: Saturate Determined's `GET /api/v1/agents` API (#229)

* Saturate Determined's `GET /api/v1/agents` API

for Dispatcher RM.
o small tweak to mark the agents as not available if Slurm has then as allocated.

* chore: Enable security.tls settings for slurm RM (#200)

We want to enable use of https:8043 for development to enable us to
connect to test machines from our desktops via Pulse VPN.
Also update casablanca-login devcluster config to use :8043

* fix: Fix parameter order Dispatch cleanup (#230)

The parameter order to the resourceQueryPostActions cleanup
method were reversed, so the Dispatch was not properly removed.

* fix: FOUNDENG-48 `ERROR: Cannot set UID=0` (#231)

Launched Slurm jobs cannot be run as a root user (the default if
no agent is configured). Catch this error early in the dispatcher
RM by returning an error if an attempt to impersonate root
and report it explicitly.

* fix: FOUNDENG-25 Dispatcher job error reporting is sub-standard (#232)

When reporting exit status messages, the dispatcher RM includes
all events that happen on the distpatch. These are mostly useless
status updates. Fixes:
- Include only WARNING/ERROR level messags in the reporting.
those potentially have real error messages in them.
- Filter the prefix for job failure messages to make them cleaner.

* chore: New 'job_storage_root' field in master.yaml needs to be added to DAI (#233)

* chore: FOUNDENG-26 Implement batch processing for Dispatcher/SLURM logs (#211)

* chore: FOUNDENG-51 update dispatcher monitor to handle 404 status for cancelled experiments (#236)

* fix: update master GoReleaser config to account for new binary

* chore: enable hpe branding (#79)

* chore: publish EE docs to hpecaide.determined.ai

* chore: rebrand docs content and theme for HPE before publishing

* chore: push EE Docker images with `hpecaide` names

* fix: update reference to config package (#80)

* fix: update reference to config package

https://github.com/determined-ai/determined/pull/3175 moved config into
its own package, this fixes an -ee specific reference which was not
updated, fixing `make all`.

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* rrci: avoid implication that the password for SCIM tests is secure (#83)

* feat: oidc support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: change branding to MLDE (#114)

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also update configuration to avoid giving the impression that we might want to
publish EE images to NVCR, and revert a few diffs between the config for EE vs
OSS that don't seem intentional/warranted.

* chore: finish CAIDE -> MLDE, change hpemlde -> hpe-mlde

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* chore: more CAIDE -> MLDE changes (#153)

* chore: update web initial title and meta description (#155)

* chore: refactor for semantic conflict in service.go:Service.extractToken (#186)

* fix: use correct tags for EE release

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: just warn on trying to set single resource as daemon, for now (#125)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* Support environment configuration on singularity (#152)

Address todos:
- TODO(HAL-2868): Support adding and dropping linux caps for containers.
- TODO(HAL-2869): Support communicating with authenticated container registries.
- Also force_pull_image option.

Design Note: https://connect.us.cray.com/confluence/pages/viewpage.action?pageId=230482663

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Add missing checkpoint-gc link supporting unshared data volumes (#156)

Add a missing link from /run/determined/checkpoint_gc to the
unshared data volume to enable them to complete successfully.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* HAL-2788 Handle task script archives as shared. (#162)

Simplify our bind mounts by treating the archive
with logging configuration scrtips as shared.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Add command and shell entry script support. (#166)

The shared volume support needs to setup links for files that
are not intended to be shared. This is not yet completely
generalized, so we need to update tasks.go to add the links
for now. Add some missing links.

* chore: Dynamically handle links for unshared archive volumes (#168)

When archives are mapped into the per-container space we
need to setup softlinks for each element within /run/determined
directory that is not treated as shared. This was previously
done using a static list of names. Replace that with a dynamic
list so that we don't miss files that are specified in other
parts of determined. Also add exceptions for command and
shell entrypoints which are read-only so we can leave them in the
shared /run/determined directory without setting up links.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: With slurm detect non-zero exit as job failure (#172)

If any container launched by the SLURM batch job exits with
a non-zero status, terminate the others with an error.
Without this a failure on horovodrun args, or other such
errors would exit the one container, but others would wait
forever.

Happy path exits, need a different fix.

* chore: Make shell_entrypoint work with SLURM (#174)

The shell command entry point needs the fix to properly
inherit DET_SLOTS from SLUM CUDA configuration.
Additionally, HPC systems include functions in their
environment to support modules. Because the variables
include () in the names, it breaks the SSH variable escaping
code. Avoid processing of these functions by adding any
variable name contiaing () to the blacklist.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
Dispatcher RM messages.

* chore: HAL-2893 Synchronize reading of Slurm resource log file (#179)

* fix: HAL-2903 Drop special handling of stdin/stdout for SLURM (#180)

The /run/determined/train directory is no longer shared between
all containers in a run, so drop the slurm-job specific task
setup code that redirects the stdin/stdout link via /tmp (which
singularity maps to /tmp on the host which can lead to conflict
between jobs refereincing the same /tmp/stdin.log link between jobs).

* feat: FOUNDENG-21 Enable SLURM preemption (#181)

Initial work to enable SLURM job preemption.
- For SLURM jobs intercept SIGTERM as a notification of pending preemption
when triggered notify the master via new pending_preemption API
- Add pending_preemption API on the master
- Upon call to the pending_preemption, dispatcher RM triggers ReleaseResources
(whith ForcePremption:true) for the allocation.
- The above triggers a preemption, and the long poll _preempt.py is performing will
return indicating that the job should preempt, and after checkpointing it will
exit with a success status.

* chore: put some life into the 'job queue' UI page (#189)

* chore: Version of Bradley's wrap_rank.py fix (#188)

This is semantically eqivalent to

https://github.com/determined-ai/determined/pull/4109

So you can apply it to your tree to get things working on dispatcher branch.

* chore: HAL-2788 Process and send logs from Dispatcher(SLURM) to Default Log Backend (#182)

Co-authored-by: Bradley Laney <bradlaney@determined.ai>

* fix: correct regex group name (#193)

* fix: HAL-2907 Simplify SLURM job naming (#190)

This is just a slight increment to reduce the verbosity, to cleanup
the username and dispatch ID we likely will need a launcher
enhancemement.

* chore: fix tb collisions (#194)

* chore: HAL-2894 Multiple distributed jobs collide on horovod run sshd port (#195)

* chore: per-partition configs (#196)

This change enables us to configure per-partition overrides for certain configurations and adds a few overrides for now. The mechanism matches determined agents where expconf settings (if available) override pool level defaults override master config level defaults.

* chore: avoid API blockage & overpolling (#198)

* chore: avoid API blockage & overpolling

when following launched jobs to completion

* chore: provide per-pool job stats (#199)

* chore: provide per-pool job stats

on receipt of GetJobQueueStatsRequest

* fix: FOUNDENG-13 Properly set default pool when no GPUs (#201)

When no partitions have GPUs, we ended up without a default
compute pool, so jobs would get subbmitted to pool "".
That would be fine, except we couldn't then match the pool
named "" to its attributes (i.e. has no GPUs) and thus we
would try to allocate GPUs from a pool without them.
So when no GPUs in any partition, set the default compute pool
(internal to the Dispatcher RM) to the default AUX pool, so
that when a request for the default pool comes in we properly
match it up with the pool attributes and correctly swith to
using the cpu as the slot type.

Also add a devcluster file for shuco for testing.

* chore: fix up slot type resolution (#197)

* chore: fix default scheme for dispatcher config (#203)

* chore: update Go dependency files

This is a sort of artificial commit used to concentrate all of the
go.{mod,sum} changes for EE into one place to hopefully reduce
conflicts.

* feat: add support for SCIM provisioning

This commit brings the SCIM support for v0.11.2 back from the dead with
minimal changes. The only new fixes/features are conflicts now correctly
returns 409 not 500, password sync is supported, externalId is not
required, and the username/password that the IdP uses for basic auth
when talking to us is configurable.

* chore: remove Apache V2 license text

The enterprise edition is not licensed under Apache V2, so remove the
license text to avoid confusion.

* fix: ignore SCIM meta field

When we receive a PATCH or PUT from Okta,
they may reply to us with the meta field we
initially gave to them. This is ok; it is in
spec to receive and ignore it.

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add support for SAML

To meet customer needs, we need to provide support for users to
authenticate with SAML. The commit adds /saml/sso and /saml/initiate to
receive SAML responses and initiate SP-initiated flow with SAML
requests, respectively. It also adds SSO providers to /info for the
frontend to determined if SAML is enabled or not and the ability to
configure SAML via master.yaml.

* feat: add support for automatically storing creds

As a Determined user, I don't want to paste a token every time I want to
auth. This adds support for, when auth'ing from the CLI, redirecting to
localhost and automatically storing the auth token.

* ci: modify CI/CD steps for EE

We avoid running jobs that will publish our product to public
repositories. This currently means that we cannot test cloud deployments
in an automated manner. However, none of our EE features are related to
cloud deployments so this should not be a problem.

* ci: enable EE releases with CircleCI

* chore: enable SCIM for test-e2e (#15)

* chore: run apt-get update for deps of master.deb (#16)

When we install ee via a deb, apt-get tries to resolve the named deps of
it but, in CI we haven't run apt-get update so we can't find it.

* feat: support acting as an OAuth 2.0 authorization server (#10)

This adds support for the OAuth 2.0 authorization code flow to the
master. Apart from the core flow itself, HTTP endpoints are provided for
managing the set of OAuth client applications (accessible only by admin
users).

Since we're only addressing one particular scenario at the moment, we
allow only a single client application to be registered, since that's
all we need for this use case and that simplifies management a bit.

* feat: enable OAuth protection for SCIM endpoints (#10)

* ci: add -ee to tags (#29)

* ci: use make publish for releasing Docker images (#36)

(cherry picked from commit a5de11b5cc87e848d0e81601d15038dd4f44d7e1)

* fix: remove det-deploy local from ee ci (#53)

* chore: increase refresh token exp (#60) [DET-4754]

* chore: remove sessions for deprovisioned users (#61) [DET-4844]

* ci: skip a bunch of testing and deployment for EE

* fix: extract master URL correctly from resource pools (#64)

With the introduction of resource pools, the provisioner is no longer a top level field. Update code so that we pull the MasterURL from the resource pools' provider field.

* fix: add ssoProviders to default store info for ee

* chore: do not publish model_hub package in EE

* fix: update master GoReleaser config to account for new binary

* chore: rebrand to HPE MLDE

Basic renaming of Docker images, docs bucket, etc. Also clearly remove
any implication of publishing to NVCR.

* chore: rebrand docs content and theme for HPE before publishing

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* ci: avoid implication that the password for SCIM tests is secure (#83)

* feat: add OIDC support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also revert a few diffs between the config for EE vs OSS that don't seem
intentional/warranted.

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* chore: refactor for semantic conflict in service.go:Service.extractToken (#186)

* fix: update docs skip check remote address to EE

* chore: add launcher client (#209)

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" af…

eecsliu pushed a commit to eecsliu/determined that referenced this pull request


          feat: slurm support (#98)

9923f9d

This change adds a dispatcher resource manager, which gives Determined the ability to run on top of Slurm.

Feature branch commit history:

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
Dispatcher RM messages.

* chore: HAL-2893 Synchronize reading of Slurm resource log file (#179)

* fix: HAL-2903 Drop special handling of stdin/stdout for SLURM (#180)

The /run/determined/train directory is no longer shared between
all containers in a run, so drop the slurm-job specific task
setup code that redirects the stdin/stdout link via /tmp (which
singularity maps to /tmp on the host which can lead to conflict
between jobs refereincing the same /tmp/stdin.log link between jobs).

* feat: FOUNDENG-21 Enable SLURM preemption (#181)

Initial work to enable SLURM job preemption.
- For SLURM jobs intercept SIGTERM as a notification of pending preemption
when triggered notify the master via new pending_preemption API
- Add pending_preemption API on the master
- Upon call to the pending_preemption, dispatcher RM triggers ReleaseResources
(whith ForcePremption:true) for the allocation.
- The above triggers a preemption, and the long poll _preempt.py is performing will
return indicating that the job should preempt, and after checkpointing it will
exit with a success status.

* chore: put some life into the 'job queue' UI page (#189)

* chore: Version of Bradley's wrap_rank.py fix (#188)

This is semantically eqivalent to

https://github.com/determined-ai/determined/pull/4109

So you can apply it to your tree to get things working on dispatcher branch.

* chore: HAL-2788 Process and send logs from Dispatcher(SLURM) to Default Log Backend (#182)

Co-authored-by: Bradley Laney <bradlaney@determined.ai>

* chore: per-partition configs (#196)

This change enables us to configure per-partition overrides for certain configurations and adds a few overrides for now. The mechanism matches determined agents where expconf settings (if available) override pool level defaults override master config level defaults.

* chore: avoid API blockage & overpolling (#198)

* chore: avoid API blockage & overpolling

when following launched jobs to completion

* chore: provide per-pool job stats (#199)

* chore: provide per-pool job stats

on receipt of GetJobQueueStatsRequest

* fix: FOUNDENG-13 Properly set default pool when no GPUs (#201)

When no partitions have GPUs, we ended up without a default
compute pool, so jobs would get subbmitted to pool "".
That would be fine, except we couldn't then match the pool
named "" to its attributes (i.e. has no GPUs) and thus we
would try to allocate GPUs from a pool without them.
So when no GPUs in any partition, set the default compute pool
(internal to the Dispatcher RM) to the default AUX pool, so
that when a request for the default pool comes in we properly
match it up with the pool attributes and correctly swith to
using the cpu as the slot type.

Also add a devcluster file for shuco for testing.

* chore: fix up slot type resolution (#197)

* chore: fix default scheme for dispatcher config (#203)

* chore: Refinements to Dispatcher/SLURM logs implementations (#191)

* fix: FOUNDENG-14 generated ResourcesStateChanged events (#202)

Generate ResourcesStateChanged to cover running, and other
resources states from the dispatcher RM.

AllocationReady comes from the container for shells, notebooks,
and tensorboards, but the clients also depend upon the state
being running immediately after AllocationReady, so to avoid
a race condition where the dispatcher RM discovers the running
state by polling for dispatcher RM, force the state to running upon reception
of the AllocationReady event.

* fix: avoid potential race between AllocationReady and Running state (#206)

Pull in oss version of this fix to avoid merge conflict.

* chore: Add devcluster files for casablanca-login and horizon (#207)

Custom devcluster files to enable simple testing for
two new test machines.

* chore: fix tests (#204)

* chore: cleanup stray stuff (#208)

* chore: rename iface configs for consistency (#210)

* fix: FOUNDENG-40 Favor default partition in all cases unless no GPUs (#213)

it would be preferable to favor the default partition explicitly.

New algorithm:

- Use default partition always for aux partition
- Use default partition for GPU partition (unless it has no GPUs).

Updated unit tests.

* chore: Improve branding/naming of dispatcher/launcher (#212)

For customer-visible purposes, use "Slurm" when referring to the scheduling
system that is managed by the dispatcher RM internally.

* fix: FOUNDENG-41 Cancel running SLURM jobs on restart (#215)

On the startup of the master there may be on-going jobs on the cluster
associated with determined jobs. Terminate them all and remove them
from the DB to prevent duplicate jobs as determined starts up new jobs
which will continue from the last checkpoint.

* fix: FOUNDENG-43 Change in yaml parser breaks default partitions (#217)

Unblock testing by reverting back to our previous yaml parser
to restore default partition identification.

* fix: correctly parse resource pool info with ghodss/yaml package [FOUNDENG-45] (#218)

* fix: handle nil response from dispatch terminate (#219)

* fix: FOUNDENG-41 properly handle missing dispatches (#221)

When cleaning up dispatches from the DB on startup, if the
dispatch does not exist (404) then still delete the DB entry
so we don't attempt to clean it up again on restart.

* chore: fill in DET_UNIQUE_PORT_OFFSET (#220)

* feat: FOUNDENG-24 Improve branding/naming of dispatcher/launcher (#216)

Change the resource_manger.type value that customer's provide
from dispatcher to slurm

* Revert "chore: fill in DET_UNIQUE_PORT_OFFSET (#220)"

This reverts commit d3de26b7d652bf15eb28ee0d9a6b2c5113efcaa3.

* chore: fill in DET_UNIQUE_PORT_OFFSET, with defaults (#225)

* fix: FOUNDENG-41 If we get an error not 404, don't delete from DB (#222)

If we get a nil response on the error path, we do not know for sure
that the dispatch does not exist, so leave it in the DB for a later
attempt at cleanup.

* chore: Saturate Determined's `GET /api/v1/agents` (#226)

* Saturate Determined's `GET /api/v1/agents` API

for Dispatcher RM

* fix: FOUNDENG-49 Slurm Logs and scripts are cleaned up (#228)

Changes to make sure that when a Dispatch is dropped from the DB
we have ensured that the DispatchIDs (launches & environment) have
been removed by the launcher.

Use common method for cleanup in both cases, and enhance
the method to verify deletion (404 is also considered a success).
If we hit any failure in cleanup of launcher data, we defer the
removal of the Dispatch until a later time (they are re-processed
on startup).

* chore: Saturate Determined's `GET /api/v1/agents` API (#229)

* Saturate Determined's `GET /api/v1/agents` API

for Dispatcher RM.
o small tweak to mark the agents as not available if Slurm has then as allocated.

* chore: Enable security.tls settings for slurm RM (#200)

We want to enable use of https:8043 for development to enable us to
connect to test machines from our desktops via Pulse VPN.
Also update casablanca-login devcluster config to use :8043

* fix: Fix parameter order Dispatch cleanup (#230)

The parameter order to the resourceQueryPostActions cleanup
method were reversed, so the Dispatch was not properly removed.

* fix: FOUNDENG-48 `ERROR: Cannot set UID=0` (#231)

Launched Slurm jobs cannot be run as a root user (the default if
no agent is configured). Catch this error early in the dispatcher
RM by returning an error if an attempt to impersonate root
and report it explicitly.

* fix: FOUNDENG-25 Dispatcher job error reporting is sub-standard (#232)

When reporting exit status messages, the dispatcher RM includes
all events that happen on the distpatch. These are mostly useless
status updates. Fixes:
- Include only WARNING/ERROR level messags in the reporting.
those potentially have real error messages in them.
- Filter the prefix for job failure messages to make them cleaner.

* chore: New 'job_storage_root' field in master.yaml needs to be added to DAI (#233)

* chore: FOUNDENG-26 Implement batch processing for Dispatcher/SLURM logs (#211)

* chore: FOUNDENG-51 update dispatcher monitor to handle 404 status for cancelled experiments (#236)

* fix: update master GoReleaser config to account for new binary

* chore: enable hpe branding (#79)

* chore: publish EE docs to hpecaide.determined.ai

* chore: rebrand docs content and theme for HPE before publishing

* chore: push EE Docker images with `hpecaide` names

* fix: update reference to config package (#80)

* fix: update reference to config package

https://github.com/determined-ai/determined/pull/3175 moved config into
its own package, this fixes an -ee specific reference which was not
updated, fixing `make all`.

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* rrci: avoid implication that the password for SCIM tests is secure (#83)

* feat: oidc support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: change branding to MLDE (#114)

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also update configuration to avoid giving the impression that we might want to
publish EE images to NVCR, and revert a few diffs between the config for EE vs
OSS that don't seem intentional/warranted.

* chore: finish CAIDE -> MLDE, change hpemlde -> hpe-mlde

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* chore: more CAIDE -> MLDE changes (#153)

* chore: update web initial title and meta description (#155)

* chore: refactor for semantic conflict in service.go:Service.extractToken (#186)

* fix: use correct tags for EE release

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: just warn on trying to set single resource as daemon, for now (#125)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* Support environment configuration on singularity (#152)

Address todos:
- TODO(HAL-2868): Support adding and dropping linux caps for containers.
- TODO(HAL-2869): Support communicating with authenticated container registries.
- Also force_pull_image option.

Design Note: https://connect.us.cray.com/confluence/pages/viewpage.action?pageId=230482663

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Add missing checkpoint-gc link supporting unshared data volumes (#156)

Add a missing link from /run/determined/checkpoint_gc to the
unshared data volume to enable them to complete successfully.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* HAL-2788 Handle task script archives as shared. (#162)

Simplify our bind mounts by treating the archive
with logging configuration scrtips as shared.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Add command and shell entry script support. (#166)

The shared volume support needs to setup links for files that
are not intended to be shared. This is not yet completely
generalized, so we need to update tasks.go to add the links
for now. Add some missing links.

* chore: Dynamically handle links for unshared archive volumes (#168)

When archives are mapped into the per-container space we
need to setup softlinks for each element within /run/determined
directory that is not treated as shared. This was previously
done using a static list of names. Replace that with a dynamic
list so that we don't miss files that are specified in other
parts of determined. Also add exceptions for command and
shell entrypoints which are read-only so we can leave them in the
shared /run/determined directory without setting up links.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: With slurm detect non-zero exit as job failure (#172)

If any container launched by the SLURM batch job exits with
a non-zero status, terminate the others with an error.
Without this a failure on horovodrun args, or other such
errors would exit the one container, but others would wait
forever.

Happy path exits, need a different fix.

* chore: Make shell_entrypoint work with SLURM (#174)

The shell command entry point needs the fix to properly
inherit DET_SLOTS from SLUM CUDA configuration.
Additionally, HPC systems include functions in their
environment to support modules. Because the variables
include () in the names, it breaks the SSH variable escaping
code. Avoid processing of these functions by adding any
variable name contiaing () to the blacklist.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
Dispatcher RM messages.

* chore: HAL-2893 Synchronize reading of Slurm resource log file (#179)

* fix: HAL-2903 Drop special handling of stdin/stdout for SLURM (#180)

The /run/determined/train directory is no longer shared between
all containers in a run, so drop the slurm-job specific task
setup code that redirects the stdin/stdout link via /tmp (which
singularity maps to /tmp on the host which can lead to conflict
between jobs refereincing the same /tmp/stdin.log link between jobs).

* feat: FOUNDENG-21 Enable SLURM preemption (#181)

Initial work to enable SLURM job preemption.
- For SLURM jobs intercept SIGTERM as a notification of pending preemption
when triggered notify the master via new pending_preemption API
- Add pending_preemption API on the master
- Upon call to the pending_preemption, dispatcher RM triggers ReleaseResources
(whith ForcePremption:true) for the allocation.
- The above triggers a preemption, and the long poll _preempt.py is performing will
return indicating that the job should preempt, and after checkpointing it will
exit with a success status.

* chore: put some life into the 'job queue' UI page (#189)

* chore: Version of Bradley's wrap_rank.py fix (#188)

This is semantically eqivalent to

https://github.com/determined-ai/determined/pull/4109

So you can apply it to your tree to get things working on dispatcher branch.

* chore: HAL-2788 Process and send logs from Dispatcher(SLURM) to Default Log Backend (#182)

Co-authored-by: Bradley Laney <bradlaney@determined.ai>

* fix: correct regex group name (#193)

* fix: HAL-2907 Simplify SLURM job naming (#190)

This is just a slight increment to reduce the verbosity, to cleanup
the username and dispatch ID we likely will need a launcher
enhancemement.

* chore: fix tb collisions (#194)

* chore: HAL-2894 Multiple distributed jobs collide on horovod run sshd port (#195)

* chore: per-partition configs (#196)

This change enables us to configure per-partition overrides for certain configurations and adds a few overrides for now. The mechanism matches determined agents where expconf settings (if available) override pool level defaults override master config level defaults.

* chore: avoid API blockage & overpolling (#198)

* chore: avoid API blockage & overpolling

when following launched jobs to completion

* chore: provide per-pool job stats (#199)

* chore: provide per-pool job stats

on receipt of GetJobQueueStatsRequest

* fix: FOUNDENG-13 Properly set default pool when no GPUs (#201)

When no partitions have GPUs, we ended up without a default
compute pool, so jobs would get subbmitted to pool "".
That would be fine, except we couldn't then match the pool
named "" to its attributes (i.e. has no GPUs) and thus we
would try to allocate GPUs from a pool without them.
So when no GPUs in any partition, set the default compute pool
(internal to the Dispatcher RM) to the default AUX pool, so
that when a request for the default pool comes in we properly
match it up with the pool attributes and correctly swith to
using the cpu as the slot type.

Also add a devcluster file for shuco for testing.

* chore: fix up slot type resolution (#197)

* chore: fix default scheme for dispatcher config (#203)

* chore: update Go dependency files

This is a sort of artificial commit used to concentrate all of the
go.{mod,sum} changes for EE into one place to hopefully reduce
conflicts.

* feat: add support for SCIM provisioning

This commit brings the SCIM support for v0.11.2 back from the dead with
minimal changes. The only new fixes/features are conflicts now correctly
returns 409 not 500, password sync is supported, externalId is not
required, and the username/password that the IdP uses for basic auth
when talking to us is configurable.

* chore: remove Apache V2 license text

The enterprise edition is not licensed under Apache V2, so remove the
license text to avoid confusion.

* fix: ignore SCIM meta field

When we receive a PATCH or PUT from Okta,
they may reply to us with the meta field we
initially gave to them. This is ok; it is in
spec to receive and ignore it.

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add support for SAML

To meet customer needs, we need to provide support for users to
authenticate with SAML. The commit adds /saml/sso and /saml/initiate to
receive SAML responses and initiate SP-initiated flow with SAML
requests, respectively. It also adds SSO providers to /info for the
frontend to determined if SAML is enabled or not and the ability to
configure SAML via master.yaml.

* feat: add support for automatically storing creds

As a Determined user, I don't want to paste a token every time I want to
auth. This adds support for, when auth'ing from the CLI, redirecting to
localhost and automatically storing the auth token.

* ci: modify CI/CD steps for EE

We avoid running jobs that will publish our product to public
repositories. This currently means that we cannot test cloud deployments
in an automated manner. However, none of our EE features are related to
cloud deployments so this should not be a problem.

* ci: enable EE releases with CircleCI

* chore: enable SCIM for test-e2e (#15)

* chore: run apt-get update for deps of master.deb (#16)

When we install ee via a deb, apt-get tries to resolve the named deps of
it but, in CI we haven't run apt-get update so we can't find it.

* feat: support acting as an OAuth 2.0 authorization server (#10)

This adds support for the OAuth 2.0 authorization code flow to the
master. Apart from the core flow itself, HTTP endpoints are provided for
managing the set of OAuth client applications (accessible only by admin
users).

Since we're only addressing one particular scenario at the moment, we
allow only a single client application to be registered, since that's
all we need for this use case and that simplifies management a bit.

* feat: enable OAuth protection for SCIM endpoints (#10)

* ci: add -ee to tags (#29)

* ci: use make publish for releasing Docker images (#36)

(cherry picked from commit a5de11b5cc87e848d0e81601d15038dd4f44d7e1)

* fix: remove det-deploy local from ee ci (#53)

* chore: increase refresh token exp (#60) [DET-4754]

* chore: remove sessions for deprovisioned users (#61) [DET-4844]

* ci: skip a bunch of testing and deployment for EE

* fix: extract master URL correctly from resource pools (#64)

With the introduction of resource pools, the provisioner is no longer a top level field. Update code so that we pull the MasterURL from the resource pools' provider field.

* fix: add ssoProviders to default store info for ee

* chore: do not publish model_hub package in EE

* fix: update master GoReleaser config to account for new binary

* chore: rebrand to HPE MLDE

Basic renaming of Docker images, docs bucket, etc. Also clearly remove
any implication of publishing to NVCR.

* chore: rebrand docs content and theme for HPE before publishing

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* ci: avoid implication that the password for SCIM tests is secure (#83)

* feat: add OIDC support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also revert a few diffs between the config for EE vs OSS that don't seem
intentional/warranted.

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* chore: refactor for semantic conflict in service.go:Service.extractToken (#186)

* fix: update docs skip check remote address to EE

* chore: add launcher client (#209)

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* Support environment configuration on singularity (#152)

Address todos:
- TODO(HAL-2868): Support adding and dropping linux caps for containers.
- TODO(HAL-2869): Support communicating with authenticated container registries.
- Also force_pull_image option.

Design Note: https://connect.us.cray.com/confluence/pages/viewpage.action?pageId=230482663

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Add missing checkpoint-gc link supporting unshared data volumes (#156)

Add a missing link from /run/determined/checkpoint_gc to the
unshared data volume to enable them to complete successfully.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* HAL-2788 Handle task script archives as shared. (#162)

Simplify our bind mounts by treating the archive
with logging configuration scrtips as shared.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Add command and shell entry script support. (#166)

The shared volume support needs to setup links for files that
are not intended to be shared. This is not yet completely
generalized, so we need to update tasks.go to add the links
for now. Add some missing links.

* chore: Dynamically handle links for unshared archive volumes (#168)

When archives are mapped into the per-container space we
need to setup softlinks for each element within /run/determined
directory that is not treated as shared. This was previously
done using a static list of names. Replace that with a dynamic
list so that we don't miss files that are specified in other
parts of determined. Also add exceptions for command and
shell entrypoints which are read-only so we can leave them in the
shared /run/determined directory without setting up links.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: With slurm detect non-zero exit as job failure (#172)

If any container launched by the SLURM batch job exits with
a non-zero status, terminate the others with an error.
Without this a failure on horovodrun args, or other such
errors would exit the one container, but others would wait
forever.

Happy path exits, need a different fix.

* chore: Make shell_entrypoint work with SLURM (#174)

The shell command entry point needs the fix to properly
inherit DET_SLOTS from SLUM CUDA configuration.
Additionally, HPC systems include functions in their
environment to support modules. Because the variables
include () in the names, it breaks the SSH variable escaping
code. Avoid processing of these functions by adding any
variable name contiaing () to the blacklist.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
…

eecsliu added a commit to eecsliu/determined that referenced this pull request

* chore: update Go dependency files

This is a sort of artificial commit used to concentrate all of the
go.{mod,sum} changes for EE into one place to hopefully reduce
conflicts.

* feat: add support for SCIM provisioning

This commit brings the SCIM support for v0.11.2 back from the dead with
minimal changes. The only new fixes/features are conflicts now correctly
returns 409 not 500, password sync is supported, externalId is not
required, and the username/password that the IdP uses for basic auth
when talking to us is configurable.

* chore: remove Apache V2 license text

The enterprise edition is not licensed under Apache V2, so remove the
license text to avoid confusion.

* fix: ignore SCIM meta field

When we receive a PATCH or PUT from Okta,
they may reply to us with the meta field we
initially gave to them. This is ok; it is in
spec to receive and ignore it.

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add support for SAML

To meet customer needs, we need to provide support for users to
authenticate with SAML. The commit adds /saml/sso and /saml/initiate to
receive SAML responses and initiate SP-initiated flow with SAML
requests, respectively. It also adds SSO providers to /info for the
frontend to determined if SAML is enabled or not and the ability to
configure SAML via master.yaml.

* feat: add support for automatically storing creds

As a Determined user, I don't want to paste a token every time I want to
auth. This adds support for, when auth'ing from the CLI, redirecting to
localhost and automatically storing the auth token.

* ci: modify CI/CD steps for EE

We avoid running jobs that will publish our product to public
repositories. This currently means that we cannot test cloud deployments
in an automated manner. However, none of our EE features are related to
cloud deployments so this should not be a problem.

* ci: enable EE releases with CircleCI

* chore: enable SCIM for test-e2e (#15)

* chore: run apt-get update for deps of master.deb (#16)

When we install ee via a deb, apt-get tries to resolve the named deps of
it but, in CI we haven't run apt-get update so we can't find it.

* feat: support acting as an OAuth 2.0 authorization server (#10)

This adds support for the OAuth 2.0 authorization code flow to the
master. Apart from the core flow itself, HTTP endpoints are provided for
managing the set of OAuth client applications (accessible only by admin
users).

Since we're only addressing one particular scenario at the moment, we
allow only a single client application to be registered, since that's
all we need for this use case and that simplifies management a bit.

* feat: enable OAuth protection for SCIM endpoints (#10)

* ci: add -ee to tags (#29)

* ci: use make publish for releasing Docker images (#36)

(cherry picked from commit a5de11b5cc87e848d0e81601d15038dd4f44d7e1)

* fix: remove det-deploy local from ee ci (#53)

* chore: increase refresh token exp (#60) [DET-4754]

* chore: remove sessions for deprovisioned users (#61) [DET-4844]

* ci: skip a bunch of testing and deployment for EE

* fix: extract master URL correctly from resource pools (#64)

With the introduction of resource pools, the provisioner is no longer a top level field. Update code so that we pull the MasterURL from the resource pools' provider field.

* fix: add ssoProviders to default store info for ee

* chore: do not publish model_hub package in EE

* fix: update master GoReleaser config to account for new binary

* chore: rebrand to HPE MLDE

Basic renaming of Docker images, docs bucket, etc. Also clearly remove
any implication of publishing to NVCR.

* chore: rebrand docs content and theme for HPE before publishing

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* ci: avoid implication that the password for SCIM tests is secure (#83)

* feat: add OIDC support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also revert a few diffs between the config for EE vs OSS that don't seem
intentional/warranted.

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* fix: update docs skip check remote address to EE

* chore: add launcher client (#209)

* feat: slurm support (#98)

This change adds a dispatcher resource manager, which gives Determined the ability to run on top of Slurm.

Feature branch commit history:

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
Dispatcher RM messages.

* chore: HAL-2893 Synchronize reading of Slurm resource log file (#179)

* fix: HAL-2903 Drop special handling of stdin/stdout for SLURM (#180)

The /run/determined/train directory is no longer shared between
all containers in a run, so drop the slurm-job specific task
setup code that redirects the stdin/stdout link via /tmp (which
singularity maps to /tmp on the host which can lead to conflict
between jobs refereincing the same /tmp/stdin.log link between jobs).

* feat: FOUNDENG-21 Enable SLURM preemption (#181)

Initial work to enable SLURM job preemption.
- For SLURM jobs intercept SIGTERM as a notification of pending preemption
when triggered notify the master via new pending_preemption API
- Add pending_preemption API on the master
- Upon call to the pending_preemption, dispatcher RM triggers ReleaseResources
(whith ForcePremption:true) for the allocation.
- The above triggers a preemption, and the long poll _preempt.py is performing will
return indicating that the job should preempt, and after checkpointing it will
exit with a success status.

* chore: put some life into the 'job queue' UI page (#189)

* chore: Version of Bradley's wrap_rank.py fix (#188)

This is semantically eqivalent to

https://github.com/determined-ai/determined/pull/4109

So you can apply it to your tree to get things working on dispatcher branch.

* chore: HAL-2788 Process and send logs from Dispatcher(SLURM) to Default Log Backend (#182)

Co-authored-by: Bradley Laney <bradlaney@determined.ai>

* chore: per-partition configs (#196)

This change enables us to configure per-partition overrides for certain configurations and adds a few overrides for now. The mechanism matches determined agents where expconf settings (if available) override pool level defaults override master config level defaults.

* chore: avoid API blockage & overpolling (#198)

* chore: avoid API blockage & overpolling

when following launched jobs to completion

* chore: provide per-pool job stats (#199)

* chore: provide per-pool job stats

on receipt of GetJobQueueStatsRequest

* fix: FOUNDENG-13 Properly set default pool when no GPUs (#201)

When no partitions have GPUs, we ended up without a default
compute pool, so jobs would get subbmitted to pool "".
That would be fine, except we couldn't then match the pool
named "" to its attributes (i.e. has no GPUs) and thus we
would try to allocate GPUs from a pool without them.
So when no GPUs in any partition, set the default compute pool
(internal to the Dispatcher RM) to the default AUX pool, so
that when a request for the default pool comes in we properly
match it up with the pool attributes and correctly swith to
using the cpu as the slot type.

Also add a devcluster file for shuco for testing.

* chore: fix up slot type resolution (#197)

* chore: fix default scheme for dispatcher config (#203)

* chore: Refinements to Dispatcher/SLURM logs implementations (#191)

* fix: FOUNDENG-14 generated ResourcesStateChanged events (#202)

Generate ResourcesStateChanged to cover running, and other
resources states from the dispatcher RM.

AllocationReady comes from the container for shells, notebooks,
and tensorboards, but the clients also depend upon the state
being running immediately after AllocationReady, so to avoid
a race condition where the dispatcher RM discovers the running
state by polling for dispatcher RM, force the state to running upon reception
of the AllocationReady event.

* fix: avoid potential race between AllocationReady and Running state (#206)

Pull in oss version of this fix to avoid merge conflict.

* chore: Add devcluster files for casablanca-login and horizon (#207)

Custom devcluster files to enable simple testing for
two new test machines.

* chore: fix tests (#204)

* chore: cleanup stray stuff (#208)

* chore: rename iface configs for consistency (#210)

* fix: FOUNDENG-40 Favor default partition in all cases unless no GPUs (#213)

it would be preferable to favor the default partition explicitly.

New algorithm:

- Use default partition always for aux partition
- Use default partition for GPU partition (unless it has no GPUs).

Updated unit tests.

* chore: Improve branding/naming of dispatcher/launcher (#212)

For customer-visible purposes, use "Slurm" when referring to the scheduling
system that is managed by the dispatcher RM internally.

* fix: FOUNDENG-41 Cancel running SLURM jobs on restart (#215)

On the startup of the master there may be on-going jobs on the cluster
associated with determined jobs. Terminate them all and remove them
from the DB to prevent duplicate jobs as determined starts up new jobs
which will continue from the last checkpoint.

* fix: FOUNDENG-43 Change in yaml parser breaks default partitions (#217)

Unblock testing by reverting back to our previous yaml parser
to restore default partition identification.

* fix: correctly parse resource pool info with ghodss/yaml package [FOUNDENG-45] (#218)

* fix: handle nil response from dispatch terminate (#219)

* fix: FOUNDENG-41 properly handle missing dispatches (#221)

When cleaning up dispatches from the DB on startup, if the
dispatch does not exist (404) then still delete the DB entry
so we don't attempt to clean it up again on restart.

* chore: fill in DET_UNIQUE_PORT_OFFSET (#220)

* feat: FOUNDENG-24 Improve branding/naming of dispatcher/launcher (#216)

Change the resource_manger.type value that customer's provide
from dispatcher to slurm

* Revert "chore: fill in DET_UNIQUE_PORT_OFFSET (#220)"

This reverts commit d3de26b7d652bf15eb28ee0d9a6b2c5113efcaa3.

* chore: fill in DET_UNIQUE_PORT_OFFSET, with defaults (#225)

* fix: FOUNDENG-41 If we get an error not 404, don't delete from DB (#222)

If we get a nil response on the error path, we do not know for sure
that the dispatch does not exist, so leave it in the DB for a later
attempt at cleanup.

* chore: Saturate Determined's `GET /api/v1/agents` (#226)

* Saturate Determined's `GET /api/v1/agents` API

for Dispatcher RM

* fix: FOUNDENG-49 Slurm Logs and scripts are cleaned up (#228)

Changes to make sure that when a Dispatch is dropped from the DB
we have ensured that the DispatchIDs (launches & environment) have
been removed by the launcher.

Use common method for cleanup in both cases, and enhance
the method to verify deletion (404 is also considered a success).
If we hit any failure in cleanup of launcher data, we defer the
removal of the Dispatch until a later time (they are re-processed
on startup).

* chore: Saturate Determined's `GET /api/v1/agents` API (#229)

* Saturate Determined's `GET /api/v1/agents` API

for Dispatcher RM.
o small tweak to mark the agents as not available if Slurm has then as allocated.

* chore: Enable security.tls settings for slurm RM (#200)

We want to enable use of https:8043 for development to enable us to
connect to test machines from our desktops via Pulse VPN.
Also update casablanca-login devcluster config to use :8043

* fix: Fix parameter order Dispatch cleanup (#230)

The parameter order to the resourceQueryPostActions cleanup
method were reversed, so the Dispatch was not properly removed.

* fix: FOUNDENG-48 `ERROR: Cannot set UID=0` (#231)

Launched Slurm jobs cannot be run as a root user (the default if
no agent is configured). Catch this error early in the dispatcher
RM by returning an error if an attempt to impersonate root
and report it explicitly.

* fix: FOUNDENG-25 Dispatcher job error reporting is sub-standard (#232)

When reporting exit status messages, the dispatcher RM includes
all events that happen on the distpatch. These are mostly useless
status updates. Fixes:
- Include only WARNING/ERROR level messags in the reporting.
those potentially have real error messages in them.
- Filter the prefix for job failure messages to make them cleaner.

* chore: New 'job_storage_root' field in master.yaml needs to be added to DAI (#233)

* chore: FOUNDENG-26 Implement batch processing for Dispatcher/SLURM logs (#211)

* chore: FOUNDENG-51 update dispatcher monitor to handle 404 status for cancelled experiments (#236)

* fix: update master GoReleaser config to account for new binary

* chore: enable hpe branding (#79)

* chore: publish EE docs to hpecaide.determined.ai

* chore: rebrand docs content and theme for HPE before publishing

* chore: push EE Docker images with `hpecaide` names

* fix: update reference to config package (#80)

* fix: update reference to config package

https://github.com/determined-ai/determined/pull/3175 moved config into
its own package, this fixes an -ee specific reference which was not
updated, fixing `make all`.

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* rrci: avoid implication that the password for SCIM tests is secure (#83)

* feat: oidc support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: change branding to MLDE (#114)

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also update configuration to avoid giving the impression that we might want to
publish EE images to NVCR, and revert a few diffs between the config for EE vs
OSS that don't seem intentional/warranted.

* chore: finish CAIDE -> MLDE, change hpemlde -> hpe-mlde

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* chore: more CAIDE -> MLDE changes (#153)

* chore: update web initial title and meta description (#155)

* chore: refactor for semantic conflict in service.go:Service.extractToken (#186)

* fix: use correct tags for EE release

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: just warn on trying to set single resource as daemon, for now (#125)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* Support environment configuration on singularity (#152)

Address todos:
- TODO(HAL-2868): Support adding and dropping linux caps for containers.
- TODO(HAL-2869): Support communicating with authenticated container registries.
- Also force_pull_image option.

Design Note: https://connect.us.cray.com/confluence/pages/viewpage.action?pageId=230482663

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Add missing checkpoint-gc link supporting unshared data volumes (#156)

Add a missing link from /run/determined/checkpoint_gc to the
unshared data volume to enable them to complete successfully.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* HAL-2788 Handle task script archives as shared. (#162)

Simplify our bind mounts by treating the archive
with logging configuration scrtips as shared.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Add command and shell entry script support. (#166)

The shared volume support needs to setup links for files that
are not intended to be shared. This is not yet completely
generalized, so we need to update tasks.go to add the links
for now. Add some missing links.

* chore: Dynamically handle links for unshared archive volumes (#168)

When archives are mapped into the per-container space we
need to setup softlinks for each element within /run/determined
directory that is not treated as shared. This was previously
done using a static list of names. Replace that with a dynamic
list so that we don't miss files that are specified in other
parts of determined. Also add exceptions for command and
shell entrypoints which are read-only so we can leave them in the
shared /run/determined directory without setting up links.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: With slurm detect non-zero exit as job failure (#172)

If any container launched by the SLURM batch job exits with
a non-zero status, terminate the others with an error.
Without this a failure on horovodrun args, or other such
errors would exit the one container, but others would wait
forever.

Happy path exits, need a different fix.

* chore: Make shell_entrypoint work with SLURM (#174)

The shell command entry point needs the fix to properly
inherit DET_SLOTS from SLUM CUDA configuration.
Additionally, HPC systems include functions in their
environment to support modules. Because the variables
include () in the names, it breaks the SSH variable escaping
code. Avoid processing of these functions by adding any
variable name contiaing () to the blacklist.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
Dispatcher RM messages.

* chore: HAL-2893 Synchronize reading of Slurm resource log file (#179)

* fix: HAL-2903 Drop special handling of stdin/stdout for SLURM (#180)

The /run/determined/train directory is no longer shared between
all containers in a run, so drop the slurm-job specific task
setup code that redirects the stdin/stdout link via /tmp (which
singularity maps to /tmp on the host which can lead to conflict
between jobs refereincing the same /tmp/stdin.log link between jobs).

* feat: FOUNDENG-21 Enable SLURM preemption (#181)

Initial work to enable SLURM job preemption.
- For SLURM jobs intercept SIGTERM as a notification of pending preemption
when triggered notify the master via new pending_preemption API
- Add pending_preemption API on the master
- Upon call to the pending_preemption, dispatcher RM triggers ReleaseResources
(whith ForcePremption:true) for the allocation.
- The above triggers a preemption, and the long poll _preempt.py is performing will
return indicating that the job should preempt, and after checkpointing it will
exit with a success status.

* chore: put some life into the 'job queue' UI page (#189)

* chore: Version of Bradley's wrap_rank.py fix (#188)

This is semantically eqivalent to

https://github.com/determined-ai/determined/pull/4109

So you can apply it to your tree to get things working on dispatcher branch.

* chore: HAL-2788 Process and send logs from Dispatcher(SLURM) to Default Log Backend (#182)

Co-authored-by: Bradley Laney <bradlaney@determined.ai>

* fix: correct regex group name (#193)

* fix: HAL-2907 Simplify SLURM job naming (#190)

This is just a slight increment to reduce the verbosity, to cleanup
the username and dispatch ID we likely will need a launcher
enhancemement.

* chore: fix tb collisions (#194)

* chore: HAL-2894 Multiple distributed jobs collide on horovod run sshd port (#195)

* chore: per-partition configs (#196)

This change enables us to configure per-partition overrides for certain configurations and adds a few overrides for now. The mechanism matches determined agents where expconf settings (if available) override pool level defaults override master config level defaults.

* chore: avoid API blockage & overpolling (#198)

* chore: avoid API blockage & overpolling

when following launched jobs to completion

* chore: provide per-pool job stats (#199)

* chore: provide per-pool job stats

on receipt of GetJobQueueStatsRequest

* fix: FOUNDENG-13 Properly set default pool when no GPUs (#201)

When no partitions have GPUs, we ended up without a default
compute pool, so jobs would get subbmitted to pool "".
That would be fine, except we couldn't then match the pool
named "" to its attributes (i.e. has no GPUs) and thus we
would try to allocate GPUs from a pool without them.
So when no GPUs in any partition, set the default compute pool
(internal to the Dispatcher RM) to the default AUX pool, so
that when a request for the default pool comes in we properly
match it up with the pool attributes and correctly swith to
using the cpu as the slot type.

Also add a devcluster file for shuco for testing.

* chore: fix up slot type resolution (#197)

* chore: fix default scheme for dispatcher config (#203)

* chore: update Go dependency files

This is a sort of artificial commit used to concentrate all of the
go.{mod,sum} changes for EE into one place to hopefully reduce
conflicts.

* feat: add support for SCIM provisioning

This commit brings the SCIM support for v0.11.2 back from the dead with
minimal changes. The only new fixes/features are conflicts now correctly
returns 409 not 500, password sync is supported, externalId is not
required, and the username/password that the IdP uses for basic auth
when talking to us is configurable.

* chore: remove Apache V2 license text

The enterprise edition is not licensed under Apache V2, so remove the
license text to avoid confusion.

* fix: ignore SCIM meta field

When we receive a PATCH or PUT from Okta,
they may reply to us with the meta field we
initially gave to them. This is ok; it is in
spec to receive and ignore it.

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add support for SAML

To meet customer needs, we need to provide support for users to
authenticate with SAML. The commit adds /saml/sso and /saml/initiate to
receive SAML responses and initiate SP-initiated flow with SAML
requests, respectively. It also adds SSO providers to /info for the
frontend to determined if SAML is enabled or not and the ability to
configure SAML via master.yaml.

* feat: add support for automatically storing creds

As a Determined user, I don't want to paste a token every time I want to
auth. This adds support for, when auth'ing from the CLI, redirecting to
localhost and automatically storing the auth token.

* ci: modify CI/CD steps for EE

We avoid running jobs that will publish our product to public
repositories. This currently means that we cannot test cloud deployments
in an automated manner. However, none of our EE features are related to
cloud deployments so this should not be a problem.

* ci: enable EE releases with CircleCI

* chore: enable SCIM for test-e2e (#15)

* chore: run apt-get update for deps of master.deb (#16)

When we install ee via a deb, apt-get tries to resolve the named deps of
it but, in CI we haven't run apt-get update so we can't find it.

* feat: support acting as an OAuth 2.0 authorization server (#10)

This adds support for the OAuth 2.0 authorization code flow to the
master. Apart from the core flow itself, HTTP endpoints are provided for
managing the set of OAuth client applications (accessible only by admin
users).

Since we're only addressing one particular scenario at the moment, we
allow only a single client application to be registered, since that's
all we need for this use case and that simplifies management a bit.

* feat: enable OAuth protection for SCIM endpoints (#10)

* ci: add -ee to tags (#29)

* ci: use make publish for releasing Docker images (#36)

(cherry picked from commit a5de11b5cc87e848d0e81601d15038dd4f44d7e1)

* fix: remove det-deploy local from ee ci (#53)

* chore: increase refresh token exp (#60) [DET-4754]

* chore: remove sessions for deprovisioned users (#61) [DET-4844]

* ci: skip a bunch of testing and deployment for EE

* fix: extract master URL correctly from resource pools (#64)

With the introduction of resource pools, the provisioner is no longer a top level field. Update code so that we pull the MasterURL from the resource pools' provider field.

* fix: add ssoProviders to default store info for ee

* chore: do not publish model_hub package in EE

* fix: update master GoReleaser config to account for new binary

* chore: rebrand to HPE MLDE

Basic renaming of Docker images, docs bucket, etc. Also clearly remove
any implication of publishing to NVCR.

* chore: rebrand docs content and theme for HPE before publishing

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* ci: avoid implication that the password for SCIM tests is secure (#83)

* feat: add OIDC support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also revert a few diffs between the config for EE vs OSS that don't seem
intentional/warranted.

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* chore: refactor for semantic conflict in service.go:Service.extractToken (#186)

* fix: update docs skip check remote address to EE

* chore: add launcher client (#209)

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's re…

eecsliu added a commit to eecsliu/determined that referenced this pull request

…or PBS (determined-ai#399)

* chore: update Go dependency files

This is a sort of artificial commit used to concentrate all of the
go.{mod,sum} changes for EE into one place to hopefully reduce
conflicts.

* feat: add support for SCIM provisioning

This commit brings the SCIM support for v0.11.2 back from the dead with
minimal changes. The only new fixes/features are conflicts now correctly
returns 409 not 500, password sync is supported, externalId is not
required, and the username/password that the IdP uses for basic auth
when talking to us is configurable.

* chore: remove Apache V2 license text

The enterprise edition is not licensed under Apache V2, so remove the
license text to avoid confusion.

* fix: ignore SCIM meta field

When we receive a PATCH or PUT from Okta,
they may reply to us with the meta field we
initially gave to them. This is ok; it is in
spec to receive and ignore it.

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add support for SAML

To meet customer needs, we need to provide support for users to
authenticate with SAML. The commit adds /saml/sso and /saml/initiate to
receive SAML responses and initiate SP-initiated flow with SAML
requests, respectively. It also adds SSO providers to /info for the
frontend to determined if SAML is enabled or not and the ability to
configure SAML via master.yaml.

* feat: add support for automatically storing creds

As a Determined user, I don't want to paste a token every time I want to
auth. This adds support for, when auth'ing from the CLI, redirecting to
localhost and automatically storing the auth token.

* ci: modify CI/CD steps for EE

We avoid running jobs that will publish our product to public
repositories. This currently means that we cannot test cloud deployments
in an automated manner. However, none of our EE features are related to
cloud deployments so this should not be a problem.

* ci: enable EE releases with CircleCI

* chore: enable SCIM for test-e2e (#15)

* chore: run apt-get update for deps of master.deb (#16)

When we install ee via a deb, apt-get tries to resolve the named deps of
it but, in CI we haven't run apt-get update so we can't find it.

* feat: support acting as an OAuth 2.0 authorization server (#10)

This adds support for the OAuth 2.0 authorization code flow to the
master. Apart from the core flow itself, HTTP endpoints are provided for
managing the set of OAuth client applications (accessible only by admin
users).

Since we're only addressing one particular scenario at the moment, we
allow only a single client application to be registered, since that's
all we need for this use case and that simplifies management a bit.

* feat: enable OAuth protection for SCIM endpoints (#10)

* ci: add -ee to tags (#29)

* ci: use make publish for releasing Docker images (#36)

(cherry picked from commit a5de11b5cc87e848d0e81601d15038dd4f44d7e1)

* fix: remove det-deploy local from ee ci (#53)

* chore: increase refresh token exp (#60) [DET-4754]

* chore: remove sessions for deprovisioned users (#61) [DET-4844]

* ci: skip a bunch of testing and deployment for EE

* fix: extract master URL correctly from resource pools (#64)

With the introduction of resource pools, the provisioner is no longer a top level field. Update code so that we pull the MasterURL from the resource pools' provider field.

* fix: add ssoProviders to default store info for ee

* chore: do not publish model_hub package in EE

* fix: update master GoReleaser config to account for new binary

* chore: rebrand to HPE MLDE

Basic renaming of Docker images, docs bucket, etc. Also clearly remove
any implication of publishing to NVCR.

* chore: rebrand docs content and theme for HPE before publishing

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* feat: add OIDC support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also revert a few diffs between the config for EE vs OSS that don't seem
intentional/warranted.

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* fix: update docs skip check remote address to EE

* chore: add launcher client (#209)

* feat: slurm support (#98)

This change adds a dispatcher resource manager, which gives Determined the ability to run on top of Slurm.

Feature branch commit history:

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
Dispatcher RM messages.

* chore: HAL-2893 Synchronize reading of Slurm resource log file (#179)

* fix: HAL-2903 Drop special handling of stdin/stdout for SLURM (#180)

The /run/determined/train directory is no longer shared between
all containers in a run, so drop the slurm-job specific task
setup code that redirects the stdin/stdout link via /tmp (which
singularity maps to /tmp on the host which can lead to conflict
between jobs refereincing the same /tmp/stdin.log link between jobs).

* feat: FOUNDENG-21 Enable SLURM preemption (#181)

Initial work to enable SLURM job preemption.
- For SLURM jobs intercept SIGTERM as a notification of pending preemption
when triggered notify the master via new pending_preemption API
- Add pending_preemption API on the master
- Upon call to the pending_preemption, dispatcher RM triggers ReleaseResources
(whith ForcePremption:true) for the allocation.
- The above triggers a preemption, and the long poll _preempt.py is performing will
return indicating that the job should preempt, and after checkpointing it will
exit with a success status.

* chore: put some life into the 'job queue' UI page (#189)

* chore: Version of Bradley's wrap_rank.py fix (#188)

This is semantically eqivalent to

https://github.com/determined-ai/determined/pull/4109

So you can apply it to your tree to get things working on dispatcher branch.

* chore: HAL-2788 Process and send logs from Dispatcher(SLURM) to Default Log Backend (#182)

Co-authored-by: Bradley Laney <bradlaney@determined.ai>

* chore: per-partition configs (#196)

This change enables us to configure per-partition overrides for certain configurations and adds a few overrides for now. The mechanism matches determined agents where expconf settings (if available) override pool level defaults override master config level defaults.

* chore: avoid API blockage & overpolling (#198)

* chore: avoid API blockage & overpolling

when following launched jobs to completion

* chore: provide per-pool job stats (#199)

* chore: provide per-pool job stats

on receipt of GetJobQueueStatsRequest

* fix: FOUNDENG-13 Properly set default pool when no GPUs (#201)

When no partitions have GPUs, we ended up without a default
compute pool, so jobs would get subbmitted to pool "".
That would be fine, except we couldn't then match the pool
named "" to its attributes (i.e. has no GPUs) and thus we
would try to allocate GPUs from a pool without them.
So when no GPUs in any partition, set the default compute pool
(internal to the Dispatcher RM) to the default AUX pool, so
that when a request for the default pool comes in we properly
match it up with the pool attributes and correctly swith to
using the cpu as the slot type.

Also add a devcluster file for shuco for testing.

* chore: fix up slot type resolution (#197)

* chore: fix default scheme for dispatcher config (#203)

* chore: Refinements to Dispatcher/SLURM logs implementations (#191)

* fix: FOUNDENG-14 generated ResourcesStateChanged events (#202)

Generate ResourcesStateChanged to cover running, and other
resources states from the dispatcher RM.

AllocationReady comes from the container for shells, notebooks,
and tensorboards, but the clients also depend upon the state
being running immediately after AllocationReady, so to avoid
a race condition where the dispatcher RM discovers the running
state by polling for dispatcher RM, force the state to running upon reception
of the AllocationReady event.

* fix: avoid potential race between AllocationReady and Running state (#206)

Pull in oss version of this fix to avoid merge conflict.

* chore: Add devcluster files for casablanca-login and horizon (#207)

Custom devcluster files to enable simple testing for
two new test machines.

* chore: fix tests (#204)

* chore: cleanup stray stuff (#208)

* chore: rename iface configs for consistency (#210)

* fix: FOUNDENG-40 Favor default partition in all cases unless no GPUs (#213)

it would be preferable to favor the default partition explicitly.

New algorithm:

- Use default partition always for aux partition
- Use default partition for GPU partition (unless it has no GPUs).

Updated unit tests.

* chore: Improve branding/naming of dispatcher/launcher (#212)

For customer-visible purposes, use "Slurm" when referring to the scheduling
system that is managed by the dispatcher RM internally.

* fix: FOUNDENG-41 Cancel running SLURM jobs on restart (#215)

On the startup of the master there may be on-going jobs on the cluster
associated with determined jobs. Terminate them all and remove them
from the DB to prevent duplicate jobs as determined starts up new jobs
which will continue from the last checkpoint.

* fix: FOUNDENG-43 Change in yaml parser breaks default partitions (#217)

Unblock testing by reverting back to our previous yaml parser
to restore default partition identification.

* fix: correctly parse resource pool info with ghodss/yaml package [FOUNDENG-45] (#218)

* fix: handle nil response from dispatch terminate (#219)

* fix: FOUNDENG-41 properly handle missing dispatches (#221)

When cleaning up dispatches from the DB on startup, if the
dispatch does not exist (404) then still delete the DB entry
so we don't attempt to clean it up again on restart.

* chore: fill in DET_UNIQUE_PORT_OFFSET (#220)

* feat: FOUNDENG-24 Improve branding/naming of dispatcher/launcher (#216)

Change the resource_manger.type value that customer's provide
from dispatcher to slurm

* Revert "chore: fill in DET_UNIQUE_PORT_OFFSET (#220)"

This reverts commit d3de26b7d652bf15eb28ee0d9a6b2c5113efcaa3.

* chore: fill in DET_UNIQUE_PORT_OFFSET, with defaults (#225)

* fix: FOUNDENG-41 If we get an error not 404, don't delete from DB (#222)

If we get a nil response on the error path, we do not know for sure
that the dispatch does not exist, so leave it in the DB for a later
attempt at cleanup.

* chore: Saturate Determined's `GET /api/v1/agents` (#226)

* Saturate Determined's `GET /api/v1/agents` API

for Dispatcher RM

* fix: FOUNDENG-49 Slurm Logs and scripts are cleaned up (#228)

Changes to make sure that when a Dispatch is dropped from the DB
we have ensured that the DispatchIDs (launches & environment) have
been removed by the launcher.

Use common method for cleanup in both cases, and enhance
the method to verify deletion (404 is also considered a success).
If we hit any failure in cleanup of launcher data, we defer the
removal of the Dispatch until a later time (they are re-processed
on startup).

* chore: Saturate Determined's `GET /api/v1/agents` API (#229)

* Saturate Determined's `GET /api/v1/agents` API

for Dispatcher RM.
o small tweak to mark the agents as not available if Slurm has then as allocated.

* chore: Enable security.tls settings for slurm RM (#200)

We want to enable use of https:8043 for development to enable us to
connect to test machines from our desktops via Pulse VPN.
Also update casablanca-login devcluster config to use :8043

* fix: Fix parameter order Dispatch cleanup (#230)

The parameter order to the resourceQueryPostActions cleanup
method were reversed, so the Dispatch was not properly removed.

* fix: FOUNDENG-48 `ERROR: Cannot set UID=0` (#231)

Launched Slurm jobs cannot be run as a root user (the default if
no agent is configured). Catch this error early in the dispatcher
RM by returning an error if an attempt to impersonate root
and report it explicitly.

* fix: FOUNDENG-25 Dispatcher job error reporting is sub-standard (#232)

When reporting exit status messages, the dispatcher RM includes
all events that happen on the distpatch. These are mostly useless
status updates. Fixes:
- Include only WARNING/ERROR level messags in the reporting.
those potentially have real error messages in them.
- Filter the prefix for job failure messages to make them cleaner.

* chore: New 'job_storage_root' field in master.yaml needs to be added to DAI (#233)

* chore: FOUNDENG-26 Implement batch processing for Dispatcher/SLURM logs (#211)

* chore: FOUNDENG-51 update dispatcher monitor to handle 404 status for cancelled experiments (#236)

* fix: update master GoReleaser config to account for new binary

* chore: enable hpe branding (#79)

* chore: publish EE docs to hpecaide.determined.ai

* chore: rebrand docs content and theme for HPE before publishing

* chore: push EE Docker images with `hpecaide` names

* fix: update reference to config package (#80)

* fix: update reference to config package

https://github.com/determined-ai/determined/pull/3175 moved config into
its own package, this fixes an -ee specific reference which was not
updated, fixing `make all`.

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* rrci: avoid implication that the password for SCIM tests is secure (#83)

* feat: oidc support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: change branding to MLDE (#114)

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also update configuration to avoid giving the impression that we might want to
publish EE images to NVCR, and revert a few diffs between the config for EE vs
OSS that don't seem intentional/warranted.

* chore: finish CAIDE -> MLDE, change hpemlde -> hpe-mlde

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* chore: more CAIDE -> MLDE changes (#153)

* chore: update web initial title and meta description (#155)

* chore: refactor for semantic conflict in service.go:Service.extractToken (#186)

* fix: use correct tags for EE release

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: just warn on trying to set single resource as daemon, for now (#125)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* Support environment configuration on singularity (#152)

Address todos:
- TODO(HAL-2868): Support adding and dropping linux caps for containers.
- TODO(HAL-2869): Support communicating with authenticated container registries.
- Also force_pull_image option.

Design Note: https://connect.us.cray.com/confluence/pages/viewpage.action?pageId=230482663

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Add missing checkpoint-gc link supporting unshared data volumes (#156)

Add a missing link from /run/determined/checkpoint_gc to the
unshared data volume to enable them to complete successfully.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* HAL-2788 Handle task script archives as shared. (#162)

Simplify our bind mounts by treating the archive
with logging configuration scrtips as shared.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Add command and shell entry script support. (#166)

The shared volume support needs to setup links for files that
are not intended to be shared. This is not yet completely
generalized, so we need to update tasks.go to add the links
for now. Add some missing links.

* chore: Dynamically handle links for unshared archive volumes (#168)

When archives are mapped into the per-container space we
need to setup softlinks for each element within /run/determined
directory that is not treated as shared. This was previously
done using a static list of names. Replace that with a dynamic
list so that we don't miss files that are specified in other
parts of determined. Also add exceptions for command and
shell entrypoints which are read-only so we can leave them in the
shared /run/determined directory without setting up links.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: With slurm detect non-zero exit as job failure (#172)

If any container launched by the SLURM batch job exits with
a non-zero status, terminate the others with an error.
Without this a failure on horovodrun args, or other such
errors would exit the one container, but others would wait
forever.

Happy path exits, need a different fix.

* chore: Make shell_entrypoint work with SLURM (#174)

The shell command entry point needs the fix to properly
inherit DET_SLOTS from SLUM CUDA configuration.
Additionally, HPC systems include functions in their
environment to support modules. Because the variables
include () in the names, it breaks the SSH variable escaping
code. Avoid processing of these functions by adding any
variable name contiaing () to the blacklist.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
Dispatcher RM messages.

* chore: HAL-2893 Synchronize reading of Slurm resource log file (#179)

* fix: HAL-2903 Drop special handling of stdin/stdout for SLURM (#180)

The /run/determined/train directory is no longer shared between
all containers in a run, so drop the slurm-job specific task
setup code that redirects the stdin/stdout link via /tmp (which
singularity maps to /tmp on the host which can lead to conflict
between jobs refereincing the same /tmp/stdin.log link between jobs).

* feat: FOUNDENG-21 Enable SLURM preemption (#181)

Initial work to enable SLURM job preemption.
- For SLURM jobs intercept SIGTERM as a notification of pending preemption
when triggered notify the master via new pending_preemption API
- Add pending_preemption API on the master
- Upon call to the pending_preemption, dispatcher RM triggers ReleaseResources
(whith ForcePremption:true) for the allocation.
- The above triggers a preemption, and the long poll _preempt.py is performing will
return indicating that the job should preempt, and after checkpointing it will
exit with a success status.

* chore: put some life into the 'job queue' UI page (#189)

* chore: Version of Bradley's wrap_rank.py fix (#188)

This is semantically eqivalent to

https://github.com/determined-ai/determined/pull/4109

So you can apply it to your tree to get things working on dispatcher branch.

* chore: HAL-2788 Process and send logs from Dispatcher(SLURM) to Default Log Backend (#182)

Co-authored-by: Bradley Laney <bradlaney@determined.ai>

* fix: correct regex group name (#193)

* fix: HAL-2907 Simplify SLURM job naming (#190)

This is just a slight increment to reduce the verbosity, to cleanup
the username and dispatch ID we likely will need a launcher
enhancemement.

* chore: fix tb collisions (#194)

* chore: HAL-2894 Multiple distributed jobs collide on horovod run sshd port (#195)

* chore: per-partition configs (#196)

This change enables us to configure per-partition overrides for certain configurations and adds a few overrides for now. The mechanism matches determined agents where expconf settings (if available) override pool level defaults override master config level defaults.

* chore: avoid API blockage & overpolling (#198)

* chore: avoid API blockage & overpolling

when following launched jobs to completion

* chore: provide per-pool job stats (#199)

* chore: provide per-pool job stats

on receipt of GetJobQueueStatsRequest

* fix: FOUNDENG-13 Properly set default pool when no GPUs (#201)

When no partitions have GPUs, we ended up without a default
compute pool, so jobs would get subbmitted to pool "".
That would be fine, except we couldn't then match the pool
named "" to its attributes (i.e. has no GPUs) and thus we
would try to allocate GPUs from a pool without them.
So when no GPUs in any partition, set the default compute pool
(internal to the Dispatcher RM) to the default AUX pool, so
that when a request for the default pool comes in we properly
match it up with the pool attributes and correctly swith to
using the cpu as the slot type.

Also add a devcluster file for shuco for testing.

* chore: fix up slot type resolution (#197)

* chore: fix default scheme for dispatcher config (#203)

* chore: update Go dependency files

This is a sort of artificial commit used to concentrate all of the
go.{mod,sum} changes for EE into one place to hopefully reduce
conflicts.

* feat: add support for SCIM provisioning

This commit brings the SCIM support for v0.11.2 back from the dead with
minimal changes. The only new fixes/features are conflicts now correctly
returns 409 not 500, password sync is supported, externalId is not
required, and the username/password that the IdP uses for basic auth
when talking to us is configurable.

* chore: remove Apache V2 license text

The enterprise edition is not licensed under Apache V2, so remove the
license text to avoid confusion.

* fix: ignore SCIM meta field

When we receive a PATCH or PUT from Okta,
they may reply to us with the meta field we
initially gave to them. This is ok; it is in
spec to receive and ignore it.

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add support for SAML

To meet customer needs, we need to provide support for users to
authenticate with SAML. The commit adds /saml/sso and /saml/initiate to
receive SAML responses and initiate SP-initiated flow with SAML
requests, respectively. It also adds SSO providers to /info for the
frontend to determined if SAML is enabled or not and the ability to
configure SAML via master.yaml.

* feat: add support for automatically storing creds

As a Determined user, I don't want to paste a token every time I want to
auth. This adds support for, when auth'ing from the CLI, redirecting to
localhost and automatically storing the auth token.

* ci: modify CI/CD steps for EE

We avoid running jobs that will publish our product to public
repositories. This currently means that we cannot test cloud deployments
in an automated manner. However, none of our EE features are related to
cloud deployments so this should not be a problem.

* ci: enable EE releases with CircleCI

* chore: enable SCIM for test-e2e (#15)

* chore: run apt-get update for deps of master.deb (#16)

When we install ee via a deb, apt-get tries to resolve the named deps of
it but, in CI we haven't run apt-get update so we can't find it.

* feat: support acting as an OAuth 2.0 authorization server (#10)

This adds support for the OAuth 2.0 authorization code flow to the
master. Apart from the core flow itself, HTTP endpoints are provided for
managing the set of OAuth client applications (accessible only by admin
users).

Since we're only addressing one particular scenario at the moment, we
allow only a single client application to be registered, since that's
all we need for this use case and that simplifies management a bit.

* feat: enable OAuth protection for SCIM endpoints (#10)

* ci: add -ee to tags (#29)

* ci: use make publish for releasing Docker images (#36)

(cherry picked from commit a5de11b5cc87e848d0e81601d15038dd4f44d7e1)

* fix: remove det-deploy local from ee ci (#53)

* chore: increase refresh token exp (#60) [DET-4754]

* chore: remove sessions for deprovisioned users (#61) [DET-4844]

* ci: skip a bunch of testing and deployment for EE

* fix: extract master URL correctly from resource pools (#64)

With the introduction of resource pools, the provisioner is no longer a top level field. Update code so that we pull the MasterURL from the resource pools' provider field.

* fix: add ssoProviders to default store info for ee

* chore: do not publish model_hub package in EE

* fix: update master GoReleaser config to account for new binary

* chore: rebrand to HPE MLDE

Basic renaming of Docker images, docs bucket, etc. Also clearly remove
any implication of publishing to NVCR.

* chore: rebrand docs content and theme for HPE before publishing

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* ci: avoid implication that the password for SCIM tests is secure (#83)

* feat: add OIDC support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also revert a few diffs between the config for EE vs OSS that don't seem
intentional/warranted.

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* chore: refactor for semantic conflict in service.go:Service.extractToken (#186)

* fix: update docs skip check remote address to EE

* chore: add launcher client (#209)

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" af…

eecsliu pushed a commit to eecsliu/determined that referenced this pull request


          feat: slurm support (#98)

f043f6e

This change adds a dispatcher resource manager, which gives Determined the ability to run on top of Slurm.

Feature branch commit history:

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
Dispatcher RM messages.

* chore: HAL-2893 Synchronize reading of Slurm resource log file (#179)

* fix: HAL-2903 Drop special handling of stdin/stdout for SLURM (#180)

The /run/determined/train directory is no longer shared between
all containers in a run, so drop the slurm-job specific task
setup code that redirects the stdin/stdout link via /tmp (which
singularity maps to /tmp on the host which can lead to conflict
between jobs refereincing the same /tmp/stdin.log link between jobs).

* feat: FOUNDENG-21 Enable SLURM preemption (#181)

Initial work to enable SLURM job preemption.
- For SLURM jobs intercept SIGTERM as a notification of pending preemption
when triggered notify the master via new pending_preemption API
- Add pending_preemption API on the master
- Upon call to the pending_preemption, dispatcher RM triggers ReleaseResources
(whith ForcePremption:true) for the allocation.
- The above triggers a preemption, and the long poll _preempt.py is performing will
return indicating that the job should preempt, and after checkpointing it will
exit with a success status.

* chore: put some life into the 'job queue' UI page (#189)

* chore: Version of Bradley's wrap_rank.py fix (#188)

This is semantically eqivalent to

https://github.com/determined-ai/determined/pull/4109

So you can apply it to your tree to get things working on dispatcher branch.

* chore: HAL-2788 Process and send logs from Dispatcher(SLURM) to Default Log Backend (#182)

Co-authored-by: Bradley Laney <bradlaney@determined.ai>

* chore: per-partition configs (#196)

This change enables us to configure per-partition overrides for certain configurations and adds a few overrides for now. The mechanism matches determined agents where expconf settings (if available) override pool level defaults override master config level defaults.

* chore: avoid API blockage & overpolling (#198)

* chore: avoid API blockage & overpolling

when following launched jobs to completion

* chore: provide per-pool job stats (#199)

* chore: provide per-pool job stats

on receipt of GetJobQueueStatsRequest

* fix: FOUNDENG-13 Properly set default pool when no GPUs (#201)

When no partitions have GPUs, we ended up without a default
compute pool, so jobs would get subbmitted to pool "".
That would be fine, except we couldn't then match the pool
named "" to its attributes (i.e. has no GPUs) and thus we
would try to allocate GPUs from a pool without them.
So when no GPUs in any partition, set the default compute pool
(internal to the Dispatcher RM) to the default AUX pool, so
that when a request for the default pool comes in we properly
match it up with the pool attributes and correctly swith to
using the cpu as the slot type.

Also add a devcluster file for shuco for testing.

* chore: fix up slot type resolution (#197)

* chore: fix default scheme for dispatcher config (#203)

* chore: Refinements to Dispatcher/SLURM logs implementations (#191)

* fix: FOUNDENG-14 generated ResourcesStateChanged events (#202)

Generate ResourcesStateChanged to cover running, and other
resources states from the dispatcher RM.

AllocationReady comes from the container for shells, notebooks,
and tensorboards, but the clients also depend upon the state
being running immediately after AllocationReady, so to avoid
a race condition where the dispatcher RM discovers the running
state by polling for dispatcher RM, force the state to running upon reception
of the AllocationReady event.

* fix: avoid potential race between AllocationReady and Running state (#206)

Pull in oss version of this fix to avoid merge conflict.

* chore: Add devcluster files for casablanca-login and horizon (#207)

Custom devcluster files to enable simple testing for
two new test machines.

* chore: fix tests (#204)

* chore: cleanup stray stuff (#208)

* chore: rename iface configs for consistency (#210)

* fix: FOUNDENG-40 Favor default partition in all cases unless no GPUs (#213)

it would be preferable to favor the default partition explicitly.

New algorithm:

- Use default partition always for aux partition
- Use default partition for GPU partition (unless it has no GPUs).

Updated unit tests.

* chore: Improve branding/naming of dispatcher/launcher (#212)

For customer-visible purposes, use "Slurm" when referring to the scheduling
system that is managed by the dispatcher RM internally.

* fix: FOUNDENG-41 Cancel running SLURM jobs on restart (#215)

On the startup of the master there may be on-going jobs on the cluster
associated with determined jobs. Terminate them all and remove them
from the DB to prevent duplicate jobs as determined starts up new jobs
which will continue from the last checkpoint.

* fix: FOUNDENG-43 Change in yaml parser breaks default partitions (#217)

Unblock testing by reverting back to our previous yaml parser
to restore default partition identification.

* fix: correctly parse resource pool info with ghodss/yaml package [FOUNDENG-45] (#218)

* fix: handle nil response from dispatch terminate (#219)

* fix: FOUNDENG-41 properly handle missing dispatches (#221)

When cleaning up dispatches from the DB on startup, if the
dispatch does not exist (404) then still delete the DB entry
so we don't attempt to clean it up again on restart.

* chore: fill in DET_UNIQUE_PORT_OFFSET (#220)

* feat: FOUNDENG-24 Improve branding/naming of dispatcher/launcher (#216)

Change the resource_manger.type value that customer's provide
from dispatcher to slurm

* Revert "chore: fill in DET_UNIQUE_PORT_OFFSET (#220)"

This reverts commit d3de26b7d652bf15eb28ee0d9a6b2c5113efcaa3.

* chore: fill in DET_UNIQUE_PORT_OFFSET, with defaults (#225)

* fix: FOUNDENG-41 If we get an error not 404, don't delete from DB (#222)

If we get a nil response on the error path, we do not know for sure
that the dispatch does not exist, so leave it in the DB for a later
attempt at cleanup.

* chore: Saturate Determined's `GET /api/v1/agents` (#226)

* Saturate Determined's `GET /api/v1/agents` API

for Dispatcher RM

* fix: FOUNDENG-49 Slurm Logs and scripts are cleaned up (#228)

Changes to make sure that when a Dispatch is dropped from the DB
we have ensured that the DispatchIDs (launches & environment) have
been removed by the launcher.

Use common method for cleanup in both cases, and enhance
the method to verify deletion (404 is also considered a success).
If we hit any failure in cleanup of launcher data, we defer the
removal of the Dispatch until a later time (they are re-processed
on startup).

* chore: Saturate Determined's `GET /api/v1/agents` API (#229)

* Saturate Determined's `GET /api/v1/agents` API

for Dispatcher RM.
o small tweak to mark the agents as not available if Slurm has then as allocated.

* chore: Enable security.tls settings for slurm RM (#200)

We want to enable use of https:8043 for development to enable us to
connect to test machines from our desktops via Pulse VPN.
Also update casablanca-login devcluster config to use :8043

* fix: Fix parameter order Dispatch cleanup (#230)

The parameter order to the resourceQueryPostActions cleanup
method were reversed, so the Dispatch was not properly removed.

* fix: FOUNDENG-48 `ERROR: Cannot set UID=0` (#231)

Launched Slurm jobs cannot be run as a root user (the default if
no agent is configured). Catch this error early in the dispatcher
RM by returning an error if an attempt to impersonate root
and report it explicitly.

* fix: FOUNDENG-25 Dispatcher job error reporting is sub-standard (#232)

When reporting exit status messages, the dispatcher RM includes
all events that happen on the distpatch. These are mostly useless
status updates. Fixes:
- Include only WARNING/ERROR level messags in the reporting.
those potentially have real error messages in them.
- Filter the prefix for job failure messages to make them cleaner.

* chore: New 'job_storage_root' field in master.yaml needs to be added to DAI (#233)

* chore: FOUNDENG-26 Implement batch processing for Dispatcher/SLURM logs (#211)

* chore: FOUNDENG-51 update dispatcher monitor to handle 404 status for cancelled experiments (#236)

* fix: update master GoReleaser config to account for new binary

* chore: enable hpe branding (#79)

* chore: publish EE docs to hpecaide.determined.ai

* chore: rebrand docs content and theme for HPE before publishing

* chore: push EE Docker images with `hpecaide` names

* fix: update reference to config package (#80)

* fix: update reference to config package

https://github.com/determined-ai/determined/pull/3175 moved config into
its own package, this fixes an -ee specific reference which was not
updated, fixing `make all`.

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* rrci: avoid implication that the password for SCIM tests is secure (#83)

* feat: oidc support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: change branding to MLDE (#114)

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also update configuration to avoid giving the impression that we might want to
publish EE images to NVCR, and revert a few diffs between the config for EE vs
OSS that don't seem intentional/warranted.

* chore: finish CAIDE -> MLDE, change hpemlde -> hpe-mlde

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* chore: more CAIDE -> MLDE changes (#153)

* chore: update web initial title and meta description (#155)

* chore: refactor for semantic conflict in service.go:Service.extractToken (#186)

* fix: use correct tags for EE release

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: just warn on trying to set single resource as daemon, for now (#125)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* Support environment configuration on singularity (#152)

Address todos:
- TODO(HAL-2868): Support adding and dropping linux caps for containers.
- TODO(HAL-2869): Support communicating with authenticated container registries.
- Also force_pull_image option.

Design Note: https://connect.us.cray.com/confluence/pages/viewpage.action?pageId=230482663

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Add missing checkpoint-gc link supporting unshared data volumes (#156)

Add a missing link from /run/determined/checkpoint_gc to the
unshared data volume to enable them to complete successfully.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* HAL-2788 Handle task script archives as shared. (#162)

Simplify our bind mounts by treating the archive
with logging configuration scrtips as shared.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Add command and shell entry script support. (#166)

The shared volume support needs to setup links for files that
are not intended to be shared. This is not yet completely
generalized, so we need to update tasks.go to add the links
for now. Add some missing links.

* chore: Dynamically handle links for unshared archive volumes (#168)

When archives are mapped into the per-container space we
need to setup softlinks for each element within /run/determined
directory that is not treated as shared. This was previously
done using a static list of names. Replace that with a dynamic
list so that we don't miss files that are specified in other
parts of determined. Also add exceptions for command and
shell entrypoints which are read-only so we can leave them in the
shared /run/determined directory without setting up links.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: With slurm detect non-zero exit as job failure (#172)

If any container launched by the SLURM batch job exits with
a non-zero status, terminate the others with an error.
Without this a failure on horovodrun args, or other such
errors would exit the one container, but others would wait
forever.

Happy path exits, need a different fix.

* chore: Make shell_entrypoint work with SLURM (#174)

The shell command entry point needs the fix to properly
inherit DET_SLOTS from SLUM CUDA configuration.
Additionally, HPC systems include functions in their
environment to support modules. Because the variables
include () in the names, it breaks the SSH variable escaping
code. Avoid processing of these functions by adding any
variable name contiaing () to the blacklist.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
Dispatcher RM messages.

* chore: HAL-2893 Synchronize reading of Slurm resource log file (#179)

* fix: HAL-2903 Drop special handling of stdin/stdout for SLURM (#180)

The /run/determined/train directory is no longer shared between
all containers in a run, so drop the slurm-job specific task
setup code that redirects the stdin/stdout link via /tmp (which
singularity maps to /tmp on the host which can lead to conflict
between jobs refereincing the same /tmp/stdin.log link between jobs).

* feat: FOUNDENG-21 Enable SLURM preemption (#181)

Initial work to enable SLURM job preemption.
- For SLURM jobs intercept SIGTERM as a notification of pending preemption
when triggered notify the master via new pending_preemption API
- Add pending_preemption API on the master
- Upon call to the pending_preemption, dispatcher RM triggers ReleaseResources
(whith ForcePremption:true) for the allocation.
- The above triggers a preemption, and the long poll _preempt.py is performing will
return indicating that the job should preempt, and after checkpointing it will
exit with a success status.

* chore: put some life into the 'job queue' UI page (#189)

* chore: Version of Bradley's wrap_rank.py fix (#188)

This is semantically eqivalent to

https://github.com/determined-ai/determined/pull/4109

So you can apply it to your tree to get things working on dispatcher branch.

* chore: HAL-2788 Process and send logs from Dispatcher(SLURM) to Default Log Backend (#182)

Co-authored-by: Bradley Laney <bradlaney@determined.ai>

* fix: correct regex group name (#193)

* fix: HAL-2907 Simplify SLURM job naming (#190)

This is just a slight increment to reduce the verbosity, to cleanup
the username and dispatch ID we likely will need a launcher
enhancemement.

* chore: fix tb collisions (#194)

* chore: HAL-2894 Multiple distributed jobs collide on horovod run sshd port (#195)

* chore: per-partition configs (#196)

This change enables us to configure per-partition overrides for certain configurations and adds a few overrides for now. The mechanism matches determined agents where expconf settings (if available) override pool level defaults override master config level defaults.

* chore: avoid API blockage & overpolling (#198)

* chore: avoid API blockage & overpolling

when following launched jobs to completion

* chore: provide per-pool job stats (#199)

* chore: provide per-pool job stats

on receipt of GetJobQueueStatsRequest

* fix: FOUNDENG-13 Properly set default pool when no GPUs (#201)

When no partitions have GPUs, we ended up without a default
compute pool, so jobs would get subbmitted to pool "".
That would be fine, except we couldn't then match the pool
named "" to its attributes (i.e. has no GPUs) and thus we
would try to allocate GPUs from a pool without them.
So when no GPUs in any partition, set the default compute pool
(internal to the Dispatcher RM) to the default AUX pool, so
that when a request for the default pool comes in we properly
match it up with the pool attributes and correctly swith to
using the cpu as the slot type.

Also add a devcluster file for shuco for testing.

* chore: fix up slot type resolution (#197)

* chore: fix default scheme for dispatcher config (#203)

* chore: update Go dependency files

This is a sort of artificial commit used to concentrate all of the
go.{mod,sum} changes for EE into one place to hopefully reduce
conflicts.

* feat: add support for SCIM provisioning

This commit brings the SCIM support for v0.11.2 back from the dead with
minimal changes. The only new fixes/features are conflicts now correctly
returns 409 not 500, password sync is supported, externalId is not
required, and the username/password that the IdP uses for basic auth
when talking to us is configurable.

* chore: remove Apache V2 license text

The enterprise edition is not licensed under Apache V2, so remove the
license text to avoid confusion.

* fix: ignore SCIM meta field

When we receive a PATCH or PUT from Okta,
they may reply to us with the meta field we
initially gave to them. This is ok; it is in
spec to receive and ignore it.

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add support for SAML

To meet customer needs, we need to provide support for users to
authenticate with SAML. The commit adds /saml/sso and /saml/initiate to
receive SAML responses and initiate SP-initiated flow with SAML
requests, respectively. It also adds SSO providers to /info for the
frontend to determined if SAML is enabled or not and the ability to
configure SAML via master.yaml.

* feat: add support for automatically storing creds

As a Determined user, I don't want to paste a token every time I want to
auth. This adds support for, when auth'ing from the CLI, redirecting to
localhost and automatically storing the auth token.

* ci: modify CI/CD steps for EE

We avoid running jobs that will publish our product to public
repositories. This currently means that we cannot test cloud deployments
in an automated manner. However, none of our EE features are related to
cloud deployments so this should not be a problem.

* ci: enable EE releases with CircleCI

* chore: enable SCIM for test-e2e (#15)

* chore: run apt-get update for deps of master.deb (#16)

When we install ee via a deb, apt-get tries to resolve the named deps of
it but, in CI we haven't run apt-get update so we can't find it.

* feat: support acting as an OAuth 2.0 authorization server (#10)

This adds support for the OAuth 2.0 authorization code flow to the
master. Apart from the core flow itself, HTTP endpoints are provided for
managing the set of OAuth client applications (accessible only by admin
users).

Since we're only addressing one particular scenario at the moment, we
allow only a single client application to be registered, since that's
all we need for this use case and that simplifies management a bit.

* feat: enable OAuth protection for SCIM endpoints (#10)

* ci: add -ee to tags (#29)

* ci: use make publish for releasing Docker images (#36)

(cherry picked from commit a5de11b5cc87e848d0e81601d15038dd4f44d7e1)

* fix: remove det-deploy local from ee ci (#53)

* chore: increase refresh token exp (#60) [DET-4754]

* chore: remove sessions for deprovisioned users (#61) [DET-4844]

* ci: skip a bunch of testing and deployment for EE

* fix: extract master URL correctly from resource pools (#64)

With the introduction of resource pools, the provisioner is no longer a top level field. Update code so that we pull the MasterURL from the resource pools' provider field.

* fix: add ssoProviders to default store info for ee

* chore: do not publish model_hub package in EE

* fix: update master GoReleaser config to account for new binary

* chore: rebrand to HPE MLDE

Basic renaming of Docker images, docs bucket, etc. Also clearly remove
any implication of publishing to NVCR.

* chore: rebrand docs content and theme for HPE before publishing

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* ci: avoid implication that the password for SCIM tests is secure (#83)

* feat: add OIDC support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also revert a few diffs between the config for EE vs OSS that don't seem
intentional/warranted.

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* chore: refactor for semantic conflict in service.go:Service.extractToken (#186)

* fix: update docs skip check remote address to EE

* chore: add launcher client (#209)

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* Support environment configuration on singularity (#152)

Address todos:
- TODO(HAL-2868): Support adding and dropping linux caps for containers.
- TODO(HAL-2869): Support communicating with authenticated container registries.
- Also force_pull_image option.

Design Note: https://connect.us.cray.com/confluence/pages/viewpage.action?pageId=230482663

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Add missing checkpoint-gc link supporting unshared data volumes (#156)

Add a missing link from /run/determined/checkpoint_gc to the
unshared data volume to enable them to complete successfully.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* HAL-2788 Handle task script archives as shared. (#162)

Simplify our bind mounts by treating the archive
with logging configuration scrtips as shared.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Add command and shell entry script support. (#166)

The shared volume support needs to setup links for files that
are not intended to be shared. This is not yet completely
generalized, so we need to update tasks.go to add the links
for now. Add some missing links.

* chore: Dynamically handle links for unshared archive volumes (#168)

When archives are mapped into the per-container space we
need to setup softlinks for each element within /run/determined
directory that is not treated as shared. This was previously
done using a static list of names. Replace that with a dynamic
list so that we don't miss files that are specified in other
parts of determined. Also add exceptions for command and
shell entrypoints which are read-only so we can leave them in the
shared /run/determined directory without setting up links.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: With slurm detect non-zero exit as job failure (#172)

If any container launched by the SLURM batch job exits with
a non-zero status, terminate the others with an error.
Without this a failure on horovodrun args, or other such
errors would exit the one container, but others would wait
forever.

Happy path exits, need a different fix.

* chore: Make shell_entrypoint work with SLURM (#174)

The shell command entry point needs the fix to properly
inherit DET_SLOTS from SLUM CUDA configuration.
Additionally, HPC systems include functions in their
environment to support modules. Because the variables
include () in the names, it breaks the SSH variable escaping
code. Avoid processing of these functions by adding any
variable name contiaing () to the blacklist.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
…

eecsliu added a commit to eecsliu/determined that referenced this pull request

* chore: update Go dependency files

This is a sort of artificial commit used to concentrate all of the
go.{mod,sum} changes for EE into one place to hopefully reduce
conflicts.

* feat: add support for SCIM provisioning

This commit brings the SCIM support for v0.11.2 back from the dead with
minimal changes. The only new fixes/features are conflicts now correctly
returns 409 not 500, password sync is supported, externalId is not
required, and the username/password that the IdP uses for basic auth
when talking to us is configurable.

* chore: remove Apache V2 license text

The enterprise edition is not licensed under Apache V2, so remove the
license text to avoid confusion.

* fix: ignore SCIM meta field

When we receive a PATCH or PUT from Okta,
they may reply to us with the meta field we
initially gave to them. This is ok; it is in
spec to receive and ignore it.

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add support for SAML

To meet customer needs, we need to provide support for users to
authenticate with SAML. The commit adds /saml/sso and /saml/initiate to
receive SAML responses and initiate SP-initiated flow with SAML
requests, respectively. It also adds SSO providers to /info for the
frontend to determined if SAML is enabled or not and the ability to
configure SAML via master.yaml.

* feat: add support for automatically storing creds

As a Determined user, I don't want to paste a token every time I want to
auth. This adds support for, when auth'ing from the CLI, redirecting to
localhost and automatically storing the auth token.

* ci: modify CI/CD steps for EE

We avoid running jobs that will publish our product to public
repositories. This currently means that we cannot test cloud deployments
in an automated manner. However, none of our EE features are related to
cloud deployments so this should not be a problem.

* ci: enable EE releases with CircleCI

* chore: enable SCIM for test-e2e (#15)

* chore: run apt-get update for deps of master.deb (#16)

When we install ee via a deb, apt-get tries to resolve the named deps of
it but, in CI we haven't run apt-get update so we can't find it.

* feat: support acting as an OAuth 2.0 authorization server (#10)

This adds support for the OAuth 2.0 authorization code flow to the
master. Apart from the core flow itself, HTTP endpoints are provided for
managing the set of OAuth client applications (accessible only by admin
users).

Since we're only addressing one particular scenario at the moment, we
allow only a single client application to be registered, since that's
all we need for this use case and that simplifies management a bit.

* feat: enable OAuth protection for SCIM endpoints (#10)

* ci: add -ee to tags (#29)

* ci: use make publish for releasing Docker images (#36)

(cherry picked from commit a5de11b5cc87e848d0e81601d15038dd4f44d7e1)

* fix: remove det-deploy local from ee ci (#53)

* chore: increase refresh token exp (#60) [DET-4754]

* chore: remove sessions for deprovisioned users (#61) [DET-4844]

* ci: skip a bunch of testing and deployment for EE

* fix: extract master URL correctly from resource pools (#64)

With the introduction of resource pools, the provisioner is no longer a top level field. Update code so that we pull the MasterURL from the resource pools' provider field.

* fix: add ssoProviders to default store info for ee

* chore: do not publish model_hub package in EE

* fix: update master GoReleaser config to account for new binary

* chore: rebrand to HPE MLDE

Basic renaming of Docker images, docs bucket, etc. Also clearly remove
any implication of publishing to NVCR.

* chore: rebrand docs content and theme for HPE before publishing

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* ci: avoid implication that the password for SCIM tests is secure (#83)

* feat: add OIDC support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also revert a few diffs between the config for EE vs OSS that don't seem
intentional/warranted.

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* fix: update docs skip check remote address to EE

* chore: add launcher client (#209)

* feat: slurm support (#98)

This change adds a dispatcher resource manager, which gives Determined the ability to run on top of Slurm.

Feature branch commit history:

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
Dispatcher RM messages.

* chore: HAL-2893 Synchronize reading of Slurm resource log file (#179)

* fix: HAL-2903 Drop special handling of stdin/stdout for SLURM (#180)

The /run/determined/train directory is no longer shared between
all containers in a run, so drop the slurm-job specific task
setup code that redirects the stdin/stdout link via /tmp (which
singularity maps to /tmp on the host which can lead to conflict
between jobs refereincing the same /tmp/stdin.log link between jobs).

* feat: FOUNDENG-21 Enable SLURM preemption (#181)

Initial work to enable SLURM job preemption.
- For SLURM jobs intercept SIGTERM as a notification of pending preemption
when triggered notify the master via new pending_preemption API
- Add pending_preemption API on the master
- Upon call to the pending_preemption, dispatcher RM triggers ReleaseResources
(whith ForcePremption:true) for the allocation.
- The above triggers a preemption, and the long poll _preempt.py is performing will
return indicating that the job should preempt, and after checkpointing it will
exit with a success status.

* chore: put some life into the 'job queue' UI page (#189)

* chore: Version of Bradley's wrap_rank.py fix (#188)

This is semantically eqivalent to

https://github.com/determined-ai/determined/pull/4109

So you can apply it to your tree to get things working on dispatcher branch.

* chore: HAL-2788 Process and send logs from Dispatcher(SLURM) to Default Log Backend (#182)

Co-authored-by: Bradley Laney <bradlaney@determined.ai>

* chore: per-partition configs (#196)

This change enables us to configure per-partition overrides for certain configurations and adds a few overrides for now. The mechanism matches determined agents where expconf settings (if available) override pool level defaults override master config level defaults.

* chore: avoid API blockage & overpolling (#198)

* chore: avoid API blockage & overpolling

when following launched jobs to completion

* chore: provide per-pool job stats (#199)

* chore: provide per-pool job stats

on receipt of GetJobQueueStatsRequest

* fix: FOUNDENG-13 Properly set default pool when no GPUs (#201)

When no partitions have GPUs, we ended up without a default
compute pool, so jobs would get subbmitted to pool "".
That would be fine, except we couldn't then match the pool
named "" to its attributes (i.e. has no GPUs) and thus we
would try to allocate GPUs from a pool without them.
So when no GPUs in any partition, set the default compute pool
(internal to the Dispatcher RM) to the default AUX pool, so
that when a request for the default pool comes in we properly
match it up with the pool attributes and correctly swith to
using the cpu as the slot type.

Also add a devcluster file for shuco for testing.

* chore: fix up slot type resolution (#197)

* chore: fix default scheme for dispatcher config (#203)

* chore: Refinements to Dispatcher/SLURM logs implementations (#191)

* fix: FOUNDENG-14 generated ResourcesStateChanged events (#202)

Generate ResourcesStateChanged to cover running, and other
resources states from the dispatcher RM.

AllocationReady comes from the container for shells, notebooks,
and tensorboards, but the clients also depend upon the state
being running immediately after AllocationReady, so to avoid
a race condition where the dispatcher RM discovers the running
state by polling for dispatcher RM, force the state to running upon reception
of the AllocationReady event.

* fix: avoid potential race between AllocationReady and Running state (#206)

Pull in oss version of this fix to avoid merge conflict.

* chore: Add devcluster files for casablanca-login and horizon (#207)

Custom devcluster files to enable simple testing for
two new test machines.

* chore: fix tests (#204)

* chore: cleanup stray stuff (#208)

* chore: rename iface configs for consistency (#210)

* fix: FOUNDENG-40 Favor default partition in all cases unless no GPUs (#213)

it would be preferable to favor the default partition explicitly.

New algorithm:

- Use default partition always for aux partition
- Use default partition for GPU partition (unless it has no GPUs).

Updated unit tests.

* chore: Improve branding/naming of dispatcher/launcher (#212)

For customer-visible purposes, use "Slurm" when referring to the scheduling
system that is managed by the dispatcher RM internally.

* fix: FOUNDENG-41 Cancel running SLURM jobs on restart (#215)

On the startup of the master there may be on-going jobs on the cluster
associated with determined jobs. Terminate them all and remove them
from the DB to prevent duplicate jobs as determined starts up new jobs
which will continue from the last checkpoint.

* fix: FOUNDENG-43 Change in yaml parser breaks default partitions (#217)

Unblock testing by reverting back to our previous yaml parser
to restore default partition identification.

* fix: correctly parse resource pool info with ghodss/yaml package [FOUNDENG-45] (#218)

* fix: handle nil response from dispatch terminate (#219)

* fix: FOUNDENG-41 properly handle missing dispatches (#221)

When cleaning up dispatches from the DB on startup, if the
dispatch does not exist (404) then still delete the DB entry
so we don't attempt to clean it up again on restart.

* chore: fill in DET_UNIQUE_PORT_OFFSET (#220)

* feat: FOUNDENG-24 Improve branding/naming of dispatcher/launcher (#216)

Change the resource_manger.type value that customer's provide
from dispatcher to slurm

* Revert "chore: fill in DET_UNIQUE_PORT_OFFSET (#220)"

This reverts commit d3de26b7d652bf15eb28ee0d9a6b2c5113efcaa3.

* chore: fill in DET_UNIQUE_PORT_OFFSET, with defaults (#225)

* fix: FOUNDENG-41 If we get an error not 404, don't delete from DB (#222)

If we get a nil response on the error path, we do not know for sure
that the dispatch does not exist, so leave it in the DB for a later
attempt at cleanup.

* chore: Saturate Determined's `GET /api/v1/agents` (#226)

* Saturate Determined's `GET /api/v1/agents` API

for Dispatcher RM

* fix: FOUNDENG-49 Slurm Logs and scripts are cleaned up (#228)

Changes to make sure that when a Dispatch is dropped from the DB
we have ensured that the DispatchIDs (launches & environment) have
been removed by the launcher.

Use common method for cleanup in both cases, and enhance
the method to verify deletion (404 is also considered a success).
If we hit any failure in cleanup of launcher data, we defer the
removal of the Dispatch until a later time (they are re-processed
on startup).

* chore: Saturate Determined's `GET /api/v1/agents` API (#229)

* Saturate Determined's `GET /api/v1/agents` API

for Dispatcher RM.
o small tweak to mark the agents as not available if Slurm has then as allocated.

* chore: Enable security.tls settings for slurm RM (#200)

We want to enable use of https:8043 for development to enable us to
connect to test machines from our desktops via Pulse VPN.
Also update casablanca-login devcluster config to use :8043

* fix: Fix parameter order Dispatch cleanup (#230)

The parameter order to the resourceQueryPostActions cleanup
method were reversed, so the Dispatch was not properly removed.

* fix: FOUNDENG-48 `ERROR: Cannot set UID=0` (#231)

Launched Slurm jobs cannot be run as a root user (the default if
no agent is configured). Catch this error early in the dispatcher
RM by returning an error if an attempt to impersonate root
and report it explicitly.

* fix: FOUNDENG-25 Dispatcher job error reporting is sub-standard (#232)

When reporting exit status messages, the dispatcher RM includes
all events that happen on the distpatch. These are mostly useless
status updates. Fixes:
- Include only WARNING/ERROR level messags in the reporting.
those potentially have real error messages in them.
- Filter the prefix for job failure messages to make them cleaner.

* chore: New 'job_storage_root' field in master.yaml needs to be added to DAI (#233)

* chore: FOUNDENG-26 Implement batch processing for Dispatcher/SLURM logs (#211)

* chore: FOUNDENG-51 update dispatcher monitor to handle 404 status for cancelled experiments (#236)

* fix: update master GoReleaser config to account for new binary

* chore: enable hpe branding (#79)

* chore: publish EE docs to hpecaide.determined.ai

* chore: rebrand docs content and theme for HPE before publishing

* chore: push EE Docker images with `hpecaide` names

* fix: update reference to config package (#80)

* fix: update reference to config package

https://github.com/determined-ai/determined/pull/3175 moved config into
its own package, this fixes an -ee specific reference which was not
updated, fixing `make all`.

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* rrci: avoid implication that the password for SCIM tests is secure (#83)

* feat: oidc support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: change branding to MLDE (#114)

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also update configuration to avoid giving the impression that we might want to
publish EE images to NVCR, and revert a few diffs between the config for EE vs
OSS that don't seem intentional/warranted.

* chore: finish CAIDE -> MLDE, change hpemlde -> hpe-mlde

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* chore: more CAIDE -> MLDE changes (#153)

* chore: update web initial title and meta description (#155)

* chore: refactor for semantic conflict in service.go:Service.extractToken (#186)

* fix: use correct tags for EE release

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: just warn on trying to set single resource as daemon, for now (#125)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* Support environment configuration on singularity (#152)

Address todos:
- TODO(HAL-2868): Support adding and dropping linux caps for containers.
- TODO(HAL-2869): Support communicating with authenticated container registries.
- Also force_pull_image option.

Design Note: https://connect.us.cray.com/confluence/pages/viewpage.action?pageId=230482663

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Add missing checkpoint-gc link supporting unshared data volumes (#156)

Add a missing link from /run/determined/checkpoint_gc to the
unshared data volume to enable them to complete successfully.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* HAL-2788 Handle task script archives as shared. (#162)

Simplify our bind mounts by treating the archive
with logging configuration scrtips as shared.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Add command and shell entry script support. (#166)

The shared volume support needs to setup links for files that
are not intended to be shared. This is not yet completely
generalized, so we need to update tasks.go to add the links
for now. Add some missing links.

* chore: Dynamically handle links for unshared archive volumes (#168)

When archives are mapped into the per-container space we
need to setup softlinks for each element within /run/determined
directory that is not treated as shared. This was previously
done using a static list of names. Replace that with a dynamic
list so that we don't miss files that are specified in other
parts of determined. Also add exceptions for command and
shell entrypoints which are read-only so we can leave them in the
shared /run/determined directory without setting up links.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: With slurm detect non-zero exit as job failure (#172)

If any container launched by the SLURM batch job exits with
a non-zero status, terminate the others with an error.
Without this a failure on horovodrun args, or other such
errors would exit the one container, but others would wait
forever.

Happy path exits, need a different fix.

* chore: Make shell_entrypoint work with SLURM (#174)

The shell command entry point needs the fix to properly
inherit DET_SLOTS from SLUM CUDA configuration.
Additionally, HPC systems include functions in their
environment to support modules. Because the variables
include () in the names, it breaks the SSH variable escaping
code. Avoid processing of these functions by adding any
variable name contiaing () to the blacklist.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
Dispatcher RM messages.

* chore: HAL-2893 Synchronize reading of Slurm resource log file (#179)

* fix: HAL-2903 Drop special handling of stdin/stdout for SLURM (#180)

The /run/determined/train directory is no longer shared between
all containers in a run, so drop the slurm-job specific task
setup code that redirects the stdin/stdout link via /tmp (which
singularity maps to /tmp on the host which can lead to conflict
between jobs refereincing the same /tmp/stdin.log link between jobs).

* feat: FOUNDENG-21 Enable SLURM preemption (#181)

Initial work to enable SLURM job preemption.
- For SLURM jobs intercept SIGTERM as a notification of pending preemption
when triggered notify the master via new pending_preemption API
- Add pending_preemption API on the master
- Upon call to the pending_preemption, dispatcher RM triggers ReleaseResources
(whith ForcePremption:true) for the allocation.
- The above triggers a preemption, and the long poll _preempt.py is performing will
return indicating that the job should preempt, and after checkpointing it will
exit with a success status.

* chore: put some life into the 'job queue' UI page (#189)

* chore: Version of Bradley's wrap_rank.py fix (#188)

This is semantically eqivalent to

https://github.com/determined-ai/determined/pull/4109

So you can apply it to your tree to get things working on dispatcher branch.

* chore: HAL-2788 Process and send logs from Dispatcher(SLURM) to Default Log Backend (#182)

Co-authored-by: Bradley Laney <bradlaney@determined.ai>

* fix: correct regex group name (#193)

* fix: HAL-2907 Simplify SLURM job naming (#190)

This is just a slight increment to reduce the verbosity, to cleanup
the username and dispatch ID we likely will need a launcher
enhancemement.

* chore: fix tb collisions (#194)

* chore: HAL-2894 Multiple distributed jobs collide on horovod run sshd port (#195)

* chore: per-partition configs (#196)

This change enables us to configure per-partition overrides for certain configurations and adds a few overrides for now. The mechanism matches determined agents where expconf settings (if available) override pool level defaults override master config level defaults.

* chore: avoid API blockage & overpolling (#198)

* chore: avoid API blockage & overpolling

when following launched jobs to completion

* chore: provide per-pool job stats (#199)

* chore: provide per-pool job stats

on receipt of GetJobQueueStatsRequest

* fix: FOUNDENG-13 Properly set default pool when no GPUs (#201)

When no partitions have GPUs, we ended up without a default
compute pool, so jobs would get subbmitted to pool "".
That would be fine, except we couldn't then match the pool
named "" to its attributes (i.e. has no GPUs) and thus we
would try to allocate GPUs from a pool without them.
So when no GPUs in any partition, set the default compute pool
(internal to the Dispatcher RM) to the default AUX pool, so
that when a request for the default pool comes in we properly
match it up with the pool attributes and correctly swith to
using the cpu as the slot type.

Also add a devcluster file for shuco for testing.

* chore: fix up slot type resolution (#197)

* chore: fix default scheme for dispatcher config (#203)

* chore: update Go dependency files

This is a sort of artificial commit used to concentrate all of the
go.{mod,sum} changes for EE into one place to hopefully reduce
conflicts.

* feat: add support for SCIM provisioning

This commit brings the SCIM support for v0.11.2 back from the dead with
minimal changes. The only new fixes/features are conflicts now correctly
returns 409 not 500, password sync is supported, externalId is not
required, and the username/password that the IdP uses for basic auth
when talking to us is configurable.

* chore: remove Apache V2 license text

The enterprise edition is not licensed under Apache V2, so remove the
license text to avoid confusion.

* fix: ignore SCIM meta field

When we receive a PATCH or PUT from Okta,
they may reply to us with the meta field we
initially gave to them. This is ok; it is in
spec to receive and ignore it.

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add support for SAML

To meet customer needs, we need to provide support for users to
authenticate with SAML. The commit adds /saml/sso and /saml/initiate to
receive SAML responses and initiate SP-initiated flow with SAML
requests, respectively. It also adds SSO providers to /info for the
frontend to determined if SAML is enabled or not and the ability to
configure SAML via master.yaml.

* feat: add support for automatically storing creds

As a Determined user, I don't want to paste a token every time I want to
auth. This adds support for, when auth'ing from the CLI, redirecting to
localhost and automatically storing the auth token.

* ci: modify CI/CD steps for EE

We avoid running jobs that will publish our product to public
repositories. This currently means that we cannot test cloud deployments
in an automated manner. However, none of our EE features are related to
cloud deployments so this should not be a problem.

* ci: enable EE releases with CircleCI

* chore: enable SCIM for test-e2e (#15)

* chore: run apt-get update for deps of master.deb (#16)

When we install ee via a deb, apt-get tries to resolve the named deps of
it but, in CI we haven't run apt-get update so we can't find it.

* feat: support acting as an OAuth 2.0 authorization server (#10)

This adds support for the OAuth 2.0 authorization code flow to the
master. Apart from the core flow itself, HTTP endpoints are provided for
managing the set of OAuth client applications (accessible only by admin
users).

Since we're only addressing one particular scenario at the moment, we
allow only a single client application to be registered, since that's
all we need for this use case and that simplifies management a bit.

* feat: enable OAuth protection for SCIM endpoints (#10)

* ci: add -ee to tags (#29)

* ci: use make publish for releasing Docker images (#36)

(cherry picked from commit a5de11b5cc87e848d0e81601d15038dd4f44d7e1)

* fix: remove det-deploy local from ee ci (#53)

* chore: increase refresh token exp (#60) [DET-4754]

* chore: remove sessions for deprovisioned users (#61) [DET-4844]

* ci: skip a bunch of testing and deployment for EE

* fix: extract master URL correctly from resource pools (#64)

With the introduction of resource pools, the provisioner is no longer a top level field. Update code so that we pull the MasterURL from the resource pools' provider field.

* fix: add ssoProviders to default store info for ee

* chore: do not publish model_hub package in EE

* fix: update master GoReleaser config to account for new binary

* chore: rebrand to HPE MLDE

Basic renaming of Docker images, docs bucket, etc. Also clearly remove
any implication of publishing to NVCR.

* chore: rebrand docs content and theme for HPE before publishing

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* ci: avoid implication that the password for SCIM tests is secure (#83)

* feat: add OIDC support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also revert a few diffs between the config for EE vs OSS that don't seem
intentional/warranted.

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* chore: refactor for semantic conflict in service.go:Service.extractToken (#186)

* fix: update docs skip check remote address to EE

* chore: add launcher client (#209)

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's re…

eecsliu added a commit to eecsliu/determined that referenced this pull request

…or PBS (determined-ai#399)

* chore: update Go dependency files

This is a sort of artificial commit used to concentrate all of the
go.{mod,sum} changes for EE into one place to hopefully reduce
conflicts.

* feat: add support for SCIM provisioning

This commit brings the SCIM support for v0.11.2 back from the dead with
minimal changes. The only new fixes/features are conflicts now correctly
returns 409 not 500, password sync is supported, externalId is not
required, and the username/password that the IdP uses for basic auth
when talking to us is configurable.

* chore: remove Apache V2 license text

The enterprise edition is not licensed under Apache V2, so remove the
license text to avoid confusion.

* fix: ignore SCIM meta field

When we receive a PATCH or PUT from Okta,
they may reply to us with the meta field we
initially gave to them. This is ok; it is in
spec to receive and ignore it.

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add support for SAML

To meet customer needs, we need to provide support for users to
authenticate with SAML. The commit adds /saml/sso and /saml/initiate to
receive SAML responses and initiate SP-initiated flow with SAML
requests, respectively. It also adds SSO providers to /info for the
frontend to determined if SAML is enabled or not and the ability to
configure SAML via master.yaml.

* feat: add support for automatically storing creds

As a Determined user, I don't want to paste a token every time I want to
auth. This adds support for, when auth'ing from the CLI, redirecting to
localhost and automatically storing the auth token.

* ci: modify CI/CD steps for EE

We avoid running jobs that will publish our product to public
repositories. This currently means that we cannot test cloud deployments
in an automated manner. However, none of our EE features are related to
cloud deployments so this should not be a problem.

* ci: enable EE releases with CircleCI

* chore: enable SCIM for test-e2e (#15)

* chore: run apt-get update for deps of master.deb (#16)

When we install ee via a deb, apt-get tries to resolve the named deps of
it but, in CI we haven't run apt-get update so we can't find it.

* feat: support acting as an OAuth 2.0 authorization server (#10)

This adds support for the OAuth 2.0 authorization code flow to the
master. Apart from the core flow itself, HTTP endpoints are provided for
managing the set of OAuth client applications (accessible only by admin
users).

Since we're only addressing one particular scenario at the moment, we
allow only a single client application to be registered, since that's
all we need for this use case and that simplifies management a bit.

* feat: enable OAuth protection for SCIM endpoints (#10)

* ci: add -ee to tags (#29)

* ci: use make publish for releasing Docker images (#36)

(cherry picked from commit a5de11b5cc87e848d0e81601d15038dd4f44d7e1)

* fix: remove det-deploy local from ee ci (#53)

* chore: increase refresh token exp (#60) [DET-4754]

* chore: remove sessions for deprovisioned users (#61) [DET-4844]

* ci: skip a bunch of testing and deployment for EE

* fix: extract master URL correctly from resource pools (#64)

With the introduction of resource pools, the provisioner is no longer a top level field. Update code so that we pull the MasterURL from the resource pools' provider field.

* fix: add ssoProviders to default store info for ee

* chore: do not publish model_hub package in EE

* fix: update master GoReleaser config to account for new binary

* chore: rebrand to HPE MLDE

Basic renaming of Docker images, docs bucket, etc. Also clearly remove
any implication of publishing to NVCR.

* chore: rebrand docs content and theme for HPE before publishing

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* feat: add OIDC support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also revert a few diffs between the config for EE vs OSS that don't seem
intentional/warranted.

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* fix: update docs skip check remote address to EE

* chore: add launcher client (#209)

* feat: slurm support (#98)

This change adds a dispatcher resource manager, which gives Determined the ability to run on top of Slurm.

Feature branch commit history:

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
Dispatcher RM messages.

* chore: HAL-2893 Synchronize reading of Slurm resource log file (#179)

* fix: HAL-2903 Drop special handling of stdin/stdout for SLURM (#180)

The /run/determined/train directory is no longer shared between
all containers in a run, so drop the slurm-job specific task
setup code that redirects the stdin/stdout link via /tmp (which
singularity maps to /tmp on the host which can lead to conflict
between jobs refereincing the same /tmp/stdin.log link between jobs).

* feat: FOUNDENG-21 Enable SLURM preemption (#181)

Initial work to enable SLURM job preemption.
- For SLURM jobs intercept SIGTERM as a notification of pending preemption
when triggered notify the master via new pending_preemption API
- Add pending_preemption API on the master
- Upon call to the pending_preemption, dispatcher RM triggers ReleaseResources
(whith ForcePremption:true) for the allocation.
- The above triggers a preemption, and the long poll _preempt.py is performing will
return indicating that the job should preempt, and after checkpointing it will
exit with a success status.

* chore: put some life into the 'job queue' UI page (#189)

* chore: Version of Bradley's wrap_rank.py fix (#188)

This is semantically eqivalent to

https://github.com/determined-ai/determined/pull/4109

So you can apply it to your tree to get things working on dispatcher branch.

* chore: HAL-2788 Process and send logs from Dispatcher(SLURM) to Default Log Backend (#182)

Co-authored-by: Bradley Laney <bradlaney@determined.ai>

* chore: per-partition configs (#196)

This change enables us to configure per-partition overrides for certain configurations and adds a few overrides for now. The mechanism matches determined agents where expconf settings (if available) override pool level defaults override master config level defaults.

* chore: avoid API blockage & overpolling (#198)

* chore: avoid API blockage & overpolling

when following launched jobs to completion

* chore: provide per-pool job stats (#199)

* chore: provide per-pool job stats

on receipt of GetJobQueueStatsRequest

* fix: FOUNDENG-13 Properly set default pool when no GPUs (#201)

When no partitions have GPUs, we ended up without a default
compute pool, so jobs would get subbmitted to pool "".
That would be fine, except we couldn't then match the pool
named "" to its attributes (i.e. has no GPUs) and thus we
would try to allocate GPUs from a pool without them.
So when no GPUs in any partition, set the default compute pool
(internal to the Dispatcher RM) to the default AUX pool, so
that when a request for the default pool comes in we properly
match it up with the pool attributes and correctly swith to
using the cpu as the slot type.

Also add a devcluster file for shuco for testing.

* chore: fix up slot type resolution (#197)

* chore: fix default scheme for dispatcher config (#203)

* chore: Refinements to Dispatcher/SLURM logs implementations (#191)

* fix: FOUNDENG-14 generated ResourcesStateChanged events (#202)

Generate ResourcesStateChanged to cover running, and other
resources states from the dispatcher RM.

AllocationReady comes from the container for shells, notebooks,
and tensorboards, but the clients also depend upon the state
being running immediately after AllocationReady, so to avoid
a race condition where the dispatcher RM discovers the running
state by polling for dispatcher RM, force the state to running upon reception
of the AllocationReady event.

* fix: avoid potential race between AllocationReady and Running state (#206)

Pull in oss version of this fix to avoid merge conflict.

* chore: Add devcluster files for casablanca-login and horizon (#207)

Custom devcluster files to enable simple testing for
two new test machines.

* chore: fix tests (#204)

* chore: cleanup stray stuff (#208)

* chore: rename iface configs for consistency (#210)

* fix: FOUNDENG-40 Favor default partition in all cases unless no GPUs (#213)

it would be preferable to favor the default partition explicitly.

New algorithm:

- Use default partition always for aux partition
- Use default partition for GPU partition (unless it has no GPUs).

Updated unit tests.

* chore: Improve branding/naming of dispatcher/launcher (#212)

For customer-visible purposes, use "Slurm" when referring to the scheduling
system that is managed by the dispatcher RM internally.

* fix: FOUNDENG-41 Cancel running SLURM jobs on restart (#215)

On the startup of the master there may be on-going jobs on the cluster
associated with determined jobs. Terminate them all and remove them
from the DB to prevent duplicate jobs as determined starts up new jobs
which will continue from the last checkpoint.

* fix: FOUNDENG-43 Change in yaml parser breaks default partitions (#217)

Unblock testing by reverting back to our previous yaml parser
to restore default partition identification.

* fix: correctly parse resource pool info with ghodss/yaml package [FOUNDENG-45] (#218)

* fix: handle nil response from dispatch terminate (#219)

* fix: FOUNDENG-41 properly handle missing dispatches (#221)

When cleaning up dispatches from the DB on startup, if the
dispatch does not exist (404) then still delete the DB entry
so we don't attempt to clean it up again on restart.

* chore: fill in DET_UNIQUE_PORT_OFFSET (#220)

* feat: FOUNDENG-24 Improve branding/naming of dispatcher/launcher (#216)

Change the resource_manger.type value that customer's provide
from dispatcher to slurm

* Revert "chore: fill in DET_UNIQUE_PORT_OFFSET (#220)"

This reverts commit d3de26b7d652bf15eb28ee0d9a6b2c5113efcaa3.

* chore: fill in DET_UNIQUE_PORT_OFFSET, with defaults (#225)

* fix: FOUNDENG-41 If we get an error not 404, don't delete from DB (#222)

If we get a nil response on the error path, we do not know for sure
that the dispatch does not exist, so leave it in the DB for a later
attempt at cleanup.

* chore: Saturate Determined's `GET /api/v1/agents` (#226)

* Saturate Determined's `GET /api/v1/agents` API

for Dispatcher RM

* fix: FOUNDENG-49 Slurm Logs and scripts are cleaned up (#228)

Changes to make sure that when a Dispatch is dropped from the DB
we have ensured that the DispatchIDs (launches & environment) have
been removed by the launcher.

Use common method for cleanup in both cases, and enhance
the method to verify deletion (404 is also considered a success).
If we hit any failure in cleanup of launcher data, we defer the
removal of the Dispatch until a later time (they are re-processed
on startup).

* chore: Saturate Determined's `GET /api/v1/agents` API (#229)

* Saturate Determined's `GET /api/v1/agents` API

for Dispatcher RM.
o small tweak to mark the agents as not available if Slurm has then as allocated.

* chore: Enable security.tls settings for slurm RM (#200)

We want to enable use of https:8043 for development to enable us to
connect to test machines from our desktops via Pulse VPN.
Also update casablanca-login devcluster config to use :8043

* fix: Fix parameter order Dispatch cleanup (#230)

The parameter order to the resourceQueryPostActions cleanup
method were reversed, so the Dispatch was not properly removed.

* fix: FOUNDENG-48 `ERROR: Cannot set UID=0` (#231)

Launched Slurm jobs cannot be run as a root user (the default if
no agent is configured). Catch this error early in the dispatcher
RM by returning an error if an attempt to impersonate root
and report it explicitly.

* fix: FOUNDENG-25 Dispatcher job error reporting is sub-standard (#232)

When reporting exit status messages, the dispatcher RM includes
all events that happen on the distpatch. These are mostly useless
status updates. Fixes:
- Include only WARNING/ERROR level messags in the reporting.
those potentially have real error messages in them.
- Filter the prefix for job failure messages to make them cleaner.

* chore: New 'job_storage_root' field in master.yaml needs to be added to DAI (#233)

* chore: FOUNDENG-26 Implement batch processing for Dispatcher/SLURM logs (#211)

* chore: FOUNDENG-51 update dispatcher monitor to handle 404 status for cancelled experiments (#236)

* fix: update master GoReleaser config to account for new binary

* chore: enable hpe branding (#79)

* chore: publish EE docs to hpecaide.determined.ai

* chore: rebrand docs content and theme for HPE before publishing

* chore: push EE Docker images with `hpecaide` names

* fix: update reference to config package (#80)

* fix: update reference to config package

https://github.com/determined-ai/determined/pull/3175 moved config into
its own package, this fixes an -ee specific reference which was not
updated, fixing `make all`.

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* rrci: avoid implication that the password for SCIM tests is secure (#83)

* feat: oidc support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: change branding to MLDE (#114)

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also update configuration to avoid giving the impression that we might want to
publish EE images to NVCR, and revert a few diffs between the config for EE vs
OSS that don't seem intentional/warranted.

* chore: finish CAIDE -> MLDE, change hpemlde -> hpe-mlde

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* chore: more CAIDE -> MLDE changes (#153)

* chore: update web initial title and meta description (#155)

* chore: refactor for semantic conflict in service.go:Service.extractToken (#186)

* fix: use correct tags for EE release

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: just warn on trying to set single resource as daemon, for now (#125)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* Support environment configuration on singularity (#152)

Address todos:
- TODO(HAL-2868): Support adding and dropping linux caps for containers.
- TODO(HAL-2869): Support communicating with authenticated container registries.
- Also force_pull_image option.

Design Note: https://connect.us.cray.com/confluence/pages/viewpage.action?pageId=230482663

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Add missing checkpoint-gc link supporting unshared data volumes (#156)

Add a missing link from /run/determined/checkpoint_gc to the
unshared data volume to enable them to complete successfully.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* HAL-2788 Handle task script archives as shared. (#162)

Simplify our bind mounts by treating the archive
with logging configuration scrtips as shared.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Add command and shell entry script support. (#166)

The shared volume support needs to setup links for files that
are not intended to be shared. This is not yet completely
generalized, so we need to update tasks.go to add the links
for now. Add some missing links.

* chore: Dynamically handle links for unshared archive volumes (#168)

When archives are mapped into the per-container space we
need to setup softlinks for each element within /run/determined
directory that is not treated as shared. This was previously
done using a static list of names. Replace that with a dynamic
list so that we don't miss files that are specified in other
parts of determined. Also add exceptions for command and
shell entrypoints which are read-only so we can leave them in the
shared /run/determined directory without setting up links.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: With slurm detect non-zero exit as job failure (#172)

If any container launched by the SLURM batch job exits with
a non-zero status, terminate the others with an error.
Without this a failure on horovodrun args, or other such
errors would exit the one container, but others would wait
forever.

Happy path exits, need a different fix.

* chore: Make shell_entrypoint work with SLURM (#174)

The shell command entry point needs the fix to properly
inherit DET_SLOTS from SLUM CUDA configuration.
Additionally, HPC systems include functions in their
environment to support modules. Because the variables
include () in the names, it breaks the SSH variable escaping
code. Avoid processing of these functions by adding any
variable name contiaing () to the blacklist.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
Dispatcher RM messages.

* chore: HAL-2893 Synchronize reading of Slurm resource log file (#179)

* fix: HAL-2903 Drop special handling of stdin/stdout for SLURM (#180)

The /run/determined/train directory is no longer shared between
all containers in a run, so drop the slurm-job specific task
setup code that redirects the stdin/stdout link via /tmp (which
singularity maps to /tmp on the host which can lead to conflict
between jobs refereincing the same /tmp/stdin.log link between jobs).

* feat: FOUNDENG-21 Enable SLURM preemption (#181)

Initial work to enable SLURM job preemption.
- For SLURM jobs intercept SIGTERM as a notification of pending preemption
when triggered notify the master via new pending_preemption API
- Add pending_preemption API on the master
- Upon call to the pending_preemption, dispatcher RM triggers ReleaseResources
(whith ForcePremption:true) for the allocation.
- The above triggers a preemption, and the long poll _preempt.py is performing will
return indicating that the job should preempt, and after checkpointing it will
exit with a success status.

* chore: put some life into the 'job queue' UI page (#189)

* chore: Version of Bradley's wrap_rank.py fix (#188)

This is semantically eqivalent to

https://github.com/determined-ai/determined/pull/4109

So you can apply it to your tree to get things working on dispatcher branch.

* chore: HAL-2788 Process and send logs from Dispatcher(SLURM) to Default Log Backend (#182)

Co-authored-by: Bradley Laney <bradlaney@determined.ai>

* fix: correct regex group name (#193)

* fix: HAL-2907 Simplify SLURM job naming (#190)

This is just a slight increment to reduce the verbosity, to cleanup
the username and dispatch ID we likely will need a launcher
enhancemement.

* chore: fix tb collisions (#194)

* chore: HAL-2894 Multiple distributed jobs collide on horovod run sshd port (#195)

* chore: per-partition configs (#196)

This change enables us to configure per-partition overrides for certain configurations and adds a few overrides for now. The mechanism matches determined agents where expconf settings (if available) override pool level defaults override master config level defaults.

* chore: avoid API blockage & overpolling (#198)

* chore: avoid API blockage & overpolling

when following launched jobs to completion

* chore: provide per-pool job stats (#199)

* chore: provide per-pool job stats

on receipt of GetJobQueueStatsRequest

* fix: FOUNDENG-13 Properly set default pool when no GPUs (#201)

When no partitions have GPUs, we ended up without a default
compute pool, so jobs would get subbmitted to pool "".
That would be fine, except we couldn't then match the pool
named "" to its attributes (i.e. has no GPUs) and thus we
would try to allocate GPUs from a pool without them.
So when no GPUs in any partition, set the default compute pool
(internal to the Dispatcher RM) to the default AUX pool, so
that when a request for the default pool comes in we properly
match it up with the pool attributes and correctly swith to
using the cpu as the slot type.

Also add a devcluster file for shuco for testing.

* chore: fix up slot type resolution (#197)

* chore: fix default scheme for dispatcher config (#203)

* chore: update Go dependency files

This is a sort of artificial commit used to concentrate all of the
go.{mod,sum} changes for EE into one place to hopefully reduce
conflicts.

* feat: add support for SCIM provisioning

This commit brings the SCIM support for v0.11.2 back from the dead with
minimal changes. The only new fixes/features are conflicts now correctly
returns 409 not 500, password sync is supported, externalId is not
required, and the username/password that the IdP uses for basic auth
when talking to us is configurable.

* chore: remove Apache V2 license text

The enterprise edition is not licensed under Apache V2, so remove the
license text to avoid confusion.

* fix: ignore SCIM meta field

When we receive a PATCH or PUT from Okta,
they may reply to us with the meta field we
initially gave to them. This is ok; it is in
spec to receive and ignore it.

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add support for SAML

To meet customer needs, we need to provide support for users to
authenticate with SAML. The commit adds /saml/sso and /saml/initiate to
receive SAML responses and initiate SP-initiated flow with SAML
requests, respectively. It also adds SSO providers to /info for the
frontend to determined if SAML is enabled or not and the ability to
configure SAML via master.yaml.

* feat: add support for automatically storing creds

As a Determined user, I don't want to paste a token every time I want to
auth. This adds support for, when auth'ing from the CLI, redirecting to
localhost and automatically storing the auth token.

* ci: modify CI/CD steps for EE

We avoid running jobs that will publish our product to public
repositories. This currently means that we cannot test cloud deployments
in an automated manner. However, none of our EE features are related to
cloud deployments so this should not be a problem.

* ci: enable EE releases with CircleCI

* chore: enable SCIM for test-e2e (#15)

* chore: run apt-get update for deps of master.deb (#16)

When we install ee via a deb, apt-get tries to resolve the named deps of
it but, in CI we haven't run apt-get update so we can't find it.

* feat: support acting as an OAuth 2.0 authorization server (#10)

This adds support for the OAuth 2.0 authorization code flow to the
master. Apart from the core flow itself, HTTP endpoints are provided for
managing the set of OAuth client applications (accessible only by admin
users).

Since we're only addressing one particular scenario at the moment, we
allow only a single client application to be registered, since that's
all we need for this use case and that simplifies management a bit.

* feat: enable OAuth protection for SCIM endpoints (#10)

* ci: add -ee to tags (#29)

* ci: use make publish for releasing Docker images (#36)

(cherry picked from commit a5de11b5cc87e848d0e81601d15038dd4f44d7e1)

* fix: remove det-deploy local from ee ci (#53)

* chore: increase refresh token exp (#60) [DET-4754]

* chore: remove sessions for deprovisioned users (#61) [DET-4844]

* ci: skip a bunch of testing and deployment for EE

* fix: extract master URL correctly from resource pools (#64)

With the introduction of resource pools, the provisioner is no longer a top level field. Update code so that we pull the MasterURL from the resource pools' provider field.

* fix: add ssoProviders to default store info for ee

* chore: do not publish model_hub package in EE

* fix: update master GoReleaser config to account for new binary

* chore: rebrand to HPE MLDE

Basic renaming of Docker images, docs bucket, etc. Also clearly remove
any implication of publishing to NVCR.

* chore: rebrand docs content and theme for HPE before publishing

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* ci: avoid implication that the password for SCIM tests is secure (#83)

* feat: add OIDC support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also revert a few diffs between the config for EE vs OSS that don't seem
intentional/warranted.

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* chore: refactor for semantic conflict in service.go:Service.extractToken (#186)

* fix: update docs skip check remote address to EE

* chore: add launcher client (#209)

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" af…

eecsliu pushed a commit to eecsliu/determined that referenced this pull request


          feat: slurm support (#98)

d033346

This change adds a dispatcher resource manager, which gives Determined the ability to run on top of Slurm.

Feature branch commit history:

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
Dispatcher RM messages.

* chore: HAL-2893 Synchronize reading of Slurm resource log file (#179)

* fix: HAL-2903 Drop special handling of stdin/stdout for SLURM (#180)

The /run/determined/train directory is no longer shared between
all containers in a run, so drop the slurm-job specific task
setup code that redirects the stdin/stdout link via /tmp (which
singularity maps to /tmp on the host which can lead to conflict
between jobs refereincing the same /tmp/stdin.log link between jobs).

* feat: FOUNDENG-21 Enable SLURM preemption (#181)

Initial work to enable SLURM job preemption.
- For SLURM jobs intercept SIGTERM as a notification of pending preemption
when triggered notify the master via new pending_preemption API
- Add pending_preemption API on the master
- Upon call to the pending_preemption, dispatcher RM triggers ReleaseResources
(whith ForcePremption:true) for the allocation.
- The above triggers a preemption, and the long poll _preempt.py is performing will
return indicating that the job should preempt, and after checkpointing it will
exit with a success status.

* chore: put some life into the 'job queue' UI page (#189)

* chore: Version of Bradley's wrap_rank.py fix (#188)

This is semantically eqivalent to

https://github.com/determined-ai/determined/pull/4109

So you can apply it to your tree to get things working on dispatcher branch.

* chore: HAL-2788 Process and send logs from Dispatcher(SLURM) to Default Log Backend (#182)

Co-authored-by: Bradley Laney <bradlaney@determined.ai>

* chore: per-partition configs (#196)

This change enables us to configure per-partition overrides for certain configurations and adds a few overrides for now. The mechanism matches determined agents where expconf settings (if available) override pool level defaults override master config level defaults.

* chore: avoid API blockage & overpolling (#198)

* chore: avoid API blockage & overpolling

when following launched jobs to completion

* chore: provide per-pool job stats (#199)

* chore: provide per-pool job stats

on receipt of GetJobQueueStatsRequest

* fix: FOUNDENG-13 Properly set default pool when no GPUs (#201)

When no partitions have GPUs, we ended up without a default
compute pool, so jobs would get subbmitted to pool "".
That would be fine, except we couldn't then match the pool
named "" to its attributes (i.e. has no GPUs) and thus we
would try to allocate GPUs from a pool without them.
So when no GPUs in any partition, set the default compute pool
(internal to the Dispatcher RM) to the default AUX pool, so
that when a request for the default pool comes in we properly
match it up with the pool attributes and correctly swith to
using the cpu as the slot type.

Also add a devcluster file for shuco for testing.

* chore: fix up slot type resolution (#197)

* chore: fix default scheme for dispatcher config (#203)

* chore: Refinements to Dispatcher/SLURM logs implementations (#191)

* fix: FOUNDENG-14 generated ResourcesStateChanged events (#202)

Generate ResourcesStateChanged to cover running, and other
resources states from the dispatcher RM.

AllocationReady comes from the container for shells, notebooks,
and tensorboards, but the clients also depend upon the state
being running immediately after AllocationReady, so to avoid
a race condition where the dispatcher RM discovers the running
state by polling for dispatcher RM, force the state to running upon reception
of the AllocationReady event.

* fix: avoid potential race between AllocationReady and Running state (#206)

Pull in oss version of this fix to avoid merge conflict.

* chore: Add devcluster files for casablanca-login and horizon (#207)

Custom devcluster files to enable simple testing for
two new test machines.

* chore: fix tests (#204)

* chore: cleanup stray stuff (#208)

* chore: rename iface configs for consistency (#210)

* fix: FOUNDENG-40 Favor default partition in all cases unless no GPUs (#213)

it would be preferable to favor the default partition explicitly.

New algorithm:

- Use default partition always for aux partition
- Use default partition for GPU partition (unless it has no GPUs).

Updated unit tests.

* chore: Improve branding/naming of dispatcher/launcher (#212)

For customer-visible purposes, use "Slurm" when referring to the scheduling
system that is managed by the dispatcher RM internally.

* fix: FOUNDENG-41 Cancel running SLURM jobs on restart (#215)

On the startup of the master there may be on-going jobs on the cluster
associated with determined jobs. Terminate them all and remove them
from the DB to prevent duplicate jobs as determined starts up new jobs
which will continue from the last checkpoint.

* fix: FOUNDENG-43 Change in yaml parser breaks default partitions (#217)

Unblock testing by reverting back to our previous yaml parser
to restore default partition identification.

* fix: correctly parse resource pool info with ghodss/yaml package [FOUNDENG-45] (#218)

* fix: handle nil response from dispatch terminate (#219)

* fix: FOUNDENG-41 properly handle missing dispatches (#221)

When cleaning up dispatches from the DB on startup, if the
dispatch does not exist (404) then still delete the DB entry
so we don't attempt to clean it up again on restart.

* chore: fill in DET_UNIQUE_PORT_OFFSET (#220)

* feat: FOUNDENG-24 Improve branding/naming of dispatcher/launcher (#216)

Change the resource_manger.type value that customer's provide
from dispatcher to slurm

* Revert "chore: fill in DET_UNIQUE_PORT_OFFSET (#220)"

This reverts commit d3de26b7d652bf15eb28ee0d9a6b2c5113efcaa3.

* chore: fill in DET_UNIQUE_PORT_OFFSET, with defaults (#225)

* fix: FOUNDENG-41 If we get an error not 404, don't delete from DB (#222)

If we get a nil response on the error path, we do not know for sure
that the dispatch does not exist, so leave it in the DB for a later
attempt at cleanup.

* chore: Saturate Determined's `GET /api/v1/agents` (#226)

* Saturate Determined's `GET /api/v1/agents` API

for Dispatcher RM

* fix: FOUNDENG-49 Slurm Logs and scripts are cleaned up (#228)

Changes to make sure that when a Dispatch is dropped from the DB
we have ensured that the DispatchIDs (launches & environment) have
been removed by the launcher.

Use common method for cleanup in both cases, and enhance
the method to verify deletion (404 is also considered a success).
If we hit any failure in cleanup of launcher data, we defer the
removal of the Dispatch until a later time (they are re-processed
on startup).

* chore: Saturate Determined's `GET /api/v1/agents` API (#229)

* Saturate Determined's `GET /api/v1/agents` API

for Dispatcher RM.
o small tweak to mark the agents as not available if Slurm has then as allocated.

* chore: Enable security.tls settings for slurm RM (#200)

We want to enable use of https:8043 for development to enable us to
connect to test machines from our desktops via Pulse VPN.
Also update casablanca-login devcluster config to use :8043

* fix: Fix parameter order Dispatch cleanup (#230)

The parameter order to the resourceQueryPostActions cleanup
method were reversed, so the Dispatch was not properly removed.

* fix: FOUNDENG-48 `ERROR: Cannot set UID=0` (#231)

Launched Slurm jobs cannot be run as a root user (the default if
no agent is configured). Catch this error early in the dispatcher
RM by returning an error if an attempt to impersonate root
and report it explicitly.

* fix: FOUNDENG-25 Dispatcher job error reporting is sub-standard (#232)

When reporting exit status messages, the dispatcher RM includes
all events that happen on the distpatch. These are mostly useless
status updates. Fixes:
- Include only WARNING/ERROR level messags in the reporting.
those potentially have real error messages in them.
- Filter the prefix for job failure messages to make them cleaner.

* chore: New 'job_storage_root' field in master.yaml needs to be added to DAI (#233)

* chore: FOUNDENG-26 Implement batch processing for Dispatcher/SLURM logs (#211)

* chore: FOUNDENG-51 update dispatcher monitor to handle 404 status for cancelled experiments (#236)

* fix: update master GoReleaser config to account for new binary

* chore: enable hpe branding (#79)

* chore: publish EE docs to hpecaide.determined.ai

* chore: rebrand docs content and theme for HPE before publishing

* chore: push EE Docker images with `hpecaide` names

* fix: update reference to config package (#80)

* fix: update reference to config package

https://github.com/determined-ai/determined/pull/3175 moved config into
its own package, this fixes an -ee specific reference which was not
updated, fixing `make all`.

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* rrci: avoid implication that the password for SCIM tests is secure (#83)

* feat: oidc support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: change branding to MLDE (#114)

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also update configuration to avoid giving the impression that we might want to
publish EE images to NVCR, and revert a few diffs between the config for EE vs
OSS that don't seem intentional/warranted.

* chore: finish CAIDE -> MLDE, change hpemlde -> hpe-mlde

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* chore: more CAIDE -> MLDE changes (#153)

* chore: update web initial title and meta description (#155)

* chore: refactor for semantic conflict in service.go:Service.extractToken (#186)

* fix: use correct tags for EE release

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: just warn on trying to set single resource as daemon, for now (#125)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* Support environment configuration on singularity (#152)

Address todos:
- TODO(HAL-2868): Support adding and dropping linux caps for containers.
- TODO(HAL-2869): Support communicating with authenticated container registries.
- Also force_pull_image option.

Design Note: https://connect.us.cray.com/confluence/pages/viewpage.action?pageId=230482663

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Add missing checkpoint-gc link supporting unshared data volumes (#156)

Add a missing link from /run/determined/checkpoint_gc to the
unshared data volume to enable them to complete successfully.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* HAL-2788 Handle task script archives as shared. (#162)

Simplify our bind mounts by treating the archive
with logging configuration scrtips as shared.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Add command and shell entry script support. (#166)

The shared volume support needs to setup links for files that
are not intended to be shared. This is not yet completely
generalized, so we need to update tasks.go to add the links
for now. Add some missing links.

* chore: Dynamically handle links for unshared archive volumes (#168)

When archives are mapped into the per-container space we
need to setup softlinks for each element within /run/determined
directory that is not treated as shared. This was previously
done using a static list of names. Replace that with a dynamic
list so that we don't miss files that are specified in other
parts of determined. Also add exceptions for command and
shell entrypoints which are read-only so we can leave them in the
shared /run/determined directory without setting up links.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: With slurm detect non-zero exit as job failure (#172)

If any container launched by the SLURM batch job exits with
a non-zero status, terminate the others with an error.
Without this a failure on horovodrun args, or other such
errors would exit the one container, but others would wait
forever.

Happy path exits, need a different fix.

* chore: Make shell_entrypoint work with SLURM (#174)

The shell command entry point needs the fix to properly
inherit DET_SLOTS from SLUM CUDA configuration.
Additionally, HPC systems include functions in their
environment to support modules. Because the variables
include () in the names, it breaks the SSH variable escaping
code. Avoid processing of these functions by adding any
variable name contiaing () to the blacklist.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
Dispatcher RM messages.

* chore: HAL-2893 Synchronize reading of Slurm resource log file (#179)

* fix: HAL-2903 Drop special handling of stdin/stdout for SLURM (#180)

The /run/determined/train directory is no longer shared between
all containers in a run, so drop the slurm-job specific task
setup code that redirects the stdin/stdout link via /tmp (which
singularity maps to /tmp on the host which can lead to conflict
between jobs refereincing the same /tmp/stdin.log link between jobs).

* feat: FOUNDENG-21 Enable SLURM preemption (#181)

Initial work to enable SLURM job preemption.
- For SLURM jobs intercept SIGTERM as a notification of pending preemption
when triggered notify the master via new pending_preemption API
- Add pending_preemption API on the master
- Upon call to the pending_preemption, dispatcher RM triggers ReleaseResources
(whith ForcePremption:true) for the allocation.
- The above triggers a preemption, and the long poll _preempt.py is performing will
return indicating that the job should preempt, and after checkpointing it will
exit with a success status.

* chore: put some life into the 'job queue' UI page (#189)

* chore: Version of Bradley's wrap_rank.py fix (#188)

This is semantically eqivalent to

https://github.com/determined-ai/determined/pull/4109

So you can apply it to your tree to get things working on dispatcher branch.

* chore: HAL-2788 Process and send logs from Dispatcher(SLURM) to Default Log Backend (#182)

Co-authored-by: Bradley Laney <bradlaney@determined.ai>

* fix: correct regex group name (#193)

* fix: HAL-2907 Simplify SLURM job naming (#190)

This is just a slight increment to reduce the verbosity, to cleanup
the username and dispatch ID we likely will need a launcher
enhancemement.

* chore: fix tb collisions (#194)

* chore: HAL-2894 Multiple distributed jobs collide on horovod run sshd port (#195)

* chore: per-partition configs (#196)

This change enables us to configure per-partition overrides for certain configurations and adds a few overrides for now. The mechanism matches determined agents where expconf settings (if available) override pool level defaults override master config level defaults.

* chore: avoid API blockage & overpolling (#198)

* chore: avoid API blockage & overpolling

when following launched jobs to completion

* chore: provide per-pool job stats (#199)

* chore: provide per-pool job stats

on receipt of GetJobQueueStatsRequest

* fix: FOUNDENG-13 Properly set default pool when no GPUs (#201)

When no partitions have GPUs, we ended up without a default
compute pool, so jobs would get subbmitted to pool "".
That would be fine, except we couldn't then match the pool
named "" to its attributes (i.e. has no GPUs) and thus we
would try to allocate GPUs from a pool without them.
So when no GPUs in any partition, set the default compute pool
(internal to the Dispatcher RM) to the default AUX pool, so
that when a request for the default pool comes in we properly
match it up with the pool attributes and correctly swith to
using the cpu as the slot type.

Also add a devcluster file for shuco for testing.

* chore: fix up slot type resolution (#197)

* chore: fix default scheme for dispatcher config (#203)

* chore: update Go dependency files

This is a sort of artificial commit used to concentrate all of the
go.{mod,sum} changes for EE into one place to hopefully reduce
conflicts.

* feat: add support for SCIM provisioning

This commit brings the SCIM support for v0.11.2 back from the dead with
minimal changes. The only new fixes/features are conflicts now correctly
returns 409 not 500, password sync is supported, externalId is not
required, and the username/password that the IdP uses for basic auth
when talking to us is configurable.

* chore: remove Apache V2 license text

The enterprise edition is not licensed under Apache V2, so remove the
license text to avoid confusion.

* fix: ignore SCIM meta field

When we receive a PATCH or PUT from Okta,
they may reply to us with the meta field we
initially gave to them. This is ok; it is in
spec to receive and ignore it.

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add support for SAML

To meet customer needs, we need to provide support for users to
authenticate with SAML. The commit adds /saml/sso and /saml/initiate to
receive SAML responses and initiate SP-initiated flow with SAML
requests, respectively. It also adds SSO providers to /info for the
frontend to determined if SAML is enabled or not and the ability to
configure SAML via master.yaml.

* feat: add support for automatically storing creds

As a Determined user, I don't want to paste a token every time I want to
auth. This adds support for, when auth'ing from the CLI, redirecting to
localhost and automatically storing the auth token.

* ci: modify CI/CD steps for EE

We avoid running jobs that will publish our product to public
repositories. This currently means that we cannot test cloud deployments
in an automated manner. However, none of our EE features are related to
cloud deployments so this should not be a problem.

* ci: enable EE releases with CircleCI

* chore: enable SCIM for test-e2e (#15)

* chore: run apt-get update for deps of master.deb (#16)

When we install ee via a deb, apt-get tries to resolve the named deps of
it but, in CI we haven't run apt-get update so we can't find it.

* feat: support acting as an OAuth 2.0 authorization server (#10)

This adds support for the OAuth 2.0 authorization code flow to the
master. Apart from the core flow itself, HTTP endpoints are provided for
managing the set of OAuth client applications (accessible only by admin
users).

Since we're only addressing one particular scenario at the moment, we
allow only a single client application to be registered, since that's
all we need for this use case and that simplifies management a bit.

* feat: enable OAuth protection for SCIM endpoints (#10)

* ci: add -ee to tags (#29)

* ci: use make publish for releasing Docker images (#36)

(cherry picked from commit a5de11b5cc87e848d0e81601d15038dd4f44d7e1)

* fix: remove det-deploy local from ee ci (#53)

* chore: increase refresh token exp (#60) [DET-4754]

* chore: remove sessions for deprovisioned users (#61) [DET-4844]

* ci: skip a bunch of testing and deployment for EE

* fix: extract master URL correctly from resource pools (#64)

With the introduction of resource pools, the provisioner is no longer a top level field. Update code so that we pull the MasterURL from the resource pools' provider field.

* fix: add ssoProviders to default store info for ee

* chore: do not publish model_hub package in EE

* fix: update master GoReleaser config to account for new binary

* chore: rebrand to HPE MLDE

Basic renaming of Docker images, docs bucket, etc. Also clearly remove
any implication of publishing to NVCR.

* chore: rebrand docs content and theme for HPE before publishing

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* ci: avoid implication that the password for SCIM tests is secure (#83)

* feat: add OIDC support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also revert a few diffs between the config for EE vs OSS that don't seem
intentional/warranted.

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* chore: refactor for semantic conflict in service.go:Service.extractToken (#186)

* fix: update docs skip check remote address to EE

* chore: add launcher client (#209)

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* Support environment configuration on singularity (#152)

Address todos:
- TODO(HAL-2868): Support adding and dropping linux caps for containers.
- TODO(HAL-2869): Support communicating with authenticated container registries.
- Also force_pull_image option.

Design Note: https://connect.us.cray.com/confluence/pages/viewpage.action?pageId=230482663

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Add missing checkpoint-gc link supporting unshared data volumes (#156)

Add a missing link from /run/determined/checkpoint_gc to the
unshared data volume to enable them to complete successfully.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* HAL-2788 Handle task script archives as shared. (#162)

Simplify our bind mounts by treating the archive
with logging configuration scrtips as shared.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Add command and shell entry script support. (#166)

The shared volume support needs to setup links for files that
are not intended to be shared. This is not yet completely
generalized, so we need to update tasks.go to add the links
for now. Add some missing links.

* chore: Dynamically handle links for unshared archive volumes (#168)

When archives are mapped into the per-container space we
need to setup softlinks for each element within /run/determined
directory that is not treated as shared. This was previously
done using a static list of names. Replace that with a dynamic
list so that we don't miss files that are specified in other
parts of determined. Also add exceptions for command and
shell entrypoints which are read-only so we can leave them in the
shared /run/determined directory without setting up links.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: With slurm detect non-zero exit as job failure (#172)

If any container launched by the SLURM batch job exits with
a non-zero status, terminate the others with an error.
Without this a failure on horovodrun args, or other such
errors would exit the one container, but others would wait
forever.

Happy path exits, need a different fix.

* chore: Make shell_entrypoint work with SLURM (#174)

The shell command entry point needs the fix to properly
inherit DET_SLOTS from SLUM CUDA configuration.
Additionally, HPC systems include functions in their
environment to support modules. Because the variables
include () in the names, it breaks the SSH variable escaping
code. Avoid processing of these functions by adding any
variable name contiaing () to the blacklist.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
…

eecsliu added a commit to eecsliu/determined that referenced this pull request

* chore: update Go dependency files

This is a sort of artificial commit used to concentrate all of the
go.{mod,sum} changes for EE into one place to hopefully reduce
conflicts.

* feat: add support for SCIM provisioning

This commit brings the SCIM support for v0.11.2 back from the dead with
minimal changes. The only new fixes/features are conflicts now correctly
returns 409 not 500, password sync is supported, externalId is not
required, and the username/password that the IdP uses for basic auth
when talking to us is configurable.

* chore: remove Apache V2 license text

The enterprise edition is not licensed under Apache V2, so remove the
license text to avoid confusion.

* fix: ignore SCIM meta field

When we receive a PATCH or PUT from Okta,
they may reply to us with the meta field we
initially gave to them. This is ok; it is in
spec to receive and ignore it.

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add support for SAML

To meet customer needs, we need to provide support for users to
authenticate with SAML. The commit adds /saml/sso and /saml/initiate to
receive SAML responses and initiate SP-initiated flow with SAML
requests, respectively. It also adds SSO providers to /info for the
frontend to determined if SAML is enabled or not and the ability to
configure SAML via master.yaml.

* feat: add support for automatically storing creds

As a Determined user, I don't want to paste a token every time I want to
auth. This adds support for, when auth'ing from the CLI, redirecting to
localhost and automatically storing the auth token.

* ci: modify CI/CD steps for EE

We avoid running jobs that will publish our product to public
repositories. This currently means that we cannot test cloud deployments
in an automated manner. However, none of our EE features are related to
cloud deployments so this should not be a problem.

* ci: enable EE releases with CircleCI

* chore: enable SCIM for test-e2e (#15)

* chore: run apt-get update for deps of master.deb (#16)

When we install ee via a deb, apt-get tries to resolve the named deps of
it but, in CI we haven't run apt-get update so we can't find it.

* feat: support acting as an OAuth 2.0 authorization server (#10)

This adds support for the OAuth 2.0 authorization code flow to the
master. Apart from the core flow itself, HTTP endpoints are provided for
managing the set of OAuth client applications (accessible only by admin
users).

Since we're only addressing one particular scenario at the moment, we
allow only a single client application to be registered, since that's
all we need for this use case and that simplifies management a bit.

* feat: enable OAuth protection for SCIM endpoints (#10)

* ci: add -ee to tags (#29)

* ci: use make publish for releasing Docker images (#36)

(cherry picked from commit a5de11b5cc87e848d0e81601d15038dd4f44d7e1)

* fix: remove det-deploy local from ee ci (#53)

* chore: increase refresh token exp (#60) [DET-4754]

* chore: remove sessions for deprovisioned users (#61) [DET-4844]

* ci: skip a bunch of testing and deployment for EE

* fix: extract master URL correctly from resource pools (#64)

With the introduction of resource pools, the provisioner is no longer a top level field. Update code so that we pull the MasterURL from the resource pools' provider field.

* fix: add ssoProviders to default store info for ee

* chore: do not publish model_hub package in EE

* fix: update master GoReleaser config to account for new binary

* chore: rebrand to HPE MLDE

Basic renaming of Docker images, docs bucket, etc. Also clearly remove
any implication of publishing to NVCR.

* chore: rebrand docs content and theme for HPE before publishing

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* ci: avoid implication that the password for SCIM tests is secure (#83)

* feat: add OIDC support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also revert a few diffs between the config for EE vs OSS that don't seem
intentional/warranted.

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* fix: update docs skip check remote address to EE

* chore: add launcher client (#209)

* feat: slurm support (#98)

This change adds a dispatcher resource manager, which gives Determined the ability to run on top of Slurm.

Feature branch commit history:

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
Dispatcher RM messages.

* chore: HAL-2893 Synchronize reading of Slurm resource log file (#179)

* fix: HAL-2903 Drop special handling of stdin/stdout for SLURM (#180)

The /run/determined/train directory is no longer shared between
all containers in a run, so drop the slurm-job specific task
setup code that redirects the stdin/stdout link via /tmp (which
singularity maps to /tmp on the host which can lead to conflict
between jobs refereincing the same /tmp/stdin.log link between jobs).

* feat: FOUNDENG-21 Enable SLURM preemption (#181)

Initial work to enable SLURM job preemption.
- For SLURM jobs intercept SIGTERM as a notification of pending preemption
when triggered notify the master via new pending_preemption API
- Add pending_preemption API on the master
- Upon call to the pending_preemption, dispatcher RM triggers ReleaseResources
(whith ForcePremption:true) for the allocation.
- The above triggers a preemption, and the long poll _preempt.py is performing will
return indicating that the job should preempt, and after checkpointing it will
exit with a success status.

* chore: put some life into the 'job queue' UI page (#189)

* chore: Version of Bradley's wrap_rank.py fix (#188)

This is semantically eqivalent to

https://github.com/determined-ai/determined/pull/4109

So you can apply it to your tree to get things working on dispatcher branch.

* chore: HAL-2788 Process and send logs from Dispatcher(SLURM) to Default Log Backend (#182)

Co-authored-by: Bradley Laney <bradlaney@determined.ai>

* chore: per-partition configs (#196)

This change enables us to configure per-partition overrides for certain configurations and adds a few overrides for now. The mechanism matches determined agents where expconf settings (if available) override pool level defaults override master config level defaults.

* chore: avoid API blockage & overpolling (#198)

* chore: avoid API blockage & overpolling

when following launched jobs to completion

* chore: provide per-pool job stats (#199)

* chore: provide per-pool job stats

on receipt of GetJobQueueStatsRequest

* fix: FOUNDENG-13 Properly set default pool when no GPUs (#201)

When no partitions have GPUs, we ended up without a default
compute pool, so jobs would get subbmitted to pool "".
That would be fine, except we couldn't then match the pool
named "" to its attributes (i.e. has no GPUs) and thus we
would try to allocate GPUs from a pool without them.
So when no GPUs in any partition, set the default compute pool
(internal to the Dispatcher RM) to the default AUX pool, so
that when a request for the default pool comes in we properly
match it up with the pool attributes and correctly swith to
using the cpu as the slot type.

Also add a devcluster file for shuco for testing.

* chore: fix up slot type resolution (#197)

* chore: fix default scheme for dispatcher config (#203)

* chore: Refinements to Dispatcher/SLURM logs implementations (#191)

* fix: FOUNDENG-14 generated ResourcesStateChanged events (#202)

Generate ResourcesStateChanged to cover running, and other
resources states from the dispatcher RM.

AllocationReady comes from the container for shells, notebooks,
and tensorboards, but the clients also depend upon the state
being running immediately after AllocationReady, so to avoid
a race condition where the dispatcher RM discovers the running
state by polling for dispatcher RM, force the state to running upon reception
of the AllocationReady event.

* fix: avoid potential race between AllocationReady and Running state (#206)

Pull in oss version of this fix to avoid merge conflict.

* chore: Add devcluster files for casablanca-login and horizon (#207)

Custom devcluster files to enable simple testing for
two new test machines.

* chore: fix tests (#204)

* chore: cleanup stray stuff (#208)

* chore: rename iface configs for consistency (#210)

* fix: FOUNDENG-40 Favor default partition in all cases unless no GPUs (#213)

it would be preferable to favor the default partition explicitly.

New algorithm:

- Use default partition always for aux partition
- Use default partition for GPU partition (unless it has no GPUs).

Updated unit tests.

* chore: Improve branding/naming of dispatcher/launcher (#212)

For customer-visible purposes, use "Slurm" when referring to the scheduling
system that is managed by the dispatcher RM internally.

* fix: FOUNDENG-41 Cancel running SLURM jobs on restart (#215)

On the startup of the master there may be on-going jobs on the cluster
associated with determined jobs. Terminate them all and remove them
from the DB to prevent duplicate jobs as determined starts up new jobs
which will continue from the last checkpoint.

* fix: FOUNDENG-43 Change in yaml parser breaks default partitions (#217)

Unblock testing by reverting back to our previous yaml parser
to restore default partition identification.

* fix: correctly parse resource pool info with ghodss/yaml package [FOUNDENG-45] (#218)

* fix: handle nil response from dispatch terminate (#219)

* fix: FOUNDENG-41 properly handle missing dispatches (#221)

When cleaning up dispatches from the DB on startup, if the
dispatch does not exist (404) then still delete the DB entry
so we don't attempt to clean it up again on restart.

* chore: fill in DET_UNIQUE_PORT_OFFSET (#220)

* feat: FOUNDENG-24 Improve branding/naming of dispatcher/launcher (#216)

Change the resource_manger.type value that customer's provide
from dispatcher to slurm

* Revert "chore: fill in DET_UNIQUE_PORT_OFFSET (#220)"

This reverts commit d3de26b7d652bf15eb28ee0d9a6b2c5113efcaa3.

* chore: fill in DET_UNIQUE_PORT_OFFSET, with defaults (#225)

* fix: FOUNDENG-41 If we get an error not 404, don't delete from DB (#222)

If we get a nil response on the error path, we do not know for sure
that the dispatch does not exist, so leave it in the DB for a later
attempt at cleanup.

* chore: Saturate Determined's `GET /api/v1/agents` (#226)

* Saturate Determined's `GET /api/v1/agents` API

for Dispatcher RM

* fix: FOUNDENG-49 Slurm Logs and scripts are cleaned up (#228)

Changes to make sure that when a Dispatch is dropped from the DB
we have ensured that the DispatchIDs (launches & environment) have
been removed by the launcher.

Use common method for cleanup in both cases, and enhance
the method to verify deletion (404 is also considered a success).
If we hit any failure in cleanup of launcher data, we defer the
removal of the Dispatch until a later time (they are re-processed
on startup).

* chore: Saturate Determined's `GET /api/v1/agents` API (#229)

* Saturate Determined's `GET /api/v1/agents` API

for Dispatcher RM.
o small tweak to mark the agents as not available if Slurm has then as allocated.

* chore: Enable security.tls settings for slurm RM (#200)

We want to enable use of https:8043 for development to enable us to
connect to test machines from our desktops via Pulse VPN.
Also update casablanca-login devcluster config to use :8043

* fix: Fix parameter order Dispatch cleanup (#230)

The parameter order to the resourceQueryPostActions cleanup
method were reversed, so the Dispatch was not properly removed.

* fix: FOUNDENG-48 `ERROR: Cannot set UID=0` (#231)

Launched Slurm jobs cannot be run as a root user (the default if
no agent is configured). Catch this error early in the dispatcher
RM by returning an error if an attempt to impersonate root
and report it explicitly.

* fix: FOUNDENG-25 Dispatcher job error reporting is sub-standard (#232)

When reporting exit status messages, the dispatcher RM includes
all events that happen on the distpatch. These are mostly useless
status updates. Fixes:
- Include only WARNING/ERROR level messags in the reporting.
those potentially have real error messages in them.
- Filter the prefix for job failure messages to make them cleaner.

* chore: New 'job_storage_root' field in master.yaml needs to be added to DAI (#233)

* chore: FOUNDENG-26 Implement batch processing for Dispatcher/SLURM logs (#211)

* chore: FOUNDENG-51 update dispatcher monitor to handle 404 status for cancelled experiments (#236)

* fix: update master GoReleaser config to account for new binary

* chore: enable hpe branding (#79)

* chore: publish EE docs to hpecaide.determined.ai

* chore: rebrand docs content and theme for HPE before publishing

* chore: push EE Docker images with `hpecaide` names

* fix: update reference to config package (#80)

* fix: update reference to config package

https://github.com/determined-ai/determined/pull/3175 moved config into
its own package, this fixes an -ee specific reference which was not
updated, fixing `make all`.

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* rrci: avoid implication that the password for SCIM tests is secure (#83)

* feat: oidc support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: change branding to MLDE (#114)

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also update configuration to avoid giving the impression that we might want to
publish EE images to NVCR, and revert a few diffs between the config for EE vs
OSS that don't seem intentional/warranted.

* chore: finish CAIDE -> MLDE, change hpemlde -> hpe-mlde

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* chore: more CAIDE -> MLDE changes (#153)

* chore: update web initial title and meta description (#155)

* chore: refactor for semantic conflict in service.go:Service.extractToken (#186)

* fix: use correct tags for EE release

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: just warn on trying to set single resource as daemon, for now (#125)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* Support environment configuration on singularity (#152)

Address todos:
- TODO(HAL-2868): Support adding and dropping linux caps for containers.
- TODO(HAL-2869): Support communicating with authenticated container registries.
- Also force_pull_image option.

Design Note: https://connect.us.cray.com/confluence/pages/viewpage.action?pageId=230482663

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Add missing checkpoint-gc link supporting unshared data volumes (#156)

Add a missing link from /run/determined/checkpoint_gc to the
unshared data volume to enable them to complete successfully.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* HAL-2788 Handle task script archives as shared. (#162)

Simplify our bind mounts by treating the archive
with logging configuration scrtips as shared.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Add command and shell entry script support. (#166)

The shared volume support needs to setup links for files that
are not intended to be shared. This is not yet completely
generalized, so we need to update tasks.go to add the links
for now. Add some missing links.

* chore: Dynamically handle links for unshared archive volumes (#168)

When archives are mapped into the per-container space we
need to setup softlinks for each element within /run/determined
directory that is not treated as shared. This was previously
done using a static list of names. Replace that with a dynamic
list so that we don't miss files that are specified in other
parts of determined. Also add exceptions for command and
shell entrypoints which are read-only so we can leave them in the
shared /run/determined directory without setting up links.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: With slurm detect non-zero exit as job failure (#172)

If any container launched by the SLURM batch job exits with
a non-zero status, terminate the others with an error.
Without this a failure on horovodrun args, or other such
errors would exit the one container, but others would wait
forever.

Happy path exits, need a different fix.

* chore: Make shell_entrypoint work with SLURM (#174)

The shell command entry point needs the fix to properly
inherit DET_SLOTS from SLUM CUDA configuration.
Additionally, HPC systems include functions in their
environment to support modules. Because the variables
include () in the names, it breaks the SSH variable escaping
code. Avoid processing of these functions by adding any
variable name contiaing () to the blacklist.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
Dispatcher RM messages.

* chore: HAL-2893 Synchronize reading of Slurm resource log file (#179)

* fix: HAL-2903 Drop special handling of stdin/stdout for SLURM (#180)

The /run/determined/train directory is no longer shared between
all containers in a run, so drop the slurm-job specific task
setup code that redirects the stdin/stdout link via /tmp (which
singularity maps to /tmp on the host which can lead to conflict
between jobs refereincing the same /tmp/stdin.log link between jobs).

* feat: FOUNDENG-21 Enable SLURM preemption (#181)

Initial work to enable SLURM job preemption.
- For SLURM jobs intercept SIGTERM as a notification of pending preemption
when triggered notify the master via new pending_preemption API
- Add pending_preemption API on the master
- Upon call to the pending_preemption, dispatcher RM triggers ReleaseResources
(whith ForcePremption:true) for the allocation.
- The above triggers a preemption, and the long poll _preempt.py is performing will
return indicating that the job should preempt, and after checkpointing it will
exit with a success status.

* chore: put some life into the 'job queue' UI page (#189)

* chore: Version of Bradley's wrap_rank.py fix (#188)

This is semantically eqivalent to

https://github.com/determined-ai/determined/pull/4109

So you can apply it to your tree to get things working on dispatcher branch.

* chore: HAL-2788 Process and send logs from Dispatcher(SLURM) to Default Log Backend (#182)

Co-authored-by: Bradley Laney <bradlaney@determined.ai>

* fix: correct regex group name (#193)

* fix: HAL-2907 Simplify SLURM job naming (#190)

This is just a slight increment to reduce the verbosity, to cleanup
the username and dispatch ID we likely will need a launcher
enhancemement.

* chore: fix tb collisions (#194)

* chore: HAL-2894 Multiple distributed jobs collide on horovod run sshd port (#195)

* chore: per-partition configs (#196)

This change enables us to configure per-partition overrides for certain configurations and adds a few overrides for now. The mechanism matches determined agents where expconf settings (if available) override pool level defaults override master config level defaults.

* chore: avoid API blockage & overpolling (#198)

* chore: avoid API blockage & overpolling

when following launched jobs to completion

* chore: provide per-pool job stats (#199)

* chore: provide per-pool job stats

on receipt of GetJobQueueStatsRequest

* fix: FOUNDENG-13 Properly set default pool when no GPUs (#201)

When no partitions have GPUs, we ended up without a default
compute pool, so jobs would get subbmitted to pool "".
That would be fine, except we couldn't then match the pool
named "" to its attributes (i.e. has no GPUs) and thus we
would try to allocate GPUs from a pool without them.
So when no GPUs in any partition, set the default compute pool
(internal to the Dispatcher RM) to the default AUX pool, so
that when a request for the default pool comes in we properly
match it up with the pool attributes and correctly swith to
using the cpu as the slot type.

Also add a devcluster file for shuco for testing.

* chore: fix up slot type resolution (#197)

* chore: fix default scheme for dispatcher config (#203)

* chore: update Go dependency files

This is a sort of artificial commit used to concentrate all of the
go.{mod,sum} changes for EE into one place to hopefully reduce
conflicts.

* feat: add support for SCIM provisioning

This commit brings the SCIM support for v0.11.2 back from the dead with
minimal changes. The only new fixes/features are conflicts now correctly
returns 409 not 500, password sync is supported, externalId is not
required, and the username/password that the IdP uses for basic auth
when talking to us is configurable.

* chore: remove Apache V2 license text

The enterprise edition is not licensed under Apache V2, so remove the
license text to avoid confusion.

* fix: ignore SCIM meta field

When we receive a PATCH or PUT from Okta,
they may reply to us with the meta field we
initially gave to them. This is ok; it is in
spec to receive and ignore it.

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add support for SAML

To meet customer needs, we need to provide support for users to
authenticate with SAML. The commit adds /saml/sso and /saml/initiate to
receive SAML responses and initiate SP-initiated flow with SAML
requests, respectively. It also adds SSO providers to /info for the
frontend to determined if SAML is enabled or not and the ability to
configure SAML via master.yaml.

* feat: add support for automatically storing creds

As a Determined user, I don't want to paste a token every time I want to
auth. This adds support for, when auth'ing from the CLI, redirecting to
localhost and automatically storing the auth token.

* ci: modify CI/CD steps for EE

We avoid running jobs that will publish our product to public
repositories. This currently means that we cannot test cloud deployments
in an automated manner. However, none of our EE features are related to
cloud deployments so this should not be a problem.

* ci: enable EE releases with CircleCI

* chore: enable SCIM for test-e2e (#15)

* chore: run apt-get update for deps of master.deb (#16)

When we install ee via a deb, apt-get tries to resolve the named deps of
it but, in CI we haven't run apt-get update so we can't find it.

* feat: support acting as an OAuth 2.0 authorization server (#10)

This adds support for the OAuth 2.0 authorization code flow to the
master. Apart from the core flow itself, HTTP endpoints are provided for
managing the set of OAuth client applications (accessible only by admin
users).

Since we're only addressing one particular scenario at the moment, we
allow only a single client application to be registered, since that's
all we need for this use case and that simplifies management a bit.

* feat: enable OAuth protection for SCIM endpoints (#10)

* ci: add -ee to tags (#29)

* ci: use make publish for releasing Docker images (#36)

(cherry picked from commit a5de11b5cc87e848d0e81601d15038dd4f44d7e1)

* fix: remove det-deploy local from ee ci (#53)

* chore: increase refresh token exp (#60) [DET-4754]

* chore: remove sessions for deprovisioned users (#61) [DET-4844]

* ci: skip a bunch of testing and deployment for EE

* fix: extract master URL correctly from resource pools (#64)

With the introduction of resource pools, the provisioner is no longer a top level field. Update code so that we pull the MasterURL from the resource pools' provider field.

* fix: add ssoProviders to default store info for ee

* chore: do not publish model_hub package in EE

* fix: update master GoReleaser config to account for new binary

* chore: rebrand to HPE MLDE

Basic renaming of Docker images, docs bucket, etc. Also clearly remove
any implication of publishing to NVCR.

* chore: rebrand docs content and theme for HPE before publishing

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* ci: avoid implication that the password for SCIM tests is secure (#83)

* feat: add OIDC support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also revert a few diffs between the config for EE vs OSS that don't seem
intentional/warranted.

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* chore: refactor for semantic conflict in service.go:Service.extractToken (#186)

* fix: update docs skip check remote address to EE

* chore: add launcher client (#209)

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's re…

eecsliu added a commit to eecsliu/determined that referenced this pull request

…or PBS (determined-ai#399)

* chore: update Go dependency files

This is a sort of artificial commit used to concentrate all of the
go.{mod,sum} changes for EE into one place to hopefully reduce
conflicts.

* feat: add support for SCIM provisioning

This commit brings the SCIM support for v0.11.2 back from the dead with
minimal changes. The only new fixes/features are conflicts now correctly
returns 409 not 500, password sync is supported, externalId is not
required, and the username/password that the IdP uses for basic auth
when talking to us is configurable.

* chore: remove Apache V2 license text

The enterprise edition is not licensed under Apache V2, so remove the
license text to avoid confusion.

* fix: ignore SCIM meta field

When we receive a PATCH or PUT from Okta,
they may reply to us with the meta field we
initially gave to them. This is ok; it is in
spec to receive and ignore it.

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add support for SAML

To meet customer needs, we need to provide support for users to
authenticate with SAML. The commit adds /saml/sso and /saml/initiate to
receive SAML responses and initiate SP-initiated flow with SAML
requests, respectively. It also adds SSO providers to /info for the
frontend to determined if SAML is enabled or not and the ability to
configure SAML via master.yaml.

* feat: add support for automatically storing creds

As a Determined user, I don't want to paste a token every time I want to
auth. This adds support for, when auth'ing from the CLI, redirecting to
localhost and automatically storing the auth token.

* ci: modify CI/CD steps for EE

We avoid running jobs that will publish our product to public
repositories. This currently means that we cannot test cloud deployments
in an automated manner. However, none of our EE features are related to
cloud deployments so this should not be a problem.

* ci: enable EE releases with CircleCI

* chore: enable SCIM for test-e2e (#15)

* chore: run apt-get update for deps of master.deb (#16)

When we install ee via a deb, apt-get tries to resolve the named deps of
it but, in CI we haven't run apt-get update so we can't find it.

* feat: support acting as an OAuth 2.0 authorization server (#10)

This adds support for the OAuth 2.0 authorization code flow to the
master. Apart from the core flow itself, HTTP endpoints are provided for
managing the set of OAuth client applications (accessible only by admin
users).

Since we're only addressing one particular scenario at the moment, we
allow only a single client application to be registered, since that's
all we need for this use case and that simplifies management a bit.

* feat: enable OAuth protection for SCIM endpoints (#10)

* ci: add -ee to tags (#29)

* ci: use make publish for releasing Docker images (#36)

(cherry picked from commit a5de11b5cc87e848d0e81601d15038dd4f44d7e1)

* fix: remove det-deploy local from ee ci (#53)

* chore: increase refresh token exp (#60) [DET-4754]

* chore: remove sessions for deprovisioned users (#61) [DET-4844]

* ci: skip a bunch of testing and deployment for EE

* fix: extract master URL correctly from resource pools (#64)

With the introduction of resource pools, the provisioner is no longer a top level field. Update code so that we pull the MasterURL from the resource pools' provider field.

* fix: add ssoProviders to default store info for ee

* chore: do not publish model_hub package in EE

* fix: update master GoReleaser config to account for new binary

* chore: rebrand to HPE MLDE

Basic renaming of Docker images, docs bucket, etc. Also clearly remove
any implication of publishing to NVCR.

* chore: rebrand docs content and theme for HPE before publishing

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* feat: add OIDC support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also revert a few diffs between the config for EE vs OSS that don't seem
intentional/warranted.

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* fix: update docs skip check remote address to EE

* chore: add launcher client (#209)

* feat: slurm support (#98)

This change adds a dispatcher resource manager, which gives Determined the ability to run on top of Slurm.

Feature branch commit history:

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
Dispatcher RM messages.

* chore: HAL-2893 Synchronize reading of Slurm resource log file (#179)

* fix: HAL-2903 Drop special handling of stdin/stdout for SLURM (#180)

The /run/determined/train directory is no longer shared between
all containers in a run, so drop the slurm-job specific task
setup code that redirects the stdin/stdout link via /tmp (which
singularity maps to /tmp on the host which can lead to conflict
between jobs refereincing the same /tmp/stdin.log link between jobs).

* feat: FOUNDENG-21 Enable SLURM preemption (#181)

Initial work to enable SLURM job preemption.
- For SLURM jobs intercept SIGTERM as a notification of pending preemption
when triggered notify the master via new pending_preemption API
- Add pending_preemption API on the master
- Upon call to the pending_preemption, dispatcher RM triggers ReleaseResources
(whith ForcePremption:true) for the allocation.
- The above triggers a preemption, and the long poll _preempt.py is performing will
return indicating that the job should preempt, and after checkpointing it will
exit with a success status.

* chore: put some life into the 'job queue' UI page (#189)

* chore: Version of Bradley's wrap_rank.py fix (#188)

This is semantically eqivalent to

https://github.com/determined-ai/determined/pull/4109

So you can apply it to your tree to get things working on dispatcher branch.

* chore: HAL-2788 Process and send logs from Dispatcher(SLURM) to Default Log Backend (#182)

Co-authored-by: Bradley Laney <bradlaney@determined.ai>

* chore: per-partition configs (#196)

This change enables us to configure per-partition overrides for certain configurations and adds a few overrides for now. The mechanism matches determined agents where expconf settings (if available) override pool level defaults override master config level defaults.

* chore: avoid API blockage & overpolling (#198)

* chore: avoid API blockage & overpolling

when following launched jobs to completion

* chore: provide per-pool job stats (#199)

* chore: provide per-pool job stats

on receipt of GetJobQueueStatsRequest

* fix: FOUNDENG-13 Properly set default pool when no GPUs (#201)

When no partitions have GPUs, we ended up without a default
compute pool, so jobs would get subbmitted to pool "".
That would be fine, except we couldn't then match the pool
named "" to its attributes (i.e. has no GPUs) and thus we
would try to allocate GPUs from a pool without them.
So when no GPUs in any partition, set the default compute pool
(internal to the Dispatcher RM) to the default AUX pool, so
that when a request for the default pool comes in we properly
match it up with the pool attributes and correctly swith to
using the cpu as the slot type.

Also add a devcluster file for shuco for testing.

* chore: fix up slot type resolution (#197)

* chore: fix default scheme for dispatcher config (#203)

* chore: Refinements to Dispatcher/SLURM logs implementations (#191)

* fix: FOUNDENG-14 generated ResourcesStateChanged events (#202)

Generate ResourcesStateChanged to cover running, and other
resources states from the dispatcher RM.

AllocationReady comes from the container for shells, notebooks,
and tensorboards, but the clients also depend upon the state
being running immediately after AllocationReady, so to avoid
a race condition where the dispatcher RM discovers the running
state by polling for dispatcher RM, force the state to running upon reception
of the AllocationReady event.

* fix: avoid potential race between AllocationReady and Running state (#206)

Pull in oss version of this fix to avoid merge conflict.

* chore: Add devcluster files for casablanca-login and horizon (#207)

Custom devcluster files to enable simple testing for
two new test machines.

* chore: fix tests (#204)

* chore: cleanup stray stuff (#208)

* chore: rename iface configs for consistency (#210)

* fix: FOUNDENG-40 Favor default partition in all cases unless no GPUs (#213)

it would be preferable to favor the default partition explicitly.

New algorithm:

- Use default partition always for aux partition
- Use default partition for GPU partition (unless it has no GPUs).

Updated unit tests.

* chore: Improve branding/naming of dispatcher/launcher (#212)

For customer-visible purposes, use "Slurm" when referring to the scheduling
system that is managed by the dispatcher RM internally.

* fix: FOUNDENG-41 Cancel running SLURM jobs on restart (#215)

On the startup of the master there may be on-going jobs on the cluster
associated with determined jobs. Terminate them all and remove them
from the DB to prevent duplicate jobs as determined starts up new jobs
which will continue from the last checkpoint.

* fix: FOUNDENG-43 Change in yaml parser breaks default partitions (#217)

Unblock testing by reverting back to our previous yaml parser
to restore default partition identification.

* fix: correctly parse resource pool info with ghodss/yaml package [FOUNDENG-45] (#218)

* fix: handle nil response from dispatch terminate (#219)

* fix: FOUNDENG-41 properly handle missing dispatches (#221)

When cleaning up dispatches from the DB on startup, if the
dispatch does not exist (404) then still delete the DB entry
so we don't attempt to clean it up again on restart.

* chore: fill in DET_UNIQUE_PORT_OFFSET (#220)

* feat: FOUNDENG-24 Improve branding/naming of dispatcher/launcher (#216)

Change the resource_manger.type value that customer's provide
from dispatcher to slurm

* Revert "chore: fill in DET_UNIQUE_PORT_OFFSET (#220)"

This reverts commit d3de26b7d652bf15eb28ee0d9a6b2c5113efcaa3.

* chore: fill in DET_UNIQUE_PORT_OFFSET, with defaults (#225)

* fix: FOUNDENG-41 If we get an error not 404, don't delete from DB (#222)

If we get a nil response on the error path, we do not know for sure
that the dispatch does not exist, so leave it in the DB for a later
attempt at cleanup.

* chore: Saturate Determined's `GET /api/v1/agents` (#226)

* Saturate Determined's `GET /api/v1/agents` API

for Dispatcher RM

* fix: FOUNDENG-49 Slurm Logs and scripts are cleaned up (#228)

Changes to make sure that when a Dispatch is dropped from the DB
we have ensured that the DispatchIDs (launches & environment) have
been removed by the launcher.

Use common method for cleanup in both cases, and enhance
the method to verify deletion (404 is also considered a success).
If we hit any failure in cleanup of launcher data, we defer the
removal of the Dispatch until a later time (they are re-processed
on startup).

* chore: Saturate Determined's `GET /api/v1/agents` API (#229)

* Saturate Determined's `GET /api/v1/agents` API

for Dispatcher RM.
o small tweak to mark the agents as not available if Slurm has then as allocated.

* chore: Enable security.tls settings for slurm RM (#200)

We want to enable use of https:8043 for development to enable us to
connect to test machines from our desktops via Pulse VPN.
Also update casablanca-login devcluster config to use :8043

* fix: Fix parameter order Dispatch cleanup (#230)

The parameter order to the resourceQueryPostActions cleanup
method were reversed, so the Dispatch was not properly removed.

* fix: FOUNDENG-48 `ERROR: Cannot set UID=0` (#231)

Launched Slurm jobs cannot be run as a root user (the default if
no agent is configured). Catch this error early in the dispatcher
RM by returning an error if an attempt to impersonate root
and report it explicitly.

* fix: FOUNDENG-25 Dispatcher job error reporting is sub-standard (#232)

When reporting exit status messages, the dispatcher RM includes
all events that happen on the distpatch. These are mostly useless
status updates. Fixes:
- Include only WARNING/ERROR level messags in the reporting.
those potentially have real error messages in them.
- Filter the prefix for job failure messages to make them cleaner.

* chore: New 'job_storage_root' field in master.yaml needs to be added to DAI (#233)

* chore: FOUNDENG-26 Implement batch processing for Dispatcher/SLURM logs (#211)

* chore: FOUNDENG-51 update dispatcher monitor to handle 404 status for cancelled experiments (#236)

* fix: update master GoReleaser config to account for new binary

* chore: enable hpe branding (#79)

* chore: publish EE docs to hpecaide.determined.ai

* chore: rebrand docs content and theme for HPE before publishing

* chore: push EE Docker images with `hpecaide` names

* fix: update reference to config package (#80)

* fix: update reference to config package

https://github.com/determined-ai/determined/pull/3175 moved config into
its own package, this fixes an -ee specific reference which was not
updated, fixing `make all`.

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* rrci: avoid implication that the password for SCIM tests is secure (#83)

* feat: oidc support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: change branding to MLDE (#114)

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also update configuration to avoid giving the impression that we might want to
publish EE images to NVCR, and revert a few diffs between the config for EE vs
OSS that don't seem intentional/warranted.

* chore: finish CAIDE -> MLDE, change hpemlde -> hpe-mlde

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* chore: more CAIDE -> MLDE changes (#153)

* chore: update web initial title and meta description (#155)

* chore: refactor for semantic conflict in service.go:Service.extractToken (#186)

* fix: use correct tags for EE release

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: just warn on trying to set single resource as daemon, for now (#125)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" after the ListDispatchesByAllocationID and return if there's an error instead of doing it in the "else".

* Update dispatcher_resource_manager.go

Changed "return" to "return nil"

* Update dispatcher_resource_manager.go

I think lint was complaining about the parenthesis around the "if (err != nil)", so I removed them.

* Update dispatcher_resource_manager.go

Removed the empty line above the "if err != nil", as lint seems to be complaining about it.

* HAL-2855: Initial nonshared archive folders (#146)

Fully working on casablanca with slots_per_trial: 8
Will be addressing additional TODOs in the code in this story.

* Suppress an dispatcher RM error on weight/move messages (#149)

Dispatcher RM does not support:
job.SetGroupWeight, job.SetGroupPriority, job.MoveJob
so ignore them instead of blowing up (since it prevents testing).

* HAL-2876 Improve UX/Reporting of job submission failures (#150)

* Support environment configuration on singularity (#152)

Address todos:
- TODO(HAL-2868): Support adding and dropping linux caps for containers.
- TODO(HAL-2869): Support communicating with authenticated container registries.
- Also force_pull_image option.

Design Note: https://connect.us.cray.com/confluence/pages/viewpage.action?pageId=230482663

* HAL-2878 Make non-shared archive volumes more robust (#151)

Make use of dispatcher support for softlinks in archive launch parameter to enable:
- Enable `set -e` in the dispatcher-wrapper.sh script to avoid missed errors.
- Enable /run/determined/ssh to be created in-place instead of after the fact.
- Enable /run/determined/workdir to be created in-place instead of after the fact.

* Add missing checkpoint-gc link supporting unshared data volumes (#156)

Add a missing link from /run/determined/checkpoint_gc to the
unshared data volume to enable them to complete successfully.

* Restore resource pool information retrieval from Slurm (HAL-2887) (#158)

* Restore resource pool information from Slurm

* HAL-2884 Properly handle SLURM systems withoug -G support (#159)

We still had a hack in place form our initial demo to enable
testing on horizon even though it does not have
SelectType=select/cons_tres configured.

Add new RM configuration to cover this case (consistent with k8s):
slot_type: The slot type (cuda/cpu/rocm) that will be used for scheduling (defaults to cuda).
If SLURM is not configured for SelectType=select/cons_tres and with GRES GPU
information configured, change this to cpu.

Also drop the image_path variable we were using until we have full docker support
and constraints (as those will now come through the expconf directly via slurm option
there).

* Hal 2793: Add dispatcher portions of expconf for slurm (#157)

* chore: Add expconf environment.slurm

Add slurm configuration in the expconf similar in purpose
to the pod_spec for k8s.

* Dispatcher RM portions of HAL-2793

* HAL-2856 Add launcher configuration variables (#161)

To simplify the user experience, we are adding the launcher.conf
variables that a user may need to set to master.yaml so they
have a single file they need to edit to configure determined.

* HAL-2788 Handle task script archives as shared. (#162)

Simplify our bind mounts by treating the archive
with logging configuration scrtips as shared.

* fix: semantic rebase conflicts

* HAL-2892: Singularity mounts /dev/shm from host by default leading to file conflicts (#163)

Change back to using /tmp for container-private directory
with SINGULARITY_NO_MOUNT to prevent sharing (requires 3.7+).

* HAL-2892: Use / as contianer-private directory (#164)

Amazingly / seems to be the best place to land our container-private
directory to hold links pointing to non-shared directories instantiated
from archives. It is private to the container and writable (with --writable-tmpfs
option). It does not impact /tmp or /dev/shm which customers may be using,
and is surprisingly writable by the owning user.

* chore: Add command and shell entry script support. (#166)

The shared volume support needs to setup links for files that
are not intended to be shared. This is not yet completely
generalized, so we need to update tasks.go to add the links
for now. Add some missing links.

* chore: Dynamically handle links for unshared archive volumes (#168)

When archives are mapped into the per-container space we
need to setup softlinks for each element within /run/determined
directory that is not treated as shared. This was previously
done using a static list of names. Replace that with a dynamic
list so that we don't miss files that are specified in other
parts of determined. Also add exceptions for command and
shell entrypoints which are read-only so we can leave them in the
shared /run/determined directory without setting up links.

* chore: Propagate DispatchState to job.SchedulingState (#169)

A first increment of HAL-2863 to expose job queue information.
This change updates the State field of the AllocateRequest
for the task when it changes to RUNNING to avoid the experiment
status always saying QUEUED.

* chore: clean up after launcher Slurm query completion (HAL-2872) (#170)

* DispatchIDs are cleaned up

This change does some clean up on the dispatcher
side following Slurm query completion.

* chore: Make MPIRUN work under horovodrun (#171)

We now have environment images with built-in OpenMPI. When invoked the SLURM_JOBID variable
triggers integration with SLURM, however, we are running in a singularity container and SLRUM
may or may not have compatible configuration enabled. We therefore clear the SLURM_JOBID variable
before invoking mpi so that mpirun will honor the args passed via horvod run to it describing
the hosts and process topology, otherwise mpi ends up wanting to launch all -np# processes on the local
causing an oversubscription error ("There are not enough slots available in the system").

Also fallback to OMPI_COMM_WORLD_RANK when HOROVOD_RANK is not defined. This is not needed
for correct operation, but impacts [rank=#] logging prefix, so improves the user experience
and debugging.

* chore: With slurm detect non-zero exit as job failure (#172)

If any container launched by the SLURM batch job exits with
a non-zero status, terminate the others with an error.
Without this a failure on horovodrun args, or other such
errors would exit the one container, but others would wait
forever.

Happy path exits, need a different fix.

* chore: Make shell_entrypoint work with SLURM (#174)

The shell command entry point needs the fix to properly
inherit DET_SLOTS from SLUM CUDA configuration.
Additionally, HPC systems include functions in their
environment to support modules. Because the variables
include () in the names, it breaks the SSH variable escaping
code. Avoid processing of these functions by adding any
variable name contiaing () to the blacklist.

* chore: Provide DET_SLOT_IDS for other entrypoint scripts. (#175)

Inherit DET_SLOT_IDS from SLUM-provided CUDA configuration.
This should cover additional entrypoint scripts not yet
addressed individually (notebooks, and tensorboards).

Also turn off debug logging from dispatcher-wrapper.sh now that
the support appears robust.

Drop the explicit handling of DET_SLOT_IDS in the shell entry,
since it is covered by the dispatcher-wrapper.sh

Further enhance blacklist of variables to those include % symbols.

* chore: clean up dispatcher on experiment deletion (#173)

* DispatchIDs are cleaned up

This change initiates some clean up on the
dispatcher side when an experiment is deleted

* chore: HAL-2900 Always send proper responses to requests (#178)

Fixes to always send a response, and return nil in response to
Dispatcher RM messages.

* chore: HAL-2893 Synchronize reading of Slurm resource log file (#179)

* fix: HAL-2903 Drop special handling of stdin/stdout for SLURM (#180)

The /run/determined/train directory is no longer shared between
all containers in a run, so drop the slurm-job specific task
setup code that redirects the stdin/stdout link via /tmp (which
singularity maps to /tmp on the host which can lead to conflict
between jobs refereincing the same /tmp/stdin.log link between jobs).

* feat: FOUNDENG-21 Enable SLURM preemption (#181)

Initial work to enable SLURM job preemption.
- For SLURM jobs intercept SIGTERM as a notification of pending preemption
when triggered notify the master via new pending_preemption API
- Add pending_preemption API on the master
- Upon call to the pending_preemption, dispatcher RM triggers ReleaseResources
(whith ForcePremption:true) for the allocation.
- The above triggers a preemption, and the long poll _preempt.py is performing will
return indicating that the job should preempt, and after checkpointing it will
exit with a success status.

* chore: put some life into the 'job queue' UI page (#189)

* chore: Version of Bradley's wrap_rank.py fix (#188)

This is semantically eqivalent to

https://github.com/determined-ai/determined/pull/4109

So you can apply it to your tree to get things working on dispatcher branch.

* chore: HAL-2788 Process and send logs from Dispatcher(SLURM) to Default Log Backend (#182)

Co-authored-by: Bradley Laney <bradlaney@determined.ai>

* fix: correct regex group name (#193)

* fix: HAL-2907 Simplify SLURM job naming (#190)

This is just a slight increment to reduce the verbosity, to cleanup
the username and dispatch ID we likely will need a launcher
enhancemement.

* chore: fix tb collisions (#194)

* chore: HAL-2894 Multiple distributed jobs collide on horovod run sshd port (#195)

* chore: per-partition configs (#196)

This change enables us to configure per-partition overrides for certain configurations and adds a few overrides for now. The mechanism matches determined agents where expconf settings (if available) override pool level defaults override master config level defaults.

* chore: avoid API blockage & overpolling (#198)

* chore: avoid API blockage & overpolling

when following launched jobs to completion

* chore: provide per-pool job stats (#199)

* chore: provide per-pool job stats

on receipt of GetJobQueueStatsRequest

* fix: FOUNDENG-13 Properly set default pool when no GPUs (#201)

When no partitions have GPUs, we ended up without a default
compute pool, so jobs would get subbmitted to pool "".
That would be fine, except we couldn't then match the pool
named "" to its attributes (i.e. has no GPUs) and thus we
would try to allocate GPUs from a pool without them.
So when no GPUs in any partition, set the default compute pool
(internal to the Dispatcher RM) to the default AUX pool, so
that when a request for the default pool comes in we properly
match it up with the pool attributes and correctly swith to
using the cpu as the slot type.

Also add a devcluster file for shuco for testing.

* chore: fix up slot type resolution (#197)

* chore: fix default scheme for dispatcher config (#203)

* chore: update Go dependency files

This is a sort of artificial commit used to concentrate all of the
go.{mod,sum} changes for EE into one place to hopefully reduce
conflicts.

* feat: add support for SCIM provisioning

This commit brings the SCIM support for v0.11.2 back from the dead with
minimal changes. The only new fixes/features are conflicts now correctly
returns 409 not 500, password sync is supported, externalId is not
required, and the username/password that the IdP uses for basic auth
when talking to us is configurable.

* chore: remove Apache V2 license text

The enterprise edition is not licensed under Apache V2, so remove the
license text to avoid confusion.

* fix: ignore SCIM meta field

When we receive a PATCH or PUT from Okta,
they may reply to us with the meta field we
initially gave to them. This is ok; it is in
spec to receive and ignore it.

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add support for SAML

To meet customer needs, we need to provide support for users to
authenticate with SAML. The commit adds /saml/sso and /saml/initiate to
receive SAML responses and initiate SP-initiated flow with SAML
requests, respectively. It also adds SSO providers to /info for the
frontend to determined if SAML is enabled or not and the ability to
configure SAML via master.yaml.

* feat: add support for automatically storing creds

As a Determined user, I don't want to paste a token every time I want to
auth. This adds support for, when auth'ing from the CLI, redirecting to
localhost and automatically storing the auth token.

* ci: modify CI/CD steps for EE

We avoid running jobs that will publish our product to public
repositories. This currently means that we cannot test cloud deployments
in an automated manner. However, none of our EE features are related to
cloud deployments so this should not be a problem.

* ci: enable EE releases with CircleCI

* chore: enable SCIM for test-e2e (#15)

* chore: run apt-get update for deps of master.deb (#16)

When we install ee via a deb, apt-get tries to resolve the named deps of
it but, in CI we haven't run apt-get update so we can't find it.

* feat: support acting as an OAuth 2.0 authorization server (#10)

This adds support for the OAuth 2.0 authorization code flow to the
master. Apart from the core flow itself, HTTP endpoints are provided for
managing the set of OAuth client applications (accessible only by admin
users).

Since we're only addressing one particular scenario at the moment, we
allow only a single client application to be registered, since that's
all we need for this use case and that simplifies management a bit.

* feat: enable OAuth protection for SCIM endpoints (#10)

* ci: add -ee to tags (#29)

* ci: use make publish for releasing Docker images (#36)

(cherry picked from commit a5de11b5cc87e848d0e81601d15038dd4f44d7e1)

* fix: remove det-deploy local from ee ci (#53)

* chore: increase refresh token exp (#60) [DET-4754]

* chore: remove sessions for deprovisioned users (#61) [DET-4844]

* ci: skip a bunch of testing and deployment for EE

* fix: extract master URL correctly from resource pools (#64)

With the introduction of resource pools, the provisioner is no longer a top level field. Update code so that we pull the MasterURL from the resource pools' provider field.

* fix: add ssoProviders to default store info for ee

* chore: do not publish model_hub package in EE

* fix: update master GoReleaser config to account for new binary

* chore: rebrand to HPE MLDE

Basic renaming of Docker images, docs bucket, etc. Also clearly remove
any implication of publishing to NVCR.

* chore: rebrand docs content and theme for HPE before publishing

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* ci: avoid implication that the password for SCIM tests is secure (#83)

* feat: add OIDC support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work. Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also revert a few diffs between the config for EE vs OSS that don't seem
intentional/warranted.

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* chore: refactor for semantic conflict in service.go:Service.extractToken (#186)

* fix: update docs skip check remote address to EE

* chore: add launcher client (#209)

* scaffolding

temp commit

temp commit

add go client

Initial progress on dispatcher

Simple API call to unauthenticated InfoAPI
endpoint working inside the dispatcher
resource manager.

* readd resource fulfillment code

* HAL-2781 Modify launcher resource manager to send manifest to launcher (#87)

* Watch for job terminations

* Revert "Watch for job terminations"

This reverts commit b185fb9301720e85eff3a3dd705fa0d580f0596d.

* HAL-2787 Modify launcher resource manager to pass environment variables required by task container (#90)

* Watch for job terminations for Dispatcher RM (HAL-2792) (#89)

* Watch for job terminations

* Fix require statement to build without HPE github access (HAL-2800) (#94)

* Fix require statement to build without HPE github access (HAL-2800)
* Add launcher to agent go.mod (indirectly pulled in via master)

* Restore proper version in generated files to avoid test failures (#95)

* ci: miscellaneous fixes (#96)

* fix: resolve semantic conflicts (#97)

* HAL-2790 Saturate Determined's GET /api/v1/resource-pools API for the Dispatcher RM. (#91)

* HAL-2789: Add initial proto authorization model (#99)

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* HAL-2789: Add initial proto authorization model.
Generalize devcluster.yaml configuration to enable development.

* Delete launcher.token

VSCode is helping again and adding files.

* Fix singularity image path, and avoid an NPE on restart due to nil Onwer in task (#100)

* rebase: semantic conflicts

* chore: send job terminations to determined internal components (#101)

This change udpates the dispatcher RM to notify the necessary determined internal components when a dispatch fails.

* Add devcluster configuration for testing on casablanca (#102)

* Update generated mock files (version change) so that link checks pass. (#103)

* HAL-2755 Figure out how to convert Determined AI resource allocation to Capsules resource allocation (#104)

* chore: post-rebase tidy

* feat: add slurm rendezvous

* chore: remove remnants of k8s copy paste (#110)

* chore: remove remnants of k8s copy paste (#110) (#112)

Co-authored-by: Bradley Laney <bradley.laney@hpe.com>

* HAL-2842 DAI launcher job status check is not requesting a refresh (#113)

* Cache & reuse resource pool query results (#111)

* Slurm resource pools

Cache & reuse query results; provide per partition stats.

* HAL-2814 Provide data archive volumes to the dispatcher (#116)

* HAL-2814 Provide data archive volumes to the dispatcher

* Address feedback, and fix checkpoint_gc path support.

The checkpoint-gc archive has an cproto.RunArchive Path value that
that is not / like the others. Regularize the root in the
generated launcher Archive argument so that we can treat them
consistently.

* CPU-only system/partitions work automatically by allocating CPUs (HAL-2845) (#117)

* CPU-only system/partitions

...work automatically by allocating CPUs

* stop using 'container' object since we cleaned the interface up to not need it upstream (#121)

* HAL-2857 The DAI launcher resource manager job monitoring code is not passing the authentication token to the launcher (#124)

* chore: various clean ups and semantic conflicts to get builds passing (#127)

* chore: various clean ups and semantic conflicts to get builds passing

* py fmt

* chore: configure default rendezvous iface on per-cluster basis, for now (#118)

* Drop max_slots_per_container as it not longer supported (#128)

max_slots_per_container got dropped from the dispatcher RM
configuration, so drop it out of the configuraiton file as well.
Jobs do not currently succeed with mulitiple slurm tasks, but
hopefully will soon.

* Allow launcher to determine docker vs local images, and enable check-point volumes (#129)

* Switch to allowing launcher to determine docker vs local images.

Drop the explicit configuration of image paths in the master
to enable use of docker images from a registry if the system is so configured.

* HAL-2858 Add DAI configured mounts as data volumes on dispatcher.

DAI Checkpointing uses a shared mounted file system for storage as
configured vi checkpoint_storage yaml configuration. Add
support in the Dispatcher RM to pass any configured mounts to the
dispatcher to be added to the launched container.

Update the devcluster & devcluster-cassablanca configuration to
point to the shared filesystem /lus/scratch/foundation_engineering/determined-cp
to enable checkpointing to work during development.

* Switch to new device type to simplify code (#130)

Migrate dispather RM from proto/pkg/devicev1 to
pkg/device.

* HAL-2860 Improve authentication token handling when monitoring job status (#131)

* Correct the DET_SLOT_IDS list to utilize values provided by SLURM (#133)

Slurm provides variables that identify the allocated GPUS:
GPU_DEVICE_ORDINAL
CUDA_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES

Use GPU_DEVICE_ORDINAL for generating DET_SLOT_IDS which should work
for both CUDA and ROCR devices. If CUDA_VISIBLE_DEVICES is
set verify that the specified CUDA devices are visible from
nvidia_smi command as a diagnositic. Generate a WARN if not
as expected.

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL (#134)

* Checkpoint slots (#135)

* Always use CUDA_VISIBLE_DEVICES instead of GPU_DEVICE_ORDINAL

* Define DET_SLOT_IDS for gc-checkpoints-entrypoint.sh

Enable cleanup to succeed by inheriting the proper DET_SLOT_IDS
when running in a SLURM environment.

* Enable resource pool validation for launcher (HAL-2854) (#137)

* Enable resource pool validation

...for launcher.

Bradley will follow up on handling the return of any error encountered.

* chore: persist dispatcher <-> resources <-> alloc mapping (#142)

This just adds some basic queries to persist mappings from allocations to dispatches so that we can query them to get logs.

* chore: store original user with dispatch (#143)

This change just updates the dispatcher persistence code to store the original launch ID for dispatches.

* HAL-2855 Enable multi-job runs with SLURM (#141)

* Enable multi-job runs with SLURM

Changes (some temporary) to enable experiments with multiple jobs to be launched by SLURM successfully.

* Enhance debug mode output for horovod initialization.
* Add per-SLURM-task /tmp directory creation so that work files do not conflict if concurrent jobs are writing to them.
The directory is generated by mktemp and owned by the user and writable only by them.
* Configure an explicit (unknown) password hash in the shadow file instead of ! which disables the account.
* Add SLURM-specific log file initialization using /tmp to enable per-container/process logs to work with the shared file system.

* Drop generate tmp directory.

With SINGULARITY_CONTAIN=true, there is actually a local /tmp in the container that we can use.

Generalize the node/total args when a constraint is set to enable multi-job experiments on horizon which doesn't support --gpus allocation.
This stil needs more work so leaving the TODO.

* cleanup original scaffolding, minor go style stuff (#132)

This change attempts to
* cleanup the last bad code from when i scaffold-ed this out very quickly
* make some small stylistic go changes as i was reading along
* ticket anything that i noticed right off that was missing

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the l… (#145)

* HAL-2859 Cancelling a DAI experiment does not cancel the job on the launcher side

* Update dispatcher_resource_manager.go

Removed duplicate declaration of StartDispatcherResources, KillDispatcherResources, and DispatchExited.

* Update dispatcher_resource_manager.go

Got rid of extra receiveRequestMsg

* Update dispatcher_resource_manager.go

Not got rid of receiveRequestMsg for real

* Update dispatcher_resource_manager.go

Fixed lint complaint because indentation of "sproto.ResourcesID" did not match that of "model.AllocationID".

* Update dispatcher_resource_manager.go

As per Bradley's request, I check for "err !=nil" af…

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment