Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for NodeSelector as Admin Defined and User Option? #238

Open
trisongz opened this issue Apr 12, 2022 · 3 comments
Open

Allow for NodeSelector as Admin Defined and User Option? #238

trisongz opened this issue Apr 12, 2022 · 3 comments

Comments

@trisongz
Copy link

I'm currently evaluating Coder and so far its great! Definitely beats manually provisioning workspaces.

I had a few questions and some minor issues

Environment

  • Provider: aws-eks
  • K8s Version: 1.21
  • Coder Helm Version: 1.29.1

In our cluster, we use ASGs, and specifically for GPUs, we separate them by the instance-type size as well as the GPU type.

Example

ASG 1: T4-XL
- g4dn.xlarge
  - Node Labels: compute-role:gpu, compute-size:xlarge, gpu-type:t4

ASG 2: A10G-XL
- g5.xlarge
  - Node Labels: compute-role:gpu, compute-size:xlarge, gpu-type:a10g

ASG 3: Mixed-XL
- g4dn.xlarge
- g5.xlarge
  - Node Labels: compute-role:gpu, compute-size:xlarge, gpu-type:mixed

Questions:

  1. Are there any future plans to allow the admin to specify node-selectors/taints based on images? For CUDA enabled images, we would pre-select the node-selectors and taints to ensure that the image gets properly provisioned with a GPU node, rather than a CPU node.

  2. Follow-on, would it be possible to allow users to specify the node-selectors/taints when creating workspaces without using a template? (if option is enabled by admin)

  3. Is there a way to adjust/specify the session-timeout for OIDC? Currently it seems like the limit is 60 mins before refresh kicks in and requires reauth.


Issues:

I was trying to have the node-selector modified by using a template that did specify compute-type:gpu, compute-role:coder, but within the provider settings, only compute-role:coder is defined.

image

However, after testing the template, and subsequently deleting it, several workspaces that were provisioned afterwards retained the nodeSelectors that were defined only in the template itself, rather than sticking strictly with the provider specified one.

image

In Template Policy, I do have write enabled for node-selector so I wonder if that's what's causing the issue.

Thanks!

@bpmct
Copy link
Member

bpmct commented Apr 29, 2022

Hi @trisongz. Would encourage you to join us on Slack so we can discuss these in more detail.

Are there any future plans to allow the admin to specify node-selectors/taints based on images? For CUDA enabled images, we would pre-select the node-selectors and taints to ensure that the image gets properly provisioned with a GPU node, rather than a CPU node.

Follow-on, would it be possible to allow users to specify the node-selectors/taints when creating workspaces without using a template? (if option is enabled by admin)

Unfortunately, the answer is no for both accounts. As you may have noticed, a workspace template + template policy allows you to set NodeSelectors on the workspace level. However, a workspace must be created "from template" not "from image."

Is there a way to adjust/specify the session-timeout for OIDC? Currently it seems like the limit is 60 mins before refresh kicks in and requires reauth.

Checking with the team now on this one

However, after testing the template, and subsequently deleting it, several workspaces that were provisioned afterwards retained the nodeSelectors that were defined only in the template itself, rather than sticking strictly with the provider specified one.

Follow up question: How were you provisioning these additional workspaces? Directly from the image or are these workspaces created from the old template? This may be a bug, but I want to make sure I'm understanding correctly.


On a slightly different note, we are working on Coder v2 which allows an admin to define templates for workspaces using an entirely custom pod spec, including NodeSelectors. It uses Terraform to define templates. It's not ready for production use, but let me know if you're interested in giving feedback and shaping the roadmap. Here's how it would work for the developer

$ coder workspace create ben1

Choose a template
> Data science 1
  Data science 2
  Frontend development
  Backend development

Creating workspace...

SSH with `coder ssh ben1`

These parameters are admin defined, and have an underlying specification with Terraform

@coadler
Copy link
Member

coadler commented Apr 29, 2022

Is there a way to adjust/specify the session-timeout for OIDC? Currently it seems like the limit is 60 mins before refresh kicks in and requires reauth.

When coder.oidc.enableRefresh is set to true, refresh and expiration intervals are defined by the upstream provider. When the access token expires, we use the refresh token to ensure the user still has access. If your provider doesn't return refresh tokens, this could be the cause of the 60m timeout. It may be necessary to add additional redirect options for your provider to return refresh tokens. For example, Google requires the following:

coderd:
  oidc:
    enableRefresh: true
    redirectOptions:
      access_type: offline
      prompt: consent

@trisongz
Copy link
Author

trisongz commented May 3, 2022

Would encourage you to join us on Slack so we can discuss these in more detail.

Requested to join!

Follow up question: How were you provisioning these additional workspaces? Directly from the image or are these workspaces created from the old template? This may be a bug, but I want to make sure I'm understanding correctly.

These new workspaces were provisioned from images only, not templates as there's currently no in-UI option to select from pre-defined templates within the dashboard (I believe). My theory would be that changes may not have persisted fully in the backend/database before the new workspace was created.

On a slightly different note, we are working on Coder v2 which allows an admin to define templates for workspaces using an entirely custom pod spec, including NodeSelectors. It uses Terraform to define templates. It's not ready for production use, but let me know if you're interested in giving feedback and shaping the roadmap. Here's how it would work for the developer

Would be more than happy to!

When coder.oidc.enableRefresh is set to true, refresh and expiration intervals are defined by the upstream provider. When the access token expires, we use the refresh token to ensure the user still has access. If your provider doesn't return refresh tokens, this could be the cause of the 60m timeout. It may be necessary to add additional redirect options for your provider to return refresh tokens.

Will update helm specs with this and follow up if the behavior still persists.


Another bug that we found as it relates to GPU nodes.

When the admin options for Enable Caching and Enable auto loading of 'shiftfs' kernel module are both enabled, GPU-based nodes simply won't allow the workspace to access the GPU itself.

The pod will have proper resource allocation, and properly scheduled etc. but whenever the user goes into the workspace and tries to access the GPU via nvidia-smi - there is no GPU present. (behavior present in 1.29-1.30) Disabling these options will resolve this issue (something I found out the hard way unfortunately)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants