Skip to content

Latest commit

 

History

History
195 lines (164 loc) · 8.88 KB

configuration.md

File metadata and controls

195 lines (164 loc) · 8.88 KB

Configuration settings

This document provides an overview of configuration settings for increasing hub performance.

  1. Culler settings
    1. Frequency
    2. Concurrency limit
    3. Timeout
    4. Notebook culler
  2. Activity intervals
    1. activity_resolution
    2. last_activity_interval
    3. JUPYERHUB_ACTIVITY_INTERVAL
  3. Startup time
    1. init_spawners_timeout
  4. Other settings
    1. k8s_threadpool_api_workers
    2. Disable events
    3. Disable consecutiveFailureLimit
    4. Increase http_timeout
  5. References

Culler settings

There are two mechanisms for controlling the culling of servers and users. One is a process managed by the hub which will periodically cull users and servers. The other is a setting which will allow servers to delete themselves after a period of inactivity.

Frequency

By default the culler runs every 10 minutes. With a more aggressive setting for the notebook idle timeout the hub-managed culler can be run less frequently.

Concurrency limit

By default the culler has a concurrency limit of 10. This means it will make up to 10 concurrent API calls. When deleting a large number of users that can generate a high load on the hub. Setting this to 1 helps to reduce load on the hub.

Timeout

The timeout controls how long a server can be idle before being deleted. Because the servers will aggressively cull themselves this value can be set very high.

These can be all configured in the cull section of values.yaml:

cull:
  timeout: 432000 # 5 days
  every: 3600 # Run once an hour instead of every 10 minutes
  concurrency: 1

Notebook culler

There are two settings which control how the notebooks cull themselves. The first is c.NotebookApp.shutdown_no_activity_timeout which specifies the period of inactivity (in seconds) before a server is shutdown. The second is c.MappingKernelManager.cull_idle_timeout which determines when kernels will be shutdown. These settings can be configured as described here.

Activity intervals

These settings control how spawner and user activity is tracked. These settings have a large impact on the performance of the hub.

c.JupyterHub.activity_resolution

Activity resolution controls how often activity updates are written to the database. Many API calls will record activity for a user. This setting determines whether or not that update is written to the database. If the update is more recent than activity_resolution seconds ago it's ignored. Increasing this value will reduce commits to the database.

extraConfig:
  myConfig: |
    c.JupyterHub.activity_resolution = 6000

c.JupyterHub.last_activity_interval

This setting controls how often a periodic task in the hub named update_last_activity runs. This task updates user activity using information from the proxy. This task makes a large number of database calls and can put a fairly significant load on the hub. Zero to Jupyterhub sets this to 1 minute by default. The upstream default of 5 minutes is a better setting.

extraConfig:
  myConfig: |
    c.JupyterHub.last_activity_interval = 300

JUPYTERHUB_ACTIVITY_INTERVAL

This controls how often each server reports its activity back to the hub. The default is 5 minutes and with hundreds or thousands of users posting activity updates it puts a heavy load on the hub and the hub's database. Increasing this to one hour or more reduces the load placed on the hub by these activity updates.

singleuser:
  extraEnv:
    JUPYTERHUB_ACTIVITY_INTERVAL: "3600"

Startup time

init_spawners_timeout

c.JupyterHub.init_spawners_timeout controls how long the hub will wait for spawners to initialize. When this timeout is reached the spawner check will go into the background and hub startup will continue. With many hundreds or thousands of spawners this is always going to exceed any reasonable timeout so there's no reason to wait at all. Setting it to 1 (which is the minimum value) allows the hub to start faster and start servicing other requests.

In values.yaml:

extraConfig:
  myConfig: |
     c.JupyterHub.init_spawners_timeout = 1

Other settings

Other settings which are helpful for tuning performance.

c.KubeSpawner.k8s_api_threadpool_workers

This value controls the number of threads kubespawner will create to make API calls to Kubernetes. The default is 5 * num_cpus. Given a large enough number of users logging in and spawning servers at the same time this may not be enough threads. A more sensible value for this setting is c.Jupyterhub.concurrent_spawn_limit. concurrent_spawn_limit controls how many users can spawn servers at the same time. By creating that many threadpool workers we ensure that there's always a thread available to service a user's spawn request. The upstream default for concurrent_spawn_limit is 100 while the default with Zero to JupyterHub is 64.

In values.yaml:

extraConfig:
  perfConfig: |
     c.KubeSpawner.k8s_api_threadpool_workers = c.JupyterHub.concurrent_spawn_limit

Disable user events

With this enabled kubespawner will process events from the Kubernetes API which are then used to show progress on the user spawn page. Disabling this reduces the load on kubespawner.

To disable user events update the events key in the values.yaml file. This value ultimately sets c.KubeSpawner.events_enabled.

singleuser:
  events: false

Disable consecutiveFailureLimit

JupyterHub itself defaults c.Spawner.consecutive_failure_limit to 0 to disable it but zero-to-jupyterhub-k8s defaults it to 5. This can be problematic when a large user event starts and many users are starting server pods at the same time if the user node capacity is exhausted and, for example, spawns timeout due to waiting on the node auto-scaler adding more user node capacity. When the consecutive failure limit is reached the hub will restart which probably will not help with this type of failure scenario when pod spawn timeouts are occurring because of capacity issues.

To disable the consecutive failure limit update the consecutiveFailureLimit key in the values.yaml file.

hub:
  consecutiveFailureLimit: 0

Increase http_timeout

c.KubeSpawner.http_timeout defaults to 30 seconds. During scale and load testing we have seen that sometimes we can hit this timeout and the hub will delete the server pod but if we had just waited a few seconds more it would have been enough. So if you have node capacity so that pods are being created, but maybe they are just slow to come up and are hitting this timeout, you might want to increase it to something like 60 seconds. This also seems to vary depending on whether you are using notebook or jupyterlab / jupyter-server, the type of backing storage for the user pods (i.e. s3fs shared object storage is known to be slow(er)), and how many and what kinds of extensions you have in the user image.

References