This document gives an overview of the hub-stress-test script and how it can be used.
You will need two things to run the script, an admin token and a target hub API endpoint URL.
The admin token can be provided to the script on the command line but it's recommended to create a
file from which you can source and export the JUPYTERHUB_API_TOKEN
environment variable.
For the hub API endpoint URL, you can probably use the same value as the JUPYTERHUB_API_URL
environment variable in your user notebooks, e.g. https://myhub-testing.acme.com/hub/api
.
Putting these together, you can have a script like the following to prepare your environment:
#!/bin/bash -e
export JUPYTERHUB_API_TOKEN=abcdef123456
export JUPYTERHUB_ENDPOINT=https://myhub-testing.acme.com/hub/api
By default the stress-test
command of the hub-stress-test
script will scale up to
100 users and notebook servers (pods) in batches, wait for them to be "ready" and then
stop and delete them.
The number of pods that can be created in any given run depends on the number of
user-placeholder
pods already in the cluster and the number of user
nodes. The
user-placeholder
pods are pre-emptible pods which are part of a StatefulSet:
$ kubectl get statefulset/user-placeholder -n jhub
NAME READY AGE
user-placeholder 300/300 118d
We normally have very few of these in our testing cluster but need to
scale them up when doing stress testing otherwise the hub-stress-test
script has to wait
for the auto-scaler to add more nodes to the user
worker pool. The number of available
workers can be found like so:
$ kubectl get nodes -l workerPurpose=users | grep -c "Ready\s"
13
The number of user
nodes needed for a scale test will depend on the resource requirements
of the user notebook pods, reserved space on the nodes, other system pods running on the nodes,
e.g. logging daemon, pod limits per node, etc.
If there are not enough nodes available and the auto-scaler has to create them
as the stress test is running, we can hit the consecutive failure limit which will cause the hub container to crash and restart.
One way to avoid this is run the script with a --count
that is not higher than 500 which
gives time between runs for the auto-scaler to add more user
nodes.
As an example, the kubelet
default maxPods
limit is 110 per node and on IBM Cloud there are about
25 system pods per node. Our testing cluster user notebooks are using a micro
profile so their resource usage is not an issue, they are just limited to the 110 pod-per-node limit.
As a reference, to scale up to 3000 users/pods we need to have at least 35 user nodes.
The --keep
option can be used to scale up the number of pods in the cluser and retain them
so that you can perform tests or profiling on the hub with a high load. When the script runs
it will first check for the number of existing hub-stress-test
users and start creating new
users based on an index so you can run the script with a --count
value of 200-500 if you need
to let the auto-scaler add user
nodes after each run.
Note that the c.NotebookApp.shutdown_no_activity_timeout
value in the user notebook image (in the
testing cluster) should either be left at the default (0) or set to some larger window so that while
you are scaling up the notebook pods do not shut themselves down.
The activity-stress-test
command can be used to simulate --count
users POSTing activity
updates. This command only creates users, not servers. It takes a number of users to simulate
specified by --count
and a number of worker threads, --workers
, to perform the actual
requests. If --keep
isn't specified then the users will be deleted after the test.
If you used the --keep
option to scale up and retain pods for steady state testing, when you are
done you can scale down the pods and users by using the purge
command. The users created by the
script all have a specific naming convention so it knows which notebook servers to stop and users
to remove.
Depending on the number of pods being created or deleted the script can take awhile. During a run there are some dashboards you should be watching and also the hub logs. The logging and monitoring platform is deployment-specific but the following are some examples of dashboards we monitor:
-
Jupyter Notebook Health (Testing)
This dashboard shows the active user notebook pods, nodes in the cluster anduser-placeholder
pods. This is mostly interesting to watch the active user notebook pods go up or down when scaling up or down with the script. The placeholders and user nodes may also fluctuate as placeholder pods are pre-empted and as the auto-scaler is adding or removing user nodes. -
Jupyter Hub Golden Signals (testing)
This is where you can monitor the response time and request rate on the hub. As user notebook pods are scaled up each of those pods will "check in" with the hub to report their activity. By default each pod checks in with the hub every 5 minutes. So we expect that the more active user notebook pods in the cluster will increase the request rate and increase the response time in this dashboard. The error rates may also increase as we get 429 responses from the hub while scaling up due to the concurrentSpawnLimit. Those 429 responses are expected and thehub-stress-test
script is built to retry on them. Here is an example of a 3000 user load run:That run started around 2:30 and then the purge started around 9 which is why response times track the increase in request rates. As the purge runs the number of pods reporting activity is going down so the request rate also goes down. One thing to note on the purge is that the slow_stop_timeout defaults to 10 seconds so as we are stopping user notebook servers (deleting pods) the response times spike up because of that arbitrary 10 second delay in the hub API.
Other useful panels on this dashboard are for tracking CPU and memory usage of the hub. From the same 3000 user run as above:
CPU, memory and network I/O increase as the number of user notebook pods increases and are reporting activity to the hub. The decrease in CPU and network I/O are when the purge starts running. Note that memory usage remains high even after the purge starts because the hub aggressively caches DB state in memory and is apparently not cleaning up the cache references even after spawners and users are deleted from the database.