-
Notifications
You must be signed in to change notification settings - Fork 93
Install bokeh and numpy in the Helm chart's scheduler and worker sample image #561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
From the head of that file: The current dockerfile is, on purpose, a very limited one for demonstration and testing only. It would be rather difficult to find a "best" image to suit all needs. |
|
I agree that we should keep this image minimal, but I do think having a working dask array (and probably dask dataframe?) would be useful for demos and testing things out. A new user is likely to try with the default images to start, having a functional demo without requiring further changes is something I think users can expect. |
yes, in fact this is related to my tutorial about deploying Dask Gateway on Jetstream 2: https://zonca.dev/2022/04/dask-gateway-jupyterhub.html Related to this, it would be nice to have |
|
I agree with the value of providing a image for slightly more than our test suite requires if it means that an end user with a fresh Helm chart installation can do a basic test involving:
I think its crucial that we define this kind of test very specifically. Without it, there is no clear line we draw about what we are to provide in the image, and this image will grow without a clear boundary and become vague in its purpose. And this I fear is problematic both for maintainers and anyone being aware about this image. I'm not an active user of dask workers etc, if we can couple this change with a change involving a very primitive example, I'd be happy to merge this change as well. End to end test with jupyterhub + dask-gatewayNote that if someone would want to make an end to end test of a dask-gateway + jupyterhub setup, I would recommend instead configuring the user image for JupyterHub, and then relying on configuration of the dask-gateway client to pass that image to be used by the scheduler and worker pod. Click to expand: example configuration for re-using jupyterhub imagejupyterhub:
singleuser:
image:
name: pangeo/pangeo-notebook
tag: 2022.04.20
extraEnv:
# About DASK_ prefixed variables we set:
#
# 1. k8s native variable expansion is applied with $(MY_ENV) syntax. The
# order variables are defined matters though and we are under the
# mercy of how KubeSpawner renders our passed dictionaries.
#
# 2. Dask loads local YAML config.
#
# 3. Dask loads environment variables prefixed DASK_.
# - DASK_ is stripped
# - Capitalization is ignored
# - Double underscore means a nested configuration
# - `ast.literal_eval` is used to parse values
#
# 4. dask-gateway and dask-distributed looks at its config and expands
# expressions in {} again, sometimes only with the environment
# variables as context but sometimes also with additional variables.
#
# References:
# - K8s expansion: https://kubernetes.io/docs/tasks/inject-data-application/define-interdependent-environment-variables/
# - KubeSpawner issue: https://github.com/jupyterhub/kubespawner/issues/491
# - Dask config: https://docs.dask.org/en/latest/configuration.html
# - Exploration issue: https://github.com/2i2c-org/infrastructure/issues/442
#
# DASK_GATEWAY__CLUSTER__OPTIONS__IMAGE makes the default worker image
# match the singleuser image.
DASK_GATEWAY__CLUSTER__OPTIONS__IMAGE: "{JUPYTER_IMAGE_SPEC}"
# DASK_DISTRIBUTED__DASHBOARD_LINK makes the suggested link to the
# dashboard account for the /user/<username> prefix in the path. Note
# that this still misbehave if you have a named server but now its at
# least functional for non-named servers.
DASK_DISTRIBUTED__DASHBOARD_LINK: "/user/{JUPYTERHUB_USER}/proxy/{port}/status"
dask-gateway:
gateway:
extraConfig:
optionHandler: |
from dask_gateway_server.options import Options, Integer, Float, String, Mapping
def cluster_options(user):
def option_handler(options):
if ":" not in options.image:
raise ValueError("When specifying an image you must also provide a tag")
return {
"worker_cores_limit": options.worker_cores,
"worker_cores": min(options.worker_cores / 2, 1),
"worker_memory": "%fG" % options.worker_memory,
"image": options.image,
"environment": options.environment,
}
return Options(
Integer("worker_cores", 2, min=1, label="Worker Cores"),
Float("worker_memory", 4, min=1, label="Worker Memory (GiB)"),
# The default image is set via DASK_GATEWAY__CLUSTER__OPTIONS__IMAGE env variable
String("image", label="Image"),
Mapping("environment", {}, label="Environment Variables"),
handler=option_handler,
)
c.Backend.cluster_options = cluster_options |
|
I think the answer is "no", but it's still probably worth asking: is there any benefit to having both the barebones image as an example and for testing and also a more fully-fledged version that builds off it, containing the most commonly used extra dependencies? There are a couple of examples in https://github.com/dask/dask-docker and even bigger ones in e.g., https://github.com/pangeo-data/pangeo-docker-images . It would be (slightly) nice to have some kind of consistency. |
|
my 2¢ as another user - my biggest fear with increasing the scope of the "default" is if tests/features start leaking into the package that would break in environments without numpy/pandas/etc. For example, we've used dask.gateway to run Octave models by building a minimal python environment that can support a dask worker, installing additional modules like octave and oc2py, then using client.map to run the models on the workers. So that might point to your idea @martindurant? If numpy/pandas became an explicit dependency I don't think this would be the end of the world either, but it's definitely helpful to be able to install something extremely lightweight and then have flexibility to install radically different things (e.g. with different compiler flags/etc than those allowed, by, say, geopandas or other elements of the normal pangeo stack). |
|
@delgadom note that the choices here are not related to This change is solely what is installed in a docker image referenced the Helm chart by default for scheduler/worker containers. The image is supposed to be configured explicitly by the user of the Helm chart, so what it includes is related mainly to testing of the Helm chart, and with this PR, it would also be something that can be used for a primitive demo of some kind. I'd like such primitive demo example is provided along with this PR, for example to be executed from within a container based on the same image as that would ensure matching versions across the dask-gateway client, scheduler, and workers. |
|
@consideRatio what about the same test showed in the "Usage" docs? https://gateway.dask.org/usage.html#run-computations-on-the-cluster should we have another test with |
Numpy is not a requirement either of `dask` or `dask-gateway`, however, `dask.array` won't work without `numpy`
bokeh for the dashboard, pandas for dask.dataframe
Co-authored-by: Erik Sundell <erik.i.sundell@gmail.com>
|
rebased on current |
If |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Agreed, thanks! |
Numpy is not a requirement either of
daskordask-gateway, however,dask.arraywon't work withoutnumpy, it would be really convenient to have that available in the defaultdask-gatewayimage.