This repository demonstrates a solution to the problem of distributing local Python packages to a Dask cluster.
When working with a Dask cluster, it's often necessary to use custom modules and packages within your distributed tasks. Dask's default mechanism for uploading modules only supports single-file modules, not entire packages. This makes it challenging to manage and distribute more complex codebases to the Dask scheduler and workers.
This limitation is discussed in the Dask community here:
This demo showcases a Dask plugin that allows you to upload entire Python packages to your Dask cluster. This simplifies dependency management and makes it easier to work with your own libraries in a distributed environment.
The implementation of this solution is proposed in this pull request:
The demo.ipynb notebook in this repository provides a hands-on example of the plugin in action. It walks you through the following steps:
- Setting up a Dask cluster: A Dask cluster is created on Google Cloud Platform.
- Demonstrating the problem: It shows that trying to use a local package (
dask_module_upload_plugin_demo) in a distributed task fails with aModuleNotFoundError. - Using the plugin: The
UploadModuleandSchedulerUploadModuleplugins are registered with the Dask client. - Success! The same distributed task is executed again, and this time it succeeds because the plugin has uploaded the necessary package to the cluster.
To run the demo, you will need to have Python and Jupyter Notebook installed. You will also need to install the dependencies listed in pyproject.toml.
- Clone this repository.
- Install the dependencies:
poetry install - Set up your GCP credentials. You can do this by creating a
.envfile with the following content:GCP_PROJECT_ID=... GCP_ZONE=... - Open and run the
demo.ipynbnotebook.