Skip to content

Latest commit

 

History

History
225 lines (178 loc) · 15.1 KB

install.rst

File metadata and controls

225 lines (178 loc) · 15.1 KB

Dask Installation

How to Install Dask

You can install Dask with conda, with pip, or install from source.

Conda

If you use the Anaconda distribution, Dask will be installed by default.

You can also install or upgrade Dask using the conda install command:

conda install dask

This installs Dask and all common dependencies, including pandas and NumPy. Dask packages are maintained both on the defaults channel and on conda-forge. You can select the channel with the -c flag:

conda install dask -c conda-forge

Optionally, you can obtain a minimal Dask installation using the following command:

conda install dask-core

This will install a minimal set of dependencies required to run Dask similar to (but not exactly the same as) python -m pip install dask.

Pip

To install Dask with pip run the following:

python -m pip install "dask[complete]"    # Install everything

This installs Dask, the distributed scheduler, and common dependencies like pandas, Numpy, and others.

You can also install only the Dask library and no optional dependencies:

python -m pip install dask                # Install only core parts of dask

Dask modules like dask.array, dask.dataframe, or dask.distributed won't work until you also install NumPy, pandas, or Tornado, respectively. This is uncommon for users but more common for downstream library maintainers.

We also maintain other dependency sets for different subsets of functionality:

python -m pip install "dask[array]"       # Install requirements for dask array
python -m pip install "dask[dataframe]"   # Install requirements for dask dataframe
python -m pip install "dask[diagnostics]" # Install requirements for dask diagnostics
python -m pip install "dask[distributed]" # Install requirements for distributed dask

We have these options so that users of the lightweight core Dask scheduler aren't required to download the more exotic dependencies of the collections (Numpy, pandas, Tornado, etc.).

Source

To install Dask from source, clone the repository from GitHub:

git clone https://github.com/dask/dask.git
cd dask
python -m pip install .

You can also install all dependencies as well:

python -m pip install ".[complete]"

You can view the list of all dependencies within the project.optional-dependencies field of pyproject.toml.

Or do a developer install by using the -e flag (see the Install section <develop-install> in the Development Guidelines):

python -m pip install -e .

Distributed Deployment

To run Dask on a distributed cluster you will want to also install the Dask cluster manager that matches your resource manager, like Kubernetes, SLURM, PBS, LSF, AWS, GCP, Azure, or similar technology.

Read more on this topic at Deploy Documentation <deploying.html>

Optional dependencies

Specific functionality in Dask may require additional optional dependencies. For example, reading from Amazon S3 requires s3fs. These optional dependencies and their minimum supported versions are listed below.

Dependency Version Description
bokeh >=2.4.2 Generate profiles of Dask execution (required for dask.diagnostics)
cachey >=0.1.1 Use caching for computation
cityhash >=0.2.4 Use CityHash and FarmHash hash functions for array hashing (~2x faster than MurmurHash)
crick >=0.0.3 Use tdigest internal method for dataframe statistics computation
cytoolz >=0.11.0 Faster cythonized implementation of internal iterators, functions, and dictionaries
dask-expr Required for dask.dataframe; Pins to a specific Dask version
dask-ml >=1.4.0 Common machine learning functions scaled with Dask
fastavro >=1.1.0 Storing and reading data from Apache Avro files
gcsfs >=2021.9.0 Storing and reading data located in Google Cloud Storage
graphviz >=0.8.4 Graph visualization using the graphviz engine
h5py >=2.10.0 Storing array data in hdf5 files
ipycytoscape >=1.0.1 Graph visualization using the cytoscape engine
IPython >=7.16.1 Write graph visualizations made with graphviz engine to file
jinja2 >=2.10.3 HTML representations of Dask objects in Jupyter notebooks (required for dask.diagnostics)
lz4 >=4.3.2 Transparent use of lz4 compression algorithm
matplotlib >=3.4.1 Color map support for graph visualization
mimesis >=5.3.0 Random bag data generation with dask.datasets.make_people
mmh3 >=2.5.1 Use MurmurHash hash functions for array hashing (~8x faster than SHA1)
numpy >=1.21 Required for dask.array
pandas >=1.3 Required for dask.dataframe
psutil >=5.7.2 Factor CPU affinity into CPU count, intelligently infer blocksize when reading CSV files
pyarrow >=7.0 Support for Apache Arrow datatypes & engine when storing/reading Apache ORC or Parquet files
python-snappy >=0.5.4 Snappy compression to bs used when storing/reading Avro or Parquet files
s3fs >=2021.9.0 Storing and reading data located in Amazon S3
scipy >=1.5.2 Required for dask.array.stats, dask.array.fft, and dask.array.linalg.lu
sparse >=0.12.0 Use sparse arrays as backend for dask arrays
sqlalchemy >=1.4.16 Writing and reading from SQL databases
tblib >=1.6.0 Serialization of worker traceback objects
tiledb >=0.8.1 Storing and reading data from TileDB files
xxhash >=2.0.0 Use xxHash hash functions for array hashing (~2x faster than MurmurHash, slightly slower than CityHash)
zarr >=2.12.0 Storing and reading data from Zarr files

Test

Test Dask with py.test:

cd dask
py.test dask

Installing Dask naively may not install all requirements by default (see the pip section above). You may choose to install the dask[complete] version which includes all dependencies for all collections:

pip install "dask[complete]"

Alternatively, you may choose to test only certain submodules depending on the libraries within your environment. For example, to test only Dask core and Dask array we would run tests as follows:

py.test dask/tests dask/array/tests

See the section on testing <develop-test> in the Development Guidelines for more details.