Add content on dealing with large arrays? #8

DamienIrving · 2018-07-03T06:23:50Z

People dealing with ocean data (due to the extra depth dimension) or high time frequency data (e.g. hourly data) tend to run into issues (like memory errors) due to the large size of their data arrays.

Some lesson content on Dask would be helpful here.

DamienIrving · 2018-07-03T06:32:08Z

Some introductory notes can be found at this post on Speeding Up Your Code

DamienIrving · 2018-08-17T19:48:35Z

One option might be to have people login to http://pangeo.pydata.org and then do one of the examples from https://github.com/pangeo-data/pangeo-example-notebooks by cloning that repo in the jupyter terminal.

(To get a notebook rather than jupyter lab environment you need to replace lab with tree in the URL, e.g. http://pangeo.pydata.org/user/damienirving/tree)

DamienIrving · 2019-06-07T21:31:37Z

Resources:
This NCI notebook from Kate Snow introduces chunking.
This tutorial from Scott Wales (see recording) introduces more advanced dask usage.

Possible outline:

0. Simple things you can do

Lazy loading, subsetting, intermediate files, looping over depth slices (for instance).

1. Introduction to chunking

Dask chunking

The metadata of an xarray DataArray loaded with open_mfdataset includes the dask chunk size.

File chunking

The file itself may also be chunked. Filesystem chunking is available in netCDF-4 and HDF5 datasets. CMIP6 data should all be netCDF-4 and include some form of chunking on the file.

You can look at the .encoding attribute of an xarray variable to see information about the file storage.

2. Chunking best practices

Accessing data across chunks is slower than along chunks.

Optimal chunk sizes:

http://xarray.pydata.org/en/stable/dask.html#chunking-and-performance
You can change the dask chunk size. (array.rechunk?)
Poor choices can make things very slow

3. Parallelising your code

In the notebook:

from dask.distributed import Client
c = Client()
c

From within a script:

import dask.distributed

if __name__ == '__main__':
    client = dask.distributed.Client(
        n_workers=8, threads_per_worker=1,
        memory_limit='4gb', local_dir=tempfile.mkdtemp())

4. Rolling your own dask aware functions

Check if a function is dask aware by watching the progress bar:

import dask.diagnostics
dask.diagnostics.ProgressBar().register()

Use the dask map_overlap and map_blocks to make your functions dask aware.

DamienIrving added the enhancement New feature or request label Jul 3, 2018

DamienIrving changed the title ~~Add content on dealing with large data arrays~~ Add content on dealing with large arrays? Jul 3, 2018

DamienIrving added the help wanted Extra attention is needed label Jul 3, 2018

DamienIrving mentioned this issue Aug 14, 2018

Switch from precipitation to ocean data #11

Open

DamienIrving closed this as completed Mar 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add content on dealing with large arrays? #8

Add content on dealing with large arrays? #8

DamienIrving commented Jul 3, 2018 •

edited

DamienIrving commented Jul 3, 2018

DamienIrving commented Aug 17, 2018

DamienIrving commented Jun 7, 2019 •

edited

Add content on dealing with large arrays? #8

Add content on dealing with large arrays? #8

Comments

DamienIrving commented Jul 3, 2018 • edited

DamienIrving commented Jul 3, 2018

DamienIrving commented Aug 17, 2018

DamienIrving commented Jun 7, 2019 • edited

0. Simple things you can do

1. Introduction to chunking

Dask chunking

File chunking

2. Chunking best practices

3. Parallelising your code

4. Rolling your own dask aware functions

DamienIrving commented Jul 3, 2018 •

edited

DamienIrving commented Jun 7, 2019 •

edited