Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add content on dealing with large arrays? #8

Closed
DamienIrving opened this issue Jul 3, 2018 · 3 comments
Closed

Add content on dealing with large arrays? #8

DamienIrving opened this issue Jul 3, 2018 · 3 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@DamienIrving
Copy link
Collaborator

DamienIrving commented Jul 3, 2018

People dealing with ocean data (due to the extra depth dimension) or high time frequency data (e.g. hourly data) tend to run into issues (like memory errors) due to the large size of their data arrays.

Some lesson content on Dask would be helpful here.

@DamienIrving DamienIrving added the enhancement New feature or request label Jul 3, 2018
@DamienIrving DamienIrving changed the title Add content on dealing with large data arrays Add content on dealing with large arrays? Jul 3, 2018
@DamienIrving
Copy link
Collaborator Author

Some introductory notes can be found at this post on Speeding Up Your Code

@DamienIrving
Copy link
Collaborator Author

One option might be to have people login to http://pangeo.pydata.org and then do one of the examples from https://github.com/pangeo-data/pangeo-example-notebooks by cloning that repo in the jupyter terminal.

(To get a notebook rather than jupyter lab environment you need to replace lab with tree in the URL, e.g. http://pangeo.pydata.org/user/damienirving/tree)

@DamienIrving
Copy link
Collaborator Author

DamienIrving commented Jun 7, 2019

Resources:
This NCI notebook from Kate Snow introduces chunking.
This tutorial from Scott Wales (see recording) introduces more advanced dask usage.

Possible outline:

0. Simple things you can do

Lazy loading, subsetting, intermediate files, looping over depth slices (for instance).

1. Introduction to chunking

Dask chunking

The metadata of an xarray DataArray loaded with open_mfdataset includes the dask chunk size.

File chunking

The file itself may also be chunked. Filesystem chunking is available in netCDF-4 and HDF5 datasets. CMIP6 data should all be netCDF-4 and include some form of chunking on the file.

You can look at the .encoding attribute of an xarray variable to see information about the file storage.

2. Chunking best practices

Accessing data across chunks is slower than along chunks.

Optimal chunk sizes:

3. Parallelising your code

In the notebook:

from dask.distributed import Client
c = Client()
c

From within a script:

import dask.distributed

if __name__ == '__main__':
    client = dask.distributed.Client(
        n_workers=8, threads_per_worker=1,
        memory_limit='4gb', local_dir=tempfile.mkdtemp())

4. Rolling your own dask aware functions

Check if a function is dask aware by watching the progress bar:

import dask.diagnostics
dask.diagnostics.ProgressBar().register()

Use the dask map_overlap and map_blocks to make your functions dask aware.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant