New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add content on dealing with large arrays? #8
Comments
Some introductory notes can be found at this post on Speeding Up Your Code |
One option might be to have people login to http://pangeo.pydata.org and then do one of the examples from https://github.com/pangeo-data/pangeo-example-notebooks by cloning that repo in the jupyter terminal. (To get a notebook rather than jupyter lab environment you need to replace |
Resources: Possible outline: 0. Simple things you can doLazy loading, subsetting, intermediate files, looping over depth slices (for instance). 1. Introduction to chunkingDask chunkingThe metadata of an xarray DataArray loaded with File chunkingThe file itself may also be chunked. Filesystem chunking is available in netCDF-4 and HDF5 datasets. CMIP6 data should all be netCDF-4 and include some form of chunking on the file. You can look at the 2. Chunking best practicesAccessing data across chunks is slower than along chunks. Optimal chunk sizes:
3. Parallelising your codeIn the notebook:
From within a script:
4. Rolling your own dask aware functionsCheck if a function is dask aware by watching the progress bar:
Use the dask |
People dealing with ocean data (due to the extra depth dimension) or high time frequency data (e.g. hourly data) tend to run into issues (like memory errors) due to the large size of their data arrays.
Some lesson content on Dask would be helpful here.
The text was updated successfully, but these errors were encountered: