Pangeo Cloud Storage Benchmarks

Big Arrays, Fast: Profiling Cloud Storage Read Throughput

Ryan Abernathey

As the size of geoscience datasets grows, scientists are eager to move away from a download-based workflow, where data files are downloaded a local computer for analysis, towards a more cloud-native workflow, where data is loaded on demand over the network. On-demand data loading offers several advantages, including increased reproducibility, provenance tracking, and, potentially, scalability using distributed cloud computing.

In this notebook, we demonstrate how to load data on-demand using three different remote data access protocols:

OPeNDAP, the most common, well-established protocol
NetCDF over HTTP, enabled by the h5py library
Zarr over HTTP, a new format optimized for cloud object storage (e.g. Amazon S3)

We then conduct a simple benchmarking exercise to explore the throughput and scalability of each service. We use Dask to parallelize reads from each access protocol and calculate the throughput as a function of number of parallel reads. One conclusion is that Zarr over HTTP, coupled with cloud object storage, shows favorable scaling up to hundreds of parallel processes.

Finally, we compare the throughput of Zarr over HTTP on a few different clouds, including Google Cloud Storage, Jetstream Cloud, Wasabi Cloud, and Open Storage Network.

Pangeo Cloud Storage Benchmarks

Investigation of the throughput of various cloud storage formats and services. Prepared for the 2020 EarthCube Meeting by Ryan Abernathey.

This repository is configured for Pangeo Gallery. It is configured to automatically build itself using GitHub actions and binderbot:

A statically rendered version is available here:

http://gallery.pangeo.io/repos/earthcube2020/ec20_abernathey_etal/cloud_storage.html

An interactive Binder is here:

The code is licensed via the open-source MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows		.github/workflows
LICENSE		LICENSE
Readme.md		Readme.md
binder-gallery.yaml		binder-gallery.yaml
cloud_storage.ipynb		cloud_storage.ipynb
thumbnail.png		thumbnail.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big Arrays, Fast: Profiling Cloud Storage Read Throughput

Pangeo Cloud Storage Benchmarks

About

Releases

Packages

Contributors 2

Languages

License

earthcube2020/ec20_abernathey_etal

Folders and files

Latest commit

History

Repository files navigation

Big Arrays, Fast: Profiling Cloud Storage Read Throughput

Pangeo Cloud Storage Benchmarks

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages