kerchunk-tools

Overview

This is a set of tools for working with the "kerchunk" library:

https://fsspec.github.io/kerchunk/

Kerchunk provides cloud-friendly indexing of data files without needing to move the data itself.

The tools included here allow:

indexing of existing NetCDF files to kerchunk files
aggregation of existing NetCDF files to a single kerchunk file
tools to write to either POSIX file systems or S3-compatible object-store
a wrapper around xarray to ensure that the data can be read by Python
integration with access control to limit read/write operations as desired

An example notebook can be run using binder:

https://mybinder.org/v2/gh/cedadev/kerchunk-tools.git/main?filepath=notebooks

Installation

Method 1: Install with miniconda

From scratch, you can conda install with:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -p ~/miniconda -b

source ~/miniconda/bin/activate
conda create --name kerchunk-tools --file spec-file.txt

conda activate kerchunk-tools
pip install -e . --no-deps

Method 2: Install with Pip

Assuming you have Python 3 installed, you can also install with Pip:

python -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
pip install -e . --no-deps

NOTE: this installation method generated a lot of HDF5 library warnings when reading data, which were not seen with the Conda install.

Basic usage

Here is an example of using kerchunk_tools with authentication to the S3 service:

import kerchunk_tools.xarray_wrapper as wrap_xr

s3_config = {
    "token": "TOKEN",
    "secret": "SECRET",
    "endpoint_url": "ENDPOINT_URL"
}

# Load a Kerchunk file

# Load a Kerchunk file
index_uri = "s3://kc-indexes-cci-cloud-v2/BICEP-OC-L3S-PP-MERGED-1M_MONTHLY_9km_mapped-1998-2020-fv4.2.zstd"
ds = wrap_xr.wrap_xr_open(index_uri, s3_config=s3_config)

# Look at the metadata
print(ds)
pp = ds.pp

print(pp.shape, pp.dims)

# Look at the data
mx = ds.pp.sel(time=slice("1998-03-01", "2000-02-01"), lat=slice(34, 40), lon=slice(20, 23)).max()
mx = float(mx)
print(mx)

assert 2137 < mx < 2139

Testing

If you are connecting to a secured endpoint, then you will need three items for your S3 configuration:

S3_TOKEN
S3_SECRET
S3_ENDPOINT_URL

Then you can run a full workflow that:

creates a bucket in S3
uploads some NetCDF files to S3
creates a kerchunk file in S3 (for a single NetCDF file)
creates a kerchunk file in S3 (for an aggregation of multiple NetCDF files)
read from the kerchunk files and extract/process a subset of data

S3_TOKEN=s3_token S3_SECRET=s3_secret S3_ENDPOINT_URL=s3_endpoint pytest tests/test_workflows/test_workflow_s3_quobyte_single.py -v

Performance testing

Our initial tests, having only run once, came out as follows:

Table of test timings (in seconds). Where multiple values appear, the test was run multiple times.

Test type	Read/process small subset	Read/process larger subset
POSIX Kerchunk	1.0, 0.7	15.2, 37.9
S3-Quobyte Kerchunk	1.1, 4.7, 1.3	8.5, 9.1, 5.7
S3-DataCore Zarr	3.9, 3.8	99.8, 99.2
POSIX Xarray	0.6, 0.9	86.0, 91.4

We need to run these repeatedly to validate them.

Test types

The test types are:

POSIX Kerchunk:

This uses a Kerchunk index file on the POSIX file system
It references NetCDF files on the POSIX file system
There is no use of object-store
This test depends on having pre-generated the Kerchunk index file

S3-Quobyte Kerchunk:

This uses a Kerchunk index file in the JASMIN S3-Quobyte object-store
It references NetCDF files in the S3-Quobyte object-store
- The files are actually part of the CEDA Archive and are exposed via an S3 interface
There is no use of the POSIX file systems
This test depends on having pre-generated the Kerchunk index file

S3-DataCore Zarr:

This reads a Zarr file that we have copied into the JASMIN DataCore (formerly Caringo) object-store
The data is the same content as used for the other tests, converted from NetCDF to Zarr
There is no use of Kerchunk
This test depends on having pre-generated the Zarr file from NetCDF

POSIX Xarray:

This reads all the NetCDF files directly into Xarray (as a list of files)
The files are read directly from the POSIX file system
There is no pre-generation step for this test
This is slower because the aggregation of the NetCDF content is done on-the-fly

Test data

The test data, being used is a list of 279 data files from the CCI archive, under the directory:

/neodc/esacci/ocean_colour/data/v5.0-release/geographic/netcdf/chlor_a/monthly/v5.0/

The first and last files are:

First: .../1997/ESACCI-OC-L3S-CHLOR_A-MERGED-1M_MONTHLY_4km_GEO_PML_OCx-199709-fv5.0.nc 
Last:  .../2020/ESACCI-OC-L3S-CHLOR_A-MERGED-1M_MONTHLY_4km_GEO_PML_OCx-202011-fv5.0.nc

Test details

In all cases the test is run as follows.

Test 1 - Read/process small subset:

Load the data as an xarray.Dataset object.
Create a small time/lat/lon slice of shape: (2, 144, 72) (only 2 time steps == 2 files)
Calculate the maximum value and assert it equals the expected value. s = time.time()

Test 2 - Read/process larger subset:

Load the data as an xarray.Dataset object.
Create a larger time/lat/lon slice of shape: (279, 12, 24) (279 time steps == 279 files)
Calculate the maximum value and assert it equals the expected value.

Background reading and resources

These resources may be useful for understanding why we wanted to look at Kerchunk and how it fits into our bigger picture plans at CEDA:

JASMIN Notebook service intro: https://www.youtube.com/watch?v=nle9teGLAb0&list=PLhF74YhqhjqmZgbQLu_PXZmA27q7vHygg

JASMIN Notebooks workshop tutorial: https://www.youtube.com/watch?v=7UWjhIKq2x0&list=PLhF74Yhqhjqn8NDgU7xfKGLGP8h-FQ1lt&index=16

Notebook that I showed (demonstrating intake access to CMIP6): https://github.com/cedadev/cmip6-object-store/blob/master/notebooks/cmip6-zarr-jasmin.ipynb

Intake library documentation: https://intake.readthedocs.io/en/latest/?badge=latest

Intake ESM (for Earth System Model data) docs: https://intake-esm.readthedocs.io/en/stable/

Kerchunk docs: https://fsspec.github.io/kerchunk/

Useful intro talk on Kerchunk (when it was called ReferenceFileSystem - I think): https://www.youtube.com/watch?v=AWJzDk6M6NM&t=628s

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.github		.github
binder		binder
docs		docs
kerchunk_tools		kerchunk_tools
notebooks		notebooks
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yml		.readthedocs.yml
AUTHORS.rst		AUTHORS.rst
CONTRIBUTING.rst		CONTRIBUTING.rst
ESACCI Summary.md		ESACCI Summary.md
HISTORY.rst		HISTORY.rst
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
setup.cfg		setup.cfg
setup.py		setup.py
spec-file.txt		spec-file.txt
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kerchunk-tools

Overview

Installation

Method 1: Install with miniconda

Method 2: Install with Pip

Basic usage

Testing

Performance testing

Test types

Test data

Test details

Background reading and resources

About

Releases

Packages

Contributors 3

Languages

License

cedadev/kerchunk-tools

Folders and files

Latest commit

History

Repository files navigation

kerchunk-tools

Overview

Installation

Method 1: Install with miniconda

Method 2: Install with Pip

Basic usage

Testing

Performance testing

Test types

Test data

Test details

Background reading and resources

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages