# Kerchunk

This is a companion notebook to [Kerchunk and Pangeo-Forge](notebooks/kerchunk-pangeo-forge.ipynb) to demonstrate how to recreate the processing pipeline using `Kerchunk` without `Pangeo-Forge`. Details on the dataset and processing steps are outlined in more detail in the previous notebook.

More detailed guides to `Kerchunk` can be found in the [Docs](https://fsspec.github.io/kerchunk/) and the [Project Pythia Cookbook](https://projectpythia.org/kerchunk-cookbook/README.html)

In [1]:
import dask
import fsspec
import glob
import logging
import ujson
import warnings
import xarray as xr

from tempfile import TemporaryDirectory
from distributed import Client
from kerchunk.combine import MultiZarrToZarr
from kerchunk.hdf import SingleHdf5ToZarr

## Exaimine a Single File

In [2]:
import xarray as xr

ds = xr.open_dataset(
    "http://thredds.northwestknowledge.net:8080/thredds/dodsC/MET/bi/bi_2021.nc"
)

In [3]:
ds

### Startup a Dask Client for Parallel Processing

In [4]:
client = Client(n_workers=8, silence_logs=logging.ERROR)
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 8
Total threads: 8,Total memory: 16.00 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:59850,Workers: 8
Dashboard: http://127.0.0.1:8787/status,Total threads: 8
Started: Just now,Total memory: 16.00 GiB

0,1
Comm: tcp://127.0.0.1:59871,Total threads: 1
Dashboard: http://127.0.0.1:59876/status,Memory: 2.00 GiB
Nanny: tcp://127.0.0.1:59853,
Local directory: /var/folders/mb/7d7yq_4j2qgdfm_j3j4tsyl40000gn/T/dask-scratch-space/worker-sny0s1hd,Local directory: /var/folders/mb/7d7yq_4j2qgdfm_j3j4tsyl40000gn/T/dask-scratch-space/worker-sny0s1hd

0,1
Comm: tcp://127.0.0.1:59872,Total threads: 1
Dashboard: http://127.0.0.1:59874/status,Memory: 2.00 GiB
Nanny: tcp://127.0.0.1:59854,
Local directory: /var/folders/mb/7d7yq_4j2qgdfm_j3j4tsyl40000gn/T/dask-scratch-space/worker-pza0kc1z,Local directory: /var/folders/mb/7d7yq_4j2qgdfm_j3j4tsyl40000gn/T/dask-scratch-space/worker-pza0kc1z

0,1
Comm: tcp://127.0.0.1:59870,Total threads: 1
Dashboard: http://127.0.0.1:59875/status,Memory: 2.00 GiB
Nanny: tcp://127.0.0.1:59855,
Local directory: /var/folders/mb/7d7yq_4j2qgdfm_j3j4tsyl40000gn/T/dask-scratch-space/worker-jf50_mjs,Local directory: /var/folders/mb/7d7yq_4j2qgdfm_j3j4tsyl40000gn/T/dask-scratch-space/worker-jf50_mjs

0,1
Comm: tcp://127.0.0.1:59873,Total threads: 1
Dashboard: http://127.0.0.1:59879/status,Memory: 2.00 GiB
Nanny: tcp://127.0.0.1:59856,
Local directory: /var/folders/mb/7d7yq_4j2qgdfm_j3j4tsyl40000gn/T/dask-scratch-space/worker-kk5rzwol,Local directory: /var/folders/mb/7d7yq_4j2qgdfm_j3j4tsyl40000gn/T/dask-scratch-space/worker-kk5rzwol

0,1
Comm: tcp://127.0.0.1:59884,Total threads: 1
Dashboard: http://127.0.0.1:59886/status,Memory: 2.00 GiB
Nanny: tcp://127.0.0.1:59857,
Local directory: /var/folders/mb/7d7yq_4j2qgdfm_j3j4tsyl40000gn/T/dask-scratch-space/worker-fn6tydqi,Local directory: /var/folders/mb/7d7yq_4j2qgdfm_j3j4tsyl40000gn/T/dask-scratch-space/worker-fn6tydqi

0,1
Comm: tcp://127.0.0.1:59882,Total threads: 1
Dashboard: http://127.0.0.1:59889/status,Memory: 2.00 GiB
Nanny: tcp://127.0.0.1:59858,
Local directory: /var/folders/mb/7d7yq_4j2qgdfm_j3j4tsyl40000gn/T/dask-scratch-space/worker-whle6t60,Local directory: /var/folders/mb/7d7yq_4j2qgdfm_j3j4tsyl40000gn/T/dask-scratch-space/worker-whle6t60

0,1
Comm: tcp://127.0.0.1:59883,Total threads: 1
Dashboard: http://127.0.0.1:59887/status,Memory: 2.00 GiB
Nanny: tcp://127.0.0.1:59859,
Local directory: /var/folders/mb/7d7yq_4j2qgdfm_j3j4tsyl40000gn/T/dask-scratch-space/worker-0wl014gp,Local directory: /var/folders/mb/7d7yq_4j2qgdfm_j3j4tsyl40000gn/T/dask-scratch-space/worker-0wl014gp

0,1
Comm: tcp://127.0.0.1:59885,Total threads: 1
Dashboard: http://127.0.0.1:59892/status,Memory: 2.00 GiB
Nanny: tcp://127.0.0.1:59860,
Local directory: /var/folders/mb/7d7yq_4j2qgdfm_j3j4tsyl40000gn/T/dask-scratch-space/worker-rhl9rhu6,Local directory: /var/folders/mb/7d7yq_4j2qgdfm_j3j4tsyl40000gn/T/dask-scratch-space/worker-rhl9rhu6


### Create List of URLs

In [28]:
start_year = 1979
end_year = 1981  # Data is available until 2022. This example only processes 2 years

years = list(range(start_year, end_year + 1))

url_list = [
    f"http://www.northwestknowledge.net/metdata/data/bi_{time}.nc" for time in years
]

### Create a Temp Directory to Write

In [29]:
import os
from tempfile import TemporaryDirectory

td = TemporaryDirectory()
target_root = td.name
target_root = "esip"
store_name = "kerchunk"
target_store = os.path.join(target_root, store_name)

### Use `SingleHdf5ToZarr` to Create Kerchunk References


In [30]:
def generate_json_reference(fil, output_dir: str):
    h5chunks = SingleHdf5ToZarr(fil, inline_threshold=300)
    fname = fil.split("/")[-1].strip(".nc")
    outf = f"{output_dir}/{fname}.json"
    with open(outf, "wb") as f:
        f.write(ujson.dumps(h5chunks.translate()).encode())
    return outf


# Generate Dask Delayed objects
tasks = [dask.delayed(generate_json_reference)(fil, target_store) for fil in url_list]

In [31]:

%%time 
# Start parallel processing

warnings.filterwarnings("ignore")
dask.compute(tasks)

(['esip/kerchunk/bi_1979.json', 'esip/kerchunk/bi_1980.json'],)

### Combine Kerchunk Reference Files into Combined Reference

In [36]:
%%time

# Create a list of reference json files
output_files = glob.glob(f"{target_store}/*.json")

# combine individual references into single consolidated reference
mzz = MultiZarrToZarr(
    output_files,
    concat_dims=["day"],
    identical_dims=["lat", "lon", "crs"],
)

multi_kerchunk = mzz.translate()

# Write in-memory Kerchunk combined reference to json
output_fname = "Combined_Reference.json"
with open(f"{output_fname}", "wb") as f:
    f.write(ujson.dumps(multi_kerchunk).encode())

CPU times: user 590 ms, sys: 103 ms, total: 693 ms
Wall time: 2.59 s


## Load Kerchunk Dataset

In [34]:
fs = fsspec.filesystem(
    "reference",
    fo="Combined_Reference.json",
    remote_protocol="http",
    remote_options={"anon": True},
    skip_instance_cache=True,
)
ds = xr.open_dataset(fs.get_mapper(""), engine="zarr")

In [35]:
ds