# Generate metadata reference files from MERRA-2 netCDFs in S3 storage

## This notebook serves you:

1) S3 location of hourly MERRA-2 files
2) Method and code for generating metadata reference files (json) using the kerchunk library
3) Recommended storage location for generated metadata reference jsons
4) Code for combining many metadata reference jsons into a single json file

### Import modules

If you're in the Openscapes 2i2c Hub, you may need to install a few packages in your Terminal first

- ```conda install -c conda-forge kerchunk```
- ```conda install -c anaconda h5py```

**Note: If it installs version 2.10.0 of h5py, run ```conda update h5py```. There are known issues with a method inside 2.10.0

In [10]:
import requests
import xarray as xr
import s3fs
import pathlib
import ujson
import h5py
import fsspec
import hvplot.xarray

from kerchunk.hdf import SingleHdf5ToZarr 
from kerchunk.combine import MultiZarrToZarr

# The xarray produced from the reference file throws a SerializationWarning for each variable. Will need to explore why
import warnings
warnings.simplefilter("ignore")

### Get authentication and set up file system

In [2]:
gesdisc_s3 = "https://data.gesdisc.earthdata.nasa.gov/s3credentials"

response = requests.get(gesdisc_s3).json()
fs = s3fs.S3FileSystem(key=response['accessKeyId'],
                    secret=response['secretAccessKey'],
                    token=response['sessionToken'],
                    client_kwargs={'region_name':'us-west-2'})

### Get a list of URLs for a particular month of MERRA-2 data (March 2019)

In [3]:
urls = fs.ls("s3://gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2019/03/")
#urls

### Create Dask client to process the json files in parallel

We recommend taking advantage of Dask parallelization to speed up the generation of these metadata json files. This is a task that does not need to be done in sequence.

In [4]:
import dask
from dask.distributed import Client
client = Client(n_workers=4)
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 4
Total threads: 4,Total memory: 7.57 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:43147,Workers: 4
Dashboard: http://127.0.0.1:8787/status,Total threads: 4
Started: Just now,Total memory: 7.57 GiB

0,1
Comm: tcp://127.0.0.1:36285,Total threads: 1
Dashboard: http://127.0.0.1:45425/status,Memory: 1.89 GiB
Nanny: tcp://127.0.0.1:42511,
Local directory: /home/jovyan/gesdisc-use-cases/mapes-temporal-feature-tracking/dask-worker-space/worker-p1bhk0a9,Local directory: /home/jovyan/gesdisc-use-cases/mapes-temporal-feature-tracking/dask-worker-space/worker-p1bhk0a9

0,1
Comm: tcp://127.0.0.1:43349,Total threads: 1
Dashboard: http://127.0.0.1:34605/status,Memory: 1.89 GiB
Nanny: tcp://127.0.0.1:45753,
Local directory: /home/jovyan/gesdisc-use-cases/mapes-temporal-feature-tracking/dask-worker-space/worker-le7tw4vd,Local directory: /home/jovyan/gesdisc-use-cases/mapes-temporal-feature-tracking/dask-worker-space/worker-le7tw4vd

0,1
Comm: tcp://127.0.0.1:33809,Total threads: 1
Dashboard: http://127.0.0.1:33689/status,Memory: 1.89 GiB
Nanny: tcp://127.0.0.1:34633,
Local directory: /home/jovyan/gesdisc-use-cases/mapes-temporal-feature-tracking/dask-worker-space/worker-sy5e1lgv,Local directory: /home/jovyan/gesdisc-use-cases/mapes-temporal-feature-tracking/dask-worker-space/worker-sy5e1lgv

0,1
Comm: tcp://127.0.0.1:39623,Total threads: 1
Dashboard: http://127.0.0.1:33755/status,Memory: 1.89 GiB
Nanny: tcp://127.0.0.1:39831,
Local directory: /home/jovyan/gesdisc-use-cases/mapes-temporal-feature-tracking/dask-worker-space/worker-7lf526ft,Local directory: /home/jovyan/gesdisc-use-cases/mapes-temporal-feature-tracking/dask-worker-space/worker-7lf526ft


### Define a function for making metadata json files

This uses methods from the kerchunk library (originally fsspec-reference-maker) to draw out important pieces of metadata from the netCDF files stored in S3

In [15]:
def gen_json(u,output_path):
    so = dict(
        mode= "rb", 
        anon= True,
        default_fill_cache= False,
        default_cache_type= "none"
    )
    with fs.open(u, **so) as infile:
        h5chunks = SingleHdf5ToZarr(infile, u, inline_threshold=300)
        with open(f"{output_path}/{u.split('/')[-1]}.json", 'wb') as outf:
            outf.write(ujson.dumps(h5chunks.translate()).encode())

### Create a directory for output jsons if one doesn't exist already. 

We recommend you create thid directory OUTSIDE of this repository, in your own workspace.

In [16]:
output_path = '/home/jovyan/data/jsons' # Change this to your preferred location
pathlib.Path(output_path).mkdir(exist_ok=True)

### Use the Dask Delayed function to create the Kerchunk reference file (json) for each URL in the list of URLs in parallel

In [17]:
%%time
dask.compute(*[dask.delayed(gen_json)(u,output_path) for u in urls])

Key:       gen_json-3afa0fba-dc7e-4cc4-9736-4283674e88ea
Function:  gen_json
args:      ('gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2019/03/MERRA2_400.tavg1_2d_slv_Nx.20190326.nc4', '/home/jovyan/data/jsons')
kwargs:    {}
Exception: 'AttributeError("\'h5py.h5d.DatasetID\' object has no attribute \'get_num_chunks\'")'



AttributeError: 'h5py.h5d.DatasetID' object has no attribute 'get_num_chunks'

Key:       gen_json-a1366358-64f1-49e1-ac39-3dfe563ffdcd
Function:  gen_json
args:      ('gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2019/03/MERRA2_400.tavg1_2d_slv_Nx.20190314.nc4', '/home/jovyan/data/jsons')
kwargs:    {}
Exception: 'AttributeError("\'h5py.h5d.DatasetID\' object has no attribute \'get_num_chunks\'")'

Key:       gen_json-66032a0f-bf6f-471d-8735-016ebf8c824a
Function:  gen_json
args:      ('gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2019/03/MERRA2_400.tavg1_2d_slv_Nx.20190324.nc4', '/home/jovyan/data/jsons')
kwargs:    {}
Exception: 'AttributeError("\'h5py.h5d.DatasetID\' object has no attribute \'get_num_chunks\'")'

Key:       gen_json-0646fe06-b78b-411e-a900-0f1d8ed722d1
Function:  gen_json
args:      ('gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2019/03/MERRA2_400.tavg1_2d_slv_Nx.20190308.nc4', '/home/jovyan/data/jsons')
kwargs:    {}
Exception: 'AttributeError("\'h5py.h5d.DatasetID\' object has no attribute \'get_num_chunks\'")'