# Generate metadata reference files from MERRA-2 netCDFs in S3 storage

## This notebook serves you:

1) S3 location of select hourly MERRA-2 files (March 2019)
2) Method and code for generating metadata reference files (json) using the kerchunk library
3) Recommended storage location for generated metadata reference jsons
4) Code for combining many metadata reference jsons into a single json file

### Import modules

In [1]:
import requests
import xarray as xr
import s3fs
import pathlib
import ujson
import h5py
import fsspec
from glob import glob
from tqdm import tqdm

from kerchunk.hdf import SingleHdf5ToZarr 
from kerchunk.combine import MultiZarrToZarr
#from fsspec_reference_maker.combine import MultiZarrToZarr

# The xarray produced from the reference file throws a SerializationWarning for each variable. Will need to explore why
import warnings
warnings.simplefilter("ignore")

### Get authentication and set up file system

In [2]:
gesdisc_s3 = "https://data.gesdisc.earthdata.nasa.gov/s3credentials"

response = requests.get(gesdisc_s3).json()
fs = s3fs.S3FileSystem(key=response['accessKeyId'],
                    secret=response['secretAccessKey'],
                    token=response['sessionToken'],
                    client_kwargs={'region_name':'us-west-2'})

### Get a list of URLs for a particular month of MERRA-2 data (March 2019)

In [3]:
urls = fs.ls("s3://gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2019/03/")
#urls

### Create Dask client to process the json files in parallel

We recommend taking advantage of Dask parallelization to speed up the generation of these metadata json files. This is a task that does not need to be done in sequence.

In [4]:
import dask
from dask.distributed import Client
client = Client(n_workers=4)
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 4
Total threads: 4,Total memory: 7.57 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:38983,Workers: 4
Dashboard: http://127.0.0.1:8787/status,Total threads: 4
Started: Just now,Total memory: 7.57 GiB

0,1
Comm: tcp://127.0.0.1:34839,Total threads: 1
Dashboard: http://127.0.0.1:41955/status,Memory: 1.89 GiB
Nanny: tcp://127.0.0.1:36861,
Local directory: /home/jovyan/gesdisc-use-cases/mapes-temporal-feature-tracking/dask-worker-space/worker-uulbj3o_,Local directory: /home/jovyan/gesdisc-use-cases/mapes-temporal-feature-tracking/dask-worker-space/worker-uulbj3o_

0,1
Comm: tcp://127.0.0.1:36557,Total threads: 1
Dashboard: http://127.0.0.1:40139/status,Memory: 1.89 GiB
Nanny: tcp://127.0.0.1:41883,
Local directory: /home/jovyan/gesdisc-use-cases/mapes-temporal-feature-tracking/dask-worker-space/worker-orwz7mbm,Local directory: /home/jovyan/gesdisc-use-cases/mapes-temporal-feature-tracking/dask-worker-space/worker-orwz7mbm

0,1
Comm: tcp://127.0.0.1:43897,Total threads: 1
Dashboard: http://127.0.0.1:36939/status,Memory: 1.89 GiB
Nanny: tcp://127.0.0.1:35409,
Local directory: /home/jovyan/gesdisc-use-cases/mapes-temporal-feature-tracking/dask-worker-space/worker-cv1wr0xw,Local directory: /home/jovyan/gesdisc-use-cases/mapes-temporal-feature-tracking/dask-worker-space/worker-cv1wr0xw

0,1
Comm: tcp://127.0.0.1:39337,Total threads: 1
Dashboard: http://127.0.0.1:35111/status,Memory: 1.89 GiB
Nanny: tcp://127.0.0.1:33271,
Local directory: /home/jovyan/gesdisc-use-cases/mapes-temporal-feature-tracking/dask-worker-space/worker-b1mnqs93,Local directory: /home/jovyan/gesdisc-use-cases/mapes-temporal-feature-tracking/dask-worker-space/worker-b1mnqs93


### Define a function for making metadata json files

This uses methods from the kerchunk library (originally fsspec-reference-maker) to draw out important pieces of metadata from the netCDF files stored in S3

In [5]:
def gen_json(u,output_path):
    so = dict(
        mode= "rb", 
        anon= True,
        default_fill_cache= False,
        default_cache_type= "none"
    )
    with fs.open(u, **so) as infile:
        h5chunks = SingleHdf5ToZarr(infile, u, inline_threshold=300)
        with open(f"{output_path}/{u.split('/')[-1]}.json", 'wb') as outf:
            outf.write(ujson.dumps(h5chunks.translate()).encode())

### Create a directory for output jsons if one doesn't exist already. 

We recommend you create this directory OUTSIDE of this repository, in your own workspace.

You can create these directories from the Terminal:

```cd ~```

```mkdir data; cd data; mkdir jsons; cd jsons; pwd```

In [7]:
output_path = '/home/jovyan/data/jsons/' # Change this to your preferred location or the directory you just created

### Use the Dask Delayed function to create the Kerchunk reference file (json) for each URL in the list of URLs in parallel

In [8]:
%%time
dask.compute(*[dask.delayed(gen_json)(u,output_path) for u in urls])

CPU times: user 19.3 s, sys: 10.5 s, total: 29.8 s
Wall time: 7min 39s


(None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None)

### If successful, you should be able to see 31 metadata reference json files in your specified output directory

Check by using the Terminal command:

```ls ~/data/jsons/```

In [9]:
! ls ~/data/jsons/

MERRA2_400.tavg1_2d_slv_Nx.20190301.nc4.json
MERRA2_400.tavg1_2d_slv_Nx.20190302.nc4.json
MERRA2_400.tavg1_2d_slv_Nx.20190303.nc4.json
MERRA2_400.tavg1_2d_slv_Nx.20190304.nc4.json
MERRA2_400.tavg1_2d_slv_Nx.20190305.nc4.json
MERRA2_400.tavg1_2d_slv_Nx.20190306.nc4.json
MERRA2_400.tavg1_2d_slv_Nx.20190307.nc4.json
MERRA2_400.tavg1_2d_slv_Nx.20190308.nc4.json
MERRA2_400.tavg1_2d_slv_Nx.20190309.nc4.json
MERRA2_400.tavg1_2d_slv_Nx.20190310.nc4.json
MERRA2_400.tavg1_2d_slv_Nx.20190311.nc4.json
MERRA2_400.tavg1_2d_slv_Nx.20190312.nc4.json
MERRA2_400.tavg1_2d_slv_Nx.20190313.nc4.json
MERRA2_400.tavg1_2d_slv_Nx.20190314.nc4.json
MERRA2_400.tavg1_2d_slv_Nx.20190315.nc4.json
MERRA2_400.tavg1_2d_slv_Nx.20190316.nc4.json
MERRA2_400.tavg1_2d_slv_Nx.20190317.nc4.json
MERRA2_400.tavg1_2d_slv_Nx.20190318.nc4.json
MERRA2_400.tavg1_2d_slv_Nx.20190319.nc4.json
MERRA2_400.tavg1_2d_slv_Nx.20190320.nc4.json
MERRA2_400.tavg1_2d_slv_Nx.20190321.nc4.json
MERRA2_400.tavg1_2d_slv_Nx.20190322.nc4.json
MERRA2_400

