# Generate metadata reference files from MERRA-2 netCDFs in S3 storage

## This notebook serves you:

1) S3 location of hourly MERRA-2 files
2) Method and code for generating metadata reference files (json) using the kerchunk library
3) Recommended storage location for generated metadata reference jsons
4) Code for combining many metadata reference jsons into a single json file

### Import modules

If you're in the Openscapes 2i2c Hub, you may need to update and install a few packages in your Terminal first


- ```conda install -c conda-forge kerchunk```
- ```conda update h5py```

**Note: There are known issues with a method inside version 2.10.0, so update to any version > 2.10.0.



In [1]:
import requests
import xarray as xr
import s3fs
import pathlib
import ujson
import h5py
import fsspec
from glob import glob
from tqdm import tqdm

from kerchunk.hdf import SingleHdf5ToZarr 
from kerchunk.combine import MultiZarrToZarr
#from fsspec_reference_maker.combine import MultiZarrToZarr

# The xarray produced from the reference file throws a SerializationWarning for each variable. Will need to explore why
import warnings
warnings.simplefilter("ignore")

### Get authentication and set up file system

In [2]:
gesdisc_s3 = "https://data.gesdisc.earthdata.nasa.gov/s3credentials"

response = requests.get(gesdisc_s3).json()
fs = s3fs.S3FileSystem(key=response['accessKeyId'],
                    secret=response['secretAccessKey'],
                    token=response['sessionToken'],
                    client_kwargs={'region_name':'us-west-2'})

### Get a list of URLs for a particular month of MERRA-2 data (March 2019)

In [None]:
urls = fs.ls("s3://gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2019/03/")
#urls

### Create Dask client to process the json files in parallel

We recommend taking advantage of Dask parallelization to speed up the generation of these metadata json files. This is a task that does not need to be done in sequence.

In [None]:
import dask
from dask.distributed import Client
client = Client(n_workers=4)
client

### Define a function for making metadata json files

This uses methods from the kerchunk library (originally fsspec-reference-maker) to draw out important pieces of metadata from the netCDF files stored in S3

In [None]:
def gen_json(u,output_path):
    so = dict(
        mode= "rb", 
        anon= True,
        default_fill_cache= False,
        default_cache_type= "none"
    )
    with fs.open(u, **so) as infile:
        h5chunks = SingleHdf5ToZarr(infile, u, inline_threshold=300)
        with open(f"{output_path}/{u.split('/')[-1]}.json", 'wb') as outf:
            outf.write(ujson.dumps(h5chunks.translate()).encode())

### Create a directory for output jsons if one doesn't exist already. 

We recommend you create thid directory OUTSIDE of this repository, in your own workspace.

In [3]:
output_path = '/home/jovyan/data/jsons' # Change this to your preferred location
pathlib.Path(output_path).mkdir(exist_ok=True)

### Use the Dask Delayed function to create the Kerchunk reference file (json) for each URL in the list of URLs in parallel

In [None]:
%%time
dask.compute(*[dask.delayed(gen_json)(u,output_path) for u in urls])

### If successful, you should be able to see 31 metadata reference json files in your specified output directory

### Define parameters for fsspec that include us-west-2 credentials

In [None]:
# s_opts = {'skip_instance_cache':True}
# r_opts = {'anon':False,
#           'key':response['accessKeyId'],
#           'secret':response['secretAccessKey'],
#           'token':response['sessionToken']}

### Now combine the metadata reference files you just genereated into a single zarr store

Make sure to define the correct 'dim' concat arguments or else you will receive a very long, complicated error.

In [4]:
json_list = sorted(glob(output_path+'/*.json'))
mzz = MultiZarrToZarr(json_list,
                      remote_protocol="s3",
                      remote_options={'anon':True}, # authentication parameters
                      xarray_open_kwargs={
                          "decode_cf" : False,
                          "mask_and_scale" : False,
                          "decode_times" : False,
                          "decode_timedelta" : False,
                          "use_cftime" : False,
                          "decode_coords" : False
                        },
                        xarray_concat_args={
                            'data_vars' : 'minimal',
                            'coords' : 'minimal',
                            'compat' : 'override',
                            'join' : 'override', 
                            'combine_attrs' : 'override',
                            'dim' : 'time'
                        }
                    )

TypeError: __init__() got an unexpected keyword argument 'xarray_open_kwargs'

In [None]:
json_list