# MUR SST Benchmark tests using consolidated metadata versus individual netcdf 

NASA JPL PODAAC has put the entire [MUR SST](https://podaac.jpl.nasa.gov/dataset/MUR-JPL-L4-GLOB-v4.1) dataset on AWS cloud as individual netCDF files, **but all ~7000 of them are netCDF files.**\ Accessing one file works well, but accessing multiple files is **very slow** because the metadata for each file has to be queried. Here, we create **fast access** by consolidating the metadata and accessing the entire dataset rapidly via zarr. More background on this project:
[medium article](https://medium.com/pangeo/fake-it-until-you-make-it-reading-goes-netcdf4-data-on-aws-s3-as-zarr-for-rapid-data-access-61e33f8fe685) and in this [repo](https://github.com/lsterzinger/fsspec-reference-maker-tutorial). We need help developing documentation and more test datasets. If you want to help, we are working in the [Pangeo Gitter](https://gitter.im/pangeo-data/cloud-performant-netcdf4).

To run this code:
- you need to set your AWS credentials up using `aws configure --profile esip-qhub`
- you need to set up your `.netrc` file in your home directory with your earthdata login info

Authors:
- [Chelle Gentemann](https://github.com/cgentemann)
- [Rich Signell](https://github.com/rsignell-usgs)
- [Lucas Steringzer](https://github.com/lsterzinger/)
- [Martin Durant](https://github.com/martindurant)

Credit:
- Funding: Interagency Implementation and Advanced Concepts Team [IMPACT](https://earthdata.nasa.gov/esds/impact) for the Earth Science Data Systems (ESDS) program and AWS Public Dataset Program
- AWS Credit Program
- ESIP Hub

## Summary of results

| Test | Consolidated-metadata | Single netCDF | Improvement |
| :---- | :----: |:----: |:----:|
| Access entire dataset |  1:41 min | 60 min*  | 36x |
| Plot 1 year at point  | 14 sec |  4:12 min | 18x |
| Plot 1 day            |  50 sec  |  1:07 min    | 1.4x |

*Extrapolated from 3:16 min for 1 year to 18 years of data

Accessing the entire dataset is substantially faster because we have already consolidated the metadata into a single file.\
Accessing 1 year of data is also substantially faster because our single metadata file can point to exactly where the data is rather than first accessing the metadata for each day, then finding the data.\
Accessing 1 day is roughly the same amount of time because both are just calling metadata for a single file.


| Test | Consolidated-metadata | Zarr-v1 | Zarr |
| :---- | :----: |:----: |:----:|
| Access entire dataset |  1:41 min | 5 sec  | 1 sec |
| Plot 1 year at point  | 14 sec |  6 sec | 3 sec |
| Plot 1 day            |  50 sec  |  55 sec    | a long time |

This compares the consolidated metadata access with two Zarr versions of MUR SST currently stored on AWS.
Zarr-v1 was re-chuncked for general use, balancing time/space. The access is faster for the timeseries analysis, because there are less chuncks in the zarr file to access than the netcdf file. The access is slightly slower to image the globe for a single day, because the netcdf file is in 1 file, while the Zarr-v1 has to access 360x180 files.
Zarr was re-chuncked for timeseries analysis. The access is faster for the timeseries analysis, because there are less chuncks in the zarr file to access than the netcdf file. The access is substantially slower to image the globe for a single day, because with the chuncking set this way, to access a single day requires reading in the entire 16 TB dataset.

In [None]:
import netrc
import os
import sys
from datetime import datetime
from http.cookiejar import CookieJar
from urllib import request

import fsspec
import hvplot.xarray
import requests
import s3fs
import xarray as xr

In [None]:
#########Setting up earthdata login credentials
# this code is from https://github.com/podaac/tutorials/blob/master/notebooks/cloudwebinar/cloud_direct_access_s3.py
def setup_earthdata_login_auth(endpoint):
    """
    Set up the request library so that it authenticates against the given Earthdata Login
    endpoint and is able to track cookies between requests.  This looks in the .netrc file
    first and if no credentials are found, it prompts for them.
    Valid endpoints:
        urs.earthdata.nasa.gov - Earthdata Login production
    """
    try:
        username, _, password = netrc.netrc().authenticators(endpoint)
    except (FileNotFoundError, TypeError):
        # FileNotFound = There's no .netrc file
        # TypeError = The endpoint isn't in the netrc file, causing the above to try unpacking None
        print("There's no .netrc file or the The endpoint isn't in the netrc file")

    manager = request.HTTPPasswordMgrWithDefaultRealm()
    manager.add_password(None, endpoint, username, password)
    auth = request.HTTPBasicAuthHandler(manager)

    jar = CookieJar()
    processor = request.HTTPCookieProcessor(jar)
    opener = request.build_opener(auth, processor)
    request.install_opener(opener)


###############################################################################
edl = "urs.earthdata.nasa.gov"
setup_earthdata_login_auth(edl)


def begin_s3_direct_access():
    url = "https://archive.podaac.earthdata.nasa.gov/s3credentials"
    response = requests.get(url).json()
    return s3fs.S3FileSystem(
        key=response["accessKeyId"],
        secret=response["secretAccessKey"],
        token=response["sessionToken"],
        client_kwargs={"region_name": "us-west-2"},
    )

In [None]:
url = "https://archive.podaac.earthdata.nasa.gov/s3credentials"
response = requests.get(url).json()

- Consolidated metadata test (without simple_templates)

In [None]:
%%time
json_consolidated = "s3://esip-qhub-public/nasa/mur/murv41_consolidated_20211011.json"

s_opts = {"requester_pays": True, "skip_instance_cache": True}
r_opts = {
    "key": response["accessKeyId"],
    "secret": response["secretAccessKey"],
    "token": response["sessionToken"],
    "client_kwargs": {"region_name": "us-west-2"},
}

fs = fsspec.filesystem(
    "reference",
    fo=json_consolidated,
    ref_storage_args=s_opts,
    remote_protocol="s3",
    remote_options=r_opts,
    simple_templates=True,
)
m = fs.get_mapper("")
ds = xr.open_dataset(m, decode_times=False, engine="zarr", consolidated=False)
ds.close()
ds

In [None]:
%%time
# test getting a random value
ds["analysed_sst"].sel(time="2005-12-20", lat=0, lon=0, method="nearest")

In [None]:
%%time
# test getting a random value
ts = ds["analysed_sst"].sel(
    lat=0.01, lon=0.01, time=slice("2005-01-01T09", "2006-06-01T09")
)
ts.plot()

In [None]:
%%time
now = datetime.now()
dy = ds["analysed_sst"].sel(time="2005-01-01", method="nearest")
dy.hvplot.quadmesh(x="lon", y="lat", geo=True, rasterize=True, cmap="turbo")

In [None]:
then = datetime.now()
print(then - now)

## Benchmark tests for the PODAAC cloud MUR SST, individual netcdf files

In [None]:
%%time
from os.path import dirname, join

fs = begin_s3_direct_access()
files = fs.glob(
    join("podaac-ops-cumulus-protected/", "MUR-JPL-L4-GLOB-v4.1", "2005*.nc")
)
ds2 = xr.open_mfdataset(
    paths=[fs.open(f) for f in files],
    combine="by_coords",
    mask_and_scale=True,
    decode_cf=True,
    chunks={"time": 1},  # analysis.
)
ds2.close()

In [None]:
%%time
# test getting a random value
# ts = ds2['analysed_sst'].sel(lat=0.01,lon=0.01,time=slice('2005-01-01T09','2006-01-01T09')) #memory issues times out so break it up
tem2 = []
for imon in range(12):
    tem = (
        ds2["analysed_sst"]
        .sel(lat=0.01, lon=0.01, time="2005-" + str(imon + 1).zfill(2))
        .load()
    )
    tem2.append(tem)
    # print(imon+1)
ts = xr.concat(tem2, dim="time")

In [None]:
%%time
ts.plot()

In [None]:
%%time
now = datetime.now()
dy = ds2["analysed_sst"].sel(time="2005-01-01")
dy.hvplot.quadmesh(x="lon", y="lat", geo=True, rasterize=True, cmap="turbo")

In [None]:
then = datetime.now()
print(then - now)

## Benchmark tests for the AWS cloud [MUR SST](https://registry.opendata.aws/mur/), ~chunked for general use
- MUR Level 4 SST dataset in Zarr format. The zarr-v1/ directory contains a zarr store chunked (5, 1799, 3600) along the dimensions (time, lat, lon).

In [None]:
import warnings

import fsspec
import hvplot.xarray
import numpy as np
import pandas as pd
import xarray as xr
import hvplot.xarray

warnings.simplefilter("ignore")  # filter some warning messages
xr.set_options(display_style="html")  # display dataset nicely

In [None]:
%%time
ds_sst = xr.open_zarr(
    "https://mur-sst.s3.us-west-2.amazonaws.com/zarr-v1", consolidated=True
)
ds_sst

In [None]:
%%time
ts = ds_sst["analysed_sst"].sel(
    lat=0.01, lon=0.01, time=slice("2005-01-01T09", "2006-01-01T09")
)
ts.plot()

In [None]:
%%time
now = datetime.now()
dy = ds_sst["analysed_sst"].sel(time="2005-01-01")
dy.hvplot.quadmesh(x="lon", y="lat", geo=True, rasterize=True, cmap="turbo")

In [None]:
then = datetime.now()
print(then - now)

## Benchmark tests for the AWS cloud [MUR SST](https://registry.opendata.aws/mur/), ~chunked for timeseries
- MUR Level 4 SST dataset in Zarr format. The zarr/ directory contains a zarr store chunked (6443, 100, 100) along the dimensions (time, lat, lon).

In [None]:
%%time
ds_sst = xr.open_zarr(
    "https://mur-sst.s3.us-west-2.amazonaws.com/zarr", consolidated=True
)
ds_sst

In [None]:
%%time
ts = ds_sst["analysed_sst"].sel(
    lat=0.01, lon=0.01, time=slice("2005-01-01T09", "2006-01-01T09")
)
ts.plot()

In [None]:
#%%time
# now = datetime.now()
# dy = ds_sst["analysed_sst"].sel(time="2005-01-01")
# dy.hvplot.quadmesh(x='lon', y='lat', geo=True, rasterize=True, cmap='turbo' )

In [None]:
# then = datetime.now()
# print(then-now)