# Using CMR to View Cloud-Hosted Datasets
### Author: Chris Battisto
### Date Authored: 1-31-22

### Timing

Exercise: 15 minutes

<p></p>

<div style="background:#fc9090;border:1px solid #cccccc;padding:5px 10px;"><big><b>Note:  </b>This notebook <em><strong>will only run in an environment with <a href="https://disc.gsfc.nasa.gov/information/glossary?keywords=%22earthdata%20cloud%22&amp;title=AWS%20region">us-west-2 AWS access</a></strong></em>.</big></div>

### Overview

This notebook demonstrates how to access cloud-hosted GES DISC granules using the [Commmon Metadata Repository (CMR) API](https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html).

### Prerequisites

This notebook was written using Python 3.8, and requires these libraries and files: 
- xarray
- S3FS
- netrc file with valid Earthdata Login credentials.
- Approval to access the GES DISC archives with your Earthdata credentials (https://disc.gsfc.nasa.gov/earthdata-login)


### Import Libraries

In [None]:
import requests
import xarray as xr
import s3fs
import pprint

### Create a Function for CMR Catalog Requests

In [None]:
#

### Search CMR Catalogs and Obtain Data URLs

First, check that the CMR catalog can be accessed:

In [None]:
#

Lets see how many cloud-hosted data collections are currently in the GES DISC CMR catalog:

In [None]:
#

Here are the current GES DISC datasets available in the cloud as of March 2022:

In [None]:
#

Once we know which datasets are cloud hosted, we can obtain individual granule S3 URLs by querying https://cmr.earthdata.nasa.gov/search/granules. By querying a JSON response of the granule that we want, we can obtain the new OPeNDAP link and S3 links. Here, we will parse out an s3 link to the AQUA AIRS IR + MW Level 2 CLIMCAPS dataset: 

In [None]:
#

Now, we can parse out that link, and assign it to a variable:

In [None]:
#

### ***Alternate Link Generation Method:***

For datasets that do not have their S3 links posted, their parent links can be manually switched to S3 using Python's <code>replace</code> function (for example, change <code>https://goldsmr4.gesdisc.eosdis.nasa.gov/data/MERRA2/</code> to <code>s3://gesdisc-cumulus-prod-protected/MERRA2/</code>. Remember that datasets like GPM IMERG may have different file organization structures, and it is recommended to use the GES DISC subsetting tool, CMR, or Earthdata Search to generate links.

In [None]:
# Paste link generated by GES DISC subsetter

#merra_opendap_link = 'https://goldsmr4.gesdisc.eosdis.nasa.gov/opendap/MERRA2/M2T1NXSLV.5.12.4/2013/05/MERRA2_400.tavg1_2d_slv_Nx.20130531.nc4'
#print('OPeNDAP Link:', merra_opendap_link)

# Manually replace the on-prem server link with S3 for file list generation
#merra_s3_link = merra_opendap_link.replace('https://goldsmr4.gesdisc.eosdis.nasa.gov/opendap/', 
#                                 's3://gesdisc-cumulus-prod-protected/')

#print('S3 Link:', merra_s3_link)


#### Now that our S3 link has been obtained, we can generate our token, mount the GES DISC S3 bucket, and open our granule.

### Obtain S3 credentials and Open Bucket Granules

Remember that the credential token requires a previously generated netrc file, and that it will only last for one hour before needing to be regenerated.

In [None]:
gesdisc_s3 = "https://data.gesdisc.earthdata.nasa.gov/s3credentials"

# Define a function for S3 access credentials

def begin_s3_direct_access(url: str=gesdisc_s3):
    response = requests.get(url).json()
    return s3fs.S3FileSystem(key=response['accessKeyId'],
                             secret=response['secretAccessKey'],
                             token=response['sessionToken'],
                             client_kwargs={'region_name':'us-west-2'})

fs = begin_s3_direct_access()

# Check that the file system is intact as an S3FileSystem object, which means that token is valid
# Common causes of rejected S3 access tokens include incorrect passwords stored in the netrc file, or a non-existent netrc file
type(fs)

Finally, we can open the CLIMCAPS granule in Xarray:

In [None]:
#

### Additional Exercise: Compare On-prem and S3 granules:

Xarray's <code>equals()</code> function can be called to compare any two Xarray data objects, or in this case, for seeing if the on-prem and S3 granules have identical data:

In [None]:
ds_merra_on_prem = xr.open_dataset(merra_opendap_link)
ds_merra_s3 = xr.open_dataset(fs.open(merra_s3_link))

# Always use equals() for checking if Xarray datasets are identical
if ds_merra_s3.equals(ds_merra_on_prem):
    print('The on-prem and S3 datasets are equal and intact')