# Use Case: Perform in-cloud analysis with MERRA-2 and GPM IMERG
### Originator: Brian Mapes
### Use Case Table: https://docs.google.com/document/d/1K4N_qJs2ru2zpaqiBB4k64LJhFzrSTXZqpwF1s5IcWY/edit#
### GitHub Repository (Mapes): https://github.com/brianmapes/VaporLakes/blob/main/TrackLakesBack_GeoPandas.py

  
  
  
### Author: Alexis Hunzinger
### Date Modified: 4/22/22

## Case 1: Using a JupyterHub in AWS us-west-2
Steps to follow if you are beginning from a JupyterHub that is running in AWS us-west-2 (i.e. Openscapes 2i2c, GES DISC SMCE)

1. Earthdata Login authentication
2. Temporary S3 credential
3. Identify S3 bucket link
4. Direct S3 access of found files
5. Extract desired variables

### 0. Import libraries

In [1]:
from netrc import netrc
from subprocess import Popen
from platform import system
from getpass import getpass
from pprint import pprint
from glob import glob
import os
import requests
import xarray as xr
import s3fs
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import matplotlib.pyplot as plt
%matplotlib inline

### 1. Earthdata Login authentication (**SKIP THIS IF NETRC ALREADY EXISTS**)

In [2]:
urs = 'urs.earthdata.nasa.gov'    # Earthdata URL endpoint for authentication
prompts = ['Enter NASA Earthdata Login Username: ',
           'Enter NASA Earthdata Login Password: ']

netrc_name = ".netrc"

# Determine if netrc file exists, and if so, if it includes NASA Earthdata Login Credentials
try:
    netrcDir = os.path.expanduser(f"~/{netrc_name}")
    netrc(netrcDir).authenticators(urs)[0]

# Below, create a netrc file and prompt user for NASA Earthdata Login Username and Password
except FileNotFoundError:
    homeDir = os.path.expanduser("~")
    Popen('touch {0}{2} | echo machine {1} >> {0}{2}'.format(homeDir + os.sep, urs, netrc_name), shell=True)
    Popen('echo login {} >> {}{}'.format(getpass(prompt=prompts[0]), homeDir + os.sep, netrc_name), shell=True)
    Popen('echo \'password {} \'>> {}{}'.format(getpass(prompt=prompts[1]), homeDir + os.sep, netrc_name), shell=True)
    # Set restrictive permissions
    Popen('chmod 0600 {0}{1}'.format(homeDir + os.sep, netrc_name), shell=True)

### 2. Temporary S3 credential

In [3]:
gesdisc_s3 = "https://data.gesdisc.earthdata.nasa.gov/s3credentials"

# Define a function for S3 access credentials

def begin_s3_direct_access(url: str=gesdisc_s3):
    response = requests.get(url).json()
    return s3fs.S3FileSystem(key=response['accessKeyId'],
                             secret=response['secretAccessKey'],
                             token=response['sessionToken'],
                             client_kwargs={'region_name':'us-west-2'})

fs = begin_s3_direct_access()

# Check that the file system is intact as an S3FileSystem object, which means that token is valid
# Common causes of rejected S3 access tokens include incorrect passwords stored in the netrc file, or a non-existent netrc file
type(fs)

s3fs.core.S3FileSystem

### 3. Identify S3 bucket links
You can find the S3 URL through a filtered Eartdata Search: https://search.earthdata.nasa.gov/search/granules/collection-details?p=C1276812863-GES_DISC&pg[0][v]=f&pg[0][gsk]=-start_date&ff=Available%20from%20AWS%20Cloud&fdc=Goddard%20Earth%20Sciences%20Data%20and%20Information%20Services%20Center%20(GES%20DISC)&tl=1648764097.138!3!!&long=0.0703125

In [4]:
s3_merra2 = "s3://gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/"
s3_imerg = "s3://gesdisc-cumulus-prod-protected/GPM_L3/GPM_3IMERGHH.06/"

### 4. Identify time period and region (bounding box) of interest
Files in the S3 bucket are organized in folders ordered by <code>YEAR/MONTH/DAILY-FILES.ext<code>



For example, daily MERRA-2 files from May 2013: <code>s3://gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2013/05/*.nc4<code>

In [5]:
year = "2019"
month = "05"
#day_of_year = "015"

lat = -30,30
lon = 30,90

### 5. List files from S3

In [35]:
s3files = fs.glob(s3_merra2+
                    year+
                    "/"+
                    month+
                    "/"+
                    "*")
len(s3files), s3files

(31,
 ['gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2019/05/MERRA2_400.tavg1_2d_slv_Nx.20190501.nc4',
  'gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2019/05/MERRA2_400.tavg1_2d_slv_Nx.20190502.nc4',
  'gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2019/05/MERRA2_400.tavg1_2d_slv_Nx.20190503.nc4',
  'gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2019/05/MERRA2_400.tavg1_2d_slv_Nx.20190504.nc4',
  'gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2019/05/MERRA2_400.tavg1_2d_slv_Nx.20190505.nc4',
  'gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2019/05/MERRA2_400.tavg1_2d_slv_Nx.20190506.nc4',
  'gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2019/05/MERRA2_400.tavg1_2d_slv_Nx.20190507.nc4',
  'gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2019/05/MERRA2_400.tavg1_2d_slv_Nx.20190508.nc4',
  'gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2019/05/MERRA2_400.tavg1_2d_slv_Nx.20190509.nc4',
  'gesdisc-cumulus-prod

In [36]:
# # IMERG
# s3files = fs.glob(s3_imerg+
#                     year+
#                     "/"+
#                     month+
#                     "/"+
#                     "*")
# len(s3files), s3files

(0, [])

# ***Begin testing access methods and documenting issues/failures/speed test results***

### 6a. Open list of S3 files with xarray

- combine='by_coords'
- mask_and_scale=True
- decode_cf=True
- ***parallel=False***

In [9]:
%%time
merraDataset = xr.open_mfdataset(
    paths=[fs.open(f) for f in s3files],
    combine='by_coords',
    mask_and_scale=True,
    decode_cf=True,
    parallel=False,
#     chunks={'lat': 60,   # These were chosen arbitrarily. You must specify 
#             'lon': 120, # chunking that is suitable to the data and target
#             'time': 100}      # analysis.
)


print("Using xarray's open_mfdataset to open 31 netCDF files from S3 storage (parallel=False)")

Using xarray's open_mfdataset to open 31 netCDF files from S3 storage (parallel=False)
CPU times: user 31.1 s, sys: 1.03 s, total: 32.2 s
Wall time: 41.1 s


### 6b. Open list of S3 files with xarray

- combine='by_coords'
- mask_and_scale=True
- decode_cf=True
- ***parallel=True***

In [14]:
%%time
merraDataset = xr.open_mfdataset(
    paths=[fs.open(f) for f in s3files],
    combine='by_coords',
    mask_and_scale=True,
    decode_cf=True,
    parallel=True,
#     chunks={'lat': 60,   # These were chosen arbitrarily. You must specify 
#             'lon': 120, # chunking that is suitable to the data and target
#             'time': 100}      # analysis.
)


print("Using xarray's open_mfdataset to open 31 netCDF files from S3 storage (parallel=True)")

Using xarray's open_mfdataset to open 31 netCDF files from S3 storage (parallel=True)
CPU times: user 31.6 s, sys: 1.01 s, total: 32.6 s
Wall time: 40 s


### 6c. Open list of S3 files with xarray

- combine='by_coords'
- mask_and_scale=True
- decode_cf=True
- parallel=False
- ***decode_times=True***

In [16]:
%%time
merraDataset = xr.open_mfdataset(
    paths=[fs.open(f) for f in s3files],
    combine='by_coords',
    mask_and_scale=True,
    decode_cf=True,
    parallel=False,
    decode_times=True
#     chunks={'lat': 60,   # These were chosen arbitrarily. You must specify 
#             'lon': 120, # chunking that is suitable to the data and target
#             'time': 100}      # analysis.
)


print("Using xarray's open_mfdataset to open 31 netCDF files from S3 storage (decode_times=True)")

Using xarray's open_mfdataset to open 31 netCDF files from S3 storage (decode_times=True)
CPU times: user 31.7 s, sys: 1.2 s, total: 32.9 s
Wall time: 39.7 s


### 6d. Open list of S3 files with xarray

- combine='by_coords'
- mask_and_scale=True
- decode_cf=True
- parallel=False
- ***decode_times=False***

In [17]:
%%time
merraDataset = xr.open_mfdataset(
    paths=[fs.open(f) for f in s3files],
    combine='by_coords',
    mask_and_scale=True,
    decode_cf=True,
    parallel=False,
    decode_times=False,
#     chunks={'lat': 60,   # These were chosen arbitrarily. You must specify 
#             'lon': 120, # chunking that is suitable to the data and target
#             'time': 100}      # analysis.
)


print("Using xarray's open_mfdataset to open 31 netCDF files from S3 storage (decode_times=False)")

ValueError: Could not find any dimension coordinates to use to order the datasets for concatenation

### 6e. Open list of S3 files with xarray

- combine='by_coords'
- mask_and_scale=True
- decode_cf=True
- parallel=False
- decode_times=True
- ***chunks={'lat':10,'lon':10,'time':100}***

In [18]:
%%time
merraDataset1 = xr.open_mfdataset(
    paths=[fs.open(f) for f in s3files],
    combine='by_coords',
    mask_and_scale=True,
    decode_cf=True,
    parallel=False,
    decode_times=True,
    chunks={'lat': 10,   # These were chosen arbitrarily. You must specify 
            'lon': 10, # chunking that is suitable to the data and target
            'time': 100}      # analysis.
)


print("Using xarray's open_mfdataset to open 31 netCDF files from S3 storage (chunks specified)")

Using xarray's open_mfdataset to open 31 netCDF files from S3 storage (chunks specified)
CPU times: user 45.5 s, sys: 3.83 s, total: 49.4 s
Wall time: 1min 19s


### 6f. Open list of S3 files with xarray

- combine='by_coords'
- mask_and_scale=True
- decode_cf=True
- parallel=False
- decode_times=True
- ***chunks={'lat':30,'lon':30,'time':100}***

In [20]:
%%time
merraDataset1 = xr.open_mfdataset(
    paths=[fs.open(f) for f in s3files],
    combine='by_coords',
    mask_and_scale=True,
    decode_cf=True,
    parallel=False,
    decode_times=True,
    chunks={'lat': 30,   # These were chosen arbitrarily. You must specify 
            'lon': 30, # chunking that is suitable to the data and target
            'time': 50}      # analysis.
)


print("Using xarray's open_mfdataset to open 31 netCDF files from S3 storage (see chunks specified)")

Using xarray's open_mfdataset to open 31 netCDF files from S3 storage (see chunks specified)
CPU times: user 32 s, sys: 848 ms, total: 32.8 s
Wall time: 41.2 s


### 6f. Open a single S3 file (.nc4) with xarray's open_zarr()

- Method attempt pulled from: https://medium.com/pangeo/cloud-performant-reading-of-netcdf4-hdf5-data-using-the-zarr-library-1a95c5c92314

In [22]:
import fsspec as fsspec
f = s3files[0]

In [30]:
%%time
ncfile = fsspec.open(f)
print(ncfile)

<OpenFile '/home/jovyan/gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2019/05/MERRA2_400.tavg1_2d_slv_Nx.20190501.nc4'>
CPU times: user 406 µs, sys: 0 ns, total: 406 µs
Wall time: 350 µs


In [29]:
%%time
#skipping a chunk store step
ds = xr.open_zarr(ncfile)

FileNotFoundError: [Errno 2] No such file or directory: '/home/jovyan/gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2019/05/MERRA2_400.tavg1_2d_slv_Nx.20190501.nc4'

### 6g. Open a single S3 file (.nc4) with zarr-eosdis-store

- Method attempt pulled from: https://github.com/nasa/zarr-eosdis-store

In [33]:
from eosdis_store import EosdisStore

### 6h. Open a single S3 file (.nc4) using kerchunk json to mimic zarr

- Method adapted for NASA data by Aaron Friesz: https://github.com/NASA-Openscapes/earthdata-cloud-cookbook/blob/ornl_daymet_access/examples/GESDISC/GESDISC_MERRA2_tavg1_2d_flx_Nx__Kerchunk.ipynb