# example-load-podaac-S3-dataset

This notebook shows how to load a dataset from the PO.DAAC S3 data bucket on AWS West-2 while working from a JupyterHub environment also located on West-2. In our case, West-2 is the same cloud center as the SMCE JupyterHub, where the EIS Sea Level project does a lot of work. By reading data directly from the S3 bucket, so do not have to maintain our own copy, and should not need to pay extra to work with that data. This notebook is an attempt to show one way of working with that dataset. The notebook also includes nominal scripts for computing and plotting global fit functions to the example dataset, the JPL RL06 V2 GRACE(-FO) Mascon Solution.

As we improve our knowledge and integration with PO.DAAC S3 data, we will attempt to keep this notebook up to date to maintain a "best practice" example.

### Configuring Earthdata Authentication

You will need to setup Earthdata authentication to successfully use the PO.DAAC S3 bucket. A set of functions are included below to facilitate this (compiled from multiple sources, listed below). You can automate authentication by creating a ".netrc" file in your home directory and writing the following:

```
machine urs.earthdata.nasa.gov
    login <earthdata username>
    password <earthdata password>
```

On the SMCE JupyterHub, it is recommended that you do this from a terminal. First make sure you are in your home directory (`cd ~`), then create a new file and include the following:

```
cat >> .netrc
machine urs.earthdata.nasa.gov
    login <username>
    password <password>

```

Press `Enter` and then type `Ctrl+C` to save and close the prompt.

> **⚠️ Warning:** After writing the file, we _**strongly**_ recommend setting the new `.netrc` file to read-only for only the user using `chmod 0400 .netrc`. If you later need to edit this file, you can temporarily allow read/write by only the user with `chmod 0600 .netrc`. **NOTE:** Some SMCE users have found that they must reset the 0400 permissions every time they start a new SMCE server. If you find this to be the case, you can simply add the correct command to your bash profile or else run the first cell in this notebook. Alternatively, you may wish to forego using a .netrc file altogether and instead use the login prompt below to authenticate each time you use this notebook. _However, that prompt appears to be broken at this time..._

If configured successfully, you should see the following output from the second notebook cell.

```
# Your URS credentials were securely retrieved from your .netrc file.
Earthdata login credentials configured. Ready.
```

Otherwise, you will see a message saying it could not use the .netrc file and it will ask you to input your username and password.

```
There's no .netrc file or the The endpoint isn't in the netrc file. Please provide...
# Your info will only be passed to urs.earthdata.nasa.gov and will not be exposed in Jupyter.
Username: 
```

---

*Note: There is a pip package called "earthdata" that is supposed to help with this process, primarily in reducing code that we must write. However, I (Mike Croteau) have not been able to get it to install on the SMCE. I've tried cloning the default environment and installing this extra package with conda, as suggested by the earthdata devs, but I get unresolvable package inconsistencies. If anyone can get this working, please share.*

---

Sources:

- [Use Case: Study Amazon Estuaries with Data from the EOSDIS Cloud](https://github.com/podaac/tutorials/blob/master/notebooks/SWOT-EA-2021/Estuary_explore_inCloud_zarr.ipynb)
- [SWOT Oceanography with PO.DAAC](https://git.mysmce.com/eis-sealevel/swot/-/blob/main/tutorials/.ipynb_checkpoints/SWOT_simulated_L2_SSH_introduction-checkpoint.ipynb)
- ["Update cloud_direct_access_s3.py" - podaac tutorials commit 4da70c7cf079ddd7a6de4c4345749f580ba66d71](https://github.com/podaac/tutorials/commit/4da70c7cf079ddd7a6de4c4345749f580ba66d71#)

In [1]:
!chmod 0400 ~/.netrc

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import xarray as xr
import progressbar

In [3]:
from urllib import request
from http.cookiejar import CookieJar
import netrc
import requests
import s3fs
import getpass

def setup_earthdata_login_auth(endpoint):
    """
    Set up the request library so that it authenticates against the given Earthdata Login
    endpoint and is able to track cookies between requests.  This looks in the .netrc file 
    first and if no credentials are found, it prompts for them.
    Valid endpoints:
        urs.earthdata.nasa.gov - Earthdata Login production
    """
    try:
        username, _, password = netrc.netrc().authenticators(endpoint)
        print('# Your URS credentials were securely retrieved from your .netrc file.')
    except (FileNotFoundError, TypeError):
        # FileNotFound = There's no .netrc file
        # TypeError = The endpoint isn't in the netrc file, causing the above to try unpacking None
        print("There's no .netrc file or the The endpoint isn't in the netrc file. Please provide...")
        print('# Your info will only be passed to %s and will not be exposed in Jupyter.' % (endpoint))
        username = input('Username: ')
        password = getpass.getpass('Password: ')

    manager = request.HTTPPasswordMgrWithDefaultRealm()
    manager.add_password(None, endpoint, username, password)
    auth = request.HTTPBasicAuthHandler(manager)

    jar = CookieJar()
    processor = request.HTTPCookieProcessor(jar)
    opener = request.build_opener(auth, processor)
    request.install_opener(opener)
    
def begin_s3_direct_access():
    url="https://archive.podaac.earthdata.nasa.gov/s3credentials"
    r = requests.get(url)
    response = r.json()
    return s3fs.S3FileSystem(key=response['accessKeyId'],secret=response['secretAccessKey'],token=response['sessionToken'],client_kwargs={'region_name':'us-west-2'})

edl = "urs.earthdata.nasa.gov"
setup_earthdata_login_auth(edl)
print('Earthdata login credentials configured. Ready.')

# Your URS credentials were securely retrieved from your .netrc file.
Earthdata login credentials configured. Ready.


In [4]:
# Disko Bay
# UL: 69°17'41.9"N 53°05'39.0"W OR 69.294972 -53.094167
# LR: 68°45'42.8"N 51°23'32.7"W OR 68.761889 -51.392417
ul = [-53.4, 69.3]
lr = [-51.4, 68.7]


## Get data information from S3, then load the file you need.

Here, we initiate S3 access, then use s3fs to tell us what netcdf files are available in the given S3 bucket (printing out the last 5 for good measure). Then we select the last file and load it using xarray. Alternatively, you could attempt to use "harmony" to convert it to zarr format and load things from there (see the first source document above for more details).

In [98]:
import gsw

def thermal_forcing(lat, depth, theta, salt):
    # From Xu et al. (2012): https://doi.org/10.3189/2012AoG60A139
    
    # Local freezing point calculation
    a = -0.0575 # degC psu^-1
    b = 0.0901 # degC
    c = -7.61e-4 # degC dbar^-1
    
    rho_seawater = 1026 # kg m^-3
    g = 9.18 # m s^-2
    
    p = gsw.p_from_z(depth, lat) # dbar
    
    T_freeze = a*salt + b + c*p
    thermal_forcing = theta - T_freeze
    
    return thermal_forcing


In [103]:
# Initiate PO.DAAC S3 connection
fs = begin_s3_direct_access()

years = range(1992,2018)
TF_avg = list()

for year in progressbar.progressbar(years[:2]):
    # Get list of ECCO files
    s3_bucket = 's3://podaac-ops-cumulus-protected/ECCO_L4_TEMP_SALINITY_05DEG_MONTHLY_V4R4/*{:d}*nc'.format(year)
    s3_files = fs.glob(s3_bucket)

    fileset = [fs.open(file) for file in s3_files]

    ds = xr.open_mfdataset(fileset,
                               combine='by_coords',
                               mask_and_scale=True,
                               decode_cf=True,
                               chunks='auto')

    depth = ds.Z.values
    lat = ds.latitude.values
    lon = ds.longitude.values

    # Subset by depth, lat, lon
    z_idx = np.where( (depth <= -200) & (depth >= -400) )[0]
    lon_idx = np.where( (lon >= ul[0]) * (lon <= lr[0]) )[0]
    lon_min = lon_idx[0]; lon_max = lon_idx[-1]+1
    lat_idx = np.where( (lat <= ul[1]) * (lat >= lr[1]) )[0]
    lat_min = lat_idx[0]; lat_max = lat_idx[-1]+1

    depth_s = np.tile(depth[z_idx], (len(lat[lat_idx]), len(lon[lon_idx]), 1)).transpose([2,0,1])
    lat_s   = np.tile(lat[lat_idx], (len(lon[lon_idx]), len(depth[z_idx]), 1)).transpose([0,2,1])
    
    for month in range(12):
        theta = ds.THETA[month, z_idx, lat_min:lat_max, lon_min:lon_max].values
        salt  = ds.SALT [month, z_idx, lat_min:lat_max, lon_min:lon_max].values
        
        TF = thermal_forcing(lat_s, depth_s, theta, salt)
        TF_avg.append(np.nanmean(TF, axis=(0,1,2)))
        
    ds.close()


N/A% (0 of 2) |                          | Elapsed Time: 0:00:00 ETA:  --:--:--

(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)


 50% (1 of 2) |#############             | Elapsed Time: 0:00:16 ETA:   0:00:16

(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)
(4, 2, 4)


100% (2 of 2) |##########################| Elapsed Time: 0:00:46 Time:  0:00:46


(4, 2, 4)
(4, 2, 4)


In [104]:
print(TF_avg)

[4.453350764067848, 4.377813869269568, 4.310500388892371, 4.2670297200996465, 4.2346790130455085, 4.20793772295304, 4.185539465697486, 4.165246563704688, 4.1569648082573005, 4.1502922113258425, 4.135242729933936, 4.111061650069434, 4.091312127860267, 4.080246907027442, 4.065527897628028, 4.040402584822852, 4.024995809348304, 4.019834404738623, 4.011505966933448, 3.9979178006965705, 3.9757149989921636, 3.921005373748023, 3.8484749849159305, 3.8466065938789433]


In [None]:
# Plot
fig = plt.figure(figsize=(7,3), dpi=150)
ax = plt.axes(projection=ccrs.Robinson(), extent=[-58, -50, 68, 72])

z_level = 17
pc = ax.pcolormesh(lon, lat, theta[z_level,:,:], cmap=plt.cm.Reds, transform=ccrs.PlateCarree(), vmin=-1, vmax=+4)
ax.plot([ul[0], lr[0], lr[0], ul[0], ul[0]], [ul[1], ul[1], lr[1], lr[1], ul[1]], 'k--', markersize=5, zorder=100, transform=ccrs.PlateCarree())

lonm, latm = np.meshgrid(lon,lat)
ax.plot(lonm, latm, 'k.', markersize=2, transform=ccrs.PlateCarree())

c = plt.colorbar(pc)
c.ax.tick_params(labelsize=7)
c.set_label('deg. C', size=7)

ax.coastlines(resolution='10m', zorder=7, linewidth=0.25)

ax.set_title('ocean potential temp. at depth {:f} m'.format(depth[z_level,0,0]))