# Discover, Customize and Access NSIDC DAAC Data

This notebook is based off of the NSIDC-Data-Access-Notebook provided here: 
https://github.com/nsidc/NSIDC-Data-Access-Notebook

Now that we've visualized our study areas, we will explore data coverage, size, and customization (subsetting, reformatting, reprojection) service availability using ATL07 as an example. We will then access a file using NSIDC DAAC's Data Access and Service API. The "Data access for all datasets" notebook provides the steps needed to subset and download all the data we'll be using in the final "Visualize and Analyze Data notebook".

***A note on data access options:***
We will be pursuing data discovery and access "programmatically" using Application Programming Interfaces, or APIs.

What is an API? API stands for Application Programming Interface. You can think of it as a middle man between an application or end-use (in this case, us) and a data provider. In this case, the data provider is both the metadata repository housing ICESat-2 data information (the Common Metadata Repository and NSIDC). These APIs are essentially structured as a URL with a base plus individual key-value-pairs (KVPs) separated by ‘&’.

There are other discovery and access methods available from NSIDC, as you can see from the data set home pages under the 'Download Data' tab, including OpenAltimetry and NASA Earthdata Search.



## Import packages


In [1]:
%run functions.py

In [2]:
#import requests
import getpass
import socket
#import json
import zipfile
import io
import math
import os
import shutil
import pprint
import re
import time
import functions
#from statistics import mean
from requests.auth import HTTPBasicAuth

## Explore data availability using the Common Metadata Repository 

The Common Metadata Repository (CMR) is a high-performance, high-quality, continuously evolving metadata system that catalogs Earth Science data and associated service metadata records. These metadata records are registered, modified, discovered, and accessed through programmatic interfaces leveraging standard protocols and APIs.

https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html

### Select data set and determine version number

Data sets are selected by data set IDs (e.g. ATL07).  In the CMR API documentation, a data set ids is referred to as a "short name". We use the python requests package to access the CMR. This search does not require a password.  

In [3]:
CMR_COLLECTIONS_URL = 'https://cmr.earthdata.nasa.gov/search/collections.json'
response = requests.get(CMR_COLLECTIONS_URL, params={'short_name': 'ATL07'})

Data are then converted to [JSON](https://en.wikipedia.org/wiki/JSON) format; a language independant human-readable open-standard file format.

In [4]:
results = json.loads(response.content)
results

{'feed': {'updated': '2019-11-20T20:19:41.808Z',
  'id': 'https://cmr.earthdata.nasa.gov:443/search/collections.json?short_name=ATL07',
  'title': 'ECHO dataset metadata',
  'entry': [{'processing_level_id': 'Level 3',
    'boxes': ['-90 -180 90 180'],
    'time_start': '2018-10-14T00:00:00.000Z',
    'version_id': '001',
    'dataset_id': 'ATLAS/ICESat-2 L3A Sea Ice Height V001',
    'has_spatial_subsetting': True,
    'has_transforms': False,
    'associations': {'services': ['S1568899363-NSIDC_ECS',
      'S1613689509-NSIDC_ECS',
      'S1613669681-NSIDC_ECS']},
    'has_variables': True,
    'data_center': 'NSIDC_ECS',
    'short_name': 'ATL07',
    'organizations': ['NASA NSIDC DAAC', 'NASA/GSFC/EOS/ESDIS'],
    'title': 'ATLAS/ICESat-2 L3A Sea Ice Height V001',
    'coordinate_system': 'CARTESIAN',
    'summary': 'The data set (ATL07) contains along-track heights for sea ice and open water leads (at varying length scales) relative to the WGS84 ellipsoid (ITRF2014 reference frame)

More than one version can exist for a given data set:

In [5]:
for entry in results['feed']['entry']:
    functions.print_cmr_metadata(entry)

dataset_id: ATLAS/ICESat-2 L3A Sea Ice Height V001, version_id: 001
dataset_id: ATLAS/ICESat-2 L3A Sea Ice Height V002, version_id: 002


We will specify the most recent version, `002`, for our remaining ATL07 queries.

### Select time and area of interest

We will create a simple dictionary with our short name, version, time, and area of interest. We'll continue to add to this dictionary as we discover more information about our data set.

In [6]:
# Bounding Box spatial parameter in 'W,S,E,N' format

bounding_box = '140,72,153,80'

In [7]:
# Each date in yyyy-MM-ddTHH:mm:ssZ format
# Date range in start,end format

temporal = '2019-03-23T00:00:00Z,2019-03-23T23:59:59Z'

In [8]:
data_dict = {'short_name': 'ATL07', 
             'version': '002',
             'bounding_box': bounding_box, 
             'temporal': temporal }

### Determine how many files exist over this time and area of interest, as well as the average size and total volume of those granules

We will use the `granule_info` function to query the CMR granule API. The function prints the number of granules, average size, and total volume of those granules. It returns the granule number value, which we will add to our data dictionary.

In [9]:
gran_num = functions.granule_info(data_dict)
data_dict['gran_num'] = gran_num

There are 3 granules of ATL07 version 002 over my area and time of interest.
The average size of each granule is 260.65 MB and the total size of all 3 granules is 781.94 MB


Note that subsetting, reformatting, or reprojecting can alter the size of the granules if those services are applied to your request.

### Input Earthdata Login credentials

An Earthdata Login account is required to query data services and to access data from the NSIDC DAAC. If you do not already have an Earthdata Login account, visit http://urs.earthdata.nasa.gov to register.

In [11]:
#Store Earthdata Login user name
uid = 'amy.steiker'

#Input and store Earthdata Login password
pswd = getpass.getpass('Earthdata Login password: ')

#Store email associated with Earthata Login account
email = 'amy.steiker@nsidc.org'

Earthdata Login password:  ·········


### Select the subsetting, reformatting, and reprojection services enabled for your data set of interest.

The NSIDC DAAC supports customization services on many of our NASA Earthdata mission collections. Let's discover whether or not our data set has these services available. If services are available, we will also determine the specific service options supported for this data set and select which of these services we want to request. 

In [None]:
from xml.etree import ElementTree as ET

for k, v in data_dict.items():
    sn = v['short_name']
    ve = (v['version'])
    capability_url = f'https://n5eil02u.ecs.nsidc.org/egi/capabilities/{sn}.{ve}.xml'
    
    # Create session to store cookie and pass credentials to capabilities url
    session = requests.session()
    s = session.get(capability_url)
    response = session.get(s.url,auth=(uid,pswd))
    root = ET.fromstring(response.content)

    #collect lists with each service option
    subagent = [subset_agent.attrib for subset_agent in root.iter('SubsetAgent')]

    # variable subsetting
    variables = [SubsetVariable.attrib for SubsetVariable in root.iter('SubsetVariable')]  
    variables_raw = [variables[i]['value'] for i in range(len(variables))]
    variables_join = [''.join(('/',v)) if v.startswith('/') == False else v for v in variables_raw] 
    variable_vals = [v.replace(':', '/') for v in variables_join]

    # reformatting
    formats = [Format.attrib for Format in root.iter('Format')]
    format_vals = [formats[i]['value'] for i in range(len(formats))]
    if format_vals : format_vals.remove('')

    # reprojection options
    projections = [Projection.attrib for Projection in root.iter('Projection')]
    proj_vals = []
    for i in range(len(projections)):
        if (projections[i]['value']) != 'NO_CHANGE' :
            proj_vals.append(projections[i]['value'])

    #print service information depending on service availability and select service options
    print(sn, 'service selection:')
    if len(subagent) < 1 :
            print('No services exist for', sn)
            meta = input('Would like to receive XML metadata files along with the science files? (y/n)')
            if meta == 'n': data_dict[k]['INCLUDE_META'] = 'N'
            print('')
    else:
        subdict = subagent[0]
        if subdict['spatialSubsetting'] == 'true':
            ss = input('Subsetting by bounding box, based on the area of interest inputted above, is available. Would you like to request this service? (y/n)')
            if ss == 'y': data_dict[k]['bbox'] = bounding_box
        if subdict['temporalSubsetting'] == 'true':
            ts = input('Subsetting by time, based on the temporal range inputted above, is available. Would you like to request this service? (y/n)')
            if ts == 'y': data_dict[k]['time'] = start_date + 'T' + start_time + ',' + end_date + 'T' + end_time 
        if len(format_vals) > 0 :
            print('These reformatting options are available:', format_vals)
            reformat = input('If you would like to reformat, copy and paste the reformatting option you would like (make sure to omit quotes, e.g. GeoTIFF), otherwise leave blank.')
        if len(proj_vals) > 0 : 
            print('These reprojection options are available with your requested format:', proj_vals)
            data_dict[k]['projection'] = input('If you would like to reproject, copy and paste the reprojection option you would like (make sure to omit quotes, e.g. GEOGRAPHIC), otherwise leave blank.')
            # Enter required parameters for UTM North and South
            if data_dict[k]['projection'] == 'UTM NORTHERN HEMISPHERE' or data_dict[k]['projection'] == 'UTM SOUTHERN HEMISPHERE': 
                data_dict[k]['NZone'] = input('Please enter a UTM zone (1 to 60 for Northern Hemisphere; -60 to -1 for Southern Hemisphere):')
                data_dict[k]['projection_parameters'] = str('NZone:' + NZone)
        else: 
            print('No reprojection options are supported with your requested format')
    print()

Now let's select a subset of variables. We'll use these primary variables of interest for the ICESat-2 sea ice and photon height products:


In [None]:
#ATL07
#Use only strong beams 
# 150 returns per segment - you get those over a shorter distance. Better coverage - more dense on

data_dict['sea_ice_height']['coverage'] = '/gt1l/sea_ice_segments/delta_time,\
/gt1l/sea_ice_segments/latitude,\
/gt1l/sea_ice_segments/longitude,\
/gt1l/sea_ice_segments/heights/height_segment_confidence,\
/gt1l/sea_ice_segments/heights/height_segment_height,\
/gt1l/sea_ice_segments/heights/height_segment_quality,\
/gt1l/sea_ice_segments/heights/height_segment_surface_error_est,\
/gt2l/sea_ice_segments/delta_time,\
/gt2l/sea_ice_segments/latitude,\
/gt2l/sea_ice_segments/longitude,\
/gt2l/sea_ice_segments/heights/height_segment_confidence,\
/gt2l/sea_ice_segments/heights/height_segment_height,\
/gt2l/sea_ice_segments/heights/height_segment_quality,\
/gt2l/sea_ice_segments/heights/height_segment_surface_error_est,\
/gt3l/sea_ice_segments/delta_time,\
/gt3l/sea_ice_segments/latitude,\
/gt3l/sea_ice_segments/longitude,\
/gt3l/sea_ice_segments/heights/height_segment_confidence,\
/gt3l/sea_ice_segments/heights/height_segment_height,\
/gt3l/sea_ice_segments/heights/height_segment_quality,\
/gt3l/sea_ice_segments/heights/height_segment_surface_error_est'

#ATL03
#Use only strong beams

data_dict['photon_height']['coverage'] = '/ds_surf_type,\
/gt1l/heights/delta_time,\
/gt1l/heights/h_ph,\
/gt1l/heights/lat_ph,\
/gt1l/heights/lon_ph,\
/gt1l/heights/signal_conf_ph,\
/gt2l/heights/delta_time,\
/gt2l/heights/h_ph,\
/gt2l/heights/lat_ph,\
/gt2l/heights/lon_ph,\
/gt2l/heights/signal_conf_ph,\
/gt3l/heights/delta_time,\
/gt3l/heights/h_ph,\
/gt3l/heights/lat_ph,\
/gt3l/heights/lon_ph,\
/gt3l/heights/signal_conf_ph'

### Select data access configurations

The data request can be accessed asynchronously or synchronously. The asynchronous option will allow concurrent requests to be queued and processed without the need for a continuous connection. Those requested orders will be delivered to the specified email address, or they can be accessed programmatically as shown below. Synchronous requests will automatically download the data as soon as processing is complete. For this tutorial, we will be selecting the asynchronous method. 

In [None]:
#Set NSIDC data access base URL
base_url = 'https://n5eil02u.ecs.nsidc.org/egi/request'

for k, v in data_dict.items():
    #Add email address
    data_dict[k]['email'] = email
    
    #Set the request mode to asynchronous
    data_dict[k]['request_mode'] = 'async'

    #Set the page size to the maximum for asynchronous requests 
    page_size = 2000
    data_dict[k]['page_size'] = page_size

    #Determine number of orders needed for requests over 2000 granules. 
    page_num = math.ceil(data_dict[k]['gran_num']/page_size)
    data_dict[k]['page_num'] = page_num
    del data_dict[k]['gran_num']
    print('There will be', page_num, 'total order(s) processed for our', v['short_name'], 'request.')

### Create the API endpoint 

Programmatic API requests are formatted as HTTPS URLs that contain key-value-pairs specifying the service operations that we specified above. The following command can be executed via command line, a web browser, or in Python below. 

In [None]:
endpoint_list = [] 
for k, v in data_dict.items():
    param_string = '&'.join("{!s}={!r}".format(k,v) for (k,v) in v.items())
    param_string = param_string.replace("'","")
    
    #Print API base URL + request parameters
    API_request = api_request = f'{base_url}?{param_string}'
    endpoint_list.append(API_request)
    if data_dict[k]['page_num'] > 1:
        for i in range(data_dict[k]['page_num']):
            page_val = i + 2
            data_dict[k]['page_num'] = page_val
            API_request = api_request = f'{base_url}?{param_string}'
            endpoint_list.append(API_request)

print("\n".join("\n"+s for s in endpoint_list))

### Request data

We will now download data using the Python requests library. The data will be downloaded directly to this notebook directory in a new Outputs folder. The progress of each order will be reported.

In [None]:
# Create an output folder if the folder does not already exist.
path = str(os.getcwd() + '/Outputs')
if not os.path.exists(path):
    os.mkdir(path)

# Request data service for each page number, and unzip outputs
for k, v in data_dict.items():
    for i in range(data_dict[k]['page_num']):
        page_val = i + 1
        print(v['short_name'], 'Order: ', page_val)

    # For all requests other than spatial file upload, use get function
        request = session.get(base_url, params=v.items())
        print('Request HTTP response: ', request.status_code)

    # Raise bad request: Loop will stop for bad response code.
        request.raise_for_status()
        #print('Order request URL: ', request.url)
        esir_root = ET.fromstring(request.content)
        #print('Order request response XML content: ', request.content)

    #Look up order ID
        orderlist = []   
        for order in esir_root.findall("./order/"):
            orderlist.append(order.text)
        orderID = orderlist[0]
        print('order ID: ', orderID)

    #Create status URL
        statusURL = base_url + '/' + orderID
        print('status URL: ', statusURL)

    #Find order status
        request_response = session.get(statusURL)    
        print('HTTP response from order response URL: ', request_response.status_code)

    # Raise bad request: Loop will stop for bad response code.
        request_response.raise_for_status()
        request_root = ET.fromstring(request_response.content)
        statuslist = []
        for status in request_root.findall("./requestStatus/"):
            statuslist.append(status.text)
        status = statuslist[0]
        print('Data request ', page_val, ' is submitting...')
        print('Initial request status is ', status)

    #Continue loop while request is still processing
        while status == 'pending' or status == 'processing': 
            print('Status is not complete. Trying again.')
            time.sleep(10)
            loop_response = session.get(statusURL)

    # Raise bad request: Loop will stop for bad response code.
            loop_response.raise_for_status()
            loop_root = ET.fromstring(loop_response.content)

    #find status
            statuslist = []
            for status in loop_root.findall("./requestStatus/"):
                statuslist.append(status.text)
            status = statuslist[0]
            print('Retry request status is: ', status)
            if status == 'pending' or status == 'processing':
                continue

    #Order can either complete, complete_with_errors, or fail:
    # Provide complete_with_errors error message:
        if status == 'complete_with_errors' or status == 'failed':
            messagelist = []
            for message in loop_root.findall("./processInfo/"):
                messagelist.append(message.text)
            print('error messages:')
            pprint.pprint(messagelist)

    # Download zipped order if status is complete or complete_with_errors
        if status == 'complete' or status == 'complete_with_errors':
            downloadURL = 'https://n5eil02u.ecs.nsidc.org/esir/' + orderID + '.zip'
            print('Zip download URL: ', downloadURL)
            print('Beginning download of zipped output...')
            zip_response = session.get(downloadURL)
            # Raise bad request: Loop will stop for bad response code.
            zip_response.raise_for_status()
            with zipfile.ZipFile(io.BytesIO(zip_response.content)) as z:
                z.extractall(path)
            print('Data request', page_val, 'is complete.')
        else: print('Request failed.')
    print()

In [None]:
print(request.content)

### Finally, we will clean up the Output folder by removing individual order folders:

In [None]:
# Clean up Outputs folder by removing individual granule folders 

for root, dirs, files in os.walk(path, topdown=False):
    for file in files:
        try:
            shutil.move(os.path.join(root, file), path)
        except OSError:
            pass
    for name in dirs:
        os.rmdir(os.path.join(root, name))    

### To review, we have explored data availability and volume over a region and time of interest, discovered and selected data customization options, constructed API endpoints for our requests, and downloaded data. Let's move on to the analysis portion of the tutorial.