# How to: Find Data From Two Intercalibration Instrument Targets using NASA's CMR API

**Summary**  

This notebook will show an efficient method to locate and access data that fall within LASICS-identified intercalibration opportunities using NASA's CMR API. The Common Metadata Repository (CMR) is a metadata system that catalogs Earth Science data and associated metadata records. The CMR Application Programming Interface (API) provides programmatic search capabilities through CMR's metadata using various parameters and keywords. When querying NASA's CMR, there is a limit of 1 million granules matched and only 2000 granules are returned per page. 

**Requirements:**
+ A NASA [Earthdata Login](https://urs.earthdata.nasa.gov/) account is required to download NASA mission data   

**Learning Objectives**  
- How to find NASA data using NASA's CMR API
- How to download programmatically 


*Thank you to the LP DAAC for their tutorial on EMIT data CMR API Search & Download Tutorial. These instructions were derived from that tutorial, which can be found [on github](https://github.com/nasa/EMIT-Data-Resources).*

---

**Example: LASICS-identified EMIT on ISS vs MODIS/CERES on Aqua or Terra; or VIIRS/CERES on SNPP or NOAA-20 Intercalibration Events}**

Import the required packages

In [1]:
import requests
import pandas as pd
import datetime as dt
import numpy as np

---

## Searching multiple dates/times using CMR API

Because the LASICS tool identifies concurrent measurements opportunities within a user-specified period of time, only the  date-time ranges need to be specified when searching the CMR API.  

Specify multiple date-time ranges and format to the structure necessary for searching CMR.

### Inputs Needed 
* The XML path/filename
* DOI for Target 1
* DOI for Target 2

Note that if your LASICS search only included one instrument target (e.g. you searched for intercal events over a particular pseudo-invariant land site), then only one DOI is needed, and you only need to run the cells to locate the files pertaining to the data set that you're interested in.

NASA EarthData's unique ID for this dataset (called Concept ID) is needed for searching the dataset. The dataset Digital Object Identifier or DOI can be used to obtain the Concept ID.

#### Obtaining the Concept ID for CERES

CERES on NOAA-20 (CER_SSF_NOAA20-FM6-VIIRS_Edition1B) DOI found on [ASDC DAAC's page](https://asdc.larc.nasa.gov/project/CERES/CER_SSF_NOAA20-FM6-VIIRS_Edition1B). - DOIs for CERES on other platforms can be found here as well. 

#### Obtaining the Concept ID for MODIS or VIIRS

MODIS or VIIRS DOIs can be found on [LAADS DAAC's Page](https://ladsweb.modaps.eosdis.nasa.gov/search/order/)

#### Obtaining the Concept ID for EMIT on ISS
Starting with EMIT on ISS, DOIs can be found by clicking the `Citation` link on the LP DAAC's [EMIT Product Pages](https://lpdaac.usgs.gov/product_search/?query=emit&view=cards&sort=title).

---
I ran this example for CERES, but this will work for MODIS or VIIRS the same way by inserting the DOI for the data set on appropriate platform (Terra, Aqua, SNPP, or NOAA-20) - see the name of the XML to ensure you're selecting the right platform; then pick the data set of interest for MODIS or VIIRS.

In [2]:
# # Inputs needed for this notebook
# In this example EMIT is always one of the targets
doi_target1 = '10.5067/EMIT/EMITL1BRAD.001'# EMIT L1B TOA Radiance


In [3]:
# # Inputs needed for this notebook
# # EMIT-Terra
# xml_fname = 'LASICS-SPS_20230227T162925_SNPP-Aug22-Sept22.xml'
# doi_target2 = '10.5067/NOAA20/CERES/SSF-FM6_L2.001B'# CERES FM6 on NOAA-20 SSF 

In [4]:
# # Inputs needed for this notebook
# # EMIT-Aqua
# xml_fname = 'LASICS-SPS_20230227T162925_SNPP-Aug22-Sept22.xml'
# doi_target2 = '10.5067/NOAA20/CERES/SSF-FM6_L2.001B'# CERES FM6 on NOAA-20 SSF 

In [5]:
# Inputs needed for this notebook
# EMIT-SNPP
# xml_fname = 'LASICS-SPS_20230227T162925_SNPP-Aug22-Sept22.xml'
# doi_target2 = '10.5067/NOAA20/CERES/SSF-FM6_L2.001B'# CERES FM6 on NOAA-20 SSF 

In [14]:
# # Inputs needed for this notebook
# # EMIT-NOAA-20
xml_fname = 'LASICS-SPS_20230223T013121_NOAA20-Aug22-Sept22.xml'
doi_target2 = '10.5067/NOAA20/CERES/SSF-FM6_L2.001B'# CERES FM6 on NOAA-20 SSF 

---
## Get intercal event information from LASICS XML output file

In [15]:
# xml_fname = 'LASICS-SPS_20230223T013121_NOAA20-Aug22-Sept22.xml'
new_cols = ['StartTime', 'EndTime']
df = pd.read_xml(xml_fname, xpath='/SPS_Plan/ScienceOpportunities/ScienceOpportunity')

exclude_columns = ['TargetName', 'TargetStartTime', 'TargetEndTime', 'ReferenceName']
target1 = (df.loc[:, ~df.columns.isin(exclude_columns)]).copy()
target1.columns = new_cols
target1.StartTime = [x[0:-2]+'Z' for x in target1.StartTime]
target1.EndTime = [x[0:-2]+'Z' for x in target1.EndTime]

exclude_columns2 = ['TargetName', 'ReferenceStartTime', 'ReferenceEndTime', 'ReferenceName']
target2 = (df.loc[:, ~df.columns.isin(exclude_columns2)]).copy()
target2.columns = new_cols
target2.StartTime = [x[0:-2]+'Z' for x in target2.StartTime]
target2.EndTime = [x[0:-2]+'Z' for x in target2.EndTime]

In [16]:
# What's specified as "Target" and "Reference" is irrelevant. 
# In the tool that produces these (LASICS: The Langley Automated Sensor Intercalibration System), 
# It needs to specify the target at the lower altitude as the "reference" - just an FYI
df

Unnamed: 0,TargetName,TargetStartTime,TargetEndTime,ReferenceName,ReferenceStartTime,ReferenceEndTime
0,NOAA 20,2022-08-10T05:23:00.0,2022-08-10T05:23:20.0,ISS,2022-08-10T05:26:59.0,2022-08-10T05:29:49.0
1,NOAA 20,2022-08-10T07:04:30.0,2022-08-10T07:04:50.0,ISS,2022-08-10T06:59:31.0,2022-08-10T07:02:21.0
2,NOAA 20,2022-08-11T00:00:00.0,2022-08-11T00:00:00.0,ISS,2022-08-11T00:01:35.0,2022-08-11T00:01:59.0
3,NOAA 20,2022-08-11T01:41:00.0,2022-08-11T01:41:30.0,ISS,2022-08-11T01:34:07.0,2022-08-11T01:37:03.0
4,NOAA 20,2022-08-11T16:54:50.0,2022-08-11T16:55:10.0,ISS,2022-08-11T17:03:38.0,2022-08-11T17:04:50.0
...,...,...,...,...,...,...
113,NOAA 20,2022-09-11T15:03:20.0,2022-09-11T15:03:40.0,ISS,2022-09-11T15:02:02.0,2022-09-11T15:04:54.0
114,NOAA 20,2022-09-11T16:44:50.0,2022-09-11T16:45:00.0,ISS,2022-09-11T16:34:50.0,2022-09-11T16:37:26.0
115,NOAA 20,2022-09-12T07:58:40.0,2022-09-12T07:59:10.0,ISS,2022-09-12T08:04:13.0,2022-09-12T08:07:01.0
116,NOAA 20,2022-09-12T09:40:00.0,2022-09-12T09:40:30.0,ISS,2022-09-12T09:36:44.0,2022-09-12T09:39:32.0


In [17]:
target1

Unnamed: 0,StartTime,EndTime
0,2022-08-10T05:26:59Z,2022-08-10T05:29:49Z
1,2022-08-10T06:59:31Z,2022-08-10T07:02:21Z
2,2022-08-11T00:01:35Z,2022-08-11T00:01:59Z
3,2022-08-11T01:34:07Z,2022-08-11T01:37:03Z
4,2022-08-11T17:03:38Z,2022-08-11T17:04:50Z
...,...,...
113,2022-09-11T15:02:02Z,2022-09-11T15:04:54Z
114,2022-09-11T16:34:50Z,2022-09-11T16:37:26Z
115,2022-09-12T08:04:13Z,2022-09-12T08:07:01Z
116,2022-09-12T09:36:44Z,2022-09-12T09:39:32Z


In [18]:
target2

Unnamed: 0,StartTime,EndTime
0,2022-08-10T05:23:00Z,2022-08-10T05:23:20Z
1,2022-08-10T07:04:30Z,2022-08-10T07:04:50Z
2,2022-08-11T00:00:00Z,2022-08-11T00:00:00Z
3,2022-08-11T01:41:00Z,2022-08-11T01:41:30Z
4,2022-08-11T16:54:50Z,2022-08-11T16:55:10Z
...,...,...
113,2022-09-11T15:03:20Z,2022-09-11T15:03:40Z
114,2022-09-11T16:44:50Z,2022-09-11T16:45:00Z
115,2022-09-12T07:58:40Z,2022-09-12T07:59:10Z
116,2022-09-12T09:40:00Z,2022-09-12T09:40:30Z


---

## Search for Target 1 (EMIT) Files

In [19]:
# CMR API base url
cmrurl='https://cmr.earthdata.nasa.gov/search/' 

doisearch = cmrurl + 'collections.json?doi=' + doi_target1
concept_id = requests.get(doisearch).json()['feed']['entry'][0]['id']
print(concept_id)

C2408009906-LPCLOUD


This is the unique NASA-given concept ID for the EMIT L1B TOA Radiance dataset, which can be used to retrieve relevant files (or granules).

In [20]:
temporal_str = []
temporal_str = [s + ','+ e for s,e in zip(target1.StartTime, target1.EndTime)]

In [21]:
temporal_str

['2022-08-10T05:26:59Z,2022-08-10T05:29:49Z',
 '2022-08-10T06:59:31Z,2022-08-10T07:02:21Z',
 '2022-08-11T00:01:35Z,2022-08-11T00:01:59Z',
 '2022-08-11T01:34:07Z,2022-08-11T01:37:03Z',
 '2022-08-11T17:03:38Z,2022-08-11T17:04:50Z',
 '2022-08-11T18:36:11Z,2022-08-11T18:39:09Z',
 '2022-08-11T20:08:43Z,2022-08-11T20:11:41Z',
 '2022-08-12T11:38:23Z,2022-08-12T11:40:50Z',
 '2022-08-12T13:10:56Z,2022-08-12T13:13:49Z',
 '2022-08-12T14:44:20Z,2022-08-12T14:46:20Z',
 '2022-08-13T06:12:55Z,2022-08-13T06:16:09Z',
 '2022-08-13T07:45:27Z,2022-08-13T07:48:39Z',
 '2022-08-13T09:20:20Z,2022-08-13T09:21:09Z',
 '2022-08-15T12:24:26Z,2022-08-15T12:25:19Z',
 '2022-08-15T13:56:56Z,2022-08-15T14:00:32Z',
 '2022-08-15T15:29:26Z,2022-08-15T15:32:58Z',
 '2022-08-16T06:59:19Z,2022-08-16T07:00:57Z',
 '2022-08-16T08:31:48Z,2022-08-16T08:35:18Z',
 '2022-08-16T10:04:50Z,2022-08-16T10:07:42Z',
 '2022-08-17T01:34:03Z,2022-08-17T01:36:30Z',
 '2022-08-17T03:06:30Z,2022-08-17T03:10:19Z',
 '2022-08-17T04:40:30Z,2022-08-17T

---
My current concern with these results is that even for a ~30-day run, only one day's worth of files are identified. I'm skeptical about whether it's actually returning all the files that exist between the pairs of EMIT acquisition dates. 

In [22]:
page_num = 1
page_size = 2000 # CMR page size limit

granule_arr = []

for idx, opptime in enumerate(temporal_str):
    while True:

         # defining parameters
        cmr_param = {
            "collection_concept_id": concept_id, 
            "page_size": page_size,
            "page_num": page_num,
            "temporal": opptime,
            "pretty": "TRUE"
        }

        granulesearch = cmrurl + 'granules.json'
        response = requests.post(granulesearch, data=cmr_param)
        granules = response.json()['feed']['entry']

        if granules:
            for g in granules:
                granule_urls = ''

                # read cloud cover - relevant to EMIT available metadata
                cloud_cover = g['cloud_cover']

                # Get https URLs to .nc files and exclude .dmrpp files
                granule_urls = [x['href'] for x in g['links'] if 'https' in x['href'] and '.nc' in x['href'] and '.dmrpp' not in x['href']]

                # Add to list
                granule_arr.append([target1.index[idx], opptime, granule_urls, cloud_cover])

            page_num += 1
        else: 
            break
        
    # print(granule_arr)


In [23]:
granule_arr

[[15,
  '2022-08-15T06:12:06Z,2022-08-15T06:15:29Z',
  ['https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/EMITL1BRAD.001/EMIT_L1B_RAD_001_20220815T061309_2222704_019/EMIT_L1B_RAD_001_20220815T061309_2222704_019.nc',
   'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/EMITL1BRAD.001/EMIT_L1B_RAD_001_20220815T061309_2222704_019/EMIT_L1B_OBS_001_20220815T061309_2222704_019.nc'],
  '62'],
 [15,
  '2022-08-15T06:12:06Z,2022-08-15T06:15:29Z',
  ['https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/EMITL1BRAD.001/EMIT_L1B_RAD_001_20220815T061321_2222704_020/EMIT_L1B_RAD_001_20220815T061321_2222704_020.nc',
   'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/EMITL1BRAD.001/EMIT_L1B_RAD_001_20220815T061321_2222704_020/EMIT_L1B_OBS_001_20220815T061321_2222704_020.nc'],
  '62'],
 [15,
  '2022-08-15T06:12:06Z,2022-08-15T06:15:29Z',
  ['https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/EMITL1BRAD.001/EMIT_L1B_RAD_001_20220815T061333_222

In [24]:
# creating a pandas dataframe
target1_results_df = pd.DataFrame(granule_arr, columns=['Event_index', 'Times',"asset_url", "cloud_cover"])
# Expand so each row contains a single url 
target1_results_df = target1_results_df.explode('asset_url')
# Name each asset based on filename
target1_results_df.insert(2,'asset_name', target1_results_df.asset_url.str.split('/',n=-1).str.get(-1))
target1_results_df.insert(1, 'StartTime', target1_results_df.Times.str.split(',').str.get(0))
target1_results_df.insert(2, 'EndTime', target1_results_df.Times.str.split(',').str.get(-1))
target1_results_df.drop(columns='Times', inplace= True)

---

## Search for Target 2: (e.g. CERES Files)

In [25]:
# Only need the files for target 2 that are tied to and event for which there
# where files for target1 (especially relevant if an instrument doesn't take obs 100% of the time)
unique_events = target1_results_df.Event_index.unique()
target2 = target2.iloc[unique_events].copy()

In [26]:
temporal_str = []
temporal_str = [s + ','+ e for s,e in zip(target2.StartTime, target2.EndTime)]

In [27]:
temporal_str

['2022-08-15T06:20:00Z,2022-08-15T06:21:40Z']

In [28]:
# doi = '10.5067/NOAA20/CERES/SSF-FM6_L2.001B'# CERES FM6 on NOAA-20 SSF 

# CMR API base url
cmrurl='https://cmr.earthdata.nasa.gov/search/' 

doisearch = cmrurl + 'collections.json?doi=' + doi_target2
concept_id = requests.get(doisearch).json()['feed']['entry'][0]['id']
print(concept_id)

C2246001744-LARC_ASDC


In [29]:
page_num = 1
page_size = 2000 # CMR page size limit

granule_arr = []

for idx, opptime in enumerate(temporal_str):
    while True:

         # defining parameters
        cmr_param = {
            "collection_concept_id": concept_id, 
            "page_size": page_size,
            "page_num": page_num,
            "temporal": opptime,
            "pretty": "TRUE"
        }

        granulesearch = cmrurl + 'granules.json'
        response = requests.post(granulesearch, data=cmr_param)
        granules = response.json()['feed']['entry']

        if granules:
            for g in granules:
                granule_urls = ''

                # read cloud cover - relevant to EMIT available metadata
                #cloud_cover = g['cloud_cover']

                # Get https URLs to .nc files and exclude .dmrpp files
                granule_urls = [x['href'] for x in g['links'] if 'https' in x['href'] and '.nc' in x['href'] and '.dmrpp' not in x['href']]

                # Add to list
                granule_arr.append([target2.index[idx], opptime, granule_urls])

            page_num += 1
        else: 
            break
    # print(granule_arr)


In [30]:
# creating a pandas dataframe
target2_results_df = pd.DataFrame(granule_arr, columns=['Event_index', 'Times',"asset_url"])
# Expand so each row contains a single url 
target2_results_df = target2_results_df.explode('asset_url')
# Name each asset based on filename
target2_results_df.insert(2,'asset_name', target2_results_df.asset_url.str.split('/',n=-1).str.get(-1))
target2_results_df.insert(1, 'StartTime', target2_results_df.Times.str.split(',').str.get(0))
target2_results_df.insert(2, 'EndTime', target2_results_df.Times.str.split(',').str.get(-1))
target2_results_df.drop(columns='Times', inplace= True)

In [31]:
# note that there is an *OBS* and a *RAD* file for each acquisition (it looks like there's a duplication, but there isn't)
target1_results_df

Unnamed: 0,Event_index,StartTime,EndTime,asset_name,asset_url,cloud_cover
0,15,2022-08-15T06:12:06Z,2022-08-15T06:15:29Z,EMIT_L1B_RAD_001_20220815T061309_2222704_019.nc,https://data.lpdaac.earthdatacloud.nasa.gov/lp...,62
0,15,2022-08-15T06:12:06Z,2022-08-15T06:15:29Z,EMIT_L1B_OBS_001_20220815T061309_2222704_019.nc,https://data.lpdaac.earthdatacloud.nasa.gov/lp...,62
1,15,2022-08-15T06:12:06Z,2022-08-15T06:15:29Z,EMIT_L1B_RAD_001_20220815T061321_2222704_020.nc,https://data.lpdaac.earthdatacloud.nasa.gov/lp...,62
1,15,2022-08-15T06:12:06Z,2022-08-15T06:15:29Z,EMIT_L1B_OBS_001_20220815T061321_2222704_020.nc,https://data.lpdaac.earthdatacloud.nasa.gov/lp...,62
2,15,2022-08-15T06:12:06Z,2022-08-15T06:15:29Z,EMIT_L1B_RAD_001_20220815T061333_2222704_021.nc,https://data.lpdaac.earthdatacloud.nasa.gov/lp...,25
2,15,2022-08-15T06:12:06Z,2022-08-15T06:15:29Z,EMIT_L1B_OBS_001_20220815T061333_2222704_021.nc,https://data.lpdaac.earthdatacloud.nasa.gov/lp...,25
3,15,2022-08-15T06:12:06Z,2022-08-15T06:15:29Z,EMIT_L1B_RAD_001_20220815T061345_2222704_022.nc,https://data.lpdaac.earthdatacloud.nasa.gov/lp...,55
3,15,2022-08-15T06:12:06Z,2022-08-15T06:15:29Z,EMIT_L1B_OBS_001_20220815T061345_2222704_022.nc,https://data.lpdaac.earthdatacloud.nasa.gov/lp...,55
4,15,2022-08-15T06:12:06Z,2022-08-15T06:15:29Z,EMIT_L1B_RAD_001_20220815T061356_2222704_023.nc,https://data.lpdaac.earthdatacloud.nasa.gov/lp...,90
4,15,2022-08-15T06:12:06Z,2022-08-15T06:15:29Z,EMIT_L1B_OBS_001_20220815T061356_2222704_023.nc,https://data.lpdaac.earthdatacloud.nasa.gov/lp...,90


In [32]:
target2_results_df

Unnamed: 0,Event_index,StartTime,EndTime,asset_name,asset_url
0,15,2022-08-15T06:20:00Z,2022-08-15T06:21:40Z,CER_SSF_NOAA20-FM6-VIIRS_Edition1B_100102.2022...,https://asdc.larc.nasa.gov/data/CERES/SSF/NOAA...
