<a href="https://colab.research.google.com/github/coraldelmarvr/literate-goggles/blob/main/search_data/ESS%20PI%20Meeting%202025%20Using%20Data%20-%20Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome to ESS-DIVE's Using Data Tutorial with Jupyter Notebook
This Jupyter Notebook is a workflow to help data users find and access ESS-DIVE datasets, particularly those that employ file-level metadata and csv reporting formats. The workflow includes: <br>
**[Part 1: Searching for Data](#-Part-1-Searching-on-ESS-DIVE)**

    Use the ESS-DIVE Dataset API to search for dataset files
**[Part 2: Exploring Inside Datasets](#-Part-2-Exploring-Inside-Datasets)**
    
    Basic searching inside datasets - look at individual files
    Use API tools and dataset details to explore within a dataset - using File-level Metadata (flmd) and Data Dictionaires (DD)
    Import data from csv files into python pandas dataframes
**[Part 3: Starting Analysis](#-Part-3-Starting-Analysis)**
    
    Create simple visualizations with the data
**[Part 4: Download Files and Log](#-Part-4-Download-Files-and-Save-the-Download-Log)**

    Download files to local storage and log access details
**[Part 5: Workflow Using Deep Dive API](#-Part-5-Workflow-Using-Deep-Dive-API)**

    Try using the Fusion database and the Deep Dive API as an alternative for limited Search and deep Exploration
**[EXTRA: Extra Resources](#-Part-5-Extra-Resources-and-Examples)**
    
    Explore Sample Metadata to explore datasets with sample-based data
    And more!

This was created as a resource to the PI Meeting 2025 ESS-DIVE Using Data Tutorial.

Written By: Emily Nagamoto (she/her, LBNL), Danielle S Christianson (she/her, LBNL)

Acknowledgements: This notebook builds from the 2024 ESS-DIVE Workshop [Using Data tutorial](https://github.com/ess-dive/essdive-tutorials/blob/main/search_data/Using_Data_with_Dataset_DeepDiveAPI_Python.ipynb) Danielle Christianson's [Finding and Accessing Data notebook](https://github.com/ess-dive/essdive-tutorials/blob/main/search_data/Tutorial_FindingAccessingData.ipynb), and Madison Burrus and Valerie Hendrix's Search & Download notebook.

Last updated: 04/14/2025

## README: How to use this notebook
You will be running the cells in sequential order. The notebook is designed that you can just run every cell without changing anything, or you can enter your own inputs into cells marked with <strong><span style="color:blue">Enter INPUT</span></strong>. If a cell is not marked with <strong><span style="color:blue">Enter INPUT</span></strong> or is marked with <strong><span style="color:green">Run Cell</span></strong>, then just run the cell without making changes.

Optional view cells are marked with "Optional" in the first line. These do not need to be run, but are included for additional visualization or guidance.

Any downloaded files are logged with the date/time of access. See Section 4 to save the log.

Workflows:
* Cells in **Part 1-4** are sequential and depend on variables entered in prior cells.
* To use **Part 5**: *Section A* replicates **Part 1** and *Section B* replicates **Part 2-3** using Deep Dive. To save the data, you can modify **Part 4**.
* **EXTRA** requires a different notebook - [Finding and Accessing Data notebook](https://github.com/ess-dive/essdive-tutorials/blob/main/search_data/Tutorial_FindingAccessingData.ipynb).


# SET-UP - Run before any other cells.

### 1. Load packages that will be used later.

<strong><span style="color:green">Run Cell</span></strong>

In [None]:
# This notebook requires Python 3.
import csv
import datetime as dt
import io
import json
import os
import pandas as pd
import requests
import urllib
import matplotlib.pyplot as plt
%matplotlib inline

from pathlib import Path
from urllib.request import Request, urlopen, urlretrieve
from zipfile import ZipFile


### 2. Configure authentification

<strong><span style="color:green">Run Cell</span></strong> <br>
1. Go to ESS-DIVE (https://data.ess-dive.lbl.gov/data), login with your ORCID, and copy your authentication token from your account settings page.
2. Run the following code cell.
3. Paste your authentication token into the prompt as requested. Hit `Enter` key.

   _Always re-run this code cell when you update your token. Tokens expire every 24 hours._

In [3]:
token = input('Token: ')

essdive_api_url = 'https://api.ess-dive.lbl.gov'

essdive_direct_url = 'https://data.ess-dive.lbl.gov/catalog/d1/mn/v2/object/'

essdive_deepdive_url = 'https://fusion.ess-dive.lbl.gov'

print('Success! Token is loaded.')

Token: eyJhbGciOiJSUzI1NiJ9.eyJzdWIiOiJodHRwOlwvXC9vcmNpZC5vcmdcLzAwMDktMDAwMy04NDQwLTU4M1giLCJmdWxsTmFtZSI6IkNvcmFsIGRlbCBNYXIgIFZhbGxlIFJvZHLDrWd1ZXoiLCJpc3N1ZWRBdCI6IjIwMjUtMDQtMTVUMTU6MzI6MzIuODQ4KzAwOjAwIiwiY29uc3VtZXJLZXkiOiJ0aGVjb25zdW1lcmtleSIsImV4cCI6MTc0NDc5NTk1MiwidXNlcklkIjoiaHR0cDpcL1wvb3JjaWQub3JnXC8wMDA5LTAwMDMtODQ0MC01ODNYIiwidHRsIjo2NDgwMCwiaWF0IjoxNzQ0NzMxMTUyfQ.j84aeSWRzq5j9IbR8lBc2bwGaqM31mYmy3WcgX9MITuIj16P3QoUonK_IMq6W1xUz_ci6KMVnDi3QbX3o5lHjZRgZpVhB4tItJHO4wwJeYPODyh1l2w2lvwQn6R6RoTue_KXhsi3Q-vnSOsNoap039Xm4gAozbi7h2SKNgtrjscURdbiwhep_Yxgz-_ocr18sQTxTf2RQtDTFUAIXlItA4zY5Oz08xLc-9saB1e7mD3TXcOfrWxcHcj_6enoDTJqNvPbr91zL9JdiJFhI-HhA_M9OAR0Yk00wnbJoWeUrJ_GGKFs3yhvmbqVIIfw5RQNP6aGffqRBZvx83gRe8ETug
Success! Token is loaded.


### 3. Configure local storage for downloads

This cell will grab the current directory path as the path to save any downloads. The code is configured to create a new folder in the current directory to save any files there.

<strong><span style="color:green">Run Cell</span></strong>

In [4]:
# make new folder in current local directory
new_dir = 'ESS-DIVE_Tutorial_Downloads'
parent_dir = os.getcwd()
download_dir_path = Path(os.path.join(parent_dir, new_dir))
try:
    os.mkdir(download_dir_path)
    print("Directory '% s' created" % new_dir)
except:
    print("This directory already exists.")

if download_dir_path.exists():
    print(f'Success! Local directory {download_dir_path} configured for downloads')
    print('===================================')
    current_files = [x for x in os.listdir(download_dir_path) if x != '.DS_Store']
    if current_files:
        print('Local directory contains: '+str(len(current_files)))
    else:
        print(f'Local directory is currently empty.')
else:
    print(f'Cannot find local directory {download_dir_path}. Please try again.')

# create the file download log
download_file_log = {}
print('===================================')
print('Downloaded files will be logged in the dictionary object "download_file_log".\n'
      'You can save this dictionary as a file later in the notebook.\n'
      'The filename, file url, and datetime accessed are recorded as a tuple in the "downloaded_files" element.')


Directory 'ESS-DIVE_Tutorial_Downloads' created
Success! Local directory /content/ESS-DIVE_Tutorial_Downloads configured for downloads
Local directory is currently empty.
Downloaded files will be logged in the dictionary object "download_file_log".
You can save this dictionary as a file later in the notebook.
The filename, file url, and datetime accessed are recorded as a tuple in the "downloaded_files" element.


### 4. Load general functions

These are helper functions that we made to make printing information, creating pandas dataframes, and calling the API easier. Feel free to copy these functions to other notebooks as needed. Once you run the following cell, the functions can be used at any point in the workflow

<strong><span style="color:green">Run Cell</span></strong>

In [5]:

def get_request(filename, f_url, stream=True):
    """
    Get request for file, and stream the content back
    """

    headers = {'user_agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0',
               'content-type': 'application/json'}
    try:
        r = requests.get(f_url, headers=headers, verify=True, stream=stream)
        status_code = r.status_code
        if status_code == 200:
            return r
        else:
            print(f"{filename} request returned {status_code}")
            return None
    except Exception as e:
        print(f"{filename} request unsuccessful: {e}")
        return None


def make_store(file_request, use_idx=True, print_headers=True):
    """
    Read response and make store
    """
    file_store = {}
    csv_reader = csv.DictReader(file_request.iter_lines(decode_unicode=True))

    for idx, row in enumerate(csv_reader):
        if use_idx:
            file_store.update({f'Index {idx}': row})
            continue
        fn = row.get('File_Name')
        file_store.update({fn: row})

    headers = list(row.keys())
    if print_headers:
        print(f"File headers: {headers}")
    return headers, file_store


def inspect_dataset_distribution(dataset_detail, file_type='all'):

    print(dataset_detail.get('name'))
    print('========================================')

    count = 0
    dist = dataset_detail.get('distribution')

    for idx, file_info in enumerate(dist):
        fn = file_info.get('name')
        fn_url = file_info.get('contentUrl')
        f_encoding = file_info.get('encodingFormat')
        if file_type != 'all' and file_type not in f_encoding:
            continue
        print(f'Index {idx}: {fn}\n  encoding: {f_encoding}\n  url: {fn_url}')
        count += 1

    if count == 0:
        print(f'No files found that match the file_type: "{file_type}" criteria.')


def retrieve_file_from_essdive(file_url, file_path):
    """ Retrieve the data file
        file_path includes file name.
    """
    error_messages = []
    try:
        urlretrieve(file_url, file_path)
        return True, None
    except Exception as e:
        error_messages.append(f'Attempt 1 (no auth) failed: {e}')
    try:
        req = Request(file_url)
        req.add_header('User-Agent', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0')
        with urllib.request.urlopen(req) as response:
            with open(file_path, 'wb') as out_file:
                out_file.write(response.read())
        return True, None
    except Exception as e:
        error_messages.append(f'Attempt 2 (no auth) failed: {e}')
    try:
        headers={"Authorization": f"Bearer {token}"}
        request = urllib.request.Request(file_url, headers=headers)

        with urllib.request.urlopen(request) as response:
            with open(file_path, 'wb') as out_file:
                out_file.write(response.read())
        return True, None
    except urllib.error.HTTPError as e:
        error_messages.append(f'Attempt 3 (with token) failed: HTTP Error {e.code}: {e.reason}')
    except Exception as e:
        error_messages.append(f'Attempt 4 (with token) failed: {str(e)}')
        return False, ' | '.join(error_messages)


def download_selected_files(dataset_detail, file_indices, file_dir=download_dir_path, log_store=download_file_log, citation=None,
                            is_csv_zipped=False, zip_download=None, zip_member_fn=None):
    dist = dataset_detail.get('distribution')
    ds_id = dataset_detail.get('@id')
    #citation = dataset_detail.get('citation') << grabs related references but not the citation of the downloaded file
    citation = citation
    ds_name = dataset_detail.get('name')

    if log_store is None:
        log_store = {}

    log_store.setdefault(ds_id, {'@id': ds_id, 'name': ds_name, 'citation': citation, 'downloaded_files': []})
    ds_file_log = log_store.get(ds_id).get('downloaded_files')

    print(f'Saving files in {download_dir_path}')
    print("-------------------------------------")

    for idx, file_info in enumerate(dist):
        msg = None
        is_downloaded = None

        if idx not in file_indices:
            continue

        fn = file_info.get('name')
        file_path = download_dir_path / fn
        fn_url = file_info.get('contentUrl')

        if not is_csv_zipped:

            download_ts = dt.datetime.now().isoformat()
            is_downloaded, msg = retrieve_file_from_essdive(fn_url, file_path)

        else:
            if not zip_download or not zip_member_fn:
                print('ZipFile object and zipped member file name are required. Try again.')
                return None
            try:
                zip_download.extract(zip_member_fn, path=file_path)
                if Path.exists(file_path / zip_member_fn):
                    is_downloaded = True
                    download_ts = dt.datetime.now().isoformat()
                else:
                    msg = f'Extraction of {zip_member_fn} from {fn} was not successful.'
            except Exception as e:
                msg = f'ERROR attempting to extract {zip_member_fn} from {fn}: {e}'

        if is_downloaded:
            print(f'--- {fn} downloaded')
            ds_file_log.append((fn, fn_url, download_ts))
        else:
            print(msg)

    print("-------------------------------------")
    print(f'Remember to cite these files! Dataset DOI {ds_id}, \nDataset citation: {citation}')
    return ds_id


def inspect_zip_file_contents(dataset_detail, file_idx):
    dist = dataset_detail.get('distribution')
    file_info = dist[file_idx]

    if not file_info:
        print('File index not found. Please try again.')
        return

    fn = file_info.get('name')
    if 'zip' not in file_info.get('encodingFormat'):
        print(f'{fn} is not encoded as a zip file. Please select a different file.')

    fn_url = file_info.get('contentUrl')

    try:
    # Create a request with headers
        req = Request(fn_url)
        req.add_header('User-Agent', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0')
    # Open the URL with the added headers
        resp = urlopen(req)
        zip_download = ZipFile(io.BytesIO(resp.read()))
        print('Success!')
    except urllib.error.HTTPError as e:
        print(f'HTTPError: {e.code} - {e.reason}')

    # resp = urlopen(fn_url)

    # zip_download = ZipFile(io.BytesIO(resp.read()))

    print(f'{fn} contents:')
    print('=================================')
    for idx, file_member in enumerate(zip_download.namelist()):
        print(f'Index {idx}: {file_member}')

    return fn, zip_download


def read_zipped_csv(zip_file_obj, csv_file_name, header_rows=1):
    # with open(zip_file_obj, mode='r') as z:
    #     csv_df = pd.read_csv(io.BytesIO(z.read(csv_file_name)))
    csv_df = pd.read_csv(zip_download.open(csv_file_name), skiprows=header_rows)
    return csv_df


def grab_metadata(r_json): # for fusiondb
    df = pd.DataFrame()
    records = []

    for dataset in r_json:
        field_name = dataset['field_name']
        unit = dataset['unit']
        definition = dataset['definition']
        data_type = dataset['data_type']
        total_record_count = dataset['total_record_count']
        values_summary = dataset['values_summary']
        unit = dataset['unit']
        doi = dataset['doi']
        url = dataset['data_file_url']
        data_file = dataset['data_file']
        report={'Field_name':field_name, 'Unit':unit, 'Definition':definition, 'Data_type':data_type,
                'Total_records':total_record_count,'Values':values_summary,'DOI':doi,
                'URL':url,'File':data_file }
        records.append(report)

    df = pd.DataFrame(records)
    return df

# Change dataframe display options to better visualize the results
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.colheader_justify', 'left')

print('Functions loaded.')

Functions loaded.


---

# Part 1: Searching on ESS-DIVE

## (A) Use the Dataset API tool
Run this section to find datasets with the Dataset API tool. This section results in a list of potential datasets, and classification if it contains structured data or not.

Use the ESS-DIVE Dataset API to search for datasets of interest.

You can search for datasets using any of the following parameters:
- Dataset Creator (**creator**): The creator/submitter of datasets
- Date Published (**datePublished**): This is the date range of the publication of a package.
- Project Name (**providerName**): The dataset project/provider that is set in the metadata.
- Any text (**text**): Searches any metadata field that contains the passed text
- Keywords (**keywords**): Search for datasets that have an exact match for all the given keywords.
- Public datasets only (**isPublic**): If set with true, would only return public packages.

**See additional details for dataset search in the ESS-DIVE package API techincal documentation:** https://api.ess-dive.lbl.gov/#/Data%20Package/listPackages.

Use the [ESS-DIVE's project list](https://docs.google.com/spreadsheets/d/179SOyv42wXbP4owWZtUg3RqhW9dPOyENYcVYuUCcqwg/edit?usp=sharing) to find the options for project names.

### 1. Enter Search Parameters and make API call
<strong><span style="color:blue">Enter INPUT</span></strong>

In [8]:
# Enter search terms: "\"Leaf"\" is an exact match, "Leaf" is any match
#creater is the las name of the author or submitter
creator="Serbin"
text= "G-LiHT"
datePublished = "[2019 TO 2021]"  # "<[YYYY TO YYYY-MM-DD]>" # Not the same as data coverage

<strong><span style="color:green">Run Cell</span></strong>

In [9]:
# Contruct URL query to send to the ESS-DIVE packages API
get_packages_response = f"{essdive_api_url}/packages?creator={creator}&text={text}&datePublished={datePublished}&isPublic=true"

# Send request to API
response = requests.get(get_packages_response, headers={"Authorization": f"Bearer {token}"})

# Review the response and debug if needed
if response.status_code == 200:
    # Success
    response_json = response.json()
    print("Success! Continue to look at the search results")
else:
    # There was an error
    print("There was an error. Stop here and debug the issue. Email ess-dive-support@lbl.gov if you need assistance. \n")
    print(response.text)

Success! Continue to look at the search results


### 2. Inspect the search results

<strong><span style="color:green">Run Cell</span></strong>

In [10]:
# Here is a formatted version of what the response returns
search_record_total = response_json['total']
print(f"Datasets found: {search_record_total}")

if search_record_total > 100:
    print("The search API cannot return more than 100 results at a time. See documentation for how to paginate.")

candidate_datasets = response_json['result']

for idx, dataset in enumerate(candidate_datasets):
    print('-------------------')
    print(f'Index: {idx}')
    print(dataset.get('dataset').get('name'))
    print(dataset.get('url'))
    print(dataset.get('viewUrl'))
    print(dataset.get('citation'))


Datasets found: 17
-------------------
Index: 0
Evaluation of the One-point Method for Estimating Carboxylation Capacity, Utqiagvik (Barrow), Alaska and Upton, New York, 2018
https://api.ess-dive.lbl.gov/packages/ess-dive-92b6d7de4f6722c-20250320T171720679
https://data.ess-dive.lbl.gov/view/doi:10.5440/1506965
Burnett A; Ely K; Davidson K; Serbin S; Rogers A (2019): Evaluation of the One-point Method for Estimating Carboxylation Capacity, Utqiagvik (Barrow), Alaska and Upton, New York, 2018. Next-Generation Ecosystem Experiments (NGEE) Arctic. Dataset. doi:10.5440/1506965
-------------------
Index: 1
A Multi-Sensor Unoccupied Aerial System Improves Characterization of Vegetation Composition and Canopy Properties in the Arctic Tundra: Supporting Data
https://api.ess-dive.lbl.gov/packages/ess-dive-3a53e1b91b26596-20250318T221944548
https://data.ess-dive.lbl.gov/view/doi:10.5440/1647365
Serbin S; Yang D; McMahon A (2020): A Multi-Sensor Unoccupied Aerial System Improves Characterization o

#### ***Optional***: Want to see what the JSON response look like? Run the cell below.
This cell will be available for most calls that we make.

In [11]:
# Optional: display entire response
# ===================================
display(response_json)

{'total': 17,
 'user': 'http://orcid.org/0009-0003-8440-583X',
 'query': {'isPublic': True,
  'creator': 'Serbin',
  'providerName': None,
  'text': 'G-LiHT',
  'datePublished': '[2019 TO 2021]',
  'keywords': None},
 'pageSize': 25,
 'rowStart': 1,
 'result': [{'id': 'ess-dive-92b6d7de4f6722c-20250320T171720679',
   'viewUrl': 'https://data.ess-dive.lbl.gov/view/doi:10.5440/1506965',
   'url': 'https://api.ess-dive.lbl.gov/packages/ess-dive-92b6d7de4f6722c-20250320T171720679',
   'next': None,
   'previous': 'https://api.ess-dive.lbl.gov/packages/ess-dive-246e76462718539-20250319T191515216',
   'dateUploaded': '2025-03-20T17:17:22.626Z',
   'dateModified': '2025-03-21T23:05:28.191Z',
   'isPublic': True,
   'citation': 'Burnett A; Ely K; Davidson K; Serbin S; Rogers A (2019): Evaluation of the One-point Method for Estimating Carboxylation Capacity, Utqiagvik (Barrow), Alaska and Upton, New York, 2018. Next-Generation Ecosystem Experiments (NGEE) Arctic. Dataset. doi:10.5440/1506965',


### 3. Subset search results - Which datasets do we want to explore further?

<strong><span style="color:blue">Enter INPUT</span></strong>

In [12]:
# pick any that you are interested in
record_indices = [2,4,8,9]

<strong><span style="color:green">Run Cell</span></strong>

In [14]:
datasets = [candidate_datasets[x] for x in record_indices]
citations_list = {}
for idx, dataset in enumerate(datasets):
    print(f"{idx}: {dataset.get('dataset').get('name')}")
    # grab the citations of the datasets to store for future use - Remember to always cite data sources you use!
    citations_list.update({dataset.get('dataset').get('@id') : dataset.get('citation')})

0: G-LiHT Campaign Leaf Spectral Reflectance and Transmittance, Mar2017: Puerto Rico
1: G-LiHT Campaign Leaf Carbon and Nitrogen Content, Mar2017: Puerto Rico
2: G-LiHT Campaign Leaf Mass Area and Water Content, Mar2017: Puerto Rico
3: G-LiHT Campaign Leaf Sample details & photos, March 2017: Puerto Rico


### Let's also grab the DOI for each dataset

<strong><span style="color:green">Run Cell</span></strong>

In [15]:
# Grab the DOIs for our selected datasets

total_doi_array = []
for idx, dataset in enumerate(datasets):
    print(f"{idx}: {dataset.get('dataset').get('@id')}, {dataset.get('dataset').get('name')[:25]}...")
    total_doi_array.append(dataset.get('dataset').get('@id'))

0: doi:10.15486/NGT/1495204, G-LiHT Campaign Leaf Spec...
1: doi:10.15486/NGT/1905770, G-LiHT Campaign Leaf Carb...
2: doi:10.15486/NGT/1495202, G-LiHT Campaign Leaf Mass...
3: doi:10.15486/NGT/1781005, G-LiHT Campaign Leaf Samp...


### Great! We found 9 datasets that may be relevant to our science interest - let's move on from Searching for Data.


---
# Part 2: Exploring Inside Datasets
Let's look inside the datasets we are interested in. <br>
Some datasets follow the File Level Metadata Reporting Format and are structured with File Level Metadata (FLMDs) while some are not. Depending on the file structure, we can approach further exploration differently. <br>
First, we'll grab the **dataset details**, then we'll see whether the data has **FLMDs** readily available. Then we can try to **explore within the dataset** to see if it is useful for our science interests

### 1. Get dataset details using ESS-DIVE Dataset API

Use the ESS-DIVE individual dataset search to get details of the datasets, including its list of files. The results of the above search contain the URLs to retrieve the dataset details in the field: `url`.

The `get_dataset_details` method is a helper function that uses the same _requests.get_ from 'Step 1: Enter Search Parameters and make API call'.

**See more details for the individual dataset search in the ESS-DIVE package API techincal documentation:** https://api.ess-dive.lbl.gov/#/Dataset/getDataset.

<strong><span style="color:green">Run Cell - Helper Function</span></strong>

In [16]:
# load this helper function that does the same GET call to the API but for specific files
def get_dataset_details(dataset_url):

    response_status = None
    try:
        dataset_response = requests.get(dataset_url, headers={"Authorization": f"Bearer {token}"})
        response_status = dataset_response.status_code
    except Exception as e:
        print(f"{dataset.get('dataset').get('name')} did not have a successful return: {e}")
        return None

    # If successful response, add to dataset_store
    if response_status == 200:
            dataset_json = dataset_response.json()['dataset']
            print(f"--- Acquired details for {dataset_json.get('name')}")
            return dataset_json
    elif response_status:
        print(f"Response status {response_status}: {dataset_response.text}")
    else:
        print(f"Response status unavailable. Response cannot be interpreted. Debug required.")
    return None
print('Function loaded.')

Function loaded.


<strong><span style="color:green">Run Cell</span></strong>

In [17]:
# Store the dataset details in a list
dataset_details = []

for dataset in datasets:
    dataset_url = dataset.get('url')
    # see details for the get_dataset_details helper method in the cell above
    dataset_detail_json = get_dataset_details(dataset_url)
    if dataset_detail_json:
        dataset_details.append(dataset_detail_json)

print("=====================================")
print(f"Details acquired for {len(dataset_details)} datasets.")

--- Acquired details for G-LiHT Campaign Leaf Spectral Reflectance and Transmittance, Mar2017: Puerto Rico
--- Acquired details for G-LiHT Campaign Leaf Carbon and Nitrogen Content, Mar2017: Puerto Rico
--- Acquired details for G-LiHT Campaign Leaf Mass Area and Water Content, Mar2017: Puerto Rico
--- Acquired details for G-LiHT Campaign Leaf Sample details & photos, March 2017: Puerto Rico
Details acquired for 4 datasets.


#### ***Optional***: Want to see what the dataset details look like? Select the input the number in the brackets for the index of the dataset you want to see and run the cell

In [23]:
# Optional: Run to display dataset information for one of the datasets you chose - you can change number in the brackets to select
#index is based on # of data sets- in this example it would be 0-3
# ===================================
display(dataset_details[3])

{'@context': 'http://schema.org/',
 '@type': 'Dataset',
 '@id': 'doi:10.15486/NGT/1781005',
 'name': 'G-LiHT Campaign Leaf Sample details & photos, March 2017: Puerto Rico',
 'description': ['This data package includes details of leaves sampled for leaf spectra and chemistry from 5 sites in Puerto Rico, in March of 2017. Sunlit canopy and shaded leaves of 66 species were collected. Data for each sample includes species, leaf age, type of analysis (spectroscopy, gas exchange, chemistry), sample number and sample photographs. The data package includes a spreadsheet with sample information and a zip file of photographs (1.6 GB). This data was collected as part of the 2017 BNL–G-LiHT leaf spectra campaign. See related datasets for leaf spectral reflectance and transmittance, leaf mass area (LMA), and leaf chemistry. Note that leaf sample details are also included in related datasets.',
  'This dataset was originally published on the NGEE Tropics Archive and is being mirrored on ESS-DIVE fo

### 2. Which datasets have File Level Metadata (FLMD)?
Some datasets are structured with FLMDs and some are not. Depending on the file structure, we can approach further exploration differently.

#### Here is a helper function `assess_datasets_flmd_dd_csv_files` that will inspect a list of datasets and search the files in a dataset for `flmd` files. It will return two lists of datasets - one for datasets that have a readily accessible FLMD (not in a zip file) and ones that do not (either no FLMD or it is in a zip file).
The utility of this function allows us to get a sense of which tools may be the most helpful in determining if a dataset will be useful.

<strong><span style="color:green">Run Cell - Helper Function</span></strong>

In [24]:
def assess_datasets_flmd_dd_csv_files(dataset_details_list):
    """
    Find the datasets with flmd files
    Sort the csv file contents into potential and data files; add to the dataset details dictionary
    """
    flmd_datasets_indices = set()
    flmd_dataset_details = []

    for idx, dataset in enumerate(dataset_details_list):
        file_list = dataset.get('distribution')
        flmd_url = {}
        csv_files = {}
        for f in file_list:
            encoding_format = f.get('encodingFormat')
            filename = f.get('name')
            url = f.get('contentUrl')

            if 'csv' not in encoding_format or url is None:
                continue
            if 'flmd' in filename:
                flmd_datasets_indices.add(idx)
                flmd_url.update({filename: url})
            else:
                csv_files.update({filename: url})

        dataset.update({
            'flmd_url': flmd_url,
            'csv_files': csv_files
        })

        if not flmd_url:
            dataset_name = dataset.get('name')
            print(f"No flmd found for dataset: {dataset_name}")

    print("=====================================")
    if len(flmd_datasets_indices) > 0:
        print(f'flmd found in {len(flmd_datasets_indices)} datasets')
        flmd_dataset_details = [dataset_details_list[x] for x in flmd_datasets_indices]
    else:
        print(f'No datasets in the search results have flmds.')

    no_flmd_dataset_details = [dataset_detail for idx, dataset_detail in enumerate(dataset_details_list) if idx not in flmd_datasets_indices]
    return flmd_dataset_details, no_flmd_dataset_details
print('Function loaded.')

Function loaded.


<strong><span style="color:green">Run Cell</span></strong>

In [27]:
# use the helper function assess_datasets_flmd_dd_csv_files to determine which files have readily accessible flmd
flmd_datasets, no_flmd_datasets = assess_datasets_flmd_dd_csv_files(dataset_details)

No flmd found for dataset: G-LiHT Campaign Leaf Spectral Reflectance and Transmittance, Mar2017: Puerto Rico
No flmd found for dataset: G-LiHT Campaign Leaf Carbon and Nitrogen Content, Mar2017: Puerto Rico
No flmd found for dataset: G-LiHT Campaign Leaf Mass Area and Water Content, Mar2017: Puerto Rico
No flmd found for dataset: G-LiHT Campaign Leaf Sample details & photos, March 2017: Puerto Rico
No datasets in the search results have flmds.


## A. Manually inspect the FLMD and Data Dictionary (DD)
This section manually examines structured data (FLMD) through FLMD and DD, which may be useful for a variety of purposes. An alternative approach would be to use the DeepDive API, but here we can look a files that are not in the Fusion database (and therefore not parseable by DeepDive).  <br>

### 3. Choose dataset to inspect - from datasets with accessible FLMD

<strong><span style="color:blue">Enter INPUT</span></strong>

In [28]:
# Write in the index of the FLMD dataset you want to investigate
ds_idx = 0

<strong><span style="color:green">Run Cell - Helper Function and Print</span></strong>

In [29]:
dataset = flmd_datasets[ds_idx]

# helper function to print the dataset information
def print_dataset_info(d, info_fields=['@id', 'name', 'description', 'citation'], line_space=False):
    """
    Display basic dataset info for evaluation
    """
    for f in info_fields:
        value = d.get(f)
        if value is None:
            dataset_value = d.get('dataset')
            if dataset_value:
                value = dataset_value.get(f)
        if value:
            if f in ['flmd_url', 'csv_files']:
                print(f"--- {f}:")
                for filename, url in value.items():
                    print(f"    - {filename}")
                continue
            print(f"--- {f}: {value}")
            if line_space:
                print(" ")

print_dataset_info(dataset, info_fields=['@id', 'name', 'flmd_url'], line_space=True)

IndexError: list index out of range

### 4. Select and read flmd

_If multiple flmd files exist in the dataset, run the cell below as many times as needed changing the index._

<strong><span style="color:blue">Enter INPUT</span></strong>

In [None]:
# Select index of the FLMD you want to use
flmd_file_idx = 0

<strong><span style="color:green">Run Cell</span></strong>

In [None]:
# read the flmd
flmd_name, flmd_url = list(dataset.get('flmd_url').items())[flmd_file_idx]
print(f"{flmd_name}: {flmd_url}")
print('-------------------------')

flmd_response = get_request(flmd_name, flmd_url)

flmd_headers, flmd_store = make_store(flmd_response)

### 5. View dataset files listed in flmd

<strong><span style="color:blue">Enter INPUT</span></strong>

In [None]:
# Enter flmd fields to view (File name automatically included):
flmd_header_indices = [1, -2]

<strong><span style="color:green">Run Cell</span></strong>

In [None]:
# print dataset files in flmd
for idx, flmd_info in flmd_store.items():
    print(f"{idx}: {flmd_info.get(flmd_headers[0])}")
    for flmd_idx in flmd_header_indices:
        print(f"-- {flmd_headers[flmd_idx]}: {flmd_info.get(flmd_headers[flmd_idx])}")
    print(f"---------------------------")

### 6. Inspect dataset file contents using Data Dictionary


<strong><span style="color:blue">Enter INPUT</span></strong>

In [None]:
# Enter data file index
data_file_index = 10

# Enter Data Dictionary file index
dd_file_index = 1


<strong><span style="color:green">Run Cell</span></strong>

In [None]:
# Grab the DD
dd_file_name = flmd_store[f"Index {dd_file_index}"].get('File_Name')
data_file_name = flmd_store[f"Index {data_file_index}"].get('File_Name')
print(f'Data File: {data_file_name}\n'
      f'Data Dictionary File: {dd_file_name}')

### 7. Check if the DD is zipped

<strong><span style="color:blue">Enter INPUT</span></strong>

In [None]:
# choose which files you want to print out that are included in the dataset
file_type = 'all'  # 'all' or 'csv' or 'pdf' or 'zip'

<strong><span style="color:green">Run Cell</span></strong>

In [None]:
# helper function that lists the files included
inspect_dataset_distribution(dataset, file_type)

### 8A) IF DD in zip: search in zip for DD

#### 1. Show zip contents to select DD

<strong><span style="color:blue">Enter INPUT</span></strong>

In [None]:
# file from file distribution - choose the zip where you think the DD may be
zip_file_idx = 1

<strong><span style="color:green">Run Cell</span></strong>

In [None]:
# helper function that prints zipped file content
fn, zip_download = inspect_zip_file_contents(dataset, zip_file_idx)

#### 2. Display DD within zip file to inspect

<strong><span style="color:blue">Enter INPUT</span></strong>

In [None]:
# Run if csv file is zipped up
dd_csv = 1

# If needed adjust the number of rows to skip.
header_rows = 0

<strong><span style="color:green">Run Cell</span></strong>

In [None]:

csv_file_name = zip_download.namelist()[dd_csv]
print(f'Attempting to read: {csv_file_name} from zip file {fn}')

metadata_df = read_zipped_csv(zip_download, csv_file_name, header_rows)
zip_download_dd = zip_download
fn_dd = fn

if metadata_df is not None:
    is_csv_zipped = True
    headers = list(metadata_df.columns)
    display(metadata_df)
else:
    print('ERROR: Sample metadata file was not successfully loaded.')

### 8B) If DD not in zip: Inspect data dictionary

In [None]:
# ===================================
data_files = dataset.get('csv_files')

if dd_file_name not in data_files.keys():
    print(f"Cannot find {dd_file_name} in dataset distribution.")
else:
    dd_url = data_files[dd_file_name]
    print(f"{dd_file_name}")
    print(f"{dd_url}")
    print('-------------------------')

    dd_request = get_request(dd_file_name, dd_url)
    dd_headers, dd_store = make_store(dd_request)
    print('-------------------------')

    for idx, dd_info in dd_store.items():
        print(f"{dd_info.get(dd_headers[0])} -- Units: {dd_info.get(dd_headers[1])} -- Desc: {dd_info.get(dd_headers[2])}")



## B. No FLMD or DD? No problem! We can look inside the datasets manually with the Dataset Details
_Inspect dataset using Dataset Details Distribution_ <br>
Useful for a preliminary search into files without readily accessible FLMDs. You may find FLMDs and DDs stored within the zip, but let's start without them.

### 9. Choose dataset to inspect using index above from the non-FLMD list.

<strong><span style="color:blue">Enter INPUT</span></strong>

In [30]:
no_flmd_datasets[3]

{'@context': 'http://schema.org/',
 '@type': 'Dataset',
 '@id': 'doi:10.15486/NGT/1781005',
 'name': 'G-LiHT Campaign Leaf Sample details & photos, March 2017: Puerto Rico',
 'description': ['This data package includes details of leaves sampled for leaf spectra and chemistry from 5 sites in Puerto Rico, in March of 2017. Sunlit canopy and shaded leaves of 66 species were collected. Data for each sample includes species, leaf age, type of analysis (spectroscopy, gas exchange, chemistry), sample number and sample photographs. The data package includes a spreadsheet with sample information and a zip file of photographs (1.6 GB). This data was collected as part of the 2017 BNL–G-LiHT leaf spectra campaign. See related datasets for leaf spectral reflectance and transmittance, leaf mass area (LMA), and leaf chemistry. Note that leaf sample details are also included in related datasets.',
  'This dataset was originally published on the NGEE Tropics Archive and is being mirrored on ESS-DIVE fo

In [43]:
# Select the dataset you want to look at and decide which files you want to print out
ds_idx_no_flmd = 3
file_type = 'all'  # 'all' or 'csv' or 'pdf' or 'zip'

<strong><span style="color:green">Run Cell</span></strong>

In [44]:
# use this helper function to print the names of the files in the dataset you chose
inspect_dataset_distribution(no_flmd_datasets[ds_idx_no_flmd], file_type)

G-LiHT Campaign Leaf Sample details & photos, March 2017: Puerto Rico
Index 0: NGT0077_locations.csv
  encoding: text/csv
  url: https://data.ess-dive.lbl.gov/catalog/d1/mn/v2/object/ess-dive-1e92402c98110a3-20240913T175522805088
Index 1: G_LiHT_Campaign_Leaf_Sample_details_photos_March_2017_Puerto_Rico.xml
  encoding: https://eml.ecoinformatics.org/eml-2.2.0
  url: https://data.ess-dive.lbl.gov/catalog/d1/mn/v2/object/ess-dive-bf58a2ab1d40f9f-20241028T151709835342
Index 2: NGT0077_PR2017samples_20200923222634_20200923222634.zip
  encoding: application/zip
  url: https://data.ess-dive.lbl.gov/catalog/d1/mn/v2/object/ess-dive-a421d695ce279f7-20220608T192417400244


### 10. Select zip file to inspect

<strong><span style="color:blue">Enter INPUT</span></strong>

In [47]:
# Grab the specific dataset details from the dataset we chose from the list of datasets we selected originally:
dataset_detail = dataset_details[2]

# Index of zip file from file distribution
zip_file_index = 3


<strong><span style="color:green">Run Cell</span></strong>

In [48]:
# use this helper function to list the files in the zip file
fn, zip_download = inspect_zip_file_contents(dataset_detail, zip_file_index)

Success!
PR2017_LMA_20190218212240.zip contents:
Index 0: NGEE-Tropics_Puerto_Rico_March2017_LMA.xlsx
Index 1: __MACOSX/
Index 2: __MACOSX/._NGEE-Tropics_Puerto_Rico_March2017_LMA.xlsx
Index 3: File_Submission_Metadata_2017_PR_LMA.xlsx
Index 4: __MACOSX/._File_Submission_Metadata_2017_PR_LMA.xlsx
Index 5: NGEE-Tropics_Puerto_Rico_March2017_Leaf_Sample_Detail.xlsx
Index 6: __MACOSX/._NGEE-Tropics_Puerto_Rico_March2017_Leaf_Sample_Detail.xlsx
Index 7: E-Field_Log_2017_BNL_PuertoRico.xlsx
Index 8: __MACOSX/._E-Field_Log_2017_BNL_PuertoRico.xlsx


### 11. Select csv file within zip file that you want to inspect

<strong><span style="color:blue">Enter INPUT</span></strong>

In [53]:
# Select the index for the file you want to look at
csv_file_idx = 5

#### Before you can view the file, let's take a look at the file structure to understand how to parse it.
For this tutorial, we know this dataset has structured CSV files and it may have multiple rows of metadata. Let's look at the first line to see where the header rows start.   

<strong><span style="color:green">Run Cell</span></strong>

In [55]:
# Print out the first line of the file and extract header row number
# ===================================
csv_file_name = zip_download.namelist()[csv_file_idx]

header_row = 0
with zip_download.open(csv_file_name) as f:
    line = f.readline().decode('latin-1')  # Decode the bytes to string
    print(line)
    if "# HeaderRows_" in line:
        header_row = int(line.split("# HeaderRows_")[1])  # Extract the number part
        print(f"Extracted header row number: {header_row}")

PK     ! |li      [Content_Types].xml ¢(                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ÌßjÂ0Æï{ÛÑD1¬^ìÏå&Ì=@ÖÚ`èôíwuÑ)¢°Ý4´Éù¾_Oo8^5&[B@ílÁú¼Ç2°¥SÚÎ



#### This CSV happens to follow the CSV Guidelines and we can easily print out the number of header rows. To verify that this is true, we'll print this number of rows first.

<strong><span style="color:green">Run Cell</span></strong>

In [56]:
# Print out rows up to header row number
if header_row > 0:
    with zip_download.open(csv_file_name) as f:
        for i in range(header_row):
            print(f.readline())

#### Look at the last line that is printed - that should be the column names!

#### So to correctly put a csv file into a pandas dataframe, you want to take that header row number (7 in this example) and subtract 1, to keep the row with the data column names. In this example we want to skip 6 rows.**

<strong><span style="color:green">Run Cell</span></strong>

In [63]:
rows_to_skip = header_row - 1
print(f'The header row is row {header_row}, so we will skip {rows_to_skip} rows of the file')

print(f'Attempting to read: {csv_file_name} from zip file {fn}')

metadata_df = pd.read_csv(zip_download.open(csv_file_name), skiprows=rows_to_skip)
zip_download_1_datasetapi = zip_download
fn_datasetapi = fn
csv_file_name_datasetapi = csv_file_name

if metadata_df is not None:
    is_csv_zipped = True
    headers = list(metadata_df.columns)
    data_df_datasetapi = metadata_df
    display(metadata_df)
else:
    print('ERROR: Sample metadata file was not successfully loaded.')

The header row is row 1, so we will skip 0 rows of the file
Attempting to read: NGEE-Tropics_Puerto_Rico_March2017_Leaf_Sample_Detail.xlsx from zip file PR2017_LMA_20190218212240.zip


UnicodeDecodeError: 'utf-8' codec can't decode byte 0x98 in position 16: invalid start byte

### This allows you to view the datasets that we looked through manually
### Now - let's use analyze!

---
# Part 3: Analysis

## A. Begin Simple Analysis
Now that we have identified files of interest, let's start using them and begin our investigation!

### 1. Load the two selected csv data files into pandas dataframes

<strong><span style="color:green">Run Cell</span></strong>

In [None]:
# Data identified from Basic Search
# ===================================
# grab and print identifying information from the dataset details
index_dataset_api_dataset = total_doi_array.index(no_flmd_datasets[ds_idx_no_flmd].get('@id'))
print(datasets[index_dataset_api_dataset].get('dataset').get('@id'))
print(datasets[index_dataset_api_dataset].get('dataset').get('name'))
data_df_datasetapi_name = datasets[index_dataset_api_dataset].get('dataset').get('name')
print(datasets[index_dataset_api_dataset].get('viewUrl'))

# display the pandas dataframe containing the datafile
display(data_df_datasetapi)

In [None]:
# Otherwise: can load any data that you downloaded previously.

### 2. Look at basic statistics and data coverage

Print out the basic statistics of the variables, as well as the date range for both dataset files. <br>
By gleaning more information - we can begin to determine which dataset may be useful for our science question.

<strong><span style="color:green">Run Cell</span></strong>

In [None]:
for data_df in ['data_df_datasetapi']:
    print(vars()[str(data_df)+'_name'])
    date_range = (vars()[data_df]['DateTime'].min(), vars()[data_df]['DateTime'].max())
    print(f"Date range: {date_range[0]} to {date_range[1]}")
    display(vars()[data_df].describe())


### 3. Plot the data to visualize basic patterns

<strong><span style="color:blue">Enter INPUT</span></strong>

In [None]:
## DATASET API RESULT

# Select the dataset you want to plot
dataframe = data_df_datasetapi

# Select the variables that you are interest in plotting
variables_of_interest = ['Temperature','Specific_Conductance']

<strong><span style="color:green">Run Cell</span></strong>

In [None]:
# Plot the data
# ===================================
# Convert 'DateTime'to datetime using:
dataframe['DateTime'] = pd.to_datetime(dataframe['DateTime'])

num_plots = len(variables_of_interest)

# Create a figure with two subplots
fig, axs = plt.subplots(num_plots, 1, figsize=(10, 8))

for i, ax in enumerate(axs):
    # Plot VARIABLE over time
    ax.plot(dataframe['DateTime'], dataframe[variables_of_interest[i]], label=variables_of_interest[i])
    ax.set_title(variables_of_interest[i] + ' over Time')
    ax.set_xlabel('DateTime')
    ax.set_ylabel(variables_of_interest[i])
    ax.grid(True)

# Adjust layout
plt.tight_layout()

# Show the plot
plt.show()


### Using this plot, we can visually see the data coverage, and start to think about patterns in the data.
### Visualizing the data can help you determine if this data file may work for your science question. You can keep going with analysis by inserting your custom analysis code here! Or, you can move on to the next section and download the data for future use.
RESOURCE: [Python pandas user guide](https://pandas.pydata.org/docs/user_guide/index.html#user-guide)

---
# Part 4: Download Files and Save the Download Log


## A. Download file(s) to local directory
If desired, change save location and file location.
Otherwise the path configured at the begining of the notebook will be used.

### 1. Ensure you have the right file to download.

<strong><span style="color:green">Run Cell</span></strong>

In [None]:
# Run cell to view dataset to check if this is the one you want to download
datafile_to_download = "data_df_datasetapi" # "data_df_deep_dive" >> this is if you use the file exploration in Part 5!

### 2. Download the file and update the file download log.
#### This example will download the whole zip file

<strong><span style="color:green">Run Cell</span></strong>

In [None]:
# Download the zip file to the chosen directory
if datafile_to_download == "data_df_datasetapi":
    file_indices = [zip_file_index]
    dataset_details_chosen =  no_flmd_datasets[ds_idx_no_flmd]
    dataset_citation = citations_list.get(dataset_details_chosen.get('@id'))

    ds_doi = download_selected_files(dataset_details_chosen, file_indices, download_dir_path,citation=dataset_citation)

# if datafile_to_download == "data_df_deep_dive":
#     dataset_details_chosen =  dataset_details[total_doi_array.index(current_response_json.get('doi'))]

#     files_deep_dive = dataset_details[total_doi_array.index(current_response_json.get('doi'))].get('distribution')
#     zipfile_to_download = current_response_json.get('data_file').split('/', 1)[0]
#     index = next((i for i, item in enumerate(files_deep_dive) if item['name'] == zipfile_to_download), None)
#     file_indices = [index]

#     file_to_download = current_response_json.get('data_file').rsplit('/', 1)[-1]

#     dataset_citation = citations_list.get(current_response_json.get('doi'))

#     ds_doi = download_selected_files(dataset_details_chosen, file_indices, download_dir_path, citation=dataset_citation)


#### You can view the Download log file to see a list of the files that we downloaded

In [None]:
# Optional: display the whole download file log
# ===================================
display(download_file_log)

### 3. Download the Download File Log to get a list of citations of data that we downloaded

<strong><span style="color:green">Run Cell</span></strong>

In [None]:
log_filename = 'essdive_downloaded_files_log.csv'
log_fn_path = download_dir_path / log_filename

with open(log_fn_path, mode='w') as f:
    csv_writer = csv.writer(f)
    csv_writer.writerow(['dataset_id', 'file_name', 'access_datetime', 'access_url', 'dataset_name', 'citation'])

    for ds_id, log_info in download_file_log.items():
        ds_name = log_info.get('name')
        ds_citation = log_info.get('citation')

        accessed_file_list = log_info.get('downloaded_files')
        for accessed_file in accessed_file_list:
            fn, fn_url, access_ts = accessed_file

            csv_writer.writerow([ds_id, fn, access_ts, fn_url, ds_name, ds_citation])

print(f'Check {str(download_dir_path)} for the log file: {log_filename}')

# That's a wrap!

-----
<br>
<br>


# Part 5. Workflow Using Deep Dive API

## A. Searching for Data using Deep Dive API with the Fusion Database
### (Alternative to Part 1)

The Fusion Database allows you to search within files and across datasets that follow structured data. Sometimes, datasets don't include all of the information in the metadata and thus may not come up in just the Dataset API search. You can search across all datasets available in the Fusion DB for specific field names.

**See additional details for Deep Dive search in API techincal documentation:** https://fusion.ess-dive.lbl.gov/#/

### Search within datasets for certain measured data
The Fusion Database only searches structured data, meaning that the total list of potential datasets is limited. However, if you find datasets of interest, you will be able to explore inside them much more deeply. <br>
You can search for datasets using any of the following parameters:
- **rowStart** (integer, query): The row number to start on. Use this for paging results, minimum: 1
- **pageSize** (integer, query): The number of datasets to return, maximum: 100
- **doi** (string array, query): The digital object identifier (doi) representing a dataset
- **fieldName** (string, query): The field name to search for, minLength: 1, maxLength: 100
- **fieldDefinition** (string, query): Search the field definition, minLength: 1, maxLength: 100
- **recordCountMin** (integer, query): Filter by record count greater that or equal to.
- **recordCountMax** (integer, query): Filter by record count less than or equal to.
- **fieldValueText** (string, query): Filter by a text field value. Search is case insensitive
- **fieldValueNumeric** (integer, query): Filter by a numeric value that is between min and max summary values.
- **fieldValueDate** (string($date), query): Filter by a date/datetime value that is between min and max summary values. Date format: (yyyy-mm-dd), Datetime format: (yyyy-mm-ddTHH:MM:SS)


### General Search
You can search within individual DOIs, multiple DOIs, or across all available datasets that are available in the Fusion Database. Here, we will do a search without specifying the specific DOI, to explore if there are other datasets of interest. In the next section, we will do searches on a couple of DOIs to see if they have specific files we are interested in.

### 1. Enter Search Parameters and make API call
<strong><span style="color:blue">Enter INPUT</span></strong>

In [None]:
# Enter search terms
# For an exact match, put the string in quotes, e.g. "\"Leaf"\" is an exact match, "Leaf" is any match
fieldName="conductance"

<strong><span style="color:green">Run Cell</span></strong>

In [None]:
# Contruct URL query to send to the Deep Dive API
get_deepdive_response = f"{essdive_deepdive_url}/api/v1/deepdive?rowStart=1&pageSize=100&fieldName={fieldName}"

# Send request to API
response_deepdive = requests.get(get_deepdive_response)

# Review the response and debug if needed
if response_deepdive.status_code == 200:
    # Success
    response_json_deepdive = response_deepdive.json()
    results_deepdive = response_deepdive.json()['results']
    print("Success! Continue to look at the search results")
else:
    # There was an error
    print("There was an error. Stop here and debug the issue. Email ess-dive-support@lbl.gov if you need assistance. \n")
    print(response_deepdive.text)


In [None]:
# OPTIONAL: View the JSON response
# ===================================
display(response_json_deepdive)

### 2. Inspect the search results - as a Pandas Dataframe

<strong><span style="color:green">Run Cell</span></strong>

In [None]:
# Create and display a pandas dataframe for the report
project_report_deepdive =grab_metadata(results_deepdive)
display(project_report_deepdive.style.set_properties(**{'text-align': 'left'}))

### This example for "conductance" headers returns 98 files that match this search. How do we narrow the search down further?

---
## B. Exploring Data using Deep Dive
### (Alternative to Part 2)

_Inspect datasets with structured data (FLMD)_ <br>
This section picks up where the **[Part 1: Searching for Data](#-Part-1-Searching-on-ESS-DIVE)** leaves off - list of identified datasets in variable name _datasets_

### 1. Use the Deep Dive API (Query-Data) to look in specific datasets
Using the datasets that **do** have FLMD, we will explore inside these files to find ones we are interested in for analysis.

In Part 1 we used the Deep Dive (Query-Data) to look for files with certain terms across pany public dataset that is in the Fusion DB.

Now, we will specify which datasets we want to look at to see (a) if they are available on Deep Dive and (b) what specific files may be of interest. We will need their DOIs to do so.

***The DOIs that we will use in this example come from our Dataset API search results (Part 1, Section A, Step 3: Subset Search Results).***

<strong><span style="color:blue">Enter INPUT</span></strong>

In [None]:
# Enter search terms
fieldName="conductance"

# Select the datasets that you would like to check.
# Change the indices in the bracket for the indices of the datasets from the cell above
doi_array = total_doi_array[0:]

<strong><span style="color:green">Run Cell</span></strong>

In [None]:
# Contruct URL query to send to the Deep Dive API
doi_information=""
for d in doi_array:
    doi_information=doi_information + "&doi="+d

get_deepdive_response = f"{essdive_deepdive_url}/api/v1/deepdive?rowStart=1&pageSize=100&fieldName={fieldName}{doi_information}"

# Send request to API
response_deep_dive = requests.get(get_deepdive_response)

# Review the response and debug if needed
if response_deep_dive.status_code == 200:
    # Success
    response_json_deep_dive = response_deep_dive.json()
    results_deep_dive = response_deep_dive.json()['results']
    print("Success! Continue to look at the search results")
else:
    # There was an error
    print("There was an error. Stop here and debug the issue. Email ess-dive-support@lbl.gov if you need assistance. \n")
    print(response_deep_dive.text)


### 2. View the results

In [None]:
# OPTIONAL: View the JSON response
# ===================================
display(response_json_deep_dive)

### In this example, I'm interested in looking at the results with the most amount of data records. I sorted my table to show me which those are so I can easily reference the index.
There is also the option to the view the table unsorted.

<strong><span style="color:green">Run Cell</span></strong>

In [None]:
# Create pandas dataframe for the report
project_report_deep_dive = grab_metadata(results_deep_dive)

# This code sorts the dataframe by total records
columns_to_sort = ['Total_records']
ascending = [False]
project_report_sorted_deep_dive = project_report_deep_dive.sort_values(by=columns_to_sort,ascending=ascending)


<strong><span style="color:green">Run Cell</span></strong>

In [None]:
## Choose the dataframe to display - Sorted or Non-Sorted
## =================
## Display Sorted dataframe
display(project_report_sorted_deep_dive.style.set_properties(**{'text-align': 'left'}))

## Uncomment to display Non-Sorted dataframe
#display(project_report_deep_dive.style.set_properties(**{'text-align': 'left'}))

### Let's grab the file(s) that we are interested in

### 3. Use Get-Dataset-File to identify specific files
Aside from identifying specific files in datasets, the Deep Dive API can also retrieves a dataset file by its file path, using a different request message (called an end point). <br>
Learn more at Fusion docs: [Get-Dataset-File](https://fusion.ess-dive.lbl.gov/#/default/get_dataset_file_api_v1_deepdive__doi___file_path__get)

### From the previous list of files, we will use the index to then grab the DOI and file name to query the Deep Dive API.
<strong><span style="color:blue">Enter INPUT</span></strong>

In [None]:
# Select an index from the pandas dataframe to choose a file to investigate
i_of_interest= 69

<strong><span style="color:green">Run Cell</span></strong>

In [None]:
# The format for the URL Deep Dive call is - DOI:file_name
doi_file_information = project_report_sorted_deep_dive.loc[i_of_interest]['DOI'] + ':' + project_report_sorted_deep_dive.loc[i_of_interest]['File']
# using the DOI, grab the index from the flmd_datasets
index_for_datasets = doi_array.index(project_report_sorted_deep_dive.loc[i_of_interest]['DOI'])

# Contruct URL query to send to the Deep Dive API
get_deepdive_response_file = f"{essdive_deepdive_url}/api/v1/deepdive/{doi_file_information}"

# Send request to API
response_deepdive_file = requests.get(get_deepdive_response_file)

# Review the response and debug if needed
if response_deepdive_file.status_code == 200:
    # Success
    response_deepdive_file_json = response_deepdive_file.json()
    print(f"Success for file {doi_file_information}! Continue to look at the search results")
else:
    # There was an error
    print("There was an error. Stop here and debug the issue. Email ess-dive-support@lbl.gov if you need assistance. \n")
    print(response_deepdive_file.text)

In [None]:
# Optional: display entire json response
# ===================================
display(response_deepdive_file_json)

### Great! Now we have identified the file we want through the Deep Dive API. Next, we'll to look into the file itself, to see if we want to download the file.

<strong><span style="color:green">Run Cell</span></strong>

In [None]:
# We using the file information to grab it and then visualize it
current_response_json = response_deepdive_file_json

fn_url = current_response_json['data_download']['contentUrl']

try:
# Create a request with headers
    req = Request(fn_url)
    req.add_header('User-Agent', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0')
# Open the URL with the added headers
    resp = urlopen(req)
    zip_download = ZipFile(io.BytesIO(resp.read()))
    print('Success!')
except urllib.error.HTTPError as e:
    print(f'HTTPError: {e.code} - {e.reason}')
# try:
#     request = urllib.request.Request(fn_url, headers=headers)

#     with urllib.request.urlopen(request) as response:
#         with open(file_path, 'wb') as out_file:
#             out_file.write(response.read())

# except urllib.error.HTTPError as e:
#     print(f'HTTPError: {e.code} - {e.reason}')


#### We want to visualize this file in a pandas dataframe - thus we need to identify what the header row is.
Let's find out by printing the first couple of lines of the file. The first line should contain a string like ` b'# HeaderRows_10\n' `, and the number is the line of the file where the header row is.  <br>
We will try this to identify the line where the header row is in the file.

<strong><span style="color:green">Run Cell</span></strong>

In [None]:
# Print out the first 2 lines of the file and extract header row number
# ===================================
csv_file_name_deep_dive = current_response_json['data_file']
csv_file_name_deep_dive = csv_file_name_deep_dive.split('.zip/', 1)[1]
header_row = 0
with zip_download.open(csv_file_name_deep_dive) as f:
    line = f.readline().decode('utf-8')  # Decode the bytes to string
    print(line)
    if "# HeaderRows_" in line:
        header_row = int(line.split("# HeaderRows_")[1])  # Extract the number part
        print(f"Extracted header row number: {header_row}")

#### You can then verify this by printing this number of lines to see if you get a row of header. Look at the last line that is printed - that should be the column names!

<strong><span style="color:green">Run Cell</span></strong>

In [None]:
# Print out rows up to header row number
if header_row > 0:
    with zip_download.open(csv_file_name_deep_dive) as f:
        for i in range(header_row):
            print(f.readline())

#### So to correctly put a csv file into a pandas dataframe, you want to take that header row number (7 in this example) and subtract 1, to keep the row with the data column names. In this example we want to skip 6 rows.**

<strong><span style="color:green">Run Cell</span></strong>

In [None]:
rows_to_skip = header_row - 1
print(f'The header row is row {header_row}, so we will skip {rows_to_skip} rows of the file')

fn = current_response_json['data_download']['name']
print(f'Attempting to read: {csv_file_name_deep_dive} from zip file {fn}')

metadata_df = read_zipped_csv(zip_download, csv_file_name_deep_dive, rows_to_skip)
zip_download_2_deep_dive = zip_download
fn_deep_dive = fn

if metadata_df is not None:
    is_csv_zipped = True
    headers = list(metadata_df.columns)
    data_df_deep_dive = metadata_df
    display(metadata_df)
else:
    print('ERROR: Sample metadata file was not successfully loaded.')

### Success! We have identified a number of files that could be relevant and we have opened one file for this example. Let's move on to visualizing this example file.
### ( Modified from  **[Part 3: Starting Analysis](#-Part-3-Starting-Analysis)** )

### 1. Load the two selected csv data files into pandas dataframes

<strong><span style="color:green">Run Cell</span></strong>

In [None]:
# Data identified from Part 2
# ===================================
# grab and print identifying information from the dataset details
index_deep_dive_dataset = total_doi_array.index(current_response_json.get('doi'))
print(datasets[index_deep_dive_dataset].get('dataset').get('@id'))
print(datasets[index_deep_dive_dataset].get('dataset').get('name'))
data_df_deep_dive_name = datasets[index_deep_dive_dataset].get('dataset').get('name')
print(datasets[index_deep_dive_dataset].get('viewUrl'))

# display the pandas dataframe containing the datafile
display(data_df_deep_dive)

In [None]:
# Otherwise: can load any data that you downloaded previously.

### 2. Look at basic statistics and data coverage

Print out the basic statistics of the variables, as well as the date range for both dataset files. <br>
By gleaning more information - we can begin to determine which dataset may be useful for our science question.

<strong><span style="color:green">Run Cell</span></strong>

In [None]:
for data_df in ['data_df_deep_dive']:
    print(vars()[str(data_df)+'_name'])
    date_range = (vars()[data_df]['DateTime'].min(), vars()[data_df]['DateTime'].max())
    print(f"Date range: {date_range[0]} to {date_range[1]}")
    display(vars()[data_df].describe())


#### Looks interesting, let's plot!

### 3. Plot the data to visualize basic patterns

<strong><span style="color:blue">Enter INPUT</span></strong>

In [None]:
## DEEP DIVE API RESULT

# Select the dataset you want to plot
dataframe = data_df_deep_dive

# Select the variables that you are interest in plotting
variables_of_interest = ['Temperature','Specific_Conductance']

<strong><span style="color:green">Run Cell</span></strong>

In [None]:
# Plot the data
# ===================================
# Convert 'DateTime'to datetime using:
dataframe['DateTime'] = pd.to_datetime(dataframe['DateTime'])

num_plots = len(variables_of_interest)

# Create a figure with two subplots
fig, axs = plt.subplots(num_plots, 1, figsize=(10, 8))

for i, ax in enumerate(axs):
    # Plot VARIABLE over time
    ax.plot(dataframe['DateTime'], dataframe[variables_of_interest[i]], label=variables_of_interest[i])
    ax.set_title(variables_of_interest[i] + ' over Time')
    ax.set_xlabel('DateTime')
    ax.set_ylabel(variables_of_interest[i])
    ax.grid(True)

# Adjust layout
plt.tight_layout()

# Show the plot
plt.show()


### Using this plot, we can visually see the data coverage, and start to think about patterns in the data.
### Visualizing the data can help you determine if this data file may work for your science question. You can keep going with analysis by inserting your custom analysis code here! Or, you can move on to the next section and download the data for future use.
RESOURCE: [Python pandas user guide](https://pandas.pydata.org/docs/user_guide/index.html#user-guide)

### Move on to Part 4 if you wish to download the data.
You will need to change the variable names to download the correct files from this workflow

-----

# EXTRA. Finding data using Sample ID and Metadata Reporting Formats - workflow

### Tutorial_FindingAccessData.ipynb - 2023 ESS-DIVE Community Workshop
This notebook [Tutorial_FindingAccessData.ipynb](https://github.com/ess-dive/essdive-tutorials/blob/main/search_data/Tutorial_FindingAccessingData.ipynb) is from the Finding and Accessing Data Tutorial 2023. It contains a similar workflow to this notebook (albeit without the Deep Dive API), but also additional information and code including:

1. (Step 6 of DSC's notebook) Using Sample ID and Metadata Reporting Formats
   - The example utilizes data that contain the Sample ID reporting formats.
   - It utilizes the same basic tools: Dataset API, inspecting reporting format files, etc to provide another way to utilize ESS-DIVE data
   - You will want to run Steps 1: Set Up before running Step 6: Sample ID and Metadata Reporting Formats.