# Welcome to ESS-DIVE's Finding & Accessing Data Jupyter Notebook

This Jupyter Notebook will help data users find and access ESS-DIVE datasets that employ file-level metadata and csv reporting formats, including:

    Use the ESS-DIVE Dataset API to access dataset files
    Use the xml file to explore / access a dataset
    Use the File-level Metadata (flmd) to explore the dataset
    Use Data Dictionaries to understand data content
    Explore Sample Metadata to explore datasets with sample-based data
    Import data from csv files into python pandas dataframes
    Download files to local storage and log access details

This was created as a resource to the ESS-DIVE 2023 Open Data Workshop.

Written By: Danielle S Christianson (she/her, LBNL)

Acknowledgements: This notebook builds from Madison Burrus and Valerie Hendrix's Search & Download notebook.

## README: How to use this notebook

Enter input and/or read information above the "# ===============================" in python cells. 
Otherwise, just run the cell.

Optional view cells are marked with "Optional" in the first line. These do not need to be run.

Any downloaded files are logged with the date/time of access. See Section 7 to save the log.

Workflows:
* Cells in Section 1-5 are sequential and depend on variables entered in prior cells. 
* To use Section 6: Sample ID and Metadata, run Section 1: Setup, then proceed to Section 6.
* Section 7: Saving the download log file can be run after Section 5 or after Section 6.

# 1. Setup

In [None]:
# This notebook requires Python 3.
# ===================================

import csv
import datetime as dt
import io
import json
import os
import pandas as pd
import requests

from ipywidgets import widgets, interact
from IPython.display import display, display_html
from pathlib import Path
from urllib.request import Request, urlopen, urlretrieve
from zipfile import ZipFile

### Configure authentification

1. Go to ESS-DIVE (https://data.ess-dive.lbl.gov/data), login with your ORCID, and copy your authentication token from your account settings page.
2. Enter your authentication token into the widget above
3. Run the following code cell

   _Always re-run this code cell when you update your token. Tokens expire every 24 hours._

In [None]:
my_token = "<put_your_token_here>"

# ===================================
token_text = widgets.Text(my_token, description="Token:")
display(token_text)

token = token_text.value
essdive_api_url = 'https://api.ess-dive.lbl.gov'

essdive_direct_url = 'https://data.ess-dive.lbl.gov/catalog/d1/mn/v2/object/'

### Configure local storage for downloads (if desired)

Enter the local directory path in which you want to save downloaded files.

In [None]:
local_dir = Path('<put_local_directory_here>')

# ===================================
if local_dir.exists():
    print(f'Success! Local directory {local_dir} configured for downloads')
    print('===================================')
    current_files = [x for x in os.listdir(local_dir) if x != '.DS_Store']
    if current_files:
        print(f'Local directory contains: {current_files}')
    else:
        print(f'Local directory is currently empty.')
else:
    print(f'Cannot find local directory {local_dir}. Please reenter valid directory path.')
    
download_file_log = {}
print('===================================')
print('Downloaded files will be logged in the dictionary object "download_file_log".\n'
      'You can save this dictionary as a file later in the notebook.\n'
      'The filename, file url, and datetime accessed are recorded as a tuple in the "downloaded_files" element.')

### Load general functions

In [None]:
# Run these general functions
# ===================================

def print_dataset_info(d, info_fields=['@id', 'name', 'description', 'citation'], line_space=False):
    """ 
    Display basic dataset info for evaluation 
    """
    for f in info_fields:
        value = d.get(f)
        
        if value is None:
            dataset_value = d.get('dataset')
            if dataset_value:
                value = dataset_value.get(f)
                    
        if value:
            if f in ['flmd_url', 'csv_files']:
                print(f"--- {f}:")
                for filename, url in value.items():
                    print(f"    - {filename}")
                continue
                          
            print(f"--- {f}: {value}")
            if line_space:
                print(" ")


def print_datasets_info(dataset_list, info_fields=['@id', 'name', 'description', 'citation'], line_space=False):
    """ 
    Display basic dataset info for evaluation 
    """
    print(f'=========== Info for {len(dataset_list)} datasets ===========')
    for a_dataset in dataset_list:
        print_dataset_info(a_dataset, info_fields, line_space)
                
        print("----------------------------------------------------------")
        
        
        
def assess_datasets_flmd_dd_csv_files(dataset_details_list):
    """
    Find the datasets with flmd files
    Sort the csv file contents into potential and data files; add to the dataset details dictionary
    """
    
    flmd_datasets_indices = set()
    flmd_dataset_details = []
    
    for idx, dataset in enumerate(dataset_details_list):
        file_list = dataset.get('distribution')
    
        flmd_url = {}
        csv_files = {}
        for f in file_list:
            encoding_format = f.get('encodingFormat')
            filename = f.get('name')
            url = f.get('contentUrl')
        
            if 'csv' not in encoding_format or url is None:
                continue
        
            if 'flmd' in filename:
                flmd_datasets_indices.add(idx)
                flmd_url.update({filename: url})
        
            else:
                csv_files.update({filename: url})

        dataset.update({
            'flmd_url': flmd_url,
            'csv_files': csv_files
        })
    
        if not flmd_url:      
            dataset_name = dataset.get('name')
            print(f"No flmd found for dataset: {dataset_name}")
        
    print("=====================================")
    
    if len(flmd_datasets_indices) > 0:
        print(f'flmd found in {len(flmd_datasets_indices)} datasets')
        flmd_dataset_details = [dataset_details_list[x] for x in flmd_datasets_indices]
    else:
        print(f'No datasets in the search results have flmds.')
        
    no_flmd_dataset_details = [dataset_detail for idx, dataset_detail in enumerate(dataset_details_list) if idx not in flmd_datasets_indices]
    
    return flmd_dataset_details, no_flmd_dataset_details


def get_dataset_details(dataset_url):
    
    response_status = None
    try:
        dataset_response = requests.get(dataset_url, headers={"Authorization": f"Bearer {token}"})
        response_status = dataset_response.status_code
    except Exception as e:
        print(f"{dataset.get('dataset').get('name')} did not have a successful return: {e}")
        return None

    # If successful response, add to dataset_store
    if response_status == 200:
            dataset_json = dataset_response.json()['dataset'] 
            print(f"--- Acquired details for {dataset_json.get('name')}")
            return dataset_json
    elif response_status:  
        print(f"Response status {response_status}: {dataset_response.text}")
    else:
        print(f"Response status unavailable. Response cannot be interpreted. Debug required.")
    return None


def get_request(filename, f_url, stream=True):
    """
    Get request for file, and stream the content back
    """

    headers = {'user_agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0',
               'content-type': 'application/json'}
    try:
        r = requests.get(f_url, headers=headers, verify=True, stream=stream)
        status_code = r.status_code
        if status_code == 200:
            return r
        else:
            print(f"{filename} request returned {status_code}")
            return None
    except Exception as e:
        print(f"{filename} request unsuccessful: {e}")
        return None
    
    
def make_store(file_request, use_idx=True, print_headers=True):
    """
    Read response and make store
    """
    file_store = {}
    csv_reader = csv.DictReader(file_request.iter_lines(decode_unicode=True))

    for idx, row in enumerate(csv_reader):
        if use_idx:
            file_store.update({f'Index {idx}': row})
            continue
        fn = row.get('File_Name')
        file_store.update({fn: row})
    
    headers = list(row.keys())
    if print_headers:
        print(f"File headers: {headers}")
    return headers, file_store


def make_pandas_df(file_url, header_rows=1, print_headers=True):
    """
    Read response and make pandas pdf from online csv file
    Designed for ESS-DIVE Sample ID and Metadata RF sample_metadata.csv files that have one header row.
    """
    p_df = pd.read_csv(file_url, skiprows=header_rows)
    
    headers = list(p_df.columns)
    if print_headers:
        print(f"File headers: {headers}")
    return headers, p_df


def inspect_dataset_distribution(dataset_detail, file_type='all'):

    print(dataset_detail.get('name'))
    print('========================================')

    count = 0
    dist = dataset_detail.get('distribution')
    
    for idx, file_info in enumerate(dist):
        fn = file_info.get('name')
        fn_url = file_info.get('contentUrl')
        f_encoding = file_info.get('encodingFormat')
        if file_type != 'all' and file_type not in f_encoding:
            continue
        print(f'Index {idx}: {fn}\n  encoding: {f_encoding}\n  url: {fn_url}')
        count += 1
        
    if count == 0:
        print(f'No files found that match the file_type: "{file_type}" criteria.')
            
            
def retrieve_file_from_essdive(file_url, file_path):
    """ Retrieve the data file 
        file_path includes file name.
    """     
    try:
        urlretrieve(file_url, file_path)
        return True, None
    except Exception as e:
        return False, f'File at url: {file_url} was not saved: {e}'
    

def download_selected_files(dataset_detail, file_indices, file_dir=local_dir, log_store=download_file_log, 
                            is_csv_zipped=False, zip_download=None, zip_member_fn=None):
    dist = dataset_detail.get('distribution')
    ds_id = dataset_detail.get('@id')
    citation = dataset_detail.get('citation')
    ds_name = dataset_detail.get('name')
    
    if log_store is None:
        log_store = {}
    
    log_store.setdefault(ds_id, {'@id': ds_id, 'name': ds_name, 'citation': citation, 'downloaded_files': []})
    ds_file_log = log_store.get(ds_id).get('downloaded_files')
    
    print(f'Saving files in {local_dir}')
    print("-------------------------------------")

    for idx, file_info in enumerate(dist):
        msg = None
        is_downloaded = None
        
        if idx not in file_indices:
            continue
            
        fn = file_info.get('name')
        file_path = local_dir / fn
        fn_url = file_info.get('contentUrl')
        
        if not is_csv_zipped:
    
            download_ts = dt.datetime.now().isoformat()
            is_downloaded, msg = retrieve_file_from_essdive(fn_url, file_path)
    
        else:
            if not zip_download or not zip_member_fn:
                print('ZipFile object and zipped member file name are required. Try again.')
                return None
            try:
                zip_download.extract(zip_member_fn, path=file_path)
                if Path.exists(file_path / zip_member_fn):
                    is_downloaded = True
                    download_ts = dt.datetime.now().isoformat()
                else:
                    msg = f'Extraction of {zip_member_fn} from {fn} was not successful.'
            except Exception as e:
                msg = f'ERROR attempting to extract {zip_member_fn} from {fn}: {e}'
        
        if is_downloaded:
            print(f'--- {fn} downloaded')
            ds_file_log.append((fn, fn_url, download_ts))
        else:
            print(msg)
            
    print("-------------------------------------")
    print(f'Remember to cite these files! Dataset DOI {ds_id}')
    return ds_id    


def inspect_zip_file_contents(dataset_detail, file_idx):
    dist = dataset_detail.get('distribution')
    file_info = dist[7]
    
    if not file_info:
        print('File index not found. Please try again.')
        return
    
    fn = file_info.get('name')
    if 'zip' not in file_info.get('encodingFormat'):
        print(f'{fn} is not encoded as a zip file. Please select a different file.')
    
    fn_url = file_info.get('contentUrl')
    resp = urlopen(fn_url)
    
    zip_download = ZipFile(io.BytesIO(resp.read()))
    
    print(f'{fn} contents:')
    print('=================================')
    for idx, file_member in enumerate(zip_download.namelist()):
        print(f'Index {idx}: {file_member}')
        
    return fn, zip_download


def read_zipped_csv(zip_file_obj, csv_file_name, header_rows=1):
    # with open(zip_file_obj, mode='r') as z:
    #     csv_df = pd.read_csv(io.BytesIO(z.read(csv_file_name)))
    csv_df = pd.read_csv(zip_download.open(csv_file_name), skiprows=header_rows)
    return csv_df
    
    
print('Functions loaded.')

# 2. Search ESS-DIVE using Dataset API

Use the ESS-DIVE Dataset API to search for datasets of interest.

You can search for datasets using any of the following parameters:
- Dataset Creator (creator)
- Date Published (datePublished)
- Project Name (providerName)
- Any text (text)
- Keywords (keywords)
- Public datasets only (isPublic)

**See additional details for dataset search in the ESS-DIVE package API techincal documentation:**** https://api.ess-dive.lbl.gov/#/Data%20Package/listPackages.

Use the [ESS-DIVE's project list](https://docs.google.com/spreadsheets/d/179SOyv42wXbP4owWZtUg3RqhW9dPOyENYcVYuUCcqwg/edit?usp=sharing) to find the options for project names.

In [None]:
# Enter search terms
# For an exact match, put the string in quotes, e.g. "\"Leaf"\" is an exact match, "Leaf" is any match

providerName="\"Next-Generation Ecosystem Experiments (NGEE) Arctic\""  # "\"<project name>\""
creator="Ely"
text= "\"Leaf\""
datePublished = "[2016 TO 2023]"  # "<[YYYY TO YYYY-MM-DD]>" # Not the same as data coverage

# ===================================
# ToDo: make construction of URL a function to handle empty search criteria

# Contruct URL query to send to the ESS-DIVE packages API
get_packages_response = f"{essdive_api_url}/packages?providerName={providerName}&creator={creator}&text={text}&datePublished={datePublished}&isPublic=true"

# Send request to API
response = requests.get(get_packages_response, headers={"Authorization": f"Bearer {token}"})

# Review the response and debug if needed
if response.status_code == 200:
    # Success
    response_json = response.json()
    print("Success! Continue to look at the search results")  
else:
    # There was an error
    print("There was an error. Stop here and debug the issue. Email ess-dive-support@lbl.gov if you need assistance. \n")
    print(response.text)

### Inspect the search results

In [None]:
# ===================================
search_record_total = response_json['total']
print(f"Datasets found: {search_record_total}")

if search_record_total > 100:
    print("The search API cannot return more than 100 results at a time. See documentation for how to paginate.")

candidate_datasets = response_json['result']

for idx, dataset in enumerate(candidate_datasets):
    print('-------------------')
    print(f'Index: {idx}')
    print(dataset.get('dataset').get('name'))
    print(dataset.get('url'))
    print(dataset.get('viewUrl'))


In [None]:
# Optional: display entire response
# ===================================
display(response_json)

## Subset search results

In [None]:
record_indices = [6, 4, 5, 7]

# ===================================
datasets = [candidate_datasets[x] for x in record_indices]

for idx, dataset in enumerate(datasets):
    print(f"{idx}: {dataset.get('dataset').get('name')}")

### Get dataset details using ESS-DIVE Dataset API

Use the ESS-DIVE individual dataset search to get details of the datasets, including its list of files.

The results of the above search contain the URLs to retrieve the dataset details in the field: `url`.

**See more details for the individual dataset search in the ESS-DIVE package API techincal documentation:** https://api.ess-dive.lbl.gov/#/Dataset/getDataset. 

In [None]:
# ===================================
# Store the dataset details in a list
dataset_details = []

for dataset in datasets:
    dataset_url = dataset.get('url')
    dataset_detail_json = get_dataset_details(dataset_url)
    if dataset_detail_json:
        dataset_details.append(dataset_detail_json) 

print("=====================================")
print(f"Details acquired for {len(dataset_details)} datasets.")

In [None]:
# Optional: display dataset information
# (Un)comment options below

# print_datasets_info(dataset_details)
display(dataset_details[1])

# ===================================

### Determine which datasets have flmd

In [None]:
# ===================================
flmd_datasets, no_flmd_datasets = assess_datasets_flmd_dd_csv_files(dataset_details)

# 3. Inspect dataset using Dataset Details Distribution (without flmd)

List the datasets that do not have flmd files.

In [None]:
# ===================================
for idx, fd in enumerate(no_flmd_datasets):
    print(f"--- Index {idx}: {fd.get('name')}")

### Choose dataset to inspect using index above.

In [None]:
ds_idx = 0
file_type = 'all'  # 'all' or 'csv' or 'pdf' or 'zip'

# ===================================
inspect_dataset_distribution(no_flmd_datasets[ds_idx], file_type)

### Inspect the contents of a zip file

In [None]:
# Coming soon!! See Section 6 for example code and functions.

### Download file(s) to local directory

In [None]:
file_indices = [1]

# ===================================
ds_doi = download_selected_files(no_flmd_datasets[ds_idx], file_indices, local_dir)

In [None]:
# Optional: display the download file log for this DOI
# ===================================
print(f'Downloaded file information for {ds_doi}:')
display(download_file_log[ds_doi])

# 4. Inspect dataset contents using File-level Metadata (flmd)

### View flmd datasets

In [None]:
# ===================================
for idx, fd in enumerate(flmd_datasets):
    print(f"--- Index {idx}: {fd.get('name')}")

### Choose dataset to inspect

In [None]:
ds_idx = 2

# ===================================
dataset = flmd_datasets[ds_idx]
print_dataset_info(dataset, info_fields=['@id', 'name', 'flmd_url'], line_space=True)

### Select and read flmd

_If multiple flmd files exist in the dataset, run the cell below as many times as needed changing the index._

In [None]:
flmd_file_idx = 0

# ===================================
flmd_name, flmd_url = list(dataset.get('flmd_url').items())[flmd_file_idx]
print(f"{flmd_name}: {flmd_url}")
print('-------------------------')

flmd_response = get_request(flmd_name, flmd_url)

flmd_headers, flmd_store = make_store(flmd_response)

### View dataset files listed in flmd

In [None]:
# File name automatically included. Enter addtional flmd fields:
flmd_header_indices = [1, -2]

# ===================================
for idx, flmd_info in flmd_store.items():
    print(f"{idx}: {flmd_info.get(flmd_headers[0])}")
    for flmd_idx in flmd_header_indices:
        print(f"-- {flmd_headers[flmd_idx]}: {flmd_info.get(flmd_headers[flmd_idx])}")
    print(f"---------------------------")

# 5. Inspect dataset file contents using Data Dictionary

### Choose indices of file of interest and its corresponding Data Dictionary file to inspect below.

In [None]:
# Enter data file index
data_file_index = 0

# Enter Data Dictionary file index
dd_file_index = 1

# ===================================
dd_file_name = flmd_store[f"Index {dd_file_index}"].get('File_Name')
data_file_name = flmd_store[f"Index {data_file_index}"].get('File_Name')
print(f'Data File: {data_file_name}\n'
      f'Data Dictionary File: {dd_file_name}')

### Inspect data dictionary

In [None]:
# ===================================
data_files = dataset.get('csv_files')

if dd_file_name not in data_files.keys():
    print(f"Cannot find {dd_file_name} in dataset distribution.")
else:
    dd_url = data_files[dd_file_name]
    print(f"{dd_file_name}")
    print(f"{dd_url}")
    print('-------------------------')

    dd_request = get_request(dd_file_name, dd_url)
    dd_headers, dd_store = make_store(dd_request)
    print('-------------------------')

    for idx, dd_info in dd_store.items():
        print(f"{dd_info.get(dd_headers[0])} -- Units: {dd_info.get(dd_headers[1])} -- Desc: {dd_info.get(dd_headers[2])}")

### Load selected csv data file into pandas dataframe

In [None]:
# ===================================
if data_file_name not in data_files.keys():
    print(f"Cannot find {data_file_name} in dataset distribution.")
else:
    data_url = data_files[data_file_name]
    print(f"{data_file_name}")
    print(f"{data_url}")
    print('-------------------------')

    data_request = get_request(data_file_name, data_url, stream=False)
    
    data_df = pd.read_csv(io.StringIO(data_request.text))
    
    display(data_df)

In [None]:
# INSERT your custom analysis code here
# Python pandas user guide: https://pandas.pydata.org/docs/user_guide/index.html#user-guide 

### Downloal selected csv data file

In [None]:
# ===================================
ds_doi = download_selected_files(dataset, [data_file_index], local_dir)

In [None]:
# Optional: display the download file log for this DOI
# ===================================
print(f'Downloaded file information for {ds_doi}:')
display(download_file_log[ds_doi])

# 6. Use Sample ID and Metadata Reporting Format

The example below starts with a search on the ESS-DIVE main search webpage: https://data.ess-dive.lbl.gov/

The dataset version/identifier of the desired dataset is entered below as the dataset_id.
* Find the dataset version in the upper left corner of the dataset's webpage next to the DOI.
* Or find the dataset identifier as the first field in the General metadata section (below the dataset files).

The search feature of the Dataset API illustrated in Section 2 above can also be used to find a dataset_id of interest. The dataset_id is the last part of the API URL shown in the results.

Example:
For the dataset detail url: https://api.ess-dive.lbl.gov/packages/ess-dive-f0861161a6bd3bf-20231109T125444193, the dataset_id is ess-dive-f0861161a6bd3bf-20231109T125444193.

## README

The code below performs **minimal** Sample ID and Metadata Reporting Format validation. Not all features may work if files do not adhere to the Reporting Format.

*Note: We leave the sample_metadata.csv column names unvalidated to increase the ability of inspecting the files.*

### Enter the dataset ID of interest

Example datasets:
* CSV files at top-level: ess-dive-2569191b32b447d-20230809T173212651
* Zipped files at top-level: ess-dive-120a44f1c8a626c-20230914T183544541

In [None]:
dataset_id = 'ess-dive-2569191b32b447d-20230809T173212651'

# ===================================
# Find dataset identifier from search above or via Search Webpage
dataset_details_url = f'https://api.ess-dive.lbl.gov/packages/{dataset_id}'

dataset_detail = get_dataset_details(dataset_details_url)

In [None]:
# Assess the dataset for fmld
# ===================================
flmd_datasets, no_flmd_datasets = assess_datasets_flmd_dd_csv_files([dataset_detail])

# Additional setup
# Set the default assumptions
is_csv_zipped = False
metadata_df = None
igsn_col_idx = None

### View the dataset csv files

Look for the sample metadata file.
It is a csv file that should have "sample_metadata" in the filename.

In [None]:
# ===================================
csv_files = dataset_detail.get('csv_files')

if not csv_files:
    print('No csv files. Try Zip File Option below.')

csv_index = []
idx = 0
for fn, url in csv_files.items():
    print(f'Index {idx}: {fn}\n{url}')
    csv_index.append(fn)
    idx += 1

### Select and load the sample metadata csv file

In [None]:
metadata_file_idx = 0

# ===================================
# get file_url
fn = csv_index[metadata_file_idx]
fn_url = csv_files.get(fn)

if not fn_url:
    print('Something is amiss! Could not find file_url. Try again.')
else:
    try:
        headers, metadata_df = make_pandas_df(fn_url, print_headers=False)
        print(f'{fn} was loaded as a pandas dataframe.')
        display(metadata_df)
    except Exception as e:
        print(f'Error while attempted to read the {fn_url} into a pandas dataframe. Try again.\nError: {e}')


=========================================================================

## Zip File Option: Inspect zipped dataset contents

Otherwise if csv sample metadata files were found, skip down to the end of the Zip File section.

### Inspect all dataset files if sample_metadata.csv is not found

In [None]:
# Run if sample_metadata csv file is not found

# ===================================
inspect_dataset_distribution(dataset_detail, 'all')

### Select zip file to inspect

In [None]:
# Run if sample_metadata is not found at the top-level of the dataset contents.
zip_file_idx = 7

# ===================================   
fn, zip_download = inspect_zip_file_contents(dataset_detail, zip_file_idx)

### Select csv file within zip file to inspect

In [None]:
# Run if csv file is zipped up
csv_file_idx = 2

# If needed adjust the number of rows to skip. The Sample ID and Metadata RF specifies 1 header row.
header_rows = 1

# ===================================
csv_file_name = zip_download.namelist()[csv_file_idx]
print(f'Attempting to read: {csv_file_name} from zip file {fn}')

metadata_df = read_zipped_csv(zip_download, csv_file_name, header_rows)

if metadata_df is not None:
    is_csv_zipped = True
    headers = list(metadata_df.columns)
    display(metadata_df)
else:
    print('ERROR: Sample metadata file was not successfully loaded.')

### End zip file section
=========================================================================

## Review sample metadata

### View sample metadata columns

In [None]:
# ===================================
if metadata_df is not None:
    print(f'Success! {fn} loaded as a pandas dataframe with the following column names:\n')
            
    for idx, header in enumerate(headers):
        print(f'Index {idx} --- {header}')
        if header == 'IGSN':
            igsn_col_idx = idx
            
    if igsn_col_idx is None:
        print('\nRequired column name "IGSN" was not found. The following code may not work.')
    else:
        print(f'\nRequired "IGSN" column {igsn_col_idx} was detected.')
else:
    print('Valid dataframe was not create. Please try again.')

### Select metadata columns to view

In [None]:
# Enter column indices from above
metadata_columns_idxs = [igsn_col_idx, 3, 19, 21, 30, 33, 32, 1, 5, 6]

# ===================================

display(metadata_df.iloc[:, metadata_columns_idxs])
print('==============================')
for col_idx in metadata_columns_idxs:
    print(f'Index {col_idx} --- {headers[col_idx]}')

### Inspect unique values in a specified column

In [None]:
metadata_column_idx = 20

# ===================================

metadata_col = headers[metadata_column_idx]

unique_df = metadata_df.iloc[:, [igsn_col_idx, metadata_column_idx]].groupby(metadata_col)

display(unique_df.count())
unique_values = list(unique_df.groups.keys())

print('========================================')
print(f'Unique values of metadata colum {metadata_col}:')
for idx, val in enumerate(unique_values):
    print(f'Index {idx} -- {val}')

### Select a subset of the metadata based on a unique value

In [None]:
# Enter values of interest
value_idxs = [0]

# ===================================
subset_values = [x for idx, x in enumerate(unique_values) if idx in value_idxs]

subset_df = metadata_df[metadata_df[metadata_col].isin(subset_values)]

display(subset_df)

### Download sample_metadata file

In [None]:
# ===================================

if not is_csv_zipped:
    fn = csv_index[metadata_file_idx]
    all_file_idx = None

    for idx, filename in enumerate(dataset_detail.get('distribution')):
        if filename.get('name') == fn:
            all_file_idx = idx
            break
    if all_file_idx:
        ds_doi = download_selected_files(dataset_detail, [all_file_idx], local_dir)
    else:
        print('Could not find requested file.')
else:
    ds_doi = download_selected_files(dataset_detail, [zip_file_idx], local_dir, is_csv_zipped=is_csv_zipped, 
                                     zip_download=zip_download, zip_member_fn=csv_file_name)

In [None]:
# Optional: display the download file log for this DOI
# ===================================
print(f'Downloaded file information for {ds_doi}:')
display(download_file_log[ds_doi])

# 7. Save Download File Log

If desired, change save location and file location.
Otherwise the local_dir configured at the begining of the notebook will be used.

In [None]:
# Optional: display the download file log
display(download_file_log)

In [None]:
# Optional: change the directory location to save the file

save_dir = local_dir  # Path('<enter_alternative_dir_path_here')
log_filename = 'essdive_downloaded_files_log.csv'

# ===================================

log_fn_path = save_dir / log_filename

with open(log_fn_path, mode='w') as f:
    csv_writer = csv.writer(f)
    csv_writer.writerow(['dataset_id', 'file_name', 'access_datetime', 'access_url', 'dataset_name', 'citation'])
    
    for ds_id, log_info in download_file_log.items():
        ds_name = log_info.get('name')
        ds_citation_ls = log_info.get('citation')
        
        # deal with the list of citations
        ds_citation = None
        if ds_citation_ls:
            for iref, ref in enumerate(ds_citation_ls):
                if iref == 0:
                    ds_citation = ref
                    continue
                ds_citation = f'{ds_citation} -AND- {ref}'
        if ds_citation is None:
            ds_citation = 'None'
        
        accessed_file_list = log_info.get('downloaded_files')
        for accessed_file in accessed_file_list:
            fn, fn_url, access_ts = accessed_file
            
            csv_writer.writerow([ds_id, fn, access_ts, fn_url, ds_name, ds_citation])
            
print(f'Check {str(save_dir)} for the log file: {log_filename}')

That's a wrap!