# Data Repository Service (DRS) URI Access Examination


In this notebook, we get all the DRS URIs from the data table and check if we can access the file each DRS URI points to. 

This notebook will:
1. Grab all DRS URIs from the data table
2. Resolve each DRS URI and check our access to the file
3. Create a report with all the DRS URIs and our access level to each of them

This Notebook will run successfully with the default Terra Cloud Environment with 4 CPUs.
For improved performance when testing a substantial number of DRS URIs in a workspace (> 10,000) increasing the number of Cloud Environment CPUs to 16 is recommended.

## Background


The Data Repository Service (DRS) API is a standardized set of access methods that are agnostic to cloud infrastructure. Developed by the Global Alliance for Genomics and Health (GA4GH), DRS enable researchers to access data regardless of the underlying architecture of the repository (i.e. Google Cloud, Azure, AWS, etc.) in which it is stored. Terra supports accessing data using the GA4GH standard Data Repository Service (DRS). To learn more look at this link: https://support.terra.bio/hc/en-us/articles/360039330211-Data-Access-with-the-GA4GH-Data-Repository-Service-DRS- 

## Setup 

To run code from the terra-notebook-utils, tabulate, and tqdm library, we first have to install the packages and restart the kernel. 

1. Run the cell below
2. Go to the kernel tab in the menu at the top, and select restart
3. Press the restart in the pop-up menu

In [17]:
# Installing terra-notebook-utils, tabulate, and tqdm library
!pip install --upgrade --no-cache-dir git+https://github.com/DataBiosphere/terra-notebook-utils
!pip install tabulate
!pip install tqdm -U

# Configuring notebook log level to log if the DRS URI has access
import logging
%config Application.log_level = "INFO"  
log = logging.getLogger()

## Set DRS URI Column Location 

The DRS URIs' columns location varies per data table varies. The default for this notebook is BioData Catalyst data. Base on your data can choose from biodata-catalyst, anvil, and tcga. If your data is not from one of these three categories, please type the column name of where the DRS URIs are stored.

In [2]:
# set variable to dataset name "biodata-catalyst", "anvil", or custom dataset name to determine which columns will be parse for DRS uris
# custom dataset values = all data tables' columns parsed for DRS uris
dataset_name = "tcga"


## Set Head Variable

The head variable is an optional variable that allows you to run the notebook on the first 10 DRS URIs instead of the all the DRS URIs in the data table.

In [3]:
# Change to True if you want to runs the notebook on the first 10 DRS uri in the data table 
head = False

# Grab all DRS URIs from the data table

The DRS URI is stored under the object_id column in a table.

Using the firecloud library, we are going to search for the object_id column in each table in the data table. If a table contains an object_id column, we check if the content is a string and has 'drs://' at its beginning.

In [4]:
# set imports and environment variables
import json
import os
import requests
import sys
import threading
from pytz import timezone
from datetime import datetime
from firecloud import api as fapi
from subprocess import Popen, PIPE
from tabulate import tabulate
from terra_notebook_utils import drs, table
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
from tqdm.notebook import tqdm
import pandas as pd


# Getting the namespace/project and name of the workspace
namespace = os.environ['WORKSPACE_NAMESPACE']
workspace = os.environ['WORKSPACE_NAME']

In [16]:
# for each of the listed datasets, define names of columns that should be parsed for DRS URIs
# datasets may be known to contain DRS URIs only in a single column (such as anvil)
# create + assign dataset names and assoc. column names to search for DRS URIs.
# TODO: Ask Michael about "ga4gh_drs_uri"
dataset_column_map = {"biodata-catalyst":["object_id"], "anvil":["object_id"]}

# API call: get every entity (row) and associated attributes (columns) for every etype (table)
# returns: list of dictionaries - each item = 1 entitiy and attrs for all tables in workspace
all_entities_dict = fapi.get_entities_with_type(namespace, workspace).json()

# if dataset is not in pre-defined list, get all unique col names across workspace data model
# add dataset_name:[all_unique_cols] to the dataset_column_map dictionary
if dataset_name not in ["biodata-catalyst", "anvil"]:
    all_unique_cols = set()
    for entity in all_entities_dict:
        entity_cols = list(entity["attributes"].keys())
        all_unique_cols.update([col for col in entity_cols])
    dataset_column_map[dataset_name] = list(all_unique_cols)
    

# creating the master DRS URIs dictionary 
drs_uris_dict = {}

# for each item in response (one entity/row of all rows from all tables)
for entity in all_entities_dict:
    # get entity's attributes - tuple (col_name, val)
    entity_columns = entity["attributes"]
    # if entity's columns match dataset map values for specified dataset name, add row/entity to dictionary
    # dictionary of {col_name: "value"}
    matched_col_dict = dict(filter(lambda column: column[0] in dataset_column_map[dataset_name], entity_columns.items()))
    # if any column matches, check each value of column is a string and starts with "drs://"
    if matched_col_dict:
        for col, val in matched_col_dict.items():
            if isinstance(val, str) and val.startswith('drs://'):
                drs_uris_dict[val] = {'table_name': entity['entityType'], 
                                       'row_id' : entity['name'], 
                                       'drs_uri' : val}

# number of total DRS uris found in workspace
print(f"Found {len(drs_uris_dict)} DRS uris in this workspace")
# If head is true, runs the notebook on the first 10 DRS URI in the data table            
if head:
    for i in range(len(drs_uris_dict)-1):
        drs_uris_dict.popitem()

# Resolve DRS URIs and Checking File Access

DRS creates a unique ID mapping that allows for flexible file retrieval. The unique mapping is the DRS Uniform Resource Identifier (URI) - a string of characters that uniquely identifies a particular cloud-based resource (similar to URIs) and is agnostic to the cloud infrastructure where it physically exists. To learn where the file physically exists on the cloud, we must resolve the DRS through a backend service called Martha that will unmap the DRS URI to get the file's google bucket file path.

In this step, we check if all the DRS URIs can be resolved and check our access to the file it points to in the google bucket. If the DRS URI can't be resolved, we will record the error that will, later on, be shown in the report in the last step.

One of the most common reasons the DRS URIs can not resolve is "Fence is not linked". If you have this error, it is probably because the data is controlled-access. To use controlled-access data on Terra, you will need to link your Terra user ID to your authorization account (such as a dbGaP account). Linking to external servers will allow Terra to automatically determine if you can access controlled datasets hosted in Terra (ex. TCGA, TOPMed, etc.) based on your valid dbGaP applications. Go to this link to learn more: https://support.terra.bio/hc/en-us/articles/360038086332

## Creating Resolve DRS URI Function 

The `resolve_drs_uri` fuction will resolve and check file access for a given DRS URI


In [6]:
def resolve_drs_uri(drs_url):
    try:
        # try to check access to file by reading the first 10 bytes
        drs.head(drs_url)
        
        # Logging if the call succeed
        log.info(f"SUCCESS  : {drs_url}")
        
        # Saving error to the DRS URI information dictionary
        drs_uris_dict[drs_url]['drs_unresolved_error'] = "None"
    except drs.GSBlobInaccessible:
        # Logging if the call failed
        log.warning(f"NO ACCESS: {drs_url}")
        # Saving the error if failed
        error = sys.exc_info()[1]
        if " Error:" in error:
            error = str((sys.exc_info()[1])).split(" Error:")[1] 
            if "ProviderUser" in error:
                error = error.split("ProviderUser")[0]
        
        # Outputting the error to console
        print(error)
        
        # Outputting the extra error info for the "Fence account not linked" error to console
        if "Fence account not linked" in error:
            print('''One of the most common reasons the DRS URIs can not resolve is "Fence is not linked". 
If you have this error, it is probably because the data is controlled-access. 
To use controlled-access data on Terra, you will need to link your Terra user ID to your authorization account (such as a dbGaP account). 
Linking to external servers will allow Terra to automatically determine if you can access controlled datasets hosted in Terra (ex. TCGA, TOPMed, etc.) based on your valid dbGaP applications. 
Go to this link to learn more: https://support.terra.bio/hc/en-us/articles/360038086332''')
        
        # Saving error to the DRS URI information dictionary
        drs_uris_dict[drs_url]['drs_unresolved_error'] = error
    
    # Returing the DRS URI information dictionary 
    return {drs_url : drs_uris_dict[drs_url]}


## Running the Resolve DRS URI Function

Running all the DRS URI found through the resolve_drs_uri fuction in parallel groups

In [15]:
# Outputting the start of resolving DRS URIs
print("Resolving "+ str(len(drs_uris_dict)) + " DRS uris.")

# Suppress DRS resolution info messages
drs.logger.disabled = True 

# Setting the list of Threads
NUMBER_OF_PROCESSES = 20
THREADS_PER_PROCESS = 10

# creating threads per batch of drs uris
def resolve_single_drs_batch(batch_drs_list):
    print(f"Attempting to resolve the following DRS URIs in this batch: {batch_drs_list}")
    # Creating temp DRS URI information dictionary 
    temp_drs_uris_dict = {}
    
    # Creating threads for each DRS URIs
    with ThreadPoolExecutor() as e:
        for drs_uri, temp_dict in zip(batch_drs_list, e.map(resolve_drs_uri, batch_drs_list)):
            # Updating DRS URI information dictionary 
            temp_drs_uris_dict.update(temp_dict)
    
    # Returing the DRS URI information dictionary 
    return temp_drs_uris_dict


# creating batches of DRS URIs with pre-assigned number of threads
# [[batch1_drs1,batch1_drs2,batch1_drs3], [batch2_drs1, batch2_drs1]...]
batches = [list(drs_uris_dict.keys())[i:i+THREADS_PER_PROCESS] for i in range(0, len(drs_uris_dict.keys()), THREADS_PER_PROCESS)]

# Creating Pool of batches
with ProcessPoolExecutor(max_workers=NUMBER_OF_PROCESSES) as e:
        # number = empty
    for number, prime in tqdm(zip(batches, e.map(resolve_single_drs_batch, batches)), total=len(batches)):
        # Updating DRS URI information dictionary 
        drs_uris_dict.update(prime)

# Create a report

In this report, we will first print out statistics of the DRS URI data. The information record in the statistics is DRS URIs that were found in the workspace, that were resolved, that were unresolved. Also, files with and without access. Lastly, errors from DRS URIs that are not resolved or files without access.

The second part of the report is the table of DRS URI data that includes the columns: table_name, row_id, drs_uri, drs_unresolved_error, file_access_error, file_path, is_resolved, and file_access 



In [13]:
# checking if there is unresolved_drs_URIs_errors to report
unresolved_drs_uris_errors_list = []
for value in drs_uris_dict.values():
    if value['drs_unresolved_error'] not in 'None':
        unresolved_drs_uris_errors_list.append(value['drs_unresolved_error'])
    
if unresolved_drs_uris_errors_list:
    unresolved_drs_uris_errors = ", ".join(set(unresolved_drs_uris_errors_list))
    if any("Fence account not linked" in errors for errors in unresolved_drs_uris_errors_list):
        unresolved_drs_uris_errors =  unresolved_drs_uris_errors + ''' ~ One of the most common reasons the DRS URIs can not resolve is "Fence is not linked". 
If you have this error, it is probably because the data is controlled-access. 
To use controlled-access data on Terra, you will need to link your Terra user ID to your authorization account (such as a dbGaP account). 
Linking to external servers will allow Terra to automatically determine if you can access controlled datasets hosted in Terra (ex. TCGA, TOPMed, etc.) based on your valid dbGaP applications. 
Go to this link to learn more: https://support.terra.bio/hc/en-us/articles/360038086332'''
else:
    unresolved_drs_uris_errors = "None"
    
    
# Setting the title for the report
title = "Data Repository Service (DRS) URI Access Examination Report"

# Creating the statistics for the report
stats = f'''

Number of DRS URIs found in the workspace: {len(drs_uris_dict)} \n      

Number of DRS URIs successfully resolved: {len(drs_uris_dict)-len(unresolved_drs_uris_errors_list)} 

Number of DRS URIs not resolved: {len(unresolved_drs_uris_errors_list)}       
Errors from DRS URIs that are were not successfully resolved: {len(unresolved_drs_uris_errors_list)}
        
Distinct errors from DRS URIs that are not resolved: {unresolved_drs_uris_errors}
'''

# Outputting statistics
print(f'''
_______________________________________________________________________

:: {title} ::
_______________________________________________________________________

{stats}''')



# Outputting the table of the master DRS URIs dictionary
pd.DataFrame.from_dict(drs_uris_dict).transpose()

## (Optional) Save results to the Google Bucket

To save the results in a HTML file in the google bucket, please run the cell below.

In [14]:
# Converting Report to HTML format 
html_report = f'''<html lang="en">
<head>
<title>{title}</title>''' + '''
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<style>
* {
  box-sizing: border-box;
}

body {
  font-family: Arial, Helvetica, sans-serif;
}

/* Style the header */
header {
  background-color: #73af42;
  padding: 30px;
  text-align: center;
  font-size: 35px;
  color: white;
}

/* Create boxes that floats next to each other */
article {
  float: left;
  padding: 20px;
  width: 100%;
  background-color: #f7f7f7;
  height: 300px; /* only for demonstration, should be removed */
}

/* Style the list inside the menu */
article ul {
  list-style-type: none;
  padding: 0;
}

/* Clear floats after the columns */
section:after {
  content: "";
  display: table;
  clear: both;
}

/* Style the footer */
footer {
  background-color: white;
  padding: 10px;
  text-align: center;
  color: white;
}

/* Responsive layout - makes the two columns/boxes stack on top of each other instead of next to each other, on small screens */
@media (max-width: 600px) {
  nav, article {
    width: 100%;
    height: auto;
  }
}
</style>
</head>
<body>
''' + f'''
<header>
  <h2>{title}</h2>
</header>
<section>
''' + f'''
  <article>
    <h2>
        <ul>
          <li>DRS URIs found in the workspace: {len(drs_uris_dict)}</li>
          <li>DRS URIs resolved: {len(drs_uris_dict)-len(unresolved_drs_uris_errors_list)}</li>
          <li>DRS URIs not resolved: {len(unresolved_drs_uris_errors_list)}</li>
          <li>Errors found from DRS URIs that are not resolved: {len(unresolved_drs_uris_errors_list)}</li>
          <li>Distinct errors from DRS URIs that are not resolved: {unresolved_drs_uris_errors}</li>
        </ul>
      </h2>
  </article>
</section>
<footer>
  <p>{tabulate(pd.DataFrame.from_dict(drs_uris_dict).transpose(), headers='keys', tablefmt="html")}</p>
</footer>

</body>
</html>
'''

# Getting Timezone
tz = timezone('EST')

# Getting Today Date and Time
today = datetime.now(tz).strftime('%Y_%m_%d_%H_%M_%S')

# HTML File Name
html_file_name = f"DRS_URI_Access_Examination_{today}.html"

# Creating the DRS_URI_Access_Examination.html file
with open(html_file_name, "w") as f:
    f.write(html_report)

# Getting the bucket path of the workspace
WORKSPACE_BUCKET = os.environ['WORKSPACE_BUCKET']

# Getting the bucket id of the workspace
bucket_id = str(WORKSPACE_BUCKET).replace("gs://", "")

# Copying the DRS_URI_Access_Examination.html file to the google bucket
!gsutil cp {html_file_name} {WORKSPACE_BUCKET}/ 2>&1

# Printing path to the report file in the google bucket
print("\n To view results, go to https://console.cloud.google.com/storage/browser/_details/" + bucket_id + "/" + html_file_name)
