# Data Repository Service (DRS) URI Access Examination


In this notebook, we get all the DRS URIs from the data table and check if we can access the file each DRS URI points to.

This notebook will:
1. Grab all DRS URIs from the data table
2. Resolve each DRS URI and check our access to the file
3. Create a report with all the DRS URIs and our access level to each of them

## Background


The Data Repository Service (DRS) API is a standardized set of access methods that are agnostic to cloud infrastructure. Developed by the Global Alliance for Genomics and Health (GA4GH), DRS enable researchers to access data regardless of the underlying architecture of the repository (i.e. Google Cloud, Azure, AWS, etc.) in which it is stored. Terra supports accessing data using the GA4GH standard Data Repository Service (DRS). To learn more look at this link: https://support.terra.bio/hc/en-us/articles/360039330211-Data-Access-with-the-GA4GH-Data-Repository-Service-DRS- 

## Setup 

To run code from the terra-notebook-utils, tabulate, and tqdm library, we first have to install the packages and restart the kernel. 

1. Run the cell below
2. Go to the kernel tab in the menu at the top, and select restart
3. Press the restart in the pop-up menu

In [None]:
# Installing terra-notebook-utils, tabulate, and tqdm library
!pip install terra-notebook-utils
!pip install tabulate
!pip install tqdm -U

## Set DRS URI Column Location 

The DRS URIs' columns location varies per data table varies. The default for this notebook is BioData Catalyst data. Base on your data can choose from biodata-catalyst, anvil, and tcga. If your data is not from one of these three categories, please type the column name of where the DRS URIs are stored.

In [None]:
# Set variable to either "biodata-catalyst", "anvil", "tcga", or "<INSERT DRS URIs' column name>"
columns_name = "biodata-catalyst"

## Set Head Variable

The head variable is an optional variable that allows you to run the notebook on the first 10 DRS URIs instead of the all the DRS URIs in the data table.

In [None]:
# Change to True if you want to runs the notebook on the first 10 DRS uri in the data table 
head = False

# Grab all DRS URIs from the data table

The DRS URI is stored under the object_id column in a table.

Using the firecloud library, we are going to search for the object_id column in each table in the data table. If a table contains an object_id column, we check if the content is a string and has 'drs://' at its beginning.

In [None]:
import json
import os
import requests
import sys
import threading
from datetime import datetime
from firecloud import api as fapi
from subprocess import Popen, PIPE
from tabulate import tabulate
from terra_notebook_utils import drs
import concurrent.futures
from tqdm.notebook import tqdm
import pandas as pd


# Getting the namespace/project and name of the workspace
namespace = os.environ['WORKSPACE_NAMESPACE']
workspace = os.environ['WORKSPACE_NAME']

# Getting list of tables in the data table 
data_tables_list = fapi.get_entities_with_type(namespace, workspace).json()

# Assigning column search key (Add workspace option)  #TODO: Ask micheal about "ga4gh_drs_uri"
data_column_map = {"biodata-catalyst":"object_id", "anvil":"object_id", "tcga":"uuid_and_filename" }
if columns_name in data_column_map.keys():
    column_name_search_key = data_column_map[columns_name]
else:
    column_name_search_key = columns_name

# Creating the master DRS URIs dictionary 
DRS_uris_dict = {}

# Going through each table in the data table and pull out the contents of the object_id column
for table in data_tables_list:
    table_colums = table["attributes"]
    object_id_column = dict(filter(lambda item: column_name_search_key in item[0], table_colums.items())) 
    
    # Checking if the object_id column's contents is a string and start with 'drs://' 
    if object_id_column:
        DRS_uri = next(iter(object_id_column.values()))
        if type(DRS_uri) is str and DRS_uri.startswith('drs://'):
            DRS_uris_dict[DRS_uri] = {'table_name': table['entityType'], 
                                       'row_id' : table['name'], 
                                       'drs_uri' : DRS_uri}

# If head is true, runs the notebook on the first 10 DRS URI in the data table            
if head:
    for i in range(len(DRS_uris_dict)-10):
        DRS_uris_dict.popitem()
    
# Outputting the amount of DRS URIs found
print(f"Found {len(DRS_uris_dict)} DRS uris in this workspace")

# Resolve DRS URIs and Checking File Access

DRS creates a unique ID mapping that allows for flexible file retrieval. The unique mapping is the DRS Uniform Resource Identifier (URI) - a string of characters that uniquely identifies a particular cloud-based resource (similar to URIs) and is agnostic to the cloud infrastructure where it physically exists. To learn where the file physically exists on the cloud, we must resolve the DRS through a backend service called Martha that will unmap the DRS URI to get the file's google bucket file path.

In this step, we check if all the DRS URIs can be resolved and check our access to the file it points to in the google bucket. If the DRS URI can't be resolved, we will record the error that will, later on, be shown in the report in the last step.

One of the most common reasons the DRS URIs can not resolve is "Fence is not linked". If you have this error, it is probably because the data is controlled-access. To use controlled-access data on Terra, you will need to link your Terra user ID to your authorization account (such as a dbGaP account). Linking to external servers will allow Terra to automatically determine if you can access controlled datasets hosted in Terra (ex. TCGA, TOPMed, etc.) based on your valid dbGaP applications. Go to this link to learn more: https://support.terra.bio/hc/en-us/articles/360038086332

## Creating Resolve DRS URI Function 

The `resolve_drs_uri` fuction will resolve and check file access for a given DRS URI


In [None]:
def resolve_drs_uri(DRS_uri):
    
    # Getting the access_token
    access_token = !gcloud auth print-access-token
    
    # Getting DRS URI dictionary information
    DRS_uri_information = DRS_uris_dict[DRS_uri]
    
    # Creating the drs_unresolved_error element in the DRS URI information dictionary
    DRS_uri_information['drs_unresolved_error'] = None
    
    # # Creating the file_access_error in the DRS URI information dictionary
    DRS_uri_information['file_access_error'] = None
    
    # Calling Martha to resolved the DRS URI
    martha_request = requests.post("https://us-central1-broad-dsde-prod.cloudfunctions.net/martha_v2",
                                    data = {'url': DRS_uri},
                                    headers={"authorization": "Bearer " + access_token[0]} )

    # Getting the response from martha in json format
    martha_response = json.loads(martha_request.text)
    
    # Assigning the file path to None (It will be overwritten if a file path is found)
    DRS_uri_information['file_path'] = None
    
    # Checking if the resolving failed
    if martha_request.status_code != 200:
        # Saving the error if resolving failed
        error_json = json.loads(martha_response['response']['text'])
        error = '"' + error_json['error']['message'] + '"'
        
        # Setting is_resolved to False for the DRS URI
        DRS_uri_information['is_resolved'] = False
        
        # Saving error to the DRS URI information dictionary
        DRS_uri_information['drs_unresolved_error'] = [error]
        
        # Saving error to the master errors list
        unresolved_DRS_uris_errors_list.append(error)
        
    else:
        # Saving the google bucket file path if resolving passed
        DRS_uri_information['is_resolved'] = True
        
        # Get the file path that start with 'gs://'
        for values in martha_response['dos']['data_object']['urls']:
            if 'gs://' in values['url']:
                # Saving file path to the DRS URI information dictionary
                DRS_uri_information['file_path'] = values['url']
    
    # Checking access to file by reading the first 2 bytes
    try:
        %capture drs.head(DRS_uri, num_bytes=2)
        
        # Setting access_to_file to True, if not errors arise 
        access_to_file = True
        
        # Increasing gobal variable `files_with_access`
        files_with_access = files_with_access + 1
    except:
        # Saving the error if failed
        error = '"' + str((sys.exc_info()[1])).split("Error:")[1] + '"'
        
        # Saving error to the DRS URI information dictionary
        DRS_uri_information['file_access_error'] = [error]
        
        # Saving error to the master errors list
        files_without_access_errors_list.append(error)
        
        # Setting access_to_file to False for the DRS URI
        access_to_file = False
        
        # Increasing gobal variable `files_without_access`
        files_without_access = files_without_access + 1
    
    # Checking if we got access to the file
    DRS_uri_information['file_access'] = access_to_file


## Running the Resolve DRS URI Function

Running all the DRS URI found through the resolve_drs_uri fuction in parallel groups

In [None]:
# Outputting the start of resolving DRS URIs
print("Resolving "+ str(len(DRS_uris_dict)) + " DRS uris")

# Assigning the count variable to keep track of resolved DRS URIs
resolved_DRS_uris_count = 0

# Assigning the unresolved DRS URIs errors list variable
unresolved_DRS_uris_errors_list = []

# Assigning the files without access errors list variable
files_without_access_errors_list = []

# Assigning the count variable to keep track files with access
files_with_access = 0

# Assigning the count variable to keep track files without access
files_without_access = 0

# Assigning drs.logger to False to avoid unnecessary logging 
drs.logger.propagate = False

# Going through the Resolve DRS URI Function in threads for efficiency 
with concurrent.futures.ThreadPoolExecutor() as executor:
    # Creating threads it DRS URIs
    future_resolve_DRS_uri = {executor.submit(resolve_drs_uri, DRS_uri): DRS_uri for DRS_uri in DRS_uris_dict.keys()}

    # Creating a progress bar that updates once a DRS URI finish trying to resolved 
    for future in tqdm(concurrent.futures.as_completed(future_resolve_DRS_uri), total=len(DRS_uris_dict)):
        # Increasing the resolved DRS URIs count
        resolved_DRS_uris_count = resolved_DRS_uris_count + 1 
        
# Outputing the resolved DRS URIs count after it tried to resolved all of the DRS uris
print(str(resolved_DRS_uris_count) + "/" + str(len(DRS_uris_dict)) + " completed... Done")

# Create a report

In this report, we will first print out statistics of the DRS URI data. The information record in the statistics is DRS URIs that were found in the workspace, that were resolved, that were unresolved. Also, files with and without access. Lastly, errors from DRS URIs that are not resolved or files without access.

The second part of the report is the table of DRS URI data that includes the columns: table_name, row_id, drs_uri, drs_unresolved_error, file_access_error, file_path, is_resolved, and file_access 



In [None]:
# Checking if there is unresolved_DRS_URIs_errors to report
if unresolved_DRS_uris_errors_list:
    unresolved_DRS_uris_errors = ", ".join(set(unresolved_DRS_uris_errors_list))
else:
    unresolved_DRS_uris_errors  = "None"
    
# Checking if there is files_without_access_errors to report
if files_without_access_errors_list:
    files_without_access_errors = ", ".join(set(files_without_access_errors_list))
else:
    files_without_access_errors  = "None"

# Setting the title for the report
title = "Data Repository Service (DRS) URI Access Examination Report"

# Creating the statistics for the report
stats = f'''

DRS URIs found in the workspace: {len(DRS_uris_dict)} \n
        
DRS URIs resolved: {len(DRS_uris_dict)-len(unresolved_DRS_uris_errors_list)}
        
Files with access: {files_with_access}
 
DRS URIs not resolved: {len(unresolved_DRS_uris_errors_list)}
        
Files without access: {files_without_access}
        

Errors found from DRS URIs that are not resolved: {len(unresolved_DRS_uris_errors_list)}
        
Distinct errors from DRS URIs that are not resolved: 
        
{unresolved_DRS_uris_errors}


Errors found from files without access: {len(files_without_access_errors_list)}
        
Distinct errors from files without access: 
        
{files_without_access_errors}
'''

# Outputting statistics
print(f'''
_______________________________________________________________________

:: {title} ::
_______________________________________________________________________

{stats}''')



# Outputting the table of the master DRS URIs dictionary
pd.DataFrame.from_dict(DRS_uris_dict).transpose()


## (Optional) Save results to the Google Bucket

To save the results in a HTML file in the google bucket, please run the cell below.

In [None]:
# Converting Report to HTML format 
html_report = f'''<html lang="en">
<head>
<title>{title}</title>''' + '''
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<style>
* {
  box-sizing: border-box;
}

body {
  font-family: Arial, Helvetica, sans-serif;
}

/* Style the header */
header {
  background-color: #73af42;
  padding: 30px;
  text-align: center;
  font-size: 35px;
  color: white;
}

/* Create boxes that floats next to each other */
article {
  float: left;
  padding: 20px;
  width: 100%;
  background-color: #f7f7f7;
  height: 300px; /* only for demonstration, should be removed */
}

/* Style the list inside the menu */
article ul {
  list-style-type: none;
  padding: 0;
}

/* Clear floats after the columns */
section:after {
  content: "";
  display: table;
  clear: both;
}

/* Style the footer */
footer {
  background-color: white;
  padding: 10px;
  text-align: center;
  color: white;
}

/* Responsive layout - makes the two columns/boxes stack on top of each other instead of next to each other, on small screens */
@media (max-width: 600px) {
  nav, article {
    width: 100%;
    height: auto;
  }
}
</style>
</head>
<body>
''' + f'''
<header>
  <h2>{title}</h2>
</header>
<section>
''' + f'''
  <article>
    <h2>
        <ul>
          <li>DRS URIs found in the workspace: {len(DRS_uris_dict)}</li>
          <li>DRS URIs resolved: {len(DRS_uris_dict)-len(unresolved_DRS_uris_errors_list)}</li>
          <li>Files with access: {files_with_access}</li>
          <li>DRS URIs not resolved: {len(unresolved_DRS_uris_errors_list)}</li>
          <li>Files without access: {files_without_access}</li>
          <li>Errors found from DRS URIs that are not resolved: {len(unresolved_DRS_uris_errors_list)}</li>
          <li>Distinct errors from DRS URIs that are not resolved: {unresolved_DRS_uris_errors}</li>
          <li>Errors found from files without access: {len(files_without_access_errors_list)}</li>
          <li>Distinct errors from files without access: {files_without_access_errors}</li>
        </ul>
      </h2>
  </article>
</section>
<footer>
  <p>{tabulate(pd.DataFrame.from_dict(DRS_uris_dict).transpose(), headers='keys', tablefmt="html")}</p>
</footer>

</body>
</html>
'''

# Getting Today Date and Time
today = datetime.now().strftime('%Y_%m_%d_%H_%M_%S')

# HTML File Name
html_file_name = f"DRS_URI_Access_Examination_{today}.html"

# Creating the DRS_URI_Access_Examination.html file
with open(html_file_name, "w") as f:
    f.write(html_report)

# Getting the bucket path of the workspace
WORKSPACE_BUCKET = os.environ['WORKSPACE_BUCKET']

# Getting the bucket id of the workspace
bucket_id = str(WORKSPACE_BUCKET).replace("gs://", "")

# Copying the DRS_URI_Access_Examination.html file to the google bucket
!gsutil cp {html_file_name} {WORKSPACE_BUCKET}/ 2>&1

# Printing path to the report file in the google bucket
print("\n TO View Result go to https://console.cloud.google.com/storage/browser/_details/" + bucket_id + "/" + html_file_name)
