### <center>Update TCGA Workspace Data Model with Compact DRS URLs/Identifier</center>

##### Description
    This notebook allows a user to update single entity (ex. participant, sample, and pair) data model tables from a  "drs://dataguids.org" file path, denoting the location of a file, to the newer compact DRS URL - drs://dg.4DFC:UUID - format. The notebook will isolate eligible columns that point to a file location with the dataguids.org pointer, create a new updated tsv file, and update the data model.


##### Options
    The dry_run option (default = True) will print out the changes that will be made to each table ahead of modifying the data tables. The stdout will show the data table name, the individual columns, and the path to the workspace bucket location of the updated .tsv. Users can examine and verify the changes before setting dry_run = False and re-running the cells to make real updates to the data model table.
    

##### Execution
    1. Set dry_run = True or dry_run = False and execute cell (Shift + Enter).
    2. Run each following cell once the preeceding cell has completed.
        [*] denotes a cell that is not finished executing.

##### Notes
    The set entity (participant_set, sample_set, and pair_set) data model tables are not modified in this script. The set tables point to the unique IDs of the set constituents - a value that is not modified - thus, not requiring any updates.
    

In [1]:
# variable that allows user to run script and look at the updated .tsv files before updating data model
# set dry_run to "False" and re-run script to perform actual update of data model with DRS URLs

# DEFAULT: dry_run is set to True and will list the columns in each table that will be updated
#          it will also provide the location of the .tsv files with the DRS url updates for inspection

dry_run = True

In [2]:
# Imports relevant packages. (Shift + Enter) to execute.

import os
import json
import re
from firecloud import api as fapi
import pandas as pd
from io import StringIO
import csv
import pprint
from collections import OrderedDict

In [4]:
# Sets up workspace environment variables. (Shift + Enter) to execute.

ws_project = os.environ['WORKSPACE_NAMESPACE']
ws_name = os.environ['WORKSPACE_NAME']
ws_bucket = os.environ['WORKSPACE_BUCKET']

# print(ws_project + "\n" + ws_name + "\n" + "bucket: " + ws_bucket)

In [2]:
# Gets list of single entity types in workspace that need DRS URL updates. (Shift + Enter) to execute.
    
# API call to get all entity types in workspace
res_etypes = fapi.list_entity_types(ws_project, ws_name)
dict_all_etypes = json.loads(res_etypes.text)

# get non-set entities and add to list
# "set" entities do not need to be updated because they only reference the unique ID of each single entity
# the unique ID of any single entity is not modified so sets should remain the same
single_etypes_list = []
single_etypes_list = [key for key in dict_all_etypes.keys() if not key.endswith("_set")]

print(f"List of entity types that will be updated, if applicable:")
print('\n'.join(['\t' * 7 + c for c in single_etypes_list]))

In [3]:
# Updates the data model, for single entity types, with DRS URLs. (Shift + Enter) to execute.

# set guid pattern for guid validation
guid_pattern = re.compile(r'^[\da-f]{8}-([\da-f]{4}-){3}[\da-f]{12}$', re.IGNORECASE)

for etype in single_etypes_list:
    print(f'Starting TCGA DRS updates for entity: {etype}')
    
    # get entity table response for API call
    res_etype = fapi.get_entities_tsv(ws_project, ws_name, etype, model="flexible")
    
    # Save current/original data model tsv files to the bucket for provenance
    print(f'Saving original {etype} TSV to {ws_bucket}')
    df = pd.read_csv(StringIO(res_etype.text), sep="\t")
    original_tsv_name = "original_" + etype + "_table.tsv"
    # write data frames to .tsv files
    df.to_csv(original_tsv_name, sep="\t", index=False)
    !gsutil cp $original_tsv_name $ws_bucket 2> stdout
    
    # read entity table response into dictionary to perform DRS URL updates 
    dict_etype = list(csv.DictReader(StringIO(res_etype.text), delimiter='\t'))

    # create empty list to add updated rows and list to capture list of columns that were modified
    drs_dict_table = []
    modified_cols = []
    # for "row" (each row is [list] of column:values)
    for row in dict_etype:
        drs_row = row.copy()      
        # for each column in row
        for col in row:
            # check if the col values are dataguids.org URLs and parse out guid
            if row[col].startswith("drs://dataguids.org"):
                guid = row[col].split("/")[3]  #[0]
                # only modify col if guid is valid and exists
                if guid and guid_pattern.match(guid):
                    drs_url = "drs://dg.4DFC:" + guid
                    drs_row[col] = drs_url
                    modified_cols.append(col)
                else:
                    None

        # append new "row" with updated drs values to new list
        drs_dict_table.append(drs_row)
        
        # set output file name and write tsv files
        updated_tsv_name = "updated_" + etype + "_table.tsv"
        tsv_headers = drs_dict_table[0].keys()
        
        with open(updated_tsv_name, 'w') as outfile:
            # get keys from OrderedDictionary and write rows, separate with tabs
            writer = csv.DictWriter(outfile, tsv_headers, delimiter="\t")
            writer.writeheader()
            writer.writerows(drs_dict_table)
    
    print(f'Saving DRS URL updated {etype} TSV to {ws_bucket}')
    !gsutil cp $updated_tsv_name $ws_bucket 2> stdout
    
    modified_cols = list(set(modified_cols))
    if dry_run:
        print(f'The following columns in the {etype} table *will be* be updated:')
        if not modified_cols:
            print('\t' * 4 + f"No columns to update in the {etype} table." + "\n\n")
        else:
            print('\n'.join(['\t' * 4 + c for c in modified_cols]))
            print(f'To view what will be updated, inspect the {updated_tsv_name} file in the workspace bucket, {ws_bucket}.' + "\n\n")
    else:
        # upload newly created tsv file containing drs urls
        response = fapi.upload_entities_tsv(ws_project, ws_name, updated_tsv_name, model="flexible")
        if response.status_code != 200:
            print(f"Could not update existing {etype} table. Error message: {response.text}")
        
        print(f'Finished uploading TCGA DRS updated .tsv for entity: {etype}' + "\n")