### Data that DRS might provide about a file

Use case: Provide ~~meta~~data* about a file identified by a given DRS id. See [Github issue](https://github.com/ga4gh/data-repository-service-schemas/issues/336)

The realistic scenarios for this use case should be identified. Notably, when does this scenario occur, compared with the scenario where file ids are obtained from some other data driven query which reflects user interest?

Key points are that

* Data Connect and the custom APIs for various platforms already provide data relevant to answering that question
* Rather than create data for this use case, would expect to use the same schemas that those APIs use 
* Reusing those schema provides rich functionality

i.e. rather than DRS determining what metadata should be provided about a file we should tap into the existing schema that are available.

*Note: there is a conscious intent here to avoid the term metadata. For one thing we need to be more specific about which use cases and what data would be useful.

## Query to get some DRS ids
The data queried in the following example to get a DRS id uses a copy of the Institute for Systems Biology Cancer Genomics Cloud (ISB-CGC) BigQuery tables. The specific tables queried uses the release 24 data from the Genomic Data Commons.

In [32]:
from fasp.search import DataConnectClient
from fasp.loc import crdcDRSClient

import json
# TCGA Query - CRDC
searchClient = DataConnectClient('https://ga4gh-search-adapter-presto-public.prod.dnastack.com/')

query = """
    SELECT  file_id drs_id
    FROM search_cloud.cshcodeathon.gdc_rel24_filedata_active 
    where data_format = 'BAM'
    and project_disease_type = 'Breast Invasive Carcinoma'
    limit 1"""
res = searchClient.runQuery(query)

_Retrieving the query_
____Page1_______________
____Page2_______________
____Page3_______________
____Page4_______________
____Page5_______________


### Accessing the ISB-CGC file data via the Data Connect API
One way of retrieving the data is to use the same table as the first query. The query is simply reversed to query on the drs id to get the metadata.

In [38]:
query = """
    SELECT * 
    FROM search_cloud.cshcodeathon.gdc_rel24_filedata_active 
    where file_id = '030e5e74-6461-4f05-a399-de8e470bc056' """
res2 = searchClient.runQuery(query, returnType='json')
print(json.dumps(res2,indent=3))

_Retrieving the query_
____Page1_______________
____Page2_______________
____Page3_______________
____Page4_______________
____Page5_______________
____Page6_______________
____Page7_______________
[
   {
      "dbname": "active",
      "file_gdc_id": "030e5e74-6461-4f05-a399-de8e470bc056",
      "access": "controlled",
      "acl": "phs000178",
      "analysis_input_file_gdc_ids": null,
      "analysis_workflow_link": "https://github.com/NCI-GDC/cocleaning-cwl",
      "analysis_workflow_type": "BWA with Mark Duplicates and Cocleaning",
      "archive_gdc_id": null,
      "archive_revision": null,
      "archive_state": null,
      "archive_submitter_id": null,
      "associated_entities__case_gdc_id": "1b703058-e596-45bc-80fe-8b98d545c2e2",
      "associated_entities__entity_gdc_id": "7b0b60c7-5fa0-440e-937f-8d82119330d6",
      "associated_entities__entity_submitter_id": "TCGA-AR-A2LK-01A-11D-A17W-09",
      "associated_entities__entity_type": "aliquot",
      "case_gdc_id": "1b70305

### Metadata via the Gen 3 API - An example of basic data about a file.

For the purposes of this exercise a getFileData method was added to  crdcDRSClient (Gen3 client). Given a DRS id it uses the Gen3 API to retrieve data.

The data returned is the same data used in the original search. e.g. the data_format 'BAM' returned below is exactly that which is in the query.

In [33]:
drsClient = crdcDRSClient()
for id in res:
    print(json.dumps(drsClient.getFileData(id[0]), indent=3))
    print('_'*80)

    

{
   "id": "030e5e74-6461-4f05-a399-de8e470bc056",
   "data_format": "BAM",
   "access": "controlled",
   "file_name": "46db33a7f2003837e88d0a81b8ebec2c_gdc_realn.bam",
   "data_type": "Aligned Reads",
   "data_category": "Sequencing Reads",
   "type": "aligned_reads",
   "experimental_strategy": "WXS",
   "platform": "Illumina",
   "created_datetime": "2016-05-03T00:35:52.946132-05:00",
   "file_size": 23894757370
}
________________________________________________________________________________


### Gen 3 Example - Getting linked objects - Case, Indexes, Downstream Analyses, ...

More detailed information is also available. The getFileData method can be asked to retrieve the ids of related objects. The Case for example would provide clinical and other data about the person whose sample the data in the file was derived from.

Making use of existing schema means that a DRS id can be a jumping off point into the graph. A user could fulfill any use case that the graph allows. Tapping into the schemas already provided provides rich functionality with little additional effort.

In [4]:
for id in res:
    fileData = drsClient.getFileData(id[0], expanded=True)
    print(json.dumps(fileData, indent=3))
    print('_'*80)



{
   "id": "030e5e74-6461-4f05-a399-de8e470bc056",
   "data_format": "BAM",
   "access": "controlled",
   "cases": [
      {
         "case_id": "1b703058-e596-45bc-80fe-8b98d545c2e2"
      }
   ],
   "associated_entities": [
      {
         "entity_type": "aliquot",
         "entity_id": "7b0b60c7-5fa0-440e-937f-8d82119330d6"
      }
   ],
   "file_name": "46db33a7f2003837e88d0a81b8ebec2c_gdc_realn.bam",
   "data_category": "Sequencing Reads",
   "downstream_analyses": [
      {
         "analysis_id": "e05199a1-d6ad-4d31-a616-dcc6fb057216"
      },
      {
         "analysis_id": "d0cc7c6d-675c-401e-b751-60c68b4436e5"
      },
      {
         "analysis_id": "90e83350-f2cf-49bc-910e-5331a2cea795"
      },
      {
         "analysis_id": "702c782f-ce48-41c4-b7f2-c97ba26bc8a0"
      }
   ],
   "type": "aligned_reads",
   "analysis": {
      "analysis_id": "35eb6a6c-76d7-4568-ae0a-45734676c43e"
   },
   "platform": "Illumina",
   "created_datetime": "2016-05-03T00:35:52.946132-05:00",


### Gen3 example - Getting attributes of related objects
The getFileData method was also written to allow the attributes of the related objects to be returned.

In [5]:
for id in res:
    fileData = drsClient.getFileData(id[0], linked=True)
    print(json.dumps(fileData, indent=3))
    print('_'*80)

{
   "id": "030e5e74-6461-4f05-a399-de8e470bc056",
   "data_format": "BAM",
   "access": "controlled",
   "cases": [
      {
         "case_id": "1b703058-e596-45bc-80fe-8b98d545c2e2",
         "project": {
            "disease_type": "Breast Invasive Carcinoma"
         },
         "diagnoses": [
            {
               "days_to_recurrence": null,
               "morphology": "8520/3",
               "tumor_stage": "stage iii",
               "created_datetime": null,
               "tissue_or_organ_of_origin": "Breast, NOS",
               "primary_diagnosis": "Lobular carcinoma, NOS",
               "age_at_diagnosis": 22800,
               "classification_of_tumor": "not reported",
               "prior_malignancy": "no",
               "site_of_resection_or_biopsy": "Breast, NOS",
               "days_to_last_known_disease_status": null,
               "tumor_grade": "not reported",
               "progression_or_recurrence": "not reported"
            }
         ],
         

### Using the related data

In this example the file id for the related Index file can be retrieved through DRS.

In [6]:
drsClient.getObject("18fb79bb-7259-41da-bd76-9dd9f8f84bfc")

{'access_methods': [{'access_id': 'gs',
   'access_url': {'url': 'gs://gdc-tcga-phs000178-controlled/18fb79bb-7259-41da-bd76-9dd9f8f84bfc/TCGA-A1-A0SD-01A-11D-A10Y-09_IlluminaGA-DNASeq_exome_gdc_realn.bai'},
   'region': '',
   'type': 'gs'},
  {'access_id': 's3',
   'access_url': {'url': 's3://tcga-2-controlled/18fb79bb-7259-41da-bd76-9dd9f8f84bfc/TCGA-A1-A0SD-01A-11D-A10Y-09_IlluminaGA-DNASeq_exome_gdc_realn.bai'},
   'region': '',
   'type': 's3'}],
 'aliases': [],
 'checksums': [{'checksum': 'dd3b9e4fa8a85cc18c413e8b5b58e252',
   'type': 'md5'}],
 'contents': [],
 'created_time': '2018-08-08T17:11:24.583780',
 'description': None,
 'form': 'object',
 'id': '18fb79bb-7259-41da-bd76-9dd9f8f84bfc',
 'mime_type': 'application/json',
 'name': None,
 'self_uri': 'drs://nci-crdc.datacommons.io/18fb79bb-7259-41da-bd76-9dd9f8f84bfc',
 'size': 6700896,
 'updated_time': '2018-08-08T17:11:24.583791',
 'version': '3c6bf46f'}

## Seven Bridges Example
The same file as above is added to a Seven Bridges project in the CGC.
60a29b1831bc812422ab47bd


In [4]:
sbDRSid = '60240c5383a3d61deae04202'

In [36]:
from fasp.loc import sbcgcDRSClient
sbcgcDRS = sbcgcDRSClient('~/.keys/sevenbridges_keys.json', 's3')
sbcgcDRS.getObject(sbDRSid)

{'id': '60a29b1831bc812422ab47bd',
 'name': '46db33a7f2003837e88d0a81b8ebec2c_gdc_realn.bam',
 'size': 23894757370,
 'checksums': [{'type': 'etag',
   'checksum': 'b361ca214dafbf2a3c64491dd4b6be6f-2849'}],
 'self_uri': 'drs://cgc-ga4gh-api.sbgenomics.com/60a29b1831bc812422ab47bd',
 'created_time': '2021-05-17T16:34:32Z',
 'updated_time': '2021-05-17T16:34:32Z',
 'mime_type': 'application/json',
 'access_methods': [{'type': 's3',
   'region': 'us-east-1',
   'access_id': 'aws-us-east-1'}]}

In [27]:
import sevenbridges as sbg
import json

class sbFileData:

    def __init__(self, profile):
        config = sbg.Config(profile=profile)
        self.api = sbg.Api(config=config)


    def getFileData(self, file_id):
        try:
            file = self.api.files.get(id=file_id)      
            print(json.dumps(file.metadata, indent=3))
        except Error as e:
            print (e.message)



In [37]:
sbFile = sbFileData('cgc')
sbFile.getFileData(sbDRSid)

{
   "race": "white",
   "gender": "female",
   "case_id": "TCGA-AR-A2LK",
   "platform": "Illumina",
   "case_uuid": "1b703058-e596-45bc-80fe-8b98d545c2e2",
   "ethnicity": "not hispanic or latino",
   "sample_id": "TCGA-AR-A2LK-01A",
   "aliquot_id": "TCGA-AR-A2LK-01A-11D-A17W-09",
   "sample_type": "Primary Tumor",
   "sample_uuid": "d6f5c34a-0f5c-4aed-977a-74a1e5d50915",
   "aliquot_uuid": "7b0b60c7-5fa0-440e-937f-8d82119330d6",
   "disease_type": "Ductal and Lobular Neoplasms",
   "primary_site": "Breast",
   "vital_status": "Dead",
   "investigation": "TCGA-BRCA",
   "reference_genome": "GRCh38.d1.vd1",
   "experimental_strategy": "WXS"
}


In [29]:
drsClient.getFileData("030e5e74-6461-4f05-a399-de8e470bc056", linked=True)

{'id': '030e5e74-6461-4f05-a399-de8e470bc056',
 'data_format': 'BAM',
 'access': 'controlled',
 'cases': [{'case_id': '1b703058-e596-45bc-80fe-8b98d545c2e2',
   'project': {'disease_type': 'Breast Invasive Carcinoma'},
   'diagnoses': [{'days_to_recurrence': None,
     'morphology': '8520/3',
     'tumor_stage': 'stage iii',
     'created_datetime': None,
     'tissue_or_organ_of_origin': 'Breast, NOS',
     'primary_diagnosis': 'Lobular carcinoma, NOS',
     'age_at_diagnosis': 22800,
     'classification_of_tumor': 'not reported',
     'prior_malignancy': 'no',
     'site_of_resection_or_biopsy': 'Breast, NOS',
     'days_to_last_known_disease_status': None,
     'tumor_grade': 'not reported',
     'progression_or_recurrence': 'not reported'}],
   'demographic': {'race': 'white',
    'updated_datetime': '2019-07-31T21:33:13.355468-05:00',
    'submitter_id': 'TCGA-AR-A2LK_demographic',
    'state': 'released',
    'year_of_death': None,
    'year_of_birth': 1945}}],
 'associated_enti

### Data Connect - Using table info to find out about Age at Diagnosis


In [8]:
schema = searchClient.listTableInfo('search_cloud.cshcodeathon.tcga_clinical_gdc_current')
schema.getCol('diag__age_at_diagnosis')

{
   "format": "bigint",
   "type": "int",
   "$comment": "Age at the time of diagnosis expressed in number of days since birth."
}


### Better
It would be an improvement if the unit were available in a distinct field as in this case from another source described in Data Connect.

In [9]:
schema2 = searchClient.listTableInfo('dbgap_demo.scr_gecco_susceptibility.subject_phenotypes_multi')
schema2.getCol('age')

{
   "type": "number",
   "$comment": "UNIT 'Years'",
   "maximum": 98.0,
   "minimum": 37.0,
   "description": "Participant reference age"
}


The example above comes from dbGaP which provides machine readable data dictiionaries which are transformed to Data Connect schema.

In [10]:
{'type': 'number',
    'unit' : 'Days',
    'description': 'Age at the time of diagnosis expressed in number of days since birth.'}

{'type': 'number',
 'unit': 'Days',
 'description': 'Age at the time of diagnosis expressed in number of days since birth.'}

### Best?

The definition can also use the Data Connect schema capability to reference a resource of semantic standards.

For the age at diagnosis in this example a full definition already exists in a Metadata Repository (MDR), in this case the Cancer Data Standards Repository - caDSR. Note that no new curation is required to establish this connection. The link to the data element id exists in the source data and just needs to be passed through as part of the schema.

A better column definition using the semantic resource would be as follows:

In [11]:
ageColDefinition = {'type': 'number',
    'unit': 'Days',
    'description': 'Age at the time of diagnosis expressed in number of days since birth.',
        '$ref':'cadsr:2006657'}


In [12]:
print (ageColDefinition['$ref'])

cadsr:2006657


The ID above is a CURIE and can be resolved to a machine readable definition as follows:
http://identifiers.org/cadsr:2006657 or http://n2t.net/cadsr:2006657

As an example of how the linked semantic resource can be used, the following uses an example function to retrieve the details of the Common Data Element that has been used can be retrieved.

The values contain linked resources such as the range of values the element may take (Value Domain) or the formal semantic Concept applied during curation of the semantic element. 


In [17]:
cdeDefinition = schema.getcaDSRDefinition(ageColDefinition['$ref'])
print(json.dumps(cdeDefinition, indent=3))

{
   "dateCreated": "03-03-2003",
   "dateModified": "01-05-2021",
   "longName": "Diagnosis Age",
   "preferredDefinition": "Age at which a condition or disease was first diagnosed.",
   "preferredName": "DX_AGE",
   "publicID": "2006657",
   "registrationStatus": "Qualified",
   "version": "1.0",
   "valueDomain": "https://cadsrapi.nci.nih.gov/cadsrapi41/GetXML?query=ValueDomain&DataElement[@id=B7FF5F20-0AE9-3159-E034-0003BA12F5E7]&roleName=valueDomain",
   "dataElementConcept": "https://cadsrapi.nci.nih.gov/cadsrapi41/GetXML?query=DataElementConcept&DataElement[@id=B7FF5F20-0AE9-3159-E034-0003BA12F5E7]&roleName=dataElementConcept"
}


### Possible changes to Data Connect

Embedding unit in $comment does not provide the optimal functionality. See the following issue for a suggested improvement.
https://github.com/ga4gh-discovery/ga4gh-search/issues/105

Similar consideration might be given to the use of the $ref attribute in JSON-Schema. Suggest that consideration should be worked through in practive in notebooks such as these, or elsewhere, in real code with real data examples.