### Data that DRS might provide about a file

Use case: Provide data* about a file identified by a given DRS id.

The realistic scenarios for this use case should be identified. Notably, when does this scenario occur, compared with the scenario where file ids are obtained from some other data driven query which reflects user interest?

Key points are that

* Search and other APIs already provide data relevant to answering that question
* Would expect to use the same schemas that those APIs use 
* Reusing those schema provides rich functionality

i.e. rather than DRS determining what metadata should be provided about a file we should tap into the existing schema that are available.

*Note: there is a conscious intent here to avoid the term metadata. For one thing we need to be more specific about what would be useful.

In [21]:
from fasp.search import DiscoverySearchClient
from fasp.loc import crdcDRSClient

import json
# TCGA Query - CRDC
searchClient = DiscoverySearchClient('https://ga4gh-search-adapter-presto-public.prod.dnastack.com/')

query = """
    SELECT  file_id drs_id
    FROM search_cloud.cshcodeathon.gdc_rel24_filedata_active 
    where data_format = 'BAM'
    and project_disease_type = 'Breast Invasive Carcinoma'
    limit 1"""
res = searchClient.runQuery(query)

_Retrieving the query_
____Page1_______________
____Page2_______________
____Page3_______________
____Page4_______________
____Page5_______________


### An example of basic data about a file.

The data returned is the same data used in the original search. e.g. the data_format 'BAM' returned below is exactly that which is in the query.

In [22]:
drsClient = crdcDRSClient()
for id in res:
    print(json.dumps(drsClient.getFileData(id[0]), indent=3))
    print('_'*80)

    

{
   "id": "030e5e74-6461-4f05-a399-de8e470bc056",
   "data_format": "BAM",
   "access": "controlled",
   "file_name": "46db33a7f2003837e88d0a81b8ebec2c_gdc_realn.bam",
   "data_type": "Aligned Reads",
   "data_category": "Sequencing Reads",
   "type": "aligned_reads",
   "experimental_strategy": "WXS",
   "platform": "Illumina",
   "created_datetime": "2016-05-03T00:35:52.946132-05:00",
   "file_size": 23894757370
}
________________________________________________________________________________


### The data can be retrieved through Search API itself



In [23]:
query = """
    SELECT * 
    FROM search_cloud.cshcodeathon.gdc_rel24_filedata_active 
    where file_id = '030e5e74-6461-4f05-a399-de8e470bc056' """
res2 = searchClient.runQuery(query, returnType='json')
print(json.dumps(res2,indent=3))

_Retrieving the query_
____Page1_______________
____Page2_______________
____Page3_______________
____Page4_______________
____Page5_______________
[
   {
      "dbname": "active",
      "file_gdc_id": "030e5e74-6461-4f05-a399-de8e470bc056",
      "access": "controlled",
      "acl": "phs000178",
      "analysis_input_file_gdc_ids": null,
      "analysis_workflow_link": "https://github.com/NCI-GDC/cocleaning-cwl",
      "analysis_workflow_type": "BWA with Mark Duplicates and Cocleaning",
      "archive_gdc_id": null,
      "archive_revision": null,
      "archive_state": null,
      "archive_submitter_id": null,
      "associated_entities__case_gdc_id": "1b703058-e596-45bc-80fe-8b98d545c2e2",
      "associated_entities__entity_gdc_id": "7b0b60c7-5fa0-440e-937f-8d82119330d6",
      "associated_entities__entity_submitter_id": "TCGA-AR-A2LK-01A-11D-A17W-09",
      "associated_entities__entity_type": "aliquot",
      "case_gdc_id": "1b703058-e596-45bc-80fe-8b98d545c2e2",
      "project_dbg

### Getting linked objects - Case, Indexes, Downstream Analyses, ...

More detailed information is also available.

Making use of existing schema means that a DRS id can be a jumping off point into the graph. A user could fulfill any use case that the graph allows. Tapping into the schemas provided by Search provides rich functionality with little additional effort.

In [24]:
for id in res:
    fileData = drsClient.getFileData(id[0], expanded=True)
    print(json.dumps(fileData, indent=3))
    print('_'*80)



{
   "id": "030e5e74-6461-4f05-a399-de8e470bc056",
   "data_format": "BAM",
   "access": "controlled",
   "cases": [
      {
         "case_id": "1b703058-e596-45bc-80fe-8b98d545c2e2"
      }
   ],
   "associated_entities": [
      {
         "entity_type": "aliquot",
         "entity_id": "7b0b60c7-5fa0-440e-937f-8d82119330d6"
      }
   ],
   "file_name": "46db33a7f2003837e88d0a81b8ebec2c_gdc_realn.bam",
   "data_category": "Sequencing Reads",
   "downstream_analyses": [
      {
         "analysis_id": "e05199a1-d6ad-4d31-a616-dcc6fb057216"
      },
      {
         "analysis_id": "d0cc7c6d-675c-401e-b751-60c68b4436e5"
      },
      {
         "analysis_id": "90e83350-f2cf-49bc-910e-5331a2cea795"
      },
      {
         "analysis_id": "702c782f-ce48-41c4-b7f2-c97ba26bc8a0"
      }
   ],
   "type": "aligned_reads",
   "analysis": {
      "analysis_id": "35eb6a6c-76d7-4568-ae0a-45734676c43e"
   },
   "platform": "Illumina",
   "created_datetime": "2016-05-03T00:35:52.946132-05:00",


### Getting attributes of related objects



In [25]:
for id in res:
    fileData = drsClient.getFileData(id[0], linked=True)
    print(json.dumps(fileData, indent=3))
    print('_'*80)

{
   "id": "030e5e74-6461-4f05-a399-de8e470bc056",
   "data_format": "BAM",
   "access": "controlled",
   "cases": [
      {
         "case_id": "1b703058-e596-45bc-80fe-8b98d545c2e2",
         "project": {
            "disease_type": "Breast Invasive Carcinoma"
         },
         "diagnoses": [
            {
               "days_to_recurrence": null,
               "morphology": "8520/3",
               "tumor_stage": "stage iii",
               "created_datetime": null,
               "tissue_or_organ_of_origin": "Breast, NOS",
               "primary_diagnosis": "Lobular carcinoma, NOS",
               "age_at_diagnosis": 22800,
               "classification_of_tumor": "not reported",
               "prior_malignancy": "no",
               "site_of_resection_or_biopsy": "Breast, NOS",
               "days_to_last_known_disease_status": null,
               "tumor_grade": "not reported",
               "progression_or_recurrence": "not reported"
            }
         ],
         

### Using the related data

In this example the file id for the related Index file can be retrieved through DRS.

In [26]:
drsClient.getObject("18fb79bb-7259-41da-bd76-9dd9f8f84bfc")

{'access_methods': [{'access_id': 'gs',
   'access_url': {'url': 'gs://gdc-tcga-phs000178-controlled/18fb79bb-7259-41da-bd76-9dd9f8f84bfc/TCGA-A1-A0SD-01A-11D-A10Y-09_IlluminaGA-DNASeq_exome_gdc_realn.bai'},
   'region': '',
   'type': 'gs'},
  {'access_id': 's3',
   'access_url': {'url': 's3://tcga-2-controlled/18fb79bb-7259-41da-bd76-9dd9f8f84bfc/TCGA-A1-A0SD-01A-11D-A10Y-09_IlluminaGA-DNASeq_exome_gdc_realn.bai'},
   'region': '',
   'type': 's3'}],
 'aliases': [],
 'checksums': [{'checksum': 'dd3b9e4fa8a85cc18c413e8b5b58e252',
   'type': 'md5'}],
 'contents': [],
 'created_time': '2018-08-08T17:11:24.583780',
 'description': None,
 'form': 'object',
 'id': '18fb79bb-7259-41da-bd76-9dd9f8f84bfc',
 'mime_type': 'application/json',
 'name': None,
 'self_uri': 'drs://nci-crdc.datacommons.io/18fb79bb-7259-41da-bd76-9dd9f8f84bfc',
 'size': 6700896,
 'updated_time': '2018-08-08T17:11:24.583791',
 'version': '3c6bf46f'}

In [27]:
#fileData = drsClient.getFileData("18fb79bb-7259-41da-bd76-9dd9f8f84bfc")
#print(json.dumps(fileData, indent=3))


### Using table info to find out about Age at Diagnosis


In [28]:
searchClient.listTableInfo('search_cloud.cshcodeathon.tcga_clinical_gdc_current')
#searchClient.listTables('search_cloud')

{'name': 'search_cloud.cshcodeathon.tcga_clinical_gdc_current',
 'description': 'Automatically generated schema',
 'data_model': {'$id': 'https://ga4gh-search-adapter-presto-public.prod.dnastack.com/table/search_cloud.cshcodeathon.tcga_clinical_gdc_current/info',
  'description': 'Automatically generated schema',
  '$schema': 'http://json-schema.org/draft-07/schema#',
  'properties': {'submitter_id': {'format': 'varchar',
    'type': 'string',
    '$comment': 'The submitter_id of a case entity corresponds to the submitted_subject_id of the study participant in dbGaP records for the project.'},
   'case_id': {'format': 'varchar',
    'type': 'string',
    '$comment': 'GDC unique identifier for this case (corresponds to the case_barcode) -- this can be used to access more information from the GDC data portal in the following way: https://portal.gdc.cancer.gov/files/c21b332c-06c6-4403-9032-f91c8f407ba9.'},
   'diag__treat__count': {'format': 'bigint',
    'type': 'int',
    '$comment': 'T