### Explore DRS bundle contents for different dbGaP projects

This notebook shows that given the study-specific experimental design different studies ienvitably have different file content. This speaks to the need for an external model which provides the study-specific design.

In [1]:
from fasp.search import DataConnectClient
tst = "~/Downloads/task-specific-token.txt"
cl = DataConnectClient('http://localhost:8089/', 100000, passport=tst)


import json
from fasp.loc import SRADRSClient
drsClient = SRADRSClient('https://locate.be-md.ncbi.nlm.nih.gov', debug=False, passp=tst)

### bigquery.UDN



In [13]:
def summarize_bundle(bundle_drs_id):
    bundle = drsClient.get_object(bundle_drs_id)
    print(f"Bundle name: {bundle['name']}")
    for i in bundle['contents']:
        #print(i['name'])
        print(i)

In [3]:
cl.list_table_info('bigquery.UDN.run_file2',verbose=True)

_Schema for tablebigquery.UDN.run_file2_
{
   "errors": [
      {
         "status": 500,
         "title": "Encountered an unexpected error",
         "details": "bb3b3a338b0abb2a: java.lang.IllegalStateException"
      }
   ]
}


<fasp.search.data_connect_client.SearchSchema at 0x11c4c3bb0>

### Get the file data for the UDN study


In [5]:
study_files = cl.run_query("select * from bigquery.UDN.run_file2 order by sra_run", return_type='json')

Retrieving the query
____Page1_______________
____Page2_______________
____Page3_______________
____Page4_______________
____Page5_______________
____Page6_______________
____Page7_______________
____Page8_______________
____Page9_______________
____Page10_______________
____Page11_______________
____Page12_______________


In [23]:
from collections import Counter
import pandas as pd

def getBundleTypeStr(bundle_type):
    type_str = ''
    sorted_exts = sorted(bundle_type.keys())
    for k in sorted_exts:
        type_str =f'{type_str}{k}[{bundle_type[k]}]'
    return type_str

def getBundleTypes(study_file_json):
    bundle_types = Counter()
    bundle_type = Counter()
    first_iteration = True
    for f in study_file_json:
        # if we have a new run - wrap up the last run
        if not first_iteration and f['sra_run'] != last_run:
            btype_str = getBundleTypeStr(bundle_type)
            bundle_types[btype_str] +=1
            bundle_type = Counter()
        first_iteration = False
        
        # Process the current file
        file_name = f['file_name']
        if '.' in file_name:
            parts = f['file_name'].split('.')
            #parts.pop(0)
            #file_ext = '.'.join(parts)
            N = min([len(parts)-1, 2])
            file_ext = '.'.join(parts[-N:])
        else:
            file_ext = 'unknown'
        bundle_type[file_ext] +=1


        last_run = f['sra_run']
    # complete the last bundle
    btype_str = getBundleTypeStr(bundle_type)
    bundle_types[btype_str] +=1
    
    # print the deatils
    #for t, v in bundle_types.items():
        #print (t, v)
        
    df = pd.DataFrame(bundle_types.items(), columns=['bundle type','no of bundles'])
    #print (df)
        
    return df

The function above identifies for each SRA Run (bundle) the number 

bundle_type indicates the composition of the run. For example, bam[1] indicates that a run has one bam file. There are 6619 runs like this in the study. unknown[2] means a run had two files whose type could not be determined. There were 48 runs like this, etc.

In [24]:
getBundleTypes(study_files)

Unnamed: 0,bundle type,no of bundles
0,bam[1],6619
1,unknown[2],48
2,unknown[1],119
3,unknown[3],4


### What does one of the unknown[2] bundles look like?

In [25]:
summarize_bundle('b490da7405b83cb242198f3b392fcbc1')

Bundle name: SRR5031422
{'id': '439f0297f35d47fdc5dba9cf34109112', 'name': '8a32fcfc-1099-4f6a-ac7d-c3be03da59e7'}
{'id': '4474b747618a3c598e3d6c391f5b4d82', 'name': '7bd91d7e-b1c6-4eaa-b5d0-313b5e2aaa58'}


## GECCO phs001554
Looking at a different study.

In [28]:
study_files2 = cl.run_query("select * from bigquery.GECCO_CRC_Susceptibility.run_file order by sra_run", return_type='json')

Retrieving the query
____Page1_______________
____Page2_______________
____Page3_______________
____Page4_______________
____Page5_______________
____Page6_______________
____Page7_______________
____Page8_______________
____Page9_______________
____Page10_______________
____Page11_______________
____Page12_______________


The runs in this study are more homogenous. All runs contained one cram file and one index (crai) file.

In [29]:
getBundleTypes(study_files2)

Unnamed: 0,bundle type,no of bundles
0,cram.crai[1]recal.cram[1],2892


In [30]:
summarize_bundle('e2ee588c6ae07189e7ab89f405ff00cc')

Bundle name: SRR7274346
{'id': '0a6478a586b2a5378bc3b69bbfed9b52', 'name': '93598.recal.cram'}
{'id': 'e4575cbf92732aadb6ace55bb25ffcc4', 'name': '93598.recal.cram.crai'}


## LCCC-1108 phs001713

In [32]:
study_files3 = cl.run_query("select * from bigquery.LCCC_1108.run_file order by sra_run", return_type='json')

Retrieving the query
____Page1_______________
____Page2_______________
____Page3_______________
____Page4_______________
____Page5_______________
____Page6_______________
____Page7_______________
____Page8_______________
____Page9_______________
____Page10_______________
____Page11_______________
Client row limit of 10000 was reached. Reset limit with care!


In [83]:
cl.run_query("select count(*) from bigquery.LCCC_1108.run_file")

Retrieving the query
____Page1_______________
____Page2_______________
____Page3_______________
____Page4_______________
____Page5_______________
____Page6_______________
____Page7_______________


[[11230]]

In [84]:
len(study_files3)

11230

In this study only compressed fastq files were found, but runs contained variable numbers of files. 

In [33]:
getBundleTypes(study_files3)

Unnamed: 0,bundle type,no of bundles
0,fastq.gz[4],2111
1,fastq.gz[2],1339
2,fastq.gz[1],108


### Example bundles

Looking at an example bundle...

In [34]:
summarize_bundle('832cbe2df4027776cabb4270e26be4ba')

Bundle name: SRR10327764
{'id': '0f764ed7ef1cd6075a0224a372a2702d', 'name': '130814_UNC17-D00216_0066_AH725CADXX_ACCTCCAA_L001_1.fastq.gz'}
{'id': '3d14ec827be54f4c46bd04731a630b94', 'name': '130814_UNC17-D00216_0066_AH725CADXX_ACCTCCAA_L002_1.fastq.gz'}
{'id': 'c4d73ba900b6fa50c544c65694f4e97c', 'name': '130814_UNC17-D00216_0066_AH725CADXX_ACCTCCAA_L001_2.fastq.gz'}
{'id': 'f4ec3088f0345ce5db58361655d6405c', 'name': '130814_UNC17-D00216_0066_AH725CADXX_ACCTCCAA_L002_2.fastq.gz'}


In [106]:
summarize_bundle('aea12cd67902679f816aa2bcc2108a7c')

Bundle name: SRR10329167
151116_NS500270_0070_AHHJH2BGXX_HHJH2BGXX_CCGTGAGA_S1_L001_R2.fastq.gz
151116_NS500270_0070_AHHJH2BGXX_HHJH2BGXX_CCGTGAGA_S1_L001_R1.fastq.gz


There seems to be  a pattern to these files for SRR10327392, but it takes special knowledge and inference to work it out. The SRA database is able to provide some insight to this, but that is not accessed through DRS.

### PPTC

In [38]:
study_files4 = cl.run_query("select * from bigquery.PPTC.run_file order by sra_run", return_type='json')

Retrieving the query
____Page1_______________
____Page2_______________
____Page3_______________
____Page4_______________
____Page5_______________
____Page6_______________
____Page7_______________
____Page8_______________
____Page9_______________
____Page10_______________
____Page11_______________
____Page12_______________


In [39]:
getBundleTypes(study_files4)

Unnamed: 0,bundle type,no of bundles
0,bam[1],643
1,recal.bam[1],7
2,human.bam[1],40
3,PDX.bam[1],163
4,bam[2],116


### Example bundles

In [113]:
summarize_bundle('52af143ffd75208d311476abc505e5b9')

Bundle name: SRR6226625
PPTCOS44-D-human.bam
PPTCOS44-D-PDX.bam


In [114]:
summarize_bundle('d06e6463579189d7aba7990d85baff56')

Bundle name: SRR6226387
PPTCALL42-D-human.bam
PPTCALL42-D-PDX.bam


The rationale (study design) is not evident from these bundles.