 ### Combining GA4GH standards to perform an end-to-end workflow
 
#### Learning Objectives
Combine Data Connect, WES and DRS services  

What will participants do as part of the exercise?

 - Search for files with Data Connect
 - Obtain links to access files
 - Submit the files to a WES workflow
 - Retrieve the results of the analysis
 
 
 #### Icons in this Guide

 🖐 A hands-on section where you will code something or interact with the server
 
 
 ## Obtain Thousand Genomes files from SRA DRS and submit to Seven Bridges WES

#### Step 1: Set up options
🖐 Set up your project name, location of your key file and Download location

In [1]:
SB_PROJECT = 'ianfore/ian-tutorial'
SB_API_KEY_PATH = '~/tutkeys/sbcgc_key.json'
DOWNLOAD_LOCATION = '~/Downloads'

#### Step 2: Use Data Connect to retrieve file details for specified population

In [2]:
from fasp.search import DataConnectClient

searchClient = DataConnectClient('https://data.publisher.dnastack.com/data-connect/')

query = '''SELECT f.sample_name, drs_id bam_drs_id, acc,
population, annotated_sex
FROM collections.public_datasets.ssd_drs s 
join collections.public_datasets.sra_drs_files f on f.sample_name = s.su_submitter_id 
where filetype = 'bam' and mapped = 'mapped' 
and sequencing_type ='exome' and  population = 'PUR' LIMIT 3'''

json_result = searchClient.run_query(query, returnType='json')
json_result

Retrieving the query
____Page1_______________
____Page2_______________
____Page3_______________
____Page4_______________
____Page5_______________
____Page6_______________
____Page7_______________


[{'sample_name': 'HG00640',
  'bam_drs_id': '58e2964f2a0adbf41ab0e8c7a95e7d0c',
  'acc': 'SRR1596923',
  'population': 'PUR',
  'annotated_sex': 'male'},
 {'sample_name': 'HG00637',
  'bam_drs_id': '475dfc02f643c368036df6816d05afe4',
  'acc': 'SRR1596919',
  'population': 'PUR',
  'annotated_sex': 'male'},
 {'sample_name': 'HG00731',
  'bam_drs_id': '515ae091f29ac699a4d2e272812cea47',
  'acc': 'SRR1606560',
  'population': 'PUR',
  'annotated_sex': 'male'}]

#### Step 3: Convert the result into a Dataframe
And display it.

In [3]:
import pandas as pd
first_df = pd.DataFrame(json_result)
first_df

Unnamed: 0,sample_name,bam_drs_id,acc,population,annotated_sex
0,HG00640,58e2964f2a0adbf41ab0e8c7a95e7d0c,SRR1596923,PUR,male
1,HG00637,475dfc02f643c368036df6816d05afe4,SRR1596919,PUR,male
2,HG00731,515ae091f29ac699a4d2e272812cea47,SRR1606560,PUR,male


#### Step 4: Set up a DRS and WES clients

In [4]:
from fasp.loc import DRSClient
from fasp.workflow import sbcgcWESClient
from fasp.loc import sbcgcDRSClient

drsClient = DRSClient('https://locate.be-md.ncbi.nlm.nih.gov', public=True, debug=False)
wesClient = sbcgcWESClient(SB_PROJECT, api_key_path=SB_API_KEY_PATH)
results_DRS_client = sbcgcDRSClient(SB_API_KEY_PATH, 's3')

#### Step 5: Define a function to submit the workflow

🖐 In the following it may be necessary to point to your copy of the application.
Instructions will we provided

In [7]:
import json
import requests

def runWorkflow(wesClient, fileurl, outfile):

    #replace with your copy of the app
    #sam_view_app = 'sbg://forei/ismb-tutorial/samtools-view-drsurl-1-8-url'
    # for example
    sam_view_app = 'sbg://ianfore/ian-tutorial/samtools-view-drsurl-1-8-url'
    
    ref_drs_id = 'drs://cgc-ga4gh-api.sbgenomics.com/62b07ea84e3edb6b1c23c8d5'

    params = {
        "project": SB_PROJECT,
        "inputs": {
          "alignment_file_url": fileurl,
          "count_alignments": True,
          "reference_file": {
            "path": ref_drs_id,
            "name": "references-hs37d5-hs37d5.fasta",
            "class": "File"
          },
          "output_file_path": outfile
        }
     }


    body = {
      "workflow_params": (None, json.dumps(params), 'application/json'),
      "workflow_type": "CWL",
      "workflow_type_version": "sbg:draft-2",
      "workflow_url": sam_view_app
    }
    
    run_id= wesClient.run_generic_workflow(
        workflow_url=sam_view_app,
        workflow_params = json.dumps(params),
        workflow_type = "CWL",
        workflow_type_version = "sbg:draft-2",
        verbose=False
    )
    return(run_id)

#### Step 6:  For each result of the query above submit a task to the Cancer Genomics Cloud

In [8]:
import datetime

# set the region we want to access data from
region = 's3.us-east-1'
my_runs = []
        
for row in json_result:

    print("subject={}, drsID={}".format(row['bam_drs_id'], row['sample_name']))
    drs_id = row['bam_drs_id']


    objInfo = drsClient.get_object(drs_id)
    url = drsClient.get_url_for_region(drs_id,region)

    # Step 3 - Run a pipeline on the file at the drs url
    if url != None:
        outfile = "{}.txt".format(row['sample_name'])
        time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        run_id = runWorkflow(wesClient, url, outfile)
        print('Submitted run {} to {}'.format(run_id, wesClient.__class__.__name__))
        my_runs.append(run_id)
        row['run_id']=run_id
    print('_________________________________________________________________________')

subject=58e2964f2a0adbf41ab0e8c7a95e7d0c, drsID=HG00640
Submitted run 9db0f8b6-b82d-44ed-8af4-20672ca4a38a to sbcgcWESClient
_________________________________________________________________________
subject=475dfc02f643c368036df6816d05afe4, drsID=HG00637
Submitted run 902f949e-8cdf-4f90-b5e3-23ffcc1e14f7 to sbcgcWESClient
_________________________________________________________________________
subject=515ae091f29ac699a4d2e272812cea47, drsID=HG00731
Submitted run 23a52f8d-3dae-4657-94ec-e0df626f75d6 to sbcgcWESClient
_________________________________________________________________________


#### Step 7: Check status of each task

In [19]:
for run in json_result:
    status = wesClient.get_task_status(run['run_id'])
    print(("Run {} {}".format(run['run_id'], status)))

Run 9db0f8b6-b82d-44ed-8af4-20672ca4a38a COMPLETE
Run 902f949e-8cdf-4f90-b5e3-23ffcc1e14f7 COMPLETE
Run 23a52f8d-3dae-4657-94ec-e0df626f75d6 COMPLETE


### Check status above until completion
Expect these runs to take 7-10 minutes to complete

## Getting the results

Use the Seven Bridges CGC DRS service to retrieve the output file

The next cell defines a function to retrieve the results from the WES server

* Retrieve the result
* Download the result file
* Extract the count from the file
* Return the count


#### Step 8: Define a function to get a task result

In [20]:
import tempfile

def get_sam_view_result(run_id):
    # WES API call to retrieve the log of the run - including the results
    log = wesClient.get_run_log(run_id)
    resultsDRSID = log['outputs']['counts']['path']
    resultsDRSID = resultsDRSID.split('/')[-1]
    
    # DRS API call to get the results file
    url = results_DRS_client.get_access_url(resultsDRSID,'s3')
    
    with tempfile.NamedTemporaryFile(mode='r+') as file:
        response = requests.get(url)
        file.write(response.text)
        file.seek(0)
        x = file.read()
    return x.strip()
 

#### Step 9: Update the dataframe with the results

In [24]:
import pandas as pd

for run in json_result:
    status = wesClient.get_task_status(run['run_id'])
    if  status == 'COMPLETE':
        count_result = get_sam_view_result(run['run_id'])
        run['count_result'] = count_result
    else:
        run['count_result'] = status

df = pd.DataFrame(json_result)
df

Unnamed: 0,sample_name,bam_drs_id,acc,population,annotated_sex,run_id,count_result
0,HG00640,58e2964f2a0adbf41ab0e8c7a95e7d0c,SRR1596923,PUR,male,9db0f8b6-b82d-44ed-8af4-20672ca4a38a,102424655
1,HG00637,475dfc02f643c368036df6816d05afe4,SRR1596919,PUR,male,902f949e-8cdf-4f90-b5e3-23ffcc1e14f7,102554431
2,HG00731,515ae091f29ac699a4d2e272812cea47,SRR1606560,PUR,male,23a52f8d-3dae-4657-94ec-e0df626f75d6,432255472


#### Key point:
The complete sequence above shows how 
* Data can be obtained via a search or query
* A compute can be run as workflows via WES
* The results of the workflows can be retrived and merged with the query data