### Finding files and data using Data Connect

#### Learning Objectives
Workshop attendees will learn how use the GA4GH Data Connect Service.  

What will participants do as part of the exercise?

 - Understanding how to query data via Data Connect
 - Use Data Connect to find files that can be accessed via DRS
 - Learn how to obtain and use data descriptions (schema)
 - Discover the meaning of codes used in data
 

 #### Icons in this Guide

 🖐 A hands-on section where you will code something or interact with the server
 
### Query files
The approach taken below is using mapping available through subject and specimen data available through the Data Connect API. 

Queries are submitted as SQL queries to one or more tables on the Data Connect server.

As with other examples, first we set up a client to use the API. The server at DNAStack is used in the following examples.

#### Step 1: Set up a Data Connect Client and run a predefined query 

In [None]:
from fasp.search import DataConnectClient
searchClient = DataConnectClient('https://data.publisher.dnastack.com/data-connect/')

In [None]:
query = '''SELECT f.sample_name, drs_id bam_drs_id, acc
FROM collections.public_datasets.ssd_drs s 
join collections.public_datasets.sra_drs_files f on f.sample_name = s.su_submitter_id 
where filetype = 'bam' and mapped = 'mapped' 
and sequencing_type ='exome' and  population = 'JPT' '''

resultRows = searchClient.run_query(query, return_type='dataframe')
resultRows

#### Step 2: Run a second query to find bam files from members of a given family

In [None]:
family_query = '''SELECT f.sample_name, relationship, drs_id bam_drs_id, acc
FROM collections.public_datasets.thousand_genomes_meta s 
join collections.public_datasets.sra_drs_files f on f.sample_name = s.sample 
where filetype = 'bam' and mapped = 'mapped' 
and sequencing_type ='exome' and  family_id = '1447' '''

family_results = searchClient.run_query(family_query, return_type='dataframe')
family_results

### List table details


#### Step 3:
We can list the available tables available in this set as follows

In [None]:
table_list = searchClient.list_catalog('thousand_genomes')

#### Step 4: List schema for sra_drs_files table
The following cells can be run to list the columns for the other tables used in the queries above.

In [None]:
schema1 = searchClient.list_table_info('thousand_genomes.onek_genomes.sra_drs_files', verbose=True)

In [None]:
schema2 = searchClient.list_table_info('thousand_genomes.onek_genomes.ssd_drs', verbose=True)

#### Step 6: Search for a different population group 
🖐 Using the information above about the tables, modify the query to use
a) a population code that represents Gujarati Indians living in a city in Texas.
b) bam files for reads that have not been mapped to a reference genome.

So you don't have to modify the sql query itself you can add the values you identified to the variables in the next cell of this notebook.

In [None]:
# replace these values
population_code = 'XYZ'
mapping_type = 'your_value_here'

In [None]:
query = f'''SELECT f.sample_name, drs_id bam_drs_id, acc, filename
FROM collections.public_datasets.ssd_drs s 
join collections.public_datasets.sra_drs_files f on f.sample_name = s.su_submitter_id 

where filetype = 'bam' and mapped = '{mapping_type}' 
and sequencing_type ='exome' and  population = '{population_code}'
'''

resultRows = searchClient.run_query(query, return_type='dataframe')
resultRows

##### Important note
Looking up a data dictionary to discover codes in this way is not what we would typically expect a user to do. Our aim today is to focus on the API and what it is capable of and what it can enable.

Given the information the data schema provide about the data it is possible for developers to create interfaces in their systems which allow new datasources to be integrated as they appear.

In another notebook (next). We'll look at an example of how a more user friendly user interface can be provided using the information that the the API provides.

#### Step 7: Return the whole of a table
The Data Connect standard provides a function to return the whole of a table.

This should be used with care, but here is how to do it using DataConnectClient. The client has a built in mechanism to prevent problems.

In [None]:
searchClient.get_data('collections.public_datasets.ssd_drs', return_type='dataframe')

By default, the returned data is limited to 10,000 rows. The limit can be set on the client as follows
* When the client is created

 `searchClient = DataConnectClient(host_url, row_limit=50000)`
 
 
* At a later stage

 `searchClient.set_row_limit(50000)`
 
Note that the default return type for the client can also be set. This saves having to specify the return type on every query.

* When the client is created

 `searchClient = DataConnectClient(return_type='dataframe')`
 
 
* At a later stage

 `searchClient.set_return_type('dataframe')`



#### Step 8 - Combine with DRS Server

The following shows how the SRA DRS server we used in workbook 2-1 can be used to determine where the files we discovered can be obtained from. 

🖐 Using the results from one of the queries that you ran above, take a DRS id from the query results and use it in the following calls to the NCBI DRS server.

In [None]:
from fasp.loc import DRSClient

drsClient = DRSClient('https://locate.be-md.ncbi.nlm.nih.gov', public=True, debug=True)
test_id = 'add_id_here'
objInfo = drsClient.get_object(test_id)
objInfo

A second DRS call can be used to obtain a url to access the file from one of the above locations.

In [None]:
access_id = objInfo['access_methods'][0]['access_id']
print('access_id:{}'.format(access_id))
url = drsClient.get_access_url(test_id, access_id=access_id)
print('url:{}'.format(url))