## Use of rich table schema to understand and work with data

The aim of this script was  to illustrate the following
* A self-sufficient data scientist can use basic capabilities of their day-to-day toolset (pandas dataframes) to easily transform data retrieved via the GA4GH Search
* The GA4GH Search schema provides sufficient information in machine readable form to do so
* This can be done with no additional curation of data sources
* Endorses the value of requiring data submitters to provide a data dictionary

First set up a Search client

In [1]:
from fasp.search import DiscoverySearchClient
cl = DiscoverySearchClient('https://ga4gh-search-adapter-presto-public.prod.dnastack.com')

#### Specify the table to search and run the query, specifying the columns to retrieve

The table being searched is a deidentifed (scrambled) version of the data for dbGaP study phs001611. This study is represented in the NCI Genomic Data Commons. 

A data frame is returned.

In [2]:
table_name = 'search_cloud.cshcodeathon.organoid_profiling_pc_subject_phenotypes_gru'
res = cl.runOneTableQuery(column_list=['dbgap_subject_id', 'age', 'race', 'sex'],table=table_name,limit=10)
res

_Retrieving the query_
____Page1_______________
____Page2_______________
____Page3_______________
____Page4_______________
____Page5_______________


Unnamed: 0,dbgap_subject_id,age,race,sex
0,2675511,24,W,F
1,2675537,43,AA,F
2,2675497,52,W,F
3,2675520,55,W,F
4,2675517,57,AA,F
5,2675504,57,W,F
6,2675552,61,W,F
7,2675502,61,W,F
8,2675550,62,W,F
9,2675494,62,,F


This indicates codings for race and gender. Relying on column names alone gives minimal information.

In this case, reasonable guesses might be made at what those codes mean. The age column too might reasonably be guessed at. However, that won't always be the case.

dbGaP also provides a data dictionary. These are imported by GA4GH and the information provided in JSON Schema, the standard form for describing data in GA4GH Search.

### ToDo
Update the table on ga4gh-search-adapter-presto-public, and other tables from the same dataset, to use the dbGaP dictionary representation of the schema.

In [4]:
po_schema = cl.listTableInfo(table_name)

_Schema for tablesearch_cloud.cshcodeathon.organoid_profiling_pc_subject_phenotypes_gru_
{'data_model': {'$id': 'https://ga4gh-search-adapter-presto-public.prod.dnastack.com/table/search_cloud.cshcodeathon.organoid_profiling_pc_subject_phenotypes_gru/info',
                '$schema': 'http://json-schema.org/draft-07/schema#',
                'description': 'Automatically generated schema',
                'properties': {'age': {'$comment': "Subject's age (Years)",
                                       'format': 'bigint',
                                       'type': 'int'},
                               'dbgap_subject_id': {'$comment': 'Unique '
                                                                'Subject ID in '
                                                                'dbGap',
                                                    'format': 'varchar',
                                                    'type': 'string'},
                               'race': {'$com

The information from the data dictionary, makes the data sufficiently well described for a data scientist . That information supplied by the investigators who conducted the study. So the data is sufficiently well-described for useful analysis without any additional harmonization or curation. Note that the existing dbGaP submission process playes a significant role in enabling that.

#### Workaround
Until the Search server description is updated from the data dictionary we'll look at it directly.

In [7]:
import xml.dom.minidom
# Local copy of Subject_Phenotypes.data_dict.xml for phs001611
with open('../../fasp/data/dbgap/phs001611.v1.pht009160.v1.Organoid_Profiling_PC_Subject_Phenotypes.data_dict.xml') as xmldata:
    xml = xml.dom.minidom.parseString(xmldata.read())  # or xml.dom.minidom.parseString(xml_string)
    xml_pretty_str = xml.toprettyxml()
print (xml_pretty_str)

<?xml version="1.0" ?>
<?xml-stylesheet type="text/xsl" href="./datadict_v2.xsl"?>
<data_table date_created="Thu Jun 13 14:30:34 2019" id="pht009160.v1" participant_set="1" study_id="phs001611.v1">
	<description/>
	<variable id="phv00409134.v1">
		<name>SUBJECT_ID</name>
		<description>De-identified Subject ID</description>
		<type>string</type>
	</variable>
	<variable id="phv00409135.v1">
		<name>sex</name>
		<description>Sex of participant</description>
		<type>encoded value</type>
		<value code="F">Female</value>
		<value code="N/A">Not Applicable</value>
		<value code="M">Male</value>
	</variable>
	<variable id="phv00409136.v1">
		<name>age</name>
		<description>Subject's age</description>
		<type>integer, encoded value</type>
		<unit>Years</unit>
		<logical_min>24</logical_min>
		<logical_max>92</logical_max>
		<value code="N/A">Not vailable</value>
	</variable>
	<variable id="phv00409137.v1">
		<name>race</name>
		<description>Race of participant</description>
		<type>encoded val

Given what the dictionary tells us we can transform the coding using a built in capability of a pandas dataframe.

In [8]:
transforms = { 'sex':{'F': 'Female', 'N/A': 'N/A', 'M': 'Male'},
               'race' : {'AA': 'African American',
                   'A': 'Asian',
                   'W': 'White',
                   'H': 'Hispanic',
                   'N/A': 'N/A'}
             }

#### Transform the column
Use the replace function of the dataframe to use the data provided by the mapping

In [12]:
for col, mapping in transforms.items():
    res[col] = res[col].replace(mapping.keys(),mapping.values())

In [13]:
res

Unnamed: 0,dbgap_subject_id,age,race,sex
0,2675511,24,White,Female
1,2675537,43,African American,Female
2,2675497,52,White,Female
3,2675520,55,White,Female
4,2675517,57,African American,Female
5,2675504,57,White,Female
6,2675552,61,White,Female
7,2675502,61,White,Female
8,2675550,62,White,Female
9,2675494,62,,Female


This illustrates the points listed at the beginning of this workbook.

This is still too complex for regular use. A second notebook illustrates how such mappings might be stored and retrieved as needed for more convenient use.

