## Use of rich table schema to understand and work with data

The aim of this script was  to illustrate the following
* A self-sufficient data scientist can use basic capabilities of their day-to-day toolset (pandas dataframes) to easily transform data retrieved via the GA4GH Search
* The GA4GH Search schema provides sufficient information in machine readable form to do so
* This can be done with no additional curation of data sources
* Endorses the value of requiring data submitters to provide a data dictionary

First set up a Search client

In [1]:
from fasp.search import DiscoverySearchClient
cl = DiscoverySearchClient('https://ga4gh-search-adapter-presto-public.prod.dnastack.com')

#### Specify the table to search and run the query, specifying the columns to retrieve

The table being searched is a deidentifed (scrambled) version of the data for dbGaP study phs001611. This study is represented in the NCI Genomic Data Commons. 

A data frame is returned.

In [2]:
table_name = 'search_cloud.cshcodeathon.organoid_profiling_pc_subject_phenotypes_gru'
res = cl.runOneTableQuery(column_list=['dbgap_subject_id', 'age', 'race', 'sex'],table=table_name,limit=10)
res

_Retrieving the query_
____Page1_______________
____Page2_______________
____Page3_______________
____Page4_______________
____Page5_______________


Unnamed: 0,dbgap_subject_id,age,race,sex
0,2675511,24,W,F
1,2675537,43,AA,F
2,2675497,52,W,F
3,2675520,55,W,F
4,2675517,57,AA,F
5,2675504,57,W,F
6,2675552,61,W,F
7,2675502,61,W,F
8,2675550,62,W,F
9,2675494,62,,F


This indicates codings for race and gender. Relying on column names alone gives minimal information.

In this case, reasonable guesses might be made at what those codes mean. The age column too might reasonably be guessed at. However, that won't always be the case.

GA4GH Search provides information about the tables via JSON Schema. In this case that schema was easily populated via the dbGaP data dictionary. 

In [3]:
po_schema = cl.listTableInfo(table_name, verbose=True)

_Schema for tablesearch_cloud.cshcodeathon.organoid_profiling_pc_subject_phenotypes_gru_
{'data_model': {'$id': 'phs001611.v1.pht009160.v1.Organoid_Profiling_PC_Subject_Phenotypes',
                '$schema': 'http://json-schema.org/draft-07/schema',
                'properties': {'age': {'$comment': "UNIT 'Years'",
                                       'description': "Subject's age",
                                       'maximum': 92.0,
                                       'minimum': 24.0,
                                       'oneOf': [{'const': 'N/A',
                                                  'title': 'Not vailable'}],
                                       'type': 'integer, encoded value'},
                               'race': {'description': 'Race of participant',
                                        'oneOf': [{'const': 'AA',
                                                   'title': 'African American'},
                                                  {'const

The information from the data dictionary helps describe the data for a data scientist. That information was supplied by the investigators who conducted the study. Thia makes the data available for analysis without any additional harmonization or curation. Note that the existing dbGaP submission process plays a significant role in enabling that.

Given what the dictionary tells us we can transform the coding using a built in capability of a pandas dataframe. 

In [5]:
transforms = { 'sex':{'F': 'Female', 'N/A': 'N/A', 'M': 'Male'},
               'race' : {'AA': 'African American',
                   'A': 'Asian',
                   'W': 'White',
                   'H': 'Hispanic',
                   'N/A': 'N/A'}
             }

#### Transform the column
Use the replace function of the dataframe to use the data provided by the mapping

In [6]:
for col, mapping in transforms.items():
    res[col] = res[col].replace(mapping.keys(),mapping.values())

In [7]:
res

Unnamed: 0,dbgap_subject_id,age,race,sex
0,2675511,24,White,Female
1,2675537,43,African American,Female
2,2675497,52,White,Female
3,2675520,55,White,Female
4,2675517,57,African American,Female
5,2675504,57,White,Female
6,2675552,61,White,Female
7,2675502,61,White,Female
8,2675550,62,White,Female
9,2675494,62,,Female


This illustrates the points listed at the beginning of this workbook.

To help with generating a mapping like that above the DiscoverySearchClient class provides the following function to create a prepopulated template which can be edited to provide the values to which these could be mapped.

In [8]:
template = cl.getMappingTemplate(table_name,['sex','race'])
template

{'sex': {'F': 'replaceThis', 'N/A': 'replaceThis', 'M': 'replaceThis'},
 'race': {'AA': 'replaceThis',
  'A': 'replaceThis',
  'W': 'replaceThis',
  'H': 'replaceThis',
  'N/A': 'replaceThis'}}

Note that this is base level functionality with regard to mappings, more convenient options are needed. 

A second notebook illustrates how such mappings might be stored and retrieved as needed for more convenient use.

The Search specification is open to more sophisticated examples being provided by third parties. Crowd-sourcing of mappings is encouraged, with information available to allow users to determine if mappings, from any source, are fit for the users purpose.