## GA4GH Data Connect Level 2 entry point

This notebook illustrates what is possible if you want to implement Data Connect and you have a data dictionary, model or schema for your data.

The following schema lists a subset of the attributes for the Cineca H3 Africa synthetic dataset.



In [1]:
from fasp.search import DataConnectClient
cl = DataConnectClient('http://localhost:8089/')
#cl.listTables(verbose=True)



Listing the table information

The following function calls
http://localhost:8089/table/bigquery.cineca.syn_europe_ch_sib/info

In [14]:
cl.listTableInfo('bigquery.cineca.syn_europe_ch_sib',verbose=True)

_Schema for tablebigquery.cineca.syn_europe_ch_sib_
{
   "name": "bigquery.cineca.syn_europe_ch_sib",
   "description": "Scrambled version of sample data for phs001554 Colorectal cancer susceptibility study.",
   "data_model": {
      "$id": "phs001554.v1.pht007610.v1.GECCO_CRC_Susceptibility_Sample_Attributes",
      "description": "Scrambled version of sample data for phs001554 Colorectal cancer susceptibility study.",
      "$schema": "http://json-schema.org/draft-07/schema",
      "properties": {
         "pt": {
            "type": "string",
            "description": "Patient identifier"
         },
         "phyact": {
            "type": "string",
            "oneOf": [
               {
                  "const": ">3WK"
               },
               {
                  "const": "1WK"
               },
               {
                  "const": "2WK"
               },
               {
                  "const": "K"
               },
               {
                  "const"

<fasp.search.data_connect_client.SearchSchema at 0x131145910>

In [11]:
cl.runQuery('select pt, phyact from bigquery.cineca.syn_europe_ch_sib where wt > 90.0')

_Retrieving the query_
____Page1_______________
____Page2_______________
____Page3_______________
____Page4_______________
____Page5_______________
____Page6_______________
____Page7_______________
____Page8_______________


[['FAKE594', 'N'],
 ['FAKE2564', 'N'],
 ['FAKE2291', 'K'],
 ['FAKE3366', '>3WK'],
 ['FAKE6472', '>3WK'],
 ['FAKE3286', 'N'],
 ['FAKE6716', '>3WK'],
 ['FAKE956', '1WK'],
 ['FAKE3227', 'N'],
 ['FAKE3451', 'N'],
 ['FAKE4486', 'N'],
 ['FAKE1790', 'N'],
 ['FAKE6317', '2WK'],
 ['FAKE4874', '2WK'],
 ['FAKE6595', 'N'],
 ['FAKE1902', '1WK'],
 ['FAKE2482', 'K'],
 ['FAKE5017', '>3WK'],
 ['FAKE6481', '1WK'],
 ['FAKE608', '1WK'],
 ['FAKE4555', 'N'],
 ['FAKE6043', 'N'],
 ['FAKE285', '2WK'],
 ['FAKE712', '2WK'],
 ['FAKE1932', 'N'],
 ['FAKE2192', 'K'],
 ['FAKE4158', '1WK'],
 ['FAKE4685', 'N'],
 ['FAKE5884', '2WK'],
 ['FAKE273', '2WK'],
 ['FAKE2628', '1WK'],
 ['FAKE4400', 'N'],
 ['FAKE6210', 'N'],
 ['FAKE5544', 'N'],
 ['FAKE2289', '1WK'],
 ['FAKE3154', 'N'],
 ['FAKE4356', 'K'],
 ['FAKE2222', 'N'],
 ['FAKE4262', '>3WK'],
 ['FAKE1290', 'K'],
 ['FAKE2723', '2WK'],
 ['FAKE703', '>3WK'],
 ['FAKE1895', 'K'],
 ['FAKE3199', 'N'],
 ['FAKE4216', '1WK'],
 ['FAKE4293', '1WK'],
 ['FAKE1130', '1WK'],
 ['FAKE1470', '

## Add a new table

In [2]:
cl.listTableInfo('bigquery.cineca.syn_Africa_H3ABioNet_v1',verbose=True)

_Schema for tablebigquery.cineca.syn_Africa_H3ABioNet_v1_
{
   "name": "bigquery.cineca.syn_Africa_H3ABioNet_v1",
   "description": "CINECA synthetic cohort Africa H3ABioNet v1",
   "data_model": {
      "description": "CINECA synthetic cohort Africa H3ABioNet v1",
      "name": "bigquery.cineca.syn_Africa_H3ABioNet_v1",
      "data_model": {
         "$id": "",
         "description": "CINECA synthetic cohort Africa H3ABioNet v1",
         "$schema": "http://json-schema.org/draft-07/schema",
         "properties": {
            "pid": {
               "description": "Participant ID:",
               "type": "text"
            },
            "height_avg": {
               "description": "Average height:",
               "type": "number",
               "$unit": "cm"
            },
            "weight_avg": {
               "description": "Calculated average weight (kg):",
               "type": "number",
               "$unit": "kg"
            },
            "hypertension": {
        

<fasp.search.data_connect_client.SearchSchema at 0x1047a7c70>

In [4]:
cl.runQuery('select sex, count(*) from bigquery.cineca.syn_Africa_H3ABioNet_v1 group by sex',returnType='dataframe')

_Retrieving the query_
____Page1_______________
____Page2_______________
____Page3_______________
____Page4_______________
____Page5_______________
____Page6_______________
____Page7_______________
____Page8_______________
____Page9_______________


Unnamed: 0,sex,_col1
0,Male,19
1,Other,26
2,Refused,27
3,Female,28


## Looking forward to level 3 - Using existing lists of countries

### Load the caDSR list of countries

In [4]:

import json
 
# Opening JSON file
f = open("./schema generation/cadsr_countries.json")
 
# returns JSON object as
# a dictionary
data = json.load(f)
 
# Iterating through the json
# list
cadsr_countries = [ i['const'] for i in data]
print(cadsr_countries)
 
# Closing file
f.close()

['Falkland Islands', 'Faroe Islands', 'France', 'France, Metropolitan', 'Gabon', 'United Kingdom', 'Grenada', 'Georgia', 'French Guiana', 'Guernsey', 'Ghana', 'Gibraltar', 'Greenland', 'Gambia, The', 'Guinea', 'Guadeloupe', 'Equatorial Guinea', 'Greece', 'South Georgia and the Islands', 'Guatemala', 'Guam', 'Guinea-Bissau', 'Guyana', 'Hong Kong', 'Heard Island and McDonald Islands', 'Honduras', 'Croatia', 'Haiti', 'Hungary', 'Indonesia', 'Ireland', 'Israel', 'Isle of Man', 'India', 'British Indian Ocean Territory', 'Iraq', 'Iran', 'Iceland', 'Italy', 'Jersey', 'Jamaica', 'Jordan', 'Japan', 'Kenya', 'Kyrgyzstan', 'Cambodia', 'Kiribati', 'Comoros', 'Saint Kitts and Nevis', 'Korea, North', 'Korea, South', 'Kuwait', 'Cayman Islands', 'Kazakhstan', 'Laos', 'Lebanon', 'Saint Lucia', 'Liechtenstein', 'Sri Lanka', 'Liberia', 'Lesotho', 'Lithuania', 'Luxembourg', 'Latvia', 'Libya', 'Morocco', 'Monaco', 'Moldova', 'Madagascar', 'Marshall Islands', 'Micronesia, Federated States of', 'Macedonia', 

#### Load the Cineca country list

In [5]:
# Opening JSON file
f = open("./schema generation/bigquery.cineca.syn_Africa_H3ABioNet_v3.json")
 
# returns JSON object as
# a dictionary
data = json.load(f)
 
# Iterating through the json
# list
cineca_countries = [ i['const'] for i in data["def"]["country"]["oneOf"]]
print(cineca_countries)
 
# Closing file
f.close()

['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Terr', 'Brunei', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo', 'Cook Islands', 'Costa Rica', "Cote d'Ivoire", 'Croatia', 'Cuba', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'East Timor', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands', 'Faroe Islands', 'Fiji Islands', 'Finland', 'France', 'F

In [6]:
difference_1 = set(cineca_countries).difference(set(cadsr_countries))
difference_2 = set(cadsr_countries).difference(set(cineca_countries))

list_difference = list(difference_1.union(difference_2))
print(list_difference)

['Glorioso Islands', 'Western Samoa', 'Navassa Island', 'United States Minor Outly', 'Pitcairn Islands', 'Holy See (Vatican City St', 'Guernsey', 'Micronesia', 'French Southern territori', 'Congo, Republic of the', 'Fiji Islands', 'Holy See (Vatican City)', 'Clipperton Island', 'Bassas da India', 'Saint Vincent and the Grenadines', 'Gaza Strip', 'Virgin Islands (UK)', 'Micronesia, Federated States of', 'Spratly Islands', 'Virgin Islands (US)', 'Macau', 'Libyan Arab Jamahiriya', 'Timor-Leste', 'Paracel Islands', 'South Sudan', 'Kazakstan', 'Jan Mayen', 'Svalbard', 'Libya', 'United States Minor Outlying Islands', 'Gambia, The', 'Zaire', 'Saint Barthelemy', 'Netherland Antilles', 'Russian Federation', 'Isle of Man', 'Serbia', 'Bahamas', 'Congo', 'Heard Island and McDonald Islands', 'Pitcairn', 'Russia', 'British Indian Ocean Terr', 'South Georgia and the Sou', 'Svalbard and Jan Mayen', 'Europa Island', 'Netherlands Antilles', 'Saint Vincent and the Gre', 'Congo, Democratic Republic of the

In [7]:
difference_1

{'Bahamas',
 'British Indian Ocean Terr',
 'Congo',
 'East Timor',
 'Fiji Islands',
 'French Southern territori',
 'Gambia',
 'Heard Island and McDonald',
 'Holy See (Vatican City St',
 'Kazakstan',
 'Libyan Arab Jamahiriya',
 'Macao',
 'Micronesia',
 'Netherlands Antilles',
 'North Korea',
 'Palestine',
 'Pitcairn',
 'Russian Federation',
 'Saint Vincent and the Gre',
 'South Georgia and the Sou',
 'South Korea',
 'South Sudan',
 'Svalbard and Jan Mayen',
 'United States Minor Outly',
 'Yugoslavia'}

In [8]:
difference_2

{'Ascension Island',
 'Ashmore and Cartier Islands',
 'Bahamas, The',
 'Baker Island',
 'Bassas da India',
 'British Indian Ocean Territory',
 'Burma',
 'Clipperton Island',
 'Congo, Democratic Republic of the',
 'Congo, Republic of the',
 'Coral Sea Islands',
 'Europa Island',
 'Fiji',
 'France, Metropolitan',
 'French Southern and Antarctic Lands',
 'Gambia, The',
 'Gaza Strip',
 'Glorioso Islands',
 'Guernsey',
 'Heard Island and McDonald Islands',
 'Holy See (Vatican City)',
 'Howland Island',
 'Isle of Man',
 'Jan Mayen',
 'Jarvis Island',
 'Jersey',
 'Juan de Nova Island',
 'Kazakhstan',
 'Kingman Reef',
 'Korea, North',
 'Korea, South',
 'Kosovo',
 'Libya',
 'Macau',
 'Micronesia, Federated States of',
 'Midway Islands',
 'Navassa Island',
 'Netherland Antilles',
 'Palmyra Atoll',
 'Paracel Islands',
 'Pitcairn Islands',
 'Russia',
 'Saint Barthelemy',
 'Saint Martin',
 'Saint Vincent and the Grenadines',
 'Serbia',
 'South Georgia and the Islands',
 'Spratly Islands',
 'Svalbar