### Schema Registry - retrieving schemas for multiple dbGaP studies

See https://github.com/ianfore/ga4gh-starter-schema-repository for details of the implementation.


In [1]:
import requests;

import json;

import xml.etree.ElementTree as ET

def prettyprint(a_dict):
    print(json.dumps(a_dict, indent=3))

def printline(char="_"):
    print(char*80)

In [4]:
base = "http://localhost:8080"

### Get schemas for a dbGaP study
The syntax below supports the Schema Registry capability to retrieve schema according to specific named parameters.

dbGaP contains schemas for multiple versions of over 2,500 studies. This provides a good use case for the need to retrieve specific subsets of schemas instead of all schemas for a namespace.

**For consideration/discussion:** What should a namespace be? Rather than dbGaP as the namespace, the study could be a namespace.

In [5]:
namespace = "dbGaP"
endpoint = f"/schemas/{namespace}/?study=phs002921&study_version=v2.p1"
print(endpoint)
response = requests.get(f"{base}{endpoint}")
schemas = response.json()['schemas']
for schema in schemas:
    print( schema['schema_name'])

/schemas/dbGaP/?study=phs002921&study_version=v2.p1
phs002921.v2.pht012614.v1.ICAC_Subject_Phenotypes
phs002921.v2.pht012612.v1.ICAC_Subject
phs002921.v2.pht012613.v1.ICAC_Sample
phs002921.v2.pht012615.v1.ICAC_Sample_Attributes


In [6]:
prettyprint(response.json())

{
   "namespace": "dbGaP",
   "schemas": [
      {
         "schema_name": "phs002921.v2.pht012614.v1.ICAC_Subject_Phenotypes",
         "latest_version": "v1",
         "maintainer": [
            "dbGaP"
         ],
         "lifecycle_stage": "released"
      },
      {
         "schema_name": "phs002921.v2.pht012612.v1.ICAC_Subject",
         "latest_version": "v1",
         "maintainer": [
            "dbGaP"
         ],
         "lifecycle_stage": "released"
      },
      {
         "schema_name": "phs002921.v2.pht012613.v1.ICAC_Sample",
         "latest_version": "v1",
         "maintainer": [
            "dbGaP"
         ],
         "lifecycle_stage": "released"
      },
      {
         "schema_name": "phs002921.v2.pht012615.v1.ICAC_Sample_Attributes",
         "latest_version": "v1",
         "maintainer": [
            "dbGaP"
         ],
         "lifecycle_stage": "released"
      }
   ]
}


### Finding all dbGaP study versions
In the absence of an implementation that provides all dbGaP studies there are various ways all dbGaP studies could be found. For demo purposes we will retrieve a list of dbGaP studies from a file.

In [11]:
with open("data/dbgap_studies_with_schema.json") as f:
    studies = json.load(f)
print(f"{len(studies)} studies loaded")

2352 studies loaded


### An example study from the list

In [12]:
studies[10]

{'study': 'phs000145', 'study_version': 'v4.p2'}

### List schemas for a subset of studies

In [20]:
import time

printSchemas = True

for sv in studies[2000:2010]:
    study = sv['study']
    study_version = sv['study_version']
    endpoint = f"/schemas/{namespace}/?study={study}&study_version={study_version}"
    print(f"Retrieving schemas at: {endpoint}")
    response = requests.get(f"{base}{endpoint}")
    if 'schemas' in response.json():
        schemas = response.json()['schemas']
        for schema in schemas:
            print( f"Schema: {schema['schema_name']}")
            if printSchemas:
                print( f"Downloading Schema:")
                url = f"{base}/schemas/{namespace}/{schema['schema_name']}/versions/latest"
                response = requests.get(url)
                prettyprint(response.json())
                printline()
    else:
        print('No schemas')
        prettyprint(response.json())
    printline('=')
    #time.sleep(1

Retrieving schemas at: /schemas/dbGaP/?study=phs001956&study_version=v1.p1
Schema: phs001956.v1.pht009785.v1.sc_Liver_Subject
Downloading Schema:
{
   "description": "",
   "$id": "dbgap:pht009785.v1",
   "properties": {
      "SUBJECT_ID": {
         "$id": "dbgap:phv00424150.v1",
         "description": "Subject ID",
         "type": "string",
         "$unit": null
      },
      "CONSENT": {
         "$id": "dbgap:phv00424151.v1",
         "description": "Consent group as determined by DAC",
         "type": "encoded value",
         "$unit": null,
         "oneOf": [
            {
               "const": "2",
               "title": "Health/Medical/Biomedical (PUB, NPU) (HMB-PUB-NPU)"
            }
         ]
      }
   }
}
________________________________________________________________________________
Schema: phs001956.v1.pht009786.v1.sc_Liver_Sample
Downloading Schema:
{
   "description": "",
   "$id": "dbgap:pht009786.v1",
   "properties": {
      "SUBJECT_ID": {
         "$id

{
   "description": "",
   "$id": "dbgap:pht009829.v2",
   "properties": {
      "SEQUENCING_CENTER": {
         "$id": "dbgap:phv00426524.v2",
         "description": "Name of the center which conducted sequencing or genotyping",
         "type": "string",
         "$unit": null
      },
      "ANALYTE_TYPE": {
         "$id": "dbgap:phv00426522.v2",
         "description": "Analyte Type",
         "type": "string",
         "$unit": null
      },
      "IS_TUMOR": {
         "$id": "dbgap:phv00426523.v2",
         "description": "Tumor status",
         "type": "encoded value",
         "$unit": null
      },
      "BODY_SITE": {
         "$id": "dbgap:phv00426521.v2",
         "description": "Body site where sample was collected",
         "type": "encoded value",
         "$unit": null
      },
      "SAMPLE_ID": {
         "$id": "dbgap:phv00426520.v2",
         "description": "Sample ID",
         "type": "string",
         "$unit": null
      }
   }
}
___________________________

{
   "description": "Subject ID, sample ID, sample source, source sample ID, and sample use variable obtained from participants with or without colorectal cancer and involved in the \"Uncovering the Genetic Architecture of Colorectal Cancer with Focus of Rare and Less Frequent Variant\" project.",
   "$id": "dbgap:pht006685.v1",
   "properties": {
      "SOURCE_SAMPLE_ID": {
         "$id": "dbgap:phv00308269.v1",
         "description": "Sample ID used in the Source Repository",
         "type": "string",
         "$unit": null
      },
      "SUBJECT_ID": {
         "$id": "dbgap:phv00308266.v1",
         "description": "Subject ID",
         "type": "string",
         "$unit": null
      },
      "SAMPLE_USE": {
         "$id": "dbgap:phv00308270.v1",
         "description": "Sample Use",
         "type": "encoded values",
         "$unit": null
      },
      "SAMPLE_SOURCE": {
         "$id": "dbgap:phv00308268.v1",
         "description": "Source repository (study) where samples 

{
   "description": "The subject consent file includes subject IDs, consent information, and biological sex.",
   "$id": "dbgap:pht007131.v3",
   "properties": {
      "SEX": {
         "$id": "dbgap:phv00528567.v1",
         "description": "Biological sex of subject",
         "type": "encoded value",
         "$unit": null,
         "oneOf": [
            {
               "const": "1",
               "title": "Male"
            },
            {
               "const": "2",
               "title": "Female"
            }
         ]
      },
      "CONSENT": {
         "$id": "dbgap:phv00328540.v2",
         "description": "Consent group as determined by DAC",
         "type": "encoded value",
         "$unit": null,
         "oneOf": [
            {
               "const": "1",
               "title": "General Research Use (GRU)"
            }
         ]
      },
      "SUBJID": {
         "$id": "dbgap:phv00328539.v2",
         "description": "De-identified subject ID",
         "type

{
   "description": "Sample ID, analyte type, body site where sample was obtained, histological type of sample, tumor status of sample, sequencing center, genotyping center, and methylation center of participants involved in the \"National Cancer Institute (NCI) Primary Human Melanocyte QTL Study\" project.",
   "$id": "dbgap:pht007247.v2",
   "properties": {
      "HISTOLOGICAL_TYPE": {
         "$id": "dbgap:phv00334428.v1",
         "description": "Cell or tissue type or subtype of sample",
         "type": "string",
         "$unit": null
      },
      "SEQUENCING_CENTER": {
         "$id": "dbgap:phv00334432.v1",
         "description": "Name of the center which conducted sequencing",
         "type": "string",
         "$unit": null
      },
      "ANALYTE_TYPE": {
         "$id": "dbgap:phv00334430.v1",
         "description": "Analyte type",
         "type": "string",
         "$unit": null
      },
      "IS_TUMOR": {
         "$id": "dbgap:phv00334429.v1",
         "descript

### Listing all dbGaP schemas via schemas/namespace is not currently supported

There are currently over 13,000 schemas

In [334]:
endpoint = f"/schemas/dbGaP/"
print(endpoint)
response = requests.get(f"{base}{endpoint}")
prettyprint(response.json())

/schemas/dbGaP/
{
   "namespace": "dbGaP",
   "schemas": []
}
