### Demonstration of Schema Registry exploratory implementation

See https://github.com/ianfore/ga4gh-starter-schema-repository for the implementation code.

The implementation is based on the concept of a SchemaProvider interface. This follows a similar pattern to the DataModelSupplier interface implemented in data-connect-trino. 

The implementation uses the following SchemaProviders for different sources.

* GitHubSchemaProvider - serves json schema from a folder in a GitHub repository. Uses GitHub API to provide versioning and to retrieve schema.
* FileDataModelSupplier - serves schema from a folder in a file system available to the server
* DbGaPSchemaProvider - serves schema from dbGaP's public ftp site


In [1]:
import requests;

import json;

import xml.etree.ElementTree as ET

def prettyprint(a_dict):
    print(json.dumps(a_dict, indent=3))

def printline(char="_"):
    print(char*80)

### Docker build and deployment to follow
This will allow local deployment as follows

In [21]:
base = "http://localhost:8080"

### Get namespaces

In [22]:
url = f"{base}/namespaces"
response = requests.get(url)
namespaces = response.json()['namespaces']
for namespace in namespaces:
    print( namespace['namespace_name'])

dbGaP
gks-core
dataconnect-demo
expt-metadata
vrs


In [23]:
prettyprint(response.json())

{
   "server": "http://localCatalog",
   "namespaces": [
      {
         "namespace_name": "dbGaP",
         "contact_url": "https://ncbi.nlm.gov"
      },
      {
         "namespace_name": "gks-core",
         "contact_url": "https://github.com/ga4gh/gks-core"
      },
      {
         "namespace_name": "dataconnect-demo",
         "contact_url": "https://localhost"
      },
      {
         "namespace_name": "expt-metadata",
         "contact_url": "https://localhost"
      },
      {
         "namespace_name": "vrs",
         "contact_url": "https://github.com/ga4gh/vrs"
      }
   ]
}


In [24]:
namespace = namespaces[3]['namespace_name']
namespace

'expt-metadata'

In [25]:
endpoint = f"/schemas/{namespace}"
print(endpoint)
response = requests.get(f"{base}{endpoint}")
schemas = response.json()['schemas']
for schema in schemas:
    print( schema['schema_name'])

/schemas/expt-metadata
sra_PRJEB10573
sra_PRJEB1985
sra_PRJEB37886
sra_phs001554
sra_scr_icac
sra_scr_udn_v5


In [26]:
prettyprint(response.json())

{
   "namespace": "expt-metadata",
   "schemas": [
      {
         "schema_name": "sra_PRJEB10573",
         "latest_version": "v1",
         "maintainer": [
            "local"
         ],
         "lifecycle_stage": "released"
      },
      {
         "schema_name": "sra_PRJEB1985",
         "latest_version": "v1",
         "maintainer": [
            "local"
         ],
         "lifecycle_stage": "released"
      },
      {
         "schema_name": "sra_PRJEB37886",
         "latest_version": "v1",
         "maintainer": [
            "local"
         ],
         "lifecycle_stage": "released"
      },
      {
         "schema_name": "sra_phs001554",
         "latest_version": "v1",
         "maintainer": [
            "local"
         ],
         "lifecycle_stage": "released"
      },
      {
         "schema_name": "sra_scr_icac",
         "latest_version": "v1",
         "maintainer": [
            "local"
         ],
         "lifecycle_stage": "released"
      },
      {
     

### List schemas from two namespaces that contain experimental metadata

In [27]:
namespaces = ["dataconnect-demo","expt-metadata"]
for namespace in namespaces:
    printline("=")
    print(f"getting schema for namespace {namespace}")
    endpoint = f"/schemas/{namespace}"
    print(endpoint)
    response = requests.get(f"{base}{endpoint}")
    schemas = response.json()['schemas']
    print("get the schemas")
    printline()
    for schema in schemas:
        print( f"schema:{schema['schema_name']}")
        endpoint = f"/schemas/{namespace}/{schema['schema_name']}/versions/v2"
        print(f"endpoint: {endpoint}")
        response = requests.get(f"{base}{endpoint}")
        print("fetching schema")
        response = requests.get(url)
        schema = response.json()
        #prettyprint( schema)
        printline()

getting schema for namespace dataconnect-demo
/schemas/dataconnect-demo
get the schemas
________________________________________________________________________________
schema:bigquery_public.covid19_genome_sequence.metadata
endpoint: /schemas/dataconnect-demo/bigquery_public.covid19_genome_sequence.metadata/versions/v2
fetching schema
________________________________________________________________________________
schema:nih_sra_datastore.sra.metadata
endpoint: /schemas/dataconnect-demo/nih_sra_datastore.sra.metadata/versions/v2
fetching schema
________________________________________________________________________________
schema:sra.sra.metadata
endpoint: /schemas/dataconnect-demo/sra.sra.metadata/versions/v2
fetching schema
________________________________________________________________________________
schema:tutorial.phs002409.CAMP_CData
endpoint: /schemas/dataconnect-demo/tutorial.phs002409.CAMP_CData/versions/v2
fetching schema
__________________________________________________

## dbGaP

### Get schemas for a study

In [28]:
namespace = "dbGaP"
endpoint = f"/schemas/{namespace}?study=phs002921&study_version=v2.p1"
print(endpoint)
response = requests.get(f"{base}{endpoint}")
schemas = response.json()['schemas']
for schema in schemas:
    print( schema['schema_name'])

/schemas/dbGaP?study=phs002921&study_version=v2.p1
phs002921.v2.pht012614.v1.ICAC_Subject_Phenotypes
phs002921.v2.pht012612.v1.ICAC_Subject
phs002921.v2.pht012613.v1.ICAC_Sample
phs002921.v2.pht012615.v1.ICAC_Sample_Attributes


In [29]:
prettyprint(response.json())

{
   "namespace": "dbGaP",
   "schemas": [
      {
         "schema_name": "phs002921.v2.pht012614.v1.ICAC_Subject_Phenotypes",
         "latest_version": "v1",
         "maintainer": [
            "dbGaP"
         ],
         "lifecycle_stage": "released"
      },
      {
         "schema_name": "phs002921.v2.pht012612.v1.ICAC_Subject",
         "latest_version": "v1",
         "maintainer": [
            "dbGaP"
         ],
         "lifecycle_stage": "released"
      },
      {
         "schema_name": "phs002921.v2.pht012613.v1.ICAC_Sample",
         "latest_version": "v1",
         "maintainer": [
            "dbGaP"
         ],
         "lifecycle_stage": "released"
      },
      {
         "schema_name": "phs002921.v2.pht012615.v1.ICAC_Sample_Attributes",
         "latest_version": "v1",
         "maintainer": [
            "dbGaP"
         ],
         "lifecycle_stage": "released"
      }
   ]
}


### Get schema as JSON Schema

In [30]:
schema_name = "phs001554.v2.pht007609.v1.GECCO_CRC_Susceptibility_Subject_Phenotypes"
url = f"{base}/schemas/{namespace}/{schema_name}/versions/v2"
response = requests.get(url)

In [33]:
prettyprint(response.json())

{
   "description": "This subject phenotype table contains subject IDs, case control status of the subject for colorectal cancer, sex, age, race, ethnicity, and study acronym.",
   "$id": "dbgap:pht007609.v1",
   "properties": {
      "STUDY": {
         "$id": "dbgap:phv00357188.v1",
         "description": "Study acronym",
         "type": "string",
         "$unit": null,
         "oneOf": [
            {
               "const": "CPS-II",
               "title": "Cancer Prevention Study II"
            },
            {
               "const": "DACHS",
               "title": "Darmkrebs: Chancen der Verhutung durch Screening"
            },
            {
               "const": "HPFS",
               "title": "Health Professionals Follow-up Study"
            },
            {
               "const": "NHS",
               "title": "Nurses Health Study"
            },
            {
               "const": "PLCO",
               "title": "Prostate, Lung, Colorectal and Ovarian Cancer Sc

### Get dbGaP schema as XML data dictionary
**Relation to spec:** <span style="color:red">demo endpoint - to be revised to fit spec</span>

This is a current implementation of what should likely be achieved through a parameter such as ?format=json

It was also discussed that the return type could be specified using a standard W3C header that is commonly used for this purpose.

In [34]:
schema_name = "phs001554.v2.pht007609.v1.GECCO_CRC_Susceptibility_Subject_Phenotypes"
url = f"{base}/dicts/{namespace}/{schema_name}/versions/v2"
response = requests.get(url)

In [35]:
tree = ET.fromstring(response.text)

ET.indent(tree, space='   ', level=0)
ET.dump(tree)

<data_table id="pht007609.v1" study_id="phs001554.v2" participant_set="1" date_created="Thu Sep 15 11:39:07 2022">
   <description>This subject phenotype table contains subject IDs, case control status of the subject for colorectal cancer, sex, age, race, ethnicity, and study acronym.</description>
   <variable id="phv00357182.v1">
      <name>SUBJECT_ID</name>
      <description>De-identified subject ID</description>
      <type>string</type>
   </variable>
   <variable id="phv00357183.v1">
      <name>AFFECTION_STATUS</name>
      <description>Case control status of the subject for colorectal cancer</description>
      <type>string</type>
      <value>Case</value>
      <value>Control</value>
   </variable>
   <variable id="phv00357184.v1">
      <name>SEX</name>
      <description>Sex of participant</description>
      <type>string</type>
      <value>Female</value>
      <value>Male</value>
   </variable>
   <variable id="phv00357185.v1">
      <name>AGE</name>
      <description>P

### Return types available for a namespace

**Relation to spec:** <span style="color:red">additional endpoint</span>

Given the functionality above, being able to obtain a list of the available formats seems a useful function. Is namespace the right level at which to apply this?

#### For the dbGaP namespace

In [36]:
response = requests.get(f"{base}/schematypes/dbGaP")
prettyprint(response.json())

{
   "namespace": "dbGaP",
   "return_types": [
      "application/schema+json",
      "application/xml+data_dict"
   ]
}


#### For the VRS namespace

In [37]:
url = f"{base}/schematypes/vrs"
response = requests.get(url)
prettyprint(response.json())

{
   "namespace": "vrs",
   "return_types": [
      "application/schema+json"
   ]
}


### Service Info
**Relation to spec:** <span style="color:red">additional endpoint</span>

Adding the standard service-info endpoint used in GA4GH services

In [38]:
url = f"{base}/service-info"
response = requests.get(url)
prettyprint(response.json())

{
   "id": "org.ga4gh.schema-registry",
   "name": "Experimental GA4GH Schema Registry",
   "description": "An open source, experimental, community-driven implementation of the GA4GH Schema Registry API specification for the purpose of revising the API specification.",
   "contactUrl": "mailto:info@ga4gh.org",
   "documentationUrl": "https://github.com/ga4gh/schema-registry",
   "createdAt": "2025-03-19T12:00:00Z",
   "updatedAt": "2025-03-19T12:00:00Z",
   "environment": "test",
   "version": "0.0.1",
   "type": {
      "group": "org.ga4gh",
      "artifact": "schema-registry",
      "version": "0.0.1.experimental"
   },
   "organization": {
      "name": "Global Alliance for Genomics and Health",
      "url": "https://ga4gh.org"
   }
}
