# GREGoR JSONSchema to LinkML

Author: [Sierra Moxon](https://github.com/sierra-moxon)

This notebook attempts to convert the existing "raw" version of the GREGoR JSONSchema to LinkML YAML syntax.  

### Relevant LinkML tooling

* **schema-automator**: https://linkml.io/schema-automator/ (transform JSONSchema or TSV schema definitions to LinkML)

* **linkml-map**: https://linkml.io/linkml-map/ (map one version of a schema to another in a computable fashion)

* **linkml-convert**: https://linkml.io/linkml/data/csvs.html (convert data instances according to a given LinkML Schema between model serialization formats, e.g. dump "the data" as JSON or "TSV".  This is more or less a wrapper around LinkML loader/dumper functionality: https://linkml.io/linkml/developers/loaders-and-dumpers.html)

* **SchemaBuilder**: https://linkml.io/linkml/developers/schemabuilder.html (helps build up LinkML SchemaDefinition programatically)

* **SchemaView**: https://linkml.io/linkml/developers/schemaview.html#schemaview (introspect elements of a LinkML Schema)

### GREGoR source schema

* GREGoR data model spreadsheets: https://docs.google.com/spreadsheets/d/1p_0nhKMvKBueSrUAQMCe9cHv16WyhKSX_jnxNCuGFWg/edit?gid=431973559#gid=431973559

* GREGoR JSONSchema: https://raw.githubusercontent.com/UW-GAC/gregor_data_models/refs/heads/main/GREGoR_data_model.json

In [None]:
!pip install linkml_runtime
!pip install linkml
!pip install schema-automtor


Collecting linkml_runtime
  Downloading linkml_runtime-1.8.3-py3-none-any.whl.metadata (3.6 kB)
Collecting curies>=0.5.4 (from linkml_runtime)
  Downloading curies-0.10.2-py3-none-any.whl.metadata (14 kB)
Collecting hbreader (from linkml_runtime)
  Downloading hbreader-0.9.1-py3-none-any.whl.metadata (663 bytes)
Collecting json-flattener>=0.1.9 (from linkml_runtime)
  Downloading json_flattener-0.1.9-py3-none-any.whl.metadata (5.9 kB)
Collecting jsonasobj2<2.dev0,>=1.0.4 (from linkml_runtime)
  Downloading jsonasobj2-1.0.4-py3-none-any.whl.metadata (964 bytes)
Collecting prefixcommons>=0.1.12 (from linkml_runtime)
  Downloading prefixcommons-0.1.12-py3-none-any.whl.metadata (2.0 kB)
Collecting prefixmaps>=0.1.4 (from linkml_runtime)
  Downloading prefixmaps-0.2.6-py3-none-any.whl.metadata (7.3 kB)
Collecting rdflib>=6.0.0 (from linkml_runtime)
  Downloading rdflib-7.1.2-py3-none-any.whl.metadata (11 kB)
Collecting pytrie (from curies>=0.5.4->linkml_runtime)
  Downloading PyTrie-0.4.0-p

In [None]:
import json
import yaml
import requests
from pprint import pprint
from linkml_runtime.linkml_model.meta import SchemaDefinition
from linkml_runtime.linkml_model.meta import ClassDefinition, SlotDefinition, EnumDefinition, PermissibleValue
from linkml.utils.schema_builder import SchemaBuilder
from linkml_runtime.utils.schemaview import SchemaView
from linkml_runtime.utils.formatutils import camelcase, underscore
from google.colab import files

In [None]:
def download_file_from_github(url, local_filename="GREGoR_data_model.json"):
    """
    Downloads a file from a GitHub URL and saves it to the local Colab environment.

    Args:
        url (str): The URL of the raw file on GitHub.
        local_filename (str): The name to save the file as in the local environment.

    Returns:
        str: Path to the downloaded file.
    """
    response = requests.get(url)
    response.raise_for_status()  # Ensure the request was successful

    # Save the file locally
    with open(local_filename, "w") as file:
        file.write(response.text)

    print(f"File downloaded and saved as {local_filename}")
    return local_filename

# Example usage
url = "https://raw.githubusercontent.com/UW-GAC/gregor_data_models/refs/heads/main/GREGoR_data_model.json"
local_file_path = download_file_from_github(url)

# The file is now accessible in the Colab environment
print(f"Local file path: {local_file_path}")

File downloaded and saved as GREGoR_data_model.json
Local file path: GREGoR_data_model.json


In [None]:
def load_json_from_file(file_path):
    with open(file_path, 'r') as file:
        json_data = json.load(file)
    return json_data

In [None]:
json_schema = load_json_from_file(local_file_path)

In [None]:
# Function to build enumeration definitions
def build_enum_definition(column, slot_name, sb):
    """
    Builds an enumeration definition and adds it to the schema builder.

    Args:
        column (dict): A dictionary containing column details.
        slot_name (str): The name of the slot associated with the enumeration.
        sb (SchemaBuilder): The schema builder instance.

    Returns:
        str: The name of the created enumeration.
    """
    enum_name = camelcase(slot_name)+"Enum"
    ed = EnumDefinition(name=enum_name)

    # Add permissible values
    for value in column.get("enumerations", []):
        if value:  # Check if the value is not empty or None
            ed.permissible_values[value] = PermissibleValue(text=value)

    # Add the enumeration to the schema builder
    sb.add_enum(name=enum_name, enum_def=ed)

    return enum_name


In [None]:
# Convert JSON to LinkML schema (YAML format)
def convert_to_linkml(json_model):
    """
    Converts a GREGoR formatted JSON model into a LinkML schema in YAML format.

    Args:
        json_model (dict): The input JSON model.

    Returns:
        dict: The LinkML schema as a dictionary.
    """
    sb = SchemaBuilder("GREGoRLinkMLExampleConversionSchema")
    sb.add_defaults()
    unique_slots = set()
    unique_enums = set()

    for table in json_model["tables"]:
        class_name = camelcase(table["table"])
        cd = ClassDefinition(name=class_name)

        for column in table["columns"]:
            # Check if 'column' key exists before creating SlotDefinition
            if "column" in column:
                if column["column"] in unique_slots:
                    cd.slots.append(underscore(column["column"]))
                    continue
                unique_slots.add(column["column"])
                sd = SlotDefinition(
                    name=underscore(column["column"])
                )

                # Set additional properties
                if column.get("data_type"):
                    sd.range = column["data_type"]
                if column.get("description"):
                    sd.description = column["description"]
                if column.get("required"):
                    sd.required = True
                if column.get("primary_key"):
                    sd.identifier = True
                if column.get("multi_value_delimiter"):
                    sd.multivalued = True
                if column["data_type"] == "enumeration":
                    if column["column"] in unique_enums:
                        sd.range = column["column"]
                    else:
                        unique_enums.add(column["column"])
                        enum_name = build_enum_definition(column, sd.name, sb)
                        sd.range = enum_name
                sb.add_slot(sd)
                cd.slots.append(sd.name)
            else:
                print(f"Warning: Column definition missing 'column' key: {column}")

        sb.add_class(cd)

    # Convert the schema to a dictionary
    schema = sb.as_dict()
    return schema

In [None]:
def save_to_yaml_file(data, file_path):
    linkml_yaml = yaml.dump(data, sort_keys=False, default_flow_style=False)
    linkml_yaml = linkml_yaml.replace(': null', ':')
    with open(file_path, 'w') as file:
        file.write(linkml_yaml)

In [None]:
# Example JSON model URL
url = "https://raw.githubusercontent.com/UW-GAC/gregor_data_models/refs/heads/main/GREGoR_data_model.json"
response = requests.get(url)
response.raise_for_status()  # Ensure successful request
json_model = response.json()

# Convert to LinkML schema
schema = convert_to_linkml(json_model)

# Print the schema in YAML format
# print(yaml.dump(schema, sort_keys=False))
data = yaml.dump(schema, sort_keys=False)
yaml_file_path = 'GREGoR_linkml_data_model.yaml'
save_to_yaml_file(schema, yaml_file_path)
files.download(yaml_file_path)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
sv = SchemaView(yaml_file_path)

print(sv.get_class("Participant"))
print(sv.get_slot("internal_project_id"))
print(sv.get_slot("internal_project_id").range)

for e in sv.all_enums():
  print(sv.get_enum(e))



ClassDefinition({
  'name': 'Participant',
  'from_schema': 'http://example.org/GREGoRLinkMLExampleConversionSchema',
  'slots': ['participant_id', 'internal_project_id', 'gregor_center', 'consent_code',
    'recontactable', 'prior_testing', 'pmid_id', 'family_id', 'paternal_id',
    'maternal_id', 'twin_id', 'proband_relationship', 'proband_relationship_detail',
    'sex', 'sex_detail', 'reported_race', 'reported_ethnicity', 'ancestry_detail',
    'age_at_last_observation', 'affected_status', 'phenotype_description',
    'age_at_enrollment', 'solve_status', 'missing_variant_case',
    'missing_variant_details']
})
SlotDefinition({
  'name': 'internal_project_id',
  'description': ('An identifier used by GREGoR research centers to identify a set of '
     'participants for their internal tracking'),
  'from_schema': 'http://example.org/GREGoRLinkMLExampleConversionSchema',
  'range': 'string',
  'multivalued': True
})
string
EnumDefinition({
  'name': 'GregorCenterEnum',
  'from_schema