Skip to content

Managing schema.yaml

Damion Dooley edited this page Jun 20, 2024 · 14 revisions

Building schema.yaml via field and picklist tables

A LinkML schema.yaml file can be maintained directly with a text editor or some other editing tool, but we provide details on an approach that uses team editable Google Sheets that makes it easier to maintain. It still requires a bit of programmer-level setup. Three files are required which are used to generate schema.yaml:

  • schema_core.yaml for specifying the necessary parts of a LinkML schema as a whole
  • schema_slots.tsv, a tab delimited text file for specifying templates and their fields (LinkML classes and their slots).
  • schema_enums.tsv, a tab delimited text file for specifying categorical pick lists that a field might require.

To generate or refresh schema.yaml from the above, run this in the template's directory:

> python3 ../../../script/tabular_to_schema.py

NOTE: to support code generation functionality, LinkML advises a standard naming convention:

  • Class (template) names and Enum list names be in caps camelCase, e.g. "MyTemplate"
  • Slot (field) names be in all lowercase snake_case, i.e. that allows underscores. E.g. "specimen_collector_sample_id"

Also, regarding .tsv and .yaml file content, although UTF8 characters are generally acceptable, it helps to normalize any quotes or dashes in column header or field text to basic - and " quotes.

schema_core.yaml specification:

As mentioned in the template intro, the schema_core.yaml contains a list of all possible templates for a folder, and supporting information. It populates schema.yaml's top level entities, including:

  • A generic resolvable URI for the schema.
  • A name and description for the schema.
  • A list of one or more languages the schema has translations in.
  • An imports section that indicates below to include the LinkML built-in data types such as decimal and date.
  • A list of prefixes that may occur in ontology or other term IRI references.
  • Objects containing dictionaries of classes (templates), slots (fields), enumerations (pick lists), types (datatypes), and settings (search & replace key / values).

From the example schema_core.yaml below, the "CanCOGeN Covid-19" schema will be built, with one "CanCOGeN Covid-19" class (template) which DataHarmonizer will show in its menu system.

id: https://example.com/CanCOGeN_Covid-19
    name: CanCOGeN_Covid-19
    description: ""
    version: "1.0.0"
    in_language:
      - en
    imports:
      - "linkml:types"
    prefixes:
      linkml: "https://w3id.org/linkml/"
      GENEPIO: "http://purl.obolibrary.org/obo/GENEPIO_"
    classes:
      dh_interface:
        name: dh_interface
        description: "A DataHarmonizer interface"
        from_schema: https://example.com/CanCOGeN_Covid-19
      "CanCOGeN Covid-19":
        name: "CanCOGeN Covid-19"
        description: Canadian specification for Covid-19 clinical virus biosample data gathering
        is_a: dh_interface
    slots: {}
    enums: {}
    types:
      WhitespaceMinimizedString:
        name: "WhitespaceMinimizedString"
        typeof: string
        description: "A string that has all whitespace trimmed off of beginning and end, and all internal whitespace segments reduced to single spaces. Whitespace includes #x9 (tab), #xA (linefeed), and #xD (carriage return)."
        base: str
        uri: xsd:token
      Provenance:
        name: "Provenance"
        typeof: string
        description: "A field containing a DataHarmonizer versioning marker. It is issued by DataHarmonizer when validation is applied to a given row of data."
        base: str
        uri: xsd:token
    settings:
      Title_Case: "(((?<=\\b)[^a-z\\W]\\w*?|[\\W])+)"
      UPPER_CASE: "[A-Z\\W\\d_]*"
      lower_case: "[a-z\\W\\d_]*"

As described below, the "slots: {}" and "enums: {}" dictionaries get filled in by the tabular_to_schema.py script which processes schema_slots.tsv and schema_enums.tsv content. These files can be managed as tabs in a Google spreadsheet, for example in viral pathogen data collection standards the CanCOGeN-slots and CanCOGeN-enums tabs are copied in their entirety into the tab delimited /template/canada_covid19/ folder's schema_slots.tsv and schema_enums.tsv files, which are then processed along with schema_core.yaml to create schema.yaml

IMPORTANT: DataHarmonizer will add each schema class as a template to its menu system if it finds that the class has an "is_a" relationship to the special "dh_interface" class.

Template slot (field) specification:

A slot specification lists the slot name, description, range of possible values, mappings, required or recommended status, etc. If the slot offers a menu of choices, that menu is contained in the enums dictionary, specified by schema_enums.tsv.

The schema_slots.tsv and schema_enums.tsv files' first row contains LinkML slot names or DataHarmonizer friendly variants as described below. Ensure that the field content of these files does not have extra carriage returns or line feeds (or spaces instead of tabs) as these likely will cause errors as they are read and compiled into schema.yaml. Erroneous line feeds can be detected when viewing either .tsv file by seeing if text from one row appears on next row even when "word wrap" feature is turned off in a text editor - an indication that a carriage return was copied over from in content of a spreadsheet cell text value.

schema_slots.tsv

property description
class_name a semi-colon delimited list of classes (templates) that the current row applies to. (It will be reused row-by-row until it is set by a different value on a subsequent row).
slot_group A user-friendly section label that this slot will be listed under in the two row header DataHarmonizer user interface.
slot_uri an ontology id or URI that provides a unique semantic web identifier for this slot. (Was Ontology ID in DH <= v0.15.5)
title A user-friendly label that gets displayed in the second row of spreadsheet column
name optional but may supply the database field name, if different from the title. (In the template code title will be copied into empty name entries.)
range A data type that a slot value validates to, which can be a date, decimal, a picklist menu name of categorical choices, etc.
range_2 An additional data type that a slot value validates to, which can include a semi-colon delimited list of other picklist menus. This and the range field are converted into the "any_of" structure of range specifications. This enables a menu of metadata values like "Missing", "Not Collected" etc.
identifier If true means this field value should be unique within the column (of tabular data).
multivalued If true then more than one value is allowed for this field. Multiple values are usually delimited by semi-colons.
required If true means this field requires a data value.
recommended If true means this field is suggested for data entry but this is not required.
minimum_value Contains a minimum numeric number that a decimal or integer value can take on. Can also include dates to test against, including the special value "{today}" See todos reference above.
maximum_value Contains a maximum numeric number that a decimal or integer value can take on. . Can also include dates to test against, including the special value "{today}". See todos reference above.
pattern A regular expression to validate a field's textual content by. Include ^ and $ start and end of line qualifiers for full string match. Example simple email validation: ^\S+@\S+.\S+$
structured_pattern A LinkML system for specifying strings containing regular expression pattern names which are compiled into a pattern. Takes advantage of search and replace operations of names stored in a schema's settings dictionary.
description Helpful description of what field is about. Available in column help info.
comments Data entry guidance for a field. Available in column help info.
examples An array of values which are displayed as a bulleted list in column help info.
EXPORT_... A list of 0 or more export target columns that provide instructions for mapping to external database fields. Each becomesan export template option.

The class_name field's ability to list several classes allows:

  • Two classes in the same row to have an identical slot specification, which will be provided in the schema's "slots" dictionary.
  • A slot (field) and its specification can be listed for one class on one row, and on the next row mention the same slot for a different class. The tabular_to_schema.py script will tease out what is common in the two slots and will store that in the generic slot definition, and meanwhile place the properties unique to each class in the class's respective slot_usage entry for that slot.

schema_enums.tsv:

Enumerations are flat or hierarchic lists of categorical (nominal) choices a slot can have in its range (value).  

property description
title Title of enumeration (menu) of categorical choices.
meaning A curie or IRI of an ontology term that clarifies this item's semantic meaning.
menu_1 An enumeration item's label - this is at the top-level of a hierarchic menu
menu_2 A 2nd tier menu item label
menu_3 A 3rd tier menu item label
menu_4 A 4th tier menu item label
menu_5 A 5th tier menu item label
description A description of a menu choice.
EXPORT_... A possible export database field and value that this choice can be transformed to. These get transformed into an enumeration item's exact_mappings.

In the future, DataHarmonizer will work with LinkML to extend the functionality of this so that menus can be compiled by dynamically fetching branches of ontologies during schema generation.

EXPORT_ fields

If this functionality is used, then in DataHarmonizer's "Export To ..." menu list there will be listed one or more data formats that a template's current spreadsheet data can be exported to.

If an EXPORT_XYZ column is at top of the schema_slots.tsv file, the XYZ part becomes an export format menu option under the DataHarmonizer "Export To ..." menu list, and values in this column get added to respective slot's exact_mappings dictionary, to guide which target columns that source template column values end up in.

If an EXPORT_XYZ column is at top of the schema_enums.tsv file, then an enumeration item choice will get an exact_mapping entry prefixed by "XYZ:" pointing to the target database and field.

Two additional features enable many common transform tasks:

  • A semicolon ";" symbol existing in an EXPORT_ field value will cause the source template's field value to be channeled to the multiple export fields separated by the semicolon. If an export field target is mentioned on multiple field specification rows of the template, then the values of the source fields, if any, will be concatenated into the export target field, in order (with delimiters as placeholders for any empty component values).
  • In addition, if a targeted export field is specified as a key:value pair, i.e. "[export field name]:[string]" format, then the input field value will be transformed to the given export string value. This allows conversion of values from source to export, and is pertinent to selection lists choices that vary across systems but which are semantically equivalent.

There are inevitably export data transformation cases that the above functionality can't handle. For this custom coding in an export.js file is required.

Multilingual functionality

Coming soon! A preparatory step is possible now however - one can generate language variants of schemas by expressing more than one language of a schema using the in_language attribute on the schema as shown in schema above. Translated text is held directly in the schema_slots.tsv and schema_enums.tsv files by having textual fields like title, description, comments, guidance, and examples be accompanied by the same fields but in different languages and with the ITF language code appended to them, e.g. title_fr, description_fr, comments_fr, guidance_fr and examples_fr.