Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New command line dh-validator.py tool for validationg csv,tsv,xls,xlsx data files against a schema.yaml file #445

Open
ddooley opened this issue Sep 6, 2024 · 0 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@ddooley
Copy link
Collaborator

ddooley commented Sep 6, 2024

A new command-line dh-validate.py script simplifies the validation of DataHarmonizer-generated csv,tsv,xls,xlsx files. We look forward to feedback on using this below.

Basically, the linkml-validate command is good for the .json or .yaml data format, but the tabular csv,tsv,xls,xlsx input formats often don't validate well for two main reasons which are resolved by dh-validator.py generating a temporary .yaml file version of the tabular input with necessary adjustments made according to the given schema. dh-validator.py then sends this to linkml-validate for processing. The following adjustments are made:

  • Column labels in DataHarmonizer data files are usually the slot/field titles, rather than the LinkML standard (codewriter compatible) names of slots. This script maps the appropriate column names in temporary .yaml file.
  • Multivalued slots/fields in such data files (from multi-select menus or a combination of menus and/or a string or other input element) get their values converted into an array of values in the temporary .yaml file. The semicolon and vertical bar delimiters (";|") are observed here.
  • One "gotcha" that takes some explaining is that dh-validate.py requires that picklist enumerations (enums) in the given schema) have been named according to LinkML standard naming practice. To explain: linkml-validate renames any schema slot and enumeration menu names that haven't used LinkML standard naming into its version of standard names. While we have added a conversion to ensure that the temporary .yaml file contains a linkml-validate compatible rename of a field, if that field mentions an enum in its range, that name is also renamed by linkml-validate into standard form - but LinkML isn't renaming that enum everywhere it occurs in the schema itself, and so linkml-validate will fail with a long error beginning: "jsonschema.exceptions._WrappedReferencingError: PointerToNowhere: '/$defs/GeoLocName(state/province/territory)Menu' does not exist within {'$schema' ... " Since we don't want to revise the given schema.yaml, we have to insist that the schema holds an standard-named enums.

We will be evolving this script to give a report of any miss-matched columns/fields, to facilitate having older tabular data validated in a newer LinkML schema version for example.

@ddooley ddooley self-assigned this Sep 6, 2024
@ddooley ddooley added the documentation Improvements or additions to documentation label Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

1 participant