Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Markdown schema description for json metadata files. #5

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

uermel
Copy link
Contributor

@uermel uermel commented Feb 6, 2024

This is a first pass at a markdown version of the data portal json metadata schema. It is a direct translation of the google doc. It does contain several errors, which I'll point out in comments below.

This should not be merged in current state because it is erroneous.

@uermel uermel requested a review from jgadling February 6, 2024 17:42
| Dataset identifier | dataset_identifier | N/A | | STRING | -- | -- | MUST | length 6-64, no spaces, special char. | An identifier for a CryoET dataset, assigned by the Data Portal. Used to identify the dataset as the directory name in data tree |
| Dataset title | dataset_title | N/A | | STRING | -- | -- | MUST | N/A | Title of a CryoET dataset. A good title is concise and descriptive, e.g. “S. pombe cryo-FIB lamellae acquired with defocus-only (DEF), S. pombe cryo-FIB lamellae acquired with Volta Phase Plate (VPP)” |
| Dataset Description | dataset_description | N/A | | STRING | -- | -- | MUST | N/A | A short description of a CryoET dataset, similar to an abstract for a journal article or dataset |
| Dataset authors | authors | ORCID | ORCID | STRING | -- | -- | RECOMMENDED | ORCID | A unique, persistent identifier for researchers, provided by ORCID |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'ORCID' is the only metadata key that is capitalized. Should probably be lower case in the future.

Comment on lines +33 to +40
| Dataset authors | authors | ORCID | ORCID | STRING | -- | -- | RECOMMENDED | ORCID | A unique, persistent identifier for researchers, provided by ORCID |
| | | Full name | full_name | STRING | -- | -- | MUST | N/A | Full name of a dataset author (e.g. Jane Doe). Since not everyone has a ‘first’ or ‘last’ name, we chose to use full names instead |
| | | Corresponding author status | corresponding_author_status | BOOLEAN | -- | -- | OPTIONAL | N/A | Indicating whether an author is the corresponding author (YES or NO) |
| | | Email | email | STRING | -- | -- | MUST or OPTIONAL<sup>**[1](#author_email)**</sup> | N/A | Email address for each author |
| | | Affiliation name | affiliation_name | STRING | -- | -- | RECOMMENDED | ROR | Name of the institution an author is affiliated with. Sometimes, one author may have multiple affiliations |
| | | Affiliation address | affiliation_address | STRING | -- | -- | OPTIONAL | N/A | Address of the institution an author is affiliated with |
| | | Affiliation identifier | affiliation_identifier | STRING | -- | -- | RECOMMENDED | ROR | A unique identifier assigned to the affiliated institution by The Research Organization Registry (ROR) |
| | | Author list order | order | INTEGER | -- | -- | MUST | N/A | The order in which the author appears in the publication |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire Author section is actually not a dictionary, but a list of Dictionaries of type "DatasetAuthor". We should reflect this in the documentation.

Comment on lines +41 to +42
| Funding | funding | Funding agency name | funding_agency_name | STRING | -- | -- | RECOMMENDED | ROR or Crossref Funder Registry | Name of the funding agency. There could be multiple funding agencies |
| | | Grant ID | grant_id | STRING | -- | -- | RECOMMENDED | N/A | Grant identifier provided by the funding agency |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire Funding section is actually not a dictionary, but a list of Dictionaries of type "DatasetFunding". We should reflect this in the documentation.

| | | Last modified date | last_modified_date | TIME | -- | -- | OPTIONAL | N/A | Date when a released dataset is last modified |
| Cross references | cross_references | Related database entries | related_database_entries | STRING<sup>**[1](#dataset_list)**</sup> | -- | -- | OPTIONAL | N/A | If a CryoET dataset is also deposited into another database, enter the database identifier here (e.g. EMPIAR-11445). Use a comma to separate multiple identifiers |
| | | Related publication DOIs | dataset_publications | STRING<sup>**[1](#dataset_list)**</sup> | -- | -- | OPTIONAL | DOI | DOIs of publications related to this dataset. |
| | | Related database links | related_database_links | STRING<sup>**[1](#dataset_list)**</sup> | -- | -- | OPTIONAL | DOI | Links to related database entries. |
Copy link
Contributor Author

@uermel uermel Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually turned out to be difficult in case of the strain ID due to security concerns, as each host would need to be validated. I think it would be better to create a list of allowable database prefixes in the related_database_entries field and rules for how to build URLs to them.

Comment on lines +50 to +51
| Key Image | key_photos | Snapshot URL | snapshot | STRING | -- | -- | OPTIONAL | N/A | path to the snapshot image in the dataset directory |
| | | Thumbnail URL | thumbnail | STRING | -- | -- | OPTIONAL | N/A | path to the thumbnail image in the dataset directory |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not actually URLs in the JSON files, but relative paths. This should be reflected here and we should think about including actual URLS

|---------------------|--------------------|--------------------------|----------------|-----------------|--------|----------|-------------|--------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Sample type | sample_type | N/A | N/A | ENUMERATION[^1] | N/A | N/A | MUST | Cell, tissue, organism, Intact organelle, In-vitro mixture of macromolecules or their complex, In-silico synthetic data, Other | Type of samples used in a CryoET study. |
| Organism | organism | Organism name | name | STRING[^2] | N/A | N/A | RECOMMENDED | | Name of the organism from which a biological sample used in a CryoET study is derived from, e.g. homo sapiens (reference) |
| | | Taxonomy ID | taxonomy_id | STRING | N/A | N/A | RECOMMENDED | NCBI taxonomy ID | NCBI taxonomy identifier for the organism, e.g. 9606 (reference) |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to actually be consistent and use the NCBITaxon ontology instead of straight IDs, to be consistent across the database.


| Data elements | metadata key 1 | Sub-elements | metadata key 2 | Value type | Unit | Default | Requirement | Controlled vocabularies | Description |
|---------------------|--------------------|--------------------------|----------------|-----------------|--------|----------|-------------|--------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Sample type | sample_type | N/A | N/A | ENUMERATION[^1] | N/A | N/A | MUST | Cell, tissue, organism, Intact organelle, In-vitro mixture of macromolecules or their complex, In-silico synthetic data, Other | Type of samples used in a CryoET study. |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ENUM should be clearly defined separately.

Point files are newline-delimited json files ([ndjson](https://clue.engineering/2018/introducing-reactphp-ndjson)). Each
line contains a nested dictionary.

| Data elements | metadata key 1 | Sub-elements | metadata key 2 | Value type | Unit | Default | Requirement | Controlled vocabularies | Description |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is lacking the 'type' field that is actually present in the files.

line contains a nested dictionary.

| Data elements | metadata key 1 | Sub-elements | metadata key 2 | Value type | Unit | Default | Requirement | Controlled vocabularies | Description |
|---------------|----------------------|--------------|----------------|-----------------------|:----:|:-------:|---------------|-------------------------|----------------------------------------------------------------------------------|
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is lacking the 'type' field that is actually present in the files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add that to this spec?

Comment on lines +173 to +176
| Annotation object | ??? | N/A | | STRING | -- | -- | MUST | | Name of the object being annotated (e.g. cytosolic ribosome, nuclear pore complex, actin filament, membrane) |
| Annotation object ID | ??? | N/A | | STRING | -- | -- | MUST | Gene Ontology | Gene Ontology Cellular Component identifier for the annotation object, e.g. http://purl.obolibrary.org/obo/GO_0022626 for ‘cytosolic ribosome’. |
| Annotation object description | ??? | N/A | | TEXT | -- | -- | OPTIONAL | | A textual description of the annotation object, can be a longer description to include additional information not covered by the Annotation object name and state. |
| Annotation object state | ??? | N/A | | STRING | -- | -- | OPTIONAL | | Molecule state annotated (e.g. open, closed) |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the google schema these are top level keys, but in the files they are nested.

Comment on lines +120 to +121
| Tomogram file name | ??? | N/A | | STRING | -- | -- | MUST | =Tilt_experiment.name+Processing+Processing_software+\[no\]Ctf | Tomogram file name |
| Tilt series file name | ??? | N/A | | reference | -- | -- | MUST | | The name of the tilt series used to construct the tomograms |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These exist in the google schema, but not in the implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field seems useful - seems we should add it to our implementation?

Comment on lines +139 to +146
| Authors | authors | ORCID | ORCID | STRING | -- | -- | RECOMMENDED | ORCID | A unique, persistent identifier for the tomogram’s author, provided by ORCID |
| | | Full name | name | STRING | -- | -- | MUST | N/A | Full name of an annotator. Since not everyone has a ‘first’ or ‘last’ name, we chose to use full names instead. An annotator can also be an organization name such as ‘CZII CryoET Data Portal’. |
| | | Corresponding author status | corresponding_author_status | BOOLEAN | -- | -- | OPTIONAL | YES, NO | Indicating whether an annotator is the corresponding author (YES or NO) |
| | | Email | email | STRING | -- | -- | OPTIONAL for non-corresponding author; MUST for corresponding author | N/A | Email address for each annotator |
| | | Affiliation name | affiliation_name | STRING | -- | -- | RECOMMENDED | ROR | Name of the institution an annotator is affiliated with. Sometimes, one annotator may have multiple affiliations. |
| | | Affiliation address | affiliation_address | STRING | -- | -- | OPTIONAL | N/A | Address of the institution an annotator is affiliated with. |
| | | Affiliation identifier | affiliation_identifier | STRING | -- | -- | RECOMMENDED | ROR | A unique identifier assigned to the affiliated institution by The Research Organization Registry (ROR). |
| | | Author list order | author_list_order | INTEGER | -- | -- | MUST | | The order in which the author appears in the publication |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire Author section is actually not a dictionary, but a list of Dictionaries of type "TomogramAuthor". We should reflect this in the documentation.

Comment on lines +156 to +164
| Authors | authors | ORCID | ORCID | STRING | -- | -- | RECOMMENDED | ORCID | A unique, persistent identifier for the annotator, provided by ORCID |
| | | Full name | name | STRING | -- | -- | MUST | N/A | Full name of an annotator. Since not everyone has a ‘first’ or ‘last’ name, we chose to use full names instead. An annotator can also be an organization name such as ‘CZII CryoET Data Portal’. |
| | | Corresponding author status | corresponding_author_status | BOOLEAN | -- | -- | OPTIONAL | | Indicating whether an annotator is the corresponding author (YES or NO) |
| | | Primary annotator status | primary_annotator_status | BOOLEAN | -- | -- | OPTIONAL | | Indicating whether an annotator is the main person executing the annotation, especially on manual annotation (YES or NO) |
| | | Email | email | STRING | -- | -- | OPTIONAL for non-corresponding author; MUST for corresponding author | N/A | Email address for each annotator |
| | | Affiliation name | affiliation_name | STRING | -- | -- | RECOMMENDED | ROR | Name of the institution an annotator is affiliated with. Sometimes, one annotator may have multiple affiliations. |
| | | Affiliation address | affiliation_address | STRING | -- | -- | OPTIONAL | N/A | Address of the institution an annotator is affiliated with. |
| | | Affiliation identifier | affiliation_identifier | STRING | -- | -- | RECOMMENDED | ROR | A unique identifier assigned to the affiliated institution by The Research Organization Registry (ROR). |
| | | Author list order | author_list_order | INTEGER | -- | -- | MUST | | The order in which the author appears in the publication |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire Author section is actually not a dictionary, but a list of Dictionaries of type "AnnotationAuthor". We should reflect this in the documentation.

| Data elements | metadata key 1 | Sub-elements | metadata key 2 | Value type | Unit | Default | Requirement | Controlled vocabularies | Description |
|-------------------------------|--------------------------|-----------------------------------|-----------------------------|------------|:----:|:-------:|-----------------------------------------------------------------------|-------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Annotation file name | ??? | N/A | | STRING | -- | -- | MUST | | Name of an annotation file. We adopt the following naming convention: tilt_experiment.name+object.name+annotation_instance.name+version.number. There could be multiple annotations by multiple annotators for the same CryoET datasets (tilt series and tomograms), each annotation file is considered an ‘annotation instance’). |
| Authors | authors | ORCID | ORCID | STRING | -- | -- | RECOMMENDED | ORCID | A unique, persistent identifier for the annotator, provided by ORCID |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'ORCID' is the only metadata key that is capitalized. Should probably be lower case in the future.

| Affine transformation matrix | affine_transformation_matrix | N/A | | 4x4 affine matrix of float | -- | -- | Not by data submitter | Default = identity matrix | Portal curator defines a reconstruction workflow from tilt series that results in a particular tomogram orientation. The flip or rotation transformation of this author submitted tomogram is indicated here. |
| Key Image | key_image | Preview URL | key_photo_url | STRING | -- | -- | OPTIONAL | | URL for the tomogram preview image. |
| | | Thumbnail URL | key_photo_thumbnail_url | STRING | -- | -- | OPTIONAL | | URL for the thumbnail of the tomogram preview image. |
| Authors | authors | ORCID | ORCID | STRING | -- | -- | RECOMMENDED | ORCID | A unique, persistent identifier for the tomogram’s author, provided by ORCID |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'ORCID' is the only metadata key that is capitalized. Should probably be lower case in the future.

| | | Affiliation name | affiliation_name | STRING | -- | -- | RECOMMENDED | ROR | Name of the institution an author is affiliated with. Sometimes, one author may have multiple affiliations |
| | | Affiliation address | affiliation_address | STRING | -- | -- | OPTIONAL | N/A | Address of the institution an author is affiliated with |
| | | Affiliation identifier | affiliation_identifier | STRING | -- | -- | RECOMMENDED | ROR | A unique identifier assigned to the affiliated institution by The Research Organization Registry (ROR) |
| | | Author list order | order | INTEGER | -- | -- | MUST | N/A | The order in which the author appears in the publication |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'Author list order' is not actually a field in the JSON files, it is explictely given by the order in the list (JSON arrays are ordered).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we shouldn't include the field in the metadata spec and just say that the ordering is significant.

| | | Affiliation name | affiliation_name | STRING | -- | -- | RECOMMENDED | ROR | Name of the institution an annotator is affiliated with. Sometimes, one annotator may have multiple affiliations. |
| | | Affiliation address | affiliation_address | STRING | -- | -- | OPTIONAL | N/A | Address of the institution an annotator is affiliated with. |
| | | Affiliation identifier | affiliation_identifier | STRING | -- | -- | RECOMMENDED | ROR | A unique identifier assigned to the affiliated institution by The Research Organization Registry (ROR). |
| | | Author list order | author_list_order | INTEGER | -- | -- | MUST | | The order in which the author appears in the publication |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'Author list order' is not actually a field in the JSON files, it is explictely given by the order in the list (JSON arrays are ordered).

Comment on lines +156 to +164
| Authors | authors | ORCID | ORCID | STRING | -- | -- | RECOMMENDED | ORCID | A unique, persistent identifier for the annotator, provided by ORCID |
| | | Full name | name | STRING | -- | -- | MUST | N/A | Full name of an annotator. Since not everyone has a ‘first’ or ‘last’ name, we chose to use full names instead. An annotator can also be an organization name such as ‘CZII CryoET Data Portal’. |
| | | Corresponding author status | corresponding_author_status | BOOLEAN | -- | -- | OPTIONAL | | Indicating whether an annotator is the corresponding author (YES or NO) |
| | | Primary annotator status | primary_annotator_status | BOOLEAN | -- | -- | OPTIONAL | | Indicating whether an annotator is the main person executing the annotation, especially on manual annotation (YES or NO) |
| | | Email | email | STRING | -- | -- | OPTIONAL for non-corresponding author; MUST for corresponding author | N/A | Email address for each annotator |
| | | Affiliation name | affiliation_name | STRING | -- | -- | RECOMMENDED | ROR | Name of the institution an annotator is affiliated with. Sometimes, one annotator may have multiple affiliations. |
| | | Affiliation address | affiliation_address | STRING | -- | -- | OPTIONAL | N/A | Address of the institution an annotator is affiliated with. |
| | | Affiliation identifier | affiliation_identifier | STRING | -- | -- | RECOMMENDED | ROR | A unique identifier assigned to the affiliated institution by The Research Organization Registry (ROR). |
| | | Author list order | author_list_order | INTEGER | -- | -- | MUST | | The order in which the author appears in the publication |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'Author list order' is not actually a field in the JSON files, it is explictely given by the order in the list (JSON arrays are ordered).


| Data elements | metadata key 1 | Sub-elements | metadata key 2 | Value type | Unit | Default | Requirement | Controlled vocabularies | Description |
|-------------------------------|------------------------------|-----------------------------|-----------------------------|----------------------------|:-----:|:-------:|-----------------------------------------------------------------------|----------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Tomogram file name | ??? | N/A | | STRING | -- | -- | MUST | =Tilt_experiment.name+Processing+Processing_software+\[no\]Ctf | Tomogram file name |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not actually inside the json files we are generating.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we drop this field? it seems moot?

Copy link
Contributor

@jgadling jgadling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see the author list order and tiltseries.tomogram fields dropped if that's ok with you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants