You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I reviewed the documentation and I have some questions, comments and suggestions on various sections of the document.
Random comments
Who is the target audience for the document? I am not a computer scientist and I read the document with the thought, 'How can I make our existing data MiAIRR compliant? How can I ensure that I gather the proper information in the future?' Here, I sometimes fall short. Not because I need to invest some time to understand, but because some parts simply seem inaccessible.
Maybe we should standardize the level names in the data model. The section 'MiAIRR-to-NCBI Implementation' uses slightly different terms for the levels. For instance, 'diagnosis & intervention' is mentioned in the bullet list in the section but only in the table in 'MiAIRR Data Elements', where it is 'diagnosis and intervention'. 'MiAIRR-to-NCBI Implementation' has 'processed sequences with basic analysis results' which is more detailed than 'processed AIRR sequences' used elsewhere (although 'basic analysis results' is non-descriptive). In the Nat. Comm schematic, there is no 'intervention' and the 6th level is called 'Processed Sequences with Annotations'.
The Repertoire Schema is UTF8, while the Rearrangement Schema is ASCII or UTF-8.
Sometimes we say OpenAPI V2, sometimes OpenAPI V2 and V3. (Actually, I think it's 1 all)
My understanding of the statement "The file can (optionally) contain an Info object, at the beginning of the file, based upon the Info schema in the OpenAPI V2 specification. If provided, version in Info should reference the version of the AIRR schema for the file." in 'Repertoire Schema > File Structure' is that we may not know the schema ID. Is this a problem? What is the purpose of the optional INFO field if it does not carry relevant information? If I understand the API correctly - and there is no guarantee that I am anywhere close - the schema is always returned, so the version number may be important.
study_description and study_contact in the Study-schema are missing in AIRR_Minimal_Standard_Data_Elements.
genotype in the Subject-schema is not really explained (and does not exist in AIRR_Minimal_Standard_Data_Elements).
Section specific comments
Section: MiAIRR Data Elements
We say we have 6 high levels
Study and subject, sample collection, sample processing and sequencing, raw sequences, processing of sequence data, and processed AIRR sequences.
The table has [1-5]/[Level name] because 'processed AIRR sequences' are missing
Suggestion
Separate the table into sections, similar to 'Repertoire Fields' in 'Repertoire Schema'.
Take inspiration from 'MiAIRR-to-NCBI Implementation' and use bullet points for the levels and description
Explain why 'processed AIRR sequences' is missing - at least acknowledge its absence
Section: MiAIRR-to-NCBI Implementation
"The current version (1.0) of the standard has been recently published [Rubelt_2017] and was passed by the general assembly at the annual AIRR Community meeting in December 2017." - Is this still true? The current version is 1.0?
This section seems to be the only place where diagnosis & intervention is mentioned
Section: MiAIRR-to-NCBI Specification
This sentence is nonsensical: "In terms of standard compliance, it is currently REQUIRED [1] to deposit information for MiAIRR data sets 5 and 6 in general-purpose sequence repositories for which an AIRR-accepted specification on information mapping MUST exist."
Suggestion: Start with the ubiquitous 'we have six levels' and list them. It is dull reading, but ensures full understanding as a reference document
The document generally speaks of data sets (which may be true, because of physical distribution) but elsewhere we talk of levels
Table in 'Element mapping'
'diagnosis & treatment' should be corrected to 'diagnosis and intervention'
Document is generally difficult to read, but maybe it works extremely well as a reference
It seems a little out of place in 'Study Reporting', and totally misplaced between 'MiAIRR Data Elements' and 'Requirement Levels of AIRR Schema Fields'. Maybe it should just live in 'Data Submission and Query'?
Section: Requirement Levels of AIRR Schema Fields
I don't have finer details of RFC2119 present in my mind - I like the glossary in 'MiAIRR-to-NCBI Specification'
The sentence 'Importantly, fields are not elevated to this level based on' lacks a counterpart. When are fields elevated to essential?
This sentence makes no sense to me: "However, IF information matching the semantic definition of the field is provided, this field MUST be used for reporting."
Subsection: Compliance with the MiAIRR Data Standard
This should be the first point: Data sets are considered MiAIRR-compliant ONLY IF all essential and important fields are reported.
This is not super important and should be last: Compliance to the MiAIRR Data Standard is currently a binary state, i.e., data either is or is not compliant, there are not “grades” of compliance. However, additional requirements for specific use cases might be defined in the future.
Who is this sentence for: Note that important fields with NULL-LIKE values MUST NOT be dropped from a data set.
Section: Metadata Annotation Guidelines
Where are we in the six levels? How does this section connect to the rest?
'Clarification of Terms' - As for Requirement Levels of AIRR Schema Fields, a repetition of the definition might be appropriate
Section: AIRR Data Representations
FAIR Principles: I have no idea of grammar in this case, but I like to start each entry with a capital letter
AIRR Data Model: There are some inconsistencies with 'MiAIRR Data Elements' and 'MiAIRR-to-NCBI Implementation'
I somehow need a connection to the six levels. I imagine that CellProcessing and NucleicAcidProcessing data model objects belong to 'Sample Processing and Sequencing'?
I think this is the first place where 'Processed Sequences with Annotations' is actually filled with information
A positive note: I like the discussion/explanation in 'Relationship between Schema Objects'
Section: Repertoire Schema
A positive note: I like the explanation of Repertoire.
What exactly are the types 'SubjectGenotype' and 'SequencingData'?
The subsection 'Raw Sequence Data Fields' has no content.
Section: Rearrangement Schema
'A Rearrangement is a sequence which describes' - should it be 'A Rearrangement is a sequence that describes'? (Add backticks)
The description of the category 'Alignment Annotations' could point to the CIGAR section.
The text was updated successfully, but these errors were encountered:
Related: I notice that the software standard page references the "The AIRR Data Representation Working Group," which was decommissioned/folded into Standards ...pre-pandemic??
For that matter, we are tentatively planning to close up shop on the Software WG post-Porto, though the details have yet to be worked out...
I reviewed the documentation and I have some questions, comments and suggestions on various sections of the document.
Random comments
Who is the target audience for the document? I am not a computer scientist and I read the document with the thought, 'How can I make our existing data MiAIRR compliant? How can I ensure that I gather the proper information in the future?' Here, I sometimes fall short. Not because I need to invest some time to understand, but because some parts simply seem inaccessible.
Maybe we should standardize the level names in the data model. The section 'MiAIRR-to-NCBI Implementation' uses slightly different terms for the levels. For instance, 'diagnosis & intervention' is mentioned in the bullet list in the section but only in the table in 'MiAIRR Data Elements', where it is 'diagnosis and intervention'. 'MiAIRR-to-NCBI Implementation' has 'processed sequences with basic analysis results' which is more detailed than 'processed AIRR sequences' used elsewhere (although 'basic analysis results' is non-descriptive). In the Nat. Comm schematic, there is no 'intervention' and the 6th level is called 'Processed Sequences with Annotations'.
The Repertoire Schema is UTF8, while the Rearrangement Schema is ASCII or UTF-8.
Sometimes we say OpenAPI V2, sometimes OpenAPI V2 and V3. (Actually, I think it's 1 all)
My understanding of the statement "The file can (optionally) contain an Info object, at the beginning of the file, based upon the Info schema in the OpenAPI V2 specification. If provided, version in Info should reference the version of the AIRR schema for the file." in 'Repertoire Schema > File Structure' is that we may not know the schema ID. Is this a problem? What is the purpose of the optional INFO field if it does not carry relevant information? If I understand the API correctly - and there is no guarantee that I am anywhere close - the schema is always returned, so the version number may be important.
study_description
andstudy_contact
in the Study-schema are missing inAIRR_Minimal_Standard_Data_Elements
.genotype
in the Subject-schema is not really explained (and does not exist inAIRR_Minimal_Standard_Data_Elements
).Section specific comments
Section: MiAIRR Data Elements
Section: MiAIRR-to-NCBI Implementation
Section: MiAIRR-to-NCBI Specification
Section: Requirement Levels of AIRR Schema Fields
Section: Metadata Annotation Guidelines
Section: AIRR Data Representations
CellProcessing
andNucleicAcidProcessing
data model objects belong to 'Sample Processing and Sequencing'?Section: Repertoire Schema
Repertoire
.Section: Rearrangement Schema
Rearrangement
is a sequence that describes'? (Add backticks)The text was updated successfully, but these errors were encountered: