Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review of AIRR-Standards documentation #745

Open
ustervbo opened this issue Feb 4, 2024 · 2 comments
Open

Review of AIRR-Standards documentation #745

ustervbo opened this issue Feb 4, 2024 · 2 comments
Milestone

Comments

@ustervbo
Copy link
Contributor

ustervbo commented Feb 4, 2024

I reviewed the documentation and I have some questions, comments and suggestions on various sections of the document.

Random comments

Who is the target audience for the document? I am not a computer scientist and I read the document with the thought, 'How can I make our existing data MiAIRR compliant? How can I ensure that I gather the proper information in the future?' Here, I sometimes fall short. Not because I need to invest some time to understand, but because some parts simply seem inaccessible.

Maybe we should standardize the level names in the data model. The section 'MiAIRR-to-NCBI Implementation' uses slightly different terms for the levels. For instance, 'diagnosis & intervention' is mentioned in the bullet list in the section but only in the table in 'MiAIRR Data Elements', where it is 'diagnosis and intervention'. 'MiAIRR-to-NCBI Implementation' has 'processed sequences with basic analysis results' which is more detailed than 'processed AIRR sequences' used elsewhere (although 'basic analysis results' is non-descriptive). In the Nat. Comm schematic, there is no 'intervention' and the 6th level is called 'Processed Sequences with Annotations'.

The Repertoire Schema is UTF8, while the Rearrangement Schema is ASCII or UTF-8.

Sometimes we say OpenAPI V2, sometimes OpenAPI V2 and V3. (Actually, I think it's 1 all)

My understanding of the statement "The file can (optionally) contain an Info object, at the beginning of the file, based upon the Info schema in the OpenAPI V2 specification. If provided, version in Info should reference the version of the AIRR schema for the file." in 'Repertoire Schema > File Structure' is that we may not know the schema ID. Is this a problem? What is the purpose of the optional INFO field if it does not carry relevant information? If I understand the API correctly - and there is no guarantee that I am anywhere close - the schema is always returned, so the version number may be important.

study_description and study_contact in the Study-schema are missing in AIRR_Minimal_Standard_Data_Elements.

genotype in the Subject-schema is not really explained (and does not exist in AIRR_Minimal_Standard_Data_Elements).

Section specific comments

  • Section: MiAIRR Data Elements

    • We say we have 6 high levels
      • Study and subject, sample collection, sample processing and sequencing, raw sequences, processing of sequence data, and processed AIRR sequences.
      • The table has [1-5]/[Level name] because 'processed AIRR sequences' are missing
    • Suggestion
      • Separate the table into sections, similar to 'Repertoire Fields' in 'Repertoire Schema'.
      • Take inspiration from 'MiAIRR-to-NCBI Implementation' and use bullet points for the levels and description
      • Explain why 'processed AIRR sequences' is missing - at least acknowledge its absence
  • Section: MiAIRR-to-NCBI Implementation

    • "The current version (1.0) of the standard has been recently published [Rubelt_2017] and was passed by the general assembly at the annual AIRR Community meeting in December 2017." - Is this still true? The current  version is 1.0?
    • This section seems to be the only place where diagnosis & intervention is mentioned
  • Section: MiAIRR-to-NCBI Specification

    • This sentence is nonsensical: "In terms of standard compliance, it is currently REQUIRED [1] to deposit information for MiAIRR data sets 5 and 6 in general-purpose sequence repositories for which an AIRR-accepted specification on information mapping MUST exist."
      • Suggestion: Start with the ubiquitous 'we have six levels' and list them. It is dull reading, but ensures full understanding as a reference document
      • The document generally speaks of data sets (which may be true, because of physical distribution) but elsewhere we talk of levels
    • Table in 'Element mapping'
      • 'diagnosis & treatment' should be corrected to 'diagnosis and intervention'
      • Document is generally difficult to read, but maybe it works extremely well as a reference
    • It seems a little out of place in 'Study Reporting', and totally misplaced between 'MiAIRR Data Elements' and 'Requirement Levels of AIRR Schema Fields'. Maybe it should just live in 'Data Submission and Query'?
  • Section: Requirement Levels of AIRR Schema Fields

    • I don't have finer details of RFC2119 present in my mind - I like the glossary in 'MiAIRR-to-NCBI Specification'
    • The sentence 'Importantly, fields are not elevated to this level based on' lacks a counterpart. When are fields elevated to essential?
    • This sentence makes no sense to me: "However, IF information matching the semantic definition of the field is provided, this field MUST be used for reporting."
    • Subsection: Compliance with the MiAIRR Data Standard
      • This should be the first point: Data sets are considered MiAIRR-compliant ONLY IF all essential and important fields are reported.
      • This is not super important and should be last: Compliance to the MiAIRR Data Standard is currently a binary state, i.e., data either is or is not compliant, there are not “grades” of compliance. However, additional requirements for specific use cases might be defined in the future.
      • Who is this sentence for: Note that important fields with NULL-LIKE values MUST NOT be dropped from a data set.
  • Section: Metadata Annotation Guidelines

    • Where are we in the six levels? How does this section connect to the rest?
    • 'Clarification of Terms' - As for Requirement Levels of AIRR Schema Fields, a repetition of the definition might be appropriate
  • Section: AIRR Data Representations

    • FAIR Principles: I have no idea of grammar in this case, but I like to start each entry with a capital letter
    • AIRR Data Model: There are some inconsistencies with 'MiAIRR Data Elements' and 'MiAIRR-to-NCBI Implementation'
    • I somehow need a connection to the six levels. I imagine that CellProcessing and NucleicAcidProcessing data model objects belong to 'Sample Processing and Sequencing'?
    • I think this is the first place where 'Processed Sequences with Annotations' is actually filled with information
    • A positive note: I like the discussion/explanation in 'Relationship between Schema Objects'
  • Section: Repertoire Schema

    • A positive note: I like the explanation of Repertoire.
    • What exactly are the types 'SubjectGenotype' and 'SequencingData'?
    • The subsection 'Raw Sequence Data Fields' has no content.
  • Section: Rearrangement Schema

    • 'A Rearrangement is a sequence which describes' - should it be 'A Rearrangement is a sequence that describes'? (Add backticks)
    • The description of the category 'Alignment Annotations' could point to the CIGAR section.
@javh
Copy link
Contributor

javh commented Feb 5, 2024

From the call:

@scharch
Copy link
Contributor

scharch commented Feb 7, 2024

Related: I notice that the software standard page references the "The AIRR Data Representation Working Group," which was decommissioned/folded into Standards ...pre-pandemic??
For that matter, we are tentatively planning to close up shop on the Software WG post-Porto, though the details have yet to be worked out...

@bcorrie bcorrie added this to the AIRR 2.0 milestone Feb 7, 2024
@javh javh added this to To do in Documentation via automation Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Documentation
  
To do
Development

Successfully merging a pull request may close this issue.

4 participants