Permalink
Browse files

Update SRA libary_* mapping

  • Loading branch information...
bussec committed May 16, 2018
1 parent ff7bb03 commit 93a3fdc55233377b2b10fed8153dc9b12d3d5d54
@@ -3,10 +3,10 @@ study_id bioproject_accession MAPPING TRUE
sample_id sample_name MAPPING TRUE
nucleic_acid_processing_id library_ID MAPPING TRUE
NULL title DATABASE_SPECIFIC TRUE
NULL library_strategy DATABASE_SPECIFIC TRUE
NULL library_source DATABASE_SPECIFIC TRUE
NULL library_selection DATABASE_SPECIFIC TRUE
NULL library_layout DATABASE_SPECIFIC TRUE
NULL library_strategy MAPPED_NODE TRUE see description in mapping_MiAIRR_SRA_library_terms.rst
NULL library_source MAPPED_NODE TRUE see description in mapping_MiAIRR_SRA_library_terms.rst
NULL library_selection MAPPED_NODE TRUE see description in mapping_MiAIRR_SRA_library_terms.rst
NULL library_layout MAPPED_NODE TRUE see description in mapping_MiAIRR_SRA_library_terms.rst
NULL platform DATABASE_SPECIFIC TRUE
sequencing_platform instrument_model MAPPING TRUE SRA splits this information into `platform` and `instrument_model`, however the controlled vocabulary of the latter one also often contains the `platform` information. Therefore preference was given to a 1:1 mapping using `instrument_model`
library_generation_protocol design_description MAPPING TRUE
@@ -1,81 +1,107 @@
Mapping NCBI SRA library terms
==============================
Mapping MiAIRR to NCBI SRA library_* terms
==========================================

SRA requires that each record is tagged with three ``library_*`` terms
upon submission. There is currently no direct mapping between the
MiAIRR ``template_class`` and ``library_generation_method`` keywords
and the SRA terms, but this files should provide some guidance on how
to perform reasonable mapping:
SRA requires that each record is tagged with four ``library_*`` terms
upon submission. There is no direct mapping between the MiAIRR
``library_generation_method`` and ``template_class`` keywords and the
SRA terms. Therefore this document aims to provide a basic guidance on
how to perform reasonable MiAIRR-to-SRA mapping:

* Strategy: Will be ``AMPLICON`` for the majority of current
applications, ``RNA-Seq`` for whole transcriptome.
* ``library_strategy``: Will be ``AMPLICON`` for the majority of
current applications, ``RNA-Seq`` for whole transcriptome.

* Source: ``GENOMIC`` for DNA, ``TRANSCRIPTOMIC`` for RNA,
``SYNTHETIC`` for synthetic libraries.
* ``library_source``: ``GENOMIC`` for DNA, ``TRANSCRIPTOMIC`` for
RNA, ``SYNTHETIC`` for synthetic libraries.

* Selection: Should typically be one of ``PCR``, ``RT-PCR``, ``cDNA``
or ``RACE``.
* ``library_selection``: Should typically be one of ``PCR``,
``RT-PCR``, ``cDNA`` or ``RACE``.

* ``library_layout``: This term does primarily describe the
sequencing process, rather than library construction. See `Mapping
to library_layout`_.

Current MiAIRR terms
--------------------

While ``template_class`` can only be ``DNA`` or ``RNA``, the possible
values for ``library_generation_method`` are more complex:
Mapping from library_generation_method
--------------------------------------

The following list enumerates the currently accepted values for the
``library_generation_method`` keyword. The values given in "SRA mapping"
are listed in the order `library_strategy``, ``library_source``,
``library_selection``. If there are multiple mappings for a given
method, the distinction can be made based on the criteria listed in
"requires". Note that while ``template_class`` is currently redundant as
its information is fully contained in ``library_generation_method``, it
is nevertheless a REQUIRED keyword in the MiAIRR data standard.

* PCR: Conventional PCR on genomic DNA.
- REQUIRES: ``template_class`` = ``DNA`` AND NOT ``synthetic``
- requires: ``template_class`` = ``DNA`` AND NOT ``synthetic``
- SRA mapping: ``AMPLICON``, ``GENOMIC``, ``PCR``

* PCR: Conventional PCR on synthetic DNA.
- REQUIRES: ``template_class`` = ``DNA`` AND ``synthetic``
- requires: ``template_class`` = ``DNA`` AND ``synthetic``
- SRA mapping: ``AMPLICON``, ``SYNTHETIC``, ``PCR``

* RT(RHP)+PCR: RT-PCR using random hexamer primers.
- REQUIRES: ``template_class`` = ``RNA``
- requires: ``template_class`` = ``RNA``
- SRA mapping: ``AMPLICON``, ``TRANSCRIPTOMIC``, ``RT-PCR``

* RT(oligo-dT)+PCR: RT-PCR using oligo-dT primers.
- REQUIRES: ``template_class`` = ``RNA``
- requires: ``template_class`` = ``RNA``
- SRA mapping: ``AMPLICON``, ``TRANSCRIPTOMIC``, ``RT-PCR``

* RT(oligo-dT)+TS+PCR: 5'-RACE PCR (i.e. RT is followed by a template
switch (TS) step) using oligo-dT primers.
- REQUIRES: ``template_class`` = ``RNA``
- requires: ``template_class`` = ``RNA``
- SRA mapping: ``AMPLICON``, ``TRANSCRIPTOMIC``, ``RACE``

* RT(oligo-dT)+TS(UMI)+PCR: 5'-RACE PCR using oligo-dT primers and
template switch primers containing unique molecular identifiers
(UMI), i.e. the 5' end is UMI-coded.
- REQUIRES: ``template_class`` = ``RNA``
- requires: ``template_class`` = ``RNA``
- SRA mapping: ``AMPLICON``, ``TRANSCRIPTOMIC``, ``RACE``

* RT(specific)+PCR: RT-PCR using transcript-specific primers.
- REQUIRES: ``template_class`` = ``RNA``
- requires: ``template_class`` = ``RNA``
- SRA mapping: ``AMPLICON``, ``TRANSCRIPTOMIC``, ``RT-PCR``

* RT(specific)+TS+PCR: 5'-RACE PCR using transcript-specific primers.
- REQUIRES: ``template_class`` = ``RNA``
- requires: ``template_class`` = ``RNA``
- SRA mapping: ``AMPLICON``, ``TRANSCRIPTOMIC``, ``RACE``

* RT(specific)+TS(UMI)+PCR: 5'-RACE PCR using transcript-specific
primers and template switch primers containing UMIs.
- REQUIRES: ``template_class`` = ``RNA``
- requires: ``template_class`` = ``RNA``
- SRA mapping: ``AMPLICON``, ``TRANSCRIPTOMIC``, ``RACE``

* RT(specific+UMI)+PCR: RT-PCR using transcript-specific primers
containing UMIs (i.e. the 3' end is UMI-coded).
- REQUIRES: ``template_class`` = ``RNA``
- requires: ``template_class`` = ``RNA``
- SRA mapping: ``AMPLICON``, ``TRANSCRIPTOMIC``, ``RT-PCR``

* RT(specific+UMI)+TS+PCR: 5'-RACE PCR using transcript-specific
primers containing UMIs (i.e. the 3' end is UMI-coded).
- REQUIRES: ``template_class`` = ``RNA``
- requires: ``template_class`` = ``RNA``
- SRA mapping: ``AMPLICON``, ``TRANSCRIPTOMIC``, ``RACE``

* RT(specific)+TS: RT-based generation of dsDNA **without**
subsequent PCR. This is used by RNA-seq kits.
- REQUIRES: ``template_class`` = ``RNA``
- requires: ``template_class`` = ``RNA``
- SRA mapping: ``RNA-Seq``, ``TRANSCRIPTOMIC``, ``RACE``

* other: Any methodology not covered above.


Mapping to library_layout
-------------------------

NCBI currently defines two possible values for the `library_layout`
term (``single`` and ``paired``) to describe whether a library has been
subjected to paired-end sequencing or not. MiAIRR does not specify a
dedicated field for this information as it was considered more important
to annotate whether a sequence is a ``complete_sequence`` or not.

Therefore the information for ``library_layout`` has to be derived from
the ``read_length`` keyword, which contains a JSON array with one or two
positive integer values, providing the maximum length of each read
direction. The existence of a non-zero second value SHOULD be
interpreted as indication for ``paired``-end sequencing.
@@ -19,22 +19,27 @@ The tables have five columns:

3. The relation between MiAIRR field name and NCBI attribute:

- `IDENTICAL`: The identical keyword exists in MiAIRR and the
- ``IDENTICAL``: The identical keyword exists in MiAIRR and the
NCBI repository; it defines similar content.

- `MAPPED`: Non-identical keywords are used by MiAIRR and NCBI
- ``MAPPED``: Non-identical keywords are used by MiAIRR and NCBI
to define similar content; a 1:1 mapping of the keywords is
required.

- `MAPPED_NODE`: Non-identical keywords are used by MiAIRR and NCBI
to define similar content. In addition, NCBI splits the content
into several sub-keys, which requires some string manipulation for
a 1:n mapping.
- ``MAPPED_NODE``: Non-identical keywords are used by MiAIRR and
NCBI to define similar content. In addition, NCBI splits the
content into several sub-keys, which requires some string
manipulation for a 1:n mapping.

- `AIRR_CUSTOM`: The NCBI repositories does *not* specify an
- ``AIRR_CUSTOM``: The NCBI repositories does *not* specify an
attribute for this content, so the MiAIRR field name is directly
used as custom keyword.

- ``DATABASE_SPECIFIC``: This is an NCBI specific term that has
no correlate in MiAIRR. This should only occur for NCBI specific
references or information only required during submission (e.g.
file names).

4. Whether an attribute is required by the NCBI repository (Note that
*all* data elements are **required** by MiAIRR)

Binary file not shown.

0 comments on commit 93a3fdc

Please sign in to comment.