Skip to content

Commit

Permalink
Merge pull request #173 from enasequence/lilim-ebi-tag-rules
Browse files Browse the repository at this point in the history
Rules descriptions tag_querying.rst
  • Loading branch information
suranjayathilaka committed Feb 19, 2024
2 parents f1e8c45 + 5cd36b5 commit 444f209
Showing 1 changed file with 60 additions and 41 deletions.
101 changes: 60 additions & 41 deletions retrieval/programmatic-access/tag_querying.rst
Original file line number Diff line number Diff line change
Expand Up @@ -52,14 +52,14 @@ Table of Object High Level Tags
:header: "high level tag", "description", "object type"
:widths: 20, 300, 50

"pathogen", "The sample has been automatically determined to belong to the Pathogens Portal", "assembly; sample; sequence; study; secondary_study; taxonomy"
"coastal_brackish", "The sample has been automatically determined by evaluation of GPS and other parameters to have some evidence of being collected from either a coastal or brackish environment.", "read_run; sample; taxonomy"
"freshwater", "The sample has been automatically determined by evaluation of GPS and other parameters to have some evidence of being collected from a freshwater environment.", "read_run; sample; taxonomy"
"marine", "The sample has been automatically determined by evaluation of GPS and other parameters to have some evidence of being collected from a marine environment.", "read_run; sample; taxonomy"
"terrestrial", "The sample has been automatically determined by evaluation of GPS and other parameters to have some evidence of being collected from a terrestrial environment.", "read_run; sample; taxonomy"
"datahub", "The sample has been automatically determined to belong to a datahub. Currently tags have been generated for `FAANG <https://data.faang.org/home>`_ and `Pathogen <https://www.pathogensportal.org/datahubs.>`_", "analysis; read_run; sample; secondary_study"
"xref", "The sample has been referenced in an external to the EMBL-EBI repository. Currently tags have been generated for WORMS and UniEUK.", "Depends on how the user submitted"
"covid19", "The sample has been automatically determined to belong to the COVID19 portal.", "analysis; read_run; sample; sequence; study"
"pathogen", "The record has been determined to be from a pathogenic source", "assembly; sample; sequence; study; secondary_study; taxonomy"
"coastal_brackish", "The record has been determined by evaluation of GPS and other parameters to have some evidence of being collected from either a coastal or brackish environment.", "read_run; sample; taxonomy"
"freshwater", "The record has been determined by evaluation of GPS and other parameters to have some evidence of being collected from a freshwater environment.", "read_run; sample; taxonomy"
"marine", "The record has been determined by evaluation of GPS and other parameters to have some evidence of being collected from a marine environment.", "read_run; sample; taxonomy"
"terrestrial", "The record has been determined by evaluation of GPS and other parameters to have some evidence of being collected from a terrestrial environment.", "read_run; sample; taxonomy"
"datahub", "The record has been determined to have been shared with an ENA Data Hub.", "analysis; read_run; sample; secondary_study"
"xref", "The record has been referenced in an external to the EMBL-EBI repository. Currently tags have been generated for WoRMS, UniEUK, PubMed, Europe PMC and ArrayExpress.", "Depends on how the user submitted"
"covid19", "The record has been determined to be COVID-19 related based on taxonomy or belonging to COVID-19 specific umbrella studies, or one of their child studies.", "analysis; read_run; sample; sequence; study"



Expand Down Expand Up @@ -87,7 +87,7 @@ table, to see what they apply to.
* - pathogen:priority
- pathogen
- priority
-
- A pathogen that has been identified by WHO to pose a serious threat to humans.
-
* - pathogen:bacterium
- pathogen
Expand Down Expand Up @@ -122,17 +122,17 @@ table, to see what they apply to.
* - coastal_brackish:high_confidence
- coastal_brackish
- high_confidence
- strong evidence that the object is “coastal or brackish” environment associated.
- Strong evidence that the object is “coastal or brackish” environment associated.
-
* - coastal_brackish:medium_confidence
- coastal_brackish
- medium_confidence
- moderate evidence that the object is “coastal or brackish” environment associated.
- Moderate evidence that the object is “coastal or brackish” environment associated.
-
* - coastal_brackish:low_confidence
- coastal_brackish
- low_confidence
- weak evidence that the object is “coastal or brackish” environment associated.
- Weak evidence that the object is “coastal or brackish” environment associated.
-
* - freshwater
- freshwater
Expand All @@ -147,62 +147,52 @@ table, to see what they apply to.
* - freshwater:medium_confidence
- freshwater
- medium_confidence
- moderate evidence that the object is freshwater environment associated.
- Moderate evidence that the object is freshwater environment associated.
-
* - freshwater:low_confidence
- freshwater
- low_confidence
- weak evidence that the object is freshwater environment associated.
- Weak evidence that the object is freshwater environment associated.
-
* - marine
- marine
-
- Some evidence that it is “marine” environment assosciated
- There will likely be other low level tags to provide context.
* - marine:high_confidence
* - marine:high_confidence
- marine
- high_confidence
- Strong evidence that the object is marine environment associated.
-
* - marine:medium_confidence
- marine
- medium_confidence
- moderate evidence that the object is marine environment associated.
- Moderate evidence that the object is marine environment associated.
-
* - marine:low_confidence
* - marine:low_confidence
- marine
- low_confidence
- weak evidence that the object is marine environment associated.
- Weak evidence that the object is marine environment associated.
-
* - terrestrial
- terrestrial
-
- Some evidence that it is terrestrial(land) environment associated.
- There will likely be other low level tags to provide context.
* - terrestrial:high_confidence
* - terrestrial:high_confidence
- terrestrial
- high_confidence
- Strong evidence that the object is terrestrial(land) environment associated.
-
* - terrestrial:medium_confidence
* - terrestrial:medium_confidence
- terrestrial
- medium_confidence
- moderate evidence that the object is terrestrial(land) environment associated.
- Moderate evidence that the object is terrestrial(land) environment associated.
-
* - terrestrial:low_confidence
- terrestrial
- low_confidence
- weak evidence that the object is terrestrial(land) environment associated.
-
* - datahub:faang
- datahub
- Faang
- Is a `Functional Annotation of ANimal Genomes project (FAANG) <https://data.faang.org/home>`_ sample and present in that datahub
-
* - datahub:metagenome
- datahub
- metagenome
- Is a metagenome and present in that datahub
- Weak evidence that the object is terrestrial(land) environment associated.
-
* - xref:arrayexpress
- xref
Expand All @@ -212,13 +202,13 @@ table, to see what they apply to.
* - xref:europepmc
- xref
- europepmc
- Object associated with a `European PubmedCentral <https://europepmc.org>`_ record
- A xref is available that links to European PubmedCentral
- Object associated with a `Europe PMC <https://europepmc.org>`_ record
- A xref is available that links to Europe PMC
* - xref:pubmed
- xref
- pubmed
- Object associated with an `NCBI Pubmed <https://pubmed.ncbi.nlm.nih.gov>`_ record
- A xref is available that links to NCBI Pubmed
- Object associated with a `PubMed <https://pubmed.ncbi.nlm.nih.gov>`_ record
- A xref is available that links to PubMed
* - xref:worms
- xref
- worms
Expand All @@ -227,8 +217,8 @@ table, to see what they apply to.
* - xref:unieuk
- xref
- unieuk
- Object associated with a `UNIEUK /(Universal taxonomic framework and integrated reference gene databases for Eukaryotic biology, ecology, and evolution ) <https://unieuk.net>`_ record
- A xref is available that links to UNIEUK
- Object associated with a `UniEuk (Universal taxonomic framework and integrated reference gene databases for Eukaryotic biology, ecology, and evolution ) <https://unieuk.net>`_ record
- A xref is available that links to UniEuk
* - covid19
-
- covid19
Expand All @@ -244,12 +234,41 @@ table, to see what they apply to.
How are the Tags Created?
-------------------------

The tags are typically assigned by automatic processes analysing the user supplied metadata around an object.

For example, the identification of “marine” sample records is systematically assessed by a combination of geo-coordinates and taxonomic evidence. We can further qualify such identification by a level of confidence which is dictated by a combination of the evidence available on the record to support said assertion.
The tags are typically assigned by automated processes analysing the user supplied metadata around an object.

This is an evolving and continuously improving process, where the algorithms and the rule-sets used for classification can be updated as new insights are obtained and thus results in the assigned tags being regularly refreshed. The flexibility of this system allows for new classifications to be easily created allowing the definition of new, high-level contextual groupings for ENA data making the process of discovery more intuitive for certain user communities.

^^^^^^^^
pathogen
^^^^^^^^
The pathogen tags are identified based on NCBI taxonomy IDs. The different types of pathogens have a maintained list of taxonomy IDs, all records associated with the taxonomy ID or a taxonomy ID in that lineage get the pathogen tag.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
coastal_brackish, freshwater, marine, terrestrial
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The identification of coastal_brackish, freshwater, marine and terrestrial sample records are systematically assessed by a combination of geo-coordinates and/or taxonomic evidence. Taxonomic information is taken from `WoRMS <https://www.marinespecies.org/>`_, and 4 shapefiles are used for the coordinates:

* coastal_brackish: Longhurst shapefile downloaded from https://www.marineregions.org/
* freshwater: WWF’s Global 200 g200_fw_category shapefile: https://www.worldwildlife.org/publications/global-200
* marine: OpenStreetmap’s water polygons shapefile: https://osmdata.openstreetmap.de/data/water-polygons.html
* terrestrial: OpenStreetmap’s land polygons shapefile: https://osmdata.openstreetmap.de/data/land-polygons.html

We further qualify such identification by a level of confidence which is dictated by a combination of the evidence available on the record to support said assertion.

^^^^
xref
^^^^
xref (Cross Reference) tags are based on external data resources that have provided mappings between their records and ENA records. A tag for a specific external resource can be enabled on request. Currently xref:worms tags are available on taxons; xref:arrayexpress, xref:europepmc, xref:pubmed on studies; xref:unieuk on sequences.

^^^^^^^
datahub
^^^^^^^
The datahub tag is assigned based on whether the record has been shared with an ENA Data Hub.

^^^^^^^
covid19
^^^^^^^
The record is related to COVID-19 data, as indicated by it being included under the COVID-19 specific umbrella studies PRJEB39908, PRJEB40349, PRJEB40770, or one of their child studies.

-------------
Miscellaneous
Expand Down

0 comments on commit 444f209

Please sign in to comment.