Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some organizational changes #32

Merged
merged 1 commit into from
Nov 17, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 10 additions & 4 deletions JOSS/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,11 +37,15 @@ With the development of affordable sequencing platforms, many genetic and genomi

`Taxonomy.jl` is a Julia package to work with the NCBI Taxonomy database. Julia is a language suitable for scientific purposes as it is high-performance with good scalability (similar to C or Fortran), yet highly flexible and readable. Like Python and R, Julia also has a REPL (read, evaluate, print, loop) environment for interactive use. Julia is a relatively young programming language, but it has a growing ecosystem such as `DataFrames.jl` for general data analysis, as well as communities aiming for biological data such as BioJulia and EcoJulia. `Taxonomy.jl` provides efficient native access to the NCBI Taxonomy database in Julia and interfaces for storing and manupulating the information, such as lineages. These features are suitable for integration with the Julia ecosystem and for interactive analysis, for example, the Jupyter ecosystem.

The design of `Taxonomy.jl` has been highly inspired by the command-line interface (CLI) tool `Taxonkit` [@shen2021taxonkit].
Community composition analysis, including metagenomic analysis, uses tables that represent the relative abundance of each taxon. In the NCBI Taxonomy, superkingdom, kingdom, phylum, class, order, family, genus, species, subspecies, and strain are used as canonical ranks. However, there are many exceptions to this. For example, kingdom applies only to eukaryotes; lineages, including those of viruses and environmental samples, often lack some ranks; there is a mixture of subspecies and strains with ranks below species; there are many taxa that do not have canonical ranks. Therefore, the NCBI Taxonomy lineages often cannot be used as-is. However, standardization of lineage is supported by only a few tools, such as `Taxonkit` [@shen2021taxonkit]. In `Taxonomy.jl`, this standardization is provided by the `Lineage` type.

# Features

Taxon data is manipulated by querying a database of parent-child relationships and their annotations by taxon identifiers or names. This can be accomplished in two ways: by accessing the database via a web application programming interface (API), or by directly parsing the dump files provided by NCBI (ftp://ftp.ncbi.nih.gov/pub/taxonomy/). Some tools, including the CLI tool `E-utilities` [@sayers2010general] and the R package `Taxize` [@chamberlain2013taxize], access data through a web API. This is convenient for processing a small number of queries, but is not suitable for large queries due to the limited speed of Internet connections. Moreover, NCBI requests that users submit no more than three query requests per second throuh the Entrez API, and cautions that IP blocking may be used to protect access community resources. Therefore, `Taxonomy.jl` employs an approach similar to the Python package `Taxopy` [@antonio_camargo_2022_7010602] and the CLI tool `Taxonkit` [@shen2021taxonkit], which parse the dump files directly and load them into random access memory (RAM). The dump files are small enough for modern computers (about 400MB total) to be entirely loaded in RAM, allowing queries to run in real-time. This approach also has a speed advantage over the file-based approach used by the `NCBITaxa` module of the Python package `ETE` [@huerta2016ete], which creates SQLite database from the dump files.
## In-memory offline queries with `Taxonomy.DB`

Taxon data is manipulated by querying a database of parent-child relationships and their annotations by taxon identifiers or names. This can be accomplished in two ways: by accessing the database via a web application programming interface (API), or by directly parsing the dump files provided by NCBI (ftp://ftp.ncbi.nih.gov/pub/taxonomy/). Some tools, including the CLI tool `E-utilities` [@sayers2010general] and the R package `Taxize` [@chamberlain2013taxize], access data through a web API. This is convenient for processing a small number of queries, but is not suitable for large queries due to the limited speed of Internet connections. Moreover, NCBI requests that users submit no more than three query requests per second throuh the Entrez API, and cautions that IP blocking may be used to protect access to community resources. Therefore, `Taxonomy.jl` employs an approach similar to the Python package `Taxopy` [@antonio_camargo_2022_7010602] and the CLI tool `Taxonkit` [@shen2021taxonkit], which parse the dump files directly and load them into random access memory (RAM). The dump files are small enough for modern computers (about 400MB total) to be entirely loaded in RAM, allowing queries to run in real-time. This approach also has a speed advantage over the file-based approach used by the `NCBITaxa` module of the Python package `ETE` [@huerta2016ete], which creates SQLite database from the dump files.

## Accessing taxonomies

`Taxonomy.jl` provides a convenient set of types and functions to query the database and store the obtained information. Two types are provided, `Taxonomy.DB` and `Taxon`. The `Taxonomy.DB` type represents the taxonomy database and stores all data parsed from the dump files in RAM. The `Taxon` type represents a single taxon in the database. It stores a taxonomic identifier (Taxid) and a reference to the database.

Expand Down Expand Up @@ -70,6 +74,8 @@ The following operations are defined as functions with `Taxon` or `Taxonomy.DB`:
- `isancestor` and `isdescendant`: Evaluate ancestor-descendant relationships between two taxa
- `isless` (`<`) with `CanonicalRank` type: Filter taxa by a rank range

## Displaying taxonomies

The hierarchical structure of the NCBI Taxonomy is organized as a rooted tree with each taxon as a node. Therefore, the `Taxonomy.DB` type can also be viewed as a rooted tree with the `Taxon` type as a node. We implemented an interface to handle the tree structures using `AbstractTrees.jl`. This allows users to use the functions defined in `AbstractTrees.jl`, as in the example below, and to traverse the tree in a user-defined way.

```julia
Expand Down Expand Up @@ -102,7 +108,7 @@ julia> AbstractTrees.print_tree(Taxon(207598))
└─ 1159185 [subspecies] Gorilla beringei beringei
```

Community composition analysis, including metagenomic analysis, uses tables that represent the relative abundance of each taxon. In NCBI Taxonomy, superkingdom, kingdom, phylum, class, order, family, genus, species, subspecies, and strain are used as canonical ranks. However, there are many exceptions to this. For example, kingdom applies only to eukaryotes; lineages, including those of viruses and environmental samples, often lack some ranks; there is a mixture of subspecies and strains with ranks below species; there are many taxa that do not have canonical ranks. Therefore, the NCBI Taxonomy lineages cannot be used as is and must be standardized. However, standardization of lineage is supported by only a few tools, such as `Taxonkit` [@shen2021taxonkit].
## Resolving rank clashes

`Taxonomy.jl` provides a `Lineage` type, an interface to lineage information. The `Lineage` type is a subtype of the `AbstractVector` type and can be treated as a Vector with `Taxon` elements. The `getindex` methods of the `Lineage` type are extended to also access `Taxon` using the rank symbol. The subspecies/strain are internally treated as the same rank, so that users can ignore ambiguities in each lineage. This makes it possible to consistently handle lineage information.

Expand All @@ -112,6 +118,6 @@ The `Lineage` type can be converted to the `NamedTuple` type with rank as the ke

# Acknowledgments

This work was supported by JST, the establishment of university fellowships towards the creation of science technology innovation, Grant Number JPMJFS2123.
This work was supported by JST, the establishment of university fellowships towards the creation of science technology innovation, Grant Number JPMJFS2123. The design of `Taxonomy.jl` has been highly inspired by the command-line interface (CLI) tool `Taxonkit`.

# References