Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edits #31

Merged
merged 1 commit into from
Nov 16, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions JOSS/paper.bib
Original file line number Diff line number Diff line change
Expand Up @@ -63,3 +63,21 @@ @article{shen2021taxonkit
year={2021},
publisher={Elsevier}
}

@misc{dataframes,
title = {{QuadGK.jl}: {G}auss--{K}ronrod integration in {J}ulia},
author = {Harris, Harlan and EPRI (Tom Short's code) and DuBois, Chris and Myles White, John and Bouchet-Valat, Milan and Kamiński, Bogumił and other `DataFrames.jl` contributors.},
year = {2022},
howpublished = {\url{https://github.com/JuliaData/DataFrames.jl}}
}

@article{huerta2016ete,
title={ETE 3: reconstruction, analysis, and visualization of phylogenomic data},
author={Huerta-Cepas, Jaime and Serra, Fran{\c{c}}ois and Bork, Peer},
journal={Molecular biology and evolution},
volume={33},
number={6},
pages={1635--1638},
year={2016},
publisher={Society for Molecular Biology and Evolution}
}
24 changes: 11 additions & 13 deletions JOSS/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,34 +25,33 @@ bibliography: paper.bib

# Summary

`Taxonomy.jl` is a Julia [@bezanson2017julia] package to handle the National Center for Biotechnology Information (NCBI) Taxonomy database. `Taxonomy.jl` provides a rich set of comprehensive and essential manupliation of NCBI Taxonomy data. This package is designed not only for efficient data manipulation, but also for flexibility in interactive analysis (e.g. on Jupyter notebook [@kluyver2016jupyter]), and for integration with other Julia ecosystems such as `DataFrames.jl`.
`Taxonomy.jl` is a Julia [@bezanson2017julia] package to handle the National Center for Biotechnology Information (NCBI) Taxonomy database. `Taxonomy.jl` provides a rich set of comprehensive and essential tools for the manupliation of NCBI Taxonomy data. This package is designed not only for efficient data manipulation, but also for flexibility in interactive analysis (e.g. on Jupyter notebook [@kluyver2016jupyter]), and for integration with other Julia ecosystems such as `DataFrames.jl` [@dataframes].

`Taxonomy.jl` is an open-source project hosted on Github and distributed under the MIT license.

# Statement of need

The National Center for Biotechnology Information (NCBI) Taxonomy is a nomenclature and classification database for the International Nucleotide Sequence Database Collaboration (INSDC) [@schoch2020ncbi]. It provides organism names and classifications for every entry in the nucleotide and protein sequence databases of the INSDC and allow linking between different resources. Linking taxa and sequence data is foundational for various fields from biomedical to ecological studies.

With the development and affordability of sequencing platforms, many genetic and genomic sequences are being produced. The amount of data handled in a single study, including metagenome analysis, is exploding, and there is a need for tools that can handle the taxonomy database with lightweight performance and scalability.
With the development of affordable sequencing platforms, many genetic and genomic sequences are being produced. The amount of data generated by a single study, particularly metagenome analysis, is exploding, creating a need for tools that can handle the taxonomy database with lightweight performance and scalability.

`Taxonomy.jl` is a Julia package to handle the NCBI Taxonomy database. Julia is a desirable language suitable for scientific purposes as it is high-performance with good scalability (like C/Fortran), yet highly flexible and readable with supports of interactive execution (like Python/R). Julia is a relatively young programming language, but it has a growing ecosystem such as `DataFrames.jl` for general data analysis, as well as communities aiming for biological data such as BioJulia and EcoJulia. `Taxonomy.jl` bridges the NCBI Taxonomy database and Julia's ecosystem, enabling efficient downstream computation and interactive analysis (e.g. on Jupyter notebook).
`Taxonomy.jl` is a Julia package to work with the NCBI Taxonomy database. Julia is a language suitable for scientific purposes as it is high-performance with good scalability (similar to C or Fortran), yet highly flexible and readable. Like Python and R, Julia also has a REPL (read, evaluate, print, loop) environment for interactive use. Julia is a relatively young programming language, but it has a growing ecosystem such as `DataFrames.jl` for general data analysis, as well as communities aiming for biological data such as BioJulia and EcoJulia. `Taxonomy.jl` provides efficient native access to the NCBI Taxonomy database in Julia suitable for interactive analysis in, for example, the Jupyter ecosystem.

# Features

Manipulation of taxon data is basically done by querying a database of various information and parent-child relationships by taxon identifiers or names. This can be accomplished in two major ways: by accessing the database via a web application programming interface (API), or by directly parsing the dump files provided by NCBI (ftp://ftp.ncbi.nih.gov/pub/taxonomy/). Some tools, including the CLI tool `E-utilities` [@sayers2010general] and the R package `Taxize` [@chamberlain2013taxize], access data through a web API, but this way is not suitable for large queries due to the limited speed of Internet connections. Therefore, `Taxonomy.jl` employs a way similar to the Python package `Taxopy` [@antonio_camargo_2022_7010602] and the CLI tool `Taxonkit` [@shen2021taxonkit], which parses the dump files directly and load it all into random access memory (RAM). The dump files are small enough for modern computers (about 400MB total) to be entirely loaded in RAM, allowing real-time query operations to be performed much faster. This way also has a speed advantage over the way employed by the `NCBITaxa` module of the Python package `ETE`, which creates SQLite database from the dump files and accesses the data with queries.
Taxon data is manipulated by querying a database of parent-child relationships and their annotations by taxon identifiers or names. This can be accomplished in two ways: by accessing the database via a web application programming interface (API), or by directly parsing the dump files provided by NCBI (ftp://ftp.ncbi.nih.gov/pub/taxonomy/). Some tools, including the CLI tool `E-utilities` [@sayers2010general] and the R package `Taxize` [@chamberlain2013taxize], access data through a web API. This is convenient for processing a small number of queries, but is not suitable for large queries due to the limited speed of Internet connections. Moreover, NCBI requests that users submit no more than three query requests per second throuh the Entrez API, and cautions that IP blocking may be used to protect access community resources. Therefore, `Taxonomy.jl` employs an approach similar to the Python package `Taxopy` [@antonio_camargo_2022_7010602] and the CLI tool `Taxonkit` [@shen2021taxonkit], which parse the dump files directly and load them into random access memory (RAM). The dump files are small enough for modern computers (about 400MB total) to be entirely loaded in RAM, allowing queries to run in real-time. This approach also has a speed advantage over the file-based approach used by the `NCBITaxa` module of the Python package `ETE` [@huerta2016ete], which creates SQLite database from the dump files.

`Taxonomy.jl` provides a convenient set of types and functions to query the database and store the obtained information. The core of the system is of two types, `Taxonomy.DB` and `Taxon`. `Taxonomy.DB` type, as the name implies, is the type that represents the taxonomy database and stores all data parsed from the dump files in RAM. The `Taxon` type represents a single taxon in the database. It stores a taxonomic identifier (Taxid) and a reference to the database.
`Taxonomy.jl` provides a convenient set of types and functions to query the database and store the obtained information. Two types are provided, `Taxonomy.DB` and `Taxon`. The `Taxonomy.DB` type represents the taxonomy database and stores all data parsed from the dump files in RAM. The `Taxon` type represents a single taxon in the database. It stores a taxonomic identifier (Taxid) and a reference to the database.

The `Taxonomy.DB` object is created as follows by specifying the paths to `nodes.dmp` (links the Taxids to taxonomic ranks and parent Taxids) and `names.dmp` (links the Taxids to taxonomy names) in the file downloaded from the NCBI FTP site:

In Julia REPL
```julia
julia> using Taxonomy

julia> db = Taxonomy.DB("./db/nodes.dmp", "./db/names.dmp");
```

One feature of `Taxonomy.jl` is that once a database object is created, it can be called without explicitly specifying it. Since most analyses use only one database, this approach is effective and allows users to write simple, readable code. For example, users can omit the database argument when constructing the `Taxon` object as follows:
Once a database object is created, it can be called without explicitly specifying it. Most analyses use only one database, and so this approach allows users to write simple, readable code. For example, users can omit the database argument when constructing the `Taxon` object as follows:

```julia
julia> Taxon(9606, db)
Expand All @@ -63,15 +62,14 @@ julia> Taxon(9606)
```

The following operations are defined as functions with `Taxon` or `Taxonomy.DB`:
- Get various information on a given taxon (name, rank and parent-child relationships, etc.)
- Convert a name to Taxids
- Compute the lowest common ancestor (LCA) of given taxa
- Evaluate ancestor-descendant relationships between two taxa
- Filter taxa by a rank range
- **FIXME : add function name ** Get various information on a given taxon (name, rank and parent-child relationships, etc.)
- **FIXME : add function name ** Convert a name to Taxids
- **FIXME : add function name ** Compute the lowest common ancestor (LCA) of given taxa
- **FIXME : add function name ** Evaluate ancestor-descendant relationships between two taxa
- **FIXME : add function name ** Filter taxa by a rank range

The hierarchical structure of the NCBI Taxonomy is organized as a rooted tree with each taxon as a node. Therefore, the `Taxonomy.DB` type can also be viewed as a rooted tree with the `Taxon` type as a node. We implemented an interface to handle the tree structures using `AbstractTrees.jl`. This allows users to use the functions defined in `AbstractTrees.jl`, as in the example below, and to traverse the tree in a user-defined way.

Example:
```julia
julia> AbstractTrees.print_tree(Taxon(207598))
207598 [subfamily] Homininae
Expand Down