From 907271ee8918deb86e8b67c5c26c6ab7cb5eddcf Mon Sep 17 00:00:00 2001 From: Russell Neches Date: Wed, 16 Nov 2022 14:58:35 +0900 Subject: [PATCH] Edits --- JOSS/paper.bib | 18 ++++++++++++++++++ JOSS/paper.md | 24 +++++++++++------------- 2 files changed, 29 insertions(+), 13 deletions(-) diff --git a/JOSS/paper.bib b/JOSS/paper.bib index 5ac1c27..c8aff58 100644 --- a/JOSS/paper.bib +++ b/JOSS/paper.bib @@ -63,3 +63,21 @@ @article{shen2021taxonkit year={2021}, publisher={Elsevier} } + +@misc{dataframes, + title = {{QuadGK.jl}: {G}auss--{K}ronrod integration in {J}ulia}, + author = {Harris, Harlan and EPRI (Tom Short's code) and DuBois, Chris and Myles White, John and Bouchet-Valat, Milan and Kamiński, Bogumił and other `DataFrames.jl` contributors.}, + year = {2022}, + howpublished = {\url{https://github.com/JuliaData/DataFrames.jl}} +} + +@article{huerta2016ete, + title={ETE 3: reconstruction, analysis, and visualization of phylogenomic data}, + author={Huerta-Cepas, Jaime and Serra, Fran{\c{c}}ois and Bork, Peer}, + journal={Molecular biology and evolution}, + volume={33}, + number={6}, + pages={1635--1638}, + year={2016}, + publisher={Society for Molecular Biology and Evolution} +} diff --git a/JOSS/paper.md b/JOSS/paper.md index c403b5b..0e95384 100644 --- a/JOSS/paper.md +++ b/JOSS/paper.md @@ -25,7 +25,7 @@ bibliography: paper.bib # Summary -`Taxonomy.jl` is a Julia [@bezanson2017julia] package to handle the National Center for Biotechnology Information (NCBI) Taxonomy database. `Taxonomy.jl` provides a rich set of comprehensive and essential manupliation of NCBI Taxonomy data. This package is designed not only for efficient data manipulation, but also for flexibility in interactive analysis (e.g. on Jupyter notebook [@kluyver2016jupyter]), and for integration with other Julia ecosystems such as `DataFrames.jl`. +`Taxonomy.jl` is a Julia [@bezanson2017julia] package to handle the National Center for Biotechnology Information (NCBI) Taxonomy database. `Taxonomy.jl` provides a rich set of comprehensive and essential tools for the manupliation of NCBI Taxonomy data. This package is designed not only for efficient data manipulation, but also for flexibility in interactive analysis (e.g. on Jupyter notebook [@kluyver2016jupyter]), and for integration with other Julia ecosystems such as `DataFrames.jl` [@dataframes]. `Taxonomy.jl` is an open-source project hosted on Github and distributed under the MIT license. @@ -33,26 +33,25 @@ bibliography: paper.bib The National Center for Biotechnology Information (NCBI) Taxonomy is a nomenclature and classification database for the International Nucleotide Sequence Database Collaboration (INSDC) [@schoch2020ncbi]. It provides organism names and classifications for every entry in the nucleotide and protein sequence databases of the INSDC and allow linking between different resources. Linking taxa and sequence data is foundational for various fields from biomedical to ecological studies. -With the development and affordability of sequencing platforms, many genetic and genomic sequences are being produced. The amount of data handled in a single study, including metagenome analysis, is exploding, and there is a need for tools that can handle the taxonomy database with lightweight performance and scalability. +With the development of affordable sequencing platforms, many genetic and genomic sequences are being produced. The amount of data generated by a single study, particularly metagenome analysis, is exploding, creating a need for tools that can handle the taxonomy database with lightweight performance and scalability. -`Taxonomy.jl` is a Julia package to handle the NCBI Taxonomy database. Julia is a desirable language suitable for scientific purposes as it is high-performance with good scalability (like C/Fortran), yet highly flexible and readable with supports of interactive execution (like Python/R). Julia is a relatively young programming language, but it has a growing ecosystem such as `DataFrames.jl` for general data analysis, as well as communities aiming for biological data such as BioJulia and EcoJulia. `Taxonomy.jl` bridges the NCBI Taxonomy database and Julia's ecosystem, enabling efficient downstream computation and interactive analysis (e.g. on Jupyter notebook). +`Taxonomy.jl` is a Julia package to work with the NCBI Taxonomy database. Julia is a language suitable for scientific purposes as it is high-performance with good scalability (similar to C or Fortran), yet highly flexible and readable. Like Python and R, Julia also has a REPL (read, evaluate, print, loop) environment for interactive use. Julia is a relatively young programming language, but it has a growing ecosystem such as `DataFrames.jl` for general data analysis, as well as communities aiming for biological data such as BioJulia and EcoJulia. `Taxonomy.jl` provides efficient native access to the NCBI Taxonomy database in Julia suitable for interactive analysis in, for example, the Jupyter ecosystem. # Features -Manipulation of taxon data is basically done by querying a database of various information and parent-child relationships by taxon identifiers or names. This can be accomplished in two major ways: by accessing the database via a web application programming interface (API), or by directly parsing the dump files provided by NCBI (ftp://ftp.ncbi.nih.gov/pub/taxonomy/). Some tools, including the CLI tool `E-utilities` [@sayers2010general] and the R package `Taxize` [@chamberlain2013taxize], access data through a web API, but this way is not suitable for large queries due to the limited speed of Internet connections. Therefore, `Taxonomy.jl` employs a way similar to the Python package `Taxopy` [@antonio_camargo_2022_7010602] and the CLI tool `Taxonkit` [@shen2021taxonkit], which parses the dump files directly and load it all into random access memory (RAM). The dump files are small enough for modern computers (about 400MB total) to be entirely loaded in RAM, allowing real-time query operations to be performed much faster. This way also has a speed advantage over the way employed by the `NCBITaxa` module of the Python package `ETE`, which creates SQLite database from the dump files and accesses the data with queries. +Taxon data is manipulated by querying a database of parent-child relationships and their annotations by taxon identifiers or names. This can be accomplished in two ways: by accessing the database via a web application programming interface (API), or by directly parsing the dump files provided by NCBI (ftp://ftp.ncbi.nih.gov/pub/taxonomy/). Some tools, including the CLI tool `E-utilities` [@sayers2010general] and the R package `Taxize` [@chamberlain2013taxize], access data through a web API. This is convenient for processing a small number of queries, but is not suitable for large queries due to the limited speed of Internet connections. Moreover, NCBI requests that users submit no more than three query requests per second throuh the Entrez API, and cautions that IP blocking may be used to protect access community resources. Therefore, `Taxonomy.jl` employs an approach similar to the Python package `Taxopy` [@antonio_camargo_2022_7010602] and the CLI tool `Taxonkit` [@shen2021taxonkit], which parse the dump files directly and load them into random access memory (RAM). The dump files are small enough for modern computers (about 400MB total) to be entirely loaded in RAM, allowing queries to run in real-time. This approach also has a speed advantage over the file-based approach used by the `NCBITaxa` module of the Python package `ETE` [@huerta2016ete], which creates SQLite database from the dump files. -`Taxonomy.jl` provides a convenient set of types and functions to query the database and store the obtained information. The core of the system is of two types, `Taxonomy.DB` and `Taxon`. `Taxonomy.DB` type, as the name implies, is the type that represents the taxonomy database and stores all data parsed from the dump files in RAM. The `Taxon` type represents a single taxon in the database. It stores a taxonomic identifier (Taxid) and a reference to the database. +`Taxonomy.jl` provides a convenient set of types and functions to query the database and store the obtained information. Two types are provided, `Taxonomy.DB` and `Taxon`. The `Taxonomy.DB` type represents the taxonomy database and stores all data parsed from the dump files in RAM. The `Taxon` type represents a single taxon in the database. It stores a taxonomic identifier (Taxid) and a reference to the database. The `Taxonomy.DB` object is created as follows by specifying the paths to `nodes.dmp` (links the Taxids to taxonomic ranks and parent Taxids) and `names.dmp` (links the Taxids to taxonomy names) in the file downloaded from the NCBI FTP site: -In Julia REPL ```julia julia> using Taxonomy julia> db = Taxonomy.DB("./db/nodes.dmp", "./db/names.dmp"); ``` -One feature of `Taxonomy.jl` is that once a database object is created, it can be called without explicitly specifying it. Since most analyses use only one database, this approach is effective and allows users to write simple, readable code. For example, users can omit the database argument when constructing the `Taxon` object as follows: +Once a database object is created, it can be called without explicitly specifying it. Most analyses use only one database, and so this approach allows users to write simple, readable code. For example, users can omit the database argument when constructing the `Taxon` object as follows: ```julia julia> Taxon(9606, db) @@ -63,15 +62,14 @@ julia> Taxon(9606) ``` The following operations are defined as functions with `Taxon` or `Taxonomy.DB`: -- Get various information on a given taxon (name, rank and parent-child relationships, etc.) -- Convert a name to Taxids -- Compute the lowest common ancestor (LCA) of given taxa -- Evaluate ancestor-descendant relationships between two taxa -- Filter taxa by a rank range +- **FIXME : add function name ** Get various information on a given taxon (name, rank and parent-child relationships, etc.) +- **FIXME : add function name ** Convert a name to Taxids +- **FIXME : add function name ** Compute the lowest common ancestor (LCA) of given taxa +- **FIXME : add function name ** Evaluate ancestor-descendant relationships between two taxa +- **FIXME : add function name ** Filter taxa by a rank range The hierarchical structure of the NCBI Taxonomy is organized as a rooted tree with each taxon as a node. Therefore, the `Taxonomy.DB` type can also be viewed as a rooted tree with the `Taxon` type as a node. We implemented an interface to handle the tree structures using `AbstractTrees.jl`. This allows users to use the functions defined in `AbstractTrees.jl`, as in the example below, and to traverse the tree in a user-defined way. -Example: ```julia julia> AbstractTrees.print_tree(Taxon(207598)) 207598 [subfamily] Homininae