banhbio · banhbio · Nov 17, 2022 · Nov 17, 2022
diff --git a/JOSS/paper.md b/JOSS/paper.md
@@ -37,11 +37,15 @@ With the development of affordable sequencing platforms, many genetic and genomi
 
 `Taxonomy.jl` is a Julia package to work with the NCBI Taxonomy database. Julia is a language suitable for scientific purposes as it is high-performance with good scalability (similar to C or Fortran), yet highly flexible and readable. Like Python and R, Julia also has a REPL (read, evaluate, print, loop) environment for interactive use. Julia is a relatively young programming language, but it has a growing ecosystem such as `DataFrames.jl` for general data analysis, as well as communities aiming for biological data such as BioJulia and EcoJulia. `Taxonomy.jl` provides efficient native access to the NCBI Taxonomy database in Julia and interfaces for storing and manupulating the information, such as lineages. These features are suitable for integration with the Julia ecosystem and for interactive analysis, for example, the Jupyter ecosystem.
 
-The design of `Taxonomy.jl` has been highly inspired by the command-line interface (CLI) tool `Taxonkit` [@shen2021taxonkit].
+Community composition analysis, including metagenomic analysis, uses tables that represent the relative abundance of each taxon. In the NCBI Taxonomy, superkingdom, kingdom, phylum, class, order, family, genus, species, subspecies, and strain are used as canonical ranks. However, there are many exceptions to this. For example, kingdom applies only to eukaryotes; lineages, including those of viruses and environmental samples, often lack some ranks; there is a mixture of subspecies and strains with ranks below species; there are many taxa that do not have canonical ranks. Therefore, the NCBI Taxonomy lineages often cannot be used as-is. However, standardization of lineage is supported by only a few tools, such as `Taxonkit` [@shen2021taxonkit]. In `Taxonomy.jl`, this standardization is provided by the `Lineage` type.
 
 # Features
 
-Taxon data is manipulated by querying a database of parent-child relationships and their annotations by taxon identifiers or names. This can be accomplished in two ways: by accessing the database via a web application programming interface (API), or by directly parsing the dump files provided by NCBI (ftp://ftp.ncbi.nih.gov/pub/taxonomy/). Some tools, including the CLI tool `E-utilities` [@sayers2010general] and the R package `Taxize` [@chamberlain2013taxize], access data through a web API. This is convenient for processing a small number of queries, but is not suitable for large queries due to the limited speed of Internet connections. Moreover, NCBI requests that users submit no more than three query requests per second throuh the Entrez API, and cautions that IP blocking may be used to protect access community resources. Therefore, `Taxonomy.jl` employs an approach similar to the Python package `Taxopy` [@antonio_camargo_2022_7010602] and the CLI tool `Taxonkit` [@shen2021taxonkit], which parse the dump files directly and load them into random access memory (RAM). The dump files are small enough for modern computers (about 400MB total) to be entirely loaded in RAM, allowing queries to run in real-time. This approach also has a speed advantage over the file-based approach used by the `NCBITaxa` module of the Python package `ETE` [@huerta2016ete], which creates SQLite database from the dump files.
+## In-memory offline queries with `Taxonomy.DB`
+
+Taxon data is manipulated by querying a database of parent-child relationships and their annotations by taxon identifiers or names. This can be accomplished in two ways: by accessing the database via a web application programming interface (API), or by directly parsing the dump files provided by NCBI (ftp://ftp.ncbi.nih.gov/pub/taxonomy/). Some tools, including the CLI tool `E-utilities` [@sayers2010general] and the R package `Taxize` [@chamberlain2013taxize], access data through a web API. This is convenient for processing a small number of queries, but is not suitable for large queries due to the limited speed of Internet connections. Moreover, NCBI requests that users submit no more than three query requests per second throuh the Entrez API, and cautions that IP blocking may be used to protect access to community resources. Therefore, `Taxonomy.jl` employs an approach similar to the Python package `Taxopy` [@antonio_camargo_2022_7010602] and the CLI tool `Taxonkit` [@shen2021taxonkit], which parse the dump files directly and load them into random access memory (RAM). The dump files are small enough for modern computers (about 400MB total) to be entirely loaded in RAM, allowing queries to run in real-time. This approach also has a speed advantage over the file-based approach used by the `NCBITaxa` module of the Python package `ETE` [@huerta2016ete], which creates SQLite database from the dump files.
+
+## Accessing taxonomies
 
 `Taxonomy.jl` provides a convenient set of types and functions to query the database and store the obtained information. Two types are provided, `Taxonomy.DB` and `Taxon`. The `Taxonomy.DB` type represents the taxonomy database and stores all data parsed from the dump files in RAM. The `Taxon` type represents a single taxon in the database. It stores a taxonomic identifier (Taxid) and a reference to the database.
 
@@ -70,6 +74,8 @@ The following operations are defined as functions with `Taxon` or `Taxonomy.DB`:
 - `isancestor` and `isdescendant`: Evaluate ancestor-descendant relationships between two taxa
 - `isless` (`<`) with `CanonicalRank` type: Filter taxa by a rank range
 
+## Displaying taxonomies
+
 The hierarchical structure of the NCBI Taxonomy is organized as a rooted tree with each taxon as a node. Therefore, the `Taxonomy.DB` type can also be viewed as a rooted tree with the `Taxon` type as a node. We implemented an interface to handle the tree structures using `AbstractTrees.jl`. This allows users to use the functions defined in `AbstractTrees.jl`, as in the example below, and to traverse the tree in a user-defined way.
 
 ```julia
@@ -102,7 +108,7 @@ julia> AbstractTrees.print_tree(Taxon(207598))
       └─ 1159185 [subspecies] Gorilla beringei beringei
 ```
 
-Community composition analysis, including metagenomic analysis, uses tables that represent the relative abundance of each taxon. In NCBI Taxonomy, superkingdom, kingdom, phylum, class, order, family, genus, species, subspecies, and strain are used as canonical ranks. However, there are many exceptions to this. For example, kingdom applies only to eukaryotes; lineages, including those of viruses and environmental samples, often lack some ranks; there is a mixture of subspecies and strains with ranks below species; there are many taxa that do not have canonical ranks. Therefore, the NCBI Taxonomy lineages cannot be used as is and must be standardized. However, standardization of lineage is supported by only a few tools, such as `Taxonkit` [@shen2021taxonkit].
+## Resolving rank clashes
 
 `Taxonomy.jl` provides a `Lineage` type, an interface to lineage information. The `Lineage` type is a subtype of the `AbstractVector` type and can be treated as a Vector with `Taxon` elements. The `getindex` methods of the `Lineage` type are extended to also access `Taxon` using the rank symbol. The subspecies/strain are internally treated as the same rank, so that users can ignore ambiguities in each lineage. This makes it possible to consistently handle lineage information.
 
@@ -112,6 +118,6 @@ The `Lineage` type can be converted to the `NamedTuple` type with rank as the ke
 
 # Acknowledgments
 
-This work was supported by JST, the establishment of university fellowships towards the creation of science technology innovation, Grant Number JPMJFS2123.
+This work was supported by JST, the establishment of university fellowships towards the creation of science technology innovation, Grant Number JPMJFS2123. The design of `Taxonomy.jl` has been highly inspired by the command-line interface (CLI) tool `Taxonkit`.
 
 # References