Update Documentation (concept and input_output) (#2)

* Update concept.rst * Update input_output.rst * Update usage.rst (file paths and descriptions) * Delete .buildinfo * Describe ks.tsv file format * Add (E)LMM file format description; internal links * Add (E)LMM file figures * Rename header in ELMM TSV file * Restructure mixture model section and add table * Reduce paralog analyses description in docs * Add Note block in input_output.rst about filenames * Rename header in main.nf * Add preprint DOI in "How to cite us" doc page * Replace <focal species> with species (filename)
VIB-PSB · Mar 25, 2021 · a6e4062 · a6e4062
1 parent 907afe3
commit a6e4062
Show file tree

Hide file tree

Showing 19 changed files with 266 additions and 2,143 deletions.
diff --git a/doc/source/_images/elmm.png b/doc/source/_images/elmm.png
diff --git a/doc/source/_images/ks_tsv.png b/doc/source/_images/ks_tsv.png
diff --git a/doc/source/_images/lmm.png b/doc/source/_images/lmm.png
diff --git a/doc/source/_images/ortholog_distribution_peak.svg b/doc/source/_images/ortholog_distribution_peak.svg
diff --git a/doc/source/_images/orthologs_distribution_trio.svg b/doc/source/_images/orthologs_distribution_trio.svg
diff --git a/doc/source/_images/tree.svg b/doc/source/_images/tree.svg
diff --git a/doc/source/citation_acknowledgement.rst b/doc/source/citation_acknowledgement.rst
@@ -1,3 +1,7 @@
 
 How to cite us
 ==============
+
+If you publish results generated using *ksrates*, please cite:
+
+Sensalari C., Maere S. and Lohaus R. (2021) *ksrates*: positioning whole-genome duplications relative to speciation events using rate-adjusted mixed paralog--ortholog  *K*:sub:`S` distributions. *bioRxiv* 2021.02.28.433234 `doi: https://doi.org/10.1101/2021.02.28.433234 <https://doi.org/10.1101/2021.02.28.433234>`__ 
diff --git a/doc/source/concept.rst b/doc/source/concept.rst
@@ -4,42 +4,48 @@
 Substitution rate-adjustment strategy in a nutshell
 ===================================================
 
-``ksrates`` is a package for substitution rate-adjustment in mixed ortholog and paralog *K*:sub:`S` distributions.
+To position ancient whole-genome duplication (WGD) events with respect to speciation events in a phylogeny, it is common practice to superimpose a paralog *K*:sub:`S` distribution for a species of interest with ortholog *K*:sub:`S` distributions between this species and other species, resulting in a mixed paralog--ortholog *K*:sub:`S` plot. 
+However, when the lineages involved exhibit different substitution rates, the various *K*:sub:`S` distributions are built on different *K*:sub:`S` scales and a direct comparison among them is likely to mislead the phylogenetic interpretation of WGD signatures or order of divergences.
 
-Mixed *K*:sub:`S` distributions are one of the approaches applied to detect whole-genome duplications (WGDs) and to locate them in a phylogeny. A mixed plot is composed of ortholog *K*:sub:`S` distributions - representing divergence events - overlapped onto paralog *K*:sub:`S` distributions - representing the duplication history of a species genome. The relative positions of the ortholog peaks and the WGDs peaks are informative about the order of the depicted evolutionary events, allowing to place the occurrence of a WGDs in a specific branch of the evolutionary history of the species.
+*ksrates* is an open-source tool offering a rate-adjustment strategy that brings all the distributions to a common *K*:sub:`S` scale by compensating for the differences in synonymous substitution rates relative to the species of interest, the focal species. The final mixed plot produced by *ksrates* features rate-adjusted positions of the ortholog *K*:sub:`S` estimates of species divergence times that help to clarify the phylogenetic placement of WGDs inferred in the focal species.
 
-The reliability of a mixed plot can be jeopardized in case of (remarkable) substitution rate differences between the involved species. In fact, since the *K*:sub:`S` value of a homolog pair depends on the substitution rate of the species, different distributions end up to be built on different *K*:sub:`S` scales. A direct overlap of distributions is therefore likely to lead to unreliable interpretations.
-
-The *K*:sub:`S` rate-adjustment package offers an adjustment procedure that brings all the distributions to a common *K*:sub:`S` scale by compensating for the substitution rate differences relatively to one "main" species. 
-The rate-adjusted mixed plot obtained through ``ksrates`` is composed of a) a single paralog distribution coming from the focal species and b) one or more ortholog distributions between the focal species and the another species. The analysis is thus focused on the genome duplication history of the focal species in the context of its evolutionary history with the other species. 
-
-The rate-adjustment is applied to all the ortholog distributions. For each ortholog distribution, principles from the relative rate test (RRT) are used to detect the branch-specific *K*:sub:`S` contributions of the focal species and the other species to their overall ortholog *K*:sub:`S` distance. During the rate-adjustment, the ortholog *K*:sub:`S` peak is re-encoded as twice the branch contribution of the focal species, so that the age of the ortholog distribution is adapted to the *K*:sub:`S` scale of the paralog distribution. At the end, all ortholog distributions are seen from the perspective of the focal species rate.
-The rate-adjustment generates horizontal shifts of the ortholog distribution peak towards left if the focal species is slower than the other species, or towards right if it is faster. The new disposition of the divergence events can lead to a different and more reliable interpretation of WGD placement or of the order of the divergences themselves.
-For more details about the rate-adjustment strategy, see [...].
+For more detail about the methodology, please see our `preprint <https://www.biorxiv.org/content/10.1101/2021.02.28.433234v1>`__.
 
 
 .. _`explained_example`:
 
 Explained example
 =================
 
-This example studies the phylogenetic placement of WGD signatures present in oil palm (*Elaeis guineensis*) paralog distribution. The rate-adjustment pipeline needs a input phylogenetic tree and the sequence data of all involved species. The minimum input tree is composed by the focal species (palm), another species (rice) and their outgroup (asparagus): ``((palm, rice)), asparagus)``. 
-..  The mixed plot will show the palm paralog distribution overlapped with the rate-adjusted ortholog distributions involving palm and the other species in the input tree.
+In this explained example use case the phylogenetic placement of WGD signatures present in oil palm's (*Elaeis guineensis*) paralog *K*:sub:`S` distribution is studied. This is done in the context of a small monocot phylogeny composed of the focal species (oil palm), *Oryza sativa* (rice) and *Asparagus officinalis* (asparagus) as their outgroup. The input tree in Newick format for this phylogeny is: ``((palm, rice)), asparagus)``. From the evolutionary perspective of oil palm there are two species divergence nodes: palm--rice and palm--asparagus.
+
+.. figure:: _images/tree.svg
+    :align: center
+    :width: 250
+    :alt: Input phylogenetic tree composed by oil palm, rice and asparagus as the outgroup.
+
+The detection of substitution rate differences among lineages and the decomposition of ortholog *K*:sub:`S` mode estimates into branch-specific contributions use methodology similar to relative rate testing and requires the help of an outgroup species.
+Therefore, *ksrates* breaks down the input tree into *trios* composed of the focal species, a diverged species and an outgroup species. The input tree in this examples defines only one such trio, ``palm, rice, asparagus``. Here the ortholog *K*:sub:`S` distribution of the palm--rice species divergence (or more specifically, its mode) will be rate-adjusted using asparagus as an outgroup.
 
-From the perspective of palm history there are two divergence events (i.e. ortholog distributions) in this tree, namely palm-rice and palm-asparagus. The pipeline breaks down the tree into *trios* composed by the species pair of a ortholog distribution and an outgroup used for its rate-adjustment. The example tree gives only one trio, "palm, rice, asparagus", where palm-rice divergence is rate-adjusted with outgroup asparagus. Palm-asparagus divergence has instead no outgroup in this tree and will be ignored; to avoid this, add another outgroup to the phylogeny, e.g. ``(((palm, rice), asparagus), spirodela)``. The user can also decide to perform multiple rate-adjustments for a divergence if the tree structure allows it: for example in this latter tree palm-rice can be rate-adjusted both with asparagus and spirodela (*Spirodela polyrhiza*).
+.. note ::
+    The palm--asparagus divergence has no outgroup in this tree and thus can't be rate adjusted; to be able to do so one would need to extend the phylogeny with one additional species that can function as their outgroup, e.g. *Spirodela polyrhiza*: ``(((palm, rice), asparagus), spirodela)``. 
+    By default, if more than one outgroup is available for a species pair, multiple rate-adjustments are performed and the mean among them is taken as consensus. For example, in the extended tree palm--rice would be adjusted both with ``asparagus`` and ``spirodela`` as the outgroup.
 
-Further on, the pipeline breaks down the trios into the three possible species pairs they are composed of, which in this case are palm-rice, palm-asparagus and rice-asparagus. ``wgd`` package then estimates the ortholog *K*:sub:`S` distribution for each of them. The ortholog distributions are simplified to a vertical line centered on their peak value (Figure 1).
+The three ortholog *K*:sub:`S` distributions obtained from the ``palm, rice, asparagus`` trio are approximated to their estimated mode with associated standard deviation (Figure 1; for more details please refer to the `Supplementary Materials <https://www.biorxiv.org/content/10.1101/2021.02.28.433234v1.supplementary-material>`__ of our preprint).
 
-.. figure:: _images/ortholog_distribution_peak.svg
+.. figure:: _images/orthologs_distribution_trio.svg
     :align: center
-    :width: 350
+    :width: 800
+
+    Figure 1: The three ortholog *K*:sub:`S` distributions for the ``palm, rice, asparagus`` trio. Their estimated mean mode is indicated by a black vertical line. A thin colored box ranges from one standard deviation (sd) below to one sd above the mean mode estimate.
+
+Using methodology similar to relative rate testing the ortholog *K*:sub:`S` mode estimate between palm and rice (*K*:sub:`S`\=\1.53) is decomposed into the two branch-specific *K*:sub:`S` contributions: the palm branch contributes a low *K*:sub:`S` of 0.365 while the rice branch contributes a *K*:sub:`S` of 1.17. The considerable difference between them suggests that palm has a much lower synonymous substitution rate than rice.
 
-    The ortholog distribution for palm and rice is approximated to its mode (1.53 *K*:sub:`S`).
-
-The  *K*:sub:`S`decomposition uses the *K*:sub:`S` values of the three ortholog peaks to compute the branch-specific *K*:sub:`S` contributions of the divergent pair: palm has a branch contribution of about 0.36 while rice of 1.17, therefore palm accumulates substitution much more slowly than rice. Lastly, the rate-adjustment reinterprets the ortholog *K*:sub:`S` peak of palm-rice by encoding it as twice the branch contribution of palm (*K*:sub:`S`' = 0.73). The ortholog peak has therefore been largely shifted to the left from 1.53 to 0.73 *K*:sub:`S` (Figure 2), and it is now adapted to the slow scale of palm paralog distribution. The shift has important consequences in the interpretation of the mixed plot concerning the older WGD signal around 0.9 *K*:sub:`S`.
+The ortholog *K*:sub:`S` mode estimate of palm--rice is then rate adjusted by rescaling it to twice the contribution of the palm branch (*K*:sub:`S` --> 2 * 0.365 = 0.73). The position of the (mode) divergence line thus largely shifts towards the left from *K*:sub:`S`\=\1.53 to *K*:sub:`S`\=\0.73 (Figure 2)---it is now rate-adjusted to the *K*:sub:`S` scale of the paralog *K*:sub:`S` distribution of oil palm and shifted to the other side of the second visible WGD peak.
+The rate-adjusted mixed plot offers a different interpretation for the phylogenetic placement of the older WGD signature (located at a *K*:sub:`S` of around 0.9) than a naive mixed plot would: instead of suggesting the WGD to be a palm-specific event it is now suggested to be an event shared by both rice and palm. This would be consistent with the previously proposed monocot *tau* WGD event.
 
 .. figure:: _images/mixed_palm_corrected.svg
     :align: center
     :width: 800
 
-    The ortholog distribution peak (red line) has been shifted towards left after rate-adjustment, as highlighted by the red arrows starting from the original position and pointing at the new rate-adjusted position. 
+    Figure 2: Rate-adjusted mixed paralog--ortholog *K*:sub:`S` plot. The rate-adjusted ortholog *K*:sub:`S` estimate for oil palm and rice (red vertical line) is superimposed on the paralog *K*:sub:`S` distribution of oil palm. The vertical line has been shifted towards the left to the other side of the second WGD peak, as indicated by the red arrow below the plot.
diff --git a/doc/source/configuration.rst b/doc/source/configuration.rst
@@ -82,7 +82,7 @@ The [PARAMETERS] section includes:
 
 * For ortholog divergence *K*:sub:`S`
 
-    * **num_bootstrap_iterations**: number of bootstrap iterations for mode/median estimation. [Default: 200]
+    * **num_bootstrap_iterations**: number of bootstrap iterations for mode estimation. [Default: 200]
     * **divergence_colors**: list of colors assigned to the divergence nodes: all divergence lines coming from the same divergence node share the same color. [Default: 8 colors]
 
 * For the ortholog *K*:sub:`S` distribution plots
@@ -198,12 +198,12 @@ This is an optional configuration file that contains several \"expert\" paramete
     extra_paralogs_analyses_methods = no
 
 * **logging_level**: the lowest logging/verbosity level of messages printed to the console/logs (increasing severity levels: *notset*, *debug*, *info*, *warning*, *error*, *critical*). Messages less severe than *level* will be ignored; *notset* causes all messages to be processed. [Default: "info"]
-* **max_gene_family_size**: maximum number of members that any paralog gene family can have to be included in *K*:sub:`S` estimation. Large gene families increase the run time and are often composed of unrelated sequences grouped together by shared protein domains or repetitive sequences. But this is not always the case, so one may want to check manually the gene families in file ``paralog_distributions/wgd_<focal species>/<focal species>.mcl.tsv`` and increase (or even decrease) this number. [Default: 200]
+* **max_gene_family_size**: maximum number of members that any paralog gene family can have to be included in *K*:sub:`S` estimation. Large gene families increase the run time and are often composed of unrelated sequences grouped together by shared protein domains or repetitive sequences. But this is not always the case, so one may want to check manually the gene families in file ``paralog_distributions/wgd_species/species.mcl.tsv`` and increase (or even decrease) this number. [Default: 200]
 * **distribution_peak_estimate**: the statistical method used to obtain a single ortholog *K*:sub:`S` estimate for the divergence time of a species pair from its ortholog distribution or to obtain a single paralog *K*:sub:`S` estimate from an anchor *K*:sub:`S` cluster or from lognormal components in mixture models (options: "mode" or "median"). [Default: "mode"]
 * **kde_bandwidth_modifier**: modifier to adjust the fitting of the KDE curve on the underlying whole-paranome or anchor *K*:sub:`S` distribution. The KDE Scott's factor internally computed by SciPy tends to produce an overly smooth KDE curve, especially with steep WGD peaks, and therefore it is reduced by multiplying it by a modifier. Decreasing the modifier leads to tighter fits, increasing it leads to smoother fits, and setting it to 1 gives the default KDE factor. Note that a too small factor is likely to take into account data noise. [Default: 0.4]
 * **plot_adjustment_arrows**: flag to toggle the plotting of rate-adjustment arrows below the adjusted mixed paralog--ortholog *K*:sub:`S` plot. These arrows start from the original unadjusted ortholog divergence *K*:sub:`S` estimate and end on the rate-adjusted estimate (options: "yes" and "no"). [Default: "no"]
 * **num_mixture_model_initializations**: number of times the EM algorithm is initialized (either for the random initialization in the exponential-lognormal mixture model or for k-means in the lognormal mixture model). [Default: 10]
 * **max_mixture_model_iterations**: maximum number of EM iterations for mixture modeling. [Default: 300]
 * **max_mixture_model_components**: maximum number of components considered during execution of the mixture models. [Default: 5]
 * **max_mixture_model_ks**: upper limit for the *K*:sub:`S` range in which the exponential-lognormal and lognormal-only mixture models are performed. [Default: 5]
-* **extra_paralogs_analyses_methods**: flag to toggle the optional analysis of the paralog *K*:sub:`S` distribution with non default mixture model methods (see section :ref:`paralogs_analyses` and Supplementary Materials) [Default: "no"]
+* **extra_paralogs_analyses_methods**: flag to toggle the optional analysis of the paralog *K*:sub:`S` distribution with non default mixture model methods (see section :ref:`paralogs_analyses` and Supplementary Materials) [Default: "no"]
diff --git a/doc/source/faqs.rst b/doc/source/faqs.rst
@@ -1,6 +1,6 @@
-******************
-FAQs about ksrates
-******************
+********************
+FAQs about *ksrates*
+********************
 
 Nextflow
 ========