Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Documentation (concept and input_output) #2

Merged
merged 35 commits into from
Mar 25, 2021
Merged
Show file tree
Hide file tree
Changes from 32 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
98daa45
Update documentation (concept.rst)
Mar 9, 2021
68212ac
Update Documentation (input_output.rst)
Mar 9, 2021
bd5d3ef
Update usage.rst (file paths and descriptions)
Mar 9, 2021
1a53355
Apply suggestions for concept.rst
Cecilia-Sensalari Mar 17, 2021
1b797d4
Apply suggestions in input_output.txt
Cecilia-Sensalari Mar 17, 2021
a37c907
Apply suggestions in usage.rst
Cecilia-Sensalari Mar 17, 2021
b44ab6b
Merge master into docs
Mar 18, 2021
69f9fcb
Update concept.rst
Cecilia-Sensalari Mar 18, 2021
51ea14b
Merge branch 'master' into docs
Cecilia-Sensalari Mar 18, 2021
9f784ab
Delete .buildinfo
Cecilia-Sensalari Mar 18, 2021
8895d4c
Describe ks.tsv file format
Mar 19, 2021
72de507
Remove "qsub -b y"
Mar 19, 2021
8ac5d92
Add (E)LMM file format description; internal links
Mar 20, 2021
6a71e34
Add (E)LMM file figures
Mar 20, 2021
78bbc84
Rename header in ELMM TSV file
Mar 20, 2021
cac9706
Restructure mixture model section and add table
Mar 20, 2021
e8999dc
Reduce paralog analyses description in docs
Mar 21, 2021
cf82d99
Add Note block in input_output.rst about filenames
Mar 22, 2021
f76c6eb
Minor change in Note block
Mar 22, 2021
9fa0dc6
Rename headers in main.nf
Cecilia-Sensalari Mar 23, 2021
9b14e55
Apply suggestions to concept.rst
Cecilia-Sensalari Mar 24, 2021
6aca23c
Apply suggestions to input_output.rst
Cecilia-Sensalari Mar 24, 2021
5506420
Apply suggestions to usage.rst
Cecilia-Sensalari Mar 24, 2021
b48a844
Apply suggestions to paralogs_analyses.rst
Cecilia-Sensalari Mar 24, 2021
ba4a9d4
Nextflow header with "ksrates"
Mar 24, 2021
3f48511
Italics "ksrates" in FAQs header
Mar 24, 2021
8ef4f00
Add preprint DOI in "How to cite us" doc page
Mar 24, 2021
42b4075
Remove "distribution_peak_estimate"
Mar 24, 2021
26e5d04
Update input_output.rst with other suggestions
Mar 24, 2021
0343c1d
Update usage.rst with other suggestions
Mar 24, 2021
9e0000e
Update par_analys.rst with other suggestions
Mar 24, 2021
5f67a59
Update installation.rst with other suggestions
Mar 24, 2021
96e9069
Apply suggestions to How to cite us
Cecilia-Sensalari Mar 25, 2021
2534128
Reintroduce "distribution_peak_estimate" parameter
Mar 25, 2021
106deb9
Replace <focal species> with species (filename)
Cecilia-Sensalari Mar 25, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added doc/source/_images/elmm.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/source/_images/ks_tsv.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/source/_images/lmm.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1,966 changes: 0 additions & 1,966 deletions doc/source/_images/ortholog_distribution_peak.svg

This file was deleted.

1 change: 1 addition & 0 deletions doc/source/_images/orthologs_distribution_trio.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions doc/source/_images/tree.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 4 additions & 0 deletions doc/source/citation_acknowledgement.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@

How to cite us
==============

Preprint deposited in bioRxiv:
Cecilia-Sensalari marked this conversation as resolved.
Show resolved Hide resolved

Sensalari C., Lohaus R. and Maere S. 2021. *ksrates: positioning whole-genome duplications relative to speciation events using rate-adjusted mixed paralog–ortholog KS distributions*. bioRxiv doi: https://doi.org/10.1101/2021.02.28.433234
Cecilia-Sensalari marked this conversation as resolved.
Show resolved Hide resolved
46 changes: 26 additions & 20 deletions doc/source/concept.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,42 +4,48 @@
Substitution rate-adjustment strategy in a nutshell
===================================================

``ksrates`` is a package for substitution rate-adjustment in mixed ortholog and paralog *K*:sub:`S` distributions.
To position ancient whole-genome duplication (WGD) events with respect to speciation events in a phylogeny, it is common practice to superimpose a paralog *K*:sub:`S` distribution for a species of interest with ortholog *K*:sub:`S` distributions between this species and other species, resulting in a mixed paralog--ortholog *K*:sub:`S` plot.
However, when the lineages involved exhibit different substitution rates, the various *K*:sub:`S` distributions are built on different *K*:sub:`S` scales and a direct comparison among them is likely to mislead the phylogenetic interpretation of WGD signatures or order of divergences.

Mixed *K*:sub:`S` distributions are one of the approaches applied to detect whole-genome duplications (WGDs) and to locate them in a phylogeny. A mixed plot is composed of ortholog *K*:sub:`S` distributions - representing divergence events - overlapped onto paralog *K*:sub:`S` distributions - representing the duplication history of a species genome. The relative positions of the ortholog peaks and the WGDs peaks are informative about the order of the depicted evolutionary events, allowing to place the occurrence of a WGDs in a specific branch of the evolutionary history of the species.
*ksrates* is an open-source tool offering a rate-adjustment strategy that brings all the distributions to a common *K*:sub:`S` scale by compensating for the differences in synonymous substitution rates relative to the species of interest, the focal species. The final mixed plot produced by *ksrates* features rate-adjusted positions of the ortholog *K*:sub:`S` estimates of species divergence times that help to clarify the phylogenetic placement of WGDs inferred in the focal species.

The reliability of a mixed plot can be jeopardized in case of (remarkable) substitution rate differences between the involved species. In fact, since the *K*:sub:`S` value of a homolog pair depends on the substitution rate of the species, different distributions end up to be built on different *K*:sub:`S` scales. A direct overlap of distributions is therefore likely to lead to unreliable interpretations.

The *K*:sub:`S` rate-adjustment package offers an adjustment procedure that brings all the distributions to a common *K*:sub:`S` scale by compensating for the substitution rate differences relatively to one "main" species.
The rate-adjusted mixed plot obtained through ``ksrates`` is composed of a) a single paralog distribution coming from the focal species and b) one or more ortholog distributions between the focal species and the another species. The analysis is thus focused on the genome duplication history of the focal species in the context of its evolutionary history with the other species.

The rate-adjustment is applied to all the ortholog distributions. For each ortholog distribution, principles from the relative rate test (RRT) are used to detect the branch-specific *K*:sub:`S` contributions of the focal species and the other species to their overall ortholog *K*:sub:`S` distance. During the rate-adjustment, the ortholog *K*:sub:`S` peak is re-encoded as twice the branch contribution of the focal species, so that the age of the ortholog distribution is adapted to the *K*:sub:`S` scale of the paralog distribution. At the end, all ortholog distributions are seen from the perspective of the focal species rate.
The rate-adjustment generates horizontal shifts of the ortholog distribution peak towards left if the focal species is slower than the other species, or towards right if it is faster. The new disposition of the divergence events can lead to a different and more reliable interpretation of WGD placement or of the order of the divergences themselves.
For more details about the rate-adjustment strategy, see [...].
For more detail about the methodology, please see our `preprint <https://www.biorxiv.org/content/10.1101/2021.02.28.433234v1>`__.


.. _`explained_example`:

Explained example
=================

This example studies the phylogenetic placement of WGD signatures present in oil palm (*Elaeis guineensis*) paralog distribution. The rate-adjustment pipeline needs a input phylogenetic tree and the sequence data of all involved species. The minimum input tree is composed by the focal species (palm), another species (rice) and their outgroup (asparagus): ``((palm, rice)), asparagus)``.
.. The mixed plot will show the palm paralog distribution overlapped with the rate-adjusted ortholog distributions involving palm and the other species in the input tree.
In this explained example use case the phylogenetic placement of WGD signatures present in oil palm's (*Elaeis guineensis*) paralog *K*:sub:`S` distribution is studied. This is done in the context of a small monocot phylogeny composed of the focal species (oil palm), *Oryza sativa* (rice) and *Asparagus officinalis* (asparagus) as their outgroup. The input tree in Newick format for this phylogeny is: ``((palm, rice)), asparagus)``. From the evolutionary perspective of oil palm there are two species divergence nodes: palm--rice and palm--asparagus.

.. figure:: _images/tree.svg
:align: center
:width: 250
:alt: Input phylogenetic tree composed by oil palm, rice and asparagus as the outgroup.

The detection of substitution rate differences among lineages and the decomposition of ortholog *K*:sub:`S` mode estimates into branch-specific contributions use methodology similar to relative rate testing and requires the help of an outgroup species.
Therefore, *ksrates* breaks down the input tree into *trios* composed of the focal species, a diverged species and an outgroup species. The input tree in this examples defines only one such trio, ``palm, rice, asparagus``. Here the ortholog *K*:sub:`S` distribution of the palm--rice species divergence (or more specifically, its mode) will be rate-adjusted using asparagus as an outgroup.

From the perspective of palm history there are two divergence events (i.e. ortholog distributions) in this tree, namely palm-rice and palm-asparagus. The pipeline breaks down the tree into *trios* composed by the species pair of a ortholog distribution and an outgroup used for its rate-adjustment. The example tree gives only one trio, "palm, rice, asparagus", where palm-rice divergence is rate-adjusted with outgroup asparagus. Palm-asparagus divergence has instead no outgroup in this tree and will be ignored; to avoid this, add another outgroup to the phylogeny, e.g. ``(((palm, rice), asparagus), spirodela)``. The user can also decide to perform multiple rate-adjustments for a divergence if the tree structure allows it: for example in this latter tree palm-rice can be rate-adjusted both with asparagus and spirodela (*Spirodela polyrhiza*).
.. note ::
The palm--asparagus divergence has no outgroup in this tree and thus can't be rate adjusted; to be able to do so one would need to extend the phylogeny with one additional species that can function as their outgroup, e.g. *Spirodela polyrhiza*: ``(((palm, rice), asparagus), spirodela)``.
By default, if more than one outgroup is available for a species pair, multiple rate-adjustments are performed and the mean among them is taken as consensus. For example, in the extended tree palm--rice would be adjusted both with ``asparagus`` and ``spirodela`` as the outgroup.

Further on, the pipeline breaks down the trios into the three possible species pairs they are composed of, which in this case are palm-rice, palm-asparagus and rice-asparagus. ``wgd`` package then estimates the ortholog *K*:sub:`S` distribution for each of them. The ortholog distributions are simplified to a vertical line centered on their peak value (Figure 1).
The three ortholog *K*:sub:`S` distributions obtained from the ``palm, rice, asparagus`` trio are approximated to their estimated mode with associated standard deviation (Figure 1; for more details please refer to the `Supplementary Materials <https://www.biorxiv.org/content/10.1101/2021.02.28.433234v1.supplementary-material>`__ of our preprint).

.. figure:: _images/ortholog_distribution_peak.svg
.. figure:: _images/orthologs_distribution_trio.svg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove the medians in the figure once we submit to Bioinformatics.

:align: center
:width: 350
:width: 800

Figure 1: The three ortholog *K*:sub:`S` distributions for the ``palm, rice, asparagus`` trio. Their estimated mean mode is indicated by a black vertical line. A thin colored box ranges from one standard deviation (sd) below to one sd above the mean mode estimate.

Using methodology similar to relative rate testing the ortholog *K*:sub:`S` mode estimate between palm and rice (*K*:sub:`S`\=\1.53) is decomposed into the two branch-specific *K*:sub:`S` contributions: the palm branch contributes a low *K*:sub:`S` of 0.365 while the rice branch contributes a *K*:sub:`S` of 1.17. The considerable difference between them suggests that palm has a much lower synonymous substitution rate than rice.

The ortholog distribution for palm and rice is approximated to its mode (1.53 *K*:sub:`S`).

The *K*:sub:`S`decomposition uses the *K*:sub:`S` values of the three ortholog peaks to compute the branch-specific *K*:sub:`S` contributions of the divergent pair: palm has a branch contribution of about 0.36 while rice of 1.17, therefore palm accumulates substitution much more slowly than rice. Lastly, the rate-adjustment reinterprets the ortholog *K*:sub:`S` peak of palm-rice by encoding it as twice the branch contribution of palm (*K*:sub:`S`' = 0.73). The ortholog peak has therefore been largely shifted to the left from 1.53 to 0.73 *K*:sub:`S` (Figure 2), and it is now adapted to the slow scale of palm paralog distribution. The shift has important consequences in the interpretation of the mixed plot concerning the older WGD signal around 0.9 *K*:sub:`S`.
The ortholog *K*:sub:`S` mode estimate of palm--rice is then rate adjusted by rescaling it to twice the contribution of the palm branch (*K*:sub:`S` --> 2 * 0.365 = 0.73). The position of the (mode) divergence line thus largely shifts towards the left from *K*:sub:`S`\=\1.53 to *K*:sub:`S`\=\0.73 (Figure 2)---it is now rate-adjusted to the *K*:sub:`S` scale of the paralog *K*:sub:`S` distribution of oil palm and shifted to the other side of the second visible WGD peak.
The rate-adjusted mixed plot offers a different interpretation for the phylogenetic placement of the older WGD signature (located at a *K*:sub:`S` of around 0.9) than a naive mixed plot would: instead of suggesting the WGD to be a palm-specific event it is now suggested to be an event shared by both rice and palm. This would be consistent with the previously proposed monocot *tau* WGD event.

.. figure:: _images/mixed_palm_corrected.svg
:align: center
:width: 800

The ortholog distribution peak (red line) has been shifted towards left after rate-adjustment, as highlighted by the red arrows starting from the original position and pointing at the new rate-adjusted position.
Figure 2: Rate-adjusted mixed paralog--ortholog *K*:sub:`S` plot. The rate-adjusted ortholog *K*:sub:`S` estimate for oil palm and rice (red vertical line) is superimposed on the paralog *K*:sub:`S` distribution of oil palm. The vertical line has been shifted towards the left to the other side of the second WGD peak, as indicated by the red arrow below the plot.
6 changes: 2 additions & 4 deletions doc/source/configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ The [PARAMETERS] section includes:

* For ortholog divergence *K*:sub:`S`

* **num_bootstrap_iterations**: number of bootstrap iterations for mode/median estimation. [Default: 200]
* **num_bootstrap_iterations**: number of bootstrap iterations for mode estimation. [Default: 200]
* **divergence_colors**: list of colors assigned to the divergence nodes: all divergence lines coming from the same divergence node share the same color. [Default: 8 colors]

* For the ortholog *K*:sub:`S` distribution plots
Expand Down Expand Up @@ -188,7 +188,6 @@ This is an optional configuration file that contains several \"expert\" paramete

logging_level = info
max_gene_family_size = 200
distribution_peak_estimate = mode
kde_bandwidth_modifier = 0.4
plot_adjustment_arrows = no
num_mixture_model_initializations = 10
Expand All @@ -199,11 +198,10 @@ This is an optional configuration file that contains several \"expert\" paramete

* **logging_level**: the lowest logging/verbosity level of messages printed to the console/logs (increasing severity levels: *notset*, *debug*, *info*, *warning*, *error*, *critical*). Messages less severe than *level* will be ignored; *notset* causes all messages to be processed. [Default: "info"]
* **max_gene_family_size**: maximum number of members that any paralog gene family can have to be included in *K*:sub:`S` estimation. Large gene families increase the run time and are often composed of unrelated sequences grouped together by shared protein domains or repetitive sequences. But this is not always the case, so one may want to check manually the gene families in file ``paralog_distributions/wgd_<focal species>/<focal species>.mcl.tsv`` and increase (or even decrease) this number. [Default: 200]
* **distribution_peak_estimate**: the statistical method used to obtain a single ortholog *K*:sub:`S` estimate for the divergence time of a species pair from its ortholog distribution or to obtain a single paralog *K*:sub:`S` estimate from an anchor *K*:sub:`S` cluster or from lognormal components in mixture models (options: "mode" or "median"). [Default: "mode"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this already consistently removed everywhere? Also in the code? Might be a bit early, it's also still in the preprint.

* **kde_bandwidth_modifier**: modifier to adjust the fitting of the KDE curve on the underlying whole-paranome or anchor *K*:sub:`S` distribution. The KDE Scott's factor internally computed by SciPy tends to produce an overly smooth KDE curve, especially with steep WGD peaks, and therefore it is reduced by multiplying it by a modifier. Decreasing the modifier leads to tighter fits, increasing it leads to smoother fits, and setting it to 1 gives the default KDE factor. Note that a too small factor is likely to take into account data noise. [Default: 0.4]
* **plot_adjustment_arrows**: flag to toggle the plotting of rate-adjustment arrows below the adjusted mixed paralog--ortholog *K*:sub:`S` plot. These arrows start from the original unadjusted ortholog divergence *K*:sub:`S` estimate and end on the rate-adjusted estimate (options: "yes" and "no"). [Default: "no"]
* **num_mixture_model_initializations**: number of times the EM algorithm is initialized (either for the random initialization in the exponential-lognormal mixture model or for k-means in the lognormal mixture model). [Default: 10]
* **max_mixture_model_iterations**: maximum number of EM iterations for mixture modeling. [Default: 300]
* **max_mixture_model_components**: maximum number of components considered during execution of the mixture models. [Default: 5]
* **max_mixture_model_ks**: upper limit for the *K*:sub:`S` range in which the exponential-lognormal and lognormal-only mixture models are performed. [Default: 5]
* **extra_paralogs_analyses_methods**: flag to toggle the optional analysis of the paralog *K*:sub:`S` distribution with non default mixture model methods (see section :ref:`paralogs_analyses` and Supplementary Materials) [Default: "no"]
* **extra_paralogs_analyses_methods**: flag to toggle the optional analysis of the paralog *K*:sub:`S` distribution with non default mixture model methods (see section :ref:`paralogs_analyses` and Supplementary Materials) [Default: "no"]
6 changes: 3 additions & 3 deletions doc/source/faqs.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
******************
FAQs about ksrates
******************
********************
FAQs about *ksrates*
********************

Nextflow
========
Expand Down
Loading