Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Documentation (concept and input_output) #2

Merged
merged 35 commits into from
Mar 25, 2021
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
98daa45
Update documentation (concept.rst)
Mar 9, 2021
68212ac
Update Documentation (input_output.rst)
Mar 9, 2021
bd5d3ef
Update usage.rst (file paths and descriptions)
Mar 9, 2021
1a53355
Apply suggestions for concept.rst
Cecilia-Sensalari Mar 17, 2021
1b797d4
Apply suggestions in input_output.txt
Cecilia-Sensalari Mar 17, 2021
a37c907
Apply suggestions in usage.rst
Cecilia-Sensalari Mar 17, 2021
b44ab6b
Merge master into docs
Mar 18, 2021
69f9fcb
Update concept.rst
Cecilia-Sensalari Mar 18, 2021
51ea14b
Merge branch 'master' into docs
Cecilia-Sensalari Mar 18, 2021
9f784ab
Delete .buildinfo
Cecilia-Sensalari Mar 18, 2021
8895d4c
Describe ks.tsv file format
Mar 19, 2021
72de507
Remove "qsub -b y"
Mar 19, 2021
8ac5d92
Add (E)LMM file format description; internal links
Mar 20, 2021
6a71e34
Add (E)LMM file figures
Mar 20, 2021
78bbc84
Rename header in ELMM TSV file
Mar 20, 2021
cac9706
Restructure mixture model section and add table
Mar 20, 2021
e8999dc
Reduce paralog analyses description in docs
Mar 21, 2021
cf82d99
Add Note block in input_output.rst about filenames
Mar 22, 2021
f76c6eb
Minor change in Note block
Mar 22, 2021
9fa0dc6
Rename headers in main.nf
Cecilia-Sensalari Mar 23, 2021
9b14e55
Apply suggestions to concept.rst
Cecilia-Sensalari Mar 24, 2021
6aca23c
Apply suggestions to input_output.rst
Cecilia-Sensalari Mar 24, 2021
5506420
Apply suggestions to usage.rst
Cecilia-Sensalari Mar 24, 2021
b48a844
Apply suggestions to paralogs_analyses.rst
Cecilia-Sensalari Mar 24, 2021
ba4a9d4
Nextflow header with "ksrates"
Mar 24, 2021
3f48511
Italics "ksrates" in FAQs header
Mar 24, 2021
8ef4f00
Add preprint DOI in "How to cite us" doc page
Mar 24, 2021
42b4075
Remove "distribution_peak_estimate"
Mar 24, 2021
26e5d04
Update input_output.rst with other suggestions
Mar 24, 2021
0343c1d
Update usage.rst with other suggestions
Mar 24, 2021
9e0000e
Update par_analys.rst with other suggestions
Mar 24, 2021
5f67a59
Update installation.rst with other suggestions
Mar 24, 2021
96e9069
Apply suggestions to How to cite us
Cecilia-Sensalari Mar 25, 2021
2534128
Reintroduce "distribution_peak_estimate" parameter
Mar 25, 2021
106deb9
Replace <focal species> with species (filename)
Cecilia-Sensalari Mar 25, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added doc/source/_images/elmm.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/source/_images/ks_tsv.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/source/_images/lmm.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1,966 changes: 0 additions & 1,966 deletions doc/source/_images/ortholog_distribution_peak.svg

This file was deleted.

1 change: 1 addition & 0 deletions doc/source/_images/orthologs_distribution_trio.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions doc/source/_images/tree.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
42 changes: 22 additions & 20 deletions doc/source/concept.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,46 +4,48 @@
Substitution rate-adjustment strategy in a nutshell
===================================================

To position ancient WGD events with respect to speciation events in a phylogeny, it is common practice to superimpose a paralog *K*:sub:`S` distribution for a species of interest with ortholog *K*:sub:`S` distributions between this species and other species to obtain a mixed plot.
However, when the lineages involved exhibit different substitution rates, the *K*:sub:`S` distributions are built on different *K*:sub:`S` scales and a direct comparison among them is likely to mislead the phylogenetic interpretation of WGD signatures or the divergence order.
To position ancient whole-genome duplication (WGD) events with respect to speciation events in a phylogeny, it is common practice to superimpose a paralog *K*:sub:`S` distribution for a species of interest with ortholog *K*:sub:`S` distributions between this species and other species, resulting in a mixed paralog--ortholog *K*:sub:`S` plot.
However, when the lineages involved exhibit different substitution rates, the various *K*:sub:`S` distributions are built on different *K*:sub:`S` scales and a direct comparison among them is likely to mislead the phylogenetic interpretation of WGD signatures or order of divergences.

``ksrates`` is an open-source tool offering a rate-adjustment strategy that brings all the distributions to a common *K*:sub:`S` scale by compensating for the synonymous substitution rate differences relative to one species. The final mixed plot produced by ``ksrates`` features adjustemnts in the position of the ortholog *K*:sub:`S` distributions that help in the clarification of WGD placement in the context of the provided phylogenetic tree.
*ksrates* is an open-source tool offering a rate-adjustment strategy that brings all the distributions to a common *K*:sub:`S` scale by compensating for the differences in synonymous substitution rates relative to one focal species. The final mixed plot produced by *ksrates* features rate-adjusted positions of the ortholog *K*:sub:`S` distributions that help clarifying the placement of WGDs in the context of the provided phylogenetic tree.
Cecilia-Sensalari marked this conversation as resolved.
Show resolved Hide resolved

For more details about the rate-adjustment strategy, see our `preprint <https://www.biorxiv.org/content/10.1101/2021.02.28.433234v1>`__.
For more detail about the methodology, please see our `preprint <https://www.biorxiv.org/content/10.1101/2021.02.28.433234v1>`__.


.. _`explained_example`:

Explained example
=================

This explained example studies the phylogenetic placement of WGD signatures present in oil palm (*Elaeis guineensis*) paralog *K*:sub:`S` distribution in the context of a small monocots phylogeny composed by the species of interest (oil palm), *Oryza sativa* (rice) and their outgroup *Asparagus officinalis* (asparagus). Such input tree is provided in Newick format: ``((palm, rice)), asparagus)``.
From the perspective of oil palm history there are two divergence nodes (i.e. ortholog *K*:sub:`S` distributions) to be rate-adjusted, namely palm-rice and palm-asparagus.
In this explained example use case the phylogenetic placement of WGD signatures present in oil palm's (*Elaeis guineensis*) paralog *K*:sub:`S` distribution is studied. This is done in the context of a small monocot phylogeny composed of the focal species (oil palm), *Oryza sativa* (rice) and their outgroup *Asparagus officinalis* (asparagus). The input tree in Newick format for this phylogeny is: ``((palm, rice)), asparagus)``. From the evolutionary perspective of oil palm there are two species divergence nodes: palm--rice and palm--asparagus.
Cecilia-Sensalari marked this conversation as resolved.
Show resolved Hide resolved

The detection of substituion rate differences makes use of principles of the relative rate test (REF) and requires therefore an outgroup species.
The pipeline breaks down the tree into *trios* composed by the species pair of a ortholog distribution and the outgroup used for its rate-adjustment. The example tree gives only one trio, "palm, rice, asparagus", where palm-rice divergence will be rate-adjusted with outgroup asparagus.
.. figure:: _images/tree.svg
:align: center
:width: 250
:alt: Input phylogenetic tree composed by oil palm, rice and asparagus as the outgroup.

The detection of substitution rate differences among lineages and the decomposition of ortholog *K*:sub:`S` mode estimates into branch-specific contributions use methodology similar to relative rate testing and requires the help of an outgroup species.
Therefore, *ksrates* breaks down the input tree into *trios* composed of the focal species, a diverged species and an outgroup species. The input tree in this examples defines only one such trio, ``palm, rice, asparagus``. Here the ortholog *K*:sub:`S` distribution of the palm--rice species divergence (or more specifically, its mode) will be rate-adjusted using asparagus as an outgroup.

.. note ::
Palm-asparagus divergence has no outgroup in this tree and can't be adjusted; to be able to take it into account one should extend the phylogeny with one extra species that can function as their outgroup, e.g. *Spirodela polyrhiza*: ``(((palm, rice), asparagus), spirodela)``.
By default, if more than one outgroup is available for a species pair, multiple rate-adjustments are performed and the mean among them is taken as consensus. For example, in the extended tree palm-rice would be adjusted both with ``asparagus`` and ``spirodela`` outgroups.
The palm--asparagus divergence has no outgroup in this tree and thus can't be rate adjusted; to be able to do so one would need to extend the phylogeny with one additional species that can function as their outgroup, e.g. *Spirodela polyrhiza*: ``(((palm, rice), asparagus), spirodela)``.
By default, if more than one outgroup is available for a species pair, multiple rate-adjustments are performed and the mean among them is taken as consensus. For example, in the extended tree palm--rice would be adjusted both with ``asparagus`` and ``spirodela`` as the outgroup.

The three ortholog *K*:sub:`S` distributions obtained from palm-rice-asparagus trio are approximated to their estimated mode (1.53 *K*:sub:`S`) with associated standard deviation (Figure 1; for more details please refer to Supplementary Materials, currently in preprint).
The three ortholog *K*:sub:`S` distributions obtained from the ``palm, rice, asparagus`` trio are approximated to their estimated mode with associated standard deviation (Figure 1; for more details please refer to `Supplementary Materials <https://www.biorxiv.org/content/10.1101/2021.02.28.433234v1.supplementary-material>`__, currently in preprint).
Cecilia-Sensalari marked this conversation as resolved.
Show resolved Hide resolved

.. figure:: _images/ortholog_distribution_peak.svg
.. figure:: _images/orthologs_distribution_trio.svg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove the medians in the figure once we submit to Bioinformatics.

:align: center
:width: 350
:alt: The figure shows the bell-shaped ortholog KS distribution obtained for palm and rice approximated to a vertical line passing through the estimated mode (1.53 KS). A thin colored rectangular box behind this line highlights the associated standard deviation (0.01 KS).
:width: 800

The ortholog distribution for palm and rice is approximated to a vertical line passing through its esitmated mode (1.53 *K*:sub:`S`).
Figure 1: The three ortholog *K*:sub:`S` distributions for the ``palm, rice, asparagus`` trio. Their estimated mean mode is indicated by a black vertical line. A thin colored box ranges from one standard deviation (sd) below to one sd above the mean mode estimate.

Through principles of the relative rate test the ortholog *K*:sub:`S` estimate between palm and rice (1.53 *K*:sub:`S`) is decomposed into the two branch-specific *K*:sub:`S` contributions: palm contributes with 0.365 while rice with 1.17. The difference between them suggests that palm has a much lower substitution rate than rice.
Using methodology similar to relative rate testing the ortholog *K*:sub:`S` mode estimate between palm and rice (*K*:sub:`S`\=\1.53) is decomposed into the two branch-specific *K*:sub:`S` contributions: the palm branch contributes a low *K*:sub:`S` of 0.365 while the rice branch contributes a *K*:sub:`S` of 1.17. The considerable difference between them suggests that palm has a much lower synonymous substitution rate than rice.

The ortholog *K*:sub:`S` estimate of palm-rice is then adjusted by rescaling it as twice the branch contribution of palm (*K*:sub:`S` --> 0.365 + 0.365 = 0.73). The position of the divergence line results then largely shifted towards left from 1.53 to 0.73 *K*:sub:`S` (Figure 2) and it is now adapted to the slower scale of palm paralog distribution. Interestingy, the rate-adjusted mixed plot offers a new interpretation for the placement of the older WGD signature located around 0.9 *K*:sub:`S`, from being palm-specific to be shared with rice and potentially other monocots.
The ortholog *K*:sub:`S` mode estimate of palm--rice is then rate adjusted by rescaling it to twice the contribution of the palm branch (*K*:sub:`S` --> 2 * 0.365 = 0.73). The position of the (mode) divergence line thus largely shifts towards the left to the other side of the second WGD peak, from *K*:sub:`S`\=\1.53 to *K*:sub:`S`\=\0.73 (Figure 2)---it is now rate-adjusted to the *K*:sub:`S` scale of the paralog *K*:sub:`S` distribution of oil palm.
Cecilia-Sensalari marked this conversation as resolved.
Show resolved Hide resolved
The rate-adjusted mixed plot offers a different interpretation for the phylogenetic placement of the older WGD signature (located at a *K*:sub:`S` of around 0.9) than a naive mixed plot would: instead of suggesting the WGD to be a palm-specific event it is now suggested to be an event shared by both rice and palm. This new founding likely matches the proposed monocots *tau* WGD event.
Cecilia-Sensalari marked this conversation as resolved.
Show resolved Hide resolved

.. figure:: _images/mixed_palm_corrected.svg
:align: center
:width: 800
:alt: In this figure the mixed plot shows the rate-adjusted ortholog KS distribution for oil palm and rice as a vertical line superimposed to the paralog KS distribution of oil palm. The vertical line has been shifted towards left and has crossed a WGD peak from its right side to its left side, as highlighted by an arrow.

The ortholog *K*:sub:`S` estimate (red vertical line) has been shifted towards left after rate-adjustment, as highlighted by the red arrows starting from the original position and pointing at the new rate-adjusted position.

Figure 2: Rate-adjusted mixed *K*:sub:`S` distribution plot. The rate-adjustment has shifted the ortholog *K*:sub:`S` estimate (red vertical line) towards the left, as indicated by the red arrow at the bottom starting from the original position and pointing to the new rate-adjusted position.
Cecilia-Sensalari marked this conversation as resolved.
Show resolved Hide resolved
4 changes: 2 additions & 2 deletions doc/source/configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ The [PARAMETERS] section includes:

* For ortholog divergence *K*:sub:`S`

* **num_bootstrap_iterations**: number of bootstrap iterations for mode/median estimation. [Default: 200]
* **num_bootstrap_iterations**: number of bootstrap iterations for mode estimation. [Default: 200]
* **divergence_colors**: list of colors assigned to the divergence nodes: all divergence lines coming from the same divergence node share the same color. [Default: 8 colors]

* For the ortholog *K*:sub:`S` distribution plots
Expand Down Expand Up @@ -206,4 +206,4 @@ This is an optional configuration file that contains several \"expert\" paramete
* **max_mixture_model_iterations**: maximum number of EM iterations for mixture modeling. [Default: 300]
* **max_mixture_model_components**: maximum number of components considered during execution of the mixture models. [Default: 5]
* **max_mixture_model_ks**: upper limit for the *K*:sub:`S` range in which the exponential-lognormal and lognormal-only mixture models are performed. [Default: 5]
* **extra_paralogs_analyses_methods**: flag to toggle the optional analysis of the paralog *K*:sub:`S` distribution with non default mixture model methods (see section :ref:`paralogs_analyses` and Supplementary Materials) [Default: "no"]
* **extra_paralogs_analyses_methods**: flag to toggle the optional analysis of the paralog *K*:sub:`S` distribution with non default mixture model methods (see section :ref:`paralogs_analyses` and Supplementary Materials) [Default: "no"]
Loading