Browse files

Adressing #19 and #21

  • Loading branch information...
evogytis committed Nov 7, 2017
1 parent a5bbf12 commit 4a11c8dc2fcef5ce6a80862b1edc1c220cce7a6f
Showing with 62 additions and 4 deletions.
  1. BIN figures/mers_ML.png
  2. BIN figures/mers_es_mcc.png
  3. +21 −1 mers-structure.bib
  4. +41 −3 mers-structure.tex
BIN +273 KB figures/mers_ML.png
Binary file not shown.
Binary file not shown.
@@ -1,13 +1,33 @@
%% This BibTeX bibliography file was created using BibDesk.
%% Created for Gytis Dudas at 2017-10-30 15:37:11 -0700
%% Created for Gytis Dudas at 2017-11-07 11:35:51 -0800
%% Saved with string encoding Unicode (UTF-8)
Abstract = {The increase in the number of large data sets and the complexity of current probabilistic sequence evolution models necessitates fast and reliable phylogeny reconstruction methods. We describe a new approach, based on the maximum-likelihood principle, which clearly satisfies these requirements. The core of this method is a simple hill-climbing algorithm that adjusts tree topology and branch lengths simultaneously. This algorithm starts from an initial tree built by a fast distance-based method and modifies this tree to improve its likelihood at each iteration. Due to this simultaneous adjustment of the topology and branch lengths, only a few iterations are sufficient to reach an optimum. We used extensive and realistic computer simulations to show that the topological accuracy of this new method is at least as high as that of the existing maximum-likelihood programs and much higher than the performance of distance-based and parsimony approaches. The reduction of computing time is dramatic in comparison with other maximum-likelihood packages, while the likelihood maximization ability tends to be higher. For example, only 12 min were required on a standard personal computer to analyze a data set consisting of 500 rbc L sequences with 1,428 base pairs from plant plastids, thus reaching a speed of the same order as some popular distance-based and parsimony algorithms. This new method is implemented in the PHYML program, which is freely available on our web page:},
Author = {Guindon, St{\'e}phane and Gascuel, Olivier and Rannala, Bruce},
Date-Added = {2017-11-07 19:35:47 +0000},
Date-Modified = {2017-11-07 19:35:47 +0000},
Doi = {10.1080/10635150390235520},
File = {Full Text PDF:/Users/evogytis/Library/Application Support/Zotero/Profiles/1sfxzhn6.default/zotero/storage/NH3WGPPR/Guindon et al. - 2003 - A Simple, Fast, and Accurate Algorithm to Estimate.pdf:application/pdf;Snapshot:/Users/evogytis/Library/Application Support/Zotero/Profiles/1sfxzhn6.default/zotero/storage/WICB2ERW/1681984.html:text/html},
Issn = {1063-5157},
Journal = {Systematic Biology},
Month = oct,
Number = {5},
Pages = {696--704},
Title = {A {Simple}, {Fast}, and {Accurate} {Algorithm} to {Estimate} {Large} {Phylogenies} by {Maximum} {Likelihood}},
Url = {},
Urldate = {2017-11-07},
Volume = {52},
Year = {2003},
Bdsk-Url-1 = {},
Bdsk-Url-2 = {}}
Abstract = {Bayesian phylogenetic analysis generates a set of trees which are often condensed into a single tree representing the whole set. Many methods exist for selecting a representative topology for a set of unrooted trees, few exist for assigning branch lengths to a fixed topology, and even fewer for simultaneously setting the topology and branch lengths. However, there is very little research into locating a good representative for a set of rooted time trees like the ones obtained from a BEAST analysis.},
Annote = {Pages 221 in PDF},
@@ -180,6 +180,7 @@ \subsection*{MERS-CoV is predominantly a camel virus}
Discrete trait analysis reconstruction identifies both camels and humans as important hosts for MERS-CoV persistence, but with humans as the ultimate source of camel infections.
A similar approach has been attempted previously \citep{zhang_evolutionary_2016}, but this interpretation of MERS-CoV evolution disagrees with lack of continuing human transmission chains outside of Arabian peninsula, low seroprevalence in humans and very high seroprevalence in camels across Saudi Arabia.
We suspect that this particular discrete trait analysis reconstruction is false due to biased data, \textit{i.e.} having nearly twice as many MERS-CoV sequences from humans (N=174) than from camels (N=100) and the inability of the model to account for and quantify vastly different rates of coalescence in the phylogenetic vicinity of both types of sequences.
We can replicate these results by either applying the same model to current data (Figure \ref{dta}) or by enforcing equal coalescence rates within each deme in the structured coalescent model (Figure \ref{}).
\subsection*{MERS-CoV shows seasonal introductions}
@@ -387,7 +388,10 @@ \subsection*{MERS-CoV epidemiology}
In this study we aimed to understand the drivers of MERS coronavirus transmission in humans and what role the camel reservoir plays in perpetuating the epidemic in the Arabian peninsula by using sequence data collected from both hosts (174 from humans and 100 from camels).
We showed that currently existing models of population structure \citep{vaughan_efficient_2014} can identify distinct demographic modes in MERS-CoV genomic data, where viruses continuously circulating in camels repeatedly jump into humans and cause small outbreaks doomed to extinction (Figures \ref{mcc}, \ref{exploded}).
This inference succeeds under different choices of priors for unknown demographic parameters (Figure \ref{prior}) and in the presence of strong biases in sequence sampling schemes (Figure \ref{mers_epi}).
When rapid coalescence in the human deme is not allowed (Figure \ref{equal_sizes}) structured coalescent inference loses power and ancestral state reconstruction is nearly identical to that of discrete trait analysis (Figure \ref{dta}).
From sequence data we identify at least 50 zoonotic introductions of MERS-CoV into humans from the reservoir (Figure \ref{mcc}), from which we extrapolate that hundreds more such introductions must have taken place (Figure \ref{mers_epi}).
Although we recover migration rates from our model (Figure \ref{prior}), these only pertain to sequences and in no way reflect the epidemiologically relevant \textit{per capita} rates of zoonotic spillover events.
We also looked at potential seasonality in MERS-CoV spillover into humans.
Our analyses indicated a period of three months where the odds of a sequenced spillover event are increased, with timing consistent with an enzootic amongst camel calves (Figure \ref{seasonality}).
As a result of our identification of large and asymmetric flow of viral lineages into humans we also find that the basic reproduction number for MERS-CoV in humans is well below the epidemic threshold (Figure \ref{mers_epi}).
@@ -470,7 +474,9 @@ \subsection*{Sequence data}
Protein coding sequences were extracted and concatenated, reducing alignment length from 30130 down to 29364, which allowed for codon-partitioned substitution models to be used.
The final dataset consisted of 174 genomes from human infections and 100 genomes from camel infections (Table \ref{sequences}).
\subsection*{Structured coalescent analyses}
\subsection*{Phylogenetic analyses}
\subsubsection*{Primary analysis, structured coalescent}
For our primary analysis, the MultiTypeTree module \citep{vaughan_efficient_2014} of BEAST v2.4.3 \citep{bouckaert_beast_2014} was used to specify a structured coalescent model with two demes -- humans and camels.
At time of writing structured population models are available in BEAST v2 \citep{bouckaert_beast_2014} but not in BEAST v1 \citep{drummond_bayesian_2012}.
@@ -484,13 +490,25 @@ \subsection*{Structured coalescent analyses}
Three chains out of ten did not converge and were discarded altogether.
This left $70\,000$ states on which to base posterior inference.
Posterior sets of typed (where migrating branches from structured coalescent are collapsed, and migration information is left as a switch in state between parent and descendant nodes) trees were summarised using TreeAnnotator v2.4.3 with the common ancestor heights option \citep{heled_looking_2013}.
A maximum likelihood phylogeny showing just the genetic relationships between MERS-CoV genomes from camels and humans was recovered using PhyML \citep{guindon_simple_2003} under a HKY+$\Gamma_{4}$ \citep{hky_1985,yang_1994} nucleotide substitution model and is shown in Figure \ref{ml}.
\subsubsection*{Control, structured coalescent with different prior}
As a secondary analysis to test robustness to choice of prior, we set up an analysis where we increased the mean of the exponential distribution prior for migration rate to 10.0.
All other parameters were identical to the primary analysis and as before 10 independent MCMC chains were run.
In this case, two chains did not converge.
This left $80\,000$ states on which to base posterior inference.
Posterior sets of typed trees were summarised using TreeAnnotator v2.4.3 with the common ancestor heights option \citep{heled_looking_2013}.
\subsubsection*{Control, structured coalescent with equal deme sizes}
To better understand where statistical power of the structured coalescent model lies we set up a tertiary analysis where a model was set up identically to the first structured coalescent analysis, but deme population sizes were enforced to have the same size.
This analysis allowed us to differentiate whether statistical power in our analysis is coming from effective population size contrasts between demes or the backwards-in-time migration rate estimation.
Five replicate chains were set up, two of which failed to converge after 200 million states.
Combining the three converging runs left us with $15\,000$ trees sampled from the posterior distribution, which were summarised in TreeAnnotator v2.4.3 with the common ancestor heights option \citep{heled_looking_2013}.
\subsubsection*{Control, structured coalescent with more than one tree per genome}
Due to concerns that recombination might affect our conclusions \citep{dudas_mers-cov_2016}, as an additional secondary analysis, we also considered a scenario where alignments were split into two fragments (fragment 1 comprised of positions 1-21000, fragment 2 of positions 21000-29364), with independent clocks, trees and migration rates, but shared substitution models and deme population sizes.
Fragment positions were chosen based on consistent identification of the region around nucleotide 21000 as a probable breakpoint by GARD \citep{pond_gard:_2006} by previous studies into SARS and MERS coronaviruses \citep{hon_evidence_2008,dudas_mers-cov_2016}.
All analyses were set to run for 200 million states, subsampling every $20\,000$ states.
@@ -500,7 +518,7 @@ \subsection*{Structured coalescent analyses}
This left $70\,000$ states on which to base posterior inference.
Posterior sets of typed trees were summarised using TreeAnnotator v2.4.3 with the common ancestor heights option \citep{heled_looking_2013}.
\subsubsection*{Discrete trait analysis}
\subsubsection*{Control, discrete trait analysis}
A currently widely used approach to infer ancestral states in phylogenies relies on treating traits of interest (such as geography, host, \textit{etc.}) as features evolving along a phylogeny as continuous time Markov chains with an arbitrary number of states \citep{lemey_bayesian_2009}.
Unlike structured coalescent methods, such discrete trait approaches are independent from the tree (\textit{i.e.} demographic) prior and thus unable to influence coalescence rates under different trait states.
@@ -511,6 +529,7 @@ \subsubsection*{Discrete trait analysis}
The converging chains were combined after removing 20 million states as burn-in to give a total of $27\,000$ trees drawn from the posterior distribution.
These trees were then summarised using TreeAnnotator v2.4.5 with the common ancestor heights option \citep{heled_looking_2013}.
\subsubsection*{Introduction seasonality}
We extracted the times of camel-to-human introductions from the posterior distribution of multitype trees.
@@ -951,12 +970,31 @@ \section*{Acknowledgements}
\caption{\textbf{Maximum clade credibility (MCC) tree with ancestral state reconstruction according to a discrete trait model.}
MCC tree is presented the same as Figure \ref{mcc}, with colours indicating the most probable state reconstruction at internal nodes.
MCC tree is presented the same as Figure \ref{mcc} and Figure \ref{equal_sizes}, with colours indicating the most probable state reconstruction at internal nodes.
Unlike the structured coalescent summary shown in Figure \ref{mcc} where camels are reconstructed as the main host where MERS-CoV persists, the discrete trait approach identifies both camels and humans as major hosts with humans being the source of MERS-CoV infection in camels.
\caption{\textbf{Maximum clade credibility (MCC) tree of structured coalescent model with enforced equal coalescence rates.}
MCC tree is presented the same as Figures \ref{mcc} and \ref{dta}, with colours indicating the most probable state reconstruction at internal nodes.
Similar to Figure \ref{dta} enforcing equal coalescence rates between demes in a structured coalescent model identifies humans as a major MERS-CoV host and the source of viruses in camels.
\caption{\textbf{Maximum likelihood (ML) tree of MERS-CoV genomes coloured by origin of sequence.}
Maximum likelihood tree shows genetic divergence between MERS-CoV genomes collected from camels (orange tips) and humans (blue tips).

0 comments on commit 4a11c8d

Please sign in to comment.