Skip to content

Commit

Permalink
(Manuscript) Move final paragraph of Introduction into Results
Browse files Browse the repository at this point in the history
This whole paragraph outlines the overall achievements of RD-MCL,
 setting the stage for the remaining results.
  • Loading branch information
biologyguy committed Mar 6, 2018
1 parent fd89f8d commit 0b1226c
Show file tree
Hide file tree
Showing 2 changed files with 53 additions and 11 deletions.
21 changes: 10 additions & 11 deletions manuscript/genome_biology/bmc_article.tex
Original file line number Diff line number Diff line change
Expand Up @@ -263,27 +263,26 @@ \section{Background}\label{sec:background}
The accuracy of tree-based orthology prediction methods is tied closely to the accuracy of the species trees they rely on;
this can lead to considerable uncertainty or error, especially for less well-studied taxonomic groups~\cite{Xu:2016ek}.

Finally, pairwise similarity graph clustering methods leverage graph theory to rapidly identify groups of related sequences.
Finally, pairwise similarity methods leverage graph theory to rapidly identify groups of related sequences.
InParanoid~\cite{OBrien:2005cy}, EggNOG~\cite{Jensen:2007cc}, and OMA~\cite{Roth:2009iu} are popular tools for assigning sequences to orthogroups using a `best-hit clique' approach, where closed best-hit sub-graphs are identified in the dataset.
These methods can be fast and accurate for detecting one-to-one orthologs, but they suffer diminishing recall rates when in-paralogs are present among the species under study~\cite{Dalquen:2013fz} (`in-paralog' describes homologs derived from genetic duplication \textit{after} speciation~\cite{Sonnhammer:2002vm,Tekaia:2016ga}).
In contrast, Markov clustering (MCL) can efficiently isolate more inclusive sub-graphs~\cite{VanDongen:kJZ890qx,Enright:2002uq}, although the trade-off is generally a reduction in precision;
it is often difficult to separate closely related orthogroups within a given gene family.
Indeed, the major challenge facing those wishing to analyze a defined gene family is \textit{resolution}, because the popular MCL-based ortholog prediction methods, such as OrthoMCL~\cite{Li:2003en}, OrthoFinder~\cite{Emms:2015ig}, and ProteinOrtho~\cite{Lechner:2011jk}, are targeted towards coarse-grained clustering of all protein models derived from whole-genome data.
While computationally efficient, these resources are not well suited for fine-grained processing of individual gene families, where all input sequences are homologous.
While computationally efficient, these resources are not well suited for fine-grained processing of individual gene families where all input sequences are homologous.
This leaves a gap in our ability to easily discern evolutionary patterns at this scale and has inevitably exacerbated the propagation of annotation errors in our public databases~\cite{Schnoes:2009gb}, which still relies heavily on annotation transfer following inference of homology against a limited number of reference species~\cite{Aken:2016dl, Mi:2016bw, OLeary:2016cm}.
Here we present a method that overcomes these challenges, along with a suite of tools to assist with detailed downstream analysis.

Here, we introduce Recursive Dynamic Markov Clustering (RD-MCL), a new method that refines the precision of pairwise similarity graph orthogroup prediction to high resolution, making it appropriate for the analysis of individual gene families.

\section{Results and Discussion}\label{sec:resultsAndDiscussion}
Recursive Dynamic Markov Clustering (RD-MCL) is a new pairwise similarity graph-based orthogroup prediction method that achieves high precision within the context of a gene family.
Specifically, RD-MCL significantly improves upon the BLASTP-based pairwise similarity metrics currently in popular use, applies an optimization algorithm to dynamically select MCL parameters, recursively subdivides overly inclusive orthogroups, and implements final polishing steps to maximize overall accuracy.
Given sufficient taxonomic coverage, RD-MCL is robust against missing or poorly assembled sequence data, and we illustrate how the protein sequences currently available in the RefSeq database, curated by NCBI, allow for the complete analyses of chordate gene families.
Such analyses often reveal mis-assembled or mis-annotated sequences, as well as orthogroups that have been previously undescribed.
We also demonstrate how a starting set of high quality orthogroups from one phylum can be leveraged to analyze homologous sequences from clades that are less well sampled, allowing for detailed phylogenetic placement of new homologs within a gene family in a way that is not possible with simple best-hit database queries.
Given sufficient taxonomic coverage, RD-MCL clearly reveals mis-assembled or mis-annotated sequences, as well as orthogroups that have been previously undescribed.
Furthermore, a starting set of high quality orthogroups from a well-sampled taxonomic group can be leveraged to analyze homologous sequences from clades that are less well sampled, allowing for detailed phylogenetic placement of new sequences into a gene family with greater precision than is possible with simple best-hit database queries.
The software is open-source (https://research.nhgri.nih.gov/software/RD-MCL/) and distributed as part of a suite of tools to facilitate all of the downstream analyses reported.



\section{Results and Discussion}\label{sec:resultsAndDiscussion}
\subsection{Description of the RD-MCL algorithm and software}\label{subsec:descriptionOfTheRd-mclAlgorithmAndSoftware}
The impetus for developing RD-MCL was to predict high-quality fine-grained orthogroups among any collection of homologous protein sequence.

\subsection{Implementation}\label{subsec:implementation}
\lipsum[3]

\subsection{BLASTP scores reduce overall resolving power of MCL when sequences are too similar or too dissimilar}\label{subsec:blastpScoresReduceOverallResolvingPowerOfMclWhenSequencesAreTooSimilarOrTooDissimilar}
Expand Down
43 changes: 43 additions & 0 deletions manuscript/references/refs.bib
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,16 @@
% @settings{label, options="nameyear"}

% Article within a journal
@article{Aken:2016dl,
author = {Aken, Bronwen L and Ayling, Sarah and Barrell, Daniel and Clarke, Laura and Curwen, Valery and Fairley, Susan and Fernandez Banet, Julio and Billis, Konstantinos and Garc{\'\i}a Gir{\'o}n, Carlos and Hourlier, Thibaut and Howe, Kevin and K{\"a}h{\"a}ri, Andreas and Kokocinski, Felix and Martin, Fergal J and Murphy, Daniel N and Nag, Rishi and Ruffier, Magali and Schuster, Michael and Tang, Y Amy and Vogel, Jan-Hinnerk and White, Simon and Zadissa, Amonida and Flicek, Paul and Searle, Stephen M J},
title = {{The Ensembl gene annotation system.}},
journal = {Database : the journal of biological databases and curation},
year = {2016},
volume = {2016},
pages = {baw093}
}


@article{Altenhoff:2012ea,
author = {Altenhoff, Adrian M and Studer, Romain A and Robinson-Rechavi, Marc and Dessimoz, Christophe},
title = {{Resolving the Ortholog Conjecture: Orthologs Tend to Be Weakly, but Significantly, More Similar in Function than Paralogs}},
Expand Down Expand Up @@ -184,6 +194,17 @@ @article{Li:2003en
month = sep
}

@article{Mi:2016bw,
author = {Mi, Huaiyu and Poudel, Sagar and Muruganujan, Anushya and Casagrande, John T and Thomas, Paul D},
title = {{PANTHER version 10: expanded protein families and functions, and analysis tools.}},
journal = {Nucleic acids research},
year = {2016},
volume = {44},
number = {D1},
pages = {D336--42},
month = jan
}

@article{Nakaya:2013gg,
author = {Nakaya, Akihiro and Katayama, Toshiaki and Itoh, Masumi and Hiranuka, Kazushi and Kawashima, Shuichi and Moriya, Yuki and Okuda, Shujiro and Tanaka, Michihiro and Tokimatsu, Toshiaki and Yamanishi, Yoshihiro and Yoshizawa, Akiyasu C and Kanehisa, Minoru and Goto, Susumu},
title = {{KEGG OC: a large-scale automatic construction of taxonomy-based ortholog clusters.}},
Expand All @@ -206,6 +227,17 @@ @article{OBrien:2005cy
month = jan
}

@article{OLeary:2016cm,
author = {O'Leary, Nuala A and Wright, Mathew W and Brister, J Rodney and Ciufo, Stacy and Haddad, Diana and McVeigh, Rich and Rajput, Bhanu and Robbertse, Barbara and Smith-White, Brian and Ako-Adjei, Danso and Astashyn, Alexander and Badretdin, Azat and Bao, Yiming and Blinkova, Olga and Brover, Vyacheslav and Chetvernin, Vyacheslav and Choi, Jinna and Cox, Eric and Ermolaeva, Olga and Farrell, Catherine M and Goldfarb, Tamara and Gupta, Tripti and Haft, Daniel and Hatcher, Eneida and Hlavina, Wratko and Joardar, Vinita S and Kodali, Vamsi K and Li, Wenjun and Maglott, Donna and Masterson, Patrick and McGarvey, Kelly M and Murphy, Michael R and O'Neill, Kathleen and Pujar, Shashikant and Rangwala, Sanjida H and Rausch, Daniel and Riddick, Lillian D and Schoch, Conrad and Shkeda, Andrei and Storz, Susan S and Sun, Hanzhen and Thibaud-Nissen, Francoise and Tolstoy, Igor and Tully, Raymond E and Vatsan, Anjana R and Wallin, Craig and Webb, David and Wu, Wendy and Landrum, Melissa J and Kimchi, Avi and Tatusova, Tatiana and DiCuccio, Michael and Kitts, Paul and Murphy, Terence D and Pruitt, Kim D},
title = {{Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation.}},
journal = {Nucleic acids research},
year = {2016},
volume = {44},
number = {D1},
pages = {D733--45},
month = jan
}

@article{Pan:2016jg,
author = {Pan, Shu-Ting and Xue, Danfeng and Li, Zhi-Ling and Zhou, Zhi-Wei and He, Zhi-Xu and Yang, Yinxue and Yang, Tianxin and Qiu, Jia-Xuan and Zhou, Shu-Feng},
title = {{Computational Identification of the Paralogs and Orthologs of Human Cytochrome P450 Superfamily and the Implication in Drug Discovery.}},
Expand Down Expand Up @@ -238,6 +270,17 @@ @article{Roth:2009iu
pages = {220}
}

@article{Schnoes:2009gb,
author = {Schnoes, Alexandra M and Brown, Shoshana D and Dodevski, Igor and Babbitt, Patricia C},
title = {{Annotation error in public databases: misannotation of molecular function in enzyme superfamilies.}},
journal = {PLoS computational biology},
year = {2009},
volume = {5},
number = {12},
pages = {e1000605},
month = dec
}

@article{Spielman:2015kv,
author = {Spielman, Stephanie J and Wilke, Claus O},
title = {{Pyvolve: A Flexible Python Module for Simulating Sequences along Phylogenies}},
Expand Down

0 comments on commit 0b1226c

Please sign in to comment.