(Manuscript) Add new Pipeline figure

biologyguy · Mar 7, 2018 · 72c489b · 72c489b
1 parent 0b1226c
commit 72c489b
Show file tree

Hide file tree

Showing 3 changed files with 35 additions and 6 deletions.
diff --git a/manuscript/figures/pipeline.eps b/manuscript/figures/pipeline.eps
diff --git a/manuscript/genome_biology/bmc_article.tex b/manuscript/genome_biology/bmc_article.tex
@@ -1,3 +1,4 @@
+\documentclass[twocolumn]{bmcart}% uncomment this for twocolumn layout and comment line below
 %% BioMed_Central_Tex_Template_v1.06
 %%                                      %
 %  bmc_article.tex            ver: 1.06 %
@@ -43,7 +44,7 @@
 
 %%% loading packages, author definitions
 
-\documentclass[twocolumn]{bmcart}% uncomment this for twocolumn layout and comment line below
+%\documentclass[twocolumn]{bmcart}% uncomment this for twocolumn layout and comment line below
 %\documentclass{bmcart}
 
 %%% Load packages
@@ -275,15 +276,32 @@ \section{Background}\label{sec:background}
 
 
 \section{Results and Discussion}\label{sec:resultsAndDiscussion}
-Recursive Dynamic Markov Clustering (RD-MCL) is a new pairwise similarity graph-based orthogroup prediction method that achieves high precision within the context of a gene family.
+Recursive Dynamic Markov Clustering (RD-MCL) is a new pairwise similarity graph-based orthogroup prediction heuristic that achieves high precision within the context of a gene family.
 Specifically, RD-MCL significantly improves upon the BLASTP-based pairwise similarity metrics currently in popular use, applies an optimization algorithm to dynamically select MCL parameters, recursively subdivides overly inclusive orthogroups, and implements final polishing steps to maximize overall accuracy.
 Given sufficient taxonomic coverage, RD-MCL clearly reveals mis-assembled or mis-annotated sequences, as well as orthogroups that have been previously undescribed.
-Furthermore, a starting set of high quality orthogroups from a well-sampled taxonomic group can be leveraged to analyze homologous sequences from clades that are less well sampled, allowing for detailed phylogenetic placement of new sequences into a gene family with greater precision than is possible with simple best-hit database queries.
-The software is open-source (https://research.nhgri.nih.gov/software/RD-MCL/) and distributed as part of a suite of tools to facilitate all of the downstream analyses reported.
+Furthermore, a set of high quality orthogroups from a well-sampled taxonomic group can be leveraged to analyze sequences from clades that are less well sampled, allowing for detailed phylogenetic placement of new sequences into a gene family with greater precision than is possible with simple best-hit database queries.
+The software is open-source (https://research.nhgri.nih.gov/software/RD-MCL/) and distributed as part of a suite of tools to facilitate all of the downstream analyses reported here.
 
 
-\subsection{Implementation}\label{subsec:implementation}
-\lipsum[3]
+\subsection{Architecture}\label{subsec:implementation}
+RD-MCL has been implemented in Python and  including flexible sequence format
+
+Python is not generally well adapted for distributing complex operations across a cluster environment, but creating all-by-all similarity graphs is an O\textsuperscript{2} hard problem that makes RD-MCL run times prohibitive when analyzing more than a few hundred sequences on a single machine.
+To overcome this challenge, similarity graph creation can be passed off to worker processes which control other nodes on a cluster.
+This dynamic is achieved by writing job information to a SQLite database that is monitored by the worker nodes, which then further subdivide the work if a graph is large, and return the results to the same SQLite database for retrieval by the master process.
+By distributing the work in this fashion, an arbitrary number of RD-MCL runs can all access the same pool of worker nodes, and the size of the worker pool can be modified on the fly.
+
+Sequence handling is achieved with BuddySuite~\cite{Bond:2017bj}, so the user
+
+The pipeline can be separated into five distinct components\textemdash{}
+
+\begin{figure*}[t]
+  \begin{center}
+  \includegraphics[height=0.6\textheight]{../figures/pipeline.eps}
+\end{center}
+\caption{Pipeline.}
+\label{fig:pipeline}
+\end{figure*}
 
 \subsection{BLASTP scores reduce overall resolving power of MCL when sequences are too similar or too dissimilar}\label{subsec:blastpScoresReduceOverallResolvingPowerOfMclWhenSequencesAreTooSimilarOrTooDissimilar}
 BLAST scores (bit or e-value) have a strong length bias when calculating orthogroups, but this can be corrected for by creating a linear model from the top 5\% of matches between two species and scaling all other matches according to that model~\cite{Emms:2015ig}.

diff --git a/manuscript/references/refs.bib b/manuscript/references/refs.bib
@@ -28,6 +28,17 @@ @article{Altenhoff:2012ea
 month = may
 }
 
+@article{Bond:2017bj,
+author = {Bond, Stephen R and Keat, Karl E and Barreira, Sofia N and Baxevanis, Andreas D},
+title = {{BuddySuite: Command-Line Toolkits for Manipulating Sequences, Alignments, and Phylogenetic Trees.}},
+journal = {Molecular biology and evolution},
+year = {2017},
+volume = {34},
+number = {6},
+pages = {1543--1546},
+month = jun
+}
+
 @article{Chiba:2015ed,
 author = {Chiba, Hirokazu and Nishide, Hiroyo and Uchiyama, Ikuo},
 title = {{Construction of an ortholog database using the semantic web technology for integrative analysis of genomic data.}},