Skip to content

Commit

Permalink
(Manuscript) Add new Pipeline figure
Browse files Browse the repository at this point in the history
  • Loading branch information
biologyguy committed Mar 7, 2018
1 parent 0b1226c commit 72c489b
Show file tree
Hide file tree
Showing 3 changed files with 35 additions and 6 deletions.
Binary file added manuscript/figures/pipeline.eps
Binary file not shown.
30 changes: 24 additions & 6 deletions manuscript/genome_biology/bmc_article.tex
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
\documentclass[twocolumn]{bmcart}% uncomment this for twocolumn layout and comment line below
%% BioMed_Central_Tex_Template_v1.06
%% %
% bmc_article.tex ver: 1.06 %
Expand Down Expand Up @@ -43,7 +44,7 @@

%%% loading packages, author definitions

\documentclass[twocolumn]{bmcart}% uncomment this for twocolumn layout and comment line below
%\documentclass[twocolumn]{bmcart}% uncomment this for twocolumn layout and comment line below
%\documentclass{bmcart}

%%% Load packages
Expand Down Expand Up @@ -275,15 +276,32 @@ \section{Background}\label{sec:background}


\section{Results and Discussion}\label{sec:resultsAndDiscussion}
Recursive Dynamic Markov Clustering (RD-MCL) is a new pairwise similarity graph-based orthogroup prediction method that achieves high precision within the context of a gene family.
Recursive Dynamic Markov Clustering (RD-MCL) is a new pairwise similarity graph-based orthogroup prediction heuristic that achieves high precision within the context of a gene family.
Specifically, RD-MCL significantly improves upon the BLASTP-based pairwise similarity metrics currently in popular use, applies an optimization algorithm to dynamically select MCL parameters, recursively subdivides overly inclusive orthogroups, and implements final polishing steps to maximize overall accuracy.
Given sufficient taxonomic coverage, RD-MCL clearly reveals mis-assembled or mis-annotated sequences, as well as orthogroups that have been previously undescribed.
Furthermore, a starting set of high quality orthogroups from a well-sampled taxonomic group can be leveraged to analyze homologous sequences from clades that are less well sampled, allowing for detailed phylogenetic placement of new sequences into a gene family with greater precision than is possible with simple best-hit database queries.
The software is open-source (https://research.nhgri.nih.gov/software/RD-MCL/) and distributed as part of a suite of tools to facilitate all of the downstream analyses reported.
Furthermore, a set of high quality orthogroups from a well-sampled taxonomic group can be leveraged to analyze sequences from clades that are less well sampled, allowing for detailed phylogenetic placement of new sequences into a gene family with greater precision than is possible with simple best-hit database queries.
The software is open-source (https://research.nhgri.nih.gov/software/RD-MCL/) and distributed as part of a suite of tools to facilitate all of the downstream analyses reported here.


\subsection{Implementation}\label{subsec:implementation}
\lipsum[3]
\subsection{Architecture}\label{subsec:implementation}
RD-MCL has been implemented in Python and including flexible sequence format

Python is not generally well adapted for distributing complex operations across a cluster environment, but creating all-by-all similarity graphs is an O\textsuperscript{2} hard problem that makes RD-MCL run times prohibitive when analyzing more than a few hundred sequences on a single machine.
To overcome this challenge, similarity graph creation can be passed off to worker processes which control other nodes on a cluster.
This dynamic is achieved by writing job information to a SQLite database that is monitored by the worker nodes, which then further subdivide the work if a graph is large, and return the results to the same SQLite database for retrieval by the master process.
By distributing the work in this fashion, an arbitrary number of RD-MCL runs can all access the same pool of worker nodes, and the size of the worker pool can be modified on the fly.

Sequence handling is achieved with BuddySuite~\cite{Bond:2017bj}, so the user

The pipeline can be separated into five distinct components\textemdash{}

\begin{figure*}[t]
\begin{center}
\includegraphics[height=0.6\textheight]{../figures/pipeline.eps}
\end{center}
\caption{Pipeline.}
\label{fig:pipeline}
\end{figure*}

\subsection{BLASTP scores reduce overall resolving power of MCL when sequences are too similar or too dissimilar}\label{subsec:blastpScoresReduceOverallResolvingPowerOfMclWhenSequencesAreTooSimilarOrTooDissimilar}
BLAST scores (bit or e-value) have a strong length bias when calculating orthogroups, but this can be corrected for by creating a linear model from the top 5\% of matches between two species and scaling all other matches according to that model~\cite{Emms:2015ig}.
Expand Down
11 changes: 11 additions & 0 deletions manuscript/references/refs.bib
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,17 @@ @article{Altenhoff:2012ea
month = may
}

@article{Bond:2017bj,
author = {Bond, Stephen R and Keat, Karl E and Barreira, Sofia N and Baxevanis, Andreas D},
title = {{BuddySuite: Command-Line Toolkits for Manipulating Sequences, Alignments, and Phylogenetic Trees.}},
journal = {Molecular biology and evolution},
year = {2017},
volume = {34},
number = {6},
pages = {1543--1546},
month = jun
}

@article{Chiba:2015ed,
author = {Chiba, Hirokazu and Nishide, Hiroyo and Uchiyama, Ikuo},
title = {{Construction of an ortholog database using the semantic web technology for integrative analysis of genomic data.}},
Expand Down

0 comments on commit 72c489b

Please sign in to comment.