A simple tool for calculating the distance between genomes.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
Images
test
.gitignore
Array_Submit.slrm
CalculateDistances
CombineChunks
MakeChunks
PanCake
README.md

README.md

DOI

PanCake

Implementation of methods described in Meier-Kolthoff et al 2013. Article Here

While the authors of that publication rightly believe that the BLAST algorithm gives more sensitivity for these questions , it comes at an extreme cost making application to large datasets computationally intractible.

These programs provide a minimal implementation of their described methods so that the nucmer program could instead be employed using greedy trimming along with the equation d4 to calculate a distance between two genome assemblies.

End output will give you something like this:

Big Heatmap

Of course people will be interested in if we're calculating similar distances to other implementations: The short answer; yes. We just tend to under estimate diversity.

Big Heatmap

Big Heatmap

The real important part is the speed increase:

Organism GBDP PanCake
Salmonella 06:28:11 00:09:47
Campylobacter 03:19:25 00:02:24

##Citing this work:

If you use this program or any of its outputs in your research please cite the following:

Meier-Kolthoff, Jan and Auch, Alexander and Klenk, Hans-Peter and Goker, Markus
Genome sequence-based species delimitation with confidence intervals and improved distance functions
BMC Bioinformatics
2013

Dylan Storey and Bart Weimer. (2015). PanCake: Narya. Zenodo. 10.5281/zenodo.35916

##Installation The only non core Perl packages required come from Inline.

To install:

sudo cpan install Inline
sudo cpan install Inline::CPP

##Synopsis of usage note: The first time you run these programs a folder (_Inline or .inline) will appear. Don't delete this as it holds libraries for portions of the program.

Create our manifests:

./MakeChunks --files Genomes/*.fa --chunk_size 200

Run them from a single node:

parallel '../CalculateDistances --manifest {}' ::: *.man

Run them from slurm:

sbatch --array=0-max Array_Submit.slrm

Combine and Plot:

ls Genomes/*.fa | wc -l 
$ 28
../CombineChunks --max_elements 28 --files *.mtx --out join.mtx

##Outputs of the run pairs files: These contain the calculated distances in a tab delimited format

join.mtx: This contains all of the pairs files as a true matrix with header.

test_join.mtx.png: This is the heatmap generated from join.mtx using hclust and ggplot2.

A big run will look a little something like this:

#Scripts

CalculateDistances

The core program. Currently runs mummer and retrieves MUMs , filters overlapping MUMs and keeping the longest alignment between any two overlapping MUMs, then calculates distance as the average calculated distance between reciprical MUMmer runs. Where a single distance metric is:

2 * (Total Identical Nucleotides) / (Length of MUMs from Reference) + (Length of MUMs from Query) 

Usage:

  ./CalculateDistances --manifest <options>

Options:

--manifest : The manifest file to operate on

MakeChunks

Given a large number of files splits the work into many manifests so that analysis can be spread across many processes.

Usage :

./Split__Manifests --files *.fa --chunk 10000

Options:

--files : files you wish to be broken up in to sub manifests

--chunk : how big of a chunk you want each manifest to take up.

CombineChunks

./CombineChunks --max_elements <int> --files <files> --out join.mtx

Options: --max_elements : The number of samples you have

--files : files to join

--out : name of the file for your final matrix

Combines mtx files in the pair format to a singular matrix , performs clustering, and outputs a heatmap.

##Array_Submit.slrm SLURM job file for array jobs. You'll want to know the number of the last manifest in a split to use this.

Usage:

sbatch --array=1-last__manifest Array__Submit.slrm

Options: Anything you can edit in sbatch you can edit here.