What is ShoRAH?
ShoRAH is an open source project for the analysis of next generation sequencing data. It is designed to analyse genetically heterogeneous samples. Its tools are written in different programming languages and provide error correction, haplotype reconstruction and estimation of the frequency of the different genetic variants present in a mixed sample.
More information here.
The software suite ShoRAH (Short Reads Assembly into Haplotypes) consists of several programs, the most imporant of which are:
|Tool||What it does|
||amplicon based analysis|
||local error correction based on diri_sampler|
||Gibbs sampling for error correction via Dirichlet process mixture|
||removal of redundant reads|
||maximum matching haplotype construction|
||EM algorithm for haplotype frequency|
||detects single nucleotide variants, taking strand bias into account|
||wrapper for everything|
If you use shorah, please cite the application note paper Zagordi et al. on BMC Bioinformatics.
shorah requires the following pieces of software:
Python 2, which is generally available on most Unix-like system. The required dependencies are:
a) Biopython, which can be downloaded using pip or anaconda
Perl, for some scripts
zlib, which is used by the bundled samtools for compressing bam files
pkg-config, for discovering dependencies, which most Unix-like systems include
GNU scientific library, for random number generation
In addition, if you want to boostrap the git version of shorah instead of using the provided tarballs, you will need the GNU Autotools:
m4, which most Unix-like system include
We strongly recommend you use one of the versioned tarballs from the releases page. ShoRAH uses Autoconf and Automake, and these tarballs include all necessary scripts and files required for installation, whereas the git tree only contains a none of these pre-generated files.
Further, we strongly recommend you use a virtualenv for python installation that shares the same directory
root as where you'd like to install shorah to. Not using a virtualenv means that the python dependencies will
not be located in the installation root, which will likely require you to specify
PYTHONPATH, making the
installation more brittle.
Say for instance, you would like to install shorah to
/usr/local/shorah. The first step consists of installing
the required python dependencies. Create a virtualenv:
/opt/local/bin/virtualenv-2.7 is the virtualenv command for python 2.7 on MacPorts. Now install
the python dependencies:
/usr/local/shorah/bin/pip install Biopython
Now call the
configure script from the shorah tarball, taking care to specify the absolute path of the
python interpreter, as this gets inserted into the shebang line of all python scripts:
./configure --prefix=/usr/local/shorah PYTHON=/usr/local/shorah/bin/python2.7
The configure script finds the dependencies using pkg-config. Once it completes, run:
where the number specifies the number of compilation threads to use. Finally, after compilation, install using:
All the programs should now be located in
Boostrapping from git
If you opted to clone the git repository instead of downloading a prepared tarball, you will need to boostrap the configure script:
After this, you can run the
configure script as described previously.
The input is a sorted bam file. Analysis can be performed in local or global mode.
The local analysis alone can be run invoking
for the amplicon mode). They work by cutting window from the multiple sequence
diri_sampler on the windows and calling
snv.py for the
SNV calling. See the
file in directory
The whole global reconstruction consists of the following steps:
- error correction (i.e. local haplotype reconstruction);
- SNV calling;
- removal of redundant reads;
- global haplotype reconstruction;
- frequency estimation.
These can be run one after the other, or one can invoke
shorah.py, that runs
the whole process from bam file to frequency estimation and SNV calling.