Skip to content

Commit

Permalink
First commit.
Browse files Browse the repository at this point in the history
  • Loading branch information
ozagordi committed Jan 16, 2012
1 parent e2a540b commit 9e33e8e
Show file tree
Hide file tree
Showing 10 changed files with 1,996 additions and 0 deletions.
195 changes: 195 additions & 0 deletions README
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
Full online documentation at:
https://wiki-bsse.ethz.ch/display/ShoRAH/Documentation

ShoRAH consists of several programs:

dec.py - local error correction based on diri_sampler
diri_sampler - Gibbs sampling for Dirichlet process mixture
contain.cc - eliminate redundant reads
mm.py - maximum matching haplotype construction
freqEst.cc - EM algorithm for haplotype frequency
fas2read.pl - translates between two formats for read data files
s2f.py - multiple sequence alignment of NGS reads
bam2msa.py - extracts a multiple sequence alignment from a BAM file

Copyright 2007, 2008, 2009, 2010
Niko Beerenwinkel,
Arnab Bhattacharya,
Nicholas Eriksson,
Moritz Gerstung,
Lukas Geyrhofer,
Osvaldo Zagordi,
ETH Zurich
under GPL. See file LICENSE for details.
The program also includes other software written by third parties.
This has been distributed according to the licenses provided by the
respective authors.


GENERAL USAGE:
--------------------------------------------------

Install:
type 'make' to build the C++ programs, then run. See INSTALL
for additional information.

Run:
The whole process can be run one step after the other, or one can
invoke shorah.py, that runs the whole process from read file to
frequency estimation. The local analysis alone can be run invoking
dec.py or directly diri_sampler. The input must be a fasta file with reads or a multiple alignment of the reads in fasta format
(.far extension). Shorah uses the program s2f to produce the
alignment, or the user copy its alignment to a file with the .far
extension and run shorah on it. We also provide a program to
extract a multiple sequence alignment from a BAM file.



S2F.PY
--------------------------------------------------
Options:
-h, --help show this help message and exit
-f INPUT, --readfile=INPUT
-r REF, --ref=REF
-o O, --output=O output suffix must be '.far' or none for stdout
-t THRESHOLD, --threshold=THRESHOLD
if similarity is less, throw reads away...
<default=0.7>
-d, --delete_files delete temporary files <default=False>
-n, --no_pad_insert do not insert padding gaps <default=insert>

Output:
The program s2f.py takes as input the file of reads (in fasta
format) and the reference genome. It produces a set of pairwise
alignments, then merge them to obtain a multiple one in a file
with far extension (file_name.far). This is the input to the
subsequent analysis performed with dec.py.

Examples:
A sample file has been included for testing purposes,
sample_454.fasta and the reference ref_genome.fasta.
sample_454.fasta consists of 1000 reads of mean length equal
to 310 bp, derived from two haplotypes mixed in 80:20
proportion.
We recommend moving both files to a different working directory
and follow the rule of thumb: "one experiment in one directory".

We assume shorah was downloaded in the directory
path_to_sho_bin/. The far file can be created from the fasta file
and the reference by invoking
[user@host]$ path_to_sho_bin/shorah-0.4/s2f.py -f sample_454.fasta -r ref_genome.fasta -o sample_454.far







DEC.PY:
--------------------------------------------------
Options:
-h, --help show this help message and exit
-f F, --readsfile=F file with reads <.far format>
-j J, --iterations=J iterations in dpm sampling <2000>
-a A, --alpha=A alpha in dpm sampling <0.01>
-w W, --windowsize=W window size <201>
-s S, --winshifts=S number of window shifts <3>
-k, --keep_files keep all intermediate files of diri_sampler
<default=False>
-r REF, --ref=REF reference genome
-t THRESHOLD, --threshold=THRESHOLD
if similarity is less, throw reads away...
<default=0.7>
-d, --delete_s2f_files
delete temporary align files of s2f.py <default=False>
-n, --no_pad_insert do not insert padding gaps in .far
file<default=insert>


Other options
-------------
These parameters are controlled by the global variables declared at the beginning of file dec.py. See the relative comments for an explanation.
IMPORTANT: Since version 0.3 dec.py runs in parallel on multi-processor
computers. By default it uses all available CPUs, if you want to limit the
number of parallel processes, edit the variable max_proc in dec.py

Tips:
Rerunning on selected windows:
When running with option -k, dec.py produces and retains several
files in the subdirectory "corrected/" of the form
w1-99.reads-cor.fas.gz, where 1 and 99 are, respectively, the
beginning and the end of the window under consideration. If the
user wishes to rerun dec.py on selected windows only, one should
simply delete the corresponding file in the directory "corrected/"
NOTE: The error correction implemented in diri_sampler will not run for any window whose corresponding file is present in the directory
"corrected/".

Changing parameters on the fly:
If the user wishes to change the number of iterations of
diri_sampler (because the burn-in phase is shorter or longer
than expected) one can write a file w1-99.iter (again 1 and 99
are the boundaries of the window) with a single line that
contains the desired number of iterations. Similarly, one can
change alpha by writing into a file w1-99.alpha a float with the
new value of alpha. While alpha can be increased and decreased by
editing the .alpha file without adversely affecting the
functioning of the program (but obviously its result), we
recommend changing the number of iterations at most once.

Maximum weight of read objects:
The new implementation of diri_sampler clubs identical reads into
objects defined by the read sequence and the number of reads in
the object. This allows for faster computation, but might
introduce a bias into the sampling procedure. One can limit the
maximum number of identical reads in a single object by editing
the variable LIMIT in dpm_src/dpm_sampler.h (default is 10000)

Output:
The program dec.py outputs several files. The file sample_cor.fas, which contains the corrected reads, is passed by shorah (or by the user) to the following step of the global analysis, fas2read.pl. dec.py also outputs a file called proposed.dat. It contains the
number of proposed new clusters in each window. One should always make sure that a sufficient number of new cluster has been proposed (rule of thumb, at least 0.1 per step or more). When the program is run with the -k option, several directories are created. They
contain zipped files that report information on the error
correction procedure. For a full explanation, see
http://wiki-bsse.ethz.ch/display/ShoRAH/Local+analysis



SHORAH.PY:
--------------------------------------------------


Options:
-h, --help show this help message and exit
-f F, --readsfile=F file with reads <.fas or .far format>
-j J, --iterations=J iterations in dpm sampling <1000>
-a A, --alpha=A alpha in dpm sampling <0.01>
-w W, --windowsize=W window size in <201>
-t THRESHOLD, --threshold=THRESHOLD
if similarity is less, throw reads away...
<default=0.7>
-n, --no_pad_insert do not insert padding gaps <default=insert>
-r REF, --ref=REF
-s S, --winshifts=S number of window shifts <3>
-k, --keep_files keep intermediate files (Gibbs sampling)
<default=False>


Examples:
How to run shorah on the example file described above.
[user@host]$ path_to_sho_bin/shorah-0.4/shorah.py -f sample_454.fasta -r ref_genome.fasta -j 1000 -w 90 -a 0.1 -k &> global.log



Output:
The final result of the analysis is in the file
sample_454_cor.popl, the first lines of which are reported.
>HAP0_0.708656
TC-AA--A--TCACTCTTTGGCAACGACCCCTTGTC-AC.........[line truncated]
>HAP1_0.194
TC-TG--A--TCACTCTTTGGCAGCGACCCCTCGTC-AC.........[line truncated]

This file contains the sequences of the reconstructed haplotypes
(reconstructed by the program mm.py) and their frequencies
(estimated by the program freqEst. The name of each sentence is in
the format HAPn_freq where n is an ordinal number and freq is the
frequency. Haplotypes are sorted in descending order according to
their frequencies.
110 changes: 110 additions & 0 deletions plot_sampling.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
#!/usr/bin/env python

# Copyright 2007, 2008, 2009
# Niko Beerenwinkel,
# Nicholas Eriksson,
# Moritz Gerstung,
# Lukas Geyrhofer,
# Osvaldo Zagordi,
# ETH Zurich

# This file is part of ShoRAH.
# ShoRAH is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.

# ShoRAH is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.

# You should have received a copy of the GNU General Public License
# along with ShoRAH. If not, see <http://www.gnu.org/licenses/>.

def make_plot(filename, quantity1, quantity2=None):
''' Plot quantity for ds-out file
'''
import matplotlib
matplotlib.use('pdf')
import matplotlib.pyplot as plt
import numpy as np

with open(filename) as f:
r = f.readline().split()
dtype = [('', np.int32)]*3 + [('',np.float32)]*2
y = np.loadtxt(f, dtype=dtype)
y = y.view(np.dtype([('iter', np.int32), ('K', np.int32), ('untouched', np.int32),
('theta', np.float32), ('gamma', np.float32)]))
'''
iter = y['iter']
K = y['K']
untouch = y['untouch']
theta = y['theta']
gamma = y['gamma']
'''
# Set figure scale, font and similia
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.plot(y[quantity1], 'bv-')
ax1.set_xlabel('iterations')
# Make the y-axis label and tick labels match the line color.
q1l = quantity1
if quantity1 == 'gamma':
q1l = r'$\Gamma$'
if quantity1 == 'theta':
q1l = r'$\theta$'
ax1.set_ylabel(q1l, color='b')
for tl in ax1.get_yticklabels():
tl.set_color('b')

if quantity2:
ax2 = ax1.twinx()
ax2.plot(y[quantity2], 'rx-')
q2l = quantity2
if quantity2 == 'gamma':
q2l = r'$\Gamma$'
if quantity2 == 'theta':
q2l = r'$\theta$'
ax2.set_ylabel(q2l, color='r')
for tl in ax2.get_yticklabels():
tl.set_color('r')

# save the figure
plt.title('Analysis of the run, file %s' % filename)
imtype = 'pdf'
if quantity2:
plt.savefig('run_%s_%s_%s.%s' % (filename, quantity1, quantity2, imtype), dpi=None, facecolor='w', edgecolor='w',\
orientation='portrait', papertype=None, format=imtype,\
transparent=False)
else:
plt.savefig('run_%s_%s.%s' % (filename, quantity1, imtype), dpi=None, facecolor='w', edgecolor='w',\
orientation='portrait', papertype=None, format=imtype,\
transparent=False)

def main():
''' What does a main do?
'''
import sys

args = sys.argv
poss_quant = ['theta', 'gamma', 'K', 'untouched']
try:
filename = args[1]
quantity1 = args[2]
except:
sys.exit('usage: plot_sampling.py file quantity1 [optional: quantity2] [%s]' % '|'.join(poss_quant))

try:
quantity2 = args[3]
if quantity2 not in poss_quant:
sys.exit('usage: plot_sampling.py file quantity1 [optional: quantity2] [%s]' % '|'.join(poss_quant))
except:
quantity2 = None

if quantity1 not in poss_quant:
sys.exit('usage: plot_sampling.py file quantity1 [optional: quantity2] [%s]' % '|'.join(poss_quant))

make_plot(filename, quantity1, quantity2)
if __name__ == "__main__":
main()
Loading

0 comments on commit 9e33e8e

Please sign in to comment.