First commit.

cbg-ethz · Jan 16, 2012 · 9e33e8e · 9e33e8e
1 parent e2a540b
commit 9e33e8e
Show file tree

Hide file tree

Showing 10 changed files with 1,996 additions and 0 deletions.
diff --git a/README b/README
@@ -0,0 +1,195 @@
+Full online documentation at:
+https://wiki-bsse.ethz.ch/display/ShoRAH/Documentation
+
+ShoRAH consists of several programs:
+
+dec.py       - local error correction based on diri_sampler
+diri_sampler - Gibbs sampling for Dirichlet process mixture
+contain.cc   - eliminate redundant reads
+mm.py        - maximum matching haplotype construction
+freqEst.cc   - EM algorithm for haplotype frequency
+fas2read.pl  - translates between two formats for read data files
+s2f.py       - multiple sequence alignment of NGS reads
+bam2msa.py   - extracts a multiple sequence alignment from a BAM file
+
+Copyright 2007, 2008, 2009, 2010
+Niko Beerenwinkel,
+Arnab Bhattacharya,
+Nicholas Eriksson,
+Moritz Gerstung,
+Lukas Geyrhofer,
+Osvaldo Zagordi,
+ETH Zurich
+under GPL. See file LICENSE for details.
+The program also includes other software written by third parties.
+This has been distributed according to the licenses provided by the
+respective authors.
+
+
+GENERAL USAGE:
+--------------------------------------------------
+
+Install:
+	type 'make' to build the C++ programs, then run. See INSTALL
+	for additional information.
+
+Run:
+	The whole process can be run one step after the other, or one can
+	invoke shorah.py, that runs the whole process from read	file to
+	frequency estimation. The local analysis alone can be run invoking
+	dec.py or directly diri_sampler. The input must be a fasta file 	with reads or a multiple alignment of the reads in fasta format
+	(.far extension). Shorah uses the program s2f to produce the
+	alignment, or the user copy its alignment to a file with the .far
+	extension and run shorah on it. We also provide a program to
+	extract a multiple sequence alignment from a BAM file.
+
+
+
+S2F.PY
+--------------------------------------------------
+Options:
+  -h, --help            show this help message and exit
+  -f INPUT, --readfile=INPUT
+  -r REF, --ref=REF     
+  -o O, --output=O      output suffix must be '.far' or none for stdout
+  -t THRESHOLD, --threshold=THRESHOLD
+                        if similarity is less, throw reads away...
+                        <default=0.7>
+  -d, --delete_files    delete temporary files <default=False>
+  -n, --no_pad_insert   do not insert padding gaps <default=insert>
+
+Output:
+	The program s2f.py takes as input the file of reads (in fasta
+	format) and the	reference genome. It produces a set of pairwise
+	alignments, then merge them to obtain a multiple one in a file
+	with far extension (file_name.far). This is the input to the
+	subsequent analysis performed with dec.py.
+
+Examples:
+	A sample file has been included for testing purposes,
+	sample_454.fasta and the reference ref_genome.fasta.
+	sample_454.fasta consists of 1000 reads of mean length equal
+	to 310 bp, derived from two haplotypes mixed in 80:20
+	proportion.
+	We recommend moving both files to a different working directory
+	and follow the rule of thumb: "one experiment in one directory".
+
+	We assume shorah was downloaded in the directory
+	path_to_sho_bin/. The far file can be created  from the fasta file
+	and the reference by invoking
+	[user@host]$ path_to_sho_bin/shorah-0.4/s2f.py -f sample_454.fasta -r ref_genome.fasta -o sample_454.far
+
+
+
+
+
+
+
+DEC.PY:
+--------------------------------------------------
+Options:
+  -h, --help            show this help message and exit
+  -f F, --readsfile=F   file with reads <.far format>
+  -j J, --iterations=J  iterations in dpm sampling <2000>
+  -a A, --alpha=A       alpha in dpm sampling <0.01>
+  -w W, --windowsize=W  window size <201>
+  -s S, --winshifts=S   number of window shifts <3>
+  -k, --keep_files      keep all intermediate files of diri_sampler
+                        <default=False>
+  -r REF, --ref=REF     reference genome
+  -t THRESHOLD, --threshold=THRESHOLD
+                        if similarity is less, throw reads away...
+                        <default=0.7>
+  -d, --delete_s2f_files
+                        delete temporary align files of s2f.py <default=False>
+  -n, --no_pad_insert   do not insert padding gaps in .far
+                        file<default=insert>
+
+
+Other options
+-------------
+These parameters are controlled by the global variables declared at the beginning of file dec.py. See the relative comments for an explanation.
+IMPORTANT: Since version 0.3 dec.py runs in parallel on multi-processor
+computers. By default it uses all available CPUs, if you want to limit the
+number of parallel processes, edit the variable max_proc in dec.py
+
+Tips:
+	Rerunning on selected windows:
+	When running with option -k, dec.py produces and retains several
+	files in the subdirectory "corrected/" of the form
+	w1-99.reads-cor.fas.gz, where 1 and 99 are, respectively, the
+	beginning and the end of the window under consideration. If the
+	user wishes to rerun dec.py on selected windows only, one should
+	simply delete the corresponding file in the directory "corrected/"
+	NOTE: The error correction implemented in diri_sampler will not run 	for any window whose corresponding file is present in the directory
+	"corrected/".
+
+	Changing parameters on the fly:
+	If the user wishes to change the number of iterations of
+	diri_sampler (because the burn-in phase is shorter or longer
+	than expected) one can write a file w1-99.iter (again 1 and 99
+	are the boundaries of the window) with a single line that
+	contains the desired number of iterations. Similarly, one can
+	change alpha by writing into a file w1-99.alpha a float with the
+	new value of alpha. While alpha can be increased and decreased by
+	editing the .alpha file without adversely affecting the
+	functioning of the program (but obviously its result), we
+	recommend changing the number of iterations at most once.
+
+	Maximum weight of read objects:
+	The new implementation of diri_sampler clubs identical reads into
+	objects defined by the read sequence and the number of reads in
+	the object. This allows for faster computation, but might
+	introduce a bias into the sampling procedure. One can limit the
+	maximum number of identical reads in a single object by	editing
+	the variable LIMIT in dpm_src/dpm_sampler.h (default is 10000)
+
+Output:
+	The program dec.py outputs several files. The file sample_cor.fas,	which contains the corrected reads, is passed by shorah (or by the	user) to the following step of the global analysis, fas2read.pl.	dec.py also outputs a file called proposed.dat. It contains the
+	number of proposed new clusters in each window. One should always	make sure that a sufficient number of new cluster has been proposed 	(rule of thumb, at least 0.1 per step or more). When the program is 	run with the -k option, several directories are created. They
+	contain zipped files that report information on the error
+	correction procedure. For a full explanation, see
+	http://wiki-bsse.ethz.ch/display/ShoRAH/Local+analysis
+
+
+
+SHORAH.PY:
+--------------------------------------------------
+
+
+Options:
+  -h, --help            show this help message and exit
+  -f F, --readsfile=F   file with reads <.fas or .far format>
+  -j J, --iterations=J  iterations in dpm sampling <1000>
+  -a A, --alpha=A       alpha in dpm sampling <0.01>
+  -w W, --windowsize=W  window size in <201>
+  -t THRESHOLD, --threshold=THRESHOLD
+                        if similarity is less, throw reads away...
+                        <default=0.7>
+  -n, --no_pad_insert   do not insert padding gaps <default=insert>
+  -r REF, --ref=REF     
+  -s S, --winshifts=S   number of window shifts <3>
+  -k, --keep_files      keep intermediate files (Gibbs sampling)
+                        <default=False>
+
+
+Examples:
+	How to run shorah on the example file described above.
+	[user@host]$ path_to_sho_bin/shorah-0.4/shorah.py -f sample_454.fasta -r ref_genome.fasta -j 1000 -w 90 -a 0.1 -k &> global.log
+
+
+
+Output:
+	The final result of the analysis is in the file
+	sample_454_cor.popl, the first lines of which are reported.
+	>HAP0_0.708656
+	TC-AA--A--TCACTCTTTGGCAACGACCCCTTGTC-AC.........[line truncated]
+	>HAP1_0.194
+	TC-TG--A--TCACTCTTTGGCAGCGACCCCTCGTC-AC.........[line truncated]
+
+	This file contains the sequences of the reconstructed haplotypes
+	(reconstructed by the program mm.py) and their frequencies
+	(estimated by the program freqEst. The name of each sentence is in
+	the format HAPn_freq where n is an ordinal number and freq is the
+	frequency. Haplotypes are sorted in descending order according to
+	their frequencies.
diff --git a/plot_sampling.py b/plot_sampling.py
@@ -0,0 +1,110 @@
+#!/usr/bin/env python
+
+# Copyright 2007, 2008, 2009
+# Niko Beerenwinkel,
+# Nicholas Eriksson,
+# Moritz Gerstung,
+# Lukas Geyrhofer,
+# Osvaldo Zagordi,
+# ETH Zurich
+
+# This file is part of ShoRAH.
+# ShoRAH is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# ShoRAH is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with ShoRAH.  If not, see <http://www.gnu.org/licenses/>.
+
+def make_plot(filename, quantity1, quantity2=None):
+    ''' Plot quantity for ds-out file
+    '''
+    import matplotlib
+    matplotlib.use('pdf')
+    import matplotlib.pyplot as plt
+    import numpy as np
+
+    with open(filename) as f:
+        r = f.readline().split()
+        dtype = [('', np.int32)]*3 + [('',np.float32)]*2
+        y = np.loadtxt(f, dtype=dtype)
+        y = y.view(np.dtype([('iter', np.int32), ('K', np.int32), ('untouched', np.int32),
+                             ('theta', np.float32), ('gamma', np.float32)]))
+        '''
+        iter = y['iter']
+        K = y['K']
+        untouch = y['untouch']
+        theta = y['theta']
+        gamma = y['gamma']
+        ''' 
+    # Set figure scale, font and similia
+    fig = plt.figure()
+    ax1 = fig.add_subplot(111)
+    ax1.plot(y[quantity1], 'bv-')
+    ax1.set_xlabel('iterations')
+    # Make the y-axis label and tick labels match the line color.
+    q1l = quantity1
+    if quantity1 == 'gamma':
+        q1l = r'$\Gamma$'
+    if quantity1 == 'theta':
+        q1l = r'$\theta$'
+    ax1.set_ylabel(q1l, color='b')
+    for tl in ax1.get_yticklabels():
+        tl.set_color('b')
+
+    if quantity2:
+        ax2 = ax1.twinx()
+        ax2.plot(y[quantity2], 'rx-')
+        q2l = quantity2
+        if quantity2 == 'gamma':
+            q2l = r'$\Gamma$'
+        if quantity2 == 'theta':
+            q2l = r'$\theta$'
+        ax2.set_ylabel(q2l, color='r')
+        for tl in ax2.get_yticklabels():
+            tl.set_color('r')
+
+    # save the figure
+    plt.title('Analysis of the run, file %s' % filename)
+    imtype = 'pdf'
+    if quantity2:
+        plt.savefig('run_%s_%s_%s.%s' % (filename, quantity1, quantity2, imtype), dpi=None, facecolor='w', edgecolor='w',\
+                        orientation='portrait', papertype=None, format=imtype,\
+                        transparent=False)
+    else:
+        plt.savefig('run_%s_%s.%s' % (filename, quantity1, imtype), dpi=None, facecolor='w', edgecolor='w',\
+                        orientation='portrait', papertype=None, format=imtype,\
+                        transparent=False)
+
+def main():
+    ''' What does a main do?
+    '''
+    import sys
+
+    args = sys.argv
+    poss_quant = ['theta', 'gamma', 'K', 'untouched']
+    try:
+        filename = args[1]
+        quantity1 = args[2]
+    except:
+        sys.exit('usage: plot_sampling.py file quantity1 [optional: quantity2] [%s]' % '|'.join(poss_quant))
+
+    try:
+        quantity2 = args[3]
+        if quantity2 not in poss_quant:
+            sys.exit('usage: plot_sampling.py file quantity1 [optional: quantity2] [%s]' % '|'.join(poss_quant))
+    except:
+        quantity2 = None
+
+    if quantity1 not in poss_quant:
+        sys.exit('usage: plot_sampling.py file quantity1 [optional: quantity2] [%s]' % '|'.join(poss_quant))
+
+    make_plot(filename, quantity1, quantity2)
+if __name__ == "__main__":
+    main()