Permalink
Browse files

First commit.

  • Loading branch information...
1 parent e2a540b commit 9e33e8e5a4d0808079acf1f8cf53b316d85329ea @ozagordi ozagordi committed Jan 16, 2012
Showing with 1,996 additions and 0 deletions.
  1. +195 −0 README
  2. +110 −0 plot_sampling.py
  3. +215 −0 plot_stat.py
  4. +400 −0 pythonlib/MarkxIO.py
  5. +361 −0 pythonlib/SFFParser.py
  6. 0 pythonlib/__init__.py
  7. BIN pythonlib/__init__.pyc
  8. +79 −0 pythonlib/matching.py
  9. +7 −0 ref_genome.fasta
  10. +629 −0 s2f.py
View
195 README
@@ -0,0 +1,195 @@
+Full online documentation at:
+https://wiki-bsse.ethz.ch/display/ShoRAH/Documentation
+
+ShoRAH consists of several programs:
+
+dec.py - local error correction based on diri_sampler
+diri_sampler - Gibbs sampling for Dirichlet process mixture
+contain.cc - eliminate redundant reads
+mm.py - maximum matching haplotype construction
+freqEst.cc - EM algorithm for haplotype frequency
+fas2read.pl - translates between two formats for read data files
+s2f.py - multiple sequence alignment of NGS reads
+bam2msa.py - extracts a multiple sequence alignment from a BAM file
+
+Copyright 2007, 2008, 2009, 2010
+Niko Beerenwinkel,
+Arnab Bhattacharya,
+Nicholas Eriksson,
+Moritz Gerstung,
+Lukas Geyrhofer,
+Osvaldo Zagordi,
+ETH Zurich
+under GPL. See file LICENSE for details.
+The program also includes other software written by third parties.
+This has been distributed according to the licenses provided by the
+respective authors.
+
+
+GENERAL USAGE:
+--------------------------------------------------
+
+Install:
+ type 'make' to build the C++ programs, then run. See INSTALL
+ for additional information.
+
+Run:
+ The whole process can be run one step after the other, or one can
+ invoke shorah.py, that runs the whole process from read file to
+ frequency estimation. The local analysis alone can be run invoking
+ dec.py or directly diri_sampler. The input must be a fasta file with reads or a multiple alignment of the reads in fasta format
+ (.far extension). Shorah uses the program s2f to produce the
+ alignment, or the user copy its alignment to a file with the .far
+ extension and run shorah on it. We also provide a program to
+ extract a multiple sequence alignment from a BAM file.
+
+
+
+S2F.PY
+--------------------------------------------------
+Options:
+ -h, --help show this help message and exit
+ -f INPUT, --readfile=INPUT
+ -r REF, --ref=REF
+ -o O, --output=O output suffix must be '.far' or none for stdout
+ -t THRESHOLD, --threshold=THRESHOLD
+ if similarity is less, throw reads away...
+ <default=0.7>
+ -d, --delete_files delete temporary files <default=False>
+ -n, --no_pad_insert do not insert padding gaps <default=insert>
+
+Output:
+ The program s2f.py takes as input the file of reads (in fasta
+ format) and the reference genome. It produces a set of pairwise
+ alignments, then merge them to obtain a multiple one in a file
+ with far extension (file_name.far). This is the input to the
+ subsequent analysis performed with dec.py.
+
+Examples:
+ A sample file has been included for testing purposes,
+ sample_454.fasta and the reference ref_genome.fasta.
+ sample_454.fasta consists of 1000 reads of mean length equal
+ to 310 bp, derived from two haplotypes mixed in 80:20
+ proportion.
+ We recommend moving both files to a different working directory
+ and follow the rule of thumb: "one experiment in one directory".
+
+ We assume shorah was downloaded in the directory
+ path_to_sho_bin/. The far file can be created from the fasta file
+ and the reference by invoking
+ [user@host]$ path_to_sho_bin/shorah-0.4/s2f.py -f sample_454.fasta -r ref_genome.fasta -o sample_454.far
+
+
+
+
+
+
+
+DEC.PY:
+--------------------------------------------------
+Options:
+ -h, --help show this help message and exit
+ -f F, --readsfile=F file with reads <.far format>
+ -j J, --iterations=J iterations in dpm sampling <2000>
+ -a A, --alpha=A alpha in dpm sampling <0.01>
+ -w W, --windowsize=W window size <201>
+ -s S, --winshifts=S number of window shifts <3>
+ -k, --keep_files keep all intermediate files of diri_sampler
+ <default=False>
+ -r REF, --ref=REF reference genome
+ -t THRESHOLD, --threshold=THRESHOLD
+ if similarity is less, throw reads away...
+ <default=0.7>
+ -d, --delete_s2f_files
+ delete temporary align files of s2f.py <default=False>
+ -n, --no_pad_insert do not insert padding gaps in .far
+ file<default=insert>
+
+
+Other options
+-------------
+These parameters are controlled by the global variables declared at the beginning of file dec.py. See the relative comments for an explanation.
+IMPORTANT: Since version 0.3 dec.py runs in parallel on multi-processor
+computers. By default it uses all available CPUs, if you want to limit the
+number of parallel processes, edit the variable max_proc in dec.py
+
+Tips:
+ Rerunning on selected windows:
+ When running with option -k, dec.py produces and retains several
+ files in the subdirectory "corrected/" of the form
+ w1-99.reads-cor.fas.gz, where 1 and 99 are, respectively, the
+ beginning and the end of the window under consideration. If the
+ user wishes to rerun dec.py on selected windows only, one should
+ simply delete the corresponding file in the directory "corrected/"
+ NOTE: The error correction implemented in diri_sampler will not run for any window whose corresponding file is present in the directory
+ "corrected/".
+
+ Changing parameters on the fly:
+ If the user wishes to change the number of iterations of
+ diri_sampler (because the burn-in phase is shorter or longer
+ than expected) one can write a file w1-99.iter (again 1 and 99
+ are the boundaries of the window) with a single line that
+ contains the desired number of iterations. Similarly, one can
+ change alpha by writing into a file w1-99.alpha a float with the
+ new value of alpha. While alpha can be increased and decreased by
+ editing the .alpha file without adversely affecting the
+ functioning of the program (but obviously its result), we
+ recommend changing the number of iterations at most once.
+
+ Maximum weight of read objects:
+ The new implementation of diri_sampler clubs identical reads into
+ objects defined by the read sequence and the number of reads in
+ the object. This allows for faster computation, but might
+ introduce a bias into the sampling procedure. One can limit the
+ maximum number of identical reads in a single object by editing
+ the variable LIMIT in dpm_src/dpm_sampler.h (default is 10000)
+
+Output:
+ The program dec.py outputs several files. The file sample_cor.fas, which contains the corrected reads, is passed by shorah (or by the user) to the following step of the global analysis, fas2read.pl. dec.py also outputs a file called proposed.dat. It contains the
+ number of proposed new clusters in each window. One should always make sure that a sufficient number of new cluster has been proposed (rule of thumb, at least 0.1 per step or more). When the program is run with the -k option, several directories are created. They
+ contain zipped files that report information on the error
+ correction procedure. For a full explanation, see
+ http://wiki-bsse.ethz.ch/display/ShoRAH/Local+analysis
+
+
+
+SHORAH.PY:
+--------------------------------------------------
+
+
+Options:
+ -h, --help show this help message and exit
+ -f F, --readsfile=F file with reads <.fas or .far format>
+ -j J, --iterations=J iterations in dpm sampling <1000>
+ -a A, --alpha=A alpha in dpm sampling <0.01>
+ -w W, --windowsize=W window size in <201>
+ -t THRESHOLD, --threshold=THRESHOLD
+ if similarity is less, throw reads away...
+ <default=0.7>
+ -n, --no_pad_insert do not insert padding gaps <default=insert>
+ -r REF, --ref=REF
+ -s S, --winshifts=S number of window shifts <3>
+ -k, --keep_files keep intermediate files (Gibbs sampling)
+ <default=False>
+
+
+Examples:
+ How to run shorah on the example file described above.
+ [user@host]$ path_to_sho_bin/shorah-0.4/shorah.py -f sample_454.fasta -r ref_genome.fasta -j 1000 -w 90 -a 0.1 -k &> global.log
+
+
+
+Output:
+ The final result of the analysis is in the file
+ sample_454_cor.popl, the first lines of which are reported.
+ >HAP0_0.708656
+ TC-AA--A--TCACTCTTTGGCAACGACCCCTTGTC-AC.........[line truncated]
+ >HAP1_0.194
+ TC-TG--A--TCACTCTTTGGCAGCGACCCCTCGTC-AC.........[line truncated]
+
+ This file contains the sequences of the reconstructed haplotypes
+ (reconstructed by the program mm.py) and their frequencies
+ (estimated by the program freqEst. The name of each sentence is in
+ the format HAPn_freq where n is an ordinal number and freq is the
+ frequency. Haplotypes are sorted in descending order according to
+ their frequencies.
View
@@ -0,0 +1,110 @@
+#!/usr/bin/env python
+
+# Copyright 2007, 2008, 2009
+# Niko Beerenwinkel,
+# Nicholas Eriksson,
+# Moritz Gerstung,
+# Lukas Geyrhofer,
+# Osvaldo Zagordi,
+# ETH Zurich
+
+# This file is part of ShoRAH.
+# ShoRAH is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# ShoRAH is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with ShoRAH. If not, see <http://www.gnu.org/licenses/>.
+
+def make_plot(filename, quantity1, quantity2=None):
+ ''' Plot quantity for ds-out file
+ '''
+ import matplotlib
+ matplotlib.use('pdf')
+ import matplotlib.pyplot as plt
+ import numpy as np
+
+ with open(filename) as f:
+ r = f.readline().split()
+ dtype = [('', np.int32)]*3 + [('',np.float32)]*2
+ y = np.loadtxt(f, dtype=dtype)
+ y = y.view(np.dtype([('iter', np.int32), ('K', np.int32), ('untouched', np.int32),
+ ('theta', np.float32), ('gamma', np.float32)]))
+ '''
+ iter = y['iter']
+ K = y['K']
+ untouch = y['untouch']
+ theta = y['theta']
+ gamma = y['gamma']
+ '''
+ # Set figure scale, font and similia
+ fig = plt.figure()
+ ax1 = fig.add_subplot(111)
+ ax1.plot(y[quantity1], 'bv-')
+ ax1.set_xlabel('iterations')
+ # Make the y-axis label and tick labels match the line color.
+ q1l = quantity1
+ if quantity1 == 'gamma':
+ q1l = r'$\Gamma$'
+ if quantity1 == 'theta':
+ q1l = r'$\theta$'
+ ax1.set_ylabel(q1l, color='b')
+ for tl in ax1.get_yticklabels():
+ tl.set_color('b')
+
+ if quantity2:
+ ax2 = ax1.twinx()
+ ax2.plot(y[quantity2], 'rx-')
+ q2l = quantity2
+ if quantity2 == 'gamma':
+ q2l = r'$\Gamma$'
+ if quantity2 == 'theta':
+ q2l = r'$\theta$'
+ ax2.set_ylabel(q2l, color='r')
+ for tl in ax2.get_yticklabels():
+ tl.set_color('r')
+
+ # save the figure
+ plt.title('Analysis of the run, file %s' % filename)
+ imtype = 'pdf'
+ if quantity2:
+ plt.savefig('run_%s_%s_%s.%s' % (filename, quantity1, quantity2, imtype), dpi=None, facecolor='w', edgecolor='w',\
+ orientation='portrait', papertype=None, format=imtype,\
+ transparent=False)
+ else:
+ plt.savefig('run_%s_%s.%s' % (filename, quantity1, imtype), dpi=None, facecolor='w', edgecolor='w',\
+ orientation='portrait', papertype=None, format=imtype,\
+ transparent=False)
+
+def main():
+ ''' What does a main do?
+ '''
+ import sys
+
+ args = sys.argv
+ poss_quant = ['theta', 'gamma', 'K', 'untouched']
+ try:
+ filename = args[1]
+ quantity1 = args[2]
+ except:
+ sys.exit('usage: plot_sampling.py file quantity1 [optional: quantity2] [%s]' % '|'.join(poss_quant))
+
+ try:
+ quantity2 = args[3]
+ if quantity2 not in poss_quant:
+ sys.exit('usage: plot_sampling.py file quantity1 [optional: quantity2] [%s]' % '|'.join(poss_quant))
+ except:
+ quantity2 = None
+
+ if quantity1 not in poss_quant:
+ sys.exit('usage: plot_sampling.py file quantity1 [optional: quantity2] [%s]' % '|'.join(poss_quant))
+
+ make_plot(filename, quantity1, quantity2)
+if __name__ == "__main__":
+ main()
Oops, something went wrong.

0 comments on commit 9e33e8e

Please sign in to comment.