No description, website, or topics provided.
C Other Perl Lua Makefile Java Other
Clone or download
#1 Compare This branch is 12 commits ahead, 532 commits behind samtools:develop.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
examples Merge examples/ tidy up (PR #226) May 30, 2014
misc Documentation: Updated REF_CACHE location details. Jan 30, 2015
rocksdb @ b5b486c rocksdb b5b486c Feb 5, 2014
test fix rocksort test compilation Apr 8, 2015
win32 * samtools-0.1.5-22 (r432) Aug 1, 2009
.gitattributes Set Git attributes similarly to the develop branch Aug 15, 2014
.gitignore Replace bcftools+bcf_filter.sh with new vcf-miniview Aug 20, 2014
.gitmodules initial squashed merge of rocksort Jan 8, 2014
.travis.yml Add an unoptimised GCC C99 build Oct 24, 2013
AUTHORS Remove razip source code and remaining vestiges May 28, 2014
ChangeLog.old Removed SVN ChangeLog; merged my private github Log to ChangeLog.old Mar 9, 2012
INSTALL Fix typo. Jan 6, 2015
LICENSE Update license file Jun 26, 2014
Makefile Merge tag '1.2' into dnanexus Apr 8, 2015
Makefile.mingw for WIN32 compatibility Jul 7, 2011
NEWS Release 1.2: various bug fixes Feb 2, 2015
README Release 1.2: various bug fixes Feb 2, 2015
README.md Merge remote-tracking branch 'upstream/develop' into dnanexus Feb 27, 2014
bam.c Reduce use of abort, remove assert(fp), pass one string to fprintf no… Aug 28, 2014
bam.h Release 1.2: various bug fixes Feb 2, 2015
bam2bcf.c No need to initialise array anyway Aug 18, 2014
bam2bcf.h fix DPR annotation Aug 17, 2014
bam2bcf_indel.c Remove unused kaln.c, kaln.h global aligner Nov 18, 2014
bam2depth.c Merge pull request #299 from jkbonfield/cram3 Jan 14, 2015
bam_aux.c Canonicalise whitespace -- USE -b/-w TO DIFF/BLAME ACROSS THIS COMMIT Aug 1, 2014
bam_cat.c Update copyright years and reformat licensing boilerplate Aug 1, 2014
bam_color.c Canonicalise whitespace -- USE -b/-w TO DIFF/BLAME ACROSS THIS COMMIT Aug 1, 2014
bam_endian.h Canonicalise whitespace -- USE -b/-w TO DIFF/BLAME ACROSS THIS COMMIT Aug 1, 2014
bam_flags.c Canonicalise whitespace -- USE -b/-w TO DIFF/BLAME ACROSS THIS COMMIT Aug 1, 2014
bam_import.c Canonicalise whitespace -- USE -b/-w TO DIFF/BLAME ACROSS THIS COMMIT Aug 1, 2014
bam_index.c Canonicalise whitespace -- USE -b/-w TO DIFF/BLAME ACROSS THIS COMMIT Aug 1, 2014
bam_lpileup.c Canonicalise whitespace -- USE -b/-w TO DIFF/BLAME ACROSS THIS COMMIT Aug 1, 2014
bam_lpileup.h Canonicalise whitespace -- USE -b/-w TO DIFF/BLAME ACROSS THIS COMMIT Aug 1, 2014
bam_mate.c Update comment documentation for bam_mate Nov 12, 2014
bam_md.c Remove unused kaln.c, kaln.h global aligner Nov 18, 2014
bam_plbuf.c Reduce use of abort, remove assert(fp), pass one string to fprintf no… Aug 28, 2014
bam_plbuf.h Canonicalise whitespace -- USE -b/-w TO DIFF/BLAME ACROSS THIS COMMIT Aug 1, 2014
bam_plcmd.c Merge pull request #299 from jkbonfield/cram3 Jan 14, 2015
bam_reheader.c Canonicalise whitespace -- USE -b/-w TO DIFF/BLAME ACROSS THIS COMMIT Aug 1, 2014
bam_rmdup.c Canonicalise whitespace -- USE -b/-w TO DIFF/BLAME ACROSS THIS COMMIT Aug 1, 2014
bam_rmdupse.c Canonicalise whitespace -- USE -b/-w TO DIFF/BLAME ACROSS THIS COMMIT Aug 1, 2014
bam_rocksort.c rocksort: fix sorting by read name, which had been broken in a previo… Feb 27, 2014
bam_sort.c Make samtools merge work with SAM input files Nov 27, 2014
bam_split.c Reduce use of abort, remove assert(fp), pass one string to fprintf no… Aug 28, 2014
bam_stat.c Merge pull request #299 from jkbonfield/cram3 Jan 14, 2015
bam_tview.c Reduce use of abort, remove assert(fp), pass one string to fprintf no… Aug 28, 2014
bam_tview.h Canonicalise whitespace -- USE -b/-w TO DIFF/BLAME ACROSS THIS COMMIT Aug 1, 2014
bam_tview_curses.c Canonicalise whitespace -- USE -b/-w TO DIFF/BLAME ACROSS THIS COMMIT Aug 1, 2014
bam_tview_html.c Canonicalise whitespace -- USE -b/-w TO DIFF/BLAME ACROSS THIS COMMIT Aug 1, 2014
bamshuf.c Merge abort/assert reduction (PR #290) Sep 9, 2014
bamtk.c Merge tag '1.2' into dnanexus Apr 8, 2015
bedcov.c Canonicalise whitespace -- USE -b/-w TO DIFF/BLAME ACROSS THIS COMMIT Aug 1, 2014
bedidx.c Canonicalise whitespace -- USE -b/-w TO DIFF/BLAME ACROSS THIS COMMIT Aug 1, 2014
cut_target.c Canonicalise whitespace -- USE -b/-w TO DIFF/BLAME ACROSS THIS COMMIT Aug 1, 2014
errmod.c Add a few comments to pileup and errmod code, move some existing ones… Dec 16, 2014
errmod.h Canonicalise whitespace -- USE -b/-w TO DIFF/BLAME ACROSS THIS COMMIT Aug 1, 2014
faidx.c Canonicalise whitespace -- USE -b/-w TO DIFF/BLAME ACROSS THIS COMMIT Aug 1, 2014
kprobaln.c Update copyright years to reflect historic changes Aug 1, 2014
kprobaln.h Update copyright years to reflect historic changes Aug 1, 2014
padding.c Prevent warnings when compiling on a 32-bit host [minor] Nov 4, 2014
phase.c Canonicalise whitespace -- USE -b/-w TO DIFF/BLAME ACROSS THIS COMMIT Aug 1, 2014
sam.c Avoid using htsFile is_foo bits directly Oct 20, 2014
sam.h Avoid using htsFile is_foo bits directly Oct 20, 2014
sam_header.c Canonicalise whitespace -- USE -b/-w TO DIFF/BLAME ACROSS THIS COMMIT Aug 1, 2014
sam_header.h Canonicalise whitespace -- USE -b/-w TO DIFF/BLAME ACROSS THIS COMMIT Aug 1, 2014
sam_view.c Added checks to sam_hdr_write() return value. Jan 28, 2015
sample.c Canonicalise whitespace -- USE -b/-w TO DIFF/BLAME ACROSS THIS COMMIT Aug 1, 2014
sample.h Canonicalise whitespace -- USE -b/-w TO DIFF/BLAME ACROSS THIS COMMIT Aug 1, 2014
samtools.1 Release 1.2: various bug fixes Feb 2, 2015
samtools.h Add copyright notices and licensing boilerplate Aug 1, 2014
stats.c Added 'insert size' to the head of the IS section Jan 30, 2015
stats_isize.c Canonicalise whitespace -- USE -b/-w TO DIFF/BLAME ACROSS THIS COMMIT Aug 1, 2014
stats_isize.h Canonicalise whitespace -- USE -b/-w TO DIFF/BLAME ACROSS THIS COMMIT Aug 1, 2014

README.md

dnanexus / samtools

This is a fork of samtools maintained (without warranty) by DNAnexus. See COPYING.

The original samtools package has been split into three separate but tightly coordinated projects:

  • htslib: C-library for handling high-throughput sequencing data
  • samtools: mpileup and other tools for handling SAM, BAM, CRAM
  • bcftools: calling and other tools for handling VCF, BCF

This version of samtools incorporates RocksDB as a git submodule. RocksDB has several build dependencies; see facebook/rocksdb/INSTALL.md. Additionally, a development installation of jemalloc is required (available in most Linux package managers).

Then,

git clone https://github.com/dnanexus/htslib.git
git clone https://github.com/dnanexus/samtools.git
make -C samtools

This will produce the executable samtools/samtools, which will probably assume several shared library dependencies (try ldd samtools/samtools).

Using rocksort

This section has specific instructions for using the rocksort subcommand; for a more general introduction, see our blog post. The basic invocation of samtools rocksort is similar to samtools sort:

Usage:   samtools rocksort [options] <in.bam> <out.prefix>

Options: -@ INT    number of sorting and compression threads [1]
         -m INT    max memory per thread; suffix K/M/G/T recognized [768M]

We recommend using at least four threads, with compute cores to run them. The memory requirements should be similar to samtools sort with the same -@ and -m settings.

To enable background compactions, supply the option -s.

         -s INT    plan background compactions assuming this uncompressed
                   BAM data size; suffix K/M/G/T recognized [off]

See below for guidance on setting the value of this option.

Additional options:

         -o        final output to stdout
         -f        use <out.prefix> as full file name instead of prefix
         -n        sort by read name
         -k        keep RocksDB instead of deleting it when done
         -l INT    compression level, from 0 to 9 [-1]
         -u INT    unsort: shuffle the BAM using given random seed

Like samtools sort, the input BAM can be streamed through standard input by supplying - instead of <in.bam>. In fact, if the input is a compressed BAM file, it's usually faster to decompress with pigz like this:

cat <in.bam> | pigz -dc | samtools rocksort [options] - <out.prefix>

Estimating data size (-s) for background compactions

Note: background compactions are generally useful only when the dataset to be sorted is many times larger than provisioned memory. See our blog post introducing rocksort for more explanation.

To plan efficient background compactions, rocksort needs a rough estimate of the total uncompressed size of the BAM data. A rough rule of thumb is to quadruple the expected size of the final BAM file. For example, if you expect to produce a 125 GiB final BAM (roughly a deep human WGS), a size estimate of 500 GiB would work pretty well. Note that sorted BAMs are substantially smaller than unsorted BAMs, since they're more compressible.

Alternatively, here's a bottom-up formula for the size of a single BAM alignment block (source), which you can multiply by the expected number of read alignments:

Block Size = 8*4 + ReadNameLength(including null) + CigarLength*4 + (ReadLength+1)/2 + ReadLength + TagLength

The estimate needs not be fantastically accurate; +/- 20% or so is fine. If in doubt, overestimate. Once all the data are loaded, samtools rocksort will log some feedback about the accuracy of the hint to standard error.

Setting the scratch directory

By default, temporary files are written into the same directory as the final output BAM, similar to samtools sort. You can override this by setting the TMPDIR environment variable to something else. The scratch directory should be high-performance storage, and should have at least twice the expected size of the final BAM (4-5X if using background compactions) in free space.