dnanexus / samtools

This is a fork of samtools maintained (without warranty) by DNAnexus. See COPYING.

The original samtools package has been split into three separate but tightly coordinated projects:

htslib: C-library for handling high-throughput sequencing data
samtools: mpileup and other tools for handling SAM, BAM, CRAM
bcftools: calling and other tools for handling VCF, BCF

This version of samtools incorporates RocksDB as a git submodule. RocksDB has several build dependencies; see facebook/rocksdb/INSTALL.md. Additionally, a development installation of jemalloc is required (available in most Linux package managers).

Then,

git clone https://github.com/dnanexus/htslib.git
git clone https://github.com/dnanexus/samtools.git
make -C samtools

This will produce the executable samtools/samtools, which will probably assume several shared library dependencies (try ldd samtools/samtools).

Using rocksort

This section has specific instructions for using the rocksort subcommand; for a more general introduction, see our blog post. The basic invocation of samtools rocksort is similar to samtools sort:

Usage:   samtools rocksort [options] <in.bam> <out.prefix>

Options: -@ INT    number of sorting and compression threads [1]
         -m INT    max memory per thread; suffix K/M/G/T recognized [768M]

We recommend using at least four threads, with compute cores to run them. The memory requirements should be similar to samtools sort with the same -@ and -m settings.

To enable background compactions, supply the option -s.

         -s INT    plan background compactions assuming this uncompressed
                   BAM data size; suffix K/M/G/T recognized [off]

See below for guidance on setting the value of this option.

Additional options:

         -o        final output to stdout
         -f        use <out.prefix> as full file name instead of prefix
         -n        sort by read name
         -k        keep RocksDB instead of deleting it when done
         -l INT    compression level, from 0 to 9 [-1]
         -u INT    unsort: shuffle the BAM using given random seed

Like samtools sort, the input BAM can be streamed through standard input by supplying - instead of <in.bam>. In fact, if the input is a compressed BAM file, it's usually faster to decompress with pigz like this:

cat <in.bam> | pigz -dc | samtools rocksort [options] - <out.prefix>

Estimating data size (`-s`) for background compactions

Note: background compactions are generally useful only when the dataset to be sorted is many times larger than provisioned memory. See our blog post introducing rocksort for more explanation.

To plan efficient background compactions, rocksort needs a rough estimate of the total uncompressed size of the BAM data. A rough rule of thumb is to quadruple the expected size of the final BAM file. For example, if you expect to produce a 125 GiB final BAM (roughly a deep human WGS), a size estimate of 500 GiB would work pretty well. Note that sorted BAMs are substantially smaller than unsorted BAMs, since they're more compressible.

Alternatively, here's a bottom-up formula for the size of a single BAM alignment block (source), which you can multiply by the expected number of read alignments:

Block Size = 8*4 + ReadNameLength(including null) + CigarLength*4 + (ReadLength+1)/2 + ReadLength + TagLength

The estimate needs not be fantastically accurate; +/- 20% or so is fine. If in doubt, overestimate. Once all the data are loaded, samtools rocksort will log some feedback about the accuracy of the hint to standard error.

Setting the scratch directory

By default, temporary files are written into the same directory as the final output BAM, similar to samtools sort. You can override this by setting the TMPDIR environment variable to something else. The scratch directory should be high-performance storage, and should have at least twice the expected size of the final BAM (4-5X if using background compactions) in free space.

Name		Name	Last commit message	Last commit date
Latest commit History 1,396 Commits
examples		examples
misc		misc
rocksdb @ b5b486c		rocksdb @ b5b486c
test		test
win32		win32
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.travis.yml		.travis.yml
AUTHORS		AUTHORS
ChangeLog.old		ChangeLog.old
INSTALL		INSTALL
LICENSE		LICENSE
Makefile		Makefile
Makefile.mingw		Makefile.mingw
NEWS		NEWS
README		README
README.md		README.md
bam.c		bam.c
bam.h		bam.h
bam2bcf.c		bam2bcf.c
bam2bcf.h		bam2bcf.h
bam2bcf_indel.c		bam2bcf_indel.c
bam2depth.c		bam2depth.c
bam_aux.c		bam_aux.c
bam_cat.c		bam_cat.c
bam_color.c		bam_color.c
bam_endian.h		bam_endian.h
bam_flags.c		bam_flags.c
bam_import.c		bam_import.c
bam_index.c		bam_index.c
bam_lpileup.c		bam_lpileup.c
bam_lpileup.h		bam_lpileup.h
bam_mate.c		bam_mate.c
bam_md.c		bam_md.c
bam_plbuf.c		bam_plbuf.c
bam_plbuf.h		bam_plbuf.h
bam_plcmd.c		bam_plcmd.c
bam_reheader.c		bam_reheader.c
bam_rmdup.c		bam_rmdup.c
bam_rmdupse.c		bam_rmdupse.c
bam_rocksort.c		bam_rocksort.c
bam_sort.c		bam_sort.c
bam_split.c		bam_split.c
bam_stat.c		bam_stat.c
bam_tview.c		bam_tview.c
bam_tview.h		bam_tview.h
bam_tview_curses.c		bam_tview_curses.c
bam_tview_html.c		bam_tview_html.c
bamshuf.c		bamshuf.c
bamtk.c		bamtk.c
bedcov.c		bedcov.c
bedidx.c		bedidx.c
cut_target.c		cut_target.c
errmod.c		errmod.c
errmod.h		errmod.h
faidx.c		faidx.c
kprobaln.c		kprobaln.c
kprobaln.h		kprobaln.h
padding.c		padding.c
phase.c		phase.c
sam.c		sam.c
sam.h		sam.h
sam_header.c		sam_header.c
sam_header.h		sam_header.h
sam_view.c		sam_view.c
sample.c		sample.c
sample.h		sample.h
samtools.1		samtools.1
samtools.h		samtools.h
stats.c		stats.c
stats_isize.c		stats_isize.c
stats_isize.h		stats_isize.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dnanexus / samtools

Using rocksort

Estimating data size (`-s`) for background compactions

Setting the scratch directory

About

Releases

Packages

Languages

License

dnanexus-archive/samtools

Folders and files

Latest commit

History

Repository files navigation

dnanexus / samtools

Using rocksort

Estimating data size (-s) for background compactions

Setting the scratch directory

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Estimating data size (`-s`) for background compactions

Packages