Sandbox scripts

Scripts in this directory, 'sandbox', are various utility or trial scripts that we have not fully tested. They are also not under semantic versioning, so their functionality and command line arguments may change without notice.

We are still triaging and documenting the various scripts.

Awaiting promotion to scripts:

calc-error-profile.py - calculate a per-base "error profile" for shotgun sequencing data, w/o a reference. (Used/tested in 2014 paper on semi-streaming algorithms)
count-kmers.py - output k-mer counts for multiple input files.
count-kmers-single.py - output k-mer counts for a single k-mer file.
correct-errors.py - streaming error correction.
unique-kmers.py - estimate the number of k-mers present in a file with the HyperLogLog low-memory probabilistic cardinality estimation algorithm.

Scripts with recipes:

calc-median-distribution.py - plot coverage distribution; see khmer-recipes #1
collect-reads.py - subsample reads until a particular average coverage; see khmer-recipes #2
saturate-by-median.py - calculate collector's curve on shotgun sequencing; see khmer-recipes #4
slice-reads-by-coverage.py - extract reads based on coverage; see khmer-recipes #1

To keep, document, and build recipes for:

make-coverage.py - RPKM calculation script
assemstats3.py - print out assembly statistics
build-sparse-graph.py - code for building a sparse graph (by Camille Scott)
calc-best-assembly.py - calculate the "best assembly" - used in metagenome protocol
collect-variants.py - used in a gist
extract-single-partition.py - extract all the sequences that belong to a specific partition, from a file with multiple partitions
filter-below-abund.py - like filter-abund, but trim off high-abundance k-mers
filter-median-and-pct.py - see blog post on Trinity in silico norm (http://ivory.idyll.org/blog/trinity-in-silico-normalize.html)
filter-median.py - see blog post on Trinity in silico norm (http://ivory.idyll.org/blog/trinity-in-silico-normalize.html)
graph-size.py - filter reads based on size of connected graph
memusg - memory usage analysis
multi-rename.py - rename sequences from multiple files with a common prefix
normalize-by-median-pct.py - see blog post on Trinity in silico norm (http://ivory.idyll.org/blog/trinity-in-silico-normalize.html)
print-stoptags.py - print out the stoptag k-mers
print-tagset.py - print out the tagset k-mers
renumber-partitions.py - systematically renumber partitions
shuffle-reverse-rotary.py - FASTA file shuffler for larger FASTA files
split-fasta.py - break a FASTA file up into smaller chunks
stoptags-by-position.py - print out where stoptags tend to occur
strip-partition.py - clear off partition information
subset-report.py - report stats on pmap files
sweep-files.py - various ways to extract reads based on k-mer overlap
sweep-out-reads-with-contigs.py - various ways to extract reads based on k-mer overlap
sweep-reads.py - various ways to extract reads based on k-mer overlap
sweep-reads2.py - various ways to extract reads based on k-mer overlap
sweep-reads3.py - various ways to extract reads based on k-mer overlap
write-trimmomatic.py - used to build Trimmomatic command lines in khmer-protocols

Good ideas to rewrite using newer tools/approaches:

assembly-diff.py - find sequences that differ between two assemblies
assembly-diff-2.py - find subsequences that differ between two assemblies
bloom-count.py - count # of unique k-mers; should be reimplemented with HyperLogLog, Renamed from bloom_count.py in commit 4788c31
split-sequences-by-length.py - break up short reads by length

Present in commit ff7f047b5b0d9acb6c1eb73d54cfd39c9e3d1393 but removed thereafter:

abundance-hist-by-position.py - look at abundance of k-mers by position within read; use with fasta-to-abundance-hist.py
fasta-to-abundance-hist.py - generate abundance of k-mers by position within reads; use with abundance-hist-by-position.py
find-high-abund-kmers.py - extract high-abundance k-mers into a list
hi-lo-abundance-by-position.py - look at high and low-abundance k-mers by position within read
stoptag-abundance-hist.py - print out abundance histogram of stoptags

Present in commit 19b0a09353cddc45070edcf1283cae2c83c13b0e but removed thereafter:

bloom-count-intersection.py - look at unique and disjoint #s of k-mers, renamed from bloom_count_intersection.py in commit 4788c31.

Present in commit d295bc847 but removed thereafter:

combine-pe.py - combine partitions based on shared PE reads.
compare-partitions.py - compare read membership in partitions.
count-within-radius.py - calculating graph density by position with seq
degree-by-position.py - calculating graph degree by position in seq
dn-identify-errors.py - prototype script to identify errors in reads based on diginorm principles
ec.py - new error correction foo
error-correct-pass2.py - new error correction foo
find-unpart.py - something to do with finding unpartitioned sequences
normalize-by-align.py - new error correction foo
read_aligner.py - new error correction foo
shuffle-fasta.py - FASTA file shuffler for small FASTA files
to-casava-1.8-fastq.py - convert reads to different Casava format
uniqify-sequences.py - print out paths that are unique in the graph
write-interleave.py - is this used by any protocol etc?

Present in commit 691b0b3ae but removed thereafter:

annotate-with-median-count.py - replaced by count-median.py
assemble-individual-partitions.py - better done with parallel
assemstats.py - statistics gathering; see assemstats3.
assemstats2.py - statistics gathering; see assemstats3.
abund-ablate-reads.py - trim reads of high abundance k-mers.
bench-graphsize-orig.py - benchmarking script for graphsize elimination
bench-graphsize-th.py - benchmarking script for graphsize elimination
bin-reads-by-abundance.py - see slice-reads-by-coverage.py
bowtie-parser.py - parse bowtie map file
calc-degree.py - various k-mer statistics
calc-kmer-partition-counts.py - various k-mer statistics
calc-kmer-read-abunds.py - various k-mer statistics
calc-kmer-read-stats.py - various k-mer statistics
calc-kmer-to-partition-ratio.py - various k-mer statistics
calc-sequence-entropy.py - calculate per-sequence entropy
choose-largest-assembly.py - see calc-best-assembly.py
consume-and-traverse.py - replaced by load-graph.py
contig-coverage.py - calculate coverage of contigs by k-mers
count-circum-by-position.py - k-mer graph statistics by position within read
count-density-by-position.py - k-mer graph stats by position within read
count-distance-to-volume.py - k-mer stats from graph
count-median-abund-by-partition.py - count median k-mer abundance by partition;
count-shared-kmers-btw-assemblies.py - count shared k-mers between assemblies;
ctb-iterative-bench-2-old.py - old benchmarking code
ctb-iterative-bench.py - old benchmarking code
discard-high-abund.py - discard reads by coverage; see slice-reads-by-coverage.py
discard-pre-high-abund.py - discard reads by coverage; see slice-reads-by-coverage.py
do-intertable-part.py - unused partitioning method
do-partition-2.py - replaced by scripts/do-partition.py
do-partition-stop.py - replaced by scripts/do-partition.py
do-partition.py - moved to scripts/
do-subset-merge.py - replaced by scripts/merge-partitions.py
do-th-subset-calc.py - unused benchmarking scripts
do-th-subset-load.py - unused benchmarking scripts
do-th-subset-save.py - unused benchmarking scripts
extract-surrender.py - no longer used partitioning feature
extract-with-median-count.py - see slice-reads-by-coverage.py
fasta-to-fastq.py - just a bad idea
filter-above-median.py - replaced by filter-below-abund.py
filter-abund-output-by-length.py - replaced by filter-abund/filter-below-abund
filter-area.py - trim highly connected k-mers
filter-degree.py - trim highly connected k-mers
filter-density-explosion.py - trim highly connected k-mers
filter-if-present.py - replaced by filter-abund and others
filter-max255.py - remove reads w/high-abundance k-mers.
filter-min2-multi.py - remove reads w/low-abundance k-mers
filter-sodd.py - no longer used partitioning feature
filter-subsets-by-partsize.py - deprecated way to filter out partitions by size
get-occupancy.py - utility script no longer needed
get-occupancy2.py - utility script no longer needed
graph-partition-separate.py - deprecated graph partitioning stuff
graph-size-circum-trim.py - experimental mods to graph-size.py
graph-size-degree-trim.py - experimental mods to graph-size.py
graph-size-py.py - experimental mods to graph-size.py
join_pe.py - silly attempts to deal with PE interleaving?
keep-stoptags.py - trim at stoptags
label-pairs.py - deprecated PE fixing script
length-dist.py - deprecated length distribution calc script
load-ht-and-tags.py - load and examine hashtable & tags
multi-abyss.py - better done with parallel
make-coverage-by-position-for-node.py - deprecated coverage calculation
make-coverage-histogram.py - build coverage histograms
make-random.py - make random DNA; see dbg-graph-null project.
make-read-stats.py - see readstats.py
multi-stats.py - see readstats.py
multi-velvet.py - better done with parallel
normalize-by-min.py - normalize by min k-mer abundance in seq; just a bad idea
occupy.py - no longer needed utility script
parse-bowtie-pe.py - no longer needed utility script
parse-stats.py - partition stats
partition-by-contig.py - various approaches to partitioning
partition-by-contig2.py - various approaches to partitioning
partition-size-dist-running.py - various approaches to partitioning
partition-size-dist.py - various approaches to partitioning
path-compare-to-vectors.py - ??
print-exact-abund-kmer.py - ??
print-high-density-kmers.py - display high abundance k-mers
quality-trim-pe.py - no longer needed utility script
quality-trim.py - no longer needed utility script
reformat.py - FASTA sequence description line reformatter for partitioned files
remove-N.py - eliminate sequences that have Ns in them
softmask-high-abund.py - softmask high abundance sequences (convert ACGT to acgt)
split-fasta-on-circum.py - various ways of breaking sequences on graph properties
split-fasta-on-circum2.py - various ways of breaking sequences on graph properties
split-fasta-on-circum3.py - various ways of breaking sequences on graph properties
split-fasta-on-circum4.py - various ways of breaking sequences on graph properties
split-fasta-on-degree-th.py - various ways of breaking sequences on graph properties
split-fasta-on-degree.py - various ways of breaking sequences on graph properties
split-fasta-on-density.py - various ways of breaking sequences on graph properties
split-N.py - truncate sequences on N
split-reads-on-median-diff.py - various ways of breaking sequences on graph properties
summarize.py - sequence stats calculator
sweep_perf.py - benchmarking tool
test_scripts.py - old test file
traverse-contigs.py - deprecated graph traversal stuff
traverse-from-reads.py - deprecated graph traversal stuff
validate-partitioning.py - unneeded test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.rst

README.rst

Sandbox scripts

Files

README.rst

Latest commit

History

README.rst

File metadata and controls

Sandbox scripts