Skip to content

Latest commit

 

History

History
234 lines (207 loc) · 26 KB

README.rst

File metadata and controls

234 lines (207 loc) · 26 KB

Sandbox scripts

Scripts in this directory, 'sandbox', are various utility or trial scripts that we have not fully tested. They are also not under semantic versioning, so their functionality and command line arguments may change without notice.

We are still triaging and documenting the various scripts.


Awaiting promotion to scripts:

  • calc-error-profile.py - calculate a per-base "error profile" for shotgun sequencing data, w/o a reference. (Used/tested in 2014 paper on semi-streaming algorithms)
  • count-kmers.py - output k-mer counts for multiple input files.
  • count-kmers-single.py - output k-mer counts for a single k-mer file.
  • correct-errors.py - streaming error correction.
  • unique-kmers.py - estimate the number of k-mers present in a file with the HyperLogLog low-memory probabilistic cardinality estimation algorithm.

Scripts with recipes:

  • calc-median-distribution.py - plot coverage distribution; see khmer-recipes #1
  • collect-reads.py - subsample reads until a particular average coverage; see khmer-recipes #2
  • saturate-by-median.py - calculate collector's curve on shotgun sequencing; see khmer-recipes #4
  • slice-reads-by-coverage.py - extract reads based on coverage; see khmer-recipes #1

To keep, document, and build recipes for:

  • make-coverage.py - RPKM calculation script
  • assemstats3.py - print out assembly statistics
  • build-sparse-graph.py - code for building a sparse graph (by Camille Scott)
  • calc-best-assembly.py - calculate the "best assembly" - used in metagenome protocol
  • collect-variants.py - used in a gist
  • extract-single-partition.py - extract all the sequences that belong to a specific partition, from a file with multiple partitions
  • filter-below-abund.py - like filter-abund, but trim off high-abundance k-mers
  • filter-median-and-pct.py - see blog post on Trinity in silico norm (http://ivory.idyll.org/blog/trinity-in-silico-normalize.html)
  • filter-median.py - see blog post on Trinity in silico norm (http://ivory.idyll.org/blog/trinity-in-silico-normalize.html)
  • graph-size.py - filter reads based on size of connected graph
  • memusg - memory usage analysis
  • multi-rename.py - rename sequences from multiple files with a common prefix
  • normalize-by-median-pct.py - see blog post on Trinity in silico norm (http://ivory.idyll.org/blog/trinity-in-silico-normalize.html)
  • print-stoptags.py - print out the stoptag k-mers
  • print-tagset.py - print out the tagset k-mers
  • renumber-partitions.py - systematically renumber partitions
  • shuffle-reverse-rotary.py - FASTA file shuffler for larger FASTA files
  • split-fasta.py - break a FASTA file up into smaller chunks
  • stoptags-by-position.py - print out where stoptags tend to occur
  • strip-partition.py - clear off partition information
  • subset-report.py - report stats on pmap files
  • sweep-files.py - various ways to extract reads based on k-mer overlap
  • sweep-out-reads-with-contigs.py - various ways to extract reads based on k-mer overlap
  • sweep-reads.py - various ways to extract reads based on k-mer overlap
  • sweep-reads2.py - various ways to extract reads based on k-mer overlap
  • sweep-reads3.py - various ways to extract reads based on k-mer overlap
  • write-trimmomatic.py - used to build Trimmomatic command lines in khmer-protocols

Good ideas to rewrite using newer tools/approaches:

  • assembly-diff.py - find sequences that differ between two assemblies
  • assembly-diff-2.py - find subsequences that differ between two assemblies
  • bloom-count.py - count # of unique k-mers; should be reimplemented with HyperLogLog, Renamed from bloom_count.py in commit 4788c31
  • split-sequences-by-length.py - break up short reads by length

Present in commit ff7f047b5b0d9acb6c1eb73d54cfd39c9e3d1393 but removed thereafter:

  • abundance-hist-by-position.py - look at abundance of k-mers by position within read; use with fasta-to-abundance-hist.py
  • fasta-to-abundance-hist.py - generate abundance of k-mers by position within reads; use with abundance-hist-by-position.py
  • find-high-abund-kmers.py - extract high-abundance k-mers into a list
  • hi-lo-abundance-by-position.py - look at high and low-abundance k-mers by position within read
  • stoptag-abundance-hist.py - print out abundance histogram of stoptags

Present in commit 19b0a09353cddc45070edcf1283cae2c83c13b0e but removed thereafter:

Present in commit d295bc847 but removed thereafter:

Present in commit 691b0b3ae but removed thereafter: