Skip to content

sourmash plugin to filter hashes/k-mers by presence across many sketches

License

Notifications You must be signed in to change notification settings

ctb/sourmash_plugin_commonhash

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sourmash_plugin_commonhash

If you have sketched many samples and you want to remove "rare" k-mers (present in 1, or only a few samples), this plugin is for you! This procedure helps reduce noise in Jaccard comparisons between samples.

See sourmash#2383 for an extended discussion!

Thanks to Taylor Reiter and Jessica Lumian for all their work on this!

Installation

pip install sourmash_plugin_commonhash

Usage

sourmash scripts commonhash <multiple sketches> -o commonhashes.zip

commonhash will output one filtered sketch for each input sketch. You can then use the various sourmash sig commands to union these sketches, extract individual ones, etc.

Example

sourmash scripts commonhash examples/*.sig.gz -o commonhash.zip

should yield:

...

Selecting k=31, DNA
Loaded 10587 hashes from 3 sketches in 3 files.
Of 10587 hashes, keeping 2529 that are in 2 or more samples.
Saved 3 signatures to 'commonhash.zip'

Support

We suggest filing issues in the main sourmash issue tracker as that receives more attention!

Dev docs

commonhash is developed at https://github.com/ctb/sourmash_plugin_commonhash.

Generating a release

Bump version number in pyproject.toml and push.

Make a new release on github.

Then pull, and:

python -m build

followed by twine upload dist/....