New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] gather optimizations #615

Merged
merged 3 commits into from Jan 11, 2019

Conversation

Projects
None yet
2 participants
@luizirber
Copy link
Member

luizirber commented Jan 9, 2019

A couple of fixes for gather perf problems

  • don't recalculate scaled query minhash everytime we reach a leaf in the SBT
  • add a remove_many method to remove hashes from a minhash
  • use the remove_many method to remove matches from the query (instead of building a new one). This makes a huge difference for large queries.

I ran master with a large metagenome query, got 4 matches in 17h. This PR takes 17 minutes to find the same matches.

Checklist

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@luizirber luizirber force-pushed the perf/gather branch from d63e4f2 to a1a085b Jan 9, 2019

@codecov

This comment has been minimized.

Copy link

codecov bot commented Jan 9, 2019

Codecov Report

Merging #615 into master will decrease coverage by 0.06%.
The diff coverage is 96.55%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #615      +/-   ##
==========================================
- Coverage   89.45%   89.38%   -0.07%     
==========================================
  Files          27       27              
  Lines        4191     4203      +12     
  Branches       37       39       +2     
==========================================
+ Hits         3749     3757       +8     
- Misses        442      446       +4
Impacted Files Coverage Δ
sourmash/kmer_min_hash.hh 100% <100%> (ø) ⬆️
sourmash/search.py 94.19% <100%> (-0.04%) ⬇️
sourmash/sbtmh.py 85% <90.9%> (-0.3%) ⬇️
sourmash/commands.py 89.08% <0%> (-0.35%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 314bd8e...ca7b79b. Read the comment docs.

@ctb

This comment has been minimized.

Copy link
Member

ctb commented Jan 10, 2019

LGTM!

@luizirber

This comment has been minimized.

Copy link
Member

luizirber commented Jan 10, 2019

I'll add some tests for remove_many before merging

@luizirber luizirber changed the title [WIP] gather optimizations [MRG] gather optimizations Jan 10, 2019

taylorreiter added a commit that referenced this pull request Jan 11, 2019

@ctb

ctb approved these changes Jan 11, 2019

Copy link
Member

ctb left a comment

Nice work!

@ctb ctb merged commit 64b3017 into master Jan 11, 2019

2 checks passed

codecov/patch 96.55% of diff hit (target 89.45%)
Details
codecov/project Absolute coverage decreased by -0.06% but relative coverage increased by +7.09% compared to 314bd8e
Details

@ctb ctb deleted the perf/gather branch Jan 11, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment