Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve sketching performance with lookup tables for complement and DNA validation #861

Merged
merged 6 commits into from Jan 25, 2020

Conversation

luizirber
Copy link
Member

@luizirber luizirber commented Jan 24, 2020

As discussed in #860

Checklist

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@codecov
Copy link

codecov bot commented Jan 24, 2020

Codecov Report

Merging #861 into master will decrease coverage by 0.65%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #861      +/-   ##
==========================================
- Coverage   78.94%   78.29%   -0.66%     
==========================================
  Files          82       94      +12     
  Lines        6973     7287     +314     
  Branches      473        0     -473     
==========================================
+ Hits         5505     5705     +200     
- Misses       1168     1582     +414     
+ Partials      300        0     -300
Flag Coverage Δ
#pytests ?
#rusttests 50.64% <100%> (+3.47%) ⬆️
Impacted Files Coverage Δ
src/core/tests/signature.rs 100% <100%> (ø)
src/core/src/index/linear.rs 7.57% <0%> (-84.1%) ⬇️
sourmash/signature.py 87.19% <0%> (-0.49%) ⬇️
sourmash/sbt_storage.py 85.39% <0%> (-0.33%) ⬇️
sourmash/command_compute.py 96.96% <0%> (-0.24%) ⬇️
sourmash/sbtmh.py 84.28% <0%> (-0.23%) ⬇️
sourmash/_minhash.py 97.08% <0%> (-0.13%) ⬇️
sourmash/sbt.py 86.81% <0%> (-0.08%) ⬇️
sourmash/index.py 98.38% <0%> (-0.08%) ⬇️
sourmash/lca/lca_utils.py 96.6% <0%> (-0.06%) ⬇️
... and 26 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e649d7d...d4a4c74. Read the comment docs.

@luizirber
Copy link
Member Author

luizirber commented Jan 24, 2020

Ended up using lookup tables, which are slightly faster, and have the additional benefit of supporting a larger alphabet if needed (compared to the subtract approach).

Final comparison, with RC and checkdna lookup tables:

$ cargo bench -- add_sequence --baseline master     
   Compiling sourmash v0.3.0 (/home_hd/luizirber/work/sourmash-bio/sourmash/src/core)
    Finished bench [optimized] target(s) in 33.82s
     Running target/release/deps/compute-5a1e1fd812a12533
add_sequence/valid
                        time:   [5.2672 ms 5.2829 ms 5.3032 ms]                              
                        change: [-13.618% -13.222% -12.729%] (p = 0.00 < 0.05)
                        Performance has improved.
add_sequence/lowercase
                        time:   [5.2471 ms 5.2697 ms 5.2964 ms]                                  
                        change: [-14.642% -14.078% -13.564%] (p = 0.00 < 0.05)
                        Performance has improved.
add_sequence/invalid kmers                                                                            
                        time:   [4.7763 ms 4.7890 ms 4.8020 ms]
                        change: [-27.975% -27.610% -27.196%] (p = 0.00 < 0.05)
                        Performance has improved.
add_sequence/force with valid kmers                                                                           
                        time:   [5.2602 ms 5.2730 ms 5.2883 ms]
                        change: [-13.439% -13.007% -12.584%] (p = 0.00 < 0.05)
                        Performance has improved.

For completeness, with and without checkdna lookup table (it made the invalid kmers case better):

$ cargo bench -- add_sequence --baseline lookup
    Finished bench [optimized] target(s) in 0.08s
     Running target/release/deps/compute-5a1e1fd812a12533
add_sequence/valid
                        time:   [5.2839 ms 5.3321 ms 5.4334 ms]                              
                        change: [-0.8344% +0.5284% +2.2051%] (p = 0.59 > 0.05)
                        No change in performance detected.
add_sequence/lowercase
                        time:   [5.2703 ms 5.2877 ms 5.3105 ms]                                  
                        change: [-0.8144% -0.3387% +0.1306%] (p = 0.20 > 0.05)
                        No change in performance detected.
add_sequence/invalid kmers                                                                            
                        time:   [4.8371 ms 4.8682 ms 4.8876 ms]
                        change: [-7.0321% -6.3051% -5.4531%] (p = 0.00 < 0.05)
                        Performance has improved.
add_sequence/force with valid kmers                                                                           
                        time:   [5.2742 ms 5.2961 ms 5.3178 ms]
                        change: [-1.0363% -0.5324% -0.0268%] (p = 0.07 > 0.05)
                        No change in performance detected.

@luizirber luizirber requested a review from ctb January 25, 2020 01:13
@luizirber luizirber changed the title Improve sketching performance Improve sketching performance with lookup tables for complement and DNA validation Jan 25, 2020
@luizirber
Copy link
Member Author

Ready for review and merge @ctb

@luizirber luizirber merged commit a07dd85 into master Jan 25, 2020
@luizirber luizirber deleted the rust_revcomp branch January 25, 2020 18:26
@luizirber luizirber added this to the 3.2 milestone Jan 25, 2020
@kloetzl
Copy link
Contributor

kloetzl commented Jan 26, 2020

The tables do not support lower case letters. Is that an intended change? Also, I think there is a bug hidden in the k-mer skipping logic. I will try to fix that myself, this time. ☺

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants