Some speedup for `*_inner_join()` #101

etiennebacher · 2024-02-09T17:25:29Z

Hi @beniaminogreen, here's a small PR that slightly improves the performance of *_inner_join() functions. For those functions, we don't need to get the indices of rows that don't match since they are not used in the output anyway. Small benchmark below (about 15% time decrease):

library(bench) 
library(dplyr)

out <- cross::run(
  pkgs = c("beniaminogreen/zoomerjoin", "etiennebacher/zoomerjoin@speedup-inner-join"), 
  ~{
    library(zoomerjoin)
    
    ### Setup -----
    
    corpus_1 <- data.table::rbindlist(rep(list(dime_data), 300))
    names(corpus_1) <- c("a", "field")
    print(nrow(corpus_1))
    
    corpus_2 <- data.table::rbindlist(rep(list(dime_data), 300))
    names(corpus_2) <- c("b", "field")
    
    ### Benchmark -----
    
    bench::mark(
      jaccard_inner_join(corpus_1, corpus_2, 
                        by = "field", n_gram_width=6,
                        n_bands=20, band_width=6, threshold = .8)
    )
  }
)

tidyr::unnest(out, result) |> 
  select(pkg, median, mem_alloc)
#> # A tibble: 2 × 3
#>   pkg                                           median mem_alloc
#>   <chr>                                       <bch:tm> <bch:byt>
#> 1 beniaminogreen/zoomerjoin                       3.7m    15.7GB
#> 2 etiennebacher/zoomerjoin@speedup-inner-join    3.21m    10.3GB

(the cross allows to run the same code on different branches, which is very useful for benchmarking)

Also worth noting that when getting the indices of unmatched rows is necessary, using collapse::%!iin%`` instead of !%in% would be more efficient (but would also add a dependency to the package)

beniaminogreen · 2024-02-10T03:30:55Z

Thanks for this! This is a great catch! Merging in now.

The cross function is a great tip as well - I have always struggled with testing across branches.

Some speedup for *_inner_join()

523f030

beniaminogreen merged commit 8392cd4 into beniaminogreen:main Feb 10, 2024
7 checks passed

etiennebacher deleted the speedup-inner-join branch February 10, 2024 05:34

etiennebacher mentioned this pull request Feb 12, 2024

Import collapse to reduce the time taken by %in% #104

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some speedup for `*_inner_join()` #101

Some speedup for `*_inner_join()` #101

etiennebacher commented Feb 9, 2024 •

edited

beniaminogreen commented Feb 10, 2024

Some speedup for *_inner_join() #101

Some speedup for *_inner_join() #101

Conversation

etiennebacher commented Feb 9, 2024 • edited

beniaminogreen commented Feb 10, 2024

Some speedup for `*_inner_join()` #101

Some speedup for `*_inner_join()` #101

etiennebacher commented Feb 9, 2024 •

edited