Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some speedup for *_inner_join() #101

Merged
merged 1 commit into from
Feb 10, 2024
Merged

Some speedup for *_inner_join() #101

merged 1 commit into from
Feb 10, 2024

Conversation

etiennebacher
Copy link
Collaborator

@etiennebacher etiennebacher commented Feb 9, 2024

Hi @beniaminogreen, here's a small PR that slightly improves the performance of *_inner_join() functions. For those functions, we don't need to get the indices of rows that don't match since they are not used in the output anyway. Small benchmark below (about 15% time decrease):

library(bench) 
library(dplyr)

out <- cross::run(
  pkgs = c("beniaminogreen/zoomerjoin", "etiennebacher/zoomerjoin@speedup-inner-join"), 
  ~{
    library(zoomerjoin)
    
    ### Setup -----
    
    corpus_1 <- data.table::rbindlist(rep(list(dime_data), 300))
    names(corpus_1) <- c("a", "field")
    print(nrow(corpus_1))
    
    corpus_2 <- data.table::rbindlist(rep(list(dime_data), 300))
    names(corpus_2) <- c("b", "field")
    
    ### Benchmark -----
    
    bench::mark(
      jaccard_inner_join(corpus_1, corpus_2, 
                        by = "field", n_gram_width=6,
                        n_bands=20, band_width=6, threshold = .8)
    )
  }
)

tidyr::unnest(out, result) |> 
  select(pkg, median, mem_alloc)
#> # A tibble: 2 × 3
#>   pkg                                           median mem_alloc
#>   <chr>                                       <bch:tm> <bch:byt>
#> 1 beniaminogreen/zoomerjoin                       3.7m    15.7GB
#> 2 etiennebacher/zoomerjoin@speedup-inner-join    3.21m    10.3GB

(the cross allows to run the same code on different branches, which is very useful for benchmarking)


Also worth noting that when getting the indices of unmatched rows is necessary, using collapse::%!iin%`` instead of !%in% would be more efficient (but would also add a dependency to the package)

@beniaminogreen beniaminogreen merged commit 8392cd4 into beniaminogreen:main Feb 10, 2024
7 checks passed
@beniaminogreen
Copy link
Owner

Thanks for this! This is a great catch! Merging in now.

The cross function is a great tip as well - I have always struggled with testing across branches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants