Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blocking using approximate nearest neighbours algorithms #22

Open
BERENZ opened this issue Nov 5, 2023 · 3 comments
Open

Blocking using approximate nearest neighbours algorithms #22

BERENZ opened this issue Nov 5, 2023 · 3 comments

Comments

@BERENZ
Copy link

BERENZ commented Nov 5, 2023

I am writing to let you know that I have developed a small package called [blocking] (https://github.com/ncn-foreigners/blocking) that allows blocking of records based on approximate nearest neighbours algorithms (RcppAnnoy, RcppHNSW and mlpack) and graphs (igraph). The package includes the function pair_ann, which was developed on the basis of pair_blocking and pair_minsim to allow direct integration into your package.

Here is the code using the reclin2 sample data:

library(blocking)
library(reclin2)

data("linkexample1", "linkexample2", package = "reclin2")

linkexample1$txt <- with(linkexample1, tolower(paste0(firstname, lastname, address, sex, postcode)))
linkexample1$txt <- gsub("\\s+", "", linkexample1$txt)
linkexample2$txt <- with(linkexample2, tolower(paste0(firstname, lastname, address, sex, postcode)))
linkexample2$txt <- gsub("\\s+", "", linkexample2$txt)

# pairing records from linkexample2 to linkexample1 based on the txt column

pair_ann(x = linkexample1, y = linkexample2, on = "txt", deduplication = FALSE) |>
compare_pairs(on = "txt", comparators = list(cmp_jarowinkler())) |>
score_simple("score", on = "txt") |>
select_threshold("threshold", score = "score", threshold = 0.75) |>
link(selection = "threshold")

Feel free to test and comment. I plan to submit this package to CRAN in December.

@djvanderlaan
Copy link
Owner

djvanderlaan commented Feb 1, 2024

Hi @BERENZ ,

Thanks for letting me know. This is really nice! I don't see it on CRAN yet. Still working on it? Let me know if it is on CRAN; I will then try go get a reference to your package somewhere in the documentation.

One remark: in pair_ann there is a line:

block_result <- blocking::blocking(x = x[, on], y = if (deduplication) 
        NULL
    else y[, on], deduplication = deduplication, ...)

this fails if x and/or y are already a data.table. I think this is easiest to solve by placing the data.table::as.data.table lines before this line and using x[, on, with = FALSE] . So, currently:

setDT(linkexample1)
setDT(linkexample2)
pair_ann(x = linkexample1, y = linkexample2, on = "txt", deduplication = FALSE) 

gives an error.

Sorry for not getting back earlier.

@BERENZ
Copy link
Author

BERENZ commented Mar 9, 2024

Hi @djvanderlaan,

Thanks for bug reporting, as always I forgot about with=FALSE :) If you have any other comments please let me know. I focused on other projects but I think I will be able to submit the package in April.

In addition, we use your package in the mecRecordLinkage an experimental package that implements: Lee, D., Zhang, L-C., and Kim, J.K. (2022). "Maximum entropy classification for record linkage," Survey Methodology, 48, 1-23.

@BERENZ
Copy link
Author

BERENZ commented Apr 29, 2024

Hi @djvanderlaan,

finally I had some time and fixed several issues with the blocking package. Feel free to test and verify if it is useful connection with reclin2 package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants