-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add options to support pairs within a single data set #2
Comments
Hi @jkeirstead , There is a vignette on this: https://cran.r-project.org/web/packages/reclin/vignettes/deduplication.html. You can use the function |
I had seen that vignette and noticed that it is filtering after pair creation. In my case, the bottleneck is actually in the comparison scoring so that method would work but actually I've just been using a tidyverse solutions: The package is really helpful though - thanks! |
Parabéns pelo pacote, excelente iniciativa. Error in Você tem alguma sugestão para fugir do erro, além de dividir os dados? |
@jeanbarrado The package (/ But with around 2^31 pairs, computation time is going to be quite substantial. A usual method of reducing the number of pairs, is to apply some sort of blocking: only generate pairs when the records agree on some variable, e.g. city/province/first letter of the name. My portugese is not really well developed, in this case with the help of google translate I was able to follow the question, but please use english if you can. |
Hi! Continuing to have this issue |
I'm using this package not for linking two different data sets, but analysing pairs within a single data set. In this context,
reclin
creates more pairs than necessary.For example, if input data sets
x
andy
both refer to data setdf
, then I don't want to compare x[1] with y[1] because they are the same record. Similarly I often don't want to compare both x[1]/y[2] and x[2]/y[1] because the comparison relationships are symmetric, e.g. f(x, y) = f(y, x).The text was updated successfully, but these errors were encountered: