Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add options to support pairs within a single data set #2

Open
jkeirstead opened this issue Feb 28, 2020 · 5 comments
Open

Add options to support pairs within a single data set #2

jkeirstead opened this issue Feb 28, 2020 · 5 comments

Comments

@jkeirstead
Copy link

I'm using this package not for linking two different data sets, but analysing pairs within a single data set. In this context, reclin creates more pairs than necessary.

For example, if input data sets x and y both refer to data set df, then I don't want to compare x[1] with y[1] because they are the same record. Similarly I often don't want to compare both x[1]/y[2] and x[2]/y[1] because the comparison relationships are symmetric, e.g. f(x, y) = f(y, x).

@djvanderlaan
Copy link
Owner

Hi @jkeirstead ,

There is a vignette on this: https://cran.r-project.org/web/packages/reclin/vignettes/deduplication.html. You can use the function filter_pairs_for_deduplication to remove the duplicate pairs. This still creates the pairs initially which is not nice for large data sets. I am working on reimplementing some parts of reclin and this will probably one of the things I will tackle. Don't know yet when this will be finished.

@jkeirstead
Copy link
Author

I had seen that vignette and noticed that it is filtering after pair creation. In my case, the bottleneck is actually in the comparison scoring so that method would work but actually I've just been using a tidyverse solutions: pairs %>% filter(x < y)

The package is really helpful though - thanks!

@jeanbarrado
Copy link

Parabéns pelo pacote, excelente iniciativa.
Eu também usei sua vinheta para limpar os registros duplicados da minha base de dados de Covid19 em Belo Horizonte-MG (Brasil), no entanto extrapolou.
Os erros apresentados foram os seguintes:

Error in length<-.lvec(*tmp*, value = lx + length(y)) :
std::bad_alloc
In addition: Warning message:
In lx + length(y) : NAs produced by integer overflow

Você tem alguma sugestão para fugir do erro, além de dividir os dados?

@djvanderlaan
Copy link
Owner

@jeanbarrado The package (/lvec) doesn't handle more than 2^31 pairs, which judging from the error message seems to be the case here. Please first check the expected number of pairs: for deduplication without blocking you have (n^2 - n)*0.5 pairs. If this number is less than 2^31 it should in principle be possible to do deduplication. However, unfortunately my package currently first generates n^2 pairs. If you should have a final number of pairs less than 2^31, I think I can cook up a workaround.

But with around 2^31 pairs, computation time is going to be quite substantial.

A usual method of reducing the number of pairs, is to apply some sort of blocking: only generate pairs when the records agree on some variable, e.g. city/province/first letter of the name.

My portugese is not really well developed, in this case with the help of google translate I was able to follow the question, but please use english if you can.

@abbylsmith
Copy link

Hi! Continuing to have this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants