Add options to support pairs within a single data set #2

jkeirstead · 2020-02-28T11:19:10Z

I'm using this package not for linking two different data sets, but analysing pairs within a single data set. In this context, reclin creates more pairs than necessary.

For example, if input data sets x and y both refer to data set df, then I don't want to compare x[1] with y[1] because they are the same record. Similarly I often don't want to compare both x[1]/y[2] and x[2]/y[1] because the comparison relationships are symmetric, e.g. f(x, y) = f(y, x).

The text was updated successfully, but these errors were encountered:

djvanderlaan · 2020-03-11T13:55:01Z

Hi @jkeirstead ,

There is a vignette on this: https://cran.r-project.org/web/packages/reclin/vignettes/deduplication.html. You can use the function filter_pairs_for_deduplication to remove the duplicate pairs. This still creates the pairs initially which is not nice for large data sets. I am working on reimplementing some parts of reclin and this will probably one of the things I will tackle. Don't know yet when this will be finished.

jkeirstead · 2020-03-12T10:43:43Z

I had seen that vignette and noticed that it is filtering after pair creation. In my case, the bottleneck is actually in the comparison scoring so that method would work but actually I've just been using a tidyverse solutions: pairs %>% filter(x < y)

The package is really helpful though - thanks!

jeanbarrado · 2020-06-02T13:49:55Z

Parabéns pelo pacote, excelente iniciativa.
Eu também usei sua vinheta para limpar os registros duplicados da minha base de dados de Covid19 em Belo Horizonte-MG (Brasil), no entanto extrapolou.
Os erros apresentados foram os seguintes:

Error in length<-.lvec(*tmp*, value = lx + length(y)) :
std::bad_alloc
In addition: Warning message:
In lx + length(y) : NAs produced by integer overflow

Você tem alguma sugestão para fugir do erro, além de dividir os dados?

djvanderlaan · 2020-06-02T14:07:35Z

@jeanbarrado The package (/lvec) doesn't handle more than 2^31 pairs, which judging from the error message seems to be the case here. Please first check the expected number of pairs: for deduplication without blocking you have (n^2 - n)*0.5 pairs. If this number is less than 2^31 it should in principle be possible to do deduplication. However, unfortunately my package currently first generates n^2 pairs. If you should have a final number of pairs less than 2^31, I think I can cook up a workaround.

But with around 2^31 pairs, computation time is going to be quite substantial.

A usual method of reducing the number of pairs, is to apply some sort of blocking: only generate pairs when the records agree on some variable, e.g. city/province/first letter of the name.

My portugese is not really well developed, in this case with the help of google translate I was able to follow the question, but please use english if you can.

abbylsmith · 2022-02-08T20:18:21Z

Hi! Continuing to have this issue

zlkrvsm mentioned this issue Oct 29, 2020

Link() generates more pairs than the original samples #4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add options to support pairs within a single data set #2

Add options to support pairs within a single data set #2

jkeirstead commented Feb 28, 2020

djvanderlaan commented Mar 11, 2020

jkeirstead commented Mar 12, 2020

jeanbarrado commented Jun 2, 2020

djvanderlaan commented Jun 2, 2020

abbylsmith commented Feb 8, 2022

Add options to support pairs within a single data set #2

Add options to support pairs within a single data set #2

Comments

jkeirstead commented Feb 28, 2020

djvanderlaan commented Mar 11, 2020

jkeirstead commented Mar 12, 2020

jeanbarrado commented Jun 2, 2020

djvanderlaan commented Jun 2, 2020

abbylsmith commented Feb 8, 2022