Link() generates more pairs than the original samples #4

zlkrvsm · 2020-10-29T15:57:15Z

This issue seems similar to #2, but I'm not encountering problems during pair generation.

Hello @djvanderlaan, thanks for the package, it's really easy to use, but I'm having an issue when trying to link two subdatasets.

In my case, I'm using a dataset of disease reports where some reports are for returning patients. I want to link the return patients to their original visit, but there is no unique id, hence, problinking. Below is a minimal example of the process:

id	sex	age	return_visit	date
1	M	25	TRUE	Aug 01
2	F	19	TRUE	Sep 29
3	M	25	FALSE	Sep 15
4	F	19	FALSE	Jul 19

I have extra variables to "identify" my patients, but the basic idea is that 3 and 4 are the same people as 1 and 2. So I created two subdatasets based on the return_visit variable and used just simple blocking and some filtering to reduce the number of pairs.

So far so good.

I then created a dummy variable with value TRUE so it would capture all pairs and used a selection of date variables to select for possible cases, leaving me with just 83.000 pairs. The problem arises when I use link(), and it returns 3.3 million records.

Surely my pairs are in there, but it seems to be a full_join of both datasets and I cannot for the life of me understand why link doesn't respect my selection variable or why it includes every record. Is it an issue with subdatasets? Have I made a mistake somewhere between using select_greedy and link()? Is it an issue with using datasets that have the same number of variables and who all have the same names?

Unfortunately, for confidentiality reasons, I cannot provide reprex, but if you can point me in the right direction I'll do my own research. Thanks.

Stage	Number of records
return visits	550 thousand patients
first visits	2.8 million patients
after blocking	1.07 million pairs
after filtering	83 thousand pairs
after linking	3.3 million records

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Link() generates more pairs than the original samples #4

Link() generates more pairs than the original samples #4

zlkrvsm commented Oct 29, 2020

Link() generates more pairs than the original samples #4

Link() generates more pairs than the original samples #4

Comments

zlkrvsm commented Oct 29, 2020