Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link() generates more pairs than the original samples #4

Open
zlkrvsm opened this issue Oct 29, 2020 · 0 comments
Open

Link() generates more pairs than the original samples #4

zlkrvsm opened this issue Oct 29, 2020 · 0 comments

Comments

@zlkrvsm
Copy link

zlkrvsm commented Oct 29, 2020

This issue seems similar to #2, but I'm not encountering problems during pair generation.

Hello @djvanderlaan, thanks for the package, it's really easy to use, but I'm having an issue when trying to link two subdatasets.

In my case, I'm using a dataset of disease reports where some reports are for returning patients. I want to link the return patients to their original visit, but there is no unique id, hence, problinking. Below is a minimal example of the process:

id sex age return_visit date
1 M 25 TRUE Aug 01
2 F 19 TRUE Sep 29
3 M 25 FALSE Sep 15
4 F 19 FALSE Jul 19

I have extra variables to "identify" my patients, but the basic idea is that 3 and 4 are the same people as 1 and 2. So I created two subdatasets based on the return_visit variable and used just simple blocking and some filtering to reduce the number of pairs.

So far so good.

I then created a dummy variable with value TRUE so it would capture all pairs and used a selection of date variables to select for possible cases, leaving me with just 83.000 pairs. The problem arises when I use link(), and it returns 3.3 million records.

Surely my pairs are in there, but it seems to be a full_join of both datasets and I cannot for the life of me understand why link doesn't respect my selection variable or why it includes every record. Is it an issue with subdatasets? Have I made a mistake somewhere between using select_greedy and link()? Is it an issue with using datasets that have the same number of variables and who all have the same names?

Unfortunately, for confidentiality reasons, I cannot provide reprex, but if you can point me in the right direction I'll do my own research. Thanks.

Stage Number of records
return visits 550 thousand patients
first visits 2.8 million patients
after blocking 1.07 million pairs
after filtering 83 thousand pairs
after linking 3.3 million records
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant