Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change default sample size or change sampling scheme based Tahamont's, et. al.'s findings #980

Open
fgregg opened this issue Mar 10, 2022 · 8 comments

Comments

@fgregg
Copy link
Contributor

fgregg commented Mar 10, 2022

@zjelveh, @mmcneill, and their co-authors wrote up nice paper on dedupe that has a very interesting finding that increasing the size of the training sample significantly improved the recall of dedupe, even holding the number of labeled pairs constant.

Zubin's twitter thread on the paper: https://twitter.com/zubinjelveh/status/1501978665839734790

unfortunately, the paper suggests you only really get strongly better results if the training sample is 100 x large than the default of 1,500 (for dedupe). a sample size of 150,000 records is going to make the active learning routine very slow.

given that the training budget is the same, the increased performance related to the larger sampling must be because the larger sample must contain more informative record pairs for the active learner to take advantage of.

i've thought about overhauling our pretty unprincipled sampling scheme for a while #845. It seems possible that a better sampling scheme could achieve Zubin and Melissa's results with a much smaller sample size than 150,000 records.

@fgregg
Copy link
Contributor Author

fgregg commented Mar 10, 2022

@zjelveh, @mmcneill would y'all be able to make the test harness that you used for your paper available? i'd love to expore how changes to the sampling scheme affects the results you saw.

@zjelveh
Copy link

zjelveh commented Mar 10, 2022

Here is the "replication" code for the paper. I think this should have what you need. @mmcneill developed the code here and will know more details.
https://github.com/zjelveh/dedupe-paper

(I say replication in quotes because the underlying data can't be released publicly due to PII, but hopefully this will be helpful.)

We're happy to answer any clarification questions on this.

@fgregg
Copy link
Contributor Author

fgregg commented Mar 10, 2022

@zjelveh, @mmcneill that's too bad. tests datasets of this size are unusual in record linkage, and i was hoping that this would be a new one.

if we develop some different sampling schemes, would y'all be possibly interested in trying them out on your end with your data?

@fgregg fgregg changed the title Change default sample size or change sampling scheme based Zubin Jelveh's finding Change default sample size or change sampling scheme based Tahamont's, et. al.'s findings Mar 10, 2022
@zjelveh
Copy link

zjelveh commented Mar 10, 2022

Definitely.

And yes, finding public data to do record linkage experiments is a huge challenge.

We came across this active learning paper by @tedenamorado which mentions a Brazilian election dataset that has groundtruth which could be harnessed for experiments.
https://www.dropbox.com/s/ds8o0q3mt4llb8c/jmp_TedE.pdf?raw=1

I believe this is the link to the data but it's been a while since I looked closely:
https://dadosabertos.tse.jus.br/dataset/

@fgregg
Copy link
Contributor Author

fgregg commented Mar 10, 2022

@zjelveh, @mmcneill, could you give me a sense of how much things got slowed down when you went from 1,500 to 150,000 samples

@mmcneill
Copy link

For the dataset with 200k rows, changing sample_size from 1,500 to 150,000 caused total runtime of the entire linking process to go from 4 minutes to 134 minutes on average (when we were labeling 1,000 pairs). And we had set the parallelization parameter to 8 cores.

@fgregg
Copy link
Contributor Author

fgregg commented Mar 10, 2022

yikes. that's kind of what i would expect.

@fgregg fgregg mentioned this issue Mar 15, 2022
@fgregg
Copy link
Contributor Author

fgregg commented Apr 20, 2022

hi @zjelveh, @mmcneill, i improved the sampling of dedupe, could you check out what's on master?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants