Change default sample size or change sampling scheme based Tahamont's, et. al.'s findings #980

fgregg · 2022-03-10T18:23:03Z

@zjelveh, @mmcneill, and their co-authors wrote up nice paper on dedupe that has a very interesting finding that increasing the size of the training sample significantly improved the recall of dedupe, even holding the number of labeled pairs constant.

Zubin's twitter thread on the paper: https://twitter.com/zubinjelveh/status/1501978665839734790

unfortunately, the paper suggests you only really get strongly better results if the training sample is 100 x large than the default of 1,500 (for dedupe). a sample size of 150,000 records is going to make the active learning routine very slow.

given that the training budget is the same, the increased performance related to the larger sampling must be because the larger sample must contain more informative record pairs for the active learner to take advantage of.

i've thought about overhauling our pretty unprincipled sampling scheme for a while #845. It seems possible that a better sampling scheme could achieve Zubin and Melissa's results with a much smaller sample size than 150,000 records.

fgregg · 2022-03-10T18:26:00Z

@zjelveh, @mmcneill would y'all be able to make the test harness that you used for your paper available? i'd love to expore how changes to the sampling scheme affects the results you saw.

zjelveh · 2022-03-10T18:37:05Z

Here is the "replication" code for the paper. I think this should have what you need. @mmcneill developed the code here and will know more details.
https://github.com/zjelveh/dedupe-paper

(I say replication in quotes because the underlying data can't be released publicly due to PII, but hopefully this will be helpful.)

We're happy to answer any clarification questions on this.

fgregg · 2022-03-10T18:41:24Z

@zjelveh, @mmcneill that's too bad. tests datasets of this size are unusual in record linkage, and i was hoping that this would be a new one.

if we develop some different sampling schemes, would y'all be possibly interested in trying them out on your end with your data?

zjelveh · 2022-03-10T18:51:23Z

Definitely.

And yes, finding public data to do record linkage experiments is a huge challenge.

We came across this active learning paper by @tedenamorado which mentions a Brazilian election dataset that has groundtruth which could be harnessed for experiments.
https://www.dropbox.com/s/ds8o0q3mt4llb8c/jmp_TedE.pdf?raw=1

I believe this is the link to the data but it's been a while since I looked closely:
https://dadosabertos.tse.jus.br/dataset/

fgregg · 2022-03-10T18:59:31Z

@zjelveh, @mmcneill, could you give me a sense of how much things got slowed down when you went from 1,500 to 150,000 samples

mmcneill · 2022-03-10T19:13:26Z

For the dataset with 200k rows, changing sample_size from 1,500 to 150,000 caused total runtime of the entire linking process to go from 4 minutes to 134 minutes on average (when we were labeling 1,000 pairs). And we had set the parallelization parameter to 8 cores.

fgregg · 2022-03-10T19:42:00Z

yikes. that's kind of what i would expect.

fgregg · 2022-04-20T03:44:33Z

hi @zjelveh, @mmcneill, i improved the sampling of dedupe, could you check out what's on master?

fgregg changed the title ~~Change default sample size or change sampling scheme based Zubin Jelveh's finding~~ Change default sample size or change sampling scheme based Tahamont's, et. al.'s findings Mar 10, 2022

fgregg mentioned this issue Mar 15, 2022

Better sampling #982

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change default sample size or change sampling scheme based Tahamont's, et. al.'s findings #980

Change default sample size or change sampling scheme based Tahamont's, et. al.'s findings #980

fgregg commented Mar 10, 2022 •

edited

fgregg commented Mar 10, 2022 •

edited

zjelveh commented Mar 10, 2022

fgregg commented Mar 10, 2022 •

edited

zjelveh commented Mar 10, 2022

fgregg commented Mar 10, 2022

mmcneill commented Mar 10, 2022

fgregg commented Mar 10, 2022

fgregg commented Apr 20, 2022

Change default sample size or change sampling scheme based Tahamont's, et. al.'s findings #980

Change default sample size or change sampling scheme based Tahamont's, et. al.'s findings #980

Comments

fgregg commented Mar 10, 2022 • edited

fgregg commented Mar 10, 2022 • edited

zjelveh commented Mar 10, 2022

fgregg commented Mar 10, 2022 • edited

zjelveh commented Mar 10, 2022

fgregg commented Mar 10, 2022

mmcneill commented Mar 10, 2022

fgregg commented Mar 10, 2022

fgregg commented Apr 20, 2022

fgregg commented Mar 10, 2022 •

edited

fgregg commented Mar 10, 2022 •

edited

fgregg commented Mar 10, 2022 •

edited