Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query related to console_label and training_data_dedupe methods #1166

Closed
Abhishek-thetechie opened this issue Aug 18, 2023 · 3 comments
Closed

Comments

@Abhishek-thetechie
Copy link

Abhishek-thetechie commented Aug 18, 2023

Hi,
I just want a clarification from @fgregg @derekeder or anyone who knows about this package well. I understand Dedupe uses Active Learning where user will be thrown pairs of examples to match/no match using CLI by running dedupe.console_label(deduper) . This process is manual, and we don't know when dedupe will stop the process of throwing examples. In order to skip this step, am I right to understand that we can use training_data_dedupe(data, common_key, training_size=50000) method if we already have input data as labelled?? Afterwards can we store this training data in training.json file and same we can pass for training file argument inside prepare_training(data, training_file=None, sample_size=1500, blocked_proportion=0.9) when we call the method?? Please correct me if I am wrong and help me how to skip this manual step of Active Learning.

@Abhishek-thetechie
Copy link
Author

Hi,
Still looking for an answer to my query. Please help anybody.

@pybergonz
Copy link

pybergonz commented Aug 25, 2023

Hi @Abhishek-thetechie,
from my experience you can train with Active Learning only once, save settings_file and training.json and re-use them for every subsequent run.
You can reuse only the data model by employing a StaticDedupe:
with open(sc_settings_file, 'rb') as f: deduper = dedupe.StaticDedupe(f)
Or re-train at every execution using labelled examples:

  fields = [{'field': 'name', 'type': 'String'},
              {'field': 'name', 'type': 'Exact'},
              {'field': 'address', 'type': 'String'}
              ]
   with open(settings_file) as sf:
        deduper = dedupe.Dedupe(fields, num_cores=5)
   with open(training_file) as tf:
        deduper.prepare_training(data_d, sample_size=10000)
   deduper.markPairs(training_pairs)
   deduper.train(index_predicates=False)

@fgregg
Copy link
Contributor

fgregg commented Dec 16, 2023

you got it.

@fgregg fgregg closed this as completed Dec 16, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants