You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I just want a clarification from @fgregg@derekeder or anyone who knows about this package well. I understand Dedupe uses Active Learning where user will be thrown pairs of examples to match/no match using CLI by running dedupe.console_label(deduper) . This process is manual, and we don't know when dedupe will stop the process of throwing examples. In order to skip this step, am I right to understand that we can use training_data_dedupe(data, common_key, training_size=50000) method if we already have input data as labelled?? Afterwards can we store this training data in training.json file and same we can pass for training file argument inside prepare_training(data, training_file=None, sample_size=1500, blocked_proportion=0.9) when we call the method?? Please correct me if I am wrong and help me how to skip this manual step of Active Learning.
The text was updated successfully, but these errors were encountered:
Hi @Abhishek-thetechie,
from my experience you can train with Active Learning only once, save settings_file and training.json and re-use them for every subsequent run.
You can reuse only the data model by employing a StaticDedupe: with open(sc_settings_file, 'rb') as f: deduper = dedupe.StaticDedupe(f)
Or re-train at every execution using labelled examples:
fields = [{'field': 'name', 'type': 'String'},
{'field': 'name', 'type': 'Exact'},
{'field': 'address', 'type': 'String'}
]
with open(settings_file) as sf:
deduper = dedupe.Dedupe(fields, num_cores=5)
with open(training_file) as tf:
deduper.prepare_training(data_d, sample_size=10000)
deduper.markPairs(training_pairs)
deduper.train(index_predicates=False)
Hi,
I just want a clarification from @fgregg @derekeder or anyone who knows about this package well. I understand Dedupe uses Active Learning where user will be thrown pairs of examples to match/no match using CLI by running dedupe.console_label(deduper) . This process is manual, and we don't know when dedupe will stop the process of throwing examples. In order to skip this step, am I right to understand that we can use training_data_dedupe(data, common_key, training_size=50000) method if we already have input data as labelled?? Afterwards can we store this training data in training.json file and same we can pass for training file argument inside prepare_training(data, training_file=None, sample_size=1500, blocked_proportion=0.9) when we call the method?? Please correct me if I am wrong and help me how to skip this manual step of Active Learning.
The text was updated successfully, but these errors were encountered: