dedupe.Dedupe
# initialize from a defined set of fields
variables = [
{'field' : 'Site name', 'type': 'String'},
{'field' : 'Address', 'type': 'String'},
{'field' : 'Zip', 'type': 'String', 'has missing':True},
{'field' : 'Phone', 'type': 'String', 'has missing':True},
]
deduper = dedupe.Dedupe(variables)
prepare_training
uncertain_pairs
mark_pairs
train
write_training
write_settings
cleanup_training
partition
dedupe.StaticDedupe
with open('learned_settings', 'rb') as f:
matcher = StaticDedupe(f)
partition
dedupe.RecordLink
# initialize from a defined set of fields
variables = [
{'field' : 'Site name', 'type': 'String'},
{'field' : 'Address', 'type': 'String'},
{'field' : 'Zip', 'type': 'String', 'has missing':True},
{'field' : 'Phone', 'type': 'String', 'has missing':True},
]
deduper = dedupe.RecordLink(variables)
prepare_training
uncertain_pairs
mark_pairs
train
write_training
write_settings
cleanup_training
join
dedupe.StaticRecordLink
with open('learned_settings', 'rb') as f:
matcher = StaticRecordLink(f)
join
dedupe.Gazetteer
# initialize from a defined set of fields
variables = [
{'field' : 'Site name', 'type': 'String'},
{'field' : 'Address', 'type': 'String'},
{'field' : 'Zip', 'type': 'String', 'has missing':True},
{'field' : 'Phone', 'type': 'String', 'has missing':True},
]
matcher = dedupe.Gazetteer(variables)
prepare_training
uncertain_pairs
mark_pairs
train
write_training
write_settings
cleanup_training
index
unindex
search
dedupe.StaticGazetteer
with open('learned_settings', 'rb') as f:
matcher = StaticGazetteer(f)
index
unindex
search
blocks
score
many_to_n
With the methods documented above, you can work with data into the millions of records. However, if are working with larger data you may not be able to load all your data into memory. You'll need to interact with some of the lower level classes and methods.
The PostgreSQL and MySQL examples use these lower level classes and methods.
dedupe
fingerprinter
Instance of dedupe.blocking.Fingerprinter
class if the train
has been run, else None.
pairs
score
cluster
fingerprinter
Instance of dedupe.blocking.Fingerprinter
class
pairs(data)
Same as dedupe.Dedupe.pairs
score(pairs)
Same as dedupe.Dedupe.score
cluster(scores, threshold=0.5)
Same as dedupe.Dedupe.cluster
fingerprinter
Instance of dedupe.blocking.Fingerprinter
class if the train
has been run, else None.
pairs
score
one_to_one
many_to_one
fingerprinter
Instance of dedupe.blocking.Fingerprinter
class
pairs(data_1, data_2)
Same as dedupe.RecordLink.pairs
score(pairs)
Same as dedupe.RecordLink.score
one_to_one(scores, threshold=0.0)
Same as dedupe.RecordLink.one_to_one
many_to_one(scores, threshold=0.0)
Same as dedupe.RecordLink.many_to_one
fingerprinter
Instance of dedupe.blocking.Fingerprinter
class if the train
has been run, else None.
blocks
score
many_to_n
fingerprinter
Instance of dedupe.blocking.Fingerprinter
class
blocks(data)
Same as dedupe.Gazetteer.blocks
score(blocks)
Same as dedupe.Gazetteer.score
many_to_n(score_blocks, threshold=0.0, n_matches=1)
Same as dedupe.Gazetteer.many_to_n
dedupe.blocking.Fingerprinter
__call__
index_fields
index
unindex
reset_indices
dedupe.console_label
dedupe.training_data_dedupe
dedupe.training_data_link
dedupe.canonicalize
dedupe.read_training
dedupe.write_training