Skip to content

Latest commit

 

History

History
263 lines (178 loc) · 6.53 KB

API-documentation.rst

File metadata and controls

263 lines (178 loc) · 6.53 KB

Library Documentation

Dedupe Objects

dedupe.Dedupe

# initialize from a defined set of fields
variables = [
    {'field' : 'Site name', 'type': 'String'},
    {'field' : 'Address', 'type': 'String'},
    {'field' : 'Zip', 'type': 'String', 'has missing':True},
    {'field' : 'Phone', 'type': 'String', 'has missing':True},
]
deduper = dedupe.Dedupe(variables)

prepare_training

uncertain_pairs

mark_pairs

train

write_training

write_settings

cleanup_training

partition

StaticDedupe Objects

dedupe.StaticDedupe

with open('learned_settings', 'rb') as f:
    matcher = StaticDedupe(f)

partition

dedupe.RecordLink

# initialize from a defined set of fields
variables = [
    {'field' : 'Site name', 'type': 'String'},
    {'field' : 'Address', 'type': 'String'},
    {'field' : 'Zip', 'type': 'String', 'has missing':True},
    {'field' : 'Phone', 'type': 'String', 'has missing':True},
]
deduper = dedupe.RecordLink(variables)

prepare_training

uncertain_pairs

mark_pairs

train

write_training

write_settings

cleanup_training

join

dedupe.StaticRecordLink

with open('learned_settings', 'rb') as f:
    matcher = StaticRecordLink(f)

join

Gazetteer Objects

dedupe.Gazetteer

# initialize from a defined set of fields
variables = [
    {'field' : 'Site name', 'type': 'String'},
    {'field' : 'Address', 'type': 'String'},
    {'field' : 'Zip', 'type': 'String', 'has missing':True},
    {'field' : 'Phone', 'type': 'String', 'has missing':True},
]
matcher = dedupe.Gazetteer(variables)

prepare_training

uncertain_pairs

mark_pairs

train

write_training

write_settings

cleanup_training

index

unindex

search

StaticGazetteer Objects

dedupe.StaticGazetteer

with open('learned_settings', 'rb') as f:
    matcher = StaticGazetteer(f)

index

unindex

search

blocks

score

many_to_n

Lower Level Classes and Methods

With the methods documented above, you can work with data into the millions of records. However, if are working with larger data you may not be able to load all your data into memory. You'll need to interact with some of the lower level classes and methods.

The PostgreSQL and MySQL examples use these lower level classes and methods.

Dedupe and StaticDedupe

dedupe

fingerprinter

Instance of dedupe.blocking.Fingerprinter class if the train has been run, else None.

pairs

score

cluster

fingerprinter

Instance of dedupe.blocking.Fingerprinter class

pairs(data)

Same as dedupe.Dedupe.pairs

score(pairs)

Same as dedupe.Dedupe.score

cluster(scores, threshold=0.5)

Same as dedupe.Dedupe.cluster

fingerprinter

Instance of dedupe.blocking.Fingerprinter class if the train has been run, else None.

pairs

score

one_to_one

many_to_one

fingerprinter

Instance of dedupe.blocking.Fingerprinter class

pairs(data_1, data_2)

Same as dedupe.RecordLink.pairs

score(pairs)

Same as dedupe.RecordLink.score

one_to_one(scores, threshold=0.0)

Same as dedupe.RecordLink.one_to_one

many_to_one(scores, threshold=0.0)

Same as dedupe.RecordLink.many_to_one

Gazetteer and StaticGazetteer

fingerprinter

Instance of dedupe.blocking.Fingerprinter class if the train has been run, else None.

blocks

score

many_to_n

fingerprinter

Instance of dedupe.blocking.Fingerprinter class

blocks(data)

Same as dedupe.Gazetteer.blocks

score(blocks)

Same as dedupe.Gazetteer.score

many_to_n(score_blocks, threshold=0.0, n_matches=1)

Same as dedupe.Gazetteer.many_to_n

Fingerprinter Objects

dedupe.blocking.Fingerprinter

__call__

index_fields

index

unindex

reset_indices

Convenience Functions

dedupe.console_label

dedupe.training_data_dedupe

dedupe.training_data_link

dedupe.canonicalize

dedupe.read_training

dedupe.write_training