New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Architecture Infelicities #483

Open
fgregg opened this Issue Sep 6, 2016 · 1 comment

Comments

Projects
None yet
1 participant
@fgregg
Contributor

fgregg commented Sep 6, 2016

If we ever do a dedupe 2.0, these are some things I'd like to change:

The data model is a list of variables:

Right now we are asking the user to basically passing in a list of class init method signatures. We should just take the next little step. This would allow us to simplify the code in many places, including getting rid of the current namespace plugin architecture and make dedupe more easily extendable.

Dedupe Flavor

[String('name', missing=True),
 String('address'),
 Address('address'),
 Interaction((String('name'), String('address'))),
 Text('description', corpus=docs)]

Record Link Flavor

[String('name', 'Name'),
 String('address'),
 Address('address', 'Address'),
 Interaction((String('name', 'Name'), String('address', 'Address'))),
 Text('description', 'product info', corpus=docs)]

It doesn't seem right that there are methods that I can't call unless other methods have been already called.

I'd like something like:

labeler = deduper.labeler(data)

next_pair = labeler.uncertainPairs()
...
labeler.markPairs(labeled_pair)

But I'm not sure the right way to get labeled data back into a deduper instance.

Here's roughly the data flow

data = ...
linkage_type = (Dedupe, RecordLink, Gazetteer)
data_model = ...

# Classifier
classifier = ...
labeling_sample = sampler(data, data_model, linkage_type)
labeled_data = labeler(labeling_sample, data_model, classifier)
trained_classifier = train_classifier(labeled_data, data_model, classifier)

# Blocker
random_sample = sampler(data, linkage_type)
trained_blocker = train_blocker(labeled_data, random_sample, data_model)

# All Together
clusterer = ...
clusters = clusterer(trained_classifier(trained_blocker(data)), linkage_type)

I think we could do it with basically this API

data_model = DataModel(...)

# Active Labeling
labeler = Labeler(data_model, data)
labeled_data = labeler.labeled_data

# Training
deduper = Dedupe(data_model, labeled_data, data)
deduper.match(data)
deduper.writeSettings(file_obj)

# From settings file
deduper = Dedupe(settings_file)
deduper.match(data)

# we would need to pull the methods for reading and writing training data into top level functions.

It would also be interesting to look at the scikit-learn API

Do we really need active and static instances of the different classes. Seems like we could just do this through multiple dispatch

Python 3 has some support for this https://docs.python.org/3/library/functools.html#functools.singledispatch

@fgregg

This comment has been minimized.

Contributor

fgregg commented Sep 11, 2016

Some of these architectural issues might be of interest to you, @Yashg19. Particularly, separating out the labeler from the matching class would make it easier to use different active learning algorithms, like vowpal wabbit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment