Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Architecture Infelicities #483
If we ever do a dedupe 2.0, these are some things I'd like to change:
The data model is a list of variables:
Right now we are asking the user to basically passing in a list of class init method signatures. We should just take the next little step. This would allow us to simplify the code in many places, including getting rid of the current namespace plugin architecture and make dedupe more easily extendable.
[String('name', missing=True), String('address'), Address('address'), Interaction((String('name'), String('address'))), Text('description', corpus=docs)]
Record Link Flavor
[String('name', 'Name'), String('address'), Address('address', 'Address'), Interaction((String('name', 'Name'), String('address', 'Address'))), Text('description', 'product info', corpus=docs)]
It doesn't seem right that there are methods that I can't call unless other methods have been already called.
I'd like something like:
labeler = deduper.labeler(data) next_pair = labeler.uncertainPairs() ... labeler.markPairs(labeled_pair)
But I'm not sure the right way to get labeled data back into a deduper instance.
Here's roughly the data flow
data = ... linkage_type = (Dedupe, RecordLink, Gazetteer) data_model = ... # Classifier classifier = ... labeling_sample = sampler(data, data_model, linkage_type) labeled_data = labeler(labeling_sample, data_model, classifier) trained_classifier = train_classifier(labeled_data, data_model, classifier) # Blocker random_sample = sampler(data, linkage_type) trained_blocker = train_blocker(labeled_data, random_sample, data_model) # All Together clusterer = ... clusters = clusterer(trained_classifier(trained_blocker(data)), linkage_type)
I think we could do it with basically this API
It would also be interesting to look at the scikit-learn API
Do we really need active and static instances of the different classes. Seems like we could just do this through multiple dispatch
Python 3 has some support for this https://docs.python.org/3/library/functools.html#functools.singledispatch