Refactor labeler.py #1065

The meaning of these methods is not consistent with how sklearn uses them. For them, fit_transform(X, y) means fit(X, y).transform(X), but we were using it as fit(transform(X), y). So just rename fit_transform() to fit(), and now it has the correct meaning. What's more, the transform method isn't used external to the class, and I don't think it should be. The public API should only deal with TrainingPairs, it shouldn't deal with the calculated distances. Rename it to _distances() The old fit() method now is just an internal helper method that deals with already calculated distance data.

It is already done in self._fit()

Now it prints something like `[arg-type]` when it errors, so you can add `# type: ignore[arg-type]` and be specific about the error you are silencing

It's not actually used anywhere besides from creating the initial set of candidate predicates

it's only used internally, so make that obvious

Already is part of the Learner base class

DisagreementLearner is really reaching far down into the contained objects. The lower classes themselves should be responsible for this sort of thing. Sure, it makes the code more complicated, but the way it is is fooling ourselves that it is simple, and making it prone to breaking int he future.

It's not actually used anywhere. See dedupeio#1065 (comment)

We were only testing the very small component of labeler. Now we actually go through most lines of code. Check out coverage.

- Remove many unused methods of MatchLearner like mark() and pop(). I think these used to get used when this was the only learner class, but now they aren;t used anywhere. - Just set MatchLeaner.candidates in constructor. It makes it way easier to reason about. - Adjust the inheritance. Now DisagreementLearner is out of the heirarchy that MatchLEarner and BlockLearner are in. This is good, because DisagreementLEarner OWNS these other two it is not a "is a" relationship - Remove the mark() function from the sub-learners. They just have the fit() method now, but they don't actually persist this training data, which is in line with the naming of fit(). - Make `candidates` a RO attribute, makes it easier to reason about, we don't have to worry about someone outside of the calss coming in and changing it. - Fix a bug in BlockLearner where `remove` never actually removed entry from `candidates`, so if you broke the cahce of _cached_scores and ended up calling `self._predict(self.candidates)`, you would get the result from all of the original candidates. - In the test, actually check for the values of the candidates, not just the number of them. - Rename `_remove` to `remove` in the sublearners, since they are publicly used in DisagreementLEarner - Remove the unused candidate_scores from the DisagreementLearner public API - Always make a copy in sample_records() to avoid footguns

The config in setup.cfg is ignored.

Makes it consistent between the MatchLEarner and BlockLearner. Also fixes a bug where self._fitted was never set to True

Commits on Sep 1, 2022

Merge branch 'main' into labeler-rename

NickCrews committed Sep 1, 2022

Configuration menu

View commit details

Copy full SHA for 1212a7b

Browse repository at this point

Copy the full SHA

1212a7b View commit details

Browse the repository at this point in the history

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor labeler.py #1065

Refactor labeler.py #1065

Commits on Jun 17, 2022

Commits on Jun 18, 2022

Commits on Jul 9, 2022

Commits on Sep 1, 2022