Refactor labeler.py #1065

NickCrews · 2022-06-12T17:12:46Z

See each commit individually, nothing functional changes

EDIT: still shouldn't be any funcitonal changes, though this has gotten much more substantial

dedupe/labeler.py

fgregg · 2022-06-13T13:13:12Z

dedupe/labeler.py

        return self.data_model.distances(pairs)

-    def fit(self, X: numpy.typing.NDArray[numpy.float_], y: LabelsLike) -> None:
-
+    def _fit(self, X: numpy.typing.NDArray[numpy.float_], y: LabelsLike) -> None:
        self.y: numpy.typing.NDArray[numpy.int_] = numpy.array(y)
        self.X = X


i know this was nor your design, but it's a bit weird to have the fit method be responsible for setting these instance attributes. mark would be a more natural place, imo

I agree. still trying to figure out how to deal with the fact that the learners are stateful with their data. I almost want to split them up, so that one component is stateless like sklearn models and has a fit(), and that is owned by a learner that is stateful and remembers the old training data, and that has the mark()

OK in this new version all the sublearners are stateless (in regards to saving training data) and don't have a mark, just a stateless fit(). Only DisagreementLEarner has a mark that stores traingin data.

fgregg · 2022-06-13T13:14:13Z

dedupe/labeler.py

        self.y: numpy.typing.NDArray[numpy.int_] = numpy.array(y)
        self.X = X
+        sklearn.linear_model.LogisticRegression.fit(self, self.X, self.y)


i think it's good to have an explicit call here, but if we do that, we shouldn't also have the class inherit from sklearn.linear_model.LogisticRegression

If I turn this to super().fit(self.X, self.y) then mypy complains because it needs both parents in the inheritance to have the same signature for fit: dedupe/labeler.py:84: error: Argument 1 to "fit" of "Learner" has incompatible type "ndarray[Any, dtype[floating[_64Bit]]]"; expected "List[Tuple[Mapping[str, Any], Mapping[str, Any]]]" [arg-type]

I think we should rename RLRLearner to MatchLearner (it is comparing records using distances, and this is sorta a nice opposition to BlockLEarner) and make it wrap LogisticRegression (or RandomForest, or whatever) using composition, not inheritance.

See if you like what I did.

fgregg · 2022-06-13T13:19:20Z

dedupe/labeler.py

        for learner in self.learners:
-            learner.fit_transform(pairs, y)
+            learner.fit(pairs, y)


fit is a bad verb for something that both sets the data on a learner and retrains the learner. I think maybe we want to get rid of fit from the public API and have mark be responsible for all of this.

See new version and comment above.

The meaning of these methods is not consistent with how sklearn uses them. For them, fit_transform(X, y) means fit(X, y).transform(X), but we were using it as fit(transform(X), y). So just rename fit_transform() to fit(), and now it has the correct meaning. What's more, the transform method isn't used external to the class, and I don't think it should be. The public API should only deal with TrainingPairs, it shouldn't deal with the calculated distances. Rename it to _distances() The old fit() method now is just an internal helper method that deals with already calculated distance data.

It is already done in self._fit()

Now it prints something like `[arg-type]` when it errors, so you can add `# type: ignore[arg-type]` and be specific about the error you are silencing

It's not actually used anywhere besides from creating the initial set of candidate predicates

it's only used internally, so make that obvious

Already is part of the Learner base class

DisagreementLearner is really reaching far down into the contained objects. The lower classes themselves should be responsible for this sort of thing. Sure, it makes the code more complicated, but the way it is is fooling ourselves that it is simple, and making it prone to breaking int he future.

It's not actually used anywhere. See dedupeio#1065 (comment)

We were only testing the very small component of labeler. Now we actually go through most lines of code. Check out coverage.

- Remove many unused methods of MatchLearner like mark() and pop(). I think these used to get used when this was the only learner class, but now they aren;t used anywhere. - Just set MatchLeaner.candidates in constructor. It makes it way easier to reason about. - Adjust the inheritance. Now DisagreementLearner is out of the heirarchy that MatchLEarner and BlockLearner are in. This is good, because DisagreementLEarner OWNS these other two it is not a "is a" relationship - Remove the mark() function from the sub-learners. They just have the fit() method now, but they don't actually persist this training data, which is in line with the naming of fit(). - Make `candidates` a RO attribute, makes it easier to reason about, we don't have to worry about someone outside of the calss coming in and changing it. - Fix a bug in BlockLearner where `remove` never actually removed entry from `candidates`, so if you broke the cahce of _cached_scores and ended up calling `self._predict(self.candidates)`, you would get the result from all of the original candidates. - In the test, actually check for the values of the candidates, not just the number of them. - Rename `_remove` to `remove` in the sublearners, since they are publicly used in DisagreementLEarner - Remove the unused candidate_scores from the DisagreementLearner public API - Always make a copy in sample_records() to avoid footguns

coveralls · 2022-06-18T06:04:56Z

Coverage increased (+9.5%) to 73.629% when pulling 1212a7b on NickCrews:labeler-rename into c595052 on dedupeio:main.

NickCrews · 2022-06-20T00:15:30Z

@fgregg this has gotten way bigger, and this was very much done in an experiemental/haphazard way, so that some commits undo/modify what previous commits did. I can rebase this in a more sensible way if the large scale idea looks right. Maybe though take a look at the final result and see if this is generally heading in a good direction before I do all that cleanup work. Or I can split this into a PR with the less contentious changes, get that merged, and then take a look at the more structural changes afterwards. Thanks you for looking!

fgregg · 2022-06-28T01:10:02Z

Generally, i really like this direction. Sorry for the slow response, i've been out sick.

fgregg

Really liking where this is going. A few comments in line.

dedupe/labeler.py

fgregg · 2022-06-28T01:16:41Z

dedupe/training.py

@@ -138,9 +139,15 @@ def score(predicate: Predicate) -> float:

        return candidates

-    def cover(self, pairs: TrainingExamples) -> Cover:
+    def cover(self, pairs: TrainingExamples, index_predicates: bool = True) -> Cover:


don't love this change. is it really ugly to avoid this?

Have you seen the other changes from this commit? and the corresponding commit message? I think the current way that we deal with this, where in DisagreementLearner.learn_predicates() we do self.blocker.block_learner.blocker.predicates.copy() is a terrible code smell, reaching 3 levels deep. In addition, blocker block_learner, and blocker are all the same thing. So I wanted to actually create public APIs at each level.

The other problem there is the pattern of "temporarily delete some predicates, do some stuff, and then restore them" which I didn't like.

I agree that this commit does add more lines of code, which I don't like, but what I like less was the unofficial and haphazard methods we used.

When you say "this change", do you mean "adding index_predicates as an arg to this function"? We could probably adjust this. Or do you mean the larger-scale changes I mention above?

actually, i'm good with this!

i'm really struggling with this. i think this is better than the code that it is replacing, but it seems like this filtering should not be the responsibility of the BlockLearner class.

hmm... what about this design.

class BlockLearner(ABC): def learn(self, matches, recall, candidate_types='simple', selected_predicates=None): comparison_cover = self.comparison_cover if selected_predicates: candidate_preds = comparison_cover & selected_predicates else: candidate_preds = comparison_cover.keys() match_cover = self.cover(matches, candidate_preds) ... def cover(self, pairs, candidate_preds) -> Cover: predicate_cover = {} for predicate in candidate_preds: # type: ignore ...

then this labeler code could just be

def learn_predicates( self, recall: float, selected_predicates ) -> tuple[Predicate, ...]: learned_preds: tuple[Predicate, ...] dupes = [pair for label, pair in zip(self.y, self.pairs) if label] learned_preds = self.blocker.block_learner.learn( dupes, recall=recall, candidate_types="random forest", selected_predicates )

and so then it would ultimately be the responsible for the train method in the api.py

to filter the predicates based and pass them on appropriately.

I'm trying to wrap my head around the general pattern of what we're doing, correct me if I'm wrong. Right now there are three sets of predicates:

all the available predicates from the data model

A subset of 1, all the predicates that the BlockLearners use as candidates.

A subset of 2, the final predicates chosen by the BlockLearners that we use to actually do the blocking in Dedupe.pairs().

Right now BlockLearners are handed 1, and it filters that down to 2 using the index_predicates flag. Then it does the step of filtering from 2 to 3.

I agree that this is bad, these concerns should be separated. The filtering from 1 to 2 should happen external to the BlockLearners. The block learners should just get handed set 2, and they select set 3 from that. That is the direction that #1052 is trying to go in.

I'm not sure the 1->2 filtering should happen in train() though. Can't we do this at the very beginning, when the data model is first given to Dedupe? I don't like the comparison_cover & selected_predicates bit, this filtering is happening later than it should. I think we can make it so that training.BlockLearner is only ever handed the selected_predicates at the beginning. Because in training.DedupeBlockLearner.init we call

self.blocker = blocking.Fingerprinter(predicates) self.blocker.index_all(data)

which makes us index ALL the predicates, even if then in subsequent calls to learn() or cover() we then filter out the index predicates and don't use them.

If you think this is at least an improvement, how about you merge it and then we can actually write PRs that go in this even better direction, and we have something concrete to talk about? I'm finding it hard without actually testing things out, but I don't want to have merge conflicts later so I want main to be stable.

Or, I rebase and remove this commit from this PR and you merge everything else, and we work on this separately.

But yes I think your suggestion is better than what I have currently.

This conversation is getting too big, should be it's own issue?

@fgregg in case you didn't see this

fgregg · 2022-06-28T01:19:19Z

With your refactor, it's pretty apparent that the weighted sampling methods on these classes are a bit odd. We can save that for a future PR though.

Probably, the right thing to do is to create a new object, or honestly, maybe just some new functions that take a in a blocker coverage object and return a weighted sample of keys.

The config in setup.cfg is ignored.

Makes it consistent between the MatchLEarner and BlockLearner. Also fixes a bug where self._fitted was never set to True

NickCrews · 2022-07-09T22:30:59Z

@fgregg I fixed two of your requests, take a look at the final one that I didn't change because I wanted clarification.

fgregg · 2022-08-11T02:27:53Z

hi @NickCrews. sorry for the slow response!

I responded two to you on the cover change.

fgregg · 2022-08-11T02:36:36Z

@benchmark

github-actions · 2022-08-11T02:52:53Z

All benchmarks (diff):

before	after	ratio	benchmark
530M	529M	1.00	canonical.Canonical.peakmem_run
16.8±0.2s	16.1±0.2s	0.96	canonical.Canonical.time_run
0.971	0.927	0.95	canonical.Canonical.track_precision
0.902	0.911	1.01	canonical.Canonical.track_recall
229M	229M	1.00	canonical_gazetteer.Gazetteer.peakmem_run(None)
14.4±0.2s	14.1±0.05s	0.98	canonical_gazetteer.Gazetteer.time_run(None)
0.964	0.982	1.02	canonical_gazetteer.Gazetteer.track_precision(None)
0.982	0.982	1.00	canonical_gazetteer.Gazetteer.track_recall(None)
229M	229M	1.00	canonical_matching.Matching.peakmem_run({'threshold': 0.5, 'constraint': 'many-to-one'})
229M	229M	1.00	canonical_matching.Matching.peakmem_run({'threshold': 0.5})
12.9±0.05s	12.8±0.02s	0.99	canonical_matching.Matching.time_run({'threshold': 0.5, 'constraint': 'many-to-one'})
13.1±0.1s	13.2±0.07s	1.00	canonical_matching.Matching.time_run({'threshold': 0.5})
0.99	0.99	1.00	canonical_matching.Matching.track_precision({'threshold': 0.5, 'constraint': 'many-to-one'})
0.99	0.99	1.00	canonical_matching.Matching.track_precision({'threshold': 0.5})
0.911	0.911	1.00	canonical_matching.Matching.track_recall({'threshold': 0.5, 'constraint': 'many-to-one'})
0.92	0.911	0.99	canonical_matching.Matching.track_recall({'threshold': 0.5})

(logs)

fgregg · 2022-09-02T02:31:23Z

i'm going to go ahead and bring this in, and we can keep on working on improving how we interact with predicates in future PRs.

NickCrews · 2022-09-02T05:45:28Z

Sweet, thanks for keeping momentum going @fgregg !

fgregg requested changes Jun 13, 2022

View reviewed changes

NickCrews added 20 commits June 17, 2022 12:07

Remove redundant setting of self.X, self.y

7f3842b

It is already done in self._fit()

Move BlockLearner type annotation to class

e4e8697

Fix mypy

56e2685

Show error codes from mypy runs

685a0db

Now it prints something like `[arg-type]` when it errors, so you can add `# type: ignore[arg-type]` and be specific about the error you are silencing

Tighten typing on ClaasifierProtocol

cd8782e

Don't store data_model in BlockLearner

d59d538

It's not actually used anywhere besides from creating the initial set of candidate predicates

Make BlockLearner.predict private

5bad271

it's only used internally, so make that obvious

Remove unneeded type hint BlockLearner.candidates

0b0a117

Already is part of the Learner base class

Rename RLRLearner -> MatchLearner, use composition

925030b

Remove unneeded ActiveLearner alias in api

ced6f4e

Remove fit() from DisagreementLearner API

e3442ef

It's not actually used anywhere. See dedupeio#1065 (comment)

Simplify DisagreementLeaner.__init__()

5386d6f

Fixup: rename classifier to matcher

62ad7ca

Simplify DisagreementLearner.candidate_scores

24a2609

Test more-public interface of labeler

d77c118

We were only testing the very small component of labeler. Now we actually go through most lines of code. Check out coverage.

Rename _cached_labels -> _Cached_scores

3e36957

Privatize DisagreementLearner.learners

2605b29

NickCrews force-pushed the labeler-rename branch from 3e7d4bd to 2605b29 Compare June 18, 2022 05:57

Fixup: linting

f9d88ec

fgregg requested changes Jun 28, 2022

View reviewed changes

NickCrews added 2 commits July 9, 2022 12:09

Fixup: don't say HasCandidates.candidates is RO

c21398f

Only have pytest config in pyproject.toml

3dc3a96

The config in setup.cfg is ignored.

NickCrews added 5 commits July 9, 2022 13:08

Remove unused .coveragerc

5b330cf

Always generate a html coverage report

81bf3f5

ValueError if candidate_scores() used before fit()

5d3a100

Makes it consistent between the MatchLEarner and BlockLearner. Also fixes a bug where self._fitted was never set to True

Validate args to Learner.fit()

83b5293

Fixup mypy error

918b824

NickCrews mentioned this pull request Aug 11, 2022

Extract predicate filtering from data model #1079

Merged

Merge branch 'main' into labeler-rename

1212a7b

fgregg merged commit 5742efc into dedupeio:main Sep 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor labeler.py #1065

Refactor labeler.py #1065

NickCrews commented Jun 12, 2022 •

edited

fgregg Jun 13, 2022

NickCrews Jun 17, 2022

NickCrews Jun 18, 2022

fgregg Jun 13, 2022

NickCrews Jun 17, 2022

NickCrews Jun 18, 2022

fgregg Jun 13, 2022

NickCrews Jun 18, 2022

coveralls commented Jun 18, 2022 •

edited

NickCrews commented Jun 20, 2022 •

edited

fgregg commented Jun 28, 2022

fgregg left a comment

fgregg Jun 28, 2022

NickCrews Jul 9, 2022 •

edited

fgregg Aug 11, 2022

fgregg Aug 11, 2022

fgregg Aug 11, 2022

NickCrews Aug 11, 2022 •

edited

NickCrews Aug 11, 2022

NickCrews Aug 11, 2022

NickCrews Aug 17, 2022

fgregg commented Jun 28, 2022 •

edited

NickCrews commented Jul 9, 2022

fgregg commented Aug 11, 2022

fgregg commented Aug 11, 2022

github-actions bot commented Aug 11, 2022

fgregg commented Sep 2, 2022

NickCrews commented Sep 2, 2022

Refactor labeler.py #1065

Refactor labeler.py #1065

Conversation

NickCrews commented Jun 12, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Jun 18, 2022 • edited

NickCrews commented Jun 20, 2022 • edited

fgregg commented Jun 28, 2022

fgregg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NickCrews Jul 9, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NickCrews Aug 11, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fgregg commented Jun 28, 2022 • edited

NickCrews commented Jul 9, 2022

fgregg commented Aug 11, 2022

fgregg commented Aug 11, 2022

github-actions bot commented Aug 11, 2022

All benchmarks (diff):

fgregg commented Sep 2, 2022

NickCrews commented Sep 2, 2022

NickCrews commented Jun 12, 2022 •

edited

coveralls commented Jun 18, 2022 •

edited

NickCrews commented Jun 20, 2022 •

edited

NickCrews Jul 9, 2022 •

edited

NickCrews Aug 11, 2022 •

edited

fgregg commented Jun 28, 2022 •

edited