Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor labeler.py #1065

Merged
merged 29 commits into from Sep 2, 2022
Merged

Refactor labeler.py #1065

merged 29 commits into from Sep 2, 2022

Conversation

NickCrews
Copy link
Contributor

@NickCrews NickCrews commented Jun 12, 2022

See each commit individually, nothing functional changes

EDIT: still shouldn't be any funcitonal changes, though this has gotten much more substantial

dedupe/labeler.py Outdated Show resolved Hide resolved
return self.data_model.distances(pairs)

def fit(self, X: numpy.typing.NDArray[numpy.float_], y: LabelsLike) -> None:

def _fit(self, X: numpy.typing.NDArray[numpy.float_], y: LabelsLike) -> None:
self.y: numpy.typing.NDArray[numpy.int_] = numpy.array(y)
self.X = X
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know this was nor your design, but it's a bit weird to have the fit method be responsible for setting these instance attributes. mark would be a more natural place, imo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. still trying to figure out how to deal with the fact that the learners are stateful with their data. I almost want to split them up, so that one component is stateless like sklearn models and has a fit(), and that is owned by a learner that is stateful and remembers the old training data, and that has the mark()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK in this new version all the sublearners are stateless (in regards to saving training data) and don't have a mark, just a stateless fit(). Only DisagreementLEarner has a mark that stores traingin data.

self.y: numpy.typing.NDArray[numpy.int_] = numpy.array(y)
self.X = X
sklearn.linear_model.LogisticRegression.fit(self, self.X, self.y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's good to have an explicit call here, but if we do that, we shouldn't also have the class inherit from sklearn.linear_model.LogisticRegression

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I turn this to super().fit(self.X, self.y) then mypy complains because it needs both parents in the inheritance to have the same signature for fit: dedupe/labeler.py:84: error: Argument 1 to "fit" of "Learner" has incompatible type "ndarray[Any, dtype[floating[_64Bit]]]"; expected "List[Tuple[Mapping[str, Any], Mapping[str, Any]]]" [arg-type]

I think we should rename RLRLearner to MatchLearner (it is comparing records using distances, and this is sorta a nice opposition to BlockLEarner) and make it wrap LogisticRegression (or RandomForest, or whatever) using composition, not inheritance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See if you like what I did.

for learner in self.learners:
learner.fit_transform(pairs, y)
learner.fit(pairs, y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fit is a bad verb for something that both sets the data on a learner and retrains the learner. I think maybe we want to get rid of fit from the public API and have mark be responsible for all of this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See new version and comment above.

The meaning of these methods is not consistent with how sklearn uses
them.

For them, fit_transform(X, y) means
fit(X, y).transform(X), but we were using it as
fit(transform(X), y). So just rename fit_transform() to fit(), and now
it has the correct meaning.

What's more, the transform method isn't used external to the class, and
I don't think it should be. The public API should only deal with
TrainingPairs, it shouldn't deal with the
calculated distances. Rename it to _distances()

The old fit() method now is just an internal helper method that deals
with already calculated distance data.
It is already done in self._fit()
Now it prints something like `[arg-type]` when it errors, so you can add
`# type: ignore[arg-type]` and be specific
about the error you are silencing
It's not actually used anywhere besides from creating the initial set
of candidate predicates
it's only used internally, so make that obvious
Already is part of the Learner base class
DisagreementLearner is really reaching far down into the contained
objects. The lower classes themselves should be responsible for this
sort of thing. Sure, it makes the code more complicated, but the way it
is is fooling ourselves that it is simple,
and making it prone to breaking int he future.
We were only testing the very small component of labeler.

Now we actually go through most lines of code. Check out coverage.
- Remove many unused methods of MatchLearner like mark() and pop().
  I think these used to get used when this was the only learner class,
  but now they aren;t used anywhere.
- Just set MatchLeaner.candidates in constructor. It makes it way
  easier to reason about.
- Adjust the inheritance. Now DisagreementLearner is out of the
heirarchy that MatchLEarner and BlockLearner are in. This is good,
because DisagreementLEarner OWNS these other two it is not a "is a"
  relationship
- Remove the mark() function from the sub-learners. They just have the
fit() method now, but they don't actually persist
  this training data, which is in line with the
  naming of fit().
- Make `candidates` a RO attribute, makes it easier to reason about, we
  don't have to worry about someone outside of the calss coming in
  and changing it.
- Fix a bug in BlockLearner where `remove` never actually removed entry from
  `candidates`, so if you broke the cahce of _cached_scores and ended up
  calling `self._predict(self.candidates)`, you would get the result
  from all of the original candidates.
- In the test, actually check for the values of the candidates, not
  just the number of them.
- Rename `_remove` to `remove` in the sublearners, since they are
  publicly used in DisagreementLEarner
- Remove the unused candidate_scores from the DisagreementLearner
  public API
- Always make a copy in sample_records() to avoid footguns
@coveralls
Copy link

coveralls commented Jun 18, 2022

Coverage Status

Coverage increased (+9.5%) to 73.629% when pulling 1212a7b on NickCrews:labeler-rename into c595052 on dedupeio:main.

@NickCrews
Copy link
Contributor Author

NickCrews commented Jun 20, 2022

@fgregg this has gotten way bigger, and this was very much done in an experiemental/haphazard way, so that some commits undo/modify what previous commits did. I can rebase this in a more sensible way if the large scale idea looks right. Maybe though take a look at the final result and see if this is generally heading in a good direction before I do all that cleanup work. Or I can split this into a PR with the less contentious changes, get that merged, and then take a look at the more structural changes afterwards. Thanks you for looking!

@fgregg
Copy link
Contributor

fgregg commented Jun 28, 2022

Generally, i really like this direction. Sorry for the slow response, i've been out sick.

Copy link
Contributor

@fgregg fgregg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really liking where this is going. A few comments in line.

dedupe/labeler.py Outdated Show resolved Hide resolved
dedupe/labeler.py Show resolved Hide resolved
@@ -138,9 +139,15 @@ def score(predicate: Predicate) -> float:

return candidates

def cover(self, pairs: TrainingExamples) -> Cover:
def cover(self, pairs: TrainingExamples, index_predicates: bool = True) -> Cover:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't love this change. is it really ugly to avoid this?

Copy link
Contributor Author

@NickCrews NickCrews Jul 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you seen the other changes from this commit? and the corresponding commit message? I think the current way that we deal with this, where in DisagreementLearner.learn_predicates() we do self.blocker.block_learner.blocker.predicates.copy() is a terrible code smell, reaching 3 levels deep. In addition, blocker block_learner, and blocker are all the same thing. So I wanted to actually create public APIs at each level.

The other problem there is the pattern of "temporarily delete some predicates, do some stuff, and then restore them" which I didn't like.

I agree that this commit does add more lines of code, which I don't like, but what I like less was the unofficial and haphazard methods we used.

When you say "this change", do you mean "adding index_predicates as an arg to this function"? We could probably adjust this. Or do you mean the larger-scale changes I mention above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, i'm good with this!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm really struggling with this. i think this is better than the code that it is replacing, but it seems like this filtering should not be the responsibility of the BlockLearner class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... what about this design.

class BlockLearner(ABC):
    def learn(self, matches, recall, candidate_types='simple', selected_predicates=None):
        comparison_cover = self.comparison_cover
        if selected_predicates:
            candidate_preds = comparison_cover & selected_predicates
        else:
            candidate_preds = comparison_cover.keys()
   
        match_cover = self.cover(matches, candidate_preds)
        ...



    def cover(self, pairs, candidate_preds) -> Cover:
        predicate_cover = {}
        for predicate in candidate_preds:  # type: ignore
        ...
        

then this labeler code could just be

    def learn_predicates(
        self, recall: float, selected_predicates
    ) -> tuple[Predicate, ...]:

        learned_preds: tuple[Predicate, ...]
        dupes = [pair for label, pair in zip(self.y, self.pairs) if label]
        learned_preds = self.blocker.block_learner.learn(
                dupes, recall=recall, candidate_types="random forest",
                selected_predicates
            )

and so then it would ultimately be the responsible for the train method in the api.py

to filter the predicates based and pass them on appropriately.

Copy link
Contributor Author

@NickCrews NickCrews Aug 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to wrap my head around the general pattern of what we're doing, correct me if I'm wrong. Right now there are three sets of predicates:

  1. all the available predicates from the data model
  2. A subset of 1, all the predicates that the BlockLearners use as candidates.
  3. A subset of 2, the final predicates chosen by the BlockLearners that we use to actually do the blocking in Dedupe.pairs().

Right now BlockLearners are handed 1, and it filters that down to 2 using the index_predicates flag. Then it does the step of filtering from 2 to 3.

I agree that this is bad, these concerns should be separated. The filtering from 1 to 2 should happen external to the BlockLearners. The block learners should just get handed set 2, and they select set 3 from that. That is the direction that #1052 is trying to go in.

I'm not sure the 1->2 filtering should happen in train() though. Can't we do this at the very beginning, when the data model is first given to Dedupe? I don't like the comparison_cover & selected_predicates bit, this filtering is happening later than it should. I think we can make it so that training.BlockLearner is only ever handed the selected_predicates at the beginning. Because in training.DedupeBlockLearner.init we call

self.blocker = blocking.Fingerprinter(predicates)
self.blocker.index_all(data)

which makes us index ALL the predicates, even if then in subsequent calls to learn() or cover() we then filter out the index predicates and don't use them.

If you think this is at least an improvement, how about you merge it and then we can actually write PRs that go in this even better direction, and we have something concrete to talk about? I'm finding it hard without actually testing things out, but I don't want to have merge conflicts later so I want main to be stable.

Or, I rebase and remove this commit from this PR and you merge everything else, and we work on this separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yes I think your suggestion is better than what I have currently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conversation is getting too big, should be it's own issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fgregg in case you didn't see this

@fgregg
Copy link
Contributor

fgregg commented Jun 28, 2022

With your refactor, it's pretty apparent that the weighted sampling methods on these classes are a bit odd. We can save that for a future PR though.

Probably, the right thing to do is to create a new object, or honestly, maybe just some new functions that take a in a blocker coverage object and return a weighted sample of keys.

@NickCrews
Copy link
Contributor Author

@fgregg I fixed two of your requests, take a look at the final one that I didn't change because I wanted clarification.

@fgregg
Copy link
Contributor

fgregg commented Aug 11, 2022

hi @NickCrews. sorry for the slow response!

I responded two to you on the cover change.

@fgregg
Copy link
Contributor

fgregg commented Aug 11, 2022

@benchmark

@github-actions
Copy link

All benchmarks (diff):

before after ratio benchmark
530M 529M 1.00 canonical.Canonical.peakmem_run
16.8±0.2s 16.1±0.2s 0.96 canonical.Canonical.time_run
0.971 0.927 0.95 canonical.Canonical.track_precision
0.902 0.911 1.01 canonical.Canonical.track_recall
229M 229M 1.00 canonical_gazetteer.Gazetteer.peakmem_run(None)
14.4±0.2s 14.1±0.05s 0.98 canonical_gazetteer.Gazetteer.time_run(None)
0.964 0.982 1.02 canonical_gazetteer.Gazetteer.track_precision(None)
0.982 0.982 1.00 canonical_gazetteer.Gazetteer.track_recall(None)
229M 229M 1.00 canonical_matching.Matching.peakmem_run({'threshold': 0.5, 'constraint': 'many-to-one'})
229M 229M 1.00 canonical_matching.Matching.peakmem_run({'threshold': 0.5})
12.9±0.05s 12.8±0.02s 0.99 canonical_matching.Matching.time_run({'threshold': 0.5, 'constraint': 'many-to-one'})
13.1±0.1s 13.2±0.07s 1.00 canonical_matching.Matching.time_run({'threshold': 0.5})
0.99 0.99 1.00 canonical_matching.Matching.track_precision({'threshold': 0.5, 'constraint': 'many-to-one'})
0.99 0.99 1.00 canonical_matching.Matching.track_precision({'threshold': 0.5})
0.911 0.911 1.00 canonical_matching.Matching.track_recall({'threshold': 0.5, 'constraint': 'many-to-one'})
0.92 0.911 0.99 canonical_matching.Matching.track_recall({'threshold': 0.5})

(logs)

@fgregg
Copy link
Contributor

fgregg commented Sep 2, 2022

i'm going to go ahead and bring this in, and we can keep on working on improving how we interact with predicates in future PRs.

@fgregg fgregg merged commit 5742efc into dedupeio:main Sep 2, 2022
@NickCrews
Copy link
Contributor Author

Sweet, thanks for keeping momentum going @fgregg !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants