Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor labeler.py #1065

Merged
merged 29 commits into from Sep 2, 2022
Merged

Refactor labeler.py #1065

merged 29 commits into from Sep 2, 2022

Commits on Jun 17, 2022

  1. Use conventions for fit(), fit_transform(), etc.

    The meaning of these methods is not consistent with how sklearn uses
    them.
    
    For them, fit_transform(X, y) means
    fit(X, y).transform(X), but we were using it as
    fit(transform(X), y). So just rename fit_transform() to fit(), and now
    it has the correct meaning.
    
    What's more, the transform method isn't used external to the class, and
    I don't think it should be. The public API should only deal with
    TrainingPairs, it shouldn't deal with the
    calculated distances. Rename it to _distances()
    
    The old fit() method now is just an internal helper method that deals
    with already calculated distance data.
    NickCrews committed Jun 17, 2022
    Configuration menu
    Copy the full SHA
    70f8f31 View commit details
    Browse the repository at this point in the history
  2. Remove redundant setting of self.X, self.y

    It is already done in self._fit()
    NickCrews committed Jun 17, 2022
    Configuration menu
    Copy the full SHA
    7f3842b View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    e4e8697 View commit details
    Browse the repository at this point in the history
  4. Fix mypy

    NickCrews committed Jun 17, 2022
    Configuration menu
    Copy the full SHA
    56e2685 View commit details
    Browse the repository at this point in the history
  5. Show error codes from mypy runs

    Now it prints something like `[arg-type]` when it errors, so you can add
    `# type: ignore[arg-type]` and be specific
    about the error you are silencing
    NickCrews committed Jun 17, 2022
    Configuration menu
    Copy the full SHA
    685a0db View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    cd8782e View commit details
    Browse the repository at this point in the history
  7. Don't store data_model in BlockLearner

    It's not actually used anywhere besides from creating the initial set
    of candidate predicates
    NickCrews committed Jun 17, 2022
    Configuration menu
    Copy the full SHA
    d59d538 View commit details
    Browse the repository at this point in the history
  8. Make BlockLearner.predict private

    it's only used internally, so make that obvious
    NickCrews committed Jun 17, 2022
    Configuration menu
    Copy the full SHA
    5bad271 View commit details
    Browse the repository at this point in the history
  9. Remove unneeded type hint BlockLearner.candidates

    Already is part of the Learner base class
    NickCrews committed Jun 17, 2022
    Configuration menu
    Copy the full SHA
    0b0a117 View commit details
    Browse the repository at this point in the history
  10. Delegate DisagreementLearner.learn_predicates()

    DisagreementLearner is really reaching far down into the contained
    objects. The lower classes themselves should be responsible for this
    sort of thing. Sure, it makes the code more complicated, but the way it
    is is fooling ourselves that it is simple,
    and making it prone to breaking int he future.
    NickCrews committed Jun 17, 2022
    Configuration menu
    Copy the full SHA
    3ded0bd View commit details
    Browse the repository at this point in the history
  11. Configuration menu
    Copy the full SHA
    925030b View commit details
    Browse the repository at this point in the history
  12. Configuration menu
    Copy the full SHA
    ced6f4e View commit details
    Browse the repository at this point in the history
  13. Remove fit() from DisagreementLearner API

    It's not actually used anywhere.
    
    See dedupeio#1065 (comment)
    NickCrews committed Jun 17, 2022
    Configuration menu
    Copy the full SHA
    e3442ef View commit details
    Browse the repository at this point in the history
  14. Configuration menu
    Copy the full SHA
    5386d6f View commit details
    Browse the repository at this point in the history
  15. Configuration menu
    Copy the full SHA
    62ad7ca View commit details
    Browse the repository at this point in the history
  16. Configuration menu
    Copy the full SHA
    24a2609 View commit details
    Browse the repository at this point in the history
  17. Test more-public interface of labeler

    We were only testing the very small component of labeler.
    
    Now we actually go through most lines of code. Check out coverage.
    NickCrews committed Jun 17, 2022
    Configuration menu
    Copy the full SHA
    d77c118 View commit details
    Browse the repository at this point in the history

Commits on Jun 18, 2022

  1. Configuration menu
    Copy the full SHA
    3e36957 View commit details
    Browse the repository at this point in the history
  2. Overhaul labeler inheritance

    - Remove many unused methods of MatchLearner like mark() and pop().
      I think these used to get used when this was the only learner class,
      but now they aren;t used anywhere.
    - Just set MatchLeaner.candidates in constructor. It makes it way
      easier to reason about.
    - Adjust the inheritance. Now DisagreementLearner is out of the
    heirarchy that MatchLEarner and BlockLearner are in. This is good,
    because DisagreementLEarner OWNS these other two it is not a "is a"
      relationship
    - Remove the mark() function from the sub-learners. They just have the
    fit() method now, but they don't actually persist
      this training data, which is in line with the
      naming of fit().
    - Make `candidates` a RO attribute, makes it easier to reason about, we
      don't have to worry about someone outside of the calss coming in
      and changing it.
    - Fix a bug in BlockLearner where `remove` never actually removed entry from
      `candidates`, so if you broke the cahce of _cached_scores and ended up
      calling `self._predict(self.candidates)`, you would get the result
      from all of the original candidates.
    - In the test, actually check for the values of the candidates, not
      just the number of them.
    - Rename `_remove` to `remove` in the sublearners, since they are
      publicly used in DisagreementLEarner
    - Remove the unused candidate_scores from the DisagreementLearner
      public API
    - Always make a copy in sample_records() to avoid footguns
    NickCrews committed Jun 18, 2022
    Configuration menu
    Copy the full SHA
    1b230ac View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    2605b29 View commit details
    Browse the repository at this point in the history
  4. Fixup: linting

    NickCrews committed Jun 18, 2022
    Configuration menu
    Copy the full SHA
    f9d88ec View commit details
    Browse the repository at this point in the history

Commits on Jul 9, 2022

  1. Configuration menu
    Copy the full SHA
    c21398f View commit details
    Browse the repository at this point in the history
  2. Only have pytest config in pyproject.toml

    The config in setup.cfg is ignored.
    NickCrews committed Jul 9, 2022
    Configuration menu
    Copy the full SHA
    3dc3a96 View commit details
    Browse the repository at this point in the history
  3. Remove unused .coveragerc

    NickCrews committed Jul 9, 2022
    Configuration menu
    Copy the full SHA
    5b330cf View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    81bf3f5 View commit details
    Browse the repository at this point in the history
  5. ValueError if candidate_scores() used before fit()

    Makes it consistent between the MatchLEarner and BlockLearner.
    
    Also fixes a bug where self._fitted was never set to True
    NickCrews committed Jul 9, 2022
    Configuration menu
    Copy the full SHA
    5d3a100 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    83b5293 View commit details
    Browse the repository at this point in the history
  7. Fixup mypy error

    NickCrews committed Jul 9, 2022
    Configuration menu
    Copy the full SHA
    918b824 View commit details
    Browse the repository at this point in the history

Commits on Sep 1, 2022

  1. Configuration menu
    Copy the full SHA
    1212a7b View commit details
    Browse the repository at this point in the history