Reduce responsibilities of DataModel #1098

NickCrews · 2022-09-16T23:39:27Z

No description provided.

coveralls · 2022-09-16T23:44:53Z

Coverage decreased (-0.6%) to 72.76% when pulling 83a2e89 on NickCrews:split-datamodel into 664aa67 on dedupeio:main.

NickCrews · 2022-09-16T23:50:27Z

This is the start of fixing #1088

first, remove responsibilities of DataModel from labeler and core

second, simplify how DataModel works internally

third, prep for making it easier to make next changes backwards compatible by making the pickling shcema better versioned in api.py

@fgregg there's some definitely good stuff in here (the first and second stuff), but I'm not sure if the third step is gonna be wasted effort. I thought I'd get a review from you at this point before I sink more time into this just o make sure I'm headed in the right direction. I can split this PR into three separete PRs too if you want. Thanks for the review!

It's not needed in there, so should just pass what is needed. This is prep for further refactoring of removing datamodel more

We don't use it after the initial construction, so don't store it

We only need the abstract requirement of a Featurizer function.

It is copied in DedupeDisagreementLearner, so I'm assuming that's a good idea. There's surely no downside that I can see, so might as well DEFINITELY avoid a foot gun

It isn't used publicly anywhere, and it adds more API surface area to worry about. This is still backwards compatible, as self._len is still saved/restored byt pickle the same way, now we just access it directly instead of going via __len__().

We iterate through the input more than once, so an Iterable is insufficient.

Before, we never checked for this, so when we called list.index() we just got back the first instance. I swapped us over to a lookup table because that makes more sense, but I wanted to also add this check because if someone had duplicate names, this would quitely change the behavior underneath someone: before it gave the first index, now it gives the last.

These variables 1. you can't currently create them because they inherit from Variable, not FieldVariable, so they don't appear in the datamodel.VARIABLE_CLASSES lookup table 2. You SHOULDN'T be able to instantiate these variables directly from a variable definition

Before the flow was: 1. create all the primary variables EXCEPT for the interactions 2. Expand those. 3. go back through and NOW create the InteractionVariables I found this confusing. We parse the var definitions twice in separate places, and InteractionVariables are a special case in the first pass. The other point of this is to make it so that there is one list of Variable instances which is the single source of truth. Once we do this then we can turn all the other private instance variables of DataModel into @functools.cached_property's, which will make it much easier to ensure pickle compatibility in the future, because only one variable needs to get saved and restored

This will make the init method cleaner and allow us to more easily add versioning. _load_settings will read the settings file, and for future versions of the class, will parse and convert the loaded data structures to the canonical in-memory representation that the class expects. See following commits

Deduplicate the one exception we make. Also now instead of just eating the catchall exception, we raise from it so that you have slightly more insight into its cause

fgregg

generally, these all look like good code cleanups. I think it would be good to split into three PRs!

fgregg · 2022-09-20T19:19:11Z

dedupe/datamodel.py


+    only_custom = all(isinstance(v, (CustomType, InteractionType)) for v in variables)


variable name should change. maybe no_blocking_variables

Responded in followup pr #1102

fgregg · 2022-09-20T19:20:26Z

dedupe/datamodel.py

+    return variables
+
+
+def _expand_higher_variables(variables: Iterable[Variable]) -> list[Variable]:


we could probably do away with this method if we had the base variable type have a higher_vars attribute set to an empty list or tuple

no that won't quite work. it still feels like some of this should maybe be the responsible for the variable class.

It definitely feels gross, there needs to be some more official API. Trying not to deal with that here.

NickCrews · 2022-09-23T21:40:07Z

Split into #1101 and #1102. I just dropped the 3 commits that adjusted the unpickling, I wasn't as confident those were going in the right direction.

NickCrews force-pushed the split-datamodel branch from b865bed to c46ffb1 Compare September 16, 2022 23:47

NickCrews force-pushed the split-datamodel branch from c46ffb1 to 3da6afe Compare September 16, 2022 23:52

NickCrews added 20 commits September 16, 2022 15:59

ref: don't pass DataModel into BlockLearners

5ff8d22

It's not needed in there, so should just pass what is needed. This is prep for further refactoring of removing datamodel more

ref: don't store DataModel in DisagreementLearner

3f75472

We don't use it after the initial construction, so don't store it

ref: Don't require DataModel in MatchLearner

7aa69ea

We only need the abstract requirement of a Featurizer function.

bugfix:? Copy candidate preds in RecordLinkDisagreementLearner

7e5633a

It is copied in DedupeDisagreementLearner, so I'm assuming that's a good idea. There's surely no downside that I can see, so might as well DEFINITELY avoid a foot gun

ref: Remove use of DataModel from active learners

1d1b871

ref: Don't use DataModel in core

7350aed

ref: remove __len__ from DataModel

3c20417

It isn't used publicly anywhere, and it adds more API surface area to worry about. This is still backwards compatible, as self._len is still saved/restored byt pickle the same way, now we just access it directly instead of going via __len__().

ref: make typifying variables more clear

25f91a5

ref: tweak error raising in typify_variables()

b2234e8

ref: Clarify only_custom logic in typify_variables

13b9f50

typ: Fix typing of typify_variables()

78ef7ff

We iterate through the input more than once, so an Iterable is insufficient.

ref: Further rename field to variable in datamodel

77b4192

test: Add more tests for interaction variables

497badc

ref: Move check for empty var def into typify_variables

e169cc2

ref: Add versioning to _load_settings()

241c0a6

ref: improve settings loading exception reporting

83a2e89

Deduplicate the one exception we make. Also now instead of just eating the catchall exception, we raise from it so that you have slightly more insight into its cause

NickCrews force-pushed the split-datamodel branch from 3da6afe to 83a2e89 Compare September 17, 2022 00:00

fgregg requested changes Sep 20, 2022

View reviewed changes

This was referenced Sep 23, 2022

Remove usage of DataModel from core.py and labeler.py #1101

Merged

Prep DataModel for removal #1102

Open

NickCrews closed this Sep 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce responsibilities of DataModel #1098

Reduce responsibilities of DataModel #1098

NickCrews commented Sep 16, 2022

coveralls commented Sep 16, 2022 •

edited

NickCrews commented Sep 16, 2022 •

edited

fgregg left a comment

fgregg Sep 20, 2022

NickCrews Sep 23, 2022

fgregg Sep 20, 2022

fgregg Sep 20, 2022

NickCrews Sep 23, 2022

NickCrews commented Sep 23, 2022


		only_custom = all(isinstance(v, (CustomType, InteractionType)) for v in variables)

		return variables


		def _expand_higher_variables(variables: Iterable[Variable]) -> list[Variable]:

Reduce responsibilities of DataModel #1098

Reduce responsibilities of DataModel #1098

Conversation

NickCrews commented Sep 16, 2022

coveralls commented Sep 16, 2022 • edited

NickCrews commented Sep 16, 2022 • edited

fgregg left a comment

Choose a reason for hiding this comment

fgregg Sep 20, 2022

Choose a reason for hiding this comment

NickCrews Sep 23, 2022

Choose a reason for hiding this comment

fgregg Sep 20, 2022

Choose a reason for hiding this comment

fgregg Sep 20, 2022

Choose a reason for hiding this comment

NickCrews Sep 23, 2022

Choose a reason for hiding this comment

NickCrews commented Sep 23, 2022

coveralls commented Sep 16, 2022 •

edited

NickCrews commented Sep 16, 2022 •

edited