Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce responsibilities of DataModel #1098

Closed
wants to merge 20 commits into from

Conversation

NickCrews
Copy link
Contributor

No description provided.

@coveralls
Copy link

coveralls commented Sep 16, 2022

Coverage Status

Coverage decreased (-0.6%) to 72.76% when pulling 83a2e89 on NickCrews:split-datamodel into 664aa67 on dedupeio:main.

@NickCrews
Copy link
Contributor Author

NickCrews commented Sep 16, 2022

This is the start of fixing #1088

first, remove responsibilities of DataModel from labeler and core

second, simplify how DataModel works internally

third, prep for making it easier to make next changes backwards compatible by making the pickling shcema better versioned in api.py

@fgregg there's some definitely good stuff in here (the first and second stuff), but I'm not sure if the third step is gonna be wasted effort. I thought I'd get a review from you at this point before I sink more time into this just o make sure I'm headed in the right direction. I can split this PR into three separete PRs too if you want. Thanks for the review!

It's not needed in there, so should just pass what is needed.
This is prep for further refactoring of removing datamodel more
We don't use it after the initial construction, so don't store it
We only need the abstract requirement of a Featurizer function.
It is copied in DedupeDisagreementLearner, so I'm assuming that's a
good idea. There's surely no downside that I can see,
so might as well DEFINITELY avoid
a foot gun
It isn't used publicly anywhere,
and it adds more API surface area to worry about. This is still
backwards compatible, as self._len
is still saved/restored
byt pickle the same way, now we just access
it directly instead of going via __len__().
We iterate through the input more than once, so an Iterable
is insufficient.
Before, we never checked for this,
so when we called list.index() we
just got back the first instance.

I swapped us over to a lookup table because that makes more
sense, but I wanted to also add this check because if someone had
duplicate names, this would quitely change the behavior underneath
someone: before it gave the first index, now it gives the last.
These variables
1. you can't currently create them because they inherit from Variable,
not FieldVariable, so they don't appear in the
datamodel.VARIABLE_CLASSES
lookup table
2. You SHOULDN'T be able to instantiate these variables directly
from a variable definition
Before the flow was:
1. create all the primary variables EXCEPT for the interactions
2. Expand those.
3. go back through and NOW create the InteractionVariables

I found this confusing. We parse the var definitions twice in separate
places, and InteractionVariables
are a special case in the first pass.

The other point of this is to make it so that there
is one list of Variable instances
which is the single source of truth. Once we do this
then we can turn all the other private instance variables of DataModel
into @functools.cached_property's,
which will make it much easier to
ensure pickle compatibility in
the future, because
only one variable needs to get
saved and restored
This will make the init method cleaner and allow us to more easily add
versioning.

_load_settings will read the settings file, and for future versions
of the class, will parse and convert
the loaded data structures to
the canonical in-memory
representation that the class expects. See following commits
Deduplicate the one exception we make.

Also now instead of just eating the catchall exception, we raise from it
so that you have slightly more insight into its cause
Copy link
Contributor

@fgregg fgregg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally, these all look like good code cleanups. I think it would be good to split into three PRs!


only_custom = all(isinstance(v, (CustomType, InteractionType)) for v in variables)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variable name should change. maybe no_blocking_variables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responded in followup pr #1102

return variables


def _expand_higher_variables(variables: Iterable[Variable]) -> list[Variable]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could probably do away with this method if we had the base variable type have a higher_vars attribute set to an empty list or tuple

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no that won't quite work. it still feels like some of this should maybe be the responsible for the variable class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely feels gross, there needs to be some more official API. Trying not to deal with that here.

@NickCrews
Copy link
Contributor Author

Split into #1101 and #1102. I just dropped the 3 commits that adjusted the unpickling, I wasn't as confident those were going in the right direction.

@NickCrews NickCrews closed this Sep 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants