Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use pushdown params to disable time range constraint pushdown #1216

Merged

Conversation

tlento
Copy link
Contributor

@tlento tlento commented May 15, 2024

The predicate pushdown operations for time range constriants currently
rely on None/not-None state to determine whether or not to push the
ConstrainTimeRange filter node to the source. With the switch to
pushdown params this None/not-None state gets tenuous, as future
updates will need to do things like manage time range constraints,
other time dimension filters, and categorical dimension (and entity)
filters, and relying on externalizing None/not-None combinations for
these various filter types will be challenging.

This change encapsulates the enabling and disabling of time range
constraints inside of PredicatePushdownParams. At the moment, in
order to maintain the existing behavior, we simply internalize the
None/not-None behavior for time range constraints. This will also
allow us to easily retain pushdown processing for categorical dimensions
even when time constraint filters should not be applied, and
gradually centralize those controls as we streamline the callsites.

There is added complexity to this change because of two things.

  1. Time constraint updating is scattered around in the DataflowPlanBuilder
  2. Conversion metrics currently do predicate pushdown for time constraints

The first of these will be addressed later.
Since we cannot reliably support general predicate pushdown for conversion
metrics, we need to allow for a time-range-only pushdown operation. Today
this is equivalent to pushdown enabled, but that will change shortly.

Copy link
Contributor Author

tlento commented May 15, 2024

Copy link

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

@tlento tlento force-pushed the encapsulate-time-range-constraint-for-pushdown branch from 10a9d01 to bfb75ba Compare May 15, 2024 22:47
@tlento tlento force-pushed the use-pushdown-params-for-disabling-time-constraint-pushdown branch from 26dd729 to 9435008 Compare May 15, 2024 22:47
Copy link
Contributor

@courtneyholcomb courtneyholcomb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good! Left some in-line suggestions to improve readability, but feel free to address those at the top of the stack.

@@ -238,21 +242,30 @@ def _build_aggregated_conversion_node(
constant_properties: Optional[Sequence[ConstantPropertyInput]] = None,
) -> BaseOutput:
"""Builds a node that contains aggregated values of conversions and opportunities."""
# Pushdown parameters vary with conversion metrics due to the way the time joins are applied.
# Due to other outstanding issues with conversion metric fitlers, we disable predicate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: fitlers

time_range_constraint: Optional[TimeRangeConstraint]
_PREDICATE_METADATA_KEY = "is_predicate_property"

time_range_constraint: Optional[TimeRangeConstraint] = dataclasses.field(metadata={_PREDICATE_METADATA_KEY: True})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This last was pretty hard to understand without diving into the dataclasses docs a bit & running the code to see what the value actually is. Might be helpful to leave a comment to explain that this is not a typical default arg (i.e., you still need to pass a value in for time_range_constraint)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, to be honest I can probably get rid of this and just keep a list of predicate key names. That's what I had originally and I think it's a little less magical overall, but it's still a lot of magic string handling.

if self.pushdown_state is PredicatePushdownState.FULLY_ENABLED:
return

invalid_predicate_property_names = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is this essentially just checking that time_range_constraint is always None when pushdown_state is DISABLED?
If so, this seems pretty complicated & hard to read for just that functionality, and it might be helpful to simplify by just checking that one scenario. But maybe I'm missing something or this gets used more generically down the stack?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, I definitely should've explained that in the commit message. There will be other parameters added which will also need to be nulled out for disabled. Will update things accordingly.

I'm essentially trying to enforce that when the state is disabled there's no possibility of somebody doing something like if pushdown_params.time_range_constraint or pushdown_params.where_filter_specs and then triggering a pushdown operation. If the disabled state was binary I could just make it a derived property but because of our conversion metric and time window handling I have to deal with things like time range only and categorical only pushdown states.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

)

@property
def is_pushdown_enabled(self) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be helpful to call this is_pushdown_enabled_for_time_range or something more specific so that other devs don't mistake this as an indication that pushdown is FULLY enabled!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That isn't exactly what this does, but to your point inverting it to is_pushdown_disabled might be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a ton of updates and now this is EXACTLY what that does. I also put in the disabled thing but I'll probably remove it. I think I should stop tinkering here and move ahead, though.

@tlento tlento force-pushed the encapsulate-time-range-constraint-for-pushdown branch from bfb75ba to 55ed28e Compare May 16, 2024 21:45
@tlento tlento force-pushed the use-pushdown-params-for-disabling-time-constraint-pushdown branch from 9435008 to 8d77d33 Compare May 16, 2024 21:45
Copy link
Contributor Author

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @courtneyholcomb ! I'm still playing around with the expanded pushdown mechanics so I'll come back to this before merge and either put up another PR or revise this one and request a follow up, depending on what I want to do.

time_range_constraint: Optional[TimeRangeConstraint]
_PREDICATE_METADATA_KEY = "is_predicate_property"

time_range_constraint: Optional[TimeRangeConstraint] = dataclasses.field(metadata={_PREDICATE_METADATA_KEY: True})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, to be honest I can probably get rid of this and just keep a list of predicate key names. That's what I had originally and I think it's a little less magical overall, but it's still a lot of magic string handling.

if self.pushdown_state is PredicatePushdownState.FULLY_ENABLED:
return

invalid_predicate_property_names = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, I definitely should've explained that in the commit message. There will be other parameters added which will also need to be nulled out for disabled. Will update things accordingly.

I'm essentially trying to enforce that when the state is disabled there's no possibility of somebody doing something like if pushdown_params.time_range_constraint or pushdown_params.where_filter_specs and then triggering a pushdown operation. If the disabled state was binary I could just make it a derived property but because of our conversion metric and time window handling I have to deal with things like time range only and categorical only pushdown states.

)

@property
def is_pushdown_enabled(self) -> bool:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That isn't exactly what this does, but to your point inverting it to is_pushdown_disabled might be better.

Copy link
Contributor

@courtneyholcomb courtneyholcomb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -61,7 +61,7 @@ class MultiHopJoinCandidate:
lineage: MultiHopJoinCandidateLineage


class PredicatePushdownState(Enum):
class PushdownPredicateInputType(Enum):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this change!!

Copy link
Contributor

@courtneyholcomb courtneyholcomb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good!

@tlento tlento force-pushed the encapsulate-time-range-constraint-for-pushdown branch from 88723e2 to ece1e07 Compare May 23, 2024 23:41
@tlento tlento force-pushed the use-pushdown-params-for-disabling-time-constraint-pushdown branch from 31ee2df to 3a02ea7 Compare May 23, 2024 23:41
Copy link
Contributor Author

tlento commented May 24, 2024

Merge activity

  • May 23, 5:33 PM PDT: @tlento started a stack merge that includes this pull request via Graphite.
  • May 23, 5:41 PM PDT: Graphite rebased this pull request as part of a merge.
  • May 23, 5:46 PM PDT: @tlento merged this pull request with Graphite.

@tlento tlento force-pushed the encapsulate-time-range-constraint-for-pushdown branch from ece1e07 to 7e723bc Compare May 24, 2024 00:34
Base automatically changed from encapsulate-time-range-constraint-for-pushdown to main May 24, 2024 00:40
The predicate pushdown operations for time range constriants currently
rely on None/not-None state to determine whether or not to push the
ConstrainTimeRange filter node to the source. With the switch to
pushdown params this None/not-None state gets tenuous, as future
updates will need to do things like manage time range constraints,
other time dimension filters, and categorical dimension (and entity)
filters, and relying on externalizing None/not-None combinations for
these various filter types will be challenging.

This change encapsulates the enabling and disabling of time range
constraints inside of PredicatePushdownParams. At the moment, in
order to maintain the existing behavior, we simply internalize the
None/not-None behavior for time range constraints. This will also
allow us to easily retain pushdown processing for categorical dimensions
even when time constraint filters should not be applied, and
gradually centralize those controls as we streamline the callsites.

There is added complexity to this change because of two things.

1. Time constraint updating is scattered around in the DataflowPlanBuilder
2. Conversion metrics currently do predicate pushdown for time constraints

The first of these will be addressed later.
Since we cannot reliably support general predicate pushdown for conversion
metrics, we need to allow for a time-range-only pushdown operation. Today
this is equivalent to pushdown enabled, but that will change shortly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants