Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable different start/end partitions defs in TimeWindowPartitionMapping #14449

Conversation

clairelin135
Copy link
Contributor

@clairelin135 clairelin135 commented May 24, 2023

Requested by a user: https://dagster.slack.com/archives/C01U954MEER/p1682972599494859

Sometimes, a downstream time-partitioned asset Z might be dependent on an upstream time-partitioned asset Y that starts later than B. For example:

  • Assets Z and X are partitioned daily, starting 2022-01-01
  • Logic was changed recently to make Z dependent on X and Y, but Y only exists starting on 2022-06-01
  • Z should be able to execute on partitions before 2022-06-01, with Y being passed in as a null value

Selecting one of these partitions with a nonexistent upstream errors within a call to get_upstream_partitions_for_partitions.

This PR enables this behavior by adding an additional raise_error_on_nonexistent_upstream_partition arg to TimeWindowPartitionsDefinition, which when set to True (the default) raises an error when upstream partitions fall outside of the start-end time range of their partitions def.

For the users above in this special case, they can set this bool to False. In this case, no error will be raised and the method will just filter for existent partitions. There are no changes to behavior for when get_downstream_partitions_for_partitions is called.

@clairelin135 clairelin135 force-pushed the claire-05-24/enable-time-window-partition-mapping-different-start-end-times branch from 2dc33af to 6defb07 Compare May 24, 2023 23:35
@@ -298,17 +298,6 @@ def __str__(self) -> str:
)
return partition_def_str

def __eq__(self, other):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests in this PR exposed a bug with this function. We recently added an end param to time window partitions defs, and since end is not included in this function, this function falsely evaluated defs with different end values as equivalent.

I don't think this function is necessary though, since this class is a named tuple so it should already have an __eq__ function that we don't need to maintain like this one.

@clairelin135 clairelin135 marked this pull request as ready for review May 26, 2023 17:56
@sryza
Copy link
Contributor

sryza commented May 26, 2023

This reminds me of the discussion we've been having about a partitioned asset that snapshots an unpartitioned asset.

As I mentioned on #14305 (comment), I think we should try to come up with some vocabulary to talk about the situation where a partition isn't materializable because the upstream data that would be required to compute it isn't present.

In the future, instead of just raising an error, the UI could actually give an indication that this is the case. E.g. if you tried to select those partitions in the backfill dialog, we could prevent you and pop up a tooltip explaining the situation.

I don't think we need to do that now, but I think we should try to find a name that would still make sense if/when we do do it.

@clairelin135 clairelin135 force-pushed the claire-05-24/enable-time-window-partition-mapping-different-start-end-times branch from 6defb07 to dc9318b Compare June 7, 2023 19:45
@sryza
Copy link
Contributor

sryza commented Jun 8, 2023

Btw should we mark this parameter as experimental just to be safe?

@clairelin135 clairelin135 force-pushed the claire-05-24/enable-time-window-partition-mapping-different-start-end-times branch from dc9318b to d06581a Compare June 9, 2023 17:10
@clairelin135 clairelin135 force-pushed the claire-05-24/enable-time-window-partition-mapping-different-start-end-times branch from d06581a to 4f698dd Compare June 9, 2023 17:10
@clairelin135
Copy link
Contributor Author

clairelin135 commented Jun 9, 2023

Removed support for the following edge case:

There is one edge case where this PR does not respect the bool setting. This is when a partitions def of smaller granularity is downstream of a partitions def with greater granularity, i.e. if a hourly partitions def is downstream of a daily partitions def. If current time = 2023-01-01-12:00, hourly partitions exist up until 2023-01-01-12:00, but no daily partition exists yet for 2023-01-01. It would be surprising if one of those hourly partitions could run when the upstream partition does not yet exist, so we still raise an error in this case, since the upstream partition will exist in the future. This case should be pretty unlikely since I can't think of a good reason why someone would want a smaller time partition granularity downstream of a time partitions def with larger granularity.

@sryza would appreciate a second look at this.

Previously this PR raised an error when encountering the edge case above regardless of if allow_nonexistent_upstream_partitions was set, but in the situation described in #14693 this error would occur as some upstream partitions do not yet exist. I think the situation detailed in the discussion is a much more common situation where you wouldn't want an error to be yielded, so I've updated the PR to fully respect the allow_nonexistent_upstream_partitions bool and not raise an error.

If you are in the weird edge case above where an hourly is downstream of a daily, if allow_nonexistent_upstream_partitions is set, you'll just have to set custom logic in your asset to prevent execution for "hanging partitions" if desired.

I also made the new arg experimental.

@clairelin135 clairelin135 requested a review from sryza June 9, 2023 17:16
@clairelin135 clairelin135 merged commit 38fc547 into master Jun 9, 2023
1 check passed
@clairelin135 clairelin135 deleted the claire-05-24/enable-time-window-partition-mapping-different-start-end-times branch June 9, 2023 18:57
clairelin135 added a commit that referenced this pull request Jun 12, 2023
Background here:
https://www.notion.so/dagster/Handle-invalid-parents-in-auto-materialize-5fe6dbdfba0f43a6beae88fb61ccc07d?pvs=4

This PR builds upon #14449 and
introduces graceful handling for auto-materialize cases where an invalid
parent exists. Errors are currently raised in
`PartitionMapping.get_upstream_partitions_for_partitions` when mapped
upstream partitions were nonexistent. It is good that an error occurs
when launching backfills and runs in these cases, but undesirable for
the error to occur in auto-materialization. This PR introduces logic to
gracefully handle invalid parents without erroring in auto-materialize.

**Modifies the auto-materialize logic:**
- Only considers partitions that can be materialized. If a partition has
invalid upstreams, it should not be possible to auto-materialize it.
- Modifies `is_reconciled` to recursively check only valid upstream
partitions. If all of a partition's parents are invalid, then returns
`True` if the partition is materialized, `False` if not. This enables
reconciling downstream assets.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants