-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partitioned jobs with partitioned source assets as input #13357
Comments
Thanks for filing this @ryanmeekins . I think we have two options for how to implement this:
The advantage of option 1 is that it doesn't require any new API surface area and is simpler for users. The disadvantage of option 1 is that it's some fancy logic under the covers, which users might find unexpected. |
This would be an awesome feature, is this in the near future road map? |
👍 this feature request Related to this, I am seeing this in the documentation: https://docs.dagster.io/concepts/ops-jobs-graphs/graphs#loading-an-asset-as-an-input ![]() However, the described behaviour is not respected (as of from marketing_exporter import assets, configurations, partitions
# Below never gets to execute, because `contacts_to_export` is actually a `dict[str, list[dict]]` -- it contains all the partitions
@op
def export_sendgrid_contacts_from_asset(
context: OpExecutionContext,
contacts_to_export: list[dict],
sendgrid_api: SendgridApiResource) -> Output[str]:
print(f"{contacts_to_export=}")
# Omitted because it crashes before...
@job(
partitions_def=partitions.partners_with_marketing
)
def export_to_sendgrid():
"""Export the owners with active pre-offer as SendGrid contacts"""
export_sendgrid_contacts_from_asset(assets.contacts_to_export.to_source_asset())
@asset_sensor(
asset_key=assets.contacts_to_export.key,
job=export_to_sendgrid,
default_status=DefaultSensorStatus.RUNNING,
)
def observe_for_changes_in_owners_for_marketing(context: SensorEvaluationContext, asset_event: EventLogEntry):
assert asset_event.dagster_event and asset_event.dagster_event.asset_key
context.log.info("Detected changes in the asset=%s for partition=%s",
asset_event.dagster_event.asset_key, asset_event.dagster_event.partition)
for partition_key in partitions.partners_with_marketing.get_partition_keys():
yield RunRequest(
partition_key=partition_key,
run_key=context.cursor,
) |
Had to do a git-blame on the documentation to get more context around the implementation and on the actual meaning of the doc: #12597 From https://docs.dagster.io/concepts/ops-jobs-graphs/graphs#loading-an-asset-as-an-input:
But, by reading this #12597 (comment), it is evident that the documentation is misleading and misrepresenting what is happening, @sryza Today's implementation (loading all the partitions at once at the same time in memory) is not memory friendly for large assets being returned. In other word, if we have a static partitions that is composed Any feedback on where this issue is? Do you need help implementing it? It looks like |
Any update on this? It's not always practical to load all asset partitions into memory. |
What's the use case?
I want to use a partitioned source asset as input to a partitioned job, where each job run would run for a partition and use a partition of an upstream source asset. You can currently define a
partitions_def
for both aSourceAsset
and a@job
, however, there isn't a mapping between these so every partition of theSourceAsset
is read during job execution.Ideas of implementation
Ideally, the partition mapping would happen automatically when using the same
PartitionsDefinition
, like for partitioned asset jobs defined usingdefine_asset_job
.Additional information
No response
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.
The text was updated successfully, but these errors were encountered: