RFC: default_asset_attrs #21387

sryza · 2024-04-24T00:31:54Z

Summary & Motivation

It's a pain to apply the same attribute to a bunch of different assets. As we make auto-materialize policies / scheduling conditions more central, this pain will show up more often.

This PR proposes using a context manager to set asset attributes in bulk. This is not an idea I am entirely convinced is a good one yet, but wanted to float it and get reactions.

with default_asset_attrs(
    group_name="marketing", auto_materialize_policy=AutoMaterializePolicy.eager()
):

    @asset
    def asset1(): ...

    @asset
    def asset2(): ...

I just wanted to get the idea across, so, out of laziness, I only implemented a couple asset attributes. But the idea would be to make this work for at least the following asset attributes:

partitions_def
tags
metadata
io_manager_key
compute_kind
owners
group_name
auto_materiallze_policy
backfill_policy

How I Tested These Changes

alangenfeld

Mechanically i think this is the best way to do this type of feature. I think this is a nice power feature for power users.

petehunt · 2024-04-25T18:11:00Z

What does this get us over splatting or functools.partial()?

sryza · 2024-04-25T21:52:48Z

What does this get us over splatting or functools.partial()?

Writing these out so we can compare:

splatting:

my_asset_args = dict(group_name="abc", auto_materialize_policy=AutoMaterializePolicy.eager())

@asset(**my_asset_args)
def asset1(): ...

@my_asset(owners="bar@foo.com", **my_asset_args)
def asset2(): ...

Some downsides here:

The user needs to name their args dict
Lose typing
Doesn't work with asset factories
If you already have a set of assets and want to do this, need to go one by one to add the args dict
Can't merge metadata entries and tags [1]

functools.partial:

import functools

my_asset = functools.partial(asset, group_name="abc", auto_materialize_policy=AutoMaterializePolicy.eager()))

@my_asset
def asset1(): ...

@my_asset
def asset2(): ...

Some downsides here:

The user needs to name their partial
functools.partial + decorators is likely scary for users not familiar with functional programming
Doesn't work with asset factories
Can't merge metadata entries and tags [1]

[1] merging metadata entries and tags

It would be useful to be able to do this:

with default_asset_attrs(tags={"domain": "marketing"}):

    @asset(tags={"is_pii": "true"})
    def asset1(): ...

    @asset
    def asset2(): ...

assert asset1.tags == {"domain": "marketing", "is_pii": "true"}

petehunt · 2024-04-26T05:21:43Z

I think this is kinda masking a problem, in that we can't pass partial args around as a first-class value to the @asset decorator and merge the args together in a meaningful way.

My initial reaction is to do something like this:

# We have AssetSpec, which is a bag of all the params to the asset decorator,
# and PartialAssetSpec, which is some subset of them set (i.e. all are optional).
# (We could bikeshed the PartialAssetSpec name a lot, btw. Perhaps it should just be
# AssetSpec)
from dagster import PartialAssetSpec

default_spec = PartialAssetSpec(
    group_name="marketing",
    auto_materialize_policy=AutoMaterializePolicy.eager(),
)

# Create 2 assets that use this spec
@asset(default_spec)
def asset1(): ...

@asset(default_spec)
def asset2(): ...

# Create an asset that merges two PartialAssetSpecs
default_spec = PartialAssetSpec(tags={"domain": "marketing"})

# this `extends` thing does a smart merge of the tags
@asset(PartialAssetSpec(tags={"is_pii": "true"}, extends=[default_spec])) # open to something other than `extends`
def asset1(): ...

@asset(default_spec)
def asset2(): ...

# We can also have a sugared version of this if we treat the args to `@asset` as identical to those of `AssetSpec` or `PartialAssetSpec`

@asset(tags={"is_pii": "true"}, extends=[default_spec])
def asset1(): ...

I also have some concerns about the nonlocal behavior of this context manager. I think that it could lead to really confusing bugs where an integration author assumes that an asset is getting the default behavior, when it is overridden higher up in the stack. It would be really hard for a user to debug the interaction between the integration code (which they don't see normally) and the contextmanager, which could be far removed from that integration depending on how many layers of abstraction the user has put in place.

Explicitly threading through the AssetSpec removes the nonlocal behavior which I think is pretty important. Less indentation too :)

sryza · 2024-04-27T00:29:27Z

I also have some concerns about the nonlocal behavior of this context manager.

Yeah this is definitely a downside of the context manager approach and one of the main things that makes me not fully confident in it.

I think this is kinda masking a problem, in that we can't pass partial args around as a first-class value to the @asset decorator and merge the args together in a meaningful way.

To be a little pedantic, that sounds like a solution more than a problem. Though I agree that it's a solution worth having in play. It addresses the partial metadata issue I raised above. As a lover of functional programming and as someone who's very suspicious of any mutations to global state, I personally find it very appealing.

Main cons that come to mind:

As someone who has read countless Github discussions and Reddit threads from users who complain that Dagster has too many similar-sounding nouns, I can easily imagine it being a prime target of user confusion.
Imagine that someone wants to put all the assets in the below snippet on the same schedule. With the context manager approach, they can just indent that entire block under default_asset_attrs. With a partial asset spec-based approach, they need to visit each definition individually and figure out the right way to pass it through:

@asset
def asset1(): ...

@asset(tags={"foo": "bar"})
def asset2(): ...

@multi_asset(specs=[AssetSpec("asset3"), AssetSpec("asset4")])
def asset3_and_4(): ...

asset5 = my_utils.make_s3_ingest_asset("asset5", ...)

sryza · 2024-04-27T00:41:54Z

Backing up a bit, one of the things I've been thinking about while approaching this problem is how this would look in a YAML. No idea on the specifics, but something like one of these seems perhaps the simplest:

assets:
  shared_attrs:
    group: marketing
    schedule: daily    
  list:
  - type: python
    module: a.b
  - type: shell
    command: a.sh

or

context:
  asset_attrs:
    group: marketing
    schedule: daily

assets:
- type: python
  module: a.b
- type: shell
  command: a.sh

The default_asset_attrs approach came out of thinking: what would be the most direct translation of this to Python?

petehunt · 2024-04-27T02:52:11Z

Yeah this is definitely a downside of the context manager approach and one of the main things that makes me not fully confident in it.

My initial reaction is that it's a bad idea tbh. These implicit context things tend to only work if they are used sparingly. The more code that depends on this implicit context the harder it gets to reason about what's going on.

To be a little pedantic, that sounds like a solution more than a problem.

I disagree on this one. The asset factory pattern shows up everywhere. We will definitely want a way to pass around subsets of these parameters without copy/pasting long arg lists up and down the callstack.

As someone who has read countless Github discussions and Reddit threads from users who complain that Dagster has too many similar-sounding nouns, I can easily imagine it being a prime target of user confusion.

Agreed but I think that 99% of the time users will be using the last example in my comment, where the AssetSpec class is not actually used. I think users will only need it once they need to customize the behavior or implement their own asset factory, which, to me, is reasonable gradual disclosure of complexity.

With a partial asset spec-based approach, they need to visit each definition individually and figure out the right way to pass it through

I think this is a feature, not a bug. The trade-off is, fundamentally, implicit-with-nonlocal-behavior vs explicit-with-verbosity. I think avoiding nonlocal behavior is more important than avoiding a bit of (expressive IMO) verbosity.

petehunt · 2024-04-27T03:07:22Z

The default_asset_attrs approach came out of thinking: what would be the most direct translation of this to Python?

I think it is nice in YAML because it (depending on how you implement it) wouldn't have nonlocal semantics. In Python that's a different story.

Let me show you what I mean. I would imagine the YAML example you provided:

context:
  asset_attrs:
    group: marketing
    schedule: daily

assets:
- type: python
  module: a.b
- type: shell
  command: a.sh

would desugar to something like this:

asset_attrs = PartialAssetSpec(group_name='marketing', ...)

defs = Definitions(assets=[
  *run_plugin(type='python', asset_attrs=asset_attrs, config={'module': 'a.b'}),
  *run_plugin(type='shell', asset_attrs=asset_attrs, config={'command': 'a.sh'}),
])

You could imagine the run_plugin function being defined in some way like this:

from importlib import import_module
def run_plugin(type: str, asset_attrs: PartialAssetSpec, config: dict[str, Any]) -> Sequence[AssetsDefinition]:
  # plugins install themselves in a well-known location
  plugin_module = import_module(f'dagster.yaml.plugins.{type}')
  # plugins follow a protocol where they expose a create_assets() function that returns Sequence[AssetsDefinition]
  return plugin_module.create_assets(asset_attrs, config)

If we did it this way, the user at the YAML layer can get their nice concise config file, and the plugin authors don't need to worry about spooky action at a distance

schrockn · 2024-04-27T10:15:41Z

I'm pretty skeptical of this. default_asset_attrs is exactly the type of thing I was talking about in the all hands yesterday that is a use case without thinking about primitives. This will be a technical contagion for a bunch of reasons:

Mostly obviously it only applies to assets. We have many types of definitions and any such API affordances should apply to them as well, especially as we know we need to improve our information architecture to organize all of our definition types.
I also agree with everything that @petehunt said about this being difficult to reason about. For example I think we would very quickly want to maintain a stack of these things for lineage/debugging, rather than just merging them as is implemented here.
At the code-level the way this proposed to interact with AssetsDefinition is super gross as it interacts with our cursed parallel set of dictionaries very poorly. Now if a default value is in scope it will apply during a copy, which we do all the time. This will be a source of maddening bugs.
It is not composable: This add yet another place where we have to add when we add a new property to an asset. There are already too many, and we screw this up all the time. Adding owners has been a long tail of "whoops I forgot to add it here" style oversights. (An external contributer had to catch an instance of that here Enable "owners" parameter for graph assets #21358!)

sryza · 2024-04-27T15:29:47Z

To motivate this a little more, one of the things I'm thinking about is what the Dagster starter example looks like. What we're heading towards is something like the following. While this isn't a mountain of code, a message I get from reading it is "man, Dagster's asset-based approach means I need to copy/paste my schedule in a lot more places than I would with orchestrator X".

daily_schedule = AssetSchedulingPolicy.cron("@daily")

@asset(scheduling_policy=daily_schedule)
def asset1(): ...

@asset(deps=[asset1], scheduling_policy=daily_schedule)
def asset2(): ...

@asset(deps=[asset1, asset2], scheduling_policy=daily_schedule)
def asset3(): ...

(Btw this is another potential solution to this problem: https://github.com/dagster-io/internal/discussions/8216)

petehunt · 2024-04-27T16:30:45Z

While this isn't a mountain of code, a message I get from reading it is "man, Dagster's asset-based approach means I need to copy/paste my schedule in a lot more places than I would with orchestrator X".

I am willing to be convinced that this is a problem but I am not there yet. When React came out it didn't have any of the dynamic scoping or two-way binding stuff that competitors had. This had the result of making it much more verbose than alternatives (and it still is), since everything had to be passed and handled explicitly (side note: at the time I felt like this was a big problem and shipped a userland feature to address it that no one ended up using)

The explicitness had the nice side effect of making everything introspectable, explainable and user-customizable.

With that said, I think some with_* function - perhaps assets_with_defaults()? - would be a nice solution here as you have referenced.

daily_schedule = AssetSchedulingPolicy.cron("@daily")

@asset(scheduling_policy=daily_schedule)
def asset1(): ...

@asset(deps=[asset1], scheduling_policy=daily_schedule)
def asset2(): ...

@asset(deps=[asset1, asset2], scheduling_policy=daily_schedule)
def asset3(): ...

defs = Definitions(assets=[asset1, asset2, asset3])

Could be expressed as:

default_spec = PartialAssetSpec(scheduling_policy=AssetSchedulingPolicy.cron("@daily"))

defs = Definitions(
  assets=assets_with_defaults([asset1, asset2, asset3], default_spec)
)

And then we'd offer a more sugared version:

defs = Definitions(
  assets=assets_with_defaults(
    [asset1, asset2, asset3],
    scheduling_policy=AssetSchedulingPolicy.cron('@daily'),
  ),
)

There are probably other ways we could do this too but I think this is a nice conservative version that is more of a two-way door.

schrockn · 2024-04-27T16:45:47Z

FWIW I think the default scheduling behavior should be for an asset to be included in an explicit cron schedule upstream but that is a separate topic and discussion.

sryza · 2024-04-27T20:50:32Z

Going to close this for now on account of:

generally negative reception
not personally positive enough about it to champion it

I think next step here is to write up a fuller problem statement and slate of options.

## Summary & Motivation There are diverse situations where it's useful to be able to pass around collections of definitions. Doing this in Dagster right now is quite awkward. Relevant situations: - Writing a factory that returns a few assets, a few checks, and a job that executes them together. - A submodule defines a few assets, a job, and a schedule. A different submodule does the same. They should all be in the same code location. - #21387 (comment) - https://github.com/dagster-io/internal/discussions/8216 Unlike the [prior version of this PR](#21746), this PR does _not_ turn `Definitions` into a "dumb" data class. I.e. it still validates that all assets and jobs can be bound, at construction time. ## How I Tested These Changes

## Summary & Motivation There are diverse situations where it's useful to be able to pass around collections of definitions. Doing this in Dagster right now is quite awkward. Relevant situations: - Writing a factory that returns a few assets, a few checks, and a job that executes them together. - A submodule defines a few assets, a job, and a schedule. A different submodule does the same. They should all be in the same code location. - dagster-io#21387 (comment) - dagster-io/internal#8216 Unlike the [prior version of this PR](dagster-io#21746), this PR does _not_ turn `Definitions` into a "dumb" data class. I.e. it still validates that all assets and jobs can be bound, at construction time. ## How I Tested These Changes

sryza requested review from alangenfeld, petehunt, PedramNavid, slopp and schrockn April 24, 2024 00:31

sryza force-pushed the auto-materialize-policy-in-pydantic branch 2 times, most recently from e4debde to 2309594 Compare April 24, 2024 05:39

sryza force-pushed the asset-attributes-context-manager branch from e9993ad to 9beb263 Compare April 24, 2024 05:40

alangenfeld reviewed Apr 24, 2024

View reviewed changes

PedramNavid requested review from cmpadden and nicklausroach April 24, 2024 16:23

sryza force-pushed the auto-materialize-policy-in-pydantic branch 3 times, most recently from df8a0a2 to c231010 Compare April 25, 2024 16:48

Base automatically changed from auto-materialize-policy-in-pydantic to master April 25, 2024 17:21

default_asset_attrs

773314a

sryza force-pushed the asset-attributes-context-manager branch from 9beb263 to 773314a Compare April 25, 2024 17:22

sryza closed this Apr 27, 2024

schrockn mentioned this pull request May 4, 2024

[1/n] Add with_source_code_references fn to add source location metadata to assets #21465

Merged

sryza mentioned this pull request May 9, 2024

Definitions.merge and don't immediately validate definitions #21746

Closed

sryza mentioned this pull request May 22, 2024

Definitions.merge #22035

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: default_asset_attrs #21387

RFC: default_asset_attrs #21387

sryza commented Apr 24, 2024

alangenfeld left a comment

petehunt commented Apr 25, 2024

sryza commented Apr 25, 2024 •

edited

Loading

petehunt commented Apr 26, 2024 •

edited

Loading

sryza commented Apr 27, 2024

sryza commented Apr 27, 2024

petehunt commented Apr 27, 2024 •

edited

Loading

petehunt commented Apr 27, 2024

schrockn commented Apr 27, 2024

sryza commented Apr 27, 2024 •

edited

Loading

petehunt commented Apr 27, 2024 •

edited

Loading

schrockn commented Apr 27, 2024

sryza commented Apr 27, 2024

RFC: default_asset_attrs #21387

RFC: default_asset_attrs #21387

Conversation

sryza commented Apr 24, 2024

Summary & Motivation

How I Tested These Changes

alangenfeld left a comment

Choose a reason for hiding this comment

petehunt commented Apr 25, 2024

sryza commented Apr 25, 2024 • edited Loading

splatting:

functools.partial:

[1] merging metadata entries and tags

petehunt commented Apr 26, 2024 • edited Loading

sryza commented Apr 27, 2024

sryza commented Apr 27, 2024

petehunt commented Apr 27, 2024 • edited Loading

petehunt commented Apr 27, 2024

schrockn commented Apr 27, 2024

sryza commented Apr 27, 2024 • edited Loading

petehunt commented Apr 27, 2024 • edited Loading

schrockn commented Apr 27, 2024

sryza commented Apr 27, 2024

sryza commented Apr 25, 2024 •

edited

Loading

petehunt commented Apr 26, 2024 •

edited

Loading

petehunt commented Apr 27, 2024 •

edited

Loading

sryza commented Apr 27, 2024 •

edited

Loading

petehunt commented Apr 27, 2024 •

edited

Loading