Enable efficient loading of AssetsDefinitions from slow or unreliable systems #9761

OwenKephart · 2022-09-21T20:25:29Z

Summary & Motivation

There are many cases in which we would like to load AssetsDefinition objects from external services, rather than python code or other files that are checked in to the git repo like a manifest.json

In these cases, loading the definitions may take a significant amount of time, or involve a certain amount of "risk" (i.e. an external API may not respond). Incurring this startup time + risk on every single step launched from a repository containing such definitions is not an acceptable outcome in most scenarios.

This PR allows for a new way to create AssetsDefinitions (CacheableAssetsDefinition) that works as follows:

When at least one CacheableAssetsDefinition object is placed inside an @repository-decorated function, a PendingRepositoryDefinition object will be created when the module containing this repository is loaded, rather than a RepositoryDefinition.
The CacheableAssetsDefinition object contains two methods. First, a method for generating AssetsDefinitionCacheableData. This method can contact external services, instigate long running processes, and what have you. The AssetsDefinitionCacheableData is a serializable object which packages up this data into a convenient format. Second, a method for taking that AssetsDefinitionMetadata and returning AssetsDefinition objects. This method should generally not do any heavy processing, as it will need to be called in every subprocess.
When loading a repository from a code pointer, you may optionally supply a RepositoryLoadData object. This object will be passed into the PendingRepositoryDefinition's reconstruct_repository_definition function, which will skip calling any compute_cacheable_data methods, as we already have access to all the cacheable data that we need. If you do not supply a RepositoryLoadData object, we will call the compute_repository_definition function, which will call the compute_cacheable_data for each CacheableAssetsDefinition, then pass the resulting metadata to the build_definitions function, to get AssetsDefinitions. This collection of AssetsDefinitions (plus whatever regular definitions existed in the repository-decorated function) will be used to create the actual RepositoryDefinition object.
When spinning up the grpc server for the first time, we will have no persistent metadata to fall back on, so all of the metadata will need to be refreshed. However, in processes that need to launch runs from this server, we serialize the already-calculated RepositoryMetadata, and attach it to the ExecutionPlanSnapshot. When creating the DagsterRun from this ExecutionPlanSnapshot, we add a flag on that object indicating if there is any repository metadata or not (to avoid having to fetch that information if it doesn't exist, a fairly small optimization).
Inside core_execute_run, when we first load the ReconstructablePipeline's definition, we check if there is any repository metadata available to us, and if there is, we grab the execution plan snapshot, and pull the repository metadata off of there. Then, we generate a new ReconstructablePipeline with this metadata baked into it, such that calling the recon_pipeline.get_definition() will in turn call recon_repository.get_definition(), which will pass in that RepositoryLoadData object when calling PendingRepositoryDefinition.resolve()

How I Tested These Changes

Unit tests in several places, as well as manually testing this workflow in dagit.

Notes

This system could be useful for things other than AssetsDefinitions. Some users load definitions and configuration from internal databases, which could probably benefit from this sort of treatment. Not a current priority.

vercel · 2022-09-21T20:25:32Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

3 Ignored Deployments

Name	Status	Updated
dagit-storybook	⬜️ Ignored (Inspect)	Oct 4, 2022 at 5:30PM (UTC)
dagster	⬜️ Ignored (Inspect)	Oct 4, 2022 at 5:30PM (UTC)
dagster-oss-cloud-consolidated	⬜️ Ignored (Inspect)	Oct 4, 2022 at 5:30PM (UTC)

python_modules/dagster/dagster/_core/definitions/assets_lazy.py

benpankow · 2022-09-21T23:51:20Z

Current dependencies on/for this PR:

master
- PR Enable efficient loading of AssetsDefinitions from slow or unreliable systems #9761 👈
  - PR [dagster-airbyte] Generate airbyte assets from API, using cached assets #9766
    - PR [managed stacks][wip] Prototype managed stack API #9799
      - PR [dagster-airbyte][wip] Prototype managed Airbyte stack #9800
    - PR [dagster-airbyte][docs] Add integration guide for dagster-airbyte #9705

This comment was auto-generated by Graphite.

python_modules/dagster/dagster_tests/core_tests/asset_defs_tests/test_lazy_assets.py

python_modules/dagster/dagster/_core/definitions/cacheable_assets.py

…gster into owen/explore-lazy2

alangenfeld

Did you test cloud yet as well?

Should we be concerned about someone who has a python test that interacts with a @repository -> RepositoryDefinition directly that ends up switching to PendingRepositoryDefinition when a definition thats pulled in changes? I guess they just call the "build from scratch" method.

I think this is getting close - by its nature deserves a lot of tests if we want to ensure this optimized load path is correctly triggered across the diversity of execution paths we support.

I think it is worth dialing in the naming of this stuff a bit.metadata especially is very vague. cacheable is ok but I'm hopeful we can do better. Given the very specific intent of this feature i think more specific naming would be ideal. One idea is hydrate / rehydrate coming from the React use of that language, though it is only vaguely like that.

python_modules/dagster/dagster_tests/execution_tests/execution_plan_tests/test_external_step.py

..._modules/dagster/dagster_tests/execution_tests/engine_tests/test_step_delegating_executor.py

python_modules/dagster/dagster_tests/core_tests/test_external_execution_plan.py

python_modules/dagster/dagster/_cli/api.py

python_modules/dagster/dagster/_core/definitions/repository_definition.py

python_modules/dagster/dagster/_core/definitions/reconstruct.py

OwenKephart · 2022-10-03T16:52:35Z

@alangenfeld yeah definitely interested in improving the naming. I made a couple of minor adjustments, so we now have:

CachedAssetsData
CacheableAssetsDefinition
RepositoryLoadData

I feel pretty good about RepositoryLoadData (it's data that is used to load the repository, and we may add other sorts of thing to this object in the future that is not just related to cached assets, so it's fairly flexible).

As for the asset stuff, I do think "caching" is a pretty accurate description of what's going on here (although I'm open to other options). Rather than having to construct this data in every subprocess, we can just refer to the cached information we have. While hydrate/rehydrate is accurate(ish) as well, imo the point of this feature is its caching properties.

I still don't really like the first two names though. It's not really a "CacheableAssetsDefinition", it's really a "Cache-compatible AssetsDefinition Generator". Maybe "CacheableAssetsDefinitionsGenerator" is ok in the short term (we don't have to serialize this thing).

"CachedAssetsData" could also be:

CachedAssetsDefinitionData
CachedAssetsDefinition
AssetsDefinitionData

alangenfeld

"intermediate representation" comes to mind but is quite a mouthful and AssetsIR is probably too terse.

I don't like "cached" since the tense isn't quite right to me. Changing the data piece to "cachable" i think would be better. CachableData CachableForm CachableRepresntation IntermediateRepresentation IR

python_modules/dagster/dagster/_core/definitions/reconstruct.py

python_modules/dagster/dagster/_core/definitions/cacheable_assets.py

python_modules/dagster/dagster/_core/definitions/repository_definition.py

alangenfeld · 2022-10-04T15:55:08Z

python_modules/dagster/dagster/_core/definitions/cacheable_assets.py

+@whitelist_for_serdes
+class CachedAssetsData(
+    NamedTuple(
+        "_AssetsDefinitionMetadata",
+        [
+            ("keys_by_input_name", Optional[Mapping[str, AssetKey]]),
+            ("keys_by_output_name", Optional[Mapping[str, AssetKey]]),
+            ("internal_asset_deps", Optional[Mapping[str, AbstractSet[AssetKey]]]),
+            ("group_name", Optional[str]),
+            ("metadata_by_output_name", Optional[Mapping[str, MetadataUserInput]]),
+            ("key_prefix", Optional[CoercibleToAssetKeyPrefix]),
+            ("can_subset", bool),
+            ("extra_metadata", Optional[Mapping[Any, Any]]),
+        ],
+    )
+):


do we expect users to produce these or just us / library authors? Are you confident this data structure is right for this? Changing this can be a pain since we are persisting it. Having it be in the execution plan snapshots mean we will be loading this anytime some one looks at the associated run in dagit.

I think there will be some advanced users that will want to use this behavior (essentially for home-grown libraries), but for the most part we'll be the ones creating these things

I'm confident that this is a reasonably accurate structure (it mirrors the arguments of the non-experimental AssetsDefinition.from_op), but I'm definitely not 100% confident in it.

alangenfeld

i remain mildly concerned about how this complexity will age - but the utility is pretty high here so hopefully it all works out.

🙏

OwenKephart · 2022-10-04T21:42:17Z

@alangenfeld 🙏

simonvanderveldt · 2023-01-18T15:18:44Z

@OwenKephart @alangenfeld Small question about this change, I was wondering why this is enabled by default when probably most people just have their repository/Python modules locally and don't need to do any remote/network actions to instantiate their Jobs/Assets. This PR seems to add useful functionality, but I'd expect only for a minority of use-cases. Or were there performance issues before this PR even with only local Python modules?

Asking because we automatically generate our Assets (also only from local code/data) and this caching change prevented changes to be automatically be picked up (we've found a workaround in the meantime).

OwenKephart · 2023-01-18T17:01:43Z

Hi @simonvanderveldt!

This is a purely additive feature, it shouldn't interfere with any of your custom code. If you don't manually create a CacheableAssetsDefinition, the repository loading behavior will be unchanged. Even if you do add a CacheableAssetsDefinition to your repository (e.g. if you are using the Airbyte integration), the loading behavior for the non-cacheable assets will be unchanged.

Just functionally, loading a repository in a separate process (i.e. to execute your code) still requires running all the relevant code to generate that repository, so any changes you make should be reflected in executed steps. Is that not the behavior you're seeing?

The one place where definitions stick around is the GRPC server, which serves Dagit representations of the assets you have in your repository (among other things). This will persist that representation until you hit the "reload definitions" button, or otherwise bring the the server down and back up. However, this works the same regardless of cacheable/non-cacheable asset stuff.

simonvanderveldt · 2023-01-19T16:24:18Z

@OwenKephart Thanks for the additional explanation! The reason I posted the question was that we were using @repository before and that no longer worked/we no longer saw our changes reflected in Dagit. We managed to work around this/solve this by subclassing RepositoryData which isn't cached. I'm not sure if this was or wasn't the intention? Semantically we'd prefer to keep using @repository and having to implement all the implementation details of RepositoryData ourselves is quiet annoying and means more work for every upgrade of Dagster we do, especially compared to the situation before where everything just worked.

benpankow reviewed Sep 21, 2022

View reviewed changes

python_modules/dagster/dagster/_core/definitions/assets_lazy.py Outdated Show resolved Hide resolved

benpankow mentioned this pull request Sep 21, 2022

[dagster-airbyte] Generate airbyte assets from API, using cached assets #9766

Merged

OwenKephart requested a review from sryza September 25, 2022 00:29

OwenKephart commented Sep 26, 2022

View reviewed changes

python_modules/dagster/dagster_tests/core_tests/asset_defs_tests/test_lazy_assets.py Outdated Show resolved Hide resolved

This was referenced Sep 26, 2022

[managed stacks][wip] Prototype managed stack API #9799

Closed

[dagster-airbyte][wip] Prototype managed Airbyte stack #9800

Closed

benpankow reviewed Sep 26, 2022

View reviewed changes

python_modules/dagster/dagster/_core/definitions/cacheable_assets.py Outdated Show resolved Hide resolved

benpankow reviewed Sep 26, 2022

View reviewed changes

python_modules/dagster/dagster/_core/definitions/cacheable_assets.py Outdated Show resolved Hide resolved

benpankow reviewed Sep 26, 2022

View reviewed changes

python_modules/dagster/dagster/_core/definitions/cacheable_assets.py Outdated Show resolved Hide resolved

OwenKephart requested a review from alangenfeld September 27, 2022 21:13

OwenKephart marked this pull request as ready for review September 27, 2022 21:13

OwenKephart added 16 commits September 27, 2022 14:13

wip

8931fec

wip

b580ac5

wip

b12c435

wip

f78b3ac

wip

dad08fb

kinda working, needs tests

8d609c6

wip, remove instance

aaa010d

working (source: just trust me), needs actual tests

dde3e7d

added missing file

f08bc28

repository_metadata optional

310c92e

isort

cccc4b0

update, add test

b0381ca

changed schema for AssetsDefinitionMetadata

62e07f3

changed schema for AssetsDefinitionMetadata

d5a8ebc

renames

dc4b034

added tests

6232124

OwenKephart force-pushed the owen/explore-lazy2 branch from 72281e8 to 6232124 Compare September 27, 2022 21:15

OwenKephart added 2 commits September 29, 2022 16:22

slight refactor

c5c6a1b

Merge branch 'owen/explore-lazy2' of https://github.com/dagster-io/da…

5c2e308

…gster into owen/explore-lazy2

vercel bot deployed to Preview – dagster September 29, 2022 23:26 View deployment

flatten

f8f354a

OwenKephart requested a review from alangenfeld September 30, 2022 01:00

alangenfeld reviewed Sep 30, 2022

View reviewed changes

rename add tests

003e8ba

OwenKephart added 4 commits October 3, 2022 09:59

fix tests

0e06465

test

03ae777

refix

d0e9387

refix

b9e83e8

alangenfeld requested a review from dpeng817 October 3, 2022 18:06

OwenKephart added 5 commits October 3, 2022 11:37

refix

7f559c8

refix

eebfc19

refix

4bd14a8

fixup tests

c80d2b5

fix rename

f9eae63

alangenfeld reviewed Oct 4, 2022

View reviewed changes

OwenKephart added 2 commits October 4, 2022 10:11

rename

44fc4ae

fix tests

6619554

benpankow mentioned this pull request Oct 4, 2022

[dagster-airbyte][docs] Add integration guide for dagster-airbyte #9705

Merged

4 tasks

alangenfeld approved these changes Oct 4, 2022

View reviewed changes

OwenKephart merged commit 91007ba into master Oct 4, 2022

OwenKephart deleted the owen/explore-lazy2 branch October 4, 2022 21:42

benpankow mentioned this pull request Oct 11, 2022

[cacheable-assets] Enable use of with_resources, with_prefix_or_group on cacheable assets #9978

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable efficient loading of AssetsDefinitions from slow or unreliable systems #9761

Enable efficient loading of AssetsDefinitions from slow or unreliable systems #9761

OwenKephart commented Sep 21, 2022 •

edited

vercel bot commented Sep 21, 2022 •

edited

benpankow commented Sep 21, 2022 •

edited

alangenfeld left a comment

OwenKephart commented Oct 3, 2022

alangenfeld left a comment

alangenfeld Oct 4, 2022

OwenKephart Oct 4, 2022

alangenfeld left a comment

OwenKephart commented Oct 4, 2022

simonvanderveldt commented Jan 18, 2023

OwenKephart commented Jan 18, 2023

simonvanderveldt commented Jan 19, 2023 •

edited

Enable efficient loading of AssetsDefinitions from slow or unreliable systems #9761

Enable efficient loading of AssetsDefinitions from slow or unreliable systems #9761

Conversation

OwenKephart commented Sep 21, 2022 • edited

Summary & Motivation

How I Tested These Changes

Notes

vercel bot commented Sep 21, 2022 • edited

benpankow commented Sep 21, 2022 • edited

alangenfeld left a comment

Choose a reason for hiding this comment

OwenKephart commented Oct 3, 2022

alangenfeld left a comment

Choose a reason for hiding this comment

alangenfeld Oct 4, 2022

Choose a reason for hiding this comment

OwenKephart Oct 4, 2022

Choose a reason for hiding this comment

alangenfeld left a comment

Choose a reason for hiding this comment

OwenKephart commented Oct 4, 2022

simonvanderveldt commented Jan 18, 2023

OwenKephart commented Jan 18, 2023

simonvanderveldt commented Jan 19, 2023 • edited

OwenKephart commented Sep 21, 2022 •

edited

vercel bot commented Sep 21, 2022 •

edited

benpankow commented Sep 21, 2022 •

edited

simonvanderveldt commented Jan 19, 2023 •

edited