Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable efficient loading of AssetsDefinitions from slow or unreliable systems #9761

Merged
merged 38 commits into from Oct 4, 2022

Conversation

OwenKephart
Copy link
Contributor

@OwenKephart OwenKephart commented Sep 21, 2022

Summary & Motivation

There are many cases in which we would like to load AssetsDefinition objects from external services, rather than python code or other files that are checked in to the git repo like a manifest.json

In these cases, loading the definitions may take a significant amount of time, or involve a certain amount of "risk" (i.e. an external API may not respond). Incurring this startup time + risk on every single step launched from a repository containing such definitions is not an acceptable outcome in most scenarios.

This PR allows for a new way to create AssetsDefinitions (CacheableAssetsDefinition) that works as follows:

  1. When at least one CacheableAssetsDefinition object is placed inside an @repository-decorated function, a PendingRepositoryDefinition object will be created when the module containing this repository is loaded, rather than a RepositoryDefinition.

  2. The CacheableAssetsDefinition object contains two methods. First, a method for generating AssetsDefinitionCacheableData. This method can contact external services, instigate long running processes, and what have you. The AssetsDefinitionCacheableData is a serializable object which packages up this data into a convenient format. Second, a method for taking that AssetsDefinitionMetadata and returning AssetsDefinition objects. This method should generally not do any heavy processing, as it will need to be called in every subprocess.

  3. When loading a repository from a code pointer, you may optionally supply a RepositoryLoadData object. This object will be passed into the PendingRepositoryDefinition's reconstruct_repository_definition function, which will skip calling any compute_cacheable_data methods, as we already have access to all the cacheable data that we need. If you do not supply a RepositoryLoadData object, we will call the compute_repository_definition function, which will call the compute_cacheable_data for each CacheableAssetsDefinition, then pass the resulting metadata to the build_definitions function, to get AssetsDefinitions. This collection of AssetsDefinitions (plus whatever regular definitions existed in the repository-decorated function) will be used to create the actual RepositoryDefinition object.

  4. When spinning up the grpc server for the first time, we will have no persistent metadata to fall back on, so all of the metadata will need to be refreshed. However, in processes that need to launch runs from this server, we serialize the already-calculated RepositoryMetadata, and attach it to the ExecutionPlanSnapshot. When creating the DagsterRun from this ExecutionPlanSnapshot, we add a flag on that object indicating if there is any repository metadata or not (to avoid having to fetch that information if it doesn't exist, a fairly small optimization).

  5. Inside core_execute_run, when we first load the ReconstructablePipeline's definition, we check if there is any repository metadata available to us, and if there is, we grab the execution plan snapshot, and pull the repository metadata off of there. Then, we generate a new ReconstructablePipeline with this metadata baked into it, such that calling the recon_pipeline.get_definition() will in turn call recon_repository.get_definition(), which will pass in that RepositoryLoadData object when calling PendingRepositoryDefinition.resolve()

How I Tested These Changes

Unit tests in several places, as well as manually testing this workflow in dagit.

Notes

This system could be useful for things other than AssetsDefinitions. Some users load definitions and configuration from internal databases, which could probably benefit from this sort of treatment. Not a current priority.

@vercel
Copy link

vercel bot commented Sep 21, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

3 Ignored Deployments
Name Status Preview Comments Updated
dagit-storybook ⬜️ Ignored (Inspect) Oct 4, 2022 at 5:30PM (UTC)
dagster ⬜️ Ignored (Inspect) Oct 4, 2022 at 5:30PM (UTC)
dagster-oss-cloud-consolidated ⬜️ Ignored (Inspect) Oct 4, 2022 at 5:30PM (UTC)

@OwenKephart OwenKephart marked this pull request as ready for review September 27, 2022 21:13
Copy link
Member

@alangenfeld alangenfeld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test cloud yet as well?

Should we be concerned about someone who has a python test that interacts with a @repository -> RepositoryDefinition directly that ends up switching to PendingRepositoryDefinition when a definition thats pulled in changes? I guess they just call the "build from scratch" method.

I think this is getting close - by its nature deserves a lot of tests if we want to ensure this optimized load path is correctly triggered across the diversity of execution paths we support.

I think it is worth dialing in the naming of this stuff a bit.metadata especially is very vague. cacheable is ok but I'm hopeful we can do better. Given the very specific intent of this feature i think more specific naming would be ideal. One idea is hydrate / rehydrate coming from the React use of that language, though it is only vaguely like that.

@OwenKephart
Copy link
Contributor Author

@alangenfeld yeah definitely interested in improving the naming. I made a couple of minor adjustments, so we now have:

  • CachedAssetsData
  • CacheableAssetsDefinition
  • RepositoryLoadData

I feel pretty good about RepositoryLoadData (it's data that is used to load the repository, and we may add other sorts of thing to this object in the future that is not just related to cached assets, so it's fairly flexible).

As for the asset stuff, I do think "caching" is a pretty accurate description of what's going on here (although I'm open to other options). Rather than having to construct this data in every subprocess, we can just refer to the cached information we have. While hydrate/rehydrate is accurate(ish) as well, imo the point of this feature is its caching properties.

I still don't really like the first two names though. It's not really a "CacheableAssetsDefinition", it's really a "Cache-compatible AssetsDefinition Generator". Maybe "CacheableAssetsDefinitionsGenerator" is ok in the short term (we don't have to serialize this thing).

"CachedAssetsData" could also be:

  • CachedAssetsDefinitionData
  • CachedAssetsDefinition
  • AssetsDefinitionData

Copy link
Member

@alangenfeld alangenfeld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"intermediate representation" comes to mind but is quite a mouthful and AssetsIR is probably too terse.

I don't like "cached" since the tense isn't quite right to me. Changing the data piece to "cachable" i think would be better. CachableData CachableForm CachableRepresntation IntermediateRepresentation IR

Comment on lines 12 to 27
@whitelist_for_serdes
class CachedAssetsData(
NamedTuple(
"_AssetsDefinitionMetadata",
[
("keys_by_input_name", Optional[Mapping[str, AssetKey]]),
("keys_by_output_name", Optional[Mapping[str, AssetKey]]),
("internal_asset_deps", Optional[Mapping[str, AbstractSet[AssetKey]]]),
("group_name", Optional[str]),
("metadata_by_output_name", Optional[Mapping[str, MetadataUserInput]]),
("key_prefix", Optional[CoercibleToAssetKeyPrefix]),
("can_subset", bool),
("extra_metadata", Optional[Mapping[Any, Any]]),
],
)
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we expect users to produce these or just us / library authors? Are you confident this data structure is right for this? Changing this can be a pain since we are persisting it. Having it be in the execution plan snapshots mean we will be loading this anytime some one looks at the associated run in dagit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there will be some advanced users that will want to use this behavior (essentially for home-grown libraries), but for the most part we'll be the ones creating these things

I'm confident that this is a reasonably accurate structure (it mirrors the arguments of the non-experimental AssetsDefinition.from_op), but I'm definitely not 100% confident in it.

Copy link
Member

@alangenfeld alangenfeld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i remain mildly concerned about how this complexity will age - but the utility is pretty high here so hopefully it all works out.

🙏

@OwenKephart OwenKephart merged commit 91007ba into master Oct 4, 2022
@OwenKephart OwenKephart deleted the owen/explore-lazy2 branch October 4, 2022 21:42
@OwenKephart
Copy link
Contributor Author

@alangenfeld 🙏

@simonvanderveldt
Copy link
Contributor

@OwenKephart @alangenfeld Small question about this change, I was wondering why this is enabled by default when probably most people just have their repository/Python modules locally and don't need to do any remote/network actions to instantiate their Jobs/Assets. This PR seems to add useful functionality, but I'd expect only for a minority of use-cases. Or were there performance issues before this PR even with only local Python modules?

Asking because we automatically generate our Assets (also only from local code/data) and this caching change prevented changes to be automatically be picked up (we've found a workaround in the meantime).

@OwenKephart
Copy link
Contributor Author

Hi @simonvanderveldt!

This is a purely additive feature, it shouldn't interfere with any of your custom code. If you don't manually create a CacheableAssetsDefinition, the repository loading behavior will be unchanged. Even if you do add a CacheableAssetsDefinition to your repository (e.g. if you are using the Airbyte integration), the loading behavior for the non-cacheable assets will be unchanged.

Just functionally, loading a repository in a separate process (i.e. to execute your code) still requires running all the relevant code to generate that repository, so any changes you make should be reflected in executed steps. Is that not the behavior you're seeing?

The one place where definitions stick around is the GRPC server, which serves Dagit representations of the assets you have in your repository (among other things). This will persist that representation until you hit the "reload definitions" button, or otherwise bring the the server down and back up. However, this works the same regardless of cacheable/non-cacheable asset stuff.

@simonvanderveldt
Copy link
Contributor

simonvanderveldt commented Jan 19, 2023

@OwenKephart Thanks for the additional explanation! The reason I posted the question was that we were using @repository before and that no longer worked/we no longer saw our changes reflected in Dagit. We managed to work around this/solve this by subclassing RepositoryData which isn't cached. I'm not sure if this was or wasn't the intention? Semantically we'd prefer to keep using @repository and having to implement all the implementation details of RepositoryData ourselves is quiet annoying and means more work for every upgrade of Dagster we do, especially compared to the situation before where everything just worked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants