Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1905] [Spike] Get model "catalog" info after building, and fire in an event #6732

Closed
Tracked by #8316
jtcohen6 opened this issue Jan 25, 2023 · 2 comments
Closed
Tracked by #8316
Labels
Impact: CA Impact: Orch logging performance Refinement Maintainer input needed Team:Adapters Issues designated for the adapter area of the code

Comments

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jan 25, 2023

copying from #5325 (comment)

Let's gather catalog info about the relations produced by the materialization, as soon as it finishes building them (here). Materializations already return the set of relations it creates/updates, for the purposes of updating dbt's cache. Why not share the wealth with programmatic consumers of dbt metadata?

(Will serializing Relation objects be an absolute nightmare? Relation objects can be reimplemented by adapter, of course, though they all inherit from BaseRelation, which should be serializable. Even so, we may not want all the object attributes included in the logging event—probably just a subset.)

For now, the only really valuable information included on the relation object is database location (database.schema.identifier) and relation type (view, table, etc). However, I could see doing two things to make this very valuable, for which this logging lays necessary groundwork:

  • Additional fields like columns (with data types) and table statistics (maybe even column-level stats, too) — a.k.a. the same basic contract as CatalogInfo
  • Steps at the end of each materialization that describe the just-built relation, to populate those fields, which will then be logged out once the materialization completes

Put it all together, and we'll be able to provide realer-time access to catalog info, rather than trying to grab it all in one big memory-intensive batch during docs generate.

@jtcohen6 jtcohen6 added Team:Execution Team:Adapters Issues designated for the adapter area of the code logging labels Jan 25, 2023
@github-actions github-actions bot changed the title Get model "catalog" info after building, and fire in an event [CT-1905] Get model "catalog" info after building, and fire in an event Jan 25, 2023
@jtcohen6
Copy link
Contributor Author

I'm going to queue this up for estimation by the Core-Execution team. I'm leaving the Team:Adapters label on as well, because this will require follow-up work (& testing) in adapter plugins.

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Feb 10, 2023

We talked a lot about this during BLG! Thanks @peterallenwebb @colin-rogers-dbt for chatting through it.

Implementation:

  • We need some SQL to get this information, which we should wrap in macros (information_schema or describe)
  • Option A: Run those macros within materialization, return that information along with the relation to update in the cache, fire an event in our task code. Downside here: While those queries are running, downstream models are still blocked from starting, which isn't technically necessary.
  • Option B: "Sidecar" thread. After the model materialization finishes, kick off downstream nodes, and spin up this side-thread. Since these are "read-only" queries, there isn't (shouldn't be) any need to block other queries/nodes while they're running. The SQL should still be defined in macros (user space), but we can just call from Python via adapter.execute_macro

Option B sounds compelling!

As far as UX, configuration options (per node):

  • (default) Grab metadata fields from system tables / info schema at parity with current "catalog" queries used by docs generate
  • Do less (skip those queries for this node)
  • Do more: profiling queries (min/max for numeric fields, top 5 distinct values for string fields, ...) à la dbt_profiler

If/when this is enabled, we might still consider writing catalog.json at the end of a run. (It would only include the selected nodes... implicitly similar to #6014)

@Fleid @nathaniel-may Given that this goes across our teams, let's chat about the right next steps for continuing to scope this work & solidify the proposed implementation.

@jtcohen6 jtcohen6 added the Refinement Maintainer input needed label Feb 10, 2023
@peterallenwebb peterallenwebb changed the title [CT-1905] Get model "catalog" info after building, and fire in an event [CT-1905] [Spike] Get model "catalog" info after building, and fire in an event Aug 15, 2023
@graciegoheen graciegoheen closed this as not planned Won't fix, can't repro, duplicate, stale Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Impact: CA Impact: Orch logging performance Refinement Maintainer input needed Team:Adapters Issues designated for the adapter area of the code
Projects
None yet
Development

No branches or pull requests

2 participants