[CT-1905] [Spike] Get model "catalog" info after building, and fire in an event #6732

jtcohen6 · 2023-01-25T17:33:20Z

Let's gather catalog info about the relations produced by the materialization, as soon as it finishes building them (here). Materializations already return the set of relations it creates/updates, for the purposes of updating dbt's cache. Why not share the wealth with programmatic consumers of dbt metadata?

(Will serializing Relation objects be an absolute nightmare? Relation objects can be reimplemented by adapter, of course, though they all inherit from BaseRelation, which should be serializable. Even so, we may not want all the object attributes included in the logging event—probably just a subset.)

For now, the only really valuable information included on the relation object is database location (database.schema.identifier) and relation type (view, table, etc). However, I could see doing two things to make this very valuable, for which this logging lays necessary groundwork:

Additional fields like columns (with data types) and table statistics (maybe even column-level stats, too) — a.k.a. the same basic contract as CatalogInfo
Steps at the end of each materialization that describe the just-built relation, to populate those fields, which will then be logged out once the materialization completes

Put it all together, and we'll be able to provide realer-time access to catalog info, rather than trying to grab it all in one big memory-intensive batch during docs generate.

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2023-02-10T08:02:47Z

I'm going to queue this up for estimation by the Core-Execution team. I'm leaving the Team:Adapters label on as well, because this will require follow-up work (& testing) in adapter plugins.

jtcohen6 · 2023-02-10T18:31:24Z

We talked a lot about this during BLG! Thanks @peterallenwebb @colin-rogers-dbt for chatting through it.

Implementation:

We need some SQL to get this information, which we should wrap in macros (information_schema or describe)
Option A: Run those macros within materialization, return that information along with the relation to update in the cache, fire an event in our task code. Downside here: While those queries are running, downstream models are still blocked from starting, which isn't technically necessary.
Option B: "Sidecar" thread. After the model materialization finishes, kick off downstream nodes, and spin up this side-thread. Since these are "read-only" queries, there isn't (shouldn't be) any need to block other queries/nodes while they're running. The SQL should still be defined in macros (user space), but we can just call from Python via adapter.execute_macro

Option B sounds compelling!

As far as UX, configuration options (per node):

(default) Grab metadata fields from system tables / info schema at parity with current "catalog" queries used by docs generate
Do less (skip those queries for this node)
Do more: profiling queries (min/max for numeric fields, top 5 distinct values for string fields, ...) à la dbt_profiler

If/when this is enabled, we might still consider writing catalog.json at the end of a run. (It would only include the selected nodes... implicitly similar to #6014)

@Fleid @nathaniel-may Given that this goes across our teams, let's chat about the right next steps for continuing to scope this work & solidify the proposed implementation.

jtcohen6 added Team:Execution Team:Adapters Issues designated for the adapter area of the code logging labels Jan 25, 2023

github-actions bot changed the title ~~Get model "catalog" info after building, and fire in an event~~ [CT-1905] Get model "catalog" info after building, and fire in an event Jan 25, 2023

This was referenced Jan 25, 2023

[CT-711] [Draft] Add Structured Logs to Materializations #5325

Closed

[CT-1764] [Feature] write_file not to disk #6544

Closed

jtcohen6 added the performance label Feb 1, 2023

jtcohen6 added the Refinement Maintainer input needed label Feb 10, 2023

jtcohen6 removed the Team:Execution label Jul 19, 2023

peterallenwebb mentioned this issue Aug 8, 2023

[Epic] Applied State (part 1) #8316

Closed

peterallenwebb changed the title ~~[CT-1905] Get model "catalog" info after building, and fire in an event~~ [CT-1905] [Spike] Get model "catalog" info after building, and fire in an event Aug 15, 2023

graciegoheen added Impact: CA Impact: Orch labels Sep 8, 2023

graciegoheen closed this as not planned Won't fix, can't repro, duplicate, stale Nov 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-1905] [Spike] Get model "catalog" info after building, and fire in an event #6732

[CT-1905] [Spike] Get model "catalog" info after building, and fire in an event #6732

jtcohen6 commented Jan 25, 2023 •

edited

Loading

jtcohen6 commented Feb 10, 2023

jtcohen6 commented Feb 10, 2023 •

edited

Loading

[CT-1905] [Spike] Get model "catalog" info after building, and fire in an event #6732

[CT-1905] [Spike] Get model "catalog" info after building, and fire in an event #6732

Comments

jtcohen6 commented Jan 25, 2023 • edited Loading

jtcohen6 commented Feb 10, 2023

jtcohen6 commented Feb 10, 2023 • edited Loading

jtcohen6 commented Jan 25, 2023 •

edited

Loading

jtcohen6 commented Feb 10, 2023 •

edited

Loading