Feature: Configurable, arbitrary key in `schema.yml` to enhance documentation #1362

maggiehays · 2019-03-21T18:36:29Z

Feature: Configurable, arbitrary key in `schema.yml` to enhance documentation

Feature description

Following up to this slack conversation with @drewbanin.

Request is to add additional functionality in column-level documentation, specifically to identify upstream dependencies and transformation logic.

The example I used follows:

We’re building out our warehouse on top of Ruby-application generated data, so we’re frequently coalescing across polymorphic objects to summarize a type of event. a fun (...?) challenge is dealing with inconsistent naming conventions across the objects and normalizing them in our ETL.

Here’s a made-up example:

select
  orders.order_date
  , orders.item_type
  , coalesce(jackets.jacket_name, hats.name, shoes.shoe_style) as item_name
  , coalesce(jackets.color, hats.product_colorway, shoes.material_color) as item_color
  , count(*) as order_count
from orders
left join jackets
  on orders.item_fk = jackets.pk
  and orders.item_type = 'jacket'
left join hats
  on orders.item_fk = hats.pk
  and orders.item_type = 'hat'
left join shoes
  on orders.item_fk = shoes.pk
  and orders.item_type = 'shoe'
group by 1,2,3,4

We need to be able to communicate to end-users that the columns item_name and item_color are extracted from one of 3 objects, depending on the value of item_type

An additional use case of the column-level source/target lineage is communicating with our engineering partners which exact columns our tables depend on so they’re aware of downstream impact of altering/dropping those columns in the object upstream

I wonder if there’s any way to capture it in the sheama yaml? following on that made-up example earlier:

- name: agg_orders
  description: daily count of orders by item type, item name, and item color
  columns:
      - name: order_date
      - name: item_type
      - name: item_name
        depends:
            - name: jackets.jacket_name
            - name: hats.name
            - name: shoes.shoe_style
        transformation: coalesce(jackets.jacket_name, hats.name, shoes.shoe_style)
      - name: item_color
        depends:
            - name: jackets.color
            - name: hats.product_colorway
            - name: shoes.material_color
        transformation: coalesce(jackets.color, hats.product_colorway, shoes.material_color)
      - name: order_count

Drew suggested "some sort of data: key that can contain arbitrary configs, then some sort of macro that’s responsible for translating it into documentation".

Who will this benefit?

Owners of dbt warehouse

single location to document table- and column-level details, attributes, and lineage instead of referencing docs blocks
streamlined process to identify upstream/downstream dependencies (tough to enforce consistent formatatting using markdown/docs)

Consumers of dbt warehouse

visibility into logical transformation associated with business logic
versioning to track historical changes in business logic/transformation logic

Uptream of dbt warehouse; owners of production data

visibility into downstream dependencies

The text was updated successfully, but these errors were encountered:

drewbanin · 2019-03-26T14:57:26Z

Hey @maggiehays - thanks for the feature request! I'd need to think about this some more to understand what our options even are here, but happy to leave this open for reference.

This isn't currently prioritized, and my recommendation would be to encode these dependencies in text as the descriptions of the columns until we figure out if this is possible / how it would work.

This might become easier in our Wilt Chamberlain release when we support macros inside of docs blocks. You could conceivably make a macro that returns a templated documentation string indicating the column-level provenance of your models.

I'll also say, lines like:

transformation: coalesce(jackets.jacket_name, hats.name, shoes.shoe_style)

tend to "drift" from the source code, and it might be worth leveraging the "source SQL" functionality in dbt docs to understand the transformation logic in your models. Thanks again!

clrcrl · 2019-05-21T23:33:52Z

I think this issue could be useful for slightly different use cases.
For example, we know that in larger orgs, model ownership becomes important, so it would be great to add keys like owner: to model schemas.

brandfocus · 2019-06-24T07:36:19Z

We would use this to add the Slack handle for table owners

arkady-emelyanov · 2019-07-29T19:42:59Z

Why not broad the idea to both schema.yml and sources.yml, but in a more granular level?
Labels could be rendered in documenation and could be used by external scripts.

Something like this:
schema.yml

models:
  - name: my_model
    description: model descriptions
+   labels:
+      - "my_model custom tag"

    columns:
       - name: column_name
         description: colum description
+        labels:
+          - "column custom tag"
+          - "{{ my_column_helper_value('column_name') }}"

sources.yml

sources:
   - name: source_name
     description: source description
+    labels:
+      - "source associated tag"
     tables:
       - name: source_table
         description: source table description
+        labels:
+          - "source table tag"

drewbanin · 2019-07-30T00:31:55Z

Thanks @arkady-emelyanov! Do you think a data: dictionary with arbitrary key/values would serve your needs? Something like:

models:
  - name: my_model
    description: "abc123"
    data:
      labels:
        - abc 123
        - def 456

The same would apply to model columns & source tables/columns. I imagine we could include these data dicts in the auto-generated documentation (maybe as a table? unsure). Finally, we could render them out in compiled artifacts like manifest.json.

I don't know that you'd be able to select models/sources using these fields -- does that sound acceptable given your use case?

arkady-emelyanov · 2019-07-30T09:11:45Z

@drewbanin Yes, data: fits our case definitely.

Our case is simple, build some automation (external scripts) around sources and schema definitions. We can easily parse definitions, but there are no way to provide some metadata/context to script about entity (model, model column, source, source table, etc..). And most important part: maintain those metadata fields as part of model/source development flow.

Representation of those data: fields in auto-generated documentation: I would agree, just render them as a table in extra tab, for example.

tayloramurphy · 2019-12-03T00:47:56Z

Is the scope of this issue to allow for a data: key in schema.yml and source.yml that would allow for arbitrary data? I assume since it's marked as a good first issue it's a fairly small change? I might try to take a crack at it. I'll update in coming weeks if I'm able to 🙂

drewbanin · 2019-12-10T19:11:01Z

@tayloramurphy yep! that's exactly it :)

As long as that data: dict gets written out in the manifest.json file, we can render the included values out in the auto-generated dbt documentation.

tayloramurphy · 2019-12-17T23:12:25Z

I was able to pick this up at work - I should have a PR in tomorrow. I'm so excited!

clrcrl · 2019-12-18T18:09:38Z

I have some reservations here about the key being named data:. I think it lends itself to people thinking "oh that's the table name that I get the data from", idk.

Does attributes make sense here instead? Maybe properties? Do we want to convey that these are additional_properties? I'll let @drewbanin decide :)

tayloramurphy · 2019-12-18T19:28:53Z

@clrcrl fair point. I could see people putting in actual column data as if it were a seed file! 👀

extra is the only other one I like. I like keeping it succinct since we'll be adding this everywhere. data would still be my first choice though alongside good documentation 😄

clrcrl · 2019-12-18T20:07:58Z

I prefer readability over succinctness! My vote is currently for additional_properties! (You can always set up a snippet to type fewer letters, right?)

drewbanin · 2019-12-18T21:00:45Z

@clrcrl that's a good thought - I can imagine adding other fields to the schema.yml spec in the future. I think we do want to pick a name that's broad enough to represent the catch-all nature of this field, while still specific enough that it's sensible alongside other configs. What do you all think about meta? I think that's a good compromise between succinctness and specificity. Metadata implies that the provided information isn't used directly IMO.

tayloramurphy · 2019-12-18T21:51:25Z

meta works for me! Just let me know if that's what y'all want and I'll update the PR.

drewbanin · 2019-12-19T02:15:37Z

Yep, let's roll with meta - thanks :)

drewbanin · 2020-01-07T15:49:58Z

closed by #2015

drewbanin added enhancement New feature or request good_first_issue Straightforward + self-contained changes, good for new contributors! labels Jun 23, 2019

drewbanin mentioned this issue Jul 29, 2019

Allow metadata/labels key for entities in schema.yml and sources.yml #1635

Closed

drewbanin mentioned this issue Nov 17, 2019

Enable tagging of sources, not just models #1906

Closed

tayloramurphy mentioned this issue Dec 18, 2019

Handle "meta:" key in schema/sources.yml #2015

Merged

drewbanin added this to the Barbara Gittings milestone Jan 7, 2020

drewbanin closed this as completed Jan 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Configurable, arbitrary key in `schema.yml` to enhance documentation #1362

Feature: Configurable, arbitrary key in `schema.yml` to enhance documentation #1362

maggiehays commented Mar 21, 2019

drewbanin commented Mar 26, 2019

clrcrl commented May 21, 2019

brandfocus commented Jun 24, 2019

arkady-emelyanov commented Jul 29, 2019 •

edited

drewbanin commented Jul 30, 2019

arkady-emelyanov commented Jul 30, 2019

tayloramurphy commented Dec 3, 2019

drewbanin commented Dec 10, 2019

tayloramurphy commented Dec 17, 2019

clrcrl commented Dec 18, 2019 •

edited

tayloramurphy commented Dec 18, 2019

clrcrl commented Dec 18, 2019 •

edited

drewbanin commented Dec 18, 2019

tayloramurphy commented Dec 18, 2019

drewbanin commented Dec 19, 2019

drewbanin commented Jan 7, 2020

Feature: Configurable, arbitrary key in schema.yml to enhance documentation #1362

Feature: Configurable, arbitrary key in schema.yml to enhance documentation #1362

Comments

maggiehays commented Mar 21, 2019