Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-2959] [Feature] when --output=json for dbt list and dbt show, have additional structured data available inside the ListCmdOut data object #8358

Open
3 tasks done
graciegoheen opened this issue Aug 10, 2023 · 4 comments
Labels
enhancement New feature or request Impact: Exp Impact: Orch list related to the dbt list command show related to the dbt show command

Comments

@graciegoheen
Copy link
Contributor

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Originally from @davidharting via slack

Howdy core team! :waveboi:
I wanted to provide some feedback about the ListCmdOutMsg event.
When you run with dbt ls with --output=json , the JSON output is stringified and then placed in both info.msg and data.msg.
Both of those fields are strings, so that sense.
However, when --output=json, I would love to have additional structured data available inside the ListCmdOut data object.
Right now, when we pull information out of these events we first parse the message itself to JSON, then we pull msg string out and parse that as JSON as well.

@peterallenwebb noted some challenges with this

It definitely would be more convenient to have it be unstringified JSON. Unfortunately, that convenience is in tension with our structured event architecture, which requires the JSON to match a rigorously defined schema defined in protobuf.

Describe alternatives you've considered

parse the message itself to JSON, then we pull msg string out and parse that as JSON as well.

Who will this benefit?

People who want a well-defined schema for preview table json and list output json

Are you interested in contributing this feature?

No response

Anything else?

No response

@graciegoheen graciegoheen added enhancement New feature or request triage labels Aug 10, 2023
@github-actions github-actions bot changed the title [Feature] when --output=json for dbt list and dbt show, have additional structured data available inside the ListCmdOut data object [CT-2959] [Feature] when --output=json for dbt list and dbt show, have additional structured data available inside the ListCmdOut data object Aug 10, 2023
@davidharting
Copy link

It's also worth mentioning that we use the same approach for dbt-core events for dbt show with --output=json. We stringify json and use that to represent the table of structured data.

It is working for us! But structured data is nice where we can have it.

@dbeatty10
Copy link
Contributor

I don't yet understand the nuances of either the end goal nor the intermediate steps here, but my curiosity is piqued about both.

Doing some experimentation:

rm -f logs/dbt.log
dbt --quiet --log-format-file json list --output json > dbt-list.json  

This will create JSON output in two different files:

  1. dbt-list.json
  2. logs/dbt.log

Would having separate files each containing valid non-stringified JSON be useful for you @davidharting?

Click to toggle pretty-printed output

For human-readability, I like using jq to pretty-print the resulting file output:

cat dbt-list.json  | jq . > dbt-list.pp.json
cat logs/dbt.log | jq . > dbt.pp.log

dbt-list.pp.json

{
  "name": "abc",
  "resource_type": "model",
  "package_name": "my_project",
  "original_file_path": "models/abc_v1.sql",
  "unique_id": "model.my_project.abc.v1",
  "alias": "abc_v1",
  "config": {
    "enabled": true,
    "alias": null,
    "schema": null,
    "database": null,
    "tags": [],
    "meta": {},
    "group": null,
    "materialized": "view",
    "incremental_strategy": null,
    "persist_docs": {},
    "quoting": {},
    "column_types": {},
    "full_refresh": null,
    "unique_key": null,
    "on_schema_change": "ignore",
    "grants": {},
    "packages": [],
    "docs": {
      "show": true,
      "node_color": null
    },
    "contract": {
      "enforced": true
    },
    "post-hook": [],
    "pre-hook": []
  },
  "tags": [],
  "depends_on": {
    "macros": [],
    "nodes": []
  }
}
{
  "name": "abc",
  "resource_type": "model",
  "package_name": "my_project",
  "original_file_path": "models/abc_v2.sql",
  "unique_id": "model.my_project.abc.v2",
  "alias": "abc_v2",
  "config": {
    "enabled": true,
    "alias": null,
    "schema": null,
    "database": null,
    "tags": [],
    "meta": {},
    "group": null,
    "materialized": "view",
    "incremental_strategy": null,
    "persist_docs": {},
    "quoting": {},
    "column_types": {},
    "full_refresh": null,
    "unique_key": null,
    "on_schema_change": "ignore",
    "grants": {},
    "packages": [],
    "docs": {
      "show": true,
      "node_color": null
    },
    "contract": {
      "enforced": true
    },
    "post-hook": [],
    "pre-hook": []
  },
  "tags": [],
  "depends_on": {
    "macros": [],
    "nodes": [
      "model.my_project.abc.v1"
    ]
  }
}
{
  "name": "my_seed",
  "resource_type": "seed",
  "package_name": "my_project",
  "original_file_path": "seeds/my_seed.csv",
  "unique_id": "seed.my_project.my_seed",
  "alias": "my_seed",
  "config": {
    "enabled": true,
    "alias": null,
    "schema": null,
    "database": null,
    "tags": [],
    "meta": {},
    "group": null,
    "materialized": "seed",
    "incremental_strategy": null,
    "persist_docs": {},
    "quoting": {},
    "column_types": {},
    "full_refresh": null,
    "unique_key": null,
    "on_schema_change": "ignore",
    "grants": {},
    "packages": [],
    "docs": {
      "show": true,
      "node_color": "purple"
    },
    "contract": {
      "enforced": false
    },
    "quote_columns": null,
    "post-hook": [],
    "pre-hook": []
  },
  "tags": [],
  "depends_on": {
    "macros": []
  }
}

dbt.pp.log

{
  "data": {
    "log_version": 3,
    "version": "=1.5.1"
  },
  "info": {
    "category": "",
    "code": "A001",
    "extra": {},
    "invocation_id": "5a461268-907b-4f7d-84ca-0f17d333ebc0",
    "level": "info",
    "msg": "Running with dbt=1.5.1",
    "name": "MainReportVersion",
    "pid": 58578,
    "thread": "MainThread",
    "ts": "2023-08-11T15:17:40.092948Z"
  }
}
{
  "data": {
    "args": {
      "cache_selected_only": "False",
      "debug": "False",
      "fail_fast": "False",
      "indirect_selection": "eager",
      "introspect": "True",
      "log_cache_events": "False",
      "log_format": "default",
      "log_path": "/Users/dbeatty/projects/copier-templates/duckdb-docs-440/logs",
      "no_print": "None",
      "partial_parse": "True",
      "printer_width": "80",
      "profiles_dir": "/Users/dbeatty/projects/copier-templates/duckdb-docs-440",
      "quiet": "True",
      "send_anonymous_usage_stats": "False",
      "static_parser": "True",
      "target_path": "None",
      "use_colors": "True",
      "use_experimental_parser": "False",
      "version_check": "True",
      "warn_error": "None",
      "warn_error_options": "WarnErrorOptions(include=[], exclude=[])",
      "write_json": "True"
    }
  },
  "info": {
    "category": "",
    "code": "A002",
    "extra": {},
    "invocation_id": "5a461268-907b-4f7d-84ca-0f17d333ebc0",
    "level": "debug",
    "msg": "running dbt with arguments {'printer_width': '80', 'indirect_selection': 'eager', 'write_json': 'True', 'log_cache_events': 'False', 'partial_parse': 'True', 'cache_selected_only': 'False', 'profiles_dir': '/Users/dbeatty/projects/copier-templates/duckdb-docs-440', 'version_check': 'True', 'debug': 'False', 'log_path': '/Users/dbeatty/projects/copier-templates/duckdb-docs-440/logs', 'fail_fast': 'False', 'warn_error': 'None', 'use_colors': 'True', 'use_experimental_parser': 'False', 'no_print': 'None', 'quiet': 'True', 'log_format': 'default', 'static_parser': 'True', 'introspect': 'True', 'warn_error_options': 'WarnErrorOptions(include=[], exclude=[])', 'target_path': 'None', 'send_anonymous_usage_stats': 'False'}",
    "name": "MainReportArgs",
    "pid": 58578,
    "thread": "MainThread",
    "ts": "2023-08-11T15:17:40.097418Z"
  }
}
{
  "data": {
    "checksum": "51f6b581eba8f8101bc020bf9faf8f96af641e2da86d581f66e2bdf0ff384b1c",
    "profile": "",
    "target": "",
    "vars": "{}",
    "version": "1.5.1"
  },
  "info": {
    "category": "",
    "code": "I025",
    "extra": {},
    "invocation_id": "5a461268-907b-4f7d-84ca-0f17d333ebc0",
    "level": "debug",
    "msg": "checksum: 51f6b581eba8f8101bc020bf9faf8f96af641e2da86d581f66e2bdf0ff384b1c, vars: {}, profile: , target: , version: 1.5.1",
    "name": "StateCheckVarsHash",
    "pid": 58578,
    "thread": "MainThread",
    "ts": "2023-08-11T15:17:41.073512Z"
  }
}
{
  "data": {
    "added": 0,
    "changed": 0,
    "deleted": 0
  },
  "info": {
    "category": "",
    "code": "I040",
    "extra": {},
    "invocation_id": "5a461268-907b-4f7d-84ca-0f17d333ebc0",
    "level": "debug",
    "msg": "Partial parsing enabled: 0 files deleted, 0 files added, 0 files changed.",
    "name": "PartialParsingEnabled",
    "pid": 58578,
    "thread": "MainThread",
    "ts": "2023-08-11T15:17:41.104266Z"
  }
}
{
  "data": {},
  "info": {
    "category": "",
    "code": "I017",
    "extra": {},
    "invocation_id": "5a461268-907b-4f7d-84ca-0f17d333ebc0",
    "level": "debug",
    "msg": "Partial parsing enabled, no changes found, skipping parsing",
    "name": "PartialParsingSkipParsing",
    "pid": 58578,
    "thread": "MainThread",
    "ts": "2023-08-11T15:17:41.104759Z"
  }
}
{
  "data": {
    "stat_line": "2 models, 0 tests, 0 snapshots, 0 analyses, 313 macros, 0 operations, 1 seed file, 0 sources, 0 exposures, 0 metrics, 0 groups"
  },
  "info": {
    "category": "",
    "code": "W006",
    "extra": {},
    "invocation_id": "5a461268-907b-4f7d-84ca-0f17d333ebc0",
    "level": "info",
    "msg": "Found 2 models, 0 tests, 0 snapshots, 0 analyses, 313 macros, 0 operations, 1 seed file, 0 sources, 0 exposures, 0 metrics, 0 groups",
    "name": "FoundStats",
    "pid": 58578,
    "thread": "MainThread",
    "ts": "2023-08-11T15:17:41.119910Z"
  }
}
{
  "data": {
    "command": "dbt list",
    "completed_at": "2023-08-11T15:17:41.121310Z",
    "elapsed": 1.0645814,
    "success": true
  },
  "info": {
    "category": "",
    "code": "Q039",
    "extra": {},
    "invocation_id": "5a461268-907b-4f7d-84ca-0f17d333ebc0",
    "level": "debug",
    "msg": "Command `dbt list` succeeded at 09:17:41.121310 after 1.06 seconds",
    "name": "CommandCompleted",
    "pid": 58578,
    "thread": "MainThread",
    "ts": "2023-08-11T15:17:41.121495Z"
  }
}
{
  "data": {},
  "info": {
    "category": "",
    "code": "Z042",
    "extra": {},
    "invocation_id": "5a461268-907b-4f7d-84ca-0f17d333ebc0",
    "level": "debug",
    "msg": "Flushing usage events",
    "name": "FlushEvents",
    "pid": 58578,
    "thread": "MainThread",
    "ts": "2023-08-11T15:17:41.122033Z"
  }
}

@dbeatty10
Copy link
Contributor

Doing further exploring, I discovered some key differences between dbt list --output json and dbt show --output json:

  • the former can be redirected to a file or piped to another command by using dbt --quiet
  • the latter can not use dbt --quiet as mentioned above

As a result, the example above is something that only works with the former, and not the latter.

❔ Maybe I should open a separate issue for this? ☝️

@davidharting
Copy link

davidharting commented Aug 11, 2023

@dbeatty10 Thanks for digging into this!

My goal is to build a lineage graph out of structured information emitted by dbt ls --output=json.

Today, I am using the events emitted into dbt.log. Each event is JSON. So I parse each line into JSON and see if it is a ListCmdOut event. if it is, I then pull out the data.msg string and de-serialize that from a JSON string into an arbitrary dictionary.

What I would love is for dbt to provide a well-defined schema for the structured data provided by dbt ls --output=json, and give me a reliable way to retrieve data of that well-defined type.

I hadn't considered redirecting standard out to a file and using that. I am using the dbtRunner interface instead of the CLI interface, so I need to investigate how stdout is handled.

One downside of using stdout and using data.msg is that in both cases, we do not have a stable data structure to code against. Because in both cases it is just raw text that we are turning into Python dictionaries, dbt-core could change the structure in any patch release.

Fortunately, that has never been a problem in the past. But having a data contract for this structured data could be beneficial.


This whole line of reasoning applies for dbt show --output=json output as well. We are parsing JSON strings into arbitrary dictionaries. But we don't have a contract with the expected output here. So it could theoretically break on us in the future without warning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Impact: Exp Impact: Orch list related to the dbt list command show related to the dbt show command
Projects
None yet
Development

No branches or pull requests

5 participants