Support BigQuery Job Tags and Labels #2483

boxysean · 2020-05-22T18:50:58Z

Describe the feature

I would like to be able to control tagging and labeling of BigQuery Jobs as I run dbt on BigQuery.

A similar (but not the same) issue is #1947, for labeling BigQuery Tables and Datasets. This issue focuses on BigQuery Jobs (such as Insert Jobs or Query Jobs).

Describe alternatives you've considered

It's not possible to label or tag jobs after they have started. From the docs

You cannot add labels to or update labels on pending, running, or completed jobs.

Additional context

The main reason why one would tag and label their BigQuery Job is to analyze BigQuery spend. For example, if one were able to link a BigQuery Job to a certain Airflow operator run (or similar -- in my case a python script run by a cron! :-D) then a real dollar value can be put on running that operator over time.

I think it's important to give the developer control on what tags and labels can be added, so it supports their data processing setup. And so I think tags and labels should be able to be set at launch-time. (In my case, I run a python script that calls dbt run -- I would want my python script to be able to set the BigQuery Job tags and labels, while the Jobs are ultimately launched by dbt run.)

Who will this benefit?

Folks who are responsible for their BigQuery spend should benefit by using relevant Job tags and labels.

Thanks!

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2020-05-22T19:59:56Z

This is interesting! Thanks for the detailed writeup @boxysean.

More so than labels on tables and columns, this idea feels like a close relative of query comments, which we introduced in v0.15. Parsing out and aggregating query comments is the best way to calculate usage on Postgres/Redshift.

Clarifying questions

A single dbt run may execute dozens or hundreds query jobs. Is your thought that all query jobs executed by the same dbt command should get the same label(s)? Should each query job label include information about the specific node (model/seed/snapshot/test/etc) it's running?

You mention that you'd want to have control of the label when invoking dbt from, say, Airflow or a cron script. Is this something you expect as a CLI arg, e.g.

dbt run --job-label '{"name": "nightly_run"}'

Or could it be version-controlled code that, depending on situational variables (target, var, env_var) outputs a different label?

My proposal

Here's what makes sense to me:

We could add a query-comment attribute, job-label: true. On BigQuery, in addition to prepending the query comment as a SQL comment (/* */), dbt would add the comment's key-value components as labels to each query jobs it runs.
Using the existing query-comment functionality, you could write a macro that returns a different comment (or comment dict attributes) depending on the values of target/var/env_var

mescanne · 2020-10-03T09:02:46Z

The proposal is reasonable and I experimented with it. However, I found there is a maximum length of a label value in GCP of 128 bytes, and the default comment exceeded it. I'm not sure it makes sense to use the same query-comment functionality.

Alternative:

On every DBT BigQuery job it passes in the label dbt_invocation_id matching with the invocation_id in DBT.
For cases where more detailed profiling is needed, a DBT step can be added to insert into a tracking table an association between the invocation_id and whatever information is needed. This can include whatever macros you need/want (eg coming from Airflow perhaps, or Argo).
The INFORMATION_SCHEMA job information can be used to extract performance stats and left join with the tracking table for more detailed information/analysis.

The code changes in DBT are tiny -- I'm putting together a pullrequest for it.

jtcohen6 · 2020-10-06T21:17:07Z

@mescanne Thanks for checking out this issue, and for opening up the PR. I'll take a look there in a moment.

I had envisioned splitting out the arguments currently passed in the query comment to each be their own label, so one for each of app, dbt_version, profile_name, target_name, and node_id. That would enable us to avoid the 128 byte limit. A few of those values would require more plumbing than what's currently available to raw_execute, however, so the work required is a bit more involved.

I quite like your suggestion to query INFORMATION_SCHEMA.JOBS_BY_* and filter / join on the invocation_id label as a way to get metadata about a given run, e.g. total spend.

boxysean · 2020-10-07T14:42:42Z

Hi @jtcohen6 coming back to this. I think it's helpful to add as much information as possible to the query jobs, so your query-comment suggestion makes sense.

I would still prefer to specify the BigQuery Job label at dbt run-time, such as you suggested with dbt run --job-label '{"name": "nightly_run"}'. That way it would be controlled by the executor (e.g., Airflow, cron) and I could customize how I slice & dice the executions, which indeed INFORMATION_SCHEMA.JOBS_BY_* could be used for!

hui-zheng · 2021-02-16T23:36:41Z

I would also like to upvote this feature request.

In our use case, we would like to pass other environment variables to the bq-job label for ease of monitoring and benchmarking. For example, we would like to pass the environment (prod, stage), pipeline release version, and our orchestration workflow run id (prefect run id) to BQ job label. Currently we set these info in environment variable, and it's quite easy to add them into the query-comment dict.

I think @jtcohen6 's proposal below would make it work for us if implemented.

I had envisioned splitting out the arguments currently passed in the query comment to each be their own label, so one for each of app, dbt_version, profile_name, target_name, and node_id. That would enable us to avoid the 128 byte limit. A few of those values would require more plumbing than what's currently available to raw_execute, however, so the work required is a bit more involved.

ncolomer · 2021-02-22T08:21:57Z

We are in the exact same situation than those described by @boxysean and @hui-zheng: we'd like to stamp our dbt's big query jobs with labels/tags so that we can get better visibility on how we spend our BQ budget/quota.

Although per dbt run custom labels/tags would be enough in our case (eg. customer, environment, etc), having additional (generic?) context per query jobs would be great too (eg. dbt model/node the query job is running for as suggested by @jtcohen6, or model/node's own dbt tags).

This looks to be supported and achievable through BigQuery's adding job label feature. It is also documented in the JobConfiguration API doc.

ncolomer · 2021-02-22T08:22:01Z

Currently we set these info in environment variable, and it's quite easy to add them into the query-comment dict.

@hui-zheng, could you describe how you do it?

jmcarp · 2021-02-27T03:33:43Z

My team was looking for exactly this feature. Are you planning to work on it, or would you review a PR that adds a --job-label flag to dbt run?

jtcohen6 · 2021-03-01T10:17:23Z

This isn't something we're prioritizing now. FYI #2809 did add invocation_id as a label to all dbt-bigquery jobs, starting in v0.19.0. That invocation_id can be used to query INFORMATION_SCHEMA.JOBS_BY_* and calculate total time/spend per invocation; it can also be used to associate BigQuery query history with dbt run artifacts (docs), namely run_results.json. Those run artifacts will contain lots of useful metadata, including (e.g.) any environment variables prefixed with DBT_ENV_CUSTOM_ENV_.

I agree that dbt should be able to pass more information than just the invocation_id, though I think that's a strong start. A --job-label flag that allows the user / orchestration tool to set one value for all nodes in an invocation should be straightforward to implement. The more I think about it, though, I find it functionally limiting but also one-off as an implementation, not well integrated with existing dbt constructs that seek to accomplish the same goal.

I do think the best version of this would make available the full query comment context as per-node job labels. That context, available to the query_comment macro, is defined in query_headers.py. I agree with @hui-zheng, it's quite easy to pass environment variables or --vars into the query-comment config or query_comment macro, so this approach would solve for both use cases we've been discussing.

The string version of this comment—the default value, the string passed to the config, or the value returned by the custom macro—is available to the connection manager, via set_query_header and _add_query_comment. The execute method already calls _add_query_comment to prepend the comment to SQL before execution:

https://github.com/fishtown-analytics/dbt/blob/344a14416d22f0cfbeb56b9904092c8a4f38b1fc/plugins/bigquery/dbt/adapters/bigquery/connections.py#L333-L336

So here's what I'm thinking about:

How would we enable dbt-bigquery users to turn this on? I'm thinking an additional option, nested under query_comment in dbt_project.yml, called job_label: true | false.
If query_comment.job_label is turned on, and the query comment config/macro returns a dict / JSON string (such as the advanced usage example in the docs), should dbt try to parse the returned value into a python dict, and pass each key-value pair as a separate label? I think yes; this should even work for the default query comment value.
If query_comment.job_label is turned on, and the query-comment returns an unstructured string, should dbt still try to pass the first 128 bytes (truncated if needed) as the value to a single label called query_comment? I still think yes, but I'm open to your thoughts on this point (and every point above).

Having written all that out, acknowledging that there are a few tricky pieces, I do think the requisite changes would be relatively self-contained in the codebase. Would anyone be interested in giving it a go?

jmcarp · 2021-03-03T04:41:13Z

I'm not very familiar with the dbt internals, so it would probably take me some time to figure out, but I'd be happy to give this a try if nobody picks it up first.

jtcohen6 · 2021-03-03T09:29:51Z

Happy to help along the way @jmcarp! I think the methods I linked above would be the right places to start.

In particular, raw_execute already passes job labels here:

https://github.com/fishtown-analytics/dbt/blob/344a14416d22f0cfbeb56b9904092c8a4f38b1fc/plugins/bigquery/dbt/adapters/bigquery/connections.py#L302-L313

And _add_query_comment is already a method available to the Connection Manager. So this may be as simple as adding within raw_execute:

set query_comment = self._add_query_comment('')

Then, try to parse query_comment is a dict. If it's a dict, pass each key/value pair as a separate item to job_params['labels']. If it's a non-empty string, pass the entire query_comment string (truncated to 128 characters) as one item. If it's empty, do nothing.

There would be some additional work to make this an optional config, and to turn it on/off based on query_comment.job_params, but that feels eminently doable. Probably it looks like creating a special dbt-bigquery subclass of MacroQueryStringSetter and having the BigQueryConnectionManager override set_query_header, unless we can think of comparable functionality on other adapters. (Snowflake has query tags, but those are accessible in SQL, so there's no need to wedge them into the connection.)

ncolomer · 2021-03-03T13:31:26Z

Thanks for the detailed explanations @jtcohen6, I can give it a try too 🖐️

boxysean added enhancement New feature or request triage labels May 22, 2020

jtcohen6 removed the triage label May 22, 2020

jtcohen6 added the bigquery label Jun 10, 2020

jtcohen6 mentioned this issue Oct 2, 2020

BigQuery adapter should record in a label the dbt invocation_id #2808

Closed

mescanne mentioned this issue Oct 3, 2020

Add invocation_id to BigQuery jobs #2809

Merged

4 tasks

jtcohen6 added the good_first_issue Straightforward + self-contained changes, good for new contributors! label Mar 1, 2021

jmcarp mentioned this issue Mar 4, 2021

Parse query comment and use as bigquery job labels. #3145

Merged

4 tasks

jtcohen6 added this to the Margaret Mead milestone Mar 22, 2021

jtcohen6 closed this as completed in #3145 Mar 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support BigQuery Job Tags and Labels #2483

Support BigQuery Job Tags and Labels #2483

boxysean commented May 22, 2020

jtcohen6 commented May 22, 2020

mescanne commented Oct 3, 2020

jtcohen6 commented Oct 6, 2020

boxysean commented Oct 7, 2020

hui-zheng commented Feb 16, 2021

ncolomer commented Feb 22, 2021 •

edited

ncolomer commented Feb 22, 2021

jmcarp commented Feb 27, 2021

jtcohen6 commented Mar 1, 2021

jmcarp commented Mar 3, 2021

jtcohen6 commented Mar 3, 2021 •

edited

ncolomer commented Mar 3, 2021

Support BigQuery Job Tags and Labels #2483

Support BigQuery Job Tags and Labels #2483

Comments

boxysean commented May 22, 2020

Describe the feature

Describe alternatives you've considered

Additional context

Who will this benefit?

jtcohen6 commented May 22, 2020

Clarifying questions

My proposal

mescanne commented Oct 3, 2020

jtcohen6 commented Oct 6, 2020

boxysean commented Oct 7, 2020

hui-zheng commented Feb 16, 2021

ncolomer commented Feb 22, 2021 • edited

ncolomer commented Feb 22, 2021

jmcarp commented Feb 27, 2021

jtcohen6 commented Mar 1, 2021

jmcarp commented Mar 3, 2021

jtcohen6 commented Mar 3, 2021 • edited

ncolomer commented Mar 3, 2021

ncolomer commented Feb 22, 2021 •

edited

jtcohen6 commented Mar 3, 2021 •

edited