Parse query comment and use as bigquery job labels. #3145

jmcarp · 2021-03-04T05:09:25Z

resolves #2483

Description

Try to parse query comment and pass values to bigquery job labels. Just a draft to see if I'm heading in the right direction. @jtcohen6

Checklist

I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

jtcohen6

This is really cool @jmcarp!

Thinking about how to test this:

A unit test could inspect job_params[labels] and confirms that they contain the keys from the default query comment (app, profile_name, target_name, dbt_version, node_id) and the dbt_invocation_id. Maybe the query comment-to-job label conversion should be a standalone function?
An integration test could run a model and check that the labels appropriately registered, by querying INFORMATION_SCHEMA.JOBS_BY_USER

jtcohen6 · 2021-03-04T11:36:21Z

core/dbt/contracts/connection.py

@@ -215,6 +215,7 @@ def to_target_dict(self):
 class QueryComment(dbtClassMixin):
    comment: str = DEFAULT_QUERY_COMMENT
    append: bool = False
+    job_label: bool = False


The config is query-comment, rather than query_comment (docs). Not exactly sure why we did this, but I guess kebab casing is common in dbt_project.yml / project-level configs. I think this should be job-label (instead of job_label) for consistency

OK, I switched the base class to HyphenatedDbtClassMixin.

On second thought, I may not understand what this base class does, since that change broke tests that I thought were unrelated. Reverted for now, but let me know if you have thoughts about the right way to do this.

I took another look at this, and I think I found a subtle bug in the casing logic. Should be fixed now.

jtcohen6 · 2021-03-04T11:37:56Z

plugins/bigquery/dbt/adapters/bigquery/connections.py

+            except (TypeError, ValueError):
+                pass


If the comment isn't valid json, should we still include it as a single job label (called query_comment), truncated to 128 characters if need be?

Makes sense, I changed it.

…rp/bigquery-job-labels

jmcarp · 2021-03-05T14:41:12Z

core/dbt/dataclass_schema.py

-        if cls._hyphenated:
+        # `data` might not be a dict, e.g. for `query_comment`, which accepts
+        # a dict or a string; only snake-case for dict values.
+        if cls._hyphenated and isinstance(data, dict):


Does this make sense @gshank?

Looking at all the existing calls to this class method, it seems to be true.

Honestly, I'm not sure exactly how this code gets reached, but I dropped into a debugger at this point on the unit test suite and found non-dictionary values--specifically, there's a test that sets query-comment to "". You should be able to reproduce the issue by dropping the isinstance check that I added and running something like pytest -s test/unit/test_query_headers.py -k test_disable_query_comment.

You can also run that test with the following added to pre_deserialize:

if cls._hyphenated and not isinstance(data, dict): import ipdb; ipdb.set_trace()

Looking at the stack, I'm inside a function called __unpack_union_Project_query_comment__9f5a8d0e89384cc286d2b99acef5623d that I'm guessing is dynamically generated and eval-ed by mashumaro 😱.

jtcohen6

This is looking good from my vantage! One small comment from my testing. I think it'd be good to add a few unit tests of the various inputs/outputs of query comment sanitizing and structuring. Beyond that, I think this is ready for a changelog entry (v0.20.0 Feature).

I also tagged a member of the Core team for python code review.

jtcohen6 · 2021-03-08T16:15:50Z

plugins/bigquery/dbt/adapters/bigquery/connections.py

+_SANITIZE_LABEL_PATTERN = re.compile(r"[^a-z0-9_-]")
+
+
+def _sanitize_label(value: str, max_length: int = 63) -> str:
+    """Return a legal value for a BigQuery label."""
+    value = value.lower()
+    value = _SANITIZE_LABEL_PATTERN.sub("_", value)
+    value = value[: max_length - 1]
+    return value


Anecdotally, this has the effect of appending an extra underscore at the end of my non-dict comment, e.g. comment: whatever in dbt_project.yml becomes query_comment: whatever_ in BigQuery, and comment: something else becomes query_comment: something_else_. Is that intended?

I'm not sure why this is happening--there may be something upstream of this code adding trailing whitespace, because I don't get that behavior testing the function directly. I tried adding a strip(). Does that help?

Looks like the strip() fixed it!

(The numerous _ in the top example are all special characters.)

jtcohen6

This looks awesome @jmcarp! Thanks for adding those unit tests.

Functional review is good by me: I've confirmed this works as expected when job-label: True (with default, custom JSON, and custom string comments), and just as crucially that it works as expected when job-label: False or not supplied (the only label is dbt_invocation_id = status quo).

I'll leave final code/style review to @nathaniel-may. After that, this is good to go.

…rp/bigquery-job-labels

jmcarp · 2021-03-16T03:11:11Z

No rush @jtcohen6, but it would be awesome to get this into the next release.

nathaniel-may

This was waiting on me @jmcarp. Thank you for your patience. Once we address my relatively minor questions here, this looks good to merge.

I do want an integration test so that we can detect regressions on this behavior from both dbt and bigquery, but I'm ok ticketing that in a separate PR if you are too @jtcohen6.

nathaniel-may · 2021-03-19T20:24:25Z

core/dbt/dataclass_schema.py

-        if cls._hyphenated:
+        # `data` might not be a dict, e.g. for `query_comment`, which accepts
+        # a dict or a string; only snake-case for dict values.
+        if cls._hyphenated and isinstance(data, dict):


Looking at all the existing calls to this class method, it seems to be true.

nathaniel-may · 2021-03-19T20:25:19Z

test/unit/test_bigquery_adapter.py

@@ -931,3 +941,16 @@ def test_convert_time_type(self):
        expected = ['time', 'time', 'time']
        for col_idx, expect in enumerate(expected):
            assert BigQueryAdapter.convert_time_type(agate_table, col_idx) == expect
+
+
+@pytest.mark.parametrize(


We don't seem to use this style of test very often, but I actually really like it so I'm happy to introduce it to this file too.

nathaniel-may · 2021-03-19T20:29:13Z

test/unit/test_bigquery_adapter.py

 from dbt.adapters.base.query_headers import MacroQueryStringSetter
 from dbt.clients import agate_helper
 import dbt.exceptions
 from dbt.logger import GLOBAL_LOGGER as logger  # noqa
 from dbt.context.providers import RuntimeConfigObject

+import pytest


I won't block the merge on this, but I'd rather put this import up just above import unittest.

nathaniel-may · 2021-03-19T20:32:35Z

plugins/bigquery/dbt/adapters/bigquery/connections.py

+    """Return a legal value for a BigQuery label."""
+    value = value.strip().lower()
+    value = _SANITIZE_LABEL_PATTERN.sub("_", value)
+    value = value[: max_length]


Is this the behavior we want for long labels? Alternatives include raising our own error, or sending something we know is too long to big query and raising that db exception. As much as this will not often, if ever, be attempted on purpose, a typo could cause this which might be a slightly annoying user experience.

I think passing through the input and letting bigquery crash makes sense.

jtcohen6 · 2021-03-22T11:43:07Z

Thanks for the review @nathaniel-may, and for the quick cleanup @jmcarp! I'll open a separate issue for the integration test (#3184).

Parse query comment and use as bigquery job labels.

2fdc113

cla-bot bot added the cla:yes label Mar 4, 2021

jmcarp added 2 commits March 4, 2021 00:14

Sanitize bigquery labels.

62be9f9

Fix lint.

7b0c74c

jtcohen6 reviewed Mar 4, 2021

View reviewed changes

Address comments from code review.

8d73ae2

jmcarp force-pushed the jmcarp/bigquery-job-labels branch from 7b4b728 to 8d73ae2 Compare March 4, 2021 15:20

jmcarp added 2 commits March 5, 2021 00:09

Hyphenate query comment fields and fix deserialization bug.

c71a18c

Merge branch 'develop' of github.com:fishtown-analytics/dbt into jmca…

82cca95

…rp/bigquery-job-labels

jmcarp commented Mar 5, 2021

View reviewed changes

jmcarp marked this pull request as ready for review March 5, 2021 14:41

jtcohen6 requested review from a team, nathaniel-may and jtcohen6 and removed request for a team March 8, 2021 16:05

jtcohen6 reviewed Mar 8, 2021

View reviewed changes

Add tests for bigquery label sanitize helper.

af3c3f4

jmcarp force-pushed the jmcarp/bigquery-job-labels branch from f3f2f72 to af3c3f4 Compare March 8, 2021 20:44

Add BigQuery job labels to changelog.

8566a46

jtcohen6 approved these changes Mar 9, 2021

View reviewed changes

Merge branch 'develop' of github.com:fishtown-analytics/dbt into jmca…

9fe2b65

…rp/bigquery-job-labels

nathaniel-may approved these changes Mar 19, 2021

View reviewed changes

Cleanups from code review.

044a6c6

jmcarp force-pushed the jmcarp/bigquery-job-labels branch from b0016ee to 044a6c6 Compare March 20, 2021 05:00

jtcohen6 merged commit 934c23b into dbt-labs:develop Mar 22, 2021

jtcohen6 mentioned this pull request May 5, 2021

'NoneType' object has no attribute 'job_label' #3314

Closed

jtcohen6 mentioned this pull request Jul 27, 2021

Raise a more helpful error if BigQuery job label is too long #3612

Closed

5 tasks

sungchun12 mentioned this pull request Aug 6, 2021

Fix/bigquery job label length #3703

Merged

4 tasks

jtcohen6 mentioned this pull request Oct 12, 2021

Integration test for query-comment as BigQuery job label dbt-labs/dbt-bigquery#28

Closed

matt-winkler mentioned this pull request Apr 27, 2023

[ADAP-501] [Feature] Leverage Bigquery Comments at Model level dbt-labs/dbt-bigquery#685

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse query comment and use as bigquery job labels. #3145

Parse query comment and use as bigquery job labels. #3145

jmcarp commented Mar 4, 2021 •

edited

jtcohen6 left a comment

jtcohen6 Mar 4, 2021

jmcarp Mar 4, 2021

jmcarp Mar 4, 2021

jmcarp Mar 6, 2021

jtcohen6 Mar 4, 2021

jmcarp Mar 4, 2021

jmcarp Mar 5, 2021

nathaniel-may Mar 19, 2021

jmcarp Mar 20, 2021

jmcarp Mar 20, 2021

jtcohen6 left a comment

jtcohen6 Mar 8, 2021

jmcarp Mar 8, 2021

jtcohen6 Mar 9, 2021

jtcohen6 left a comment

jmcarp commented Mar 16, 2021

nathaniel-may left a comment

nathaniel-may Mar 19, 2021

nathaniel-may Mar 19, 2021

nathaniel-may Mar 19, 2021

nathaniel-may Mar 19, 2021

jmcarp Mar 20, 2021

jtcohen6 commented Mar 22, 2021 •

edited

Parse query comment and use as bigquery job labels. #3145

Parse query comment and use as bigquery job labels. #3145

Conversation

jmcarp commented Mar 4, 2021 • edited

Description

Checklist

jtcohen6 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtcohen6 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtcohen6 left a comment

Choose a reason for hiding this comment

jmcarp commented Mar 16, 2021

nathaniel-may left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtcohen6 commented Mar 22, 2021 • edited

jmcarp commented Mar 4, 2021 •

edited

jtcohen6 commented Mar 22, 2021 •

edited