Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update external table columns #252

Merged
merged 26 commits into from
Apr 11, 2024

Conversation

thomas-vl
Copy link
Contributor

@thomas-vl thomas-vl commented Feb 7, 2024

Description & motivation

resolves: #263
Descriptions and policyTags on columns where not propagated to the table schema of an external table on BigQuery.

Checklist

  • I have verified that these changes work locally
  • I have updated the README.md (if applicable)
  • I have added an integration test for my fix/feature (if applicable)

@thomas-vl
Copy link
Contributor Author

@jeremyyeo do you have any idea why CI fails on snowflake and redshift?

@thomas-vl
Copy link
Contributor Author

@jeremyyeo CI is broken see the PR I created without any changes:
#253

@thomas-vl
Copy link
Contributor Author

@dataders I saw you merge another pr, could you also take a look at this one?

Copy link
Collaborator

@dataders dataders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first, from-the-hip thoughts

  1. this builds off of a niche BigQuery feature which allows persisting column-level metadata beyond the standard description field such as policy tags.
  2. I'm fuzzy on (and skeptical of) Core's implementation (Added support for setting policy tags for columns in BigQuery dbt-core#2589)
  3. Hard for me to judge this PR's implementation because it is so closely tied to that of dbt-bigquery.

questions for @thomas-vl

  1. what does this PR do? I cannot tell the problem it is solving. please open an issue to give context then link it to this PR.
  2. additionally, can you describe exactly how this PR works? Am I correct that the operation happens not within the creation of a single external table, but all have been already staged?
  3. would another warehouse ever want/need to make use of your proposed update_external_table_columns? still don't understand the use case, as adapter.update_columns is not used by other adapters. should it be?
  4. is there info missing from docs.getdbt.com's Big Query configuration page section on Tags about how/when these tags get persisted?
  5. extra credit: rebase this PR to be a single commit

Comment on lines +69 to +71
{% set update_columns = dbt_external_tables.update_external_table_columns(node) %}
{{ update_columns }}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. is my understanding that this operation should happen after (and separated from) the actual staging of the tables?
  2. If update_external_table_columns returns an empty string (as is the case with the default__ version) then it is a no-op?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes, it just updates the schema with operations that can only be done via the API, see:
    https://github.com/dbt-labs/dbt-bigquery/blob/6c0afe4cfb69761dada5d16150fe632b8f72bf39/dbt/adapters/bigquery/impl.py#L609
    It adds descriptions and policyTags effectively creating the same behaviour as when you create a normal model.

  2. So I added the default__update_external_table_columns because I thought that this was how you should implement a macro that is BigQuery specific.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps i was looking for something more conventional (for this repo/ dbt macros) like

  1. fetching config,
  2. if config is not empty, then do the thingy
Suggested change
{% set update_columns = dbt_external_tables.update_external_table_columns(node) %}
{{ update_columns }}
{% set update_columns = dbt_external_tables.update_external_table_columns(node) %}
{%- if update_columns -%}
{{ update_columns }}

Copy link
Collaborator

@dataders dataders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @thomas-vl did some more thinking at wrote up an answer on #263. The feature is hard for me to rationalize at this point. Fortunately your implementation was a clue that eventually clued me into the bigger picture. For that reason, I'm going to close this PR and instead welcome more discussion on the issue itself and the larger discussion about the future of external tables

@dataders
Copy link
Collaborator

dataders commented Apr 11, 2024

@thomas-vl thinking about this again. Ultimately, when External Tables are in Core, implementing your proposed change will be both easier and more robust. However that is still a ways away.

I'm comfortable accepting this PR given that:

  1. this shouldn't affect non-BQ users, and
  2. it's a stop-gap solution

Copy link
Collaborator

@dataders dataders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@dataders dataders merged commit c7ae743 into dbt-labs:main Apr 11, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable setting policy tags on BigQuery external tables
2 participants