Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for ingestion time partition table on BigQuery as incremental materialization #136

Merged
merged 6 commits into from
Oct 20, 2022

Conversation

Kayrnt
Copy link
Contributor

@Kayrnt Kayrnt commented Mar 13, 2022

resolves #75

Description

Support for ingestion time partition table on BigQuery as incremental materialization

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt-bigquery next" section.

{{ log('table ' ~ table) }}
{%- set columns = result.columns -%}
{{ log('columns ' ~ columns) }}
{{ return(load_result('get_columns_with_types_in_query').table.columns | list) }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @jtcohen6 @McKnight-42, I could use some help over here:
To create a table such as

create or replace table `project`.`cou`.`test_ingestion_dbt` (`x` INT64)

Yet, as it goes through agate lib apparently, I'm getting

create or replace table `project`.`cou`.`test_ingestion_dbt` (`x` <dbt.clients.agate_helper.Number object at 0x104688ee0>)

That code is mostly the same as https://github.com/dbt-labs/dbt-core/blob/main/core/dbt/include/global_project/macros/adapters/columns.sql#L24-L34

I want a similar return to get_columns_in_relation from the adapter so that I get the real BQ type and not a Number which can't be used for creating the actual BQ table with correct columns.
My intent is to use the data_type here https://github.com/dbt-labs/dbt-bigquery/pull/136/files#diff-167e3557df7f18f1520c5db0045dfac9923e38a617c909d703be844192b28ebeR76

However, if I need to do so I'll have to store the "dry run request" in a table so that I can use get_columns_in_relation making it even longer/complex to run the command.

I honestly don't think it's required. Indeed if I run

WITH base AS (
select 1 as f1, 1.3 as f2, "test" as f3
) 
select * from base where false limit 0

here is the result from https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/getQueryResults

{
  "kind": "bigquery#getQueryResultsResponse",
  "etag": "XXX",
  "schema": {
    "fields": [
      {
        "name": "f1",
        "type": "INTEGER",
        "mode": "NULLABLE"
      },
      {
        "name": "f2",
        "type": "FLOAT",
        "mode": "NULLABLE"
      },
      {
        "name": "f3",
        "type": "STRING",
        "mode": "NULLABLE"
      }
    ]
  },
  "jobReference": {
    "projectId": "XXX",
    "jobId": "XXX",
    "location": "US"
  },
  "totalRows": "0",
  "totalBytesProcessed": "0",
  "jobComplete": true,
  "cacheHit": false
}

As you can see the type are well reported so the issue is with agate and the way the data parsed.

Do you think the original types are accessible somewhere?

My guess is that I have to create a custom version of load_result to access the returned schema, is it the best way to do so?

An unrelated downside with that approach (and it's maybe something to deal in another change) is that all those columns are going to be created as "NULLABLE" but we have the same issue with if the schema is inferred (in usual incremental case).

Except for that, I'm making progress but we're not there yet (once I have everything working, I'll still need to write tests).
I'm wasting a lot of time working logging fields for dev as I don't think I can work with a step by step debugger (in Python code and in Jinja), isn't it?

Thanks!

@Kayrnt Kayrnt force-pushed the incremental-ingestion-time branch from 93caabc to 9e1c480 Compare March 29, 2022 16:27
@dataders
Copy link
Contributor

dataders commented Apr 1, 2022

@Kayrnt you're very close here. correct me if I'm wrong, but my understanding is that you'd like to:

for a given relation (table/view), give me the column type for each column.

The way to do this is with column_type column of the table returned by the get_catalog() (bigquery__get_catalog()) macro.

Not only do I totally understand the approach you're taken, but I've also made the same mistake a few times. Your call statement('get_columns_with_types_in_query'.... construction was sound, and it will return an empty table with column names.

The problem is with load_result which takes the json returned by the call statement and converts it into an Agate table. When Agate is given data to turn into an Agate table, there's a specific way it guesses the type of each columnn provided by looking at the values in each column. Because there are no example values for each column, Agate defaults to assuming they are Number type. If you're interested, you can read more on Agate Data Types, Agate's type inference, and how dbt-core modifies the Agate's type inference and even adds a new type in agate_helper.py.

However, you are correct that directly querying BQ via an API will return the column types, however, AFAICT, there's nothing in dbt-core or dbt-biqquery that uses the metadata in every query response and passes the needed types to Agate before loading. Perhaps it could be though starting with BigQueryConnectionManager.get_table_from_response(). Agree or disagree: @VersusFacit @McKnight-42 @jtcohen6?

@Kayrnt Kayrnt force-pushed the incremental-ingestion-time branch 10 times, most recently from e7dfb33 to a8b32a2 Compare April 10, 2022 19:44
@Kayrnt Kayrnt force-pushed the incremental-ingestion-time branch 2 times, most recently from f8e8739 to b8b0d3a Compare April 11, 2022 10:18
@Kayrnt Kayrnt marked this pull request as ready for review April 11, 2022 10:31
@Kayrnt Kayrnt force-pushed the incremental-ingestion-time branch 2 times, most recently from 1b1783c to 926e1dc Compare April 11, 2022 22:27
@Kayrnt Kayrnt marked this pull request as draft April 11, 2022 22:35
Comment on lines 481 to 486
response = BigQueryAdapterResponse( # type: ignore[call-arg]
_message=message, rows_affected=num_rows, code=code, bytes_processed=bytes_processed
_message=message,
rows_affected=num_rows,
code=code,
bytes_processed=bytes_processed,
fields=fields,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like you're introducing non-black formatted code. any chance you want to take formatting imply.py with black, to make your actual changes easier to diff here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed I ran black after my changes but I definitely run it in a dedicated commit to push before mine.

Copy link
Contributor

@dataders dataders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're almost there!

@Kayrnt Kayrnt force-pushed the incremental-ingestion-time branch from 926e1dc to 3314332 Compare April 12, 2022 06:48
@Kayrnt Kayrnt force-pushed the incremental-ingestion-time branch 2 times, most recently from 88d550b to 42d527d Compare April 13, 2022 16:46
@github-christophe-oudar
Copy link
Contributor

I've rebased the change to work with the python.
The code should be properly tested and tests appears to pass.
I'll have few tests locally to confirm it works as intended but you can rereview the code.

@Kayrnt Kayrnt force-pushed the incremental-ingestion-time branch from f532c54 to b0a97c5 Compare August 8, 2022 15:44
@McKnight-42 McKnight-42 self-assigned this Aug 9, 2022
@Kayrnt
Copy link
Contributor Author

Kayrnt commented Aug 29, 2022

What's the status on your end @jtcohen6 @McKnight-42? It would be sad to miss the 1.3 release 😉
@dataders I think you'll have to dismiss your review because it looks like I can't (and I think I made the change since then, but feel free to review 🙌 ). Thanks!

@@ -69,6 +73,13 @@ def render(self, alias: Optional[str] = None):
else:
return f"{self.data_type}_trunc({column}, {self.granularity})"

def render_wrapped(self, alias: Optional[str] = None):
"""Wrap the partitioning column when time involved to ensure it is properly casted to matching time."""
if self.data_type in ("date", "timestamp", "datetime"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to include "time" type here as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, it's not possible to partition by "time", see https://cloud.google.com/bigquery/docs/partitioned-tables?hl=fr

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh makes sense, thanks for the clarification!

@McKnight-42
Copy link
Contributor

@github-christophe-oudar shined a light on this one while testing locally as part of review looking good, hoping to get it merged soon!

@github-christophe-oudar
Copy link
Contributor

It looks like the stars are about to align then! ⭐ ⭐ ⭐
After that I'll rebase #167
Let's be ambitious and try to merge both in September 🙌

@github-christophe-oudar
Copy link
Contributor

👋
Any chance we merge over the next weeks?
It will become a blocker on my end, so I'm wondering if I need to fork for my organisation.
Thanks 🙇

Copy link
Contributor

@colin-rogers-dbt colin-rogers-dbt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@colin-rogers-dbt
Copy link
Contributor

@cla-bot check

@cla-bot cla-bot bot added the cla:yes label Oct 20, 2022
@cla-bot
Copy link

cla-bot bot commented Oct 20, 2022

The cla-bot has been summoned, and re-checked this pull request!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes ok to test ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for ingestion time partition table on BigQuery as incremental materialization
6 participants