Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1934] [Feature] Avoid billed bytes on empty model resulting of limit 0 usage with time_ingestion_partitioning option #487

Open
3 tasks done
github-christophe-oudar opened this issue Jan 27, 2023 · 3 comments
Labels
enhancement New feature or request help_wanted Extra attention is needed

Comments

@github-christophe-oudar
Copy link
Contributor

github-christophe-oudar commented Jan 27, 2023

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt-bigquery functionality, rather than a Big Idea better suited to a discussion

Describe the feature

As we've introduced time_ingestion_partitioning flag in 1.4.0, I figured out that using INSERT DML statement for that option appears not to work as expected.

Let's consider following example:

SELECT *
FROM a

Let's say that it's being billed 1 TB (because a table data is 1 TB)
Then

Select *
FROM a
LIMIT 0

would be detected as a 0 bytes billed request.
But if we do

INSERT INTO b (...)
SELECT * 
FROM a
LIMIT 0

Then we're going to be billed 1 TB just like the same query with LIMIT 0.
It looks like it's a DML issue but it can be a problem for dbt user till it's fixed as it's the approach that is used by dbt-bigquery.

The use case for that LIMIT 0 is to do some "dry run" for smart slim CI.
It's a blocker to migrate some model that have large inputs as CI would be expensive as even for flat price usage, you'll still pay for some (useless) slot time.

Describe alternatives you've considered

As discussed with @jtcohen6, there are some ways to fix it:

  • Use a staging table
    • It can be a great workaround but it creates another temp table to clean up, it creates a bit more latency as it's adding yet another statement to be executed and finally, without the limit 0, it means that we would billed the output of the intermediate table as a result of splitting the query in 2.
  • Use a staging copy only for copy_partitions option
    • As I would suggest to use copy_partitions most of the time and it would use bq copy to copy from the staging table (that would use column type partitioning) to the time ingestion table, it would work well with a bit more latency (and still I'm not even sure as we would replace the DML INSERT by a bq copy which tend to be faster).

Who will this benefit?

Any user that runs queries with limit 0 and don't want to waste resources

Are you interested in contributing this feature?

Yes, since I wrote the time_ingestion_partitioning & copy_partitions features, it would likely be more efficient that I work that one too

Anything else?

No response

@github-actions github-actions bot changed the title [Feature] Avoid billed bytes on empty model resulting of limit 0 usage with time_ingestion_partitioning option [CT-1934] [Feature] Avoid billed bytes on empty model resulting of limit 0 usage with time_ingestion_partitioning option Jan 27, 2023
@Fleid Fleid added help_wanted Extra attention is needed and removed triage labels Feb 14, 2023
@github-christophe-oudar
Copy link
Contributor Author

Here a workaround I'm using:

I'm writing my main model in a model my_model_ephemeral

{{
    config(
        materialized=time_partition_source_workaround('ephemeral'),
    )
}}
SELECT ... from `prod_table` LIMIT 0 (only in CI)

And then I have a second model my_model:

SELECT * from {{ ref('my_model_ephemeral') }}

The macro is the following:

{% macro time_partition_source_workaround(original_materialization) -%}
{% if target.name == 'ci' -%}
table
{%- else -%}
{{ original_materialization }}
{%- endif -%}
{%- endmacro %}

In practice, it works like this:
In CI, I output the data as "empty" to a table (like a temp table), the table materialization works as expected, the query isn't billing/reading data. Then the insert of this empty table in the time ingestion partitioned table works as intended.
In production, it's just used a CTE in a single query (like it would be intended in the first place).

@github-christophe-oudar
Copy link
Contributor Author

Apparently it was fixed in BQ so we can close that issue

@github-christophe-oudar
Copy link
Contributor Author

Apparently it was fixed in BQ so we can close that issue

Well it's not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help_wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants