BQ clustering can improve merge performance #2196

jtcohen6 · 2020-03-11T18:05:37Z

Description

On BigQuery, running a simple merge statement—as the incremental materialization does on versions <=0.15.2—appears to scan significantly less data if the target table is clustered.

This makes some intuitive sense when the cluster key and the unique_key for merge equality are identical, and even some slight sense when they're correlated (e.g. the merge key is event_id, the clustering is session_id, the former is contained within the latter). I'm seeing the benefit for all clustered tables, however, no matter which column(s) the table is clustered by.

This feels like a relatively recent BQ performance improvement, and—wow. While it throws a small wrench into our 0.16.0 rework of the BQ incremental materialization, it's also a very exciting discovery. Big thanks to @clausherther for his help on this!

Benchmarking

I've been doing a lot of work yesterday and today trying to benchmark query runtime and cost according to three variables: modeling, data volume, and incremental strategy. I'll have more to say about this in Discourse and hope to present some of my findings in tomorrow's office hours. Broadly:

At small data volumes, a simple merge into a clustered table is faster and scans less data than the multi-step, scripted, partition-based approach implemented in #2140.

As the target table increases in size (> 50 GB), the partition-based approach is still slower, but it's increasingly more cost-effective than any simple merge, clustered or not.

Next steps

I think we should reimplement the simple merge as the default BigQuery incremental strategy and document the finding around clustered tables' improved performance.

We should also allow users to turn on the new partition-based scripting approach using a partition_merge strategy.

That would give us three strategies total: merge (simple, default), partition_merge, and insert_overwrite.

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2020-03-11T22:30:51Z

@drewbanin and I had a chance to talk about this and game-plan for 0.16.0.

To simplify the release, we are going to:

Keep simple merge as the default incremental strategy. Users can improve model performance by adding a cluster_by config.
Include the new insert_overwrite incremental strategy.
- The user can optionally specify an array of specific partitions for replacement via a partitions config (as in this older pattern). If a model always runs incrementally for the last 3 days, this should be the fastest + cheapest option at scale.
- If the user does not specify a partitions config, dbt will perform a "dynamic" insert_overwrite strategy using the scripting approach added in Feature/cost effective bq incremental followup #2140 + Feature/bq incremental strategy insert_overwrite #2153.

PR and documentation to come!

fhoffa · 2020-03-12T17:57:18Z

This is also important for merge costs in BigQuery:

#2136 (set_sql_header works well with 'table', but not 'incremental')

fhoffa · 2020-03-12T17:58:39Z

related: fhoffa/code_snippets#2 (review)

jtcohen6 · 2020-03-12T18:09:43Z

@fhoffa Good point, the fix with missing set_sql_header interpolation in incremental runs would touch the a few of the same lines of code that I'm changing in #2198.

I don't know if @drewbanin is up for trying to sneak that fix into the next minor release (0.16.0). If not, it could come in the next patch release.

jtcohen6 · 2020-03-12T19:51:52Z

@fhoffa It seems that the behavior we're seeing in merge queries is of a piece with the phenomenon you noted for aggregate queries in Aug 2018:

The query estimator doesn’t show any benefits for clustering
BigQuery provides an estimate for how much data each query will query before running the query. Without clustering, said estimate is exact. With clustering the estimate is an upper bound, and the query might end up querying way less, as shown above.

drewbanin · 2020-03-18T21:18:29Z

@jtcohen6 can we close this out now that #2198 has been merged?

jtcohen6 added enhancement New feature or request bigquery triage labels Mar 11, 2020

drewbanin removed the triage label Mar 11, 2020

drewbanin added this to the Barbara Gittings milestone Mar 11, 2020

jtcohen6 mentioned this issue Mar 12, 2020

Rework insert_overwrite incremental strategy #2198

Merged

4 tasks

jtcohen6 closed this as completed Mar 18, 2020

This was referenced Feb 23, 2022

Cluster table if key_properties defined jmriego/pipelinewise-target-bigquery#87

Closed

Cluster table if key_properties defined thread/pipelinewise-target-bigquery#6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BQ clustering can improve merge performance #2196

BQ clustering can improve merge performance #2196

jtcohen6 commented Mar 11, 2020 •

edited

jtcohen6 commented Mar 11, 2020

fhoffa commented Mar 12, 2020

fhoffa commented Mar 12, 2020

jtcohen6 commented Mar 12, 2020

jtcohen6 commented Mar 12, 2020

drewbanin commented Mar 18, 2020

BQ clustering can improve merge performance #2196

BQ clustering can improve merge performance #2196

Comments

jtcohen6 commented Mar 11, 2020 • edited

Description

Benchmarking

Next steps

jtcohen6 commented Mar 11, 2020

fhoffa commented Mar 12, 2020

fhoffa commented Mar 12, 2020

jtcohen6 commented Mar 12, 2020

jtcohen6 commented Mar 12, 2020

drewbanin commented Mar 18, 2020

jtcohen6 commented Mar 11, 2020 •

edited