Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BQ clustering can improve merge performance #2196

Closed
jtcohen6 opened this issue Mar 11, 2020 · 6 comments
Closed

BQ clustering can improve merge performance #2196

jtcohen6 opened this issue Mar 11, 2020 · 6 comments
Labels
bigquery enhancement New feature or request

Comments

@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 11, 2020

Description

On BigQuery, running a simple merge statement—as the incremental materialization does on versions <=0.15.2—appears to scan significantly less data if the target table is clustered.

This makes some intuitive sense when the cluster key and the unique_key for merge equality are identical, and even some slight sense when they're correlated (e.g. the merge key is event_id, the clustering is session_id, the former is contained within the latter). I'm seeing the benefit for all clustered tables, however, no matter which column(s) the table is clustered by.

This feels like a relatively recent BQ performance improvement, and—wow. While it throws a small wrench into our 0.16.0 rework of the BQ incremental materialization, it's also a very exciting discovery. Big thanks to @clausherther for his help on this!

Benchmarking

I've been doing a lot of work yesterday and today trying to benchmark query runtime and cost according to three variables: modeling, data volume, and incremental strategy. I'll have more to say about this in Discourse and hope to present some of my findings in tomorrow's office hours. Broadly:

At small data volumes, a simple merge into a clustered table is faster and scans less data than the multi-step, scripted, partition-based approach implemented in #2140.

As the target table increases in size (> 50 GB), the partition-based approach is still slower, but it's increasingly more cost-effective than any simple merge, clustered or not.

Next steps

I think we should reimplement the simple merge as the default BigQuery incremental strategy and document the finding around clustered tables' improved performance.

We should also allow users to turn on the new partition-based scripting approach using a partition_merge strategy.

That would give us three strategies total: merge (simple, default), partition_merge, and insert_overwrite.

@jtcohen6 jtcohen6 added enhancement New feature or request bigquery triage labels Mar 11, 2020
@drewbanin drewbanin removed the triage label Mar 11, 2020
@drewbanin drewbanin added this to the Barbara Gittings milestone Mar 11, 2020
@jtcohen6
Copy link
Contributor Author

@drewbanin and I had a chance to talk about this and game-plan for 0.16.0.

To simplify the release, we are going to:

  • Keep simple merge as the default incremental strategy. Users can improve model performance by adding a cluster_by config.
  • Include the new insert_overwrite incremental strategy.

PR and documentation to come!

@fhoffa
Copy link

fhoffa commented Mar 12, 2020

This is also important for merge costs in BigQuery:

#2136 (set_sql_header works well with 'table', but not 'incremental')

@fhoffa
Copy link

fhoffa commented Mar 12, 2020

related: fhoffa/code_snippets#2 (review)

@jtcohen6
Copy link
Contributor Author

@fhoffa Good point, the fix with missing set_sql_header interpolation in incremental runs would touch the a few of the same lines of code that I'm changing in #2198.

I don't know if @drewbanin is up for trying to sneak that fix into the next minor release (0.16.0). If not, it could come in the next patch release.

@jtcohen6
Copy link
Contributor Author

@fhoffa It seems that the behavior we're seeing in merge queries is of a piece with the phenomenon you noted for aggregate queries in Aug 2018:

The query estimator doesn’t show any benefits for clustering
BigQuery provides an estimate for how much data each query will query before running the query. Without clustering, said estimate is exact. With clustering the estimate is an upper bound, and the query might end up querying way less, as shown above.

@drewbanin
Copy link
Contributor

@jtcohen6 can we close this out now that #2198 has been merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bigquery enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants