Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement BigQuery Driver #168

Open
judahrand opened this issue Nov 8, 2022 · 13 comments
Open

Implement BigQuery Driver #168

judahrand opened this issue Nov 8, 2022 · 13 comments

Comments

@judahrand
Copy link
Contributor

This could be another interesting one as the API can return Arrow formatted data. Perhaps implemented in Go as I believe that's the 1st class SDK?

@lidavidm
Copy link
Member

BigQuery Storage sends Arrow over gRPC so it could be done natively for all of C++, Go, and Java. It would be interesting. Though, BQS can't evaluate SQL. So we might want to add the 'inverse' of the ADBC ingest API, for scanning a table without issuing an explicit query (or, specifying that drivers can translate a Substrait read request to such a scan).

I'm less familiar with the 'standard' BigQuery API. The REST API gives row-oriented JSON which isn't as great.

@judahrand
Copy link
Contributor Author

judahrand commented Nov 10, 2022

Yeah, the Go BQ/BQS SDK has docs on how to do this: https://github.com/GoogleCloudPlatform/golang-samples/blob/f2c65eb0ee3118298a5c8b84ca22067fe84eb5db/bigquery/bigquery_storage_quickstart/main.go#L333-L369

Appreciate that doing it in C++ might be preferable first?

@lidavidm
Copy link
Member

(Sorry for the delay.) Go might be interesting just to prove it out quickly. It may also be interesting to see Go implement the C interface and build an embeddable shared/static library to reduce the maintenance costs. (Right now Go can bind to the C interface, but not yet the other way.)

@paleolimbot
Copy link
Member

Obviously the Arrow interface is preferable, but I thought I'd post the C++ that the bigrquery R package uses to parse the JSON since the output data structure is pretty similar and I happen to know where it lives: https://github.com/r-dbi/bigrquery/blob/main/src/BqField.cpp

@lidavidm
Copy link
Member

lidavidm commented Nov 25, 2022

Absolutely, but the Arrow interface only applies to (effectively) full table scans with some filters ("BigQuery Storage" != "BigQuery"), so we will need to parse one of the alternative outputs for general queries. Thanks for the reference though!

@judahrand
Copy link
Contributor Author

Absolutely, but the Arrow interface only applies to (effectively) full table scans with some filters ("BigQuery Storage" != "BigQuery"), so we will need to parse one of the alternative outputs for general queries. Thanks for the reference though!

The Python BigQuery SDK uses a trick to push any query into a table and then use BigQuery Storage API to fetch the result. We could use that here too in order to simplify things.

@lidavidm
Copy link
Member

lidavidm commented Jan 5, 2023

Ah, interesting. That would be great, then. Thanks for pointing that out.

I assume that would have cost/pricing implications though, and requires you to materialize the result before reading it?

@judahrand
Copy link
Contributor Author

judahrand commented Jan 5, 2023

This works because of this:
image
https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.job.QueryJob

BigQuery just 'handles' it in the backend through the caching mechanism. Effectively, you run the query twice; once with the BigQuery API and once with the BigQuery Storage API but the second time it hits the cache. So I was a bit wrong in the Python manually pushing it into a table - it doesn't. So this shouldn't have cost implications.

@judahrand
Copy link
Contributor Author

Other docs on the fact that BigQuery actually writes ALL queries to a table: https://cloud.google.com/bigquery/docs/cached-results

@lidavidm
Copy link
Member

lidavidm commented Jan 5, 2023

Oops, clearly I don't understand BigQuery well enough. Thanks (again) for digging into this!

@judahrand
Copy link
Contributor Author

Another useful source of inspiration - the Go SDK is in the process of implementing the same fast path as the Python SDK: googleapis/google-cloud-go#6822

To make the Go implementation straight forward interfaces on top of the changes made here would be made to add a method which is analogous to Python's RowIterator.to_arrow_iterable. That doesn't seem like a stretch, however.

@zeroshade
Copy link
Member

A nice aspect is that the Go bigquery quickstart example ( https://github.com/GoogleCloudPlatform/golang-samples/blob/main/bigquery/bigquery_storage_quickstart/main.go) actually uses the latest released version of arrow (as opposed to snowflake which vendored a 2 year old version of Go Arrow right into their module)

@josevalim
Copy link

I believe the Go client now exposes the Arrow iterator and data: googleapis/google-cloud-go#8506 (which is likely using the RPC API to read the data).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants