feat(go/adbc/driver): add support for Google BigQuery #1722

cocoa-xu · 2024-04-15T16:56:58Z

Hi this PR is a preliminary Go implementation for Google BigQuery as the preferred approach to PR #1717.

Currently it supports query functionality as a proof of concept, users can

set most supported options for statements
send queries and read the result table in Arrow format

It gives the same results as in #1717 using this driver in Elixir using elixir-explorer/adbc.

Mix.install([{:adbc, "~> 0.3.2-dev", github: "elixir-explorer/adbc"}])

defmodule BigqueryTest do
  def test do
    children = [
      {Adbc.Database,
       "adbc.bigquery.sql.project_id": "bigquery-poc-418913",
       driver: "libadbc_driver_bigquery.dylib",
       process_options: [name: MyApp.DB]},
      {Adbc.Connection, database: MyApp.DB, process_options: [name: MyApp.Conn]}
    ]

    Supervisor.start_link(children, strategy: :one_for_one)

    dbg(
      Adbc.Connection.query(MyApp.Conn, "SELECT * FROM google_trends.small_top_terms LIMIT 7", [],
        "adbc.bigquery.sql.query.write_disposition": "WRITE_TRUNCATE"
      )
    )
  end
end

BigqueryTest.test()

[bigquery.exs:16: BigqueryTest.test/0]
Adbc.Connection.query(MyApp.Conn, "SELECT * FROM google_trends.small_top_terms LIMIT 7", [],
  "adbc.bigquery.sql.query.write_disposition": "WRITE_TRUNCATE"
) #=> {:ok,
 %Adbc.Result{
   num_rows: nil,
   data: %{
     "dma_id" => [546, 546, 546, 546, 546, 546, 546],
     "dma_name" => ["Columbia SC", "Columbia SC", "Columbia SC", "Columbia SC",
      "Columbia SC", "Columbia SC", "Columbia SC"],
     "rank" => [15, 15, 15, 15, 15, 15, 15],
     "refresh_date" => [~D[2024-03-14], ~D[2024-03-14], ~D[2024-03-14],
      ~D[2024-03-14], ~D[2024-03-14], ~D[2024-03-14], ~D[2024-03-14]],
     "score" => [nil, nil, nil, nil, nil, nil, nil],
     "term" => ["Nex Benedict", "Nex Benedict", "Nex Benedict", "Nex Benedict",
      "Nex Benedict", "Nex Benedict", "Nex Benedict"],
     "week" => [~D[2020-12-13], ~D[2020-12-20], ~D[2021-02-21], ~D[2021-02-28],
      ~D[2021-03-07], ~D[2021-03-14], ~D[2021-04-04]]
   }
 }}

There're still a few thing to be done:

set credentials when initialising the database; currently Google Cloud SDK will automatically find and use credentials saved on local storage (generated by gcloud auth application-default login)
~~implement GetInfo, GetTableSchema and other functions for BigQuery's AdbcConnection and AdbcStatement~~
- ~~get table constraints and return them in corresponding info objects~~ (currently impossible to do so)
- implement Bind and BindStream
- implement ExecuteSchema?
- implement ReadPartition and ExecutePartitions?
- implement Substrait execution?
add tests for this driver

lidavidm · 2024-04-18T07:35:10Z

@zeroshade do you think you could give this a brief scan and make sure things are on the right track?

zeroshade

This is a great start!! Thanks!

Left a ton of comments for you

go/adbc/driver/bigquery/bigquery_database.go

go/adbc/driver/bigquery/statement.go

cocoa-xu · 2024-04-20T07:55:03Z

This is a great start!! Thanks!

Left a ton of comments for you

Hi @zeroshade, thank you so much for the code review! And sorry that I only picked up some Go skills in the past week based on the snowflake implementation. The issues you mentioned should've been fixed, and I'll implement the rest APIs and try to stick to these standards in Go :)

cocoa-xu · 2024-04-21T20:40:55Z

Hi I've updated and implemented a bit more. Although I'm not 100% sure if this is the right/best way to do some functions... I'll be happy to make any changes.

Besides that, I also updated the todo list in the top comment. While I'd like to implement these functions as much as I can, please do let me know if we can put off any of them and address them in another PR. :)

lidavidm · 2024-04-22T01:58:17Z

all those TODOs are fine to split into later PRs

cocoa-xu · 2024-04-22T12:46:43Z

all those TODOs are fine to split into later PRs

Got it! Then we probably can merge this first once we're happy about with it. I'll do separate PRs for the left bits. :)

And once again, thank you all for the great help and your time for the code review. @lidavidm @zeroshade ❤️

zeroshade · 2024-04-23T14:53:17Z

I agree with @lidavidm that the TODOs are fine to split into later PRs. Thanks for your work here! I'll give this a new review pass tomorrow. For now I approved the CI to run, looks like there's some pre-commit formatting/linting issues you have to resolve among other failures.

cocoa-xu · 2024-04-23T18:11:47Z

I agree with @lidavidm that the TODOs are fine to split into later PRs. Thanks for your work here! I'll give this a new review pass tomorrow. For now I approved the CI to run, looks like there's some pre-commit formatting/linting issues you have to resolve among other failures.

Thank you very much @zeroshade!! I'll resolve these issues along with any issues you may point out in the code review 😃

lidavidm · 2024-04-24T08:37:39Z

I think you'll want to try that rebase again 😅

cocoa-xu · 2024-04-24T08:56:03Z

I think you'll want to try that rebase again 😅

git is hard... now it should work I guess

go/adbc/driver/bigquery/connection.go

zeroshade

I did a first pass reviewing this. I'll do another pass tomorrow to get the rest

go/adbc/driver/bigquery/connection.go

zeroshade · 2024-04-24T20:52:55Z

go/adbc/driver/bigquery/connection.go

+	err := c.getDatasetsInProject(ctx, catalog, dbSchema, func(dataset *bigquery.Dataset) error {
+		val, ok := result[dataset.ProjectID]
+		if !ok {
+			result[dataset.ProjectID] = make([]string, 0)
+		}
+		result[dataset.ProjectID] = append(val, dataset.DatasetID)
+		return nil
+	})


This function is apparently deprecated so we shouldn't be using it.

Got it. I'll remove this function.

go/adbc/driver/bigquery/connection.go

zeroshade · 2024-04-24T21:04:08Z

go/adbc/driver/bigquery/connection.go

+func patternToRegexp(pattern *string) (*regexp.Regexp, error) {
+	patternString := ""
+	if pattern != nil {
+		patternString = *pattern
+	}
+	patternString = strings.TrimSpace(patternString)
+
+	convertedPattern := ".*"
+	if patternString != "" {
+		convertedPattern = fmt.Sprintf("(?i)^%s$", strings.ReplaceAll(strings.ReplaceAll(patternString, "_", "."), "%", ".*"))
+	}
+	r, err := regexp.Compile(convertedPattern)
+	if err != nil {
+		return nil, adbc.Error{
+			Code: adbc.StatusInvalidArgument,
+			Msg:  fmt.Sprintf("Cannot parse pattern `%s`: %s", patternString, err.Error()),
+		}
+	}
+	return r, nil
+}


Does bigquery not support SQL pattern syntax?

Sorry this was found in the csharp codebase and I thought that it was there for some careful thoughts and decisions so I copied the implementation. I'll remove this function.

arrow-adbc/csharp/src/Drivers/BigQuery/BigQueryConnection.cs

Lines 735 to 746 in 282819d

private string PatternToRegEx(string pattern)

{

if (pattern == null)

return ".*";

StringBuilder builder = new StringBuilder("(?i)^");

string convertedPattern = pattern.Replace("_", ".").Replace("%", ".*");

builder.Append(convertedPattern);

builder.Append("$");

return builder.ToString();

}

zeroshade · 2024-04-24T21:05:01Z

go/adbc/driver/bigquery/connection.go

+	pattern, err := patternToRegexp(projectIDPattern)
+	if err != nil {
+		return err
+	}
+	if !pattern.MatchString(c.client.Project()) {
+		return iterator.Done
+	}
+
+	pattern, err = patternToRegexp(datasetsIDPattern)
+	if err != nil {
+		return err
+	}
+
+	it := c.client.Datasets(ctx)
+	for {
+		dataset, err := it.Next()
+		if err != nil {
+			if errors.Is(err, iterator.Done) {
+				break
+			}
+			return err
+		}
+
+		if pattern.MatchString(dataset.DatasetID) {
+			err = cb(dataset)
+			if err != nil {
+				return err
+			}
+		}
+	}
+	return nil


is this really the only way we can do this? we can't just make a sql query or otherwise get this information from bigquery, we have to iterate and perform the pattern matching ourselves?

It's possible if the user moves all datasets to a single location, otherwise we have to do multiple SQL queries to actually get all datasets. Currently there're 13 regions in the Americas, 11 regions in Asia Pacific, 12 regions in Europe, 3 regions in the Middle East and 1 region in Africa, a total 40 regions in the world (reference). And that would be 40 SQL queries to get all datasets in one project because sadly that BigQuery doesn't support cross-region queries even if we're only interested in just some metadata, which means that we cannot do something like

SELECT * FROM region-us-central1.INFORMATION_SCHEMA.SCHEMATA UNION ALL SELECT * FROM region-us-west1.INFORMATION_SCHEMA.SCHEMATA;

because it would result in errors

Also, their example says you can do

SELECT * FROM region-us.INFORMATION_SCHEMA.SCHEMATA;

And it appears that it can magically retrieve all datasets in the US region but It's not the case. Because that would only count the datasets in the multi regional locations.

zeroshade · 2024-04-24T21:06:54Z

go/adbc/driver/bigquery/connection.go

+	queryString := fmt.Sprintf("SELECT * FROM `%s`.`%s`.INFORMATION_SCHEMA.COLUMNS WHERE table_name = @tableName", sanitizedCatalog, sanitizedDbSchema)
+	query := c.client.Query(queryString)
+	query.Parameters = []bigquery.QueryParameter{
+		{
+			Name:  "tableName",
+			Value: tableName,
+		},
+	}


can't we do this for the other cases (datasets and projects) too? rather than having to manually filter/process them?

I'm so sorry I might be completely wrong with the following SQL query but it seems that BigQuery doesn't support this.

As for using INFORMATION_SCHEMA.SCHEMATA it should return these information per the docs here but I had no luck with it either using API or in their cloud console.

I'm not quite sure about the reason, maybe there're some options somewhere I have to set and they didn't mention these option (that might be totally apparent to SQL experts) in the docs.

zeroshade · 2024-04-24T21:07:52Z

go/adbc/driver/bigquery/connection.go

+	return schema, nil
+}
+
+func (c *connectionImpl) Token() (*oauth2.Token, error) {


does this need to be exported? Can we keep this internal?

Sorry I might be wrong, but I was thinking that this is the required interface function for oauth2.TokenSource...?

// A TokenSource is anything that can return a token. type TokenSource interface { // Token returns a token or an error. // Token must be safe for concurrent use by multiple goroutines. // The returned Token must not be modified. Token() (*Token, error) }

@zeroshade

As pointed out by @zeroshade [here](#1722 (comment)), we should fix the formatting of the comment.

cocoa-xu · 2024-04-25T18:50:01Z

is this really the only way we can do this? we can't just make a sql query or otherwise get this information from bigquery, we have to iterate and perform the pattern matching ourselves?

tl;dr: yes.

According to the replies in googleapis/google-cloud-go#10044 and docs for BigQuery, the answer is yes, we have to enumerate projects and datasets (and as said in the reply, to enumerate projects we have to use ResourceManager, as current implementation of bigquery is not designed or possible to achieve this); or we're effectively limited in a single region of a project (it's impossible to query all datasets that stored in multiple regions using a single query).

Otherwise we would have to wait for Google to implement this feature, using a single SQL query to get all datasets in a project regardless of their location.

@zeroshade

As pointed out by @zeroshade [here](#1722 (comment)), this should be handled by doing `adbc.Error{Msg: ctx.Err(), Code:....}`.

…yConfig`

sanitize dataset name

@zeroshade

As pointed out by @zeroshade [here](apache#1722 (comment)), we should fix the formatting of the comment.

@zeroshade

…#1769) As pointed out by @zeroshade [here](apache#1722 (comment)), this should be handled by doing `adbc.Error{Msg: ctx.Err(), Code:....}`.

cocoa-xu · 2024-05-22T14:25:02Z

Hi @zeroshade, sorry for the ping here. I was wondering if we're waiting for a solution to #1841 first before continuing on this driver; or if we're not really happy with the limitations in Google's Cloud SDK (i.e., have to use their APIs to retrieve datasets and schemas instead of doing these queries in SQL)?

zeroshade · 2024-05-22T14:38:02Z

@cocoa-xu I was waiting for the unit tests to get updated and passing, and then I completely forgot about this. I'll take a new look through this now.

or if we're not really happy with the limitations in Google's Cloud SDK (i.e., have to use their APIs to retrieve datasets and schemas instead of doing these queries in SQL)?

If that's the standard way to do it, then that's the way we do it 😄

Can you resolve the conflicts while I give this another lookover?

cocoa-xu · 2024-05-22T14:42:28Z

Can you resolve the conflicts while I give this another lookover?

Sure thing! Although I'm not quite familiar with fixing different versions in go.mod and go.sum (as I'm not sure which ones should be updated (if they're shared between projects) and which ones should stay there (if they're pinned by some projects)). I'll try my best and hope it looks alright.

zeroshade

This is really shaping up and looking good. I've got a bunch of small nitpicks. But more importantly, the thing that is missing here is testing. Please add some tests, possibly using the validation test suite that already exists (have a look at the snowflake driver_test.go file for examples)

go/adbc/driver/bigquery/connection.go

zeroshade · 2024-05-22T16:07:08Z

go/adbc/driver/bigquery/connection.go

+		if columns == nil {
+			columns = make(map[string]int)
+			for i, f := range reader.Schema().Fields() {
+				columns[strings.ToUpper(f.Name)] = i
+			}
+		}


the schema already has functionality for retrieving fields by name (it internally maintains a map index) so you shouldn't need to manually construct this mapping.

If the issue is that the casing is unknown, then at worst you could construct this mapping outside of the loop as record readers have a Schema method

go/adbc/driver/bigquery/connection.go

go/adbc/driver/bigquery/statement.go

cocoa-xu · 2024-05-22T21:59:30Z

Hi @zeroshade, I've addressed most of the issues mentioned in the code review, I'll let you know once it's ready for another review. :)

github-actions bot added this to the ADBC Libraries 1.0.0 milestone Apr 15, 2024

lidavidm changed the title ~~feat(go/driver/bigquery): add support for Google BigQuery~~ feat(go/adbc/driver/bigquery): add support for Google BigQuery Apr 15, 2024

cocoa-xu force-pushed the feat/go-google-bigquery-support branch from 525d7c2 to 568d677 Compare April 16, 2024 06:06

zeroshade reviewed Apr 19, 2024

View reviewed changes

cocoa-xu marked this pull request as ready for review April 22, 2024 13:01

cocoa-xu requested a review from lidavidm as a code owner April 22, 2024 13:01

cocoa-xu force-pushed the feat/go-google-bigquery-support branch from 841d2d7 to afd1e77 Compare April 24, 2024 08:31

cocoa-xu requested review from kou, wjones127 and CurtHagenlocher as code owners April 24, 2024 08:31

cocoa-xu force-pushed the feat/go-google-bigquery-support branch from afd1e77 to 72f998a Compare April 24, 2024 08:34

cocoa-xu force-pushed the feat/go-google-bigquery-support branch from 72f998a to e68ed26 Compare April 24, 2024 08:55

zeroshade reviewed Apr 24, 2024

View reviewed changes

go/adbc/driver/bigquery/connection.go Outdated Show resolved Hide resolved

zeroshade reviewed Apr 24, 2024

View reviewed changes

This was referenced Apr 25, 2024

fix(go/adbc/driver/snowflake): comment format #1768

Merged

fix(go/adbc/driver/flightsql): should use ctx.Err().Error() #1769

Merged

zeroshade pushed a commit that referenced this pull request Apr 25, 2024

fix(go/adbc/driver/snowflake): comment format (#1768)

96e05a0

As pointed out by @zeroshade [here](#1722 (comment)), we should fix the formatting of the comment.

zeroshade pushed a commit that referenced this pull request Apr 26, 2024

fix(go/adbc/driver/flightsql): should use ctx.Err().Error() (#1769)

59eede4

As pointed out by @zeroshade [here](#1722 (comment)), this should be handled by doing `adbc.Error{Msg: ctx.Err(), Code:....}`.

cocoa-xu added 2 commits April 29, 2024 04:05

feat(go/driver/bigquery): basic skeletons

27d2f00

feat(go/driver/bigquery): handled most used options in `bigquery.Quer…

6775bb0

…yConfig`

cocoa-xu added 12 commits April 29, 2024 04:05

SetAutocommit(true) should be valid

b63a7f5

fix: close bigquery client in connectionImpl.Close

ffcd5a4

return adbc.StatusNotImplemented for GetObjectsCatalogs

ac92627

use map[string]arrow.DataType for simple data types

2690210

always store original DATA_TYPE value in metadata

0468d78

remove implementation for driverbase.DbObjectsEnumerator

e8a31e1

fix: call arrow.ListOf in buildField for ARRAY data type

cc7bf5a

fix: get table types from bigquery directly

f892fde

fix: use regexp for parsePrecisionAndScale

0d98b4e

fix: do not sanitize user inputs

fc375c8

minor fix for comments

e06a12f

sanitize dataset name

a2e47a2

sanitize dataset name

cocoa-xu force-pushed the feat/go-google-bigquery-support branch from 3508188 to a2e47a2 Compare April 28, 2024 20:06

lidavidm removed this from the ADBC Libraries 1.0.0 milestone May 3, 2024

cocoa-xu added 4 commits May 8, 2024 15:45

added basic support for Bind and BindStream

9246f1d

Merge branch 'main' into feat/go-google-bigquery-support

04b5a07

updated to use github.com/apache/arrow/go/v17/*

816e3a1

go mod tidy

a58803b

cocoa-xu added a commit to meowcraft-dev/arrow-adbc that referenced this pull request May 8, 2024

fix(go/adbc/driver/snowflake): comment format (apache#1768)

47755b3

As pointed out by @zeroshade [here](apache#1722 (comment)), we should fix the formatting of the comment.

Merge branch 'main' into feat/go-google-bigquery-support

c88b37f

github-actions bot added this to the ADBC Libraries 13 milestone May 22, 2024

cocoa-xu changed the title ~~feat(go/adbc/driver/bigquery): add support for Google BigQuery~~ feat(go/adbc/driver): add support for Google BigQuery May 22, 2024

zeroshade requested changes May 22, 2024

View reviewed changes

cocoa-xu added 2 commits May 22, 2024 22:46

addressed issues mentioned in code review

0d50822

allow user to configure result record buffer size

fb5b5d7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(go/adbc/driver): add support for Google BigQuery #1722

feat(go/adbc/driver): add support for Google BigQuery #1722

cocoa-xu commented Apr 15, 2024 •

edited

lidavidm commented Apr 18, 2024

zeroshade left a comment

cocoa-xu commented Apr 20, 2024

cocoa-xu commented Apr 21, 2024

lidavidm commented Apr 22, 2024

cocoa-xu commented Apr 22, 2024

zeroshade commented Apr 23, 2024

cocoa-xu commented Apr 23, 2024

lidavidm commented Apr 24, 2024

cocoa-xu commented Apr 24, 2024

zeroshade left a comment

zeroshade Apr 24, 2024

cocoa-xu Apr 24, 2024

zeroshade Apr 24, 2024

cocoa-xu Apr 24, 2024

zeroshade Apr 24, 2024

cocoa-xu Apr 24, 2024

zeroshade Apr 24, 2024

cocoa-xu Apr 24, 2024

cocoa-xu Apr 24, 2024

zeroshade Apr 24, 2024

cocoa-xu Apr 24, 2024

cocoa-xu commented Apr 25, 2024 •

edited

cocoa-xu commented May 22, 2024

zeroshade commented May 22, 2024

cocoa-xu commented May 22, 2024 •

edited

zeroshade left a comment

zeroshade May 22, 2024

cocoa-xu commented May 22, 2024

	private string PatternToRegEx(string pattern)
	{
	if (pattern == null)
	return ".*";

	StringBuilder builder = new StringBuilder("(?i)^");
	string convertedPattern = pattern.Replace("_", ".").Replace("%", ".*");
	builder.Append(convertedPattern);
	builder.Append("$");

	return builder.ToString();
	}

feat(go/adbc/driver): add support for Google BigQuery #1722

Are you sure you want to change the base?

feat(go/adbc/driver): add support for Google BigQuery #1722

Conversation

cocoa-xu commented Apr 15, 2024 • edited

lidavidm commented Apr 18, 2024

zeroshade left a comment

Choose a reason for hiding this comment

cocoa-xu commented Apr 20, 2024

cocoa-xu commented Apr 21, 2024

lidavidm commented Apr 22, 2024

cocoa-xu commented Apr 22, 2024

zeroshade commented Apr 23, 2024

cocoa-xu commented Apr 23, 2024

lidavidm commented Apr 24, 2024

cocoa-xu commented Apr 24, 2024

zeroshade left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cocoa-xu commented Apr 25, 2024 • edited

cocoa-xu commented May 22, 2024

zeroshade commented May 22, 2024

cocoa-xu commented May 22, 2024 • edited

zeroshade left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cocoa-xu commented May 22, 2024

cocoa-xu commented Apr 15, 2024 •

edited

cocoa-xu commented Apr 25, 2024 •

edited

cocoa-xu commented May 22, 2024 •

edited