Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(go/adbc/driver): add support for Google BigQuery #1722

Open
wants to merge 38 commits into
base: main
Choose a base branch
from

Conversation

cocoa-xu
Copy link
Contributor

@cocoa-xu cocoa-xu commented Apr 15, 2024

Hi this PR is a preliminary Go implementation for Google BigQuery as the preferred approach to PR #1717.

Currently it supports query functionality as a proof of concept, users can

  • set most supported options for statements
  • send queries and read the result table in Arrow format

It gives the same results as in #1717 using this driver in Elixir using elixir-explorer/adbc.

Mix.install([{:adbc, "~> 0.3.2-dev", github: "elixir-explorer/adbc"}])

defmodule BigqueryTest do
  def test do
    children = [
      {Adbc.Database,
       "adbc.bigquery.sql.project_id": "bigquery-poc-418913",
       driver: "libadbc_driver_bigquery.dylib",
       process_options: [name: MyApp.DB]},
      {Adbc.Connection, database: MyApp.DB, process_options: [name: MyApp.Conn]}
    ]

    Supervisor.start_link(children, strategy: :one_for_one)

    dbg(
      Adbc.Connection.query(MyApp.Conn, "SELECT * FROM google_trends.small_top_terms LIMIT 7", [],
        "adbc.bigquery.sql.query.write_disposition": "WRITE_TRUNCATE"
      )
    )
  end
end

BigqueryTest.test()
[bigquery.exs:16: BigqueryTest.test/0]
Adbc.Connection.query(MyApp.Conn, "SELECT * FROM google_trends.small_top_terms LIMIT 7", [],
  "adbc.bigquery.sql.query.write_disposition": "WRITE_TRUNCATE"
) #=> {:ok,
 %Adbc.Result{
   num_rows: nil,
   data: %{
     "dma_id" => [546, 546, 546, 546, 546, 546, 546],
     "dma_name" => ["Columbia SC", "Columbia SC", "Columbia SC", "Columbia SC",
      "Columbia SC", "Columbia SC", "Columbia SC"],
     "rank" => [15, 15, 15, 15, 15, 15, 15],
     "refresh_date" => [~D[2024-03-14], ~D[2024-03-14], ~D[2024-03-14],
      ~D[2024-03-14], ~D[2024-03-14], ~D[2024-03-14], ~D[2024-03-14]],
     "score" => [nil, nil, nil, nil, nil, nil, nil],
     "term" => ["Nex Benedict", "Nex Benedict", "Nex Benedict", "Nex Benedict",
      "Nex Benedict", "Nex Benedict", "Nex Benedict"],
     "week" => [~D[2020-12-13], ~D[2020-12-20], ~D[2021-02-21], ~D[2021-02-28],
      ~D[2021-03-07], ~D[2021-03-14], ~D[2021-04-04]]
   }
 }}

There're still a few thing to be done:

  • set credentials when initialising the database; currently Google Cloud SDK will automatically find and use credentials saved on local storage (generated by gcloud auth application-default login)
  • implement GetInfo, GetTableSchema and other functions for BigQuery's AdbcConnection and AdbcStatement
    • get table constraints and return them in corresponding info objects (currently impossible to do so)
    • implement Bind and BindStream
    • implement ExecuteSchema?
    • implement ReadPartition and ExecutePartitions?
    • implement Substrait execution?
  • add tests for this driver

@github-actions github-actions bot added this to the ADBC Libraries 1.0.0 milestone Apr 15, 2024
@lidavidm lidavidm changed the title feat(go/driver/bigquery): add support for Google BigQuery feat(go/adbc/driver/bigquery): add support for Google BigQuery Apr 15, 2024
@cocoa-xu cocoa-xu force-pushed the feat/go-google-bigquery-support branch from 525d7c2 to 568d677 Compare April 16, 2024 06:06
@lidavidm
Copy link
Member

@zeroshade do you think you could give this a brief scan and make sure things are on the right track?

Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start!! Thanks!

Left a ton of comments for you

go/adbc/driver/bigquery/bigquery_database.go Outdated Show resolved Hide resolved
go/adbc/driver/bigquery/bigquery_database.go Outdated Show resolved Hide resolved
go/adbc/driver/bigquery/bigquery_database.go Outdated Show resolved Hide resolved
go/adbc/driver/bigquery/bigquery_database.go Outdated Show resolved Hide resolved
go/adbc/driver/bigquery/bigquery_database.go Outdated Show resolved Hide resolved
go/adbc/driver/bigquery/statement.go Outdated Show resolved Hide resolved
go/adbc/driver/bigquery/statement.go Outdated Show resolved Hide resolved
go/adbc/driver/bigquery/statement.go Outdated Show resolved Hide resolved
go/adbc/driver/bigquery/statement.go Outdated Show resolved Hide resolved
go/adbc/driver/bigquery/statement.go Outdated Show resolved Hide resolved
@cocoa-xu
Copy link
Contributor Author

This is a great start!! Thanks!

Left a ton of comments for you

Hi @zeroshade, thank you so much for the code review! And sorry that I only picked up some Go skills in the past week based on the snowflake implementation. The issues you mentioned should've been fixed, and I'll implement the rest APIs and try to stick to these standards in Go :)

@cocoa-xu
Copy link
Contributor Author

Hi I've updated and implemented a bit more. Although I'm not 100% sure if this is the right/best way to do some functions... I'll be happy to make any changes.

Besides that, I also updated the todo list in the top comment. While I'd like to implement these functions as much as I can, please do let me know if we can put off any of them and address them in another PR. :)

@lidavidm
Copy link
Member

all those TODOs are fine to split into later PRs

@cocoa-xu
Copy link
Contributor Author

all those TODOs are fine to split into later PRs

Got it! Then we probably can merge this first once we're happy about with it. I'll do separate PRs for the left bits. :)

And once again, thank you all for the great help and your time for the code review. @lidavidm @zeroshade ❤️

@cocoa-xu cocoa-xu marked this pull request as ready for review April 22, 2024 13:01
@cocoa-xu cocoa-xu requested a review from lidavidm as a code owner April 22, 2024 13:01
@zeroshade
Copy link
Member

I agree with @lidavidm that the TODOs are fine to split into later PRs. Thanks for your work here! I'll give this a new review pass tomorrow. For now I approved the CI to run, looks like there's some pre-commit formatting/linting issues you have to resolve among other failures.

@cocoa-xu
Copy link
Contributor Author

I agree with @lidavidm that the TODOs are fine to split into later PRs. Thanks for your work here! I'll give this a new review pass tomorrow. For now I approved the CI to run, looks like there's some pre-commit formatting/linting issues you have to resolve among other failures.

Thank you very much @zeroshade!! I'll resolve these issues along with any issues you may point out in the code review 😃

@lidavidm
Copy link
Member

I think you'll want to try that rebase again 😅

@cocoa-xu cocoa-xu force-pushed the feat/go-google-bigquery-support branch from 72f998a to e68ed26 Compare April 24, 2024 08:55
@cocoa-xu
Copy link
Contributor Author

I think you'll want to try that rebase again 😅

git is hard... now it should work I guess

Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first pass reviewing this. I'll do another pass tomorrow to get the rest

go/adbc/driver/bigquery/connection.go Outdated Show resolved Hide resolved
go/adbc/driver/bigquery/connection.go Outdated Show resolved Hide resolved
go/adbc/driver/bigquery/connection.go Outdated Show resolved Hide resolved
go/adbc/driver/bigquery/connection.go Outdated Show resolved Hide resolved
Comment on lines 275 to 282
err := c.getDatasetsInProject(ctx, catalog, dbSchema, func(dataset *bigquery.Dataset) error {
val, ok := result[dataset.ProjectID]
if !ok {
result[dataset.ProjectID] = make([]string, 0)
}
result[dataset.ProjectID] = append(val, dataset.DatasetID)
return nil
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is apparently deprecated so we shouldn't be using it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I'll remove this function.

go/adbc/driver/bigquery/connection.go Outdated Show resolved Hide resolved
Comment on lines 615 to 634
func patternToRegexp(pattern *string) (*regexp.Regexp, error) {
patternString := ""
if pattern != nil {
patternString = *pattern
}
patternString = strings.TrimSpace(patternString)

convertedPattern := ".*"
if patternString != "" {
convertedPattern = fmt.Sprintf("(?i)^%s$", strings.ReplaceAll(strings.ReplaceAll(patternString, "_", "."), "%", ".*"))
}
r, err := regexp.Compile(convertedPattern)
if err != nil {
return nil, adbc.Error{
Code: adbc.StatusInvalidArgument,
Msg: fmt.Sprintf("Cannot parse pattern `%s`: %s", patternString, err.Error()),
}
}
return r, nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does bigquery not support SQL pattern syntax?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this was found in the csharp codebase and I thought that it was there for some careful thoughts and decisions so I copied the implementation. I'll remove this function.

private string PatternToRegEx(string pattern)
{
if (pattern == null)
return ".*";
StringBuilder builder = new StringBuilder("(?i)^");
string convertedPattern = pattern.Replace("_", ".").Replace("%", ".*");
builder.Append(convertedPattern);
builder.Append("$");
return builder.ToString();
}

Comment on lines 637 to 667
pattern, err := patternToRegexp(projectIDPattern)
if err != nil {
return err
}
if !pattern.MatchString(c.client.Project()) {
return iterator.Done
}

pattern, err = patternToRegexp(datasetsIDPattern)
if err != nil {
return err
}

it := c.client.Datasets(ctx)
for {
dataset, err := it.Next()
if err != nil {
if errors.Is(err, iterator.Done) {
break
}
return err
}

if pattern.MatchString(dataset.DatasetID) {
err = cb(dataset)
if err != nil {
return err
}
}
}
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this really the only way we can do this? we can't just make a sql query or otherwise get this information from bigquery, we have to iterate and perform the pattern matching ourselves?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible if the user moves all datasets to a single location, otherwise we have to do multiple SQL queries to actually get all datasets. Currently there're 13 regions in the Americas, 11 regions in Asia Pacific, 12 regions in Europe, 3 regions in the Middle East and 1 region in Africa, a total 40 regions in the world (reference). And that would be 40 SQL queries to get all datasets in one project because sadly that BigQuery doesn't support cross-region queries even if we're only interested in just some metadata, which means that we cannot do something like

SELECT
  * 
FROM 
  region-us-central1.INFORMATION_SCHEMA.SCHEMATA
UNION ALL 
SELECT 
  *
FROM
  region-us-west1.INFORMATION_SCHEMA.SCHEMATA;

because it would result in errors

Screenshot 2024-04-25 at 06 29 20

Also, their example says you can do

SELECT * FROM region-us.INFORMATION_SCHEMA.SCHEMATA;

And it appears that it can magically retrieve all datasets in the US region but It's not the case. Because that would only count the datasets in the multi regional locations.

Comment on lines 719 to 569
queryString := fmt.Sprintf("SELECT * FROM `%s`.`%s`.INFORMATION_SCHEMA.COLUMNS WHERE table_name = @tableName", sanitizedCatalog, sanitizedDbSchema)
query := c.client.Query(queryString)
query.Parameters = []bigquery.QueryParameter{
{
Name: "tableName",
Value: tableName,
},
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we do this for the other cases (datasets and projects) too? rather than having to manually filter/process them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm so sorry I might be completely wrong with the following SQL query but it seems that BigQuery doesn't support this.

Screenshot 2024-04-25 at 05 38 06

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for using INFORMATION_SCHEMA.SCHEMATA it should return these information per the docs here but I had no luck with it either using API or in their cloud console.

I'm not quite sure about the reason, maybe there're some options somewhere I have to set and they didn't mention these option (that might be totally apparent to SQL experts) in the docs.

Screenshot 2024-04-25 at 05 45 55 Screenshot 2024-04-25 at 05 48 36

return schema, nil
}

func (c *connectionImpl) Token() (*oauth2.Token, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be exported? Can we keep this internal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I might be wrong, but I was thinking that this is the required interface function for oauth2.TokenSource...?

// A TokenSource is anything that can return a token.
type TokenSource interface {
	// Token returns a token or an error.
	// Token must be safe for concurrent use by multiple goroutines.
	// The returned Token must not be modified.
	Token() (*Token, error)
}

zeroshade pushed a commit that referenced this pull request Apr 25, 2024
As pointed out by @zeroshade
[here](#1722 (comment)),
we should fix the formatting of the comment.
@cocoa-xu
Copy link
Contributor Author

cocoa-xu commented Apr 25, 2024

is this really the only way we can do this? we can't just make a sql query or otherwise get this information from bigquery, we have to iterate and perform the pattern matching ourselves?

tl;dr: yes.

According to the replies in googleapis/google-cloud-go#10044 and docs for BigQuery, the answer is yes, we have to enumerate projects and datasets (and as said in the reply, to enumerate projects we have to use ResourceManager, as current implementation of bigquery is not designed or possible to achieve this); or we're effectively limited in a single region of a project (it's impossible to query all datasets that stored in multiple regions using a single query).

Otherwise we would have to wait for Google to implement this feature, using a single SQL query to get all datasets in a project regardless of their location.

zeroshade pushed a commit that referenced this pull request Apr 26, 2024
As pointed out by @zeroshade
[here](#1722 (comment)),
this should be handled by doing `adbc.Error{Msg: ctx.Err(), Code:....}`.
@cocoa-xu cocoa-xu force-pushed the feat/go-google-bigquery-support branch from 3508188 to a2e47a2 Compare April 28, 2024 20:06
@lidavidm lidavidm removed this from the ADBC Libraries 1.0.0 milestone May 3, 2024
cocoa-xu added a commit to meowcraft-dev/arrow-adbc that referenced this pull request May 8, 2024
As pointed out by @zeroshade
[here](apache#1722 (comment)),
we should fix the formatting of the comment.
cocoa-xu added a commit to meowcraft-dev/arrow-adbc that referenced this pull request May 8, 2024
…#1769)

As pointed out by @zeroshade
[here](apache#1722 (comment)),
this should be handled by doing `adbc.Error{Msg: ctx.Err(), Code:....}`.
@cocoa-xu
Copy link
Contributor Author

Hi @zeroshade, sorry for the ping here. I was wondering if we're waiting for a solution to #1841 first before continuing on this driver; or if we're not really happy with the limitations in Google's Cloud SDK (i.e., have to use their APIs to retrieve datasets and schemas instead of doing these queries in SQL)?

@zeroshade
Copy link
Member

@cocoa-xu I was waiting for the unit tests to get updated and passing, and then I completely forgot about this. I'll take a new look through this now.

or if we're not really happy with the limitations in Google's Cloud SDK (i.e., have to use their APIs to retrieve datasets and schemas instead of doing these queries in SQL)?

If that's the standard way to do it, then that's the way we do it 😄

Can you resolve the conflicts while I give this another lookover?

@cocoa-xu
Copy link
Contributor Author

cocoa-xu commented May 22, 2024

Can you resolve the conflicts while I give this another lookover?

Sure thing! Although I'm not quite familiar with fixing different versions in go.mod and go.sum (as I'm not sure which ones should be updated (if they're shared between projects) and which ones should stay there (if they're pinned by some projects)). I'll try my best and hope it looks alright.

@github-actions github-actions bot added this to the ADBC Libraries 13 milestone May 22, 2024
@cocoa-xu cocoa-xu changed the title feat(go/adbc/driver/bigquery): add support for Google BigQuery feat(go/adbc/driver): add support for Google BigQuery May 22, 2024
Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really shaping up and looking good. I've got a bunch of small nitpicks. But more importantly, the thing that is missing here is testing. Please add some tests, possibly using the validation test suite that already exists (have a look at the snowflake driver_test.go file for examples)

go/adbc/driver/bigquery/connection.go Outdated Show resolved Hide resolved
go/adbc/driver/bigquery/connection.go Outdated Show resolved Hide resolved
Comment on lines 581 to 586
if columns == nil {
columns = make(map[string]int)
for i, f := range reader.Schema().Fields() {
columns[strings.ToUpper(f.Name)] = i
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the schema already has functionality for retrieving fields by name (it internally maintains a map index) so you shouldn't need to manually construct this mapping.

If the issue is that the casing is unknown, then at worst you could construct this mapping outside of the loop as record readers have a Schema method

go/adbc/driver/bigquery/connection.go Outdated Show resolved Hide resolved
go/adbc/driver/bigquery/connection.go Outdated Show resolved Hide resolved
go/adbc/driver/bigquery/statement.go Outdated Show resolved Hide resolved
go/adbc/driver/bigquery/statement.go Show resolved Hide resolved
go/adbc/driver/bigquery/statement.go Show resolved Hide resolved
go/adbc/driver/bigquery/statement.go Outdated Show resolved Hide resolved
go/adbc/driver/bigquery/statement.go Outdated Show resolved Hide resolved
@cocoa-xu
Copy link
Contributor Author

Hi @zeroshade, I've addressed most of the issues mentioned in the code review, I'll let you know once it's ready for another review. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants