New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(go/adbc/driver): add support for Google BigQuery #1722
base: main
Are you sure you want to change the base?
Conversation
525d7c2
to
568d677
Compare
@zeroshade do you think you could give this a brief scan and make sure things are on the right track? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great start!! Thanks!
Left a ton of comments for you
Hi @zeroshade, thank you so much for the code review! And sorry that I only picked up some Go skills in the past week based on the snowflake implementation. The issues you mentioned should've been fixed, and I'll implement the rest APIs and try to stick to these standards in Go :) |
Hi I've updated and implemented a bit more. Although I'm not 100% sure if this is the right/best way to do some functions... I'll be happy to make any changes. Besides that, I also updated the todo list in the top comment. While I'd like to implement these functions as much as I can, please do let me know if we can put off any of them and address them in another PR. :) |
all those TODOs are fine to split into later PRs |
Got it! Then we probably can merge this first once we're happy about with it. I'll do separate PRs for the left bits. :) And once again, thank you all for the great help and your time for the code review. @lidavidm @zeroshade ❤️ |
I agree with @lidavidm that the TODOs are fine to split into later PRs. Thanks for your work here! I'll give this a new review pass tomorrow. For now I approved the CI to run, looks like there's some pre-commit formatting/linting issues you have to resolve among other failures. |
Thank you very much @zeroshade!! I'll resolve these issues along with any issues you may point out in the code review 😃 |
841d2d7
to
afd1e77
Compare
afd1e77
to
72f998a
Compare
I think you'll want to try that rebase again 😅 |
72f998a
to
e68ed26
Compare
git is hard... now it should work I guess |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a first pass reviewing this. I'll do another pass tomorrow to get the rest
err := c.getDatasetsInProject(ctx, catalog, dbSchema, func(dataset *bigquery.Dataset) error { | ||
val, ok := result[dataset.ProjectID] | ||
if !ok { | ||
result[dataset.ProjectID] = make([]string, 0) | ||
} | ||
result[dataset.ProjectID] = append(val, dataset.DatasetID) | ||
return nil | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is apparently deprecated so we shouldn't be using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I'll remove this function.
func patternToRegexp(pattern *string) (*regexp.Regexp, error) { | ||
patternString := "" | ||
if pattern != nil { | ||
patternString = *pattern | ||
} | ||
patternString = strings.TrimSpace(patternString) | ||
|
||
convertedPattern := ".*" | ||
if patternString != "" { | ||
convertedPattern = fmt.Sprintf("(?i)^%s$", strings.ReplaceAll(strings.ReplaceAll(patternString, "_", "."), "%", ".*")) | ||
} | ||
r, err := regexp.Compile(convertedPattern) | ||
if err != nil { | ||
return nil, adbc.Error{ | ||
Code: adbc.StatusInvalidArgument, | ||
Msg: fmt.Sprintf("Cannot parse pattern `%s`: %s", patternString, err.Error()), | ||
} | ||
} | ||
return r, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does bigquery not support SQL pattern syntax?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this was found in the csharp codebase and I thought that it was there for some careful thoughts and decisions so I copied the implementation. I'll remove this function.
arrow-adbc/csharp/src/Drivers/BigQuery/BigQueryConnection.cs
Lines 735 to 746 in 282819d
private string PatternToRegEx(string pattern) | |
{ | |
if (pattern == null) | |
return ".*"; | |
StringBuilder builder = new StringBuilder("(?i)^"); | |
string convertedPattern = pattern.Replace("_", ".").Replace("%", ".*"); | |
builder.Append(convertedPattern); | |
builder.Append("$"); | |
return builder.ToString(); | |
} |
pattern, err := patternToRegexp(projectIDPattern) | ||
if err != nil { | ||
return err | ||
} | ||
if !pattern.MatchString(c.client.Project()) { | ||
return iterator.Done | ||
} | ||
|
||
pattern, err = patternToRegexp(datasetsIDPattern) | ||
if err != nil { | ||
return err | ||
} | ||
|
||
it := c.client.Datasets(ctx) | ||
for { | ||
dataset, err := it.Next() | ||
if err != nil { | ||
if errors.Is(err, iterator.Done) { | ||
break | ||
} | ||
return err | ||
} | ||
|
||
if pattern.MatchString(dataset.DatasetID) { | ||
err = cb(dataset) | ||
if err != nil { | ||
return err | ||
} | ||
} | ||
} | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this really the only way we can do this? we can't just make a sql query or otherwise get this information from bigquery, we have to iterate and perform the pattern matching ourselves?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible if the user moves all datasets to a single location, otherwise we have to do multiple SQL queries to actually get all datasets. Currently there're 13 regions in the Americas, 11 regions in Asia Pacific, 12 regions in Europe, 3 regions in the Middle East and 1 region in Africa, a total 40 regions in the world (reference). And that would be 40 SQL queries to get all datasets in one project because sadly that BigQuery doesn't support cross-region queries even if we're only interested in just some metadata, which means that we cannot do something like
SELECT
*
FROM
region-us-central1.INFORMATION_SCHEMA.SCHEMATA
UNION ALL
SELECT
*
FROM
region-us-west1.INFORMATION_SCHEMA.SCHEMATA;
because it would result in errors
Also, their example says you can do
SELECT * FROM region-us.INFORMATION_SCHEMA.SCHEMATA;
And it appears that it can magically retrieve all datasets in the US region but It's not the case. Because that would only count the datasets in the multi regional locations.
queryString := fmt.Sprintf("SELECT * FROM `%s`.`%s`.INFORMATION_SCHEMA.COLUMNS WHERE table_name = @tableName", sanitizedCatalog, sanitizedDbSchema) | ||
query := c.client.Query(queryString) | ||
query.Parameters = []bigquery.QueryParameter{ | ||
{ | ||
Name: "tableName", | ||
Value: tableName, | ||
}, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we do this for the other cases (datasets and projects) too? rather than having to manually filter/process them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for using INFORMATION_SCHEMA.SCHEMATA
it should return these information per the docs here but I had no luck with it either using API or in their cloud console.
I'm not quite sure about the reason, maybe there're some options somewhere I have to set and they didn't mention these option (that might be totally apparent to SQL experts) in the docs.
return schema, nil | ||
} | ||
|
||
func (c *connectionImpl) Token() (*oauth2.Token, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to be exported? Can we keep this internal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I might be wrong, but I was thinking that this is the required interface function for oauth2.TokenSource
...?
// A TokenSource is anything that can return a token.
type TokenSource interface {
// Token returns a token or an error.
// Token must be safe for concurrent use by multiple goroutines.
// The returned Token must not be modified.
Token() (*Token, error)
}
As pointed out by @zeroshade [here](#1722 (comment)), we should fix the formatting of the comment.
tl;dr: yes. According to the replies in googleapis/google-cloud-go#10044 and docs for BigQuery, the answer is yes, we have to enumerate projects and datasets (and as said in the reply, to enumerate projects we have to use ResourceManager, as current implementation of bigquery is not designed or possible to achieve this); or we're effectively limited in a single region of a project (it's impossible to query all datasets that stored in multiple regions using a single query). Otherwise we would have to wait for Google to implement this feature, using a single SQL query to get all datasets in a project regardless of their location. |
As pointed out by @zeroshade [here](#1722 (comment)), this should be handled by doing `adbc.Error{Msg: ctx.Err(), Code:....}`.
sanitize dataset name
3508188
to
a2e47a2
Compare
As pointed out by @zeroshade [here](apache#1722 (comment)), we should fix the formatting of the comment.
…#1769) As pointed out by @zeroshade [here](apache#1722 (comment)), this should be handled by doing `adbc.Error{Msg: ctx.Err(), Code:....}`.
Hi @zeroshade, sorry for the ping here. I was wondering if we're waiting for a solution to #1841 first before continuing on this driver; or if we're not really happy with the limitations in Google's Cloud SDK (i.e., have to use their APIs to retrieve datasets and schemas instead of doing these queries in SQL)? |
@cocoa-xu I was waiting for the unit tests to get updated and passing, and then I completely forgot about this. I'll take a new look through this now.
If that's the standard way to do it, then that's the way we do it 😄 Can you resolve the conflicts while I give this another lookover? |
Sure thing! Although I'm not quite familiar with fixing different versions in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really shaping up and looking good. I've got a bunch of small nitpicks. But more importantly, the thing that is missing here is testing. Please add some tests, possibly using the validation test suite that already exists (have a look at the snowflake driver_test.go file for examples)
if columns == nil { | ||
columns = make(map[string]int) | ||
for i, f := range reader.Schema().Fields() { | ||
columns[strings.ToUpper(f.Name)] = i | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the schema already has functionality for retrieving fields by name (it internally maintains a map index) so you shouldn't need to manually construct this mapping.
If the issue is that the casing is unknown, then at worst you could construct this mapping outside of the loop as record readers have a Schema
method
Hi @zeroshade, I've addressed most of the issues mentioned in the code review, I'll let you know once it's ready for another review. :) |
Hi this PR is a preliminary Go implementation for Google BigQuery as the preferred approach to PR #1717.
Currently it supports query functionality as a proof of concept, users can
It gives the same results as in #1717 using this driver in Elixir using elixir-explorer/adbc.
There're still a few thing to be done:
implement GetInfo, GetTableSchema and other functions for BigQuery's AdbcConnection and AdbcStatementget table constraints and return them in corresponding info objects(currently impossible to do so)Bind
andBindStream
ExecuteSchema
?ReadPartition
andExecutePartitions
?