New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-10587] Support Maps in BigQuery #12389
[BEAM-10587] Support Maps in BigQuery #12389
Conversation
R: @lukecwik |
R: @apilloud |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs tests!
BigQuery doesn't fundamentally support maps. This change modifies the IO to implicitly run unnest() on maps when writing to BigQuery. It seems like this could create confusion for users. What is the advantage of this over the user calling unnest() in their SQL? The BigQuery IO is bidirectional, users will expect to be able to read their maps back. Is that something you are considering adding?
...o/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryUtils.java
Outdated
Show resolved
Hide resolved
Indeed, I will add appropriate tests in
The motivation for this change is to allow Avro maps (or anything that can become a Beam Schema Map type) to flow directly into BigQuery. I agree that a map becoming an array of key/value records is not completely intuitive, but it seemed the best solution that I was able to figure out within the current features of BigQuery schemas.
Yes, that makes sense. It appears this would be done in |
Hey @rworley-monster - any progress on this? Do you need help? |
Yes, it's in progress, but temporarily preempted by a higher priority task. I hope to get back to it and add the follow-up commit next week. |
I have added support for reading BigQuery map records back into Beam Schema and Row. Can you please confirm that I am on the right track and let me know if there are any other locations that would need this new support for maps? Tests still need to be added and I hope to get to that next week. |
...o/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryUtils.java
Outdated
Show resolved
Hide resolved
...o/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryUtils.java
Outdated
Show resolved
Hide resolved
...o/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryUtils.java
Outdated
Show resolved
Hide resolved
@@ -263,6 +267,18 @@ private static FieldType fromTableFieldSchemaType( | |||
"SqlTimestampWithLocalTzType", FieldType.STRING, "", FieldType.DATETIME) {}); | |||
case "STRUCT": | |||
case "RECORD": | |||
// check if record represents a map entry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something that is going to need more discussion. I have two concerns:
- The ZetaSQL dialect doesn't support map types, so this will break that use case.
- The user might have a row matching this struct that isn't suppose to be a map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the second concern, I wonder if a user with a { "key", "value" } struct would always be satisfied to work with the field as a Map in Beam. If not, then I suppose we would need to somehow tag the field schema with a marker that a user would not reasonably collide with. The options I see are:
- Field names (could be something like "beam_key" or "map_key"). Personally, I'm not a fan of affecting schema names like this.
- Extra field (in addition to "key" and "value"), could be something like "beam_map" with all true or null values. Similarly messy like above.
- Tag in field description(s) with a warning for the user to not delete it. Might be something like #beam-map-do-not-delete#.
Though these options require that the fields are created via Beam and not by the user beforehand. Or the user would need to know to use these special markers to enable the Beam map functionality.
I left some comments, I think you have everything needed to make this work for the Beam SQL calcite dialect. |
I have added tests for the conversion of Beam schema maps to BigQuery and back. Can you please let me know if there is anything else that I can do to help prepare this for approval? |
Still looks good, here are the remaining items:
|
@rworley-monster - What is the next step on this PR? |
I have written to the dev list and we will see how the conversation goes about the implicit conversion. And I will attempt to add the BigQuery integration test. |
…o-bigquery-map
/cc relevant folks: @chamikaramj @pabloem |
…o-bigquery-map
Based on feedback from the dev list, I have added a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Before I can merge, the commit history needs a little cleanup. I can squash everything into a single commit if that is fine with you. If you want to clean up, the merge commits need to be removed (git fetch origin && git rebase origin/master
will do that) and squash any fixup commits (git rebase -i origin/master
).
Thanks, feel free to squash, I think all of the changes are a logical unit. |
Maps are currently not supported when converting a Beam Schema to a BigQuery TableFieldSchema. This improvement will convert Maps to repeated records with 'key' and 'value' fields.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.