Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-10587] Support Maps in BigQuery #12389

Merged
merged 10 commits into from Oct 15, 2020
Merged

[BEAM-10587] Support Maps in BigQuery #12389

merged 10 commits into from Oct 15, 2020

Conversation

rworley-monster
Copy link
Contributor

@rworley-monster rworley-monster commented Jul 28, 2020

Maps are currently not supported when converting a Beam Schema to a BigQuery TableFieldSchema. This improvement will convert Maps to repeated records with 'key' and 'value' fields.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Dataflow Flink Samza Spark Twister2
Go Build Status --- Build Status --- Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status
Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
--- Build Status ---
XLang Build Status --- Build Status --- Build Status ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status
Build Status
Build Status
Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@rworley-monster
Copy link
Contributor Author

R: @lukecwik

@robertwb
Copy link
Contributor

R: @apilloud

Copy link
Member

@apilloud apilloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs tests!

BigQuery doesn't fundamentally support maps. This change modifies the IO to implicitly run unnest() on maps when writing to BigQuery. It seems like this could create confusion for users. What is the advantage of this over the user calling unnest() in their SQL? The BigQuery IO is bidirectional, users will expect to be able to read their maps back. Is that something you are considering adding?

cc: @robinyqiu @kennknowles @chamikaramj

@rworley-monster
Copy link
Contributor Author

This needs tests!

Indeed, I will add appropriate tests in BigQueryUtilsTest.

BigQuery doesn't fundamentally support maps. This change modifies the IO to implicitly run unnest() on maps when writing to BigQuery. It seems like this could create confusion for users.

The motivation for this change is to allow Avro maps (or anything that can become a Beam Schema Map type) to flow directly into BigQuery. I agree that a map becoming an array of key/value records is not completely intuitive, but it seemed the best solution that I was able to figure out within the current features of BigQuery schemas.

The BigQuery IO is bidirectional, users will expect to be able to read their maps back. Is that something you are considering adding?

Yes, that makes sense. It appears this would be done in fromTableFieldSchemaType, we would detect if a record has only 'key' and 'value' fields, and in that case return a Map FieldType. A possible concern is that this would intercept key/value records that did not necessarily originate as Beam Maps. Perhaps this would be a desired feature in many cases, but it can be debated.

@aaltay
Copy link
Member

aaltay commented Aug 6, 2020

Hey @rworley-monster - any progress on this? Do you need help?

@rworley-monster
Copy link
Contributor Author

Yes, it's in progress, but temporarily preempted by a higher priority task. I hope to get back to it and add the follow-up commit next week.

@rworley-monster
Copy link
Contributor Author

I have added support for reading BigQuery map records back into Beam Schema and Row. Can you please confirm that I am on the right track and let me know if there are any other locations that would need this new support for maps? Tests still need to be added and I hope to get to that next week.

@@ -263,6 +267,18 @@ private static FieldType fromTableFieldSchemaType(
"SqlTimestampWithLocalTzType", FieldType.STRING, "", FieldType.DATETIME) {});
case "STRUCT":
case "RECORD":
// check if record represents a map entry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something that is going to need more discussion. I have two concerns:

  1. The ZetaSQL dialect doesn't support map types, so this will break that use case.
  2. The user might have a row matching this struct that isn't suppose to be a map.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the second concern, I wonder if a user with a { "key", "value" } struct would always be satisfied to work with the field as a Map in Beam. If not, then I suppose we would need to somehow tag the field schema with a marker that a user would not reasonably collide with. The options I see are:

  1. Field names (could be something like "beam_key" or "map_key"). Personally, I'm not a fan of affecting schema names like this.
  2. Extra field (in addition to "key" and "value"), could be something like "beam_map" with all true or null values. Similarly messy like above.
  3. Tag in field description(s) with a warning for the user to not delete it. Might be something like #beam-map-do-not-delete#.

Though these options require that the fields are created via Beam and not by the user beforehand. Or the user would need to know to use these special markers to enable the Beam map functionality.

@apilloud
Copy link
Member

I left some comments, I think you have everything needed to make this work for the Beam SQL calcite dialect.

@rworley-monster
Copy link
Contributor Author

I have added tests for the conversion of Beam schema maps to BigQuery and back. Can you please let me know if there is anything else that I can do to help prepare this for approval?

@apilloud
Copy link
Member

Still looks good, here are the remaining items:

  1. We need to figure out implicitly converting RECORD/STRUCT type to MAP. This has some far reaching impacts, so about sending a email to dev@beam.apache.org explaining what you are doing?
  2. Add this to an integration test with BigQuery. This file would probably be an easy place to add tests (it also includes Beam SQL):
    https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/meta/provider/bigquery/BigQueryReadWriteIT.java

@aaltay
Copy link
Member

aaltay commented Oct 8, 2020

@rworley-monster - What is the next step on this PR?

@rworley-monster
Copy link
Contributor Author

I have written to the dev list and we will see how the conversation goes about the implicit conversion. And I will attempt to add the BigQuery integration test.

@aaltay
Copy link
Member

aaltay commented Oct 12, 2020

/cc relevant folks: @chamikaramj @pabloem

@rworley-monster
Copy link
Contributor Author

Based on feedback from the dev list, I have added a SchemaConversionOptions class with the option to infer maps when converting a BigQuery TableSchema to a Beam Schema. The existing and default behaviour is unchanged, users will continue to get an array of rows. If they enable this new option, they will get a map when the TableSchema contains a field with the expected array<struct{key, value}> schema.

Copy link
Member

@apilloud apilloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Before I can merge, the commit history needs a little cleanup. I can squash everything into a single commit if that is fine with you. If you want to clean up, the merge commits need to be removed (git fetch origin && git rebase origin/master will do that) and squash any fixup commits (git rebase -i origin/master).

@rworley-monster
Copy link
Contributor Author

Thanks, feel free to squash, I think all of the changes are a logical unit.

@apilloud apilloud merged commit 0e0ffac into apache:master Oct 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants