Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-7829] Add schema names when converting with AvroUtils.toAvroSchema #9247

Merged
merged 2 commits into from
Aug 29, 2019

Conversation

RyanSkraba
Copy link
Contributor

@RyanSkraba RyanSkraba commented Aug 5, 2019

There's a subtle problem with unnamed records in Avro and Java -- they can be used to correctly serialize/deserialize binary data as long as nobody counts on serializing/deserializing the actual Avro schema as a JSON string.

This can work if both reader and writer can infer the schema from another source (like the Beam Schema), but is certain to break when passing the schema or record to a transform that expects the schema to meet the Avro specification.

More discussion at: https://issues.apache.org/jira/browse/AVRO-2492

This change ensures that no anonymous Avro records will be created as part of translating a Beam schema.

N.B. Credit goes to @iemejia for this implementation, I just cleaned up the tests.

N.B.2 I get some minor credit :D There were some bugs with potential namespace collisions. The internal implementation here should be very close to what Spark does with auto-generated Avro schemas.

R: @reuvenlax @kanterov

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- Build Status --- --- Build Status
Java Build Status Build Status Build Status Build Status
Build Status
Build Status
Build Status Build Status Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
--- Build Status
Build Status
Build Status --- --- Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@iemejia iemejia requested a review from reuvenlax August 5, 2019 19:23
@iemejia iemejia assigned kanterov and unassigned kanterov Aug 5, 2019
@iemejia iemejia requested a review from kanterov August 5, 2019 19:23
@RyanSkraba
Copy link
Contributor Author

Please hold off on reviewing -- taking a look at the Spark strategy for naming schemas, I think this implementation introduces a bug. I'm creating a unit test to check and will push a fix.

@RyanSkraba
Copy link
Contributor Author

R: @kanterov

@iemejia: I updated the implementation to use exactly the same naming strategy as Spark for namespaces, and added a test for common collision cased.

@RyanSkraba
Copy link
Contributor Author

Run Java PreCommit

@RyanSkraba RyanSkraba force-pushed the BEAM-7829-name-all-records branch 3 times, most recently from e7ddf44 to 7f57470 Compare August 28, 2019 07:36
@kanterov
Copy link
Member

I will wait for master build becoming green, and then we can merge it.

@RyanSkraba
Copy link
Contributor Author

Run Java PreCommit

1 similar comment
@kanterov
Copy link
Member

Run Java PreCommit

@kanterov kanterov merged commit 9e152b7 into apache:master Aug 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants