Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-12464] Change ProtoSchemaTranslator beam schema creation to match the order for protobufs containing Oneof fields #14974

Merged
merged 11 commits into from
Jan 13, 2022

Conversation

reubenvanammers
Copy link
Contributor

@reubenvanammers reubenvanammers commented Jun 9, 2021

Currently, when ProtoSchemaTranslator creates the beam schema from a protobuf definition it always puts the Oneofs at the start of the beam schema due to oneofs always being created first. This means that the order of the fields doesn't match the order of the protobuf definition. As the schema generation is used when converting from beam rows to protobufs, it additionally means that it is impossible to convert from a beam row where the oneof fields are not the first fields in the beamrow.

This change orders the resultant beam schema to match the order in the protobuf definition.

Additionally, this PR updates the test protos so that the beam schema in the oneof tests matches the protobuf definition, as opposed to the beam row having the Oneofs first.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-12464] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

ValidatesRunner compliance status (on master branch)

Lang ULR Dataflow Flink Samza Spark Twister2
Go --- Build Status Build Status --- Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Python --- Build Status
Build Status
Build Status
Build Status
Build Status
--- Build Status ---
XLang Build Status Build Status Build Status --- Build Status ---

Examples testing status on various runners

Lang ULR Dataflow Flink Samza Spark Twister2
Go --- --- --- --- --- --- ---
Java --- Build Status
Build Status
Build Status
--- --- --- --- ---
Python --- --- --- --- --- --- ---
XLang --- --- --- --- --- --- ---

Post-Commit SDK/Transform Integration Tests Status (on master branch)

Go Java Python
Build Status Build Status Build Status
Build Status
Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website Whitespace Typescript
Non-portable Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status Build Status
Portable --- Build Status Build Status --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@reubenvanammers reubenvanammers force-pushed the feature/oneof_proto_ordering branch 2 times, most recently from 13001ad to 07ac88e Compare June 9, 2021 03:58
@reubenvanammers reubenvanammers changed the title [BEAM-12464] Allow oneoffields to be placed in the correct positionChange ProtoSchemaTranslator beam schema creation to match the order for protobufs containing Oneof fields [BEAM-12464] Change ProtoSchemaTranslator beam schema creation to match the order for protobufs containing Oneof fields Jun 9, 2021
…chema in accordance with their location in the protobuf definition
@reubenvanammers
Copy link
Contributor Author

Do you have any thoughts about this @reuvenlax ?

@pabloem
Copy link
Member

pabloem commented Jun 10, 2021

@alexvanboxel would you have time to take a look at this?

@reuvenlax
Copy link
Contributor

My previous comment seems to have been lost. Can you explain the use case where this is necessary? Since oneof fields in Beam are nested rows (unlike in proto), they still won't quite match the proto structure.

@reubenvanammers
Copy link
Contributor Author

reubenvanammers commented Jun 11, 2021

Can you explain the use case where this is necessary?

How I noticed this was when I was trying to convert a Beam Row into a protobuf, with the OneOf not in the first position. During the conversion, the it converts in the order of the converters, which in turn is the order of the generated Schema. Because the OneOf field is first in the generated schema, the first field in the row is attempted to be converted to a proto OneOf, which then fails. Basically this means that Rows containing OneOfs can't be converted to protos, except if all of the OneOfs are at the start of the row.

Obviously there are workarounds (I.e. reordering the beam row before converting to protobuf) , but I believe that the behaviour is surprising enough to warrant changing one way or the other.

@aaltay
Copy link
Member

aaltay commented Jul 1, 2021

Folks, what do we what do next? Does this require additional discussions?

@reubenvanammers
Copy link
Contributor Author

Folks, what do we what do next? Does this require additional discussions?

I still believe this change to be worthwhile, considering that currently not all beam schemas containing unions can be converted to protobuf. What would be useful to know whether or not anyone foresees issues arising from this change of behaviour/ anything needs to be added or changed. It'd be useful to get some feedback.

@aaltay
Copy link
Member

aaltay commented Jul 16, 2021

Folks, what do we what do next? Does this require additional discussions?

I still believe this change to be worthwhile, considering that currently not all beam schemas containing unions can be converted to protobuf. What would be useful to know whether or not anyone foresees issues arising from this change of behaviour/ anything needs to be added or changed. It'd be useful to get some feedback.

You could either add more reviewers or ask for additional comments on the dev@ list.

@pabloem
Copy link
Member

pabloem commented Aug 10, 2021

@reuvenlax do you have any concerns about this PR?

@ibzib are you able to review a change like this?

Copy link
Contributor

@ibzib ibzib left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of the problem here is that OneOfType.create(List, Map) lets the user set arbitrary positions for each of the component types of the OneOfType. So there's no obvious place to put the oneof type in the schema. In the example in the test, the oneof component types are all consecutive (numbers 2-5), so it makes sense to put the oneof at position 2 in the schema. But if they're nonconsecutive, say 2 and 4, it seems kind of arbitrary to put it at position 2 instead of position 4. So we may want to just keep the current behavior and document it, including helpful error messages if we can.

@reubenvanammers
Copy link
Contributor Author

My understanding of the problem here is that OneOfType.create(List, Map) lets the user set arbitrary positions for each of the component types of the OneOfType. So there's no obvious place to put the oneof type in the schema. In the example in the test, the oneof component types are all consecutive (numbers 2-5), so it makes sense to put the oneof at position 2 in the schema. But if they're nonconsecutive, say 2 and 4, it seems kind of arbitrary to put it at position 2 instead of position 4. So we may want to just keep the current behavior and document it, including helpful error messages if we can.

Hi @ibzib , thanks for having a look!

My understanding is that descriptor.getFields() returns the fields in order that they are in the .proto file, not in order of the field number. This means that in the previous as well as the proposed change, the order of the resultant schema is based on the ordering of the fields in the .proto file, and not the field descriptor number. Because of this, all of the oneof fields are necessarily going to be "clumped", as oneof blocks are contiguous. Therefore, the position of the first field in the oneofblock is a useful and natural location for the entirety of the oneof fields for that particular oneof.

Does that make sense? Happy to write some unit tests for non monotonically increasing proto field descriptor numbers when I've got time if thats the main objection.

Copy link
Contributor

@ibzib ibzib left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense now. It's definitely worth adding a unit test where the oneof fields have noncontiguous numbers, and where the oneof field numbers don't line up with the position of the oneof.

@reubenvanammers
Copy link
Contributor Author

When you have some time @ibzib, mind having a look at this PR again? Added some unit tests with some more protobufs as well as adding some comments.

@ibzib
Copy link
Contributor

ibzib commented Aug 24, 2021

Hi Reuben, thanks for updating this PR. I don't have a lot of time at the moment, but I'll be sure to review this when I get the chance.

Copy link
Contributor

@ibzib ibzib left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for writing tests! LGTM

@reuvenlax
Copy link
Contributor

reuvenlax commented Aug 27, 2021 via email

@reubenvanammers
Copy link
Contributor Author

Hey @reuvenlax ,
That'd be great to get another set of eyes on it, thanks.

@aaltay
Copy link
Member

aaltay commented Sep 16, 2021

What is the next step on this PR?

@reubenvanammers
Copy link
Contributor Author

What is the next step on this PR?

@ibzib appears happy to enough with the PR as it stands. @reuvenlax, do you still want to have a look at the PR, or is it ok to merge? Happy to update the PR if needed.

@aaltay
Copy link
Member

aaltay commented Sep 30, 2021

@ibzib - if this looks good to you, I would suggest merging it.

@maduris
Copy link

maduris commented Oct 10, 2021

hi @reubenvanammers,

Glad you are already looking to the same issue we are facing.

I believe the convert function needs to be also updated to use the identifier number instead of the enum index on line
oneOfConvert has indices using the identifiers from proto definition where oneOf only has enum value.

https://github.com/apache/beam/blob/master/sdks/java/extensions/protobuf/src/main/java/org/apache/beam/sdk/extensions/protobuf/ProtoDynamicMessageSchema.java#L731

Something like?

        @Override
        void setOnProtoMessage(Message.Builder message, OneOfType.Value oneOf) {
            int caseIndex = oneOf.getCaseType().getValue();
            int convertIndex =
                logicalType.getCaseEnumType().valueOf(logicalType.getCaseEnumType().getValues().get(caseIndex)).getValue();
            oneOfConvert.get(convertIndex).setOnProtoMessage(message, oneOf.getValue());
        }

@aaltay
Copy link
Member

aaltay commented Oct 22, 2021

What is the next step on this PR?

@kennknowles
Copy link
Member

Pinging @ibzib; do you want a second opinion from someone or is this ready to merge?

@aaltay
Copy link
Member

aaltay commented Nov 30, 2021

Should this be merged? Closed?

@reubenvanammers
Copy link
Contributor Author

I feel like this should be merged, unless someone has an objection to it. It's been in approved status limbo for four months now.

@aaltay
Copy link
Member

aaltay commented Dec 2, 2021

I feel like this should be merged, unless someone has an objection to it. It's been in approved status limbo for four months now.

Got it. I do not have an objection but I also don't know why it was not merged.

@ibzib - since you did the review. Could this be merged? Are there any open discussion points?

@ibzib
Copy link
Contributor

ibzib commented Dec 4, 2021

@reubenvanammers can you address this comment? There may not be an issue but I wanted to make sure. Let's add your new test cases to ProtoDynamicMessageSchemaTest as well and see what happens.

hi @reubenvanammers,

Glad you are already looking to the same issue we are facing.

I believe the convert function needs to be also updated to use the identifier number instead of the enum index on line oneOfConvert has indices using the identifiers from proto definition where oneOf only has enum value.

https://github.com/apache/beam/blob/master/sdks/java/extensions/protobuf/src/main/java/org/apache/beam/sdk/extensions/protobuf/ProtoDynamicMessageSchema.java#L731

Something like?

        @Override
        void setOnProtoMessage(Message.Builder message, OneOfType.Value oneOf) {
            int caseIndex = oneOf.getCaseType().getValue();
            int convertIndex =
                logicalType.getCaseEnumType().valueOf(logicalType.getCaseEnumType().getValues().get(caseIndex)).getValue();
            oneOfConvert.get(convertIndex).setOnProtoMessage(message, oneOf.getValue());
        }

@aaltay
Copy link
Member

aaltay commented Dec 11, 2021

@reubenvanammers - Could you respond to the open comment?

@aaltay
Copy link
Member

aaltay commented Jan 13, 2022

@reubenvanammers - is this ready for another round of review?

Copy link
Contributor

@ibzib ibzib left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I'll merge this once I can get java precommit to pass

@ibzib
Copy link
Contributor

ibzib commented Jan 13, 2022

Run Java PreCommit

@ibzib ibzib merged commit 5ab52a3 into apache:master Jan 13, 2022
kennknowles added a commit to kennknowles/beam that referenced this pull request Jan 21, 2022
* github/master: (198 commits)
  Merge pull request apache#16369 from [BEAM-13558] [Playground] Hide the Graph tab and SCIO from SDK options
  Merge pull request apache#16546 from [BEAM-13661] [BEAM-13704] [Playground] Update tags for examples/katas/unit-tests
  Merge pull request apache#16540 from [BEAM-13678][Playground]Update Github Action To Deploy Examples
  [BEAM-13430] Re-add provided configuration (apache#16552)
  [BEAM-10206] Remove Fatalf calls in non-test goroutines for tests/benchmarks (apache#16575)
  [BEAM-13693] Bump beam_PostCommit_Java_ValidatesRunner_Dataflow_Streaming timeout to 12 hours (apache#16576)
  [BEAM-13699] Replace fnv with maphash. (apache#16573)
  Update Java FnAPI beam master (apache#16572)
  Merge pull request apache#16371 from [BEAM-13518][Playground] Beam Playground quickstart page on the Beam website
  Merge pull request apache#16373 from [BEAM-13515] [Playground] Hiding lines in an example that are not necessary
  Merge pull request apache#16472: [BEAM-13697] Add SchemaFieldNumber annotation
  [BEAM-13689] Output token elements when BQ batch writes complete.
  Disable logging for memoization test. (apache#16556)
  BEAM-13611 reactivating jdbcio xlang test
  Revert "Merge pull request apache#15863 from [BEAM-13184] Autosharding for JdbcIO.write* transforms"
  Remove obsolete commands from Inventory job. (apache#16564)
  [BEAM-13688] fixed type in BPG 4.5.3 window section (apache#16560)
  [BEAM-13683] Make cross-language SQL example pipeline (apache#16567)
  Merge pull request apache#16243 from darshan-sj/feature/support-priority-spannerio - Making rpcPriority a ValueProvider in SpannerConfig
  edited README and comments in Python multi-lang pipes examples
  Merge pull request apache#16518 from [BEAM-13619] [Playground] Add loading animation to the catalog
  Merge pull request apache#16519 from [BEAM-13639] [Playground] Add notification to Log/Output tabs about cached example
  Merge pull request apache#16533 from [BEAM-13548] [Playground] Add example description popover
  Merge pull request apache#16531 from [BEAM-13567] [playground] Handle run code validation and preparation errors
  Merge pull request apache#16370 from [BEAM-13556] playground - color and scroll tabs with new content
  [BEAM-13611] Skip test_xlang_jdbc_write (apache#16554)
  [BEAM-13015] Provide caching statistics in the status client. (apache#16495)
  Merge pull request apache#16309: [BEAM-13503] Set a default value to throwWriteErrors in BulkIO constructor
  [BEAM-13015] Add state caching capability to be used as hint for runners to not duplicate cached data if the SDK can do it for user state and side inputs. (apache#16525)
  [BEAM-13665] Make SpannerIO projectID optional again (apache#16547)
  Merge pull request apache#16322 from [BEAM-13407] [Playground] Preload fonts for the web application
  Merge pull request apache#16506 from [BEAM-13652][Playground] Send examples' links to the frontend
  [BEAM-11808][BEAM-9879] Support aggregate functions with two arguments (apache#16200)
  Update walkthrough.md (apache#16512)
  [BEAM-13683] Correct SQL transform schema, fix expansion address override bug (apache#16551)
  Merge pull request apache#16486 from [BEAM-13544][Playground] Add logs to examples CI/CD to see the progress
  [BEAM-13616][BEAM-13646] Upgrade vendored calcite to 1.28.0:0.2 (apache#16544)
  [BEAM-13616][BEAM-13645] Switch to vendored grpc 1.43.2 (apache#16543)
  Also bump FnAPI container.
  [BEAM-13680] Fixed code_repository (added pipelineUuid to RunCodeResult when status is "Finished")
  [BEAM-13616] Update com.google.cloud:libraries-bom to 24.2.0 (apache#16509)
  [BEAM-13430] Remove jcenter which will no longer contain any updates. (apache#16536)
  Update GH Actions to use proper variables names and proper triggers
  Remove jcenter repositories from gradle configuration. (apache#16532)
  Merge pull request apache#16507: [BEAM-13137] Fixes ES utest size flakiness with _flush api and index.store.stats_refresh_interval=0
  [BEAM-13664] Fix Primitives hashing benchmark (apache#16523)
  Bump beam container version.
  [BEAM-12621] - Update Jenkins VMs to modern Ubuntu version (apache#16457)
  doc tweaks (apache#16498)
  Redirecting cross-language transforms content (apache#16504)
  Remove tab from source.
  Remove unnecessary fmt call in universal.go
  Clean up string cast of bytes in vet.go and corresponding tests
  fix capitalized error strings in expansionx
  Remove unnecessary blank identifier assignment in harness.go
  Replace string(buf.Bytes()) with buf.String() in coder_test.go
  Replace bytes.Compare() with bytes.Equal() in test cases
  Remove unnecessary fmt.Sprintf() in partition.go
  Fix staticcheck errors in transforms directory
  [BEAM-13590] Fix  abc imports from collections (apache#15850)
  Merge pull request apache#16482 from [BEAM-13429][Playground] Add builder for preparers
  [BEAM-10206] Resolve go vet errors in protox package
  [BEAM-12572] Run java examples on multiple runners (apache#16450)
  [BEAM-13400] JDBC IO does not support UUID and JSONB PostgreSQL types and OTHER JDBC types in general
  [BEAM-13577] Beam Select's uniquifyNames function loses nullability of Complex types while inferring schema
  [BEAM-12164]: Add SDF for reading change stream records
  [BEAM-13455] Remove duplicated artifacts when using multiple environments with Dataflow Java
  Merge pull request apache#16485 from [BEAM-13486] [Playground] For unit tests (java) if one of tests fails the output goes to stdOutput
  Merge pull request apache#16385 from [BEAM-13535] [Playground] add cancel execution button
  [BEAM-12558] Fix doc typo.
  Merge pull request apache#16467 from [BEAM-12164]: SpannerIO DetectNewPartitions SDF
  Introduce the notion of a JoinIndex for fewer shuffles. (apache#16101)
  [BEAM-12464] Change ProtoSchemaTranslator beam schema creation to match the order for protobufs containing Oneof fields (apache#14974)
  Stronger typing inference for CoGBK. (apache#16465)
  [BEAM-13480] Increase pipeline timeout for PubSubIntegrationTest.test_streaming_data_only (apache#16496)
  Provide API to check whether a hint is known.
  [BEAM-8806] Integration test for SqsIO using Localstack
  Split builder into several builder for each step of pipeline execution
  [BEAM-13399] Move service liveness polling to Runner type (apache#16487)
  Merge pull request apache#16325 from [BEAM-13471] [Playground] Tag existing unit-tests
  Adds several example multi-language Python pipelines
  [BEAM-13616][BEAM-13646] Update vendored calcite 1.28.0 with protobuf 3.19.2 (apache#16473)
  Merge pull request apache#16374 from [BEAM-13398][Playground] Split LifeCycle to DTO and business logic
  Merge pull request apache#16363 from [BEAM-13557] [Playground] show code execution time
  Merge pull request apache#16149 from [BEAM-13113] [Playground] playground frontend documentation
  Merge pull request apache#16469 from [BEAM-13623][Playground] [Bugfix] During unit tests failing there is no any output
  [BEAM-13641][Playground] Add SCIO SDK support on the CI/CD step
  [BEAM-13638] Datatype of timestamp fields in SqsMessage for AWS IOs for SDK v2 was changed from String to long, visibility of all fields was fixed from package private to public
  [BEAM-13616] Initial files for vendored gRPC 1.43.2 (apache#16460)
  [BEAM-13432] Skip ExpansionService creation in Job Server (apache#16222)
  [BEAM-13628] Update SideInputCache to use full Transform and SideInputIDs as token information (apache#16483)
  [BEAM-13631] Add deterministic SQS message coder to fix reading from SQS in batch mode
  [BEAM-8806] Integration test for SqsIO
  [adhoc] Run spotlessApply on java examples to fix master
  Merge pull request apache#16156 from [BEAM-13391] Fix temporary file format in WriteToBigQuery
  Optional args and kwargs for named external transforms.
  [BEAM-13614] Add OnWindowExpiration support to the Java SDK harness and proto translation. (apache#16458)
  [BEAM-3221] Improve documentation in model pipeline protos (apache#16474)
  Merge pull request apache#16147 from [BEAM-13359] [Playground] Tag existing examples
  [BEAM-13626] Remap expanded outputs after merging. (apache#16471)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants