Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade confluent libraries to 7.2.6 to fix some errors related to optional proto fields #11753

Merged
merged 3 commits into from Oct 22, 2023

Conversation

gortiz
Copy link
Contributor

@gortiz gortiz commented Oct 6, 2023

Recently we had some issues reading nulls from optional proto fields.

After several tests, we found that the actual issue was using a very old confluent version (5.5.3). Before 7.1.x, kafka-protobuf-serializer was not able to read optional fields in proto3 and instead relay on the producer rewriting the proto descriptor in order to change all optional fields to oneof. That is a requirement in older versions of proto but it is not a requirement in modern ones. This transformation is not required in kafka-protobuf-serializer in 7.1.x and higher and therefore if the producer uses one of these modern versions, the rewrite is not applied.

That means that if the consumer used kafka-protobuf-serializer >= 7.1.x, Pinot is not able to recognize optional fields, which had the lateral effect of either detecting default values (like 0 for ints) as null (which happened before #11723) or transforming actual nulls to default values (which happened after #11723)

// older versions of confluent connectors (7.1.x and lower) used to rewrite proto3 optional as oneof at descriptor
// level. Newer versions of confluent consumers support both alternatives.
Descriptors.FieldDescriptor optionalField = message.getDescriptorForType().findFieldByName("optionalField");
Assert.assertNull(optionalField.getRealContainingOneof(), "Received protobuf have been rewritten");
Copy link
Contributor Author

@gortiz gortiz Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we know, optional fields are transformed into oneof.

Older versions of confluent always applied this change in the producer and it was never applied in the consumer. From 7.1.x, this change is not applied in the producer and it is transparently applied in the consumer. That means that:

  • The registry stores the original descriptor, which uses optional
  • The consumer receives the original descriptor and applies the transformation internally

As explained in [implementing_proto3_presence.md] (https://github.com/protocolbuffers/protobuf/blob/main/docs/implementing_proto3_presence.md) (in protobuf repo), proto3 apis provide a way to verify whether this oneof is synthetic or real. If it is synthetic it means that it was added at runtime. Otherwise it means it was explicitly defined as oneof in the descriptor file.

This is the only way I've found to test whether the received message was read using a modified protobuf descriptor or not. If we use confluent 5.5.3 (as we used to do), optionalField.getRealCOntainingOneOf is different than null (because what is stored in the schema is the modified descriptor). When using 7.4.0 (or any version >= 7.1.0) this returns null.

This test is not great, but it is the best I was able to write. What I would like to do is to have different producers and different consumers using different confluent versions and verify that consumer can always read the message, independently of the confluent version used by the producer. But I don't think that can be done in a surefire test given that only a single version of a library can be in the classpath.

I've made that test manually and I can verify that the compatibility table is something like:

producer version consumer version stored in registry correct null read
5.5.3 5.5.3 oneof yes
5.5.3 7.0.x oneof yes
5.5.3 7.1.x oneof yes
5.5.3 7.4.x oneof yes
7.0.x 5.5.3 oneof yes
7.0.x 7.0.x oneof yes
7.0.x 7.1.x oneof yes
7.0.x 7.4.x oneof yes
7.1.x 5.5.3 optional no
7.1.x 7.0.x optional no
7.1.x 7.1.x optional yes
7.1.x 7.4.x optional yes
7.4.x 5.5.3 optional no
7.4.x 7.0.x optional no
7.4.x 7.1.x optional yes
7.4.x 7.4.x optional yes

So the TL;DR is:

  • if we use confluent <= 7.0.x we can read message emitted with [5.5.3, 7.1.x] but not when they are emitted with >= 7.1.x
  • if we use confluent >= 7.1.x we can read message emitted with >= 5.5.3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be added as a check during table creation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are using 7.4, can we skip this test ? Can we assume that 7.4 will cover all cases of producers.
if we use confluent >= 7.1.x we can read message emitted with >= 5.5.3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great find. Since we are using 7.4, can we skip this test ? Can we assume that 7.4 will cover all cases of producers. if we use confluent >= 7.1.x we can read message emitted with >= 5.5.3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be added as a check during table creation?

We cannot know this during table creation. Only when the message is produced, this can be checked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true that this test, right now, mostly tests that using the same version, confluent can produce and read messages, which is quite useless.

But the test does not take long to run and having it here can be useful to verify future inconsistencies. Imagine that a future confluent v8 producer has some incompatibilities with a v7 reader. By using this code we can manually run the test with different versions of producers and readers.

If you think it is not useful, we can just remove it.

pom.xml Show resolved Hide resolved
pom.xml Outdated
<dependency>
<groupId>com.google.api.grpc</groupId>
<artifactId>proto-google-common-protos</artifactId>
<version>2.9.6</version>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and other) versions were required to avoid dependency convergence errors

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not an expert, but can we simply shade the one used in the 2 confluent input-format pom only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be an option, yes. We need to do an extensive study on how we are using shading. But I think that would be something we should do in another PR given shanding changes are difficult to verify and we already verified the proposal written here.

@codecov-commenter
Copy link

codecov-commenter commented Oct 6, 2023

Codecov Report

Merging #11753 (42c7e29) into master (86d3c44) will increase coverage by 0.03%.
The diff coverage is 50.00%.

@@             Coverage Diff              @@
##             master   #11753      +/-   ##
============================================
+ Coverage     66.35%   66.39%   +0.03%     
  Complexity      207      207              
============================================
  Files          2350     2350              
  Lines        127281   127288       +7     
  Branches      19603    19604       +1     
============================================
+ Hits          84463    84507      +44     
+ Misses        36924    36883      -41     
- Partials       5894     5898       +4     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (ø)
integration <0.01% <0.00%> (ø)
integration1 <0.01% <0.00%> (ø)
integration2 0.00% <0.00%> (ø)
java-11 49.97% <0.00%> (-16.36%) ⬇️
java-21 66.27% <50.00%> (+0.05%) ⬆️
skip-bytebuffers-false 66.35% <50.00%> (+<0.01%) ⬆️
skip-bytebuffers-true 66.22% <50.00%> (+48.41%) ⬆️
temurin 66.39% <50.00%> (+0.03%) ⬆️
unittests 66.39% <50.00%> (+0.03%) ⬆️
unittests1 67.16% <ø> (+0.04%) ⬆️
unittests2 17.83% <50.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...ConfluentSchemaRegistryProtoBufMessageDecoder.java 54.54% <50.00%> (-1.71%) ⬇️

... and 23 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

// older versions of confluent connectors (7.1.x and lower) used to rewrite proto3 optional as oneof at descriptor
// level. Newer versions of confluent consumers support both alternatives.
Descriptors.FieldDescriptor optionalField = message.getDescriptorForType().findFieldByName("optionalField");
Assert.assertNull(optionalField.getRealContainingOneof(), "Received protobuf have been rewritten");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be added as a check during table creation?

pom.xml Outdated
@@ -168,7 +168,7 @@
<argLine>-Xms4g -Xmx4g</argLine>
<protobuf.version>3.24.3</protobuf.version>
<grpc.version>1.53.0</grpc.version>
<confluent.version>5.5.3</confluent.version>
<confluent.version>7.4.0</confluent.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a huge jump in version, do we know what else could it impact?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only use pinot-protobuf and pinot-confluent-arvo. as long as we make sure the 2 input-format plugin works normally we should be good

pom.xml Outdated
@@ -168,7 +168,7 @@
<argLine>-Xms4g -Xmx4g</argLine>
<protobuf.version>3.24.3</protobuf.version>
<grpc.version>1.53.0</grpc.version>
<confluent.version>5.5.3</confluent.version>
<confluent.version>7.4.0</confluent.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only use pinot-protobuf and pinot-confluent-arvo. as long as we make sure the 2 input-format plugin works normally we should be good

pom.xml Outdated
<dependency>
<groupId>com.google.api.grpc</groupId>
<artifactId>proto-google-common-protos</artifactId>
<version>2.9.6</version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not an expert, but can we simply shade the one used in the 2 confluent input-format pom only?

@@ -82,7 +82,6 @@
<dependency>
<groupId>com.google.api.grpc</groupId>
<artifactId>proto-google-common-protos</artifactId>
<version>1.18.1</version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one also means the pinot-gcs file system has upgrade to 2.9.6 from 1.18.1? (this is a major version bump, worth more testing)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one also means the pinot-gcs file system has upgrade to 2.9.6 from 1.18.1? (this is a major version bump, worth more testing)

What tests are recommended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems strange. I removed that line without actually reading it because the actual error didn't point to an issue there but in:

[ERROR] Dependency convergence error for com.google.api.grpc:proto-google-common-protos:jar:2.9.0 paths to dependency are:
[ERROR] +-org.apache.pinot:pinot-tools:jar:1.1.0-SNAPSHOT
[ERROR]   +-org.apache.pinot:pinot-common:jar:1.1.0-SNAPSHOT:compile
[ERROR]     +-io.grpc:grpc-protobuf:jar:1.53.0:compile
[ERROR]       +-com.google.api.grpc:proto-google-common-protos:jar:2.9.0:compile
[ERROR] and
[ERROR] +-org.apache.pinot:pinot-tools:jar:1.1.0-SNAPSHOT
[ERROR]   +-org.apache.pinot:pinot-protobuf:jar:1.1.0-SNAPSHOT:compile
[ERROR]     +-io.confluent:kafka-protobuf-serializer:jar:7.4.0:compile
[ERROR]       +-io.confluent:kafka-protobuf-provider:jar:7.4.0:compile
[ERROR]         +-com.google.api.grpc:proto-google-common-protos:jar:2.5.1:compile
[ERROR] and
[ERROR] +-org.apache.pinot:pinot-tools:jar:1.1.0-SNAPSHOT
[ERROR]   +-org.apache.pinot:pinot-protobuf:jar:1.1.0-SNAPSHOT:compile
[ERROR]     +-io.confluent:kafka-protobuf-serializer:jar:7.4.0:compile
[ERROR]       +-io.confluent:kafka-protobuf-types:jar:7.4.0:compile
[ERROR]         +-com.google.api.grpc:proto-google-common-protos:jar:2.5.1:compile

So I upgraded the version thinking I only changed minor versions. But now I'm checking again, without removing the version here, and you are right, the error is:

[ERROR] Dependency convergence error for com.google.api.grpc:proto-google-common-protos:jar:2.9.6 paths to dependency are:
[ERROR] +-org.apache.pinot:pinot-gcs:jar:1.1.0-SNAPSHOT
[ERROR]   +-com.google.cloud:google-cloud-storage:jar:1.113.1:compile
[ERROR]     +-com.google.api.grpc:proto-google-common-protos:jar:2.9.6:compile
[ERROR] and
[ERROR] +-org.apache.pinot:pinot-gcs:jar:1.1.0-SNAPSHOT
[ERROR]   +-com.google.api.grpc:proto-google-common-protos:jar:1.18.1:compile

It seems that kafka-protobuf-types:5.5.3 did not depend on proto-google-common-protos, but kafka-protobuf-types:7.0.0 and newer depend on proto-google-common-protos:2.3 and newer.

pom.xml Outdated
<dependency>
<groupId>com.google.api.grpc</groupId>
<artifactId>proto-google-common-protos</artifactId>
<version>2.9.6</version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to other dependencies, can we define versions as variables? (this one and the following 2 versions)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this feedback. @gortiz Let's define the version as variables

Copy link
Contributor Author

@gortiz gortiz Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the advantage of adding variables for these kind of things. That variable should not be read anywhere else, so there is no DRY. It just make the code more difficult to read given you have to move one thousand (!!!) lines above/below to actually know what the variable does. Also, it increase the chance of having a (honestly simple to resolve) conflict in PRs, given that we usually add these variables at the end of the properties section.

It would make sense if other projects would use that variable, but that is not the case given we use the value to define a dependencymanagement, whose objective is precisely that subprojects do not write the version by their own.

Anyway, it is not the point of this PR, so I'm changing it.

@Jackie-Jiang Jackie-Jiang added dependencies Pull requests that update a dependency file bugfix labels Oct 7, 2023
@gortiz gortiz changed the title Upgrade confluent libraries to 7.4.0 to fix some errors related to optional proto fields Upgrade confluent libraries to 7.2.6 to fix some errors related to optional proto fields Oct 13, 2023
@gortiz
Copy link
Contributor Author

gortiz commented Oct 13, 2023

We don't feel confident using 7.4.x yet. There are several changes in these versions that may (or may not) produce errors in runtime. Therefore I've modified this PR to use 7.2.6 instead, which is closer to the previously used 5.5.3 (I'm not talking about the numbers but the actual code).

We have run several manual tests with this version and it seems to be working fine.

@gortiz
Copy link
Contributor Author

gortiz commented Oct 19, 2023

Can we merge this?

Copy link
Contributor

@snleee snleee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise. Let's merge it once the version variable is fixed.

pom.xml Outdated
<dependency>
<groupId>com.google.api.grpc</groupId>
<artifactId>proto-google-common-protos</artifactId>
<version>2.9.6</version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this feedback. @gortiz Let's define the version as variables

@xiangfu0 xiangfu0 merged commit 9f0d594 into apache:master Oct 22, 2023
20 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix dependencies Pull requests that update a dependency file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants