New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade confluent libraries to 7.2.6 to fix some errors related to optional proto fields #11753
Conversation
// older versions of confluent connectors (7.1.x and lower) used to rewrite proto3 optional as oneof at descriptor | ||
// level. Newer versions of confluent consumers support both alternatives. | ||
Descriptors.FieldDescriptor optionalField = message.getDescriptorForType().findFieldByName("optionalField"); | ||
Assert.assertNull(optionalField.getRealContainingOneof(), "Received protobuf have been rewritten"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we know, optional
fields are transformed into oneof
.
Older versions of confluent always applied this change in the producer and it was never applied in the consumer. From 7.1.x, this change is not applied in the producer and it is transparently applied in the consumer. That means that:
- The registry stores the original descriptor, which uses
optional
- The consumer receives the original descriptor and applies the transformation internally
As explained in [implementing_proto3_presence.md] (https://github.com/protocolbuffers/protobuf/blob/main/docs/implementing_proto3_presence.md) (in protobuf repo), proto3 apis provide a way to verify whether this oneof
is synthetic or real. If it is synthetic it means that it was added at runtime. Otherwise it means it was explicitly defined as oneof
in the descriptor file.
This is the only way I've found to test whether the received message was read using a modified protobuf descriptor or not. If we use confluent 5.5.3 (as we used to do), optionalField.getRealCOntainingOneOf
is different than null (because what is stored in the schema is the modified descriptor). When using 7.4.0 (or any version >= 7.1.0) this returns null.
This test is not great, but it is the best I was able to write. What I would like to do is to have different producers and different consumers using different confluent versions and verify that consumer can always read the message, independently of the confluent version used by the producer. But I don't think that can be done in a surefire test given that only a single version of a library can be in the classpath.
I've made that test manually and I can verify that the compatibility table is something like:
producer version | consumer version | stored in registry | correct null read |
---|---|---|---|
5.5.3 | 5.5.3 | oneof | yes |
5.5.3 | 7.0.x | oneof | yes |
5.5.3 | 7.1.x | oneof | yes |
5.5.3 | 7.4.x | oneof | yes |
7.0.x | 5.5.3 | oneof | yes |
7.0.x | 7.0.x | oneof | yes |
7.0.x | 7.1.x | oneof | yes |
7.0.x | 7.4.x | oneof | yes |
7.1.x | 5.5.3 | optional | no |
7.1.x | 7.0.x | optional | no |
7.1.x | 7.1.x | optional | yes |
7.1.x | 7.4.x | optional | yes |
7.4.x | 5.5.3 | optional | no |
7.4.x | 7.0.x | optional | no |
7.4.x | 7.1.x | optional | yes |
7.4.x | 7.4.x | optional | yes |
So the TL;DR is:
- if we use confluent <= 7.0.x we can read message emitted with [5.5.3, 7.1.x] but not when they are emitted with >= 7.1.x
- if we use confluent >= 7.1.x we can read message emitted with >= 5.5.3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be added as a check during table creation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are using 7.4, can we skip this test ? Can we assume that 7.4 will cover all cases of producers.
if we use confluent >= 7.1.x we can read message emitted with >= 5.5.3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great find. Since we are using 7.4, can we skip this test ? Can we assume that 7.4 will cover all cases of producers.
if we use confluent >= 7.1.x we can read message emitted with >= 5.5.3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be added as a check during table creation?
We cannot know this during table creation. Only when the message is produced, this can be checked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is true that this test, right now, mostly tests that using the same version, confluent can produce and read messages, which is quite useless.
But the test does not take long to run and having it here can be useful to verify future inconsistencies. Imagine that a future confluent v8 producer has some incompatibilities with a v7 reader. By using this code we can manually run the test with different versions of producers and readers.
If you think it is not useful, we can just remove it.
.../src/test/java/org/apache/pinot/plugin/inputformat/protobuf/ProtoBufConfluentSchemaTest.java
Show resolved
Hide resolved
pom.xml
Outdated
<dependency> | ||
<groupId>com.google.api.grpc</groupId> | ||
<artifactId>proto-google-common-protos</artifactId> | ||
<version>2.9.6</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (and other) versions were required to avoid dependency convergence errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not an expert, but can we simply shade the one used in the 2 confluent input-format pom only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be an option, yes. We need to do an extensive study on how we are using shading. But I think that would be something we should do in another PR given shanding changes are difficult to verify and we already verified the proposal written here.
Codecov Report
@@ Coverage Diff @@
## master #11753 +/- ##
============================================
+ Coverage 66.35% 66.39% +0.03%
Complexity 207 207
============================================
Files 2350 2350
Lines 127281 127288 +7
Branches 19603 19604 +1
============================================
+ Hits 84463 84507 +44
+ Misses 36924 36883 -41
- Partials 5894 5898 +4
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 23 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
// older versions of confluent connectors (7.1.x and lower) used to rewrite proto3 optional as oneof at descriptor | ||
// level. Newer versions of confluent consumers support both alternatives. | ||
Descriptors.FieldDescriptor optionalField = message.getDescriptorForType().findFieldByName("optionalField"); | ||
Assert.assertNull(optionalField.getRealContainingOneof(), "Received protobuf have been rewritten"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be added as a check during table creation?
pom.xml
Outdated
@@ -168,7 +168,7 @@ | |||
<argLine>-Xms4g -Xmx4g</argLine> | |||
<protobuf.version>3.24.3</protobuf.version> | |||
<grpc.version>1.53.0</grpc.version> | |||
<confluent.version>5.5.3</confluent.version> | |||
<confluent.version>7.4.0</confluent.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a huge jump in version, do we know what else could it impact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only use pinot-protobuf and pinot-confluent-arvo. as long as we make sure the 2 input-format plugin works normally we should be good
pom.xml
Outdated
@@ -168,7 +168,7 @@ | |||
<argLine>-Xms4g -Xmx4g</argLine> | |||
<protobuf.version>3.24.3</protobuf.version> | |||
<grpc.version>1.53.0</grpc.version> | |||
<confluent.version>5.5.3</confluent.version> | |||
<confluent.version>7.4.0</confluent.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only use pinot-protobuf and pinot-confluent-arvo. as long as we make sure the 2 input-format plugin works normally we should be good
pom.xml
Outdated
<dependency> | ||
<groupId>com.google.api.grpc</groupId> | ||
<artifactId>proto-google-common-protos</artifactId> | ||
<version>2.9.6</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not an expert, but can we simply shade the one used in the 2 confluent input-format pom only?
@@ -82,7 +82,6 @@ | |||
<dependency> | |||
<groupId>com.google.api.grpc</groupId> | |||
<artifactId>proto-google-common-protos</artifactId> | |||
<version>1.18.1</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one also means the pinot-gcs file system has upgrade to 2.9.6 from 1.18.1? (this is a major version bump, worth more testing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one also means the pinot-gcs file system has upgrade to 2.9.6 from 1.18.1? (this is a major version bump, worth more testing)
What tests are recommended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems strange. I removed that line without actually reading it because the actual error didn't point to an issue there but in:
[ERROR] Dependency convergence error for com.google.api.grpc:proto-google-common-protos:jar:2.9.0 paths to dependency are:
[ERROR] +-org.apache.pinot:pinot-tools:jar:1.1.0-SNAPSHOT
[ERROR] +-org.apache.pinot:pinot-common:jar:1.1.0-SNAPSHOT:compile
[ERROR] +-io.grpc:grpc-protobuf:jar:1.53.0:compile
[ERROR] +-com.google.api.grpc:proto-google-common-protos:jar:2.9.0:compile
[ERROR] and
[ERROR] +-org.apache.pinot:pinot-tools:jar:1.1.0-SNAPSHOT
[ERROR] +-org.apache.pinot:pinot-protobuf:jar:1.1.0-SNAPSHOT:compile
[ERROR] +-io.confluent:kafka-protobuf-serializer:jar:7.4.0:compile
[ERROR] +-io.confluent:kafka-protobuf-provider:jar:7.4.0:compile
[ERROR] +-com.google.api.grpc:proto-google-common-protos:jar:2.5.1:compile
[ERROR] and
[ERROR] +-org.apache.pinot:pinot-tools:jar:1.1.0-SNAPSHOT
[ERROR] +-org.apache.pinot:pinot-protobuf:jar:1.1.0-SNAPSHOT:compile
[ERROR] +-io.confluent:kafka-protobuf-serializer:jar:7.4.0:compile
[ERROR] +-io.confluent:kafka-protobuf-types:jar:7.4.0:compile
[ERROR] +-com.google.api.grpc:proto-google-common-protos:jar:2.5.1:compile
So I upgraded the version thinking I only changed minor versions. But now I'm checking again, without removing the version here, and you are right, the error is:
[ERROR] Dependency convergence error for com.google.api.grpc:proto-google-common-protos:jar:2.9.6 paths to dependency are:
[ERROR] +-org.apache.pinot:pinot-gcs:jar:1.1.0-SNAPSHOT
[ERROR] +-com.google.cloud:google-cloud-storage:jar:1.113.1:compile
[ERROR] +-com.google.api.grpc:proto-google-common-protos:jar:2.9.6:compile
[ERROR] and
[ERROR] +-org.apache.pinot:pinot-gcs:jar:1.1.0-SNAPSHOT
[ERROR] +-com.google.api.grpc:proto-google-common-protos:jar:1.18.1:compile
It seems that kafka-protobuf-types:5.5.3 did not depend on proto-google-common-protos
, but kafka-protobuf-types:7.0.0 and newer depend on proto-google-common-protos:2.3
and newer.
pom.xml
Outdated
<dependency> | ||
<groupId>com.google.api.grpc</groupId> | ||
<artifactId>proto-google-common-protos</artifactId> | ||
<version>2.9.6</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to other dependencies, can we define versions as variables? (this one and the following 2 versions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this feedback. @gortiz Let's define the version as variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the advantage of adding variables for these kind of things. That variable should not be read anywhere else, so there is no DRY. It just make the code more difficult to read given you have to move one thousand (!!!) lines above/below to actually know what the variable does. Also, it increase the chance of having a (honestly simple to resolve) conflict in PRs, given that we usually add these variables at the end of the properties section.
It would make sense if other projects would use that variable, but that is not the case given we use the value to define a dependencymanagement, whose objective is precisely that subprojects do not write the version by their own.
Anyway, it is not the point of this PR, so I'm changing it.
We don't feel confident using 7.4.x yet. There are several changes in these versions that may (or may not) produce errors in runtime. Therefore I've modified this PR to use 7.2.6 instead, which is closer to the previously used 5.5.3 (I'm not talking about the numbers but the actual code). We have run several manual tests with this version and it seems to be working fine. |
Can we merge this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise. Let's merge it once the version variable is fixed.
pom.xml
Outdated
<dependency> | ||
<groupId>com.google.api.grpc</groupId> | ||
<artifactId>proto-google-common-protos</artifactId> | ||
<version>2.9.6</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this feedback. @gortiz Let's define the version as variables
Recently we had some issues reading nulls from optional proto fields.
After several tests, we found that the actual issue was using a very old confluent version (5.5.3). Before 7.1.x,
kafka-protobuf-serializer
was not able to read optional fields in proto3 and instead relay on the producer rewriting the proto descriptor in order to change all optional fields to oneof. That is a requirement in older versions of proto but it is not a requirement in modern ones. This transformation is not required inkafka-protobuf-serializer
in 7.1.x and higher and therefore if the producer uses one of these modern versions, the rewrite is not applied.That means that if the consumer used
kafka-protobuf-serializer
>= 7.1.x, Pinot is not able to recognize optional fields, which had the lateral effect of either detecting default values (like 0 for ints) as null (which happened before #11723) or transforming actual nulls to default values (which happened after #11723)