[SPARK-44001][Protobuf] spark protobuf: handle well known wrapper types#41498
[SPARK-44001][Protobuf] spark protobuf: handle well known wrapper types#41498justaparth wants to merge 1 commit intoapache:masterfrom
Conversation
1dc4818 to
b262c68
Compare
b262c68 to
7279a91
Compare
|
after SPARK-43921, the |
ah, thanks! i didn't realize that had landed, let me rebase and remove the descriptor file |
7279a91 to
1288b48
Compare
|
cc @rangadi @SandishKumarHN would y'all be willing to take a look? thanks! |
Handle parsing well known types as primitives rather than as structs.
1288b48 to
622cb2b
Compare
| updater.setBoolean( | ||
| ordinal, | ||
| dm.getField( | ||
| dm.getDescriptorForType.findFieldByName("value") |
There was a problem hiding this comment.
Can use BoolValue.getDescriptor, right?
| val jsonStr = jsonPrinter.print(value.asInstanceOf[DynamicMessage]) | ||
| updater.set(ordinal, UTF8String.fromString(jsonStr)) | ||
|
|
||
| // Handle well known wrapper types. We unpack the value field instead of keeping |
There was a problem hiding this comment.
Under com.google.protobuf, there are some well known wrapper types, useful for distinguishing between absence of primitive fields and their default values
This says the purpose is to find if the value is set or just the default. How does this PR provide that functionality? With this PR, user can't distinguish between int32 int_value and IntValue int_value.
It will be good to have a Spark example where this helps.
I am not yet sure about including a lot of customization like this.
There was a problem hiding this comment.
Also, how do we handle backward compatibility?
There was a problem hiding this comment.
This says the purpose is to find if the value is set or just the default
This is the purpose of the wrapper types, not of this PR.
This PR is trying to respect how wrapper types should be interpreted when deserialized into other formats. The documentation for the types indicates this behavior explicitly for json, e.g. for BytesValue https://protobuf.dev/reference/protobuf/google.protobuf/#bytes-value
Wrapper message for double.
The JSON representation for DoubleValue is JSON number.
The source definition for the messages also says this https://github.com/protocolbuffers/protobuf/blob/main/src/google/protobuf/wrappers.proto
And if you look at Jsonformat in java protobuf utils or golang's json library, they also make sure to special case these so that they're not expanded as messages, but rather treated like primitives.
In this PR i do the same thing with respect to deserializing to a spark struct, as these wrapper types are intended to be treated like primitives.
Also, how do we handle backward compatibility?
Currently, this PR represents a change in the behavior. Open to discussion of course, but I think that the behavior currently is inconsistent with the intention of the types / other libraries, so we should make it the default behavior rather than hide it behind an option. We could introduce an option to allow expanding these as messages if users want to preserve that kind of behavior.
There was a problem hiding this comment.
This is the purpose of the wrapper types, not of this PR.
Better for Spark to preserve the same information, right?
There is a reason why it was used in the the users's Protobuf message.. might be simpler to Spark to preserve that.
If the justification is just convenience, i am not sure we need to do that.
I think that the behavior currently is inconsistent with the intention of the types /
I don't quite see inconsistency. The original purpose in Protobuf is to distinguish unset vs default value (as you mentioned). Spark handling preserves that.
Just because of JSON serializer does this does not imply Spark should do.
How does Java API for this look like?
There was a problem hiding this comment.
Essentially what is a compelling case other than convenience. Protobuf documentation says this is mainly for use with Any fields, and for the other use case, we can use optional int32.
There was a problem hiding this comment.
Not sure I follow. This is a serde for Protobuf and Spark struct. Consumer and Producers are expected to know the schema.
Can we have a concrete example where this makes a difference? What problem are we solving? Wrapper types are used because the wrapper is important, otherwise no need to use it. I don't see how stripping the wrapper is the right thing.
These are just utilities, not a Protobuf spec.
Did you check generate Java code? It treats it just as another Protobuf message. There is no special treatment. Why should Spark be different?
Can we have a fully spelled out example in Spark that shows the the benefits?
There was a problem hiding this comment.
Can we have a concrete example where this makes a difference? What problem are we solving?
Can we have a fully spelled out example in Spark that shows the the benefits?
Did you check the PR description? Theres an example of what the deserialization would look like before and after. Beyond that, i'm not sure what kind of example would help... Could you be specific in what you're looking for here?
Wrapper types are used because the wrapper is important, otherwise no need to use it. I don't see how stripping the wrapper is the right thing.
Can we have a fully spelled out example in Spark that shows the the benefits?
Let me give you a concrete example from the existing code. There are well known Timestamp and Duration types, and the spark protobuf library handles them in a custom way to TimestampType / DayTimeIntervalType in spark.
proto message in tests: https://github.com/apache/spark/blob/master/connector/protobuf/src/test/resources/protobuf/functions_suite.proto#L167-L175
schema conversion logic: https://github.com/apache/spark/blob/master/connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/SchemaConverters.scala#L79-L90
deserialization logic: https://github.com/apache/spark/blob/master/connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala#L229-L247
By your argument above, we should NOT be doing this; we should treat them as messages and unpack as a struct.
However, the library is choosing to make a decision based on the semantic meaning of those well known types. I.e. the library is saying "i know what these message types are meant to be and i will deserialize them differently.
In the same way, the rest of wrapper types are not simply opaque messages; they are intended to feel like primitives, and I think that we should handle the rest of the well known types just like were handling Timestamp / Duration.
(As an aside: the code for Timestamp and Duration seems overly broad, it'll catch any class that just happens to be named Timestamp or Duration, which is likely not what we want. And the tests aren't referencing the google protobuf classes, but an exact copy of them committed to the repo. I can send a separate PR for consideration of how to fix that)
These are just utilities, not a Protobuf spec.
Protobuf actually defines how the json output should look, so that is part of the spec. But I agree they never said the words "spark struct should look like XXX". I just think that spark struct is way closer to json (it's a data format that is not code) than Java classes or something, and so json provides a good guide. And also like I said above, the existing code already is handling some well known types in a special way.
There was a problem hiding this comment.
Also one point that may not have been clear, but these wrapper types for primitives are not inherently useful as "messages"; rather they exist for the purpose of distinguishing between empty and 0 in proto3, before the existence of optional.
As an example
syntax = "proto3";
message A {
int64 val = 1;
}
and then
A()
A(0)
are not distinguishable from one another because default values are not serialized to the wire for primitives in proto3.
But if you had
syntax = "proto3";
message A {
google.protobuf.Int64Value val = 1;
}
then
A()
A(Int64Value(0))
Now become distinguishable from one another.
Basically it's not like it's useful to just nest primitives inside a message wrapper for the sake of it, the purpose is simply to disambiguate in that case above (and you're right that optional in newer versions of protos basically solves this problem).
There was a problem hiding this comment.
Did you check the PR description? Theres an example of what the deserialization would look like before and after. Beyond that, i'm not sure what kind of example would help... Could you be specific in what you're looking for here?
I know what the PR does. Not looking for example usage. Looking for motivating example that shows the need to have this feature.
There was a problem hiding this comment.
So the motivating example is if someone want to convert the struct generated by Spark to json, and compare(or maintain the compatibility of) that json with another json generated from the same protobuf message using Go or Java JsonFormat, they are not able to assert the equality because Spark doesn't follow spec to translate well know types.
The team who want to do such a comparison has to write a custom converter, and writing such converter is painful(if not impossible) because 1) the type information is loss, 2) even with the type info, there is no easy way to know at what level we should get rid of the struct and replace it with a scalar -- this is a real usecase we are running into.
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
Under
com.google.protobuf, there are some well known wrapper types, useful for distinguishing between absence of primitive fields and their default values, as well as for use withingoogle.protobuf.Anytypes. These types are:Currently, when we deserialize these from a serialized protobuf into a spark struct, we expand them as if they were normal messages. Concretely, if we have
And a message like
Then the behavior today is to deserialize this as.
This is quite difficult to work with and not in the spirit of the wrapper type, so it would be nice to deserialize as
This is also the behavior by other popular deserialization libraries, including java protobuf util Jsonformat and golangs jsonpb.
So for consistency with other libraries and improved usability, I propose we deserialize well known types in this way.
Why are the changes needed?
Improved usability and consistency with other deserialization libraries.
Does this PR introduce any user-facing change?
Yes, deserialization of well known types will change from the struct format to the inline format.
How was this patch tested?
Added a unit test testing every well known type deserialization explicitly, as well as testing roundtrip.