Skip to content

[SPARK-44001][Protobuf] spark protobuf: handle well known wrapper types#41498

Closed
justaparth wants to merge 1 commit intoapache:masterfrom
justaparth:parth/improve-handling-of-well-known-wrapper-types
Closed

[SPARK-44001][Protobuf] spark protobuf: handle well known wrapper types#41498
justaparth wants to merge 1 commit intoapache:masterfrom
justaparth:parth/improve-handling-of-well-known-wrapper-types

Conversation

@justaparth
Copy link
Contributor

@justaparth justaparth commented Jun 7, 2023

Under com.google.protobuf, there are some well known wrapper types, useful for distinguishing between absence of primitive fields and their default values, as well as for use within google.protobuf.Any types. These types are:

DoubleValue
FloatValue
Int64Value
Uint64Value
Int32Value
Uint32Value
BoolValue
StringValue
BytesValue

Currently, when we deserialize these from a serialized protobuf into a spark struct, we expand them as if they were normal messages. Concretely, if we have

syntax = "proto3";

import "google/protobuf/wrappers.proto"

message WktExample {
  google.protobuf.BoolValue bool_val = 1;
  google.protobuf.Int32Value int32_val = 2;
}

And a message like

WktExample(true, 100)

Then the behavior today is to deserialize this as.

{"bool_val": {"value": true}, "int32_val": {"value": 100}}

This is quite difficult to work with and not in the spirit of the wrapper type, so it would be nice to deserialize as

{"bool_val": true, "int32_val": 100}

This is also the behavior by other popular deserialization libraries, including java protobuf util Jsonformat and golangs jsonpb.

So for consistency with other libraries and improved usability, I propose we deserialize well known types in this way.

Why are the changes needed?

Improved usability and consistency with other deserialization libraries.

Does this PR introduce any user-facing change?

Yes, deserialization of well known types will change from the struct format to the inline format.

How was this patch tested?

Added a unit test testing every well known type deserialization explicitly, as well as testing roundtrip.

@justaparth justaparth force-pushed the parth/improve-handling-of-well-known-wrapper-types branch from 1dc4818 to b262c68 Compare June 7, 2023 15:39
@justaparth justaparth force-pushed the parth/improve-handling-of-well-known-wrapper-types branch from b262c68 to 7279a91 Compare June 7, 2023 15:41
@LuciferYang
Copy link
Contributor

after SPARK-43921, the .desc file should be generated at build time

@justaparth
Copy link
Contributor Author

after SPARK-43921, the .desc file should be generated at build time

ah, thanks! i didn't realize that had landed, let me rebase and remove the descriptor file

@justaparth justaparth changed the title spark protobuf: handle well known wrapper types [SPARK-44001][Protobuf] spark protobuf: handle well known wrapper types Jun 7, 2023
@justaparth justaparth force-pushed the parth/improve-handling-of-well-known-wrapper-types branch from 7279a91 to 1288b48 Compare June 7, 2023 15:54
@justaparth
Copy link
Contributor Author

cc @rangadi @SandishKumarHN would y'all be willing to take a look? thanks!

Handle parsing well known types as primitives rather than as structs.
@justaparth justaparth force-pushed the parth/improve-handling-of-well-known-wrapper-types branch from 1288b48 to 622cb2b Compare June 7, 2023 15:58
updater.setBoolean(
ordinal,
dm.getField(
dm.getDescriptorForType.findFieldByName("value")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use BoolValue.getDescriptor, right?

val jsonStr = jsonPrinter.print(value.asInstanceOf[DynamicMessage])
updater.set(ordinal, UTF8String.fromString(jsonStr))

// Handle well known wrapper types. We unpack the value field instead of keeping
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under com.google.protobuf, there are some well known wrapper types, useful for distinguishing between absence of primitive fields and their default values

This says the purpose is to find if the value is set or just the default. How does this PR provide that functionality? With this PR, user can't distinguish between int32 int_value and IntValue int_value.

It will be good to have a Spark example where this helps.

I am not yet sure about including a lot of customization like this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, how do we handle backward compatibility?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This says the purpose is to find if the value is set or just the default

This is the purpose of the wrapper types, not of this PR.

This PR is trying to respect how wrapper types should be interpreted when deserialized into other formats. The documentation for the types indicates this behavior explicitly for json, e.g. for BytesValue https://protobuf.dev/reference/protobuf/google.protobuf/#bytes-value

Wrapper message for double.

The JSON representation for DoubleValue is JSON number.

The source definition for the messages also says this https://github.com/protocolbuffers/protobuf/blob/main/src/google/protobuf/wrappers.proto

And if you look at Jsonformat in java protobuf utils or golang's json library, they also make sure to special case these so that they're not expanded as messages, but rather treated like primitives.

In this PR i do the same thing with respect to deserializing to a spark struct, as these wrapper types are intended to be treated like primitives.

Also, how do we handle backward compatibility?

Currently, this PR represents a change in the behavior. Open to discussion of course, but I think that the behavior currently is inconsistent with the intention of the types / other libraries, so we should make it the default behavior rather than hide it behind an option. We could introduce an option to allow expanding these as messages if users want to preserve that kind of behavior.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the purpose of the wrapper types, not of this PR.

Better for Spark to preserve the same information, right?
There is a reason why it was used in the the users's Protobuf message.. might be simpler to Spark to preserve that.

If the justification is just convenience, i am not sure we need to do that.

I think that the behavior currently is inconsistent with the intention of the types /

I don't quite see inconsistency. The original purpose in Protobuf is to distinguish unset vs default value (as you mentioned). Spark handling preserves that.
Just because of JSON serializer does this does not imply Spark should do.
How does Java API for this look like?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially what is a compelling case other than convenience. Protobuf documentation says this is mainly for use with Any fields, and for the other use case, we can use optional int32.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow. This is a serde for Protobuf and Spark struct. Consumer and Producers are expected to know the schema.

Can we have a concrete example where this makes a difference? What problem are we solving? Wrapper types are used because the wrapper is important, otherwise no need to use it. I don't see how stripping the wrapper is the right thing.
These are just utilities, not a Protobuf spec.

Did you check generate Java code? It treats it just as another Protobuf message. There is no special treatment. Why should Spark be different?

Can we have a fully spelled out example in Spark that shows the the benefits?

Copy link
Contributor Author

@justaparth justaparth Jun 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a concrete example where this makes a difference? What problem are we solving?
Can we have a fully spelled out example in Spark that shows the the benefits?

Did you check the PR description? Theres an example of what the deserialization would look like before and after. Beyond that, i'm not sure what kind of example would help... Could you be specific in what you're looking for here?

Wrapper types are used because the wrapper is important, otherwise no need to use it. I don't see how stripping the wrapper is the right thing.
Can we have a fully spelled out example in Spark that shows the the benefits?

Let me give you a concrete example from the existing code. There are well known Timestamp and Duration types, and the spark protobuf library handles them in a custom way to TimestampType / DayTimeIntervalType in spark.

proto message in tests: https://github.com/apache/spark/blob/master/connector/protobuf/src/test/resources/protobuf/functions_suite.proto#L167-L175

schema conversion logic: https://github.com/apache/spark/blob/master/connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/SchemaConverters.scala#L79-L90

deserialization logic: https://github.com/apache/spark/blob/master/connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala#L229-L247

By your argument above, we should NOT be doing this; we should treat them as messages and unpack as a struct.

However, the library is choosing to make a decision based on the semantic meaning of those well known types. I.e. the library is saying "i know what these message types are meant to be and i will deserialize them differently.

In the same way, the rest of wrapper types are not simply opaque messages; they are intended to feel like primitives, and I think that we should handle the rest of the well known types just like were handling Timestamp / Duration.

(As an aside: the code for Timestamp and Duration seems overly broad, it'll catch any class that just happens to be named Timestamp or Duration, which is likely not what we want. And the tests aren't referencing the google protobuf classes, but an exact copy of them committed to the repo. I can send a separate PR for consideration of how to fix that)

These are just utilities, not a Protobuf spec.

Protobuf actually defines how the json output should look, so that is part of the spec. But I agree they never said the words "spark struct should look like XXX". I just think that spark struct is way closer to json (it's a data format that is not code) than Java classes or something, and so json provides a good guide. And also like I said above, the existing code already is handling some well known types in a special way.

Copy link
Contributor Author

@justaparth justaparth Jun 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also one point that may not have been clear, but these wrapper types for primitives are not inherently useful as "messages"; rather they exist for the purpose of distinguishing between empty and 0 in proto3, before the existence of optional.

As an example

syntax = "proto3";

message A {
  int64 val = 1;
}

and then

A()
A(0)

are not distinguishable from one another because default values are not serialized to the wire for primitives in proto3.

But if you had

syntax = "proto3";

message A {
  google.protobuf.Int64Value val = 1;
}

then

A()
A(Int64Value(0))

Now become distinguishable from one another.

Basically it's not like it's useful to just nest primitives inside a message wrapper for the sake of it, the purpose is simply to disambiguate in that case above (and you're right that optional in newer versions of protos basically solves this problem).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you check the PR description? Theres an example of what the deserialization would look like before and after. Beyond that, i'm not sure what kind of example would help... Could you be specific in what you're looking for here?

I know what the PR does. Not looking for example usage. Looking for motivating example that shows the need to have this feature.

Copy link

@pang-wu pang-wu Jun 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the motivating example is if someone want to convert the struct generated by Spark to json, and compare(or maintain the compatibility of) that json with another json generated from the same protobuf message using Go or Java JsonFormat, they are not able to assert the equality because Spark doesn't follow spec to translate well know types.
The team who want to do such a comparison has to write a custom converter, and writing such converter is painful(if not impossible) because 1) the type information is loss, 2) even with the type info, there is no easy way to know at what level we should get rid of the struct and replace it with a scalar -- this is a real usecase we are running into.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Sep 18, 2023
@github-actions github-actions bot closed this Sep 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants