New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-7274] Infer a Beam Schema from a protocol buffer class. #10356
[BEAM-7274] Infer a Beam Schema from a protocol buffer class. #10356
Conversation
Run Python PreCommit |
Friendly ping. @alexvanboxel do you have any thoughts on this PR? |
@@ -87,4 +87,56 @@ message LogicalType { | |||
string urn = 1; | |||
bytes payload = 2; | |||
FieldType representation = 3; | |||
FieldType argument_type = 4; | |||
FieldValue argument = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand why you extended the proto schema with a complete value type system... but this feels complex. I think options (although in very early draft) would be a better.
If you look at my draft (not ready): https://github.com/apache/beam/pull/10413/files#diff-9a1bd3d0a6a4b8228be12b5cab80bf60R92
I would think that the option message has a binary payload. This payload would be the same binary representation of a field, so this would be the same Row based type system, instead of another encoding for the Logical Payload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was done as a continuation of the previous PR that was already merged - the changes made in that PR were not reflected in the portability protos get. Since the proto is used as the arg for the OneOf, we needed it to be reflected in the proto for completeness.
This is also done to enable your option work. I think making the payload binary there is the wrong approach - part of the point of the proto is to be easily introspectible (i.e. you should be able to see the option value by simply examining the proto, with no need to parse a binary option value).
@@ -35,85 +42,99 @@ | |||
private static final String URN_BEAM_LOGICAL_DECIMAL = "beam:logical_type:decimal:v1"; | |||
private static final String URN_BEAM_LOGICAL_JAVASDK = "beam:logical_type:javasdk:v1"; | |||
|
|||
public static SchemaApi.Schema schemaToProto(Schema schema) { | |||
public static SchemaApi.Schema schemaToProto(Schema schema, boolean serializeLogicalType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand when you do NOT want to serializeLogicalTypes. Is this for backward compatibility ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a temporary hack to make OneOf comparison work, since right now the logical type serialization uses Java Serialization which makes the protos not binary comparable. Once we have a better serialization of logical types this can go away. Also once we support recursive schemas (in progress), then OneOf no longer needs to use the proto as its arg.
-I$PROTO_INCLUDE \ | ||
--descriptor_set_out=sdks/java/extensions/protobuf/src/test/resources/org/apache/beam/sdk/extensions/protobuf/test_option_v1.pb \ | ||
--include_imports \ | ||
sdks/java/extensions/protobuf/src/test/resources/test/option/v1/simple.proto |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this file leaked from my commit. The simpe.proto is not included in this commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe remove this file from the PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
@@ -240,4 +261,94 @@ private static FieldType fieldTypeFromProtoWithoutNullable(SchemaApi.FieldType p | |||
"Unexpected type_info: " + protoFieldType.getTypeInfoCase()); | |||
} | |||
} | |||
|
|||
public static SchemaApi.Row rowToProto(Row row) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same remark as before. The is another way of decoding a row to embed in a schema. I'm wondering if we can reuse the code that (de)serializes the row as an element? That was my first though in my Option prototype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No - I don't think we want to use SchemaCoder here. The advantage of serializing directly into the proto is that it makes the proto introspectable. Dumping binary blobs into the proto makes the proto much harder to use and to debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll as soon as the PR is merged in, I'll rebase the the option work and use .
After a good night sleep I can understand a schema specific encoding for Row: if you disconnect it from the element implementation you leave room for element specific optimizations (example: process elements in bundles with Arrow). But we need to realize it adds another row encoding that the other languages need to implement. |
You are correct, that this requires other language to implement this parsing as well. However I think the visibility advantage of having a fully-represented proto (v.s. just embedding a bytes field in a proto) is worth that tax - and it shouldn't be a huge tax on Beam SDKs (it only took me about 30-40 minutes to write the code here) |
@alexvanboxel let me know if you have more thoughts here or if this looks good. One more comment - once your options work is in, we should switch my use of field metadata over to the structured options approach. |
I've done a second pass, to verify use of logicaltypes and schema. I've added a note in my todo list for the serializeLogicalType boolean for later so we don't forget. LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
… a protocol buffer class.
… a protocol buffer class.
This utility infers a Beam schema from a protocol buffer type. LogicalType is used for types that not natively represented in Beam schemas.