Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-7274] Infer a Beam Schema from a protocol buffer class. #10356

Merged
merged 3 commits into from Dec 21, 2019

Conversation

reuvenlax
Copy link
Contributor

This utility infers a Beam schema from a protocol buffer type. LogicalType is used for types that not natively represented in Beam schemas.

@reuvenlax
Copy link
Contributor Author

R: @alexvanboxel

@reuvenlax
Copy link
Contributor Author

Run Python PreCommit

@reuvenlax
Copy link
Contributor Author

Friendly ping. @alexvanboxel do you have any thoughts on this PR?

@@ -87,4 +87,56 @@ message LogicalType {
string urn = 1;
bytes payload = 2;
FieldType representation = 3;
FieldType argument_type = 4;
FieldValue argument = 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why you extended the proto schema with a complete value type system... but this feels complex. I think options (although in very early draft) would be a better.

If you look at my draft (not ready): https://github.com/apache/beam/pull/10413/files#diff-9a1bd3d0a6a4b8228be12b5cab80bf60R92

I would think that the option message has a binary payload. This payload would be the same binary representation of a field, so this would be the same Row based type system, instead of another encoding for the Logical Payload.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was done as a continuation of the previous PR that was already merged - the changes made in that PR were not reflected in the portability protos get. Since the proto is used as the arg for the OneOf, we needed it to be reflected in the proto for completeness.

This is also done to enable your option work. I think making the payload binary there is the wrong approach - part of the point of the proto is to be easily introspectible (i.e. you should be able to see the option value by simply examining the proto, with no need to parse a binary option value).

@@ -35,85 +42,99 @@
private static final String URN_BEAM_LOGICAL_DECIMAL = "beam:logical_type:decimal:v1";
private static final String URN_BEAM_LOGICAL_JAVASDK = "beam:logical_type:javasdk:v1";

public static SchemaApi.Schema schemaToProto(Schema schema) {
public static SchemaApi.Schema schemaToProto(Schema schema, boolean serializeLogicalType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand when you do NOT want to serializeLogicalTypes. Is this for backward compatibility ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a temporary hack to make OneOf comparison work, since right now the logical type serialization uses Java Serialization which makes the protos not binary comparable. Once we have a better serialization of logical types this can go away. Also once we support recursive schemas (in progress), then OneOf no longer needs to use the proto as its arg.

-I$PROTO_INCLUDE \
--descriptor_set_out=sdks/java/extensions/protobuf/src/test/resources/org/apache/beam/sdk/extensions/protobuf/test_option_v1.pb \
--include_imports \
sdks/java/extensions/protobuf/src/test/resources/test/option/v1/simple.proto
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this file leaked from my commit. The simpe.proto is not included in this commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe remove this file from the PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@@ -240,4 +261,94 @@ private static FieldType fieldTypeFromProtoWithoutNullable(SchemaApi.FieldType p
"Unexpected type_info: " + protoFieldType.getTypeInfoCase());
}
}

public static SchemaApi.Row rowToProto(Row row) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same remark as before. The is another way of decoding a row to embed in a schema. I'm wondering if we can reuse the code that (de)serializes the row as an element? That was my first though in my Option prototype.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - I don't think we want to use SchemaCoder here. The advantage of serializing directly into the proto is that it makes the proto introspectable. Dumping binary blobs into the proto makes the proto much harder to use and to debug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll as soon as the PR is merged in, I'll rebase the the option work and use .

@alexvanboxel
Copy link
Contributor

After a good night sleep I can understand a schema specific encoding for Row: if you disconnect it from the element implementation you leave room for element specific optimizations (example: process elements in bundles with Arrow). But we need to realize it adds another row encoding that the other languages need to implement.

@reuvenlax
Copy link
Contributor Author

You are correct, that this requires other language to implement this parsing as well. However I think the visibility advantage of having a fully-represented proto (v.s. just embedding a bytes field in a proto) is worth that tax - and it shouldn't be a huge tax on Beam SDKs (it only took me about 30-40 minutes to write the code here)

@reuvenlax
Copy link
Contributor Author

@alexvanboxel let me know if you have more thoughts here or if this looks good.

One more comment - once your options work is in, we should switch my use of field metadata over to the structured options approach.

@alexvanboxel
Copy link
Contributor

@alexvanboxel let me know if you have more thoughts here or if this looks good.

One more comment - once your options work is in, we should switch my use of field metadata over to the structured options approach.

I've done a second pass, to verify use of logicaltypes and schema. I've added a note in my todo list for the serializeLogicalType boolean for later so we don't forget.

LGTM

Copy link
Contributor

@alexvanboxel alexvanboxel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@reuvenlax reuvenlax merged commit a90336b into apache:master Dec 21, 2019
JozoVilcek pushed a commit to JozoVilcek/beam that referenced this pull request Feb 21, 2020
vmarquez pushed a commit to vmarquez/beam that referenced this pull request Apr 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants