Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-7274] In preparation for protocol-buffer schemas, add OneOf and Enumeration logical types #10247

Merged
merged 2 commits into from Dec 5, 2019

Conversation

reuvenlax
Copy link
Contributor

Add new LogicalTypes to represent enumerations and OneOf types. In order to implement this, we extended the LogicalType argument to be able to have a Schema so that we don't have to encode complicated arguments inside of strings.

R:@alexvanboxel

@reuvenlax
Copy link
Contributor Author

run sql postcommit

@reuvenlax
Copy link
Contributor Author

run java precommit

}

/** This {@link LogicalType} represent an enumeration over a fixed set of values. */
public static class EnumerationType implements LogicalType<EnumerationValue, Integer> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the collection of LogicalTypes growing. Wouldn't it start to make sense to have them in there own package, eg org.apache.beam.sdk.schema.logicaltypes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea - done


oneOfSchema = Schema.builder().addFields(nullableFields).build();
schemaProtoRepresentation = SchemaTranslation.schemaToProto(oneOfSchema).toByteArray();
this.oneOfCaseFieldId = oneOfSchema.indexOf(enumField);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if the oneOfCase needs to be stored in the row as it can be inferred by the rest of the fields:

  • the non-null value is the selected oneOf value
  • if all fields are null the oneOf is NULL

this is how proto does it on the wire. Depends on what's important. Data size (extra field) or Case performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. Removed from the row representation

@@ -181,6 +183,14 @@ public Builder addBooleanField(String name) {
return this;
}

public Builder addEnumerationField(String name, EnumerationType enumerationType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to add these methods as first class add'ers to Schema? This feels like breaking consistency... as they are logical types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I left them out. But then decided that these are fairly "basic" logical types (e.g. they exist in proto, avro, etc.) so it was worth adding this method as syntactic sugar. I don't feel very strongly about it either way though.

@@ -208,6 +210,26 @@ public Boolean getBoolean(String fieldName) {
return getMap(getSchema().indexOf(fieldName));
}

/** Helper method to get an enum field value. */
public EnumerationValue getEnumValue(String fieldValue) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same remark. Are they first class citizens to merit extra methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above (also keep in mind that we want to make DATETIME a logical type as well). However I don't feel strongly one way or the other about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm.. yes. But we should keep it to the basic well known types then. Enum, DateTime they make sense. But the OneOf, is not so common.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO unions are pretty basic. However removed these methods for now, as it's not difficult to simply access as a logical type.

@reuvenlax
Copy link
Contributor Author

run java precommit

@reuvenlax
Copy link
Contributor Author

Also added support for enums in the java type mapping. This means that POJOs or JavaBeans (or AutoValues, or any similar type) with Java enums will map to the new EnumerationType.

@reuvenlax
Copy link
Contributor Author

Run Python2_PVR_Flink PreCommit

@reuvenlax
Copy link
Contributor Author

Note: fixing the Avro issue requires refactoring a bit. This refactor I already have in another PR as protos require it, so I'll go send that out now.

@alexvanboxel
Copy link
Contributor

Note: fixing the Avro issue requires refactoring a bit. This refactor I already have in another PR as protos require it, so I'll go send that out now.

I was indeed wondering how this would Avro would be handled. For me it looks all good.

@reuvenlax
Copy link
Contributor Author

Added Enum support to Avro. Adding support for generic Avro unions proved complicated (mostly because Avro compiler represents them using an Object type), so best left for another PR; also, it appears that most usages of Avro unions are to represent nullable types, which we already handle.

@reuvenlax
Copy link
Contributor Author

@alexvanboxel I added Avro support for enums, and have now fixed CheckStyle errors. Let me know how this looks to you.

@alexvanboxel
Copy link
Contributor

OK, it took a while... but I've wanted to verify something. I've been experimenting with how Avro compiles to Java code and how the ordinal value is created. I just wanted to check out the following scenario:

{ "name": "foo", "type": { "name": "FooEnum", "type": "enum", "symbols": ["x","y"] }},

we have a schema evolution to:

{ "name": "foo", "type": { "name": "FooEnum", "type": "enum", "symbols": ["a", "x", "y"] }},

we upgrade the pipeline, so the current state is stored. As the ordinal value is stored in the intermediate state and the schema is evolved by pre-pending a enum value, that means the ordinal values changed (as it's sequential as defined in the source file: from x=0 , y=1 > a=0, x=1, y=2 ).

I do notice the same holds through to the Java Enum type. Not to Proto as it has an explicit number value assigned to an Enum string.

I don't know if it's an issue... but I wanted to raise it so you're aware of the side effect. Is a pipeline even upgradable when the schema changes?

@reuvenlax
Copy link
Contributor Author

Right now we don't support evolving schemas. There is a plan (that needs to be discussed on the dev list for this): but essentially it would require the new pipeline to preserve the old job's ordinal values as much as possible, instead of simply taking them in order.

@alexvanboxel
Copy link
Contributor

Right now we don't support evolving schemas. There is a plan (that needs to be discussed on the dev list for this): but essentially it would require the new pipeline to preserve the old job's ordinal values as much as possible, instead of simply taking them in order.

OK, then it's not an issue for now.

@reuvenlax
Copy link
Contributor Author

Run Java PreCommit

4 similar comments
@reuvenlax
Copy link
Contributor Author

Run Java PreCommit

@reuvenlax
Copy link
Contributor Author

Run Java PreCommit

@reuvenlax
Copy link
Contributor Author

Run Java PreCommit

@reuvenlax
Copy link
Contributor Author

Run Java PreCommit

@reuvenlax reuvenlax merged commit 5124385 into apache:master Dec 5, 2019
11moon11 pushed a commit to 11moon11/beam that referenced this pull request Dec 12, 2019
…col-buffer schemas, add OneOf and Enumeration logical types
dpcollins-google pushed a commit to dpcollins-google/beam that referenced this pull request Dec 20, 2019
…col-buffer schemas, add OneOf and Enumeration logical types
JozoVilcek pushed a commit to JozoVilcek/beam that referenced this pull request Feb 21, 2020
…col-buffer schemas, add OneOf and Enumeration logical types
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants