New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-7274] In preparation for protocol-buffer schemas, add OneOf and Enumeration logical types #10247
Conversation
8e45427
to
2a5eb79
Compare
run sql postcommit |
run java precommit |
} | ||
|
||
/** This {@link LogicalType} represent an enumeration over a fixed set of values. */ | ||
public static class EnumerationType implements LogicalType<EnumerationValue, Integer> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the collection of LogicalTypes growing. Wouldn't it start to make sense to have them in there own package, eg org.apache.beam.sdk.schema.logicaltypes
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea - done
|
||
oneOfSchema = Schema.builder().addFields(nullableFields).build(); | ||
schemaProtoRepresentation = SchemaTranslation.schemaToProto(oneOfSchema).toByteArray(); | ||
this.oneOfCaseFieldId = oneOfSchema.indexOf(enumField); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if the oneOfCase needs to be stored in the row as it can be inferred by the rest of the fields:
- the non-null value is the selected oneOf value
- if all fields are null the oneOf is NULL
this is how proto does it on the wire. Depends on what's important. Data size (extra field) or Case performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. Removed from the row representation
@@ -181,6 +183,14 @@ public Builder addBooleanField(String name) { | |||
return this; | |||
} | |||
|
|||
public Builder addEnumerationField(String name, EnumerationType enumerationType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to add these methods as first class add'ers to Schema? This feels like breaking consistency... as they are logical types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally I left them out. But then decided that these are fairly "basic" logical types (e.g. they exist in proto, avro, etc.) so it was worth adding this method as syntactic sugar. I don't feel very strongly about it either way though.
@@ -208,6 +210,26 @@ public Boolean getBoolean(String fieldName) { | |||
return getMap(getSchema().indexOf(fieldName)); | |||
} | |||
|
|||
/** Helper method to get an enum field value. */ | |||
public EnumerationValue getEnumValue(String fieldValue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same remark. Are they first class citizens to merit extra methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above (also keep in mind that we want to make DATETIME a logical type as well). However I don't feel strongly one way or the other about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm.. yes. But we should keep it to the basic well known types then. Enum, DateTime they make sense. But the OneOf, is not so common.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO unions are pretty basic. However removed these methods for now, as it's not difficult to simply access as a logical type.
run java precommit |
Also added support for enums in the java type mapping. This means that POJOs or JavaBeans (or AutoValues, or any similar type) with Java enums will map to the new EnumerationType. |
Run Python2_PVR_Flink PreCommit |
Note: fixing the Avro issue requires refactoring a bit. This refactor I already have in another PR as protos require it, so I'll go send that out now. |
I was indeed wondering how this would Avro would be handled. For me it looks all good. |
Added Enum support to Avro. Adding support for generic Avro unions proved complicated (mostly because Avro compiler represents them using an Object type), so best left for another PR; also, it appears that most usages of Avro unions are to represent nullable types, which we already handle. |
@alexvanboxel I added Avro support for enums, and have now fixed CheckStyle errors. Let me know how this looks to you. |
OK, it took a while... but I've wanted to verify something. I've been experimenting with how Avro compiles to Java code and how the ordinal value is created. I just wanted to check out the following scenario:
we have a schema evolution to:
we upgrade the pipeline, so the current state is stored. As the ordinal value is stored in the intermediate state and the schema is evolved by pre-pending a enum value, that means the ordinal values changed (as it's sequential as defined in the source file: from I do notice the same holds through to the Java Enum type. Not to Proto as it has an explicit number value assigned to an Enum string. I don't know if it's an issue... but I wanted to raise it so you're aware of the side effect. Is a pipeline even upgradable when the schema changes? |
Right now we don't support evolving schemas. There is a plan (that needs to be discussed on the dev list for this): but essentially it would require the new pipeline to preserve the old job's ordinal values as much as possible, instead of simply taking them in order. |
1805e59
to
809e3c8
Compare
OK, then it's not an issue for now. |
Run Java PreCommit |
4 similar comments
Run Java PreCommit |
Run Java PreCommit |
Run Java PreCommit |
Run Java PreCommit |
…col-buffer schemas, add OneOf and Enumeration logical types
…col-buffer schemas, add OneOf and Enumeration logical types
…col-buffer schemas, add OneOf and Enumeration logical types
Add new LogicalTypes to represent enumerations and OneOf types. In order to implement this, we extended the LogicalType argument to be able to have a Schema so that we don't have to encode complicated arguments inside of strings.
R:@alexvanboxel