-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avro union support #10505
Avro union support #10505
Conversation
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions. |
This pull request/issue is no longer marked as stale. |
9dd87ad
to
d144d8e
Compare
This seems like a reasonable addition 👍, could you also add support to the avro streaming input format that was added by #11040? It does look like it might be hard to make the coverage bot happy without testing a lot of different union types, but maybe that is a good thing in this case if not too tedious. |
@clintropolis sure, I will get around to it sometime this week. |
1135107
to
a51214a
Compare
@clintropolis I rebased this and added support in the new InputFormat and extended the tests to cover all the union member types, ready for another review. |
Looks like only the spell check failed. Added to the spelling file and also tweaked the docs a little bit to correct a missing type and be more specific about how named types are handled. |
a5b5c04
to
d0031f4
Compare
@clintropolis would you be able to take another look at this? I think it's ready to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good to me, the only thing I'm concerned about is whether or not explodeUnions
is the best name for this, mainly the explode
part of the name, which in my mind i associate with something like #8698, which was an attempt to add a feature that could take an input row that had an array and produce multiple rows with the scalar values of the input array.
Perhaps convertUnions
or extractUnions
would be a better name? Idk, naming is hard 😓
I don't feel strongly about the name. I'm happy with |
d0031f4
to
ab1b27d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I missed this before, but we probably should also update https://github.com/apache/druid/blob/master/docs/ingestion/data-formats.md#avro-stream and the other tables that describe the input formats/parsers to include this new parameter
The newer mode can be enabled by setting `extractUnions` on the Avro parser in which case unions will be expanded according to the following rules: | ||
* Primitive types and unnamed complex types are keyed their type name. i.e `int`, `string` | ||
* Complex named types are keyed by their names, this includes `record`, `fixed` and `enum`. | ||
* The Avro null type is elided as it's value can only ever be null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: i think it's -> its
@@ -195,4 +210,46 @@ public Object unwrap(final Object o) | |||
{ | |||
return o; | |||
} | |||
|
|||
private boolean isExplodableUnion(final Schema.Field field) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: still using explode terminology
`union` types which aren't of the form `[null, otherType]` aren't supported at this time. | ||
#### Unions | ||
Druid has two modes for supporting `union` types. The original legacy mode can only support unions of the form `[null, otherType]`. | ||
The newer mode can be enabled by setting `extractUnions` on the Avro parser in which case unions will be expanded according to the following rules: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I looked a lot closer at this than I did on a previous pass, and I think the old docs were sort of wrong. Using the example someMultiMemberUnion
type in this PR, I can still use a flatten spec to extract the values of any type with the existing code, apparently even including for record types, where the extraction path is $.someMultiMemberUnion.subString
(instead of $.someMultiMemberUnion.UnionSubRecord.subString
as in the mode added in this PR).
As such, with my better understanding I think it makes sense to instead call this new property extractUnionsByType
or something similar, and clarify that this new mode requires using a flatten spec to extract the values, but with the benefit that you can selectively extract values of only a certain type so that they can be mapped to separate Druid columns or whatever. I also don't think it necessarily makes sense to refer to the other mode as legacy, since I guess it still has a use if the union is composed mainly of primitive types and all are able to be coerced into a common Druid type, or if it is a simple union type of the legacy form, which the new mode does not effect (since the isUnion code checks for more than 1 non-null type).
Sorry I didn't look closer into this previously and for the review churn, my bad.
Also because the new mode needs to be link to the flatten spec docs, maybe it makes sense to just move the unions description entirely into the complex types section, which also seems to mirror the Avro specification docs https://avro.apache.org/docs/current/spec.html#schema_complex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem, I think you are right and I will make these changes next week.
ab1b27d
to
3ebd657
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a spelling error causing a CI failure, https://travis-ci.com/github/apache/druid/jobs/512959568#L759, but other than that, lgtm 👍
* Avro union support * Document new union support * Add support for AvroStreamInputFormat and fix checkstyle * Extend multi-member union test schema and format * Some additional docs and add Enums to spelling * Rename explodeUnions -> extractUnions * explode -> extract * ByType * Correct spelling error
Adds back test coverage for Avro flattener that was mistakenly removed in apache#10505. Recfactored the tests a bit too.
* Add back missing unit test coverage in AvroFlattenerMakerTest Adds back test coverage for Avro flattener that was mistakenly removed in #10505. Recfactored the tests a bit too. * resolve checkstyle warnings
Fixes #10493
Description
Implements better support for Avro unions in the Avro extensions.
Currently when ingesting data that contains unions unless the union is always the same value it will return mixed type results.
This PR addresses the problem by exploding union fields into maps keyed by the union member type or type name in the case of named types (enums, fixed, records).
The method was chosen as it's similar to what is done on other systems that support ingesting Avro data, such as Google BigQuery which details their Avro compatibility here: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#avro_conversions
This PR has:
Key changed/added classes in this PR
AvroFlattenerMaker
GenericAvroJsonProvider