Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avro union support #10505

Merged
merged 9 commits into from
Jul 7, 2021
Merged

Conversation

josephglanville
Copy link
Contributor

@josephglanville josephglanville commented Oct 10, 2020

Fixes #10493

Description

Implements better support for Avro unions in the Avro extensions.
Currently when ingesting data that contains unions unless the union is always the same value it will return mixed type results.
This PR addresses the problem by exploding union fields into maps keyed by the union member type or type name in the case of named types (enums, fixed, records).
The method was chosen as it's similar to what is done on other systems that support ingesting Avro data, such as Google BigQuery which details their Avro compatibility here: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#avro_conversions


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Key changed/added classes in this PR
  • AvroFlattenerMaker
  • GenericAvroJsonProvider

@josephglanville josephglanville marked this pull request as draft October 10, 2020 23:45
@stale
Copy link

stale bot commented Dec 12, 2020

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

@stale stale bot added the stale label Dec 12, 2020
@josephglanville josephglanville changed the title WIP: Avro union support Avro union support Mar 22, 2021
@stale
Copy link

stale bot commented Mar 22, 2021

This pull request/issue is no longer marked as stale.

@stale stale bot removed the stale label Mar 22, 2021
@josephglanville josephglanville marked this pull request as ready for review March 22, 2021 10:54
@clintropolis
Copy link
Member

This seems like a reasonable addition 👍, could you also add support to the avro streaming input format that was added by #11040?

It does look like it might be hard to make the coverage bot happy without testing a lot of different union types, but maybe that is a good thing in this case if not too tedious.

@josephglanville
Copy link
Contributor Author

@clintropolis sure, I will get around to it sometime this week.

@josephglanville
Copy link
Contributor Author

@clintropolis I rebased this and added support in the new InputFormat and extended the tests to cover all the union member types, ready for another review.

@josephglanville
Copy link
Contributor Author

Looks like only the spell check failed. Added to the spelling file and also tweaked the docs a little bit to correct a missing type and be more specific about how named types are handled.

@josephglanville
Copy link
Contributor Author

@clintropolis would you be able to take another look at this? I think it's ready to merge.

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me, the only thing I'm concerned about is whether or not explodeUnions is the best name for this, mainly the explode part of the name, which in my mind i associate with something like #8698, which was an attempt to add a feature that could take an input row that had an array and produce multiple rows with the scalar values of the input array.

Perhaps convertUnions or extractUnions would be a better name? Idk, naming is hard 😓

@josephglanville
Copy link
Contributor Author

I don't feel strongly about the name. I'm happy with extractUnions. Will make the changes later today.

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I missed this before, but we probably should also update https://github.com/apache/druid/blob/master/docs/ingestion/data-formats.md#avro-stream and the other tables that describe the input formats/parsers to include this new parameter

The newer mode can be enabled by setting `extractUnions` on the Avro parser in which case unions will be expanded according to the following rules:
* Primitive types and unnamed complex types are keyed their type name. i.e `int`, `string`
* Complex named types are keyed by their names, this includes `record`, `fixed` and `enum`.
* The Avro null type is elided as it's value can only ever be null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: i think it's -> its

@@ -195,4 +210,46 @@ public Object unwrap(final Object o)
{
return o;
}

private boolean isExplodableUnion(final Schema.Field field)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: still using explode terminology

`union` types which aren't of the form `[null, otherType]` aren't supported at this time.
#### Unions
Druid has two modes for supporting `union` types. The original legacy mode can only support unions of the form `[null, otherType]`.
The newer mode can be enabled by setting `extractUnions` on the Avro parser in which case unions will be expanded according to the following rules:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I looked a lot closer at this than I did on a previous pass, and I think the old docs were sort of wrong. Using the example someMultiMemberUnion type in this PR, I can still use a flatten spec to extract the values of any type with the existing code, apparently even including for record types, where the extraction path is $.someMultiMemberUnion.subString (instead of $.someMultiMemberUnion.UnionSubRecord.subString as in the mode added in this PR).

As such, with my better understanding I think it makes sense to instead call this new property extractUnionsByType or something similar, and clarify that this new mode requires using a flatten spec to extract the values, but with the benefit that you can selectively extract values of only a certain type so that they can be mapped to separate Druid columns or whatever. I also don't think it necessarily makes sense to refer to the other mode as legacy, since I guess it still has a use if the union is composed mainly of primitive types and all are able to be coerced into a common Druid type, or if it is a simple union type of the legacy form, which the new mode does not effect (since the isUnion code checks for more than 1 non-null type).

Sorry I didn't look closer into this previously and for the review churn, my bad.

Also because the new mode needs to be link to the flatten spec docs, maybe it makes sense to just move the unions description entirely into the complex types section, which also seems to mirror the Avro specification docs https://avro.apache.org/docs/current/spec.html#schema_complex

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, I think you are right and I will make these changes next week.

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a spelling error causing a CI failure, https://travis-ci.com/github/apache/druid/jobs/512959568#L759, but other than that, lgtm 👍

@clintropolis clintropolis merged commit d5e8d4d into apache:master Jul 7, 2021
jihoonson pushed a commit to jihoonson/druid that referenced this pull request Jul 12, 2021
* Avro union support

* Document new union support

* Add support for AvroStreamInputFormat and fix checkstyle

* Extend multi-member union test schema and format

* Some additional docs and add Enums to spelling

* Rename explodeUnions -> extractUnions

* explode -> extract

* ByType

* Correct spelling error
@josephglanville josephglanville deleted the jpg/avro-union-support branch July 15, 2021 07:19
zachjsh added a commit to zachjsh/druid that referenced this pull request Jul 15, 2021
Adds back test coverage for Avro flattener that was mistakenly removed in apache#10505. Recfactored the tests a bit too.
suneet-s pushed a commit that referenced this pull request Jul 21, 2021
* Add back missing unit test coverage in AvroFlattenerMakerTest

Adds back test coverage for Avro flattener that was mistakenly removed in #10505. Recfactored the tests a bit too.

* resolve checkstyle warnings
@clintropolis clintropolis added this to the 0.22.0 milestone Aug 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avro Union type support
2 participants