-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
overhaul 'druid-parquet-extensions' module, promoting from 'contrib' to 'core' #6360
Conversation
11cac0d
to
d7bc3f0
Compare
The |
d7bc3f0
to
8078e8c
Compare
FYI I'm working on refactoring/parameterizing tests to cut down on the amount of dupe and json, but haven't had the chance to finish yet. |
035cb7c
to
f80eaa7
Compare
|
||
public class ParquetGroupFlattenerMaker implements ObjectFlatteners.FlattenerMaker<Group> | ||
{ | ||
private static final MappingProvider DEFAULT_MAPPING_PROVIDER = new MappingProvider() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks the same as GenericAvroMappingProvider
, can these be merged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, forgot about this. I was planning to make a default do nothing implementation since that is what these are both doing 👍
@Override | ||
public Function<Group, Object> makeJsonQueryExtractor(String expr) | ||
{ | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be better to throw an exception here
{ | ||
return schema.getType().equals(Schema.Type.UNION) && | ||
schema.getTypes().size() == 2 && | ||
schema.getTypes().get(0).getType().equals(Schema.Type.NULL) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the union types, is the NULL type guaranteed to appear before the actual type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, i'm actually not certain it's guaranteed, from spec
Unions
Unions, as mentioned above, are represented using JSON arrays. For example, ["null", "string"] declares a schema which may be either a null or string.
(Note that when a default value is specified for a record field whose type is a union, the type of the default value must match the first element of the union. Thus, for unions containing "null", the "null" is usually listed first, since the default value of such unions is typically null.)
Unions may not contain more than one schema with the same type, except for the named types record, fixed and enum. For example, unions containing two array types or two map types are not permitted, but two types with different names are permitted. (Names permit efficient resolution when reading and writing unions.)
Unions may not immediately contain other unions.
I'm actually not certain what would be best to do in this case, but since we didn't support nullable fields at all previously afaict maybe it's ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, from that it sounds like [primitive, null] is valid if the default is non-null then, can you make this check non-order dependent?
HadoopDruidIndexerConfig config = HadoopDruidIndexerConfig.fromConfiguration(context.getConfiguration()); | ||
ParseSpec parseSpec = config.getParser().getParseSpec(); | ||
|
||
// todo: this is kind of lame, maybe we can still trim what we read if we |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, rather than parsing the flattenspec, maybe this could be supported with a "requiredFields" method on flatten specs, but I would remove the "todo" part for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished initial review, the tests are somewhat unwieldy right now so I'll wait for your update there and do another review pass.
@Override | ||
public boolean isArray(final Object o) | ||
{ | ||
if (o instanceof List) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can just be return (o instanceof List)
@Override | ||
public boolean isMap(final Object o) | ||
{ | ||
if (o instanceof Map) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be (o instanceof Map) || (o instanceof Group)
return converter.unwrapListPrimitive(o); | ||
} else if (o instanceof List) { | ||
List<Object> asList = (List<Object>) o; | ||
if (asList.stream().allMatch(ParquetGroupConverter::isWrappedListPrimitive)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there cases where such lists have some items "wrapped" but some items "unwrapped"?
I'm wondering if the allMatch isWrappedListPrimitive check is necessary, is it possible and correct to remove that pass and just unwrap anything that's wrapped in the list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what is allowed to happen, but I think if the list is not homogeneous it is safer to not do this conversion rather than do a partial conversion. Leaving this as is for now
740cfa6
to
9879f34
Compare
b8ea789
to
9cdb68a
Compare
… parser that does not convert to avro first and supports flattenSpec and int96 columns, add support for flattenSpec for parquet-avro conversion parser, much test with a bunch of files lifted from spark-sql
…and now only supports primitive arrays instead of all arrays
9cdb68a
to
2cb3dee
Compare
Is this PR going to be merged soon? Would love to try it out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…to 'core' (apache#6360) * move parquet-extensions from contrib to core, adds new hadoop parquet parser that does not convert to avro first and supports flattenSpec and int96 columns, add support for flattenSpec for parquet-avro conversion parser, much test with a bunch of files lifted from spark-sql * fix avro flattener to support nullable primitives for auto discovery and now only supports primitive arrays instead of all arrays * remove leftover print * convert micro timestamp to millis * checkstyle * add ignore for .parquet and .parq to rat exclude * fix legit test failure from avro flattern behavior change * fix rebase * add exclusions to pom to cut down on redundant jars * refactor tests, add support for unwrapping lists for parquet-avro, review comments * more comment * fix oops * tweak parquet-avro list handling * more docs * fix style * grr styles
Change to index spec inputFormat (org.apache.druid.data.input.parquet.simple.DruidParquetInputFormat) should be called out in release notes |
This PR promotes the
druid-parquet-extensions
module from 'contrib' to 'core' and introduces a new hadoop parser that is not based on converting to avro first, instead using theSimpleGroup
based reference implementation of theparquet-column
package ofparquet-mr
. This is likely not be the best or most efficient way to parse and convert parquet files... but its raw structure suited my needs of supporting convertingint96
timestamp columns into longs (for #5150) and additionally provide the ability to support aflattenSpec
.changes:
druid-parquet-extensions
now provides 2 types of hadoop parsers,parquet
andparquet-avro
, which useorg.apache.druid.data.input.parquet.simple.DruidParquetInputFormat
andorg.apache.druid.data.input.parquet.avro.DruidParquetAvroInputFormat
hadoop input formats respectively.parquet
andparquet-avro
parsers now both supportflattenSpec
by specifyingparquet
andavro
in theparseSpec
respectively.parquet-avro
re-uses thedruid-avro-extensions
spec and flattener. There may be minor behavior differences in how parquet logical types are handled.NestedDataParseSpec<TFlattenSpec>
for ParseSpecs which support aflattenSpec
property, used byJSONParseSpec
,AvroParseSpec
, andParquetParseSpec
(also introduced in this PR)avro
flattener auto field discovery to be more discerning about arrays (only primitive arrays are now considered) and to allow nullable primitive fields to be picked up. The array thing might need to be called out, since previously it would have thetoString
array contents of complex types, which I don't think is correct behavior, but could trip up anyone relying on that to happen.On top of all of the added tests, I've lightly tested both parsers on a local druid/hadoop cluster on my laptop.
Fixes #5150 with
parquet
parser (parquet-avro
still does not support INT96)Fixes #5433 by defaulting
"parquet.avro.add-list-element-records":"false"
forparquet-avro