overhaul 'druid-parquet-extensions' module, promoting from 'contrib' to 'core' #6360

clintropolis · 2018-09-20T22:56:02Z

This PR promotes the druid-parquet-extensions module from 'contrib' to 'core' and introduces a new hadoop parser that is not based on converting to avro first, instead using the SimpleGroup based reference implementation of the parquet-column package of parquet-mr. This is likely not be the best or most efficient way to parse and convert parquet files... but its raw structure suited my needs of supporting converting int96 timestamp columns into longs (for #5150) and additionally provide the ability to support a flattenSpec.

changes:

druid-parquet-extensions now provides 2 types of hadoop parsers, parquet and parquet-avro, which use org.apache.druid.data.input.parquet.simple.DruidParquetInputFormat and org.apache.druid.data.input.parquet.avro.DruidParquetAvroInputFormat hadoop input formats respectively.
parquet and parquet-avro parsers now both support flattenSpec by specifying parquet and avro in the parseSpec respectively. parquet-avro re-uses the druid-avro-extensions spec and flattener. There may be minor behavior differences in how parquet logical types are handled.
extracted abstract type NestedDataParseSpec<TFlattenSpec> for ParseSpecs which support a flattenSpec property, used by JSONParseSpec, AvroParseSpec, and ParquetParseSpec (also introduced in this PR)
lightly modified behavior of avro flattener auto field discovery to be more discerning about arrays (only primitive arrays are now considered) and to allow nullable primitive fields to be picked up. The array thing might need to be called out, since previously it would have the toString array contents of complex types, which I don't think is correct behavior, but could trip up anyone relying on that to happen.
adds many tests and parquet test files ("donated" from spark-sql tests here) to ensure conversion correctness (though probably still not enough)

On top of all of the added tests, I've lightly tested both parsers on a local druid/hadoop cluster on my laptop.

Fixes #5150 with parquet parser (parquet-avro still does not support INT96)
Fixes #5433 by defaulting "parquet.avro.add-list-element-records":"false" for parquet-avro

amalakar · 2018-09-27T23:37:59Z

The INT96 bug has been one of the annoying issues we have had. It had caused us to convert to csv first before loading into druid. Looking forward to this fix.

clintropolis · 2018-09-29T00:50:31Z

FYI I'm working on refactoring/parameterizing tests to cut down on the amount of dupe and json, but haven't had the chance to finish yet.

jon-wei · 2018-10-11T23:10:13Z

...ons/src/main/java/org/apache/druid/data/input/parquet/simple/ParquetGroupFlattenerMaker.java

+
+public class ParquetGroupFlattenerMaker implements ObjectFlatteners.FlattenerMaker<Group>
+{
+  private static final MappingProvider DEFAULT_MAPPING_PROVIDER = new MappingProvider()


This looks the same as GenericAvroMappingProvider, can these be merged?

Oops, forgot about this. I was planning to make a default do nothing implementation since that is what these are both doing 👍

jon-wei · 2018-10-11T23:12:37Z

...ons/src/main/java/org/apache/druid/data/input/parquet/simple/ParquetGroupFlattenerMaker.java

+  @Override
+  public Function<Group, Object> makeJsonQueryExtractor(String expr)
+  {
+    return null;


I think it'd be better to throw an exception here

jon-wei · 2018-10-11T23:16:06Z

...-core/avro-extensions/src/main/java/org/apache/druid/data/input/avro/AvroFlattenerMaker.java

+  {
+    return schema.getType().equals(Schema.Type.UNION) &&
+           schema.getTypes().size() == 2 &&
+           schema.getTypes().get(0).getType().equals(Schema.Type.NULL) &&


in the union types, is the NULL type guaranteed to appear before the actual type?

Hmm, i'm actually not certain it's guaranteed, from spec

Unions
Unions, as mentioned above, are represented using JSON arrays. For example, ["null", "string"] declares a schema which may be either a null or string.

(Note that when a default value is specified for a record field whose type is a union, the type of the default value must match the first element of the union. Thus, for unions containing "null", the "null" is usually listed first, since the default value of such unions is typically null.)

Unions may not contain more than one schema with the same type, except for the named types record, fixed and enum. For example, unions containing two array types or two map types are not permitted, but two types with different names are permitted. (Names permit efficient resolution when reading and writing unions.)

Unions may not immediately contain other unions.

I'm actually not certain what would be best to do in this case, but since we didn't support nullable fields at all previously afaict maybe it's ok.

hm, from that it sounds like [primitive, null] is valid if the default is non-null then, can you make this check non-order dependent?

jon-wei · 2018-10-11T23:28:01Z

...nsions/src/main/java/org/apache/druid/data/input/parquet/simple/DruidParquetReadSupport.java

+    HadoopDruidIndexerConfig config = HadoopDruidIndexerConfig.fromConfiguration(context.getConfiguration());
+    ParseSpec parseSpec = config.getParser().getParseSpec();
+
+    // todo: this is kind of lame, maybe we can still trim what we read if we


hm, rather than parsing the flattenspec, maybe this could be supported with a "requiredFields" method on flatten specs, but I would remove the "todo" part for now

jon-wei

Finished initial review, the tests are somewhat unwieldy right now so I'll wait for your update there and do another review pass.

jon-wei · 2018-10-12T01:16:31Z

...sions/src/main/java/org/apache/druid/data/input/parquet/simple/ParquetGroupJsonProvider.java

+  @Override
+  public boolean isArray(final Object o)
+  {
+    if (o instanceof List) {


this can just be return (o instanceof List)

jon-wei · 2018-10-12T01:17:21Z

...sions/src/main/java/org/apache/druid/data/input/parquet/simple/ParquetGroupJsonProvider.java

+  @Override
+  public boolean isMap(final Object o)
+  {
+    if (o instanceof Map) {


could be (o instanceof Map) || (o instanceof Group)

jon-wei · 2018-10-12T19:07:18Z

...ons/src/main/java/org/apache/druid/data/input/parquet/simple/ParquetGroupFlattenerMaker.java

+      return converter.unwrapListPrimitive(o);
+    } else if (o instanceof List) {
+      List<Object> asList = (List<Object>) o;
+      if (asList.stream().allMatch(ParquetGroupConverter::isWrappedListPrimitive)) {


Are there cases where such lists have some items "wrapped" but some items "unwrapped"?

I'm wondering if the allMatch isWrappedListPrimitive check is necessary, is it possible and correct to remove that pass and just unwrap anything that's wrapped in the list?

I don't know what is allowed to happen, but I think if the list is not homogeneous it is safer to not do this conversion rather than do a partial conversion. Leaving this as is for now

… parser that does not convert to avro first and supports flattenSpec and int96 columns, add support for flattenSpec for parquet-avro conversion parser, much test with a bunch of files lifted from spark-sql

…and now only supports primitive arrays instead of all arrays

…view comments

amalakar · 2018-11-01T21:46:23Z

Is this PR going to be merged soon? Would love to try it out.

jon-wei

LGTM

…to 'core' (apache#6360) * move parquet-extensions from contrib to core, adds new hadoop parquet parser that does not convert to avro first and supports flattenSpec and int96 columns, add support for flattenSpec for parquet-avro conversion parser, much test with a bunch of files lifted from spark-sql * fix avro flattener to support nullable primitives for auto discovery and now only supports primitive arrays instead of all arrays * remove leftover print * convert micro timestamp to millis * checkstyle * add ignore for .parquet and .parq to rat exclude * fix legit test failure from avro flattern behavior change * fix rebase * add exclusions to pom to cut down on redundant jars * refactor tests, add support for unwrapping lists for parquet-avro, review comments * more comment * fix oops * tweak parquet-avro list handling * more docs * fix style * grr styles

dclim · 2018-11-27T23:08:39Z

Change to index spec inputFormat (org.apache.druid.data.input.parquet.simple.DruidParquetInputFormat) should be called out in release notes

clintropolis force-pushed the core-parquet branch 3 times, most recently from 11cac0d to d7bc3f0 Compare September 25, 2018 19:57

clintropolis force-pushed the core-parquet branch from d7bc3f0 to 8078e8c Compare September 28, 2018 18:02

gianm requested a review from jon-wei September 28, 2018 22:30

clintropolis mentioned this pull request Sep 29, 2018

Adding additional datatypes support for Parquet extension #5787

Closed

leventov added the WIP label Oct 3, 2018

clintropolis force-pushed the core-parquet branch 2 times, most recently from 035cb7c to f80eaa7 Compare October 9, 2018 22:11

clintropolis mentioned this pull request Oct 10, 2018

exclude all redundant and unecessary dependencies in orc-extensions pom.xml #6441

Merged

jon-wei reviewed Oct 11, 2018

View reviewed changes

jon-wei reviewed Oct 12, 2018

View reviewed changes

gianm assigned jon-wei Oct 15, 2018

gianm added this to the 0.13.1 milestone Oct 15, 2018

gianm mentioned this pull request Oct 15, 2018

Make parquet extension correctly parse List #5465

Closed

clintropolis force-pushed the core-parquet branch from 740cfa6 to 9879f34 Compare October 15, 2018 21:33

clintropolis force-pushed the core-parquet branch from b8ea789 to 9cdb68a Compare October 23, 2018 21:20

gianm removed the WIP label Oct 25, 2018

clintropolis added 10 commits October 29, 2018 11:42

move parquet-extensions from contrib to core, adds new hadoop parquet…

d3f4ca4

… parser that does not convert to avro first and supports flattenSpec and int96 columns, add support for flattenSpec for parquet-avro conversion parser, much test with a bunch of files lifted from spark-sql

fix avro flattener to support nullable primitives for auto discovery …

6153f4c

…and now only supports primitive arrays instead of all arrays

remove leftover print

766a47e

convert micro timestamp to millis

81ec1ba

checkstyle

2153802

add ignore for .parquet and .parq to rat exclude

d39721c

fix legit test failure from avro flattern behavior change

358e1e2

fix rebase

91e8dd5

add exclusions to pom to cut down on redundant jars

61ca3b7

refactor tests, add support for unwrapping lists for parquet-avro, re…

1299eb5

…view comments

clintropolis added 6 commits October 29, 2018 11:44

more comment

3757d77

fix oops

70631ab

tweak parquet-avro list handling

6eacf7b

more docs

fcea0a4

fix style

f20eba1

grr styles

2cb3dee

clintropolis force-pushed the core-parquet branch from 9cdb68a to 2cb3dee Compare October 29, 2018 19:19

jon-wei approved these changes Nov 6, 2018

View reviewed changes

jon-wei merged commit 1224d8b into apache:master Nov 6, 2018

clintropolis deleted the core-parquet branch November 6, 2018 21:35

jon-wei removed their assignment Nov 16, 2018

dclim added the Release Notes label Nov 27, 2018

clintropolis mentioned this pull request Dec 12, 2018

move parquet extension input formats #6727

Merged

jon-wei mentioned this pull request Feb 22, 2019

0.14.0-incubating release notes #7126

Closed

clintropolis mentioned this pull request Mar 19, 2019

Parquet Hadoop parser fails to parse columns specified in transformSpec only. #7169

Closed

clintropolis mentioned this pull request Nov 2, 2022

fix issue with parquet list conversion of nullable lists with complex nullable elements #13294

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

overhaul 'druid-parquet-extensions' module, promoting from 'contrib' to 'core' #6360

overhaul 'druid-parquet-extensions' module, promoting from 'contrib' to 'core' #6360

clintropolis commented Sep 20, 2018 •

edited

Loading

amalakar commented Sep 27, 2018

clintropolis commented Sep 29, 2018

jon-wei Oct 11, 2018

clintropolis Oct 11, 2018

jon-wei Oct 11, 2018

jon-wei Oct 11, 2018

clintropolis Oct 11, 2018

jon-wei Oct 11, 2018

jon-wei Oct 11, 2018

jon-wei left a comment

jon-wei Oct 12, 2018

jon-wei Oct 12, 2018

jon-wei Oct 12, 2018

clintropolis Oct 16, 2018

amalakar commented Nov 1, 2018

jon-wei left a comment

dclim commented Nov 27, 2018

overhaul 'druid-parquet-extensions' module, promoting from 'contrib' to 'core' #6360

overhaul 'druid-parquet-extensions' module, promoting from 'contrib' to 'core' #6360

Conversation

clintropolis commented Sep 20, 2018 • edited Loading

amalakar commented Sep 27, 2018

clintropolis commented Sep 29, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jon-wei left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amalakar commented Nov 1, 2018

jon-wei left a comment

Choose a reason for hiding this comment

dclim commented Nov 27, 2018

clintropolis commented Sep 20, 2018 •

edited

Loading