'core' ORC extension #7138

clintropolis · 2019-02-25T20:21:37Z

Adds new druid-orc-extension based on Apache ORC orc-mapreduce library, replacing 'contrib' extension to implement #7134. Includes docs and a handful of tests with files from ORC examples.

clintropolis · 2019-02-25T20:29:16Z

Apologies for a PR following so closely to the proposal, I did this work reactively on a whim over a couple of nights last week after witnessing users continually encountering the same errors trying to use the ORC extension. I realized when it was nearly finished that this is probably deserving a proposal, and we can treat this as a reference implementation until there is buy in for the proposal.

drcrallen · 2019-02-25T22:24:12Z

If you git config diff.renameLimit 0 will it pickup all the renames?

clintropolis · 2019-02-25T23:23:21Z

If you git config diff.renameLimit 0 will it pickup all the renames?

I will play around with it, I think maybe the problem is that I rewrote it while the old files all still existed, and then removed them in a 2nd commit.

gianm · 2019-02-28T04:52:28Z

@clintropolis Could you please squash the commits and see if that helps GitHub draw a proper diff?

clintropolis · 2019-02-28T05:30:56Z

@clintropolis Could you please squash the commits and see if that helps GitHub draw a proper diff?

I can try, but if it still doesn't pick up much it's probably because this is a rewrite, there isn't really any shared lineage. Some of the files, such as OrcExtensionsModule already got picked up as a rename, because they are pretty similar, but other than that there is very little in common between them.

…e extensions, support for flattenSpec, tests, docs

clintropolis · 2019-02-28T05:47:18Z

Yeah, didn't seem to change anything.

gianm · 2019-03-07T21:45:08Z

...ns-core/orc-extensions/src/main/java/org/apache/druid/data/input/orc/OrcStructConverter.java

+        // todo: is the the best way to go from java.util.Date to DateTime?
+        return new DateTime(((DateWritable) field).get());
+      case BINARY:
+        // todo: hmm, i think we need a standard way of handling binary blobs this is a placeholder.


Please don't commit TODOs, either implement it or make it a less todo-y comment, like // Base64-encode binary blobs; may want to provide other options in the future.

Oops, I left the todo because I considered this PR still sort of wip pending feedback on the proposal, and wanted to have a discussion about what to do with binary.

Avro handles binary by optionally converting to utf8 string or just returning a byte array:

... if (field instanceof ByteBuffer) { if (binaryAsString) { return StringUtils.fromUtf8(((ByteBuffer) field).array()); } else { return ((ByteBuffer) field).array(); } } ....

With the parquet extension, I decided to mimic this behavior without thinking very hard about it, because I wanted to maintain compatibility with the avro conversion since that's what the extension was previously using, so it also does:

... case FIXED_LEN_BYTE_ARRAY: case BINARY: Binary bin = g.getBinary(fieldIndex, index); byte[] bytes = bin.getBytes(); if (binaryAsString) { return StringUtils.fromUtf8(bytes); } else { return bytes; } ...

The previous version of this extension looks like it had no special handling for binary fields, so would eventually just tostring on the byte array.

My thoughts are that we should maybe preserve the binaryAsString functionality of avro and parquet, and also apply it here, and make this extension also return byte[] if it shouldn't treat it as a utf8 string rather than just always translating to base64 encoded string like is being done here currently. I do think that we should consider converting these byte[] objects into base64 strings down the line if they end up in a string dimension, I'm not certain where that should happen though.

It looks to me like the correct place to handle conversion of byte[] to a base64 string if it ends up as a string dimension would be Rows.objectToStrings. I think conversion to base64 would at least make the dimension more usable than the result of calling byte[].toString which is something like [B@73809e7.

I have made this change to Rows.objectToStrings, to special handle byte[] and convert to base64.

gianm · 2019-03-07T21:48:45Z

...ns-core/orc-extensions/src/main/java/org/apache/druid/data/input/orc/OrcStructConverter.java

+      case TIMESTAMP:
+        return ((OrcTimestamp) field).getTime();
+      case DATE:
+        // todo: is the the best way to go from java.util.Date to DateTime?


Please don't commit TODOs. Also, the best way is probably DateTimes.utc(theJavaUtilDate.getTime()).

gianm · 2019-03-07T21:53:51Z

docs/content/development/extensions-contrib/orc.md

-|type      | String      | This should say `orc`                                                                  | yes|
-|parseSpec | JSON Object | Specifies the timestamp and dimensions of the data. Any parse spec that extends ParseSpec is possible but only their TimestampSpec and DimensionsSpec are used. | yes|
-|typeString| String      | String representation of ORC struct type info. If not specified, auto constructed from parseSpec but all metric columns are dropped | no|
-|mapFieldNameFormat| String | String format for resolving the flatten map fields. Default is `<PARENT>_<CHILD>`. | no |


It looks like the mapFieldNameFormat functionality is no longer supported. That should be called out in the docs and release notes. (Or perhaps readded, if feasible.)

gianm · 2019-03-07T21:55:30Z

docs/content/development/extensions-contrib/orc.md

-|----------|-------------|----------------------------------------------------------------------------------------|---------|
-|type      | String      | This should say `orc`                                                                  | yes|
-|parseSpec | JSON Object | Specifies the timestamp and dimensions of the data. Any parse spec that extends ParseSpec is possible but only their TimestampSpec and DimensionsSpec are used. | yes|
-|typeString| String      | String representation of ORC struct type info. If not specified, auto constructed from parseSpec but all metric columns are dropped | no|


Looks like typeString is no longer a thing, because things are being detected better, and automatically now. But could this cause anybody to be sad, that may have been manually specifying an override typeString for some purpose?

The removal of typeString should be called out in the docs & release notes, or perhaps readded, if feasible and useful. (I am not sure about either of those two!)

Added a section to the extension docs about "migrating from the 'contrib' extension", with what needs changed and how to change it to maintain ingested schema, along with some better examples and improved explanation of stuff.

…bjectToStrings now converts byte[] to base64, change date handling

… docs

gianm

Three minor suggestions - everything else LGTM. I don't think any of the minor points are merge blockers so I am 👍 (but would be nice to adjust in a follow-on patch).

gianm · 2019-04-08T22:25:37Z

...re/orc-extensions/src/main/java/org/apache/druid/data/input/orc/OrcHadoopInputRowParser.java

+    } else {
+      flattenSpec = JSONPathSpec.DEFAULT;
+    }
+    this.groupFlattener = ObjectFlatteners.create(flattenSpec, new OrcStructFlattenerMaker(false));


There isn't a "group" here (c/p from Parquet)?

oof yeah, I referenced parquet ext for this 👍

gianm · 2019-04-08T22:25:49Z

...re/orc-extensions/src/main/java/org/apache/druid/data/input/orc/OrcHadoopInputRowParser.java

+    this.parseSpec = parseSpec;
+    this.binaryAsString = binaryAsString == null ? false : binaryAsString;
+    final JSONPathSpec flattenSpec;
+    if ((parseSpec instanceof OrcParseSpec)) {


Unnecessary parentheses.

gianm · 2019-04-08T22:30:12Z

...ns-core/orc-extensions/src/main/java/org/apache/druid/data/input/orc/OrcStructConverter.java

+  Object convertField(OrcStruct struct, String fieldName)
+  {
+    TypeDescription schema = struct.getSchema();
+    int fieldIndex = schema.getFieldNames().indexOf(fieldName);


Might be nice to cache this in a map or something. Unless the List is specially optimized somehow, indexOf is O(N) in the number of fields. Shouldn't be a big deal for small numbers of fields but would get expensive for wide ORCs.

I've added a cache for field names to index of the root OrcStruct, but it can't easily be used for json path expressions because the JsonProvider stuff doesn't have enough information to know if an OrcStruct it is treating as a Map for the transformation is the root level or not, and comparing columns to the cache seems to go full circle to where it started maybe. Better than nothing?

Sure, it is better than nothing.

gianm · 2019-04-09T01:57:00Z

...ns-core/orc-extensions/src/main/java/org/apache/druid/data/input/orc/OrcStructConverter.java

@@ -139,18 +155,46 @@ private static Object convertPrimitive(TypeDescription fieldDescription, Writabl
   * primitive types will be extracted into an ingestion friendly state (e.g. 'int' and 'long'). Finally,
   * if a field is not present, this method will return null.
   *
-   * Note: "Union" types are not currently supported and will be returned as null
+   * Note: "Union" types are not currently supported and will be returned as null. Additionally, this method
+   * has a cache of field names to field index that is ONLY valid for the root level {@link OrcStruct}, and should


This design is not foolproof enough- it's risky/errorprone, and a bit obtuse to read, because two methods called convertField with slightly different arguments have very different semantics. This one should be renamed to convertRootField and the javadoc should call out the restriction prominently, rather than in a side node.

Agree, will fix 👍

gianm

LGTM 👍

* orc extension reworked to use apache orc map-reduce lib, moved to core extensions, support for flattenSpec, tests, docs * change binary handling to be compatible with avro and parquet, Rows.objectToStrings now converts byte[] to base64, change date handling * better docs and tests * fix it * formatting * doc fix * fix it * exclude redundant dependencies * use latest orc-mapreduce, add hadoop jobProperties recommendations to docs * doc fix * review stuff and fix binaryAsString * cache for root level fields * more better

leventov · 2019-05-23T11:40:11Z

...ns-core/orc-extensions/src/main/java/org/apache/druid/data/input/orc/OrcStructConverter.java

+        fieldIndexCache.put(fields.get(i), i);
+      }
+    }
+    WritableComparable wc = struct.getFieldValue(fieldName);


Variable is not used. I assume that calling struct.getFieldValue() doesn't have desirable side effects. I'll delete this line.

Deleted here: #7738

Ouch, yeah this is not supposed to be there, I missed it and worse, it is sort of defeating the purpose of the field index cache since it's still causing indexOf to get called..

Thanks for fixing this, I'm going to open a PR to 0.15 branch to effectively backport this part of #7738 because this is a performance issue for data with lots of fields.

orc extension reworked to use apache orc map-reduce lib, moved to cor…

b899190

…e extensions, support for flattenSpec, tests, docs

clintropolis force-pushed the orc-core branch from bbfb74c to b899190 Compare February 28, 2019 05:45

gianm reviewed Mar 7, 2019

View reviewed changes

gianm added the Release Notes label Mar 7, 2019

fjy added this to the 0.15.0 milestone Mar 11, 2019

change binary handling to be compatible with avro and parquet, Rows.o…

04dbefe

…bjectToStrings now converts byte[] to base64, change date handling

gianm self-assigned this Mar 15, 2019

clintropolis added 4 commits March 22, 2019 00:02

better docs and tests

58a63cd

fix it

c674c4d

formatting

90ff75d

doc fix

e032ed3

clintropolis mentioned this pull request Mar 22, 2019

Enhance orc-extensions - use orc file schema #7282

Closed

clintropolis added 6 commits March 27, 2019 23:34

Merge remote-tracking branch 'upstream/master' into orc-core

920827b

fix it

c552eb5

exclude redundant dependencies

6c665ec

use latest orc-mapreduce, add hadoop jobProperties recommendations to…

64ce9e3

… docs

Merge remote-tracking branch 'upstream/master' into orc-core

732346c

doc fix

bd8d3af

gianm approved these changes Apr 8, 2019

View reviewed changes

clintropolis added 2 commits April 8, 2019 15:52

review stuff and fix binaryAsString

b53fb36

cache for root level fields

fa83988

gianm reviewed Apr 9, 2019

View reviewed changes

more better

f6ea39b

gianm approved these changes Apr 9, 2019

View reviewed changes

gianm merged commit 89bb43f into apache:master Apr 9, 2019

clintropolis mentioned this pull request Apr 9, 2019

overhaul 'druid-orc-extensions' and move from 'contrib' to 'core' #7134

Closed

clintropolis deleted the orc-core branch April 9, 2019 22:01

leventov reviewed May 23, 2019

View reviewed changes

jihoonson mentioned this pull request Jun 8, 2019

0.15.0-incubating release notes #7854

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'core' ORC extension #7138

'core' ORC extension #7138

clintropolis commented Feb 25, 2019 •

edited

Loading

clintropolis commented Feb 25, 2019

drcrallen commented Feb 25, 2019

clintropolis commented Feb 25, 2019

gianm commented Feb 28, 2019

clintropolis commented Feb 28, 2019

clintropolis commented Feb 28, 2019

gianm Mar 7, 2019

clintropolis Mar 12, 2019

clintropolis Mar 12, 2019

clintropolis Mar 22, 2019

gianm Mar 7, 2019

gianm Mar 7, 2019

gianm Mar 7, 2019

clintropolis Mar 22, 2019

gianm left a comment

gianm Apr 8, 2019

clintropolis Apr 8, 2019

gianm Apr 8, 2019

gianm Apr 8, 2019

clintropolis Apr 9, 2019

gianm Apr 9, 2019

gianm Apr 9, 2019

clintropolis Apr 9, 2019

gianm left a comment

leventov May 23, 2019

leventov May 23, 2019

clintropolis May 23, 2019

'core' ORC extension #7138

'core' ORC extension #7138

Conversation

clintropolis commented Feb 25, 2019 • edited Loading

clintropolis commented Feb 25, 2019

drcrallen commented Feb 25, 2019

clintropolis commented Feb 25, 2019

gianm commented Feb 28, 2019

clintropolis commented Feb 28, 2019

clintropolis commented Feb 28, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gianm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gianm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clintropolis commented Feb 25, 2019 •

edited

Loading