Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'core' ORC extension #7138

Merged
merged 15 commits into from
Apr 9, 2019
Merged

'core' ORC extension #7138

merged 15 commits into from
Apr 9, 2019

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented Feb 25, 2019

Adds new druid-orc-extension based on Apache ORC orc-mapreduce library, replacing 'contrib' extension to implement #7134. Includes docs and a handful of tests with files from ORC examples.

@clintropolis
Copy link
Member Author

Apologies for a PR following so closely to the proposal, I did this work reactively on a whim over a couple of nights last week after witnessing users continually encountering the same errors trying to use the ORC extension. I realized when it was nearly finished that this is probably deserving a proposal, and we can treat this as a reference implementation until there is buy in for the proposal.

@drcrallen
Copy link
Contributor

If you git config diff.renameLimit 0 will it pickup all the renames?

@clintropolis
Copy link
Member Author

If you git config diff.renameLimit 0 will it pickup all the renames?

I will play around with it, I think maybe the problem is that I rewrote it while the old files all still existed, and then removed them in a 2nd commit.

@gianm
Copy link
Contributor

gianm commented Feb 28, 2019

@clintropolis Could you please squash the commits and see if that helps GitHub draw a proper diff?

@clintropolis
Copy link
Member Author

@clintropolis Could you please squash the commits and see if that helps GitHub draw a proper diff?

I can try, but if it still doesn't pick up much it's probably because this is a rewrite, there isn't really any shared lineage. Some of the files, such as OrcExtensionsModule already got picked up as a rename, because they are pretty similar, but other than that there is very little in common between them.

…e extensions, support for flattenSpec, tests, docs
@clintropolis
Copy link
Member Author

Yeah, didn't seem to change anything.

// todo: is the the best way to go from java.util.Date to DateTime?
return new DateTime(((DateWritable) field).get());
case BINARY:
// todo: hmm, i think we need a standard way of handling binary blobs this is a placeholder.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't commit TODOs, either implement it or make it a less todo-y comment, like // Base64-encode binary blobs; may want to provide other options in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I left the todo because I considered this PR still sort of wip pending feedback on the proposal, and wanted to have a discussion about what to do with binary.

Avro handles binary by optionally converting to utf8 string or just returning a byte array:

...
  if (field instanceof ByteBuffer) {
      if (binaryAsString) {
        return StringUtils.fromUtf8(((ByteBuffer) field).array());
      } else {
        return ((ByteBuffer) field).array();
      }
    }
....

With the parquet extension, I decided to mimic this behavior without thinking very hard about it, because I wanted to maintain compatibility with the avro conversion since that's what the extension was previously using, so it also does:

...
          case FIXED_LEN_BYTE_ARRAY:
          case BINARY:
            Binary bin = g.getBinary(fieldIndex, index);
            byte[] bytes = bin.getBytes();
            if (binaryAsString) {
              return StringUtils.fromUtf8(bytes);
            } else {
              return bytes;
            }
...

The previous version of this extension looks like it had no special handling for binary fields, so would eventually just tostring on the byte array.

My thoughts are that we should maybe preserve the binaryAsString functionality of avro and parquet, and also apply it here, and make this extension also return byte[] if it shouldn't treat it as a utf8 string rather than just always translating to base64 encoded string like is being done here currently. I do think that we should consider converting these byte[] objects into base64 strings down the line if they end up in a string dimension, I'm not certain where that should happen though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me like the correct place to handle conversion of byte[] to a base64 string if it ends up as a string dimension would be Rows.objectToStrings. I think conversion to base64 would at least make the dimension more usable than the result of calling byte[].toString which is something like [B@73809e7.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made this change to Rows.objectToStrings, to special handle byte[] and convert to base64.

case TIMESTAMP:
return ((OrcTimestamp) field).getTime();
case DATE:
// todo: is the the best way to go from java.util.Date to DateTime?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't commit TODOs. Also, the best way is probably DateTimes.utc(theJavaUtilDate.getTime()).

|type | String | This should say `orc` | yes|
|parseSpec | JSON Object | Specifies the timestamp and dimensions of the data. Any parse spec that extends ParseSpec is possible but only their TimestampSpec and DimensionsSpec are used. | yes|
|typeString| String | String representation of ORC struct type info. If not specified, auto constructed from parseSpec but all metric columns are dropped | no|
|mapFieldNameFormat| String | String format for resolving the flatten map fields. Default is `<PARENT>_<CHILD>`. | no |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the mapFieldNameFormat functionality is no longer supported. That should be called out in the docs and release notes. (Or perhaps readded, if feasible.)

|----------|-------------|----------------------------------------------------------------------------------------|---------|
|type | String | This should say `orc` | yes|
|parseSpec | JSON Object | Specifies the timestamp and dimensions of the data. Any parse spec that extends ParseSpec is possible but only their TimestampSpec and DimensionsSpec are used. | yes|
|typeString| String | String representation of ORC struct type info. If not specified, auto constructed from parseSpec but all metric columns are dropped | no|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like typeString is no longer a thing, because things are being detected better, and automatically now. But could this cause anybody to be sad, that may have been manually specifying an override typeString for some purpose?

The removal of typeString should be called out in the docs & release notes, or perhaps readded, if feasible and useful. (I am not sure about either of those two!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a section to the extension docs about "migrating from the 'contrib' extension", with what needs changed and how to change it to maintain ingested schema, along with some better examples and improved explanation of stuff.

@fjy fjy added this to the 0.15.0 milestone Mar 11, 2019
…bjectToStrings now converts byte[] to base64, change date handling
@gianm gianm self-assigned this Mar 15, 2019
Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three minor suggestions - everything else LGTM. I don't think any of the minor points are merge blockers so I am 👍 (but would be nice to adjust in a follow-on patch).

} else {
flattenSpec = JSONPathSpec.DEFAULT;
}
this.groupFlattener = ObjectFlatteners.create(flattenSpec, new OrcStructFlattenerMaker(false));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a "group" here (c/p from Parquet)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oof yeah, I referenced parquet ext for this 👍

this.parseSpec = parseSpec;
this.binaryAsString = binaryAsString == null ? false : binaryAsString;
final JSONPathSpec flattenSpec;
if ((parseSpec instanceof OrcParseSpec)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary parentheses.

Object convertField(OrcStruct struct, String fieldName)
{
TypeDescription schema = struct.getSchema();
int fieldIndex = schema.getFieldNames().indexOf(fieldName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to cache this in a map or something. Unless the List is specially optimized somehow, indexOf is O(N) in the number of fields. Shouldn't be a big deal for small numbers of fields but would get expensive for wide ORCs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a cache for field names to index of the root OrcStruct, but it can't easily be used for json path expressions because the JsonProvider stuff doesn't have enough information to know if an OrcStruct it is treating as a Map for the transformation is the root level or not, and comparing columns to the cache seems to go full circle to where it started maybe. Better than nothing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it is better than nothing.

@@ -139,18 +155,46 @@ private static Object convertPrimitive(TypeDescription fieldDescription, Writabl
* primitive types will be extracted into an ingestion friendly state (e.g. 'int' and 'long'). Finally,
* if a field is not present, this method will return null.
*
* Note: "Union" types are not currently supported and will be returned as null
* Note: "Union" types are not currently supported and will be returned as null. Additionally, this method
* has a cache of field names to field index that is ONLY valid for the root level {@link OrcStruct}, and should
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This design is not foolproof enough- it's risky/errorprone, and a bit obtuse to read, because two methods called convertField with slightly different arguments have very different semantics. This one should be renamed to convertRootField and the javadoc should call out the restriction prominently, rather than in a side node.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, will fix 👍

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@gianm gianm merged commit 89bb43f into apache:master Apr 9, 2019
gianm pushed a commit to implydata/druid-public that referenced this pull request Apr 9, 2019
* orc extension reworked to use apache orc map-reduce lib, moved to core extensions, support for flattenSpec, tests, docs

* change binary handling to be compatible with avro and parquet, Rows.objectToStrings now converts byte[] to base64, change date handling

* better docs and tests

* fix it

* formatting

* doc fix

* fix it

* exclude redundant dependencies

* use latest orc-mapreduce, add hadoop jobProperties recommendations to docs

* doc fix

* review stuff and fix binaryAsString

* cache for root level fields

* more better
@clintropolis clintropolis deleted the orc-core branch April 9, 2019 22:01
fieldIndexCache.put(fields.get(i), i);
}
}
WritableComparable wc = struct.getFieldValue(fieldName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable is not used. I assume that calling struct.getFieldValue() doesn't have desirable side effects. I'll delete this line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted here: #7738

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch, yeah this is not supposed to be there, I missed it and worse, it is sort of defeating the purpose of the field index cache since it's still causing indexOf to get called..

Thanks for fixing this, I'm going to open a PR to 0.15 branch to effectively backport this part of #7738 because this is a performance issue for data with lots of fields.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants