-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'core' ORC extension #7138
'core' ORC extension #7138
Conversation
Apologies for a PR following so closely to the proposal, I did this work reactively on a whim over a couple of nights last week after witnessing users continually encountering the same errors trying to use the ORC extension. I realized when it was nearly finished that this is probably deserving a proposal, and we can treat this as a reference implementation until there is buy in for the proposal. |
If you |
I will play around with it, I think maybe the problem is that I rewrote it while the old files all still existed, and then removed them in a 2nd commit. |
@clintropolis Could you please squash the commits and see if that helps GitHub draw a proper diff? |
I can try, but if it still doesn't pick up much it's probably because this is a rewrite, there isn't really any shared lineage. Some of the files, such as |
…e extensions, support for flattenSpec, tests, docs
bbfb74c
to
b899190
Compare
Yeah, didn't seem to change anything. |
// todo: is the the best way to go from java.util.Date to DateTime? | ||
return new DateTime(((DateWritable) field).get()); | ||
case BINARY: | ||
// todo: hmm, i think we need a standard way of handling binary blobs this is a placeholder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't commit TODOs, either implement it or make it a less todo-y comment, like // Base64-encode binary blobs; may want to provide other options in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I left the todo because I considered this PR still sort of wip pending feedback on the proposal, and wanted to have a discussion about what to do with binary.
Avro handles binary by optionally converting to utf8 string or just returning a byte array:
...
if (field instanceof ByteBuffer) {
if (binaryAsString) {
return StringUtils.fromUtf8(((ByteBuffer) field).array());
} else {
return ((ByteBuffer) field).array();
}
}
....
With the parquet extension, I decided to mimic this behavior without thinking very hard about it, because I wanted to maintain compatibility with the avro conversion since that's what the extension was previously using, so it also does:
...
case FIXED_LEN_BYTE_ARRAY:
case BINARY:
Binary bin = g.getBinary(fieldIndex, index);
byte[] bytes = bin.getBytes();
if (binaryAsString) {
return StringUtils.fromUtf8(bytes);
} else {
return bytes;
}
...
The previous version of this extension looks like it had no special handling for binary fields, so would eventually just tostring on the byte array.
My thoughts are that we should maybe preserve the binaryAsString
functionality of avro and parquet, and also apply it here, and make this extension also return byte[] if it shouldn't treat it as a utf8 string rather than just always translating to base64 encoded string like is being done here currently. I do think that we should consider converting these byte[] objects into base64 strings down the line if they end up in a string dimension, I'm not certain where that should happen though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks to me like the correct place to handle conversion of byte[]
to a base64 string if it ends up as a string dimension would be Rows.objectToStrings
. I think conversion to base64 would at least make the dimension more usable than the result of calling byte[].toString
which is something like [B@73809e7
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made this change to Rows.objectToStrings
, to special handle byte[]
and convert to base64.
case TIMESTAMP: | ||
return ((OrcTimestamp) field).getTime(); | ||
case DATE: | ||
// todo: is the the best way to go from java.util.Date to DateTime? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't commit TODOs. Also, the best way is probably DateTimes.utc(theJavaUtilDate.getTime())
.
|type | String | This should say `orc` | yes| | ||
|parseSpec | JSON Object | Specifies the timestamp and dimensions of the data. Any parse spec that extends ParseSpec is possible but only their TimestampSpec and DimensionsSpec are used. | yes| | ||
|typeString| String | String representation of ORC struct type info. If not specified, auto constructed from parseSpec but all metric columns are dropped | no| | ||
|mapFieldNameFormat| String | String format for resolving the flatten map fields. Default is `<PARENT>_<CHILD>`. | no | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the mapFieldNameFormat functionality is no longer supported. That should be called out in the docs and release notes. (Or perhaps readded, if feasible.)
|----------|-------------|----------------------------------------------------------------------------------------|---------| | ||
|type | String | This should say `orc` | yes| | ||
|parseSpec | JSON Object | Specifies the timestamp and dimensions of the data. Any parse spec that extends ParseSpec is possible but only their TimestampSpec and DimensionsSpec are used. | yes| | ||
|typeString| String | String representation of ORC struct type info. If not specified, auto constructed from parseSpec but all metric columns are dropped | no| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like typeString
is no longer a thing, because things are being detected better, and automatically now. But could this cause anybody to be sad, that may have been manually specifying an override typeString for some purpose?
The removal of typeString should be called out in the docs & release notes, or perhaps readded, if feasible and useful. (I am not sure about either of those two!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a section to the extension docs about "migrating from the 'contrib' extension", with what needs changed and how to change it to maintain ingested schema, along with some better examples and improved explanation of stuff.
…bjectToStrings now converts byte[] to base64, change date handling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Three minor suggestions - everything else LGTM. I don't think any of the minor points are merge blockers so I am 👍 (but would be nice to adjust in a follow-on patch).
} else { | ||
flattenSpec = JSONPathSpec.DEFAULT; | ||
} | ||
this.groupFlattener = ObjectFlatteners.create(flattenSpec, new OrcStructFlattenerMaker(false)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't a "group" here (c/p from Parquet)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oof yeah, I referenced parquet ext for this 👍
this.parseSpec = parseSpec; | ||
this.binaryAsString = binaryAsString == null ? false : binaryAsString; | ||
final JSONPathSpec flattenSpec; | ||
if ((parseSpec instanceof OrcParseSpec)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary parentheses.
Object convertField(OrcStruct struct, String fieldName) | ||
{ | ||
TypeDescription schema = struct.getSchema(); | ||
int fieldIndex = schema.getFieldNames().indexOf(fieldName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be nice to cache this in a map or something. Unless the List is specially optimized somehow, indexOf is O(N)
in the number of fields. Shouldn't be a big deal for small numbers of fields but would get expensive for wide ORCs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a cache for field names to index of the root OrcStruct
, but it can't easily be used for json path expressions because the JsonProvider
stuff doesn't have enough information to know if an OrcStruct
it is treating as a Map
for the transformation is the root level or not, and comparing columns to the cache seems to go full circle to where it started maybe. Better than nothing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, it is better than nothing.
@@ -139,18 +155,46 @@ private static Object convertPrimitive(TypeDescription fieldDescription, Writabl | |||
* primitive types will be extracted into an ingestion friendly state (e.g. 'int' and 'long'). Finally, | |||
* if a field is not present, this method will return null. | |||
* | |||
* Note: "Union" types are not currently supported and will be returned as null | |||
* Note: "Union" types are not currently supported and will be returned as null. Additionally, this method | |||
* has a cache of field names to field index that is ONLY valid for the root level {@link OrcStruct}, and should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This design is not foolproof enough- it's risky/errorprone, and a bit obtuse to read, because two methods called convertField
with slightly different arguments have very different semantics. This one should be renamed to convertRootField
and the javadoc should call out the restriction prominently, rather than in a side node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, will fix 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
* orc extension reworked to use apache orc map-reduce lib, moved to core extensions, support for flattenSpec, tests, docs * change binary handling to be compatible with avro and parquet, Rows.objectToStrings now converts byte[] to base64, change date handling * better docs and tests * fix it * formatting * doc fix * fix it * exclude redundant dependencies * use latest orc-mapreduce, add hadoop jobProperties recommendations to docs * doc fix * review stuff and fix binaryAsString * cache for root level fields * more better
fieldIndexCache.put(fields.get(i), i); | ||
} | ||
} | ||
WritableComparable wc = struct.getFieldValue(fieldName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable is not used. I assume that calling struct.getFieldValue()
doesn't have desirable side effects. I'll delete this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted here: #7738
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ouch, yeah this is not supposed to be there, I missed it and worse, it is sort of defeating the purpose of the field index cache since it's still causing indexOf
to get called..
Thanks for fixing this, I'm going to open a PR to 0.15 branch to effectively backport this part of #7738 because this is a performance issue for data with lots of fields.
Adds new
druid-orc-extension
based on Apache ORCorc-mapreduce
library, replacing 'contrib' extension to implement #7134. Includes docs and a handful of tests with files from ORC examples.