-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
overhaul 'druid-orc-extensions' and move from 'contrib' to 'core' #7134
Comments
This looks good to me, avoiding the error-prone typeString sounds great, and I agree with using the more minimal orc-mapreduce lib instead of hive-exec (I've seen issues where hive-exec bundling Guava classes has caused incompatibilities) |
I agree here - in my experience writing the typeString is tough for people, and making it optional is a good thing. It sounds like flattenSpecs are going to be able to do all the stuff that typeStrings used to be able to do, which is nice because it's better aligned with how JSON/Avro/Parquet work.
I feel similarly about this as the typeString thing. The move to 'core' seems like a good time to better align ORC with how other input formats work. |
For the reasons mentioned in my last comment I am 👍 on the proposal. Will do another pass through the code too. |
To clarify, my current PR eliminates |
No, I don't. By 'making it optional is good' I really meant 'making it not required is good'. I think removing it is ok, especially given that we're moving this from contrib to core, and it's a good time to make changes like this that improve maintainability. And nobody has spoken up recently as needing it. If it turns out a compat layer is needed for some reason then it could be built. |
Completed in #7138 |
Motivation
The existing ORC hadoop indexing extension, while functional, is difficult to use and hard to troubleshoot, and does not provide a consistent experience compared to indexing other nested data formats such as JSON, Avro, and Parquet.
I thnk it would also be nice if this were a 'core' extension to help Druid better tie into the rest of the Apache ecosystem, and since we provide 'core' extensions for Avro and Parquet, this would help round out our 'core' support for popular Hadoop file formats.
Proposed changes
I propose replacing the existing 'contrib' extension with a new 'core' extension that is modelled after the parquet and avro hadoop indexing facilities, with support for
flattenSpec
, and built on the Apache ORCorc-mapreduce
instead of using Hive'shive-exec
. Instead of requiring the user to specify the schema in the form of thetypeString
parameter that the existing extension uses, the new extension will auto-detect the schema using the schema encoded in theTypeDescription
ofOrcStruct
.Structurally it will look a lot like the Parquet extension, with an implementation of
JsonProvider
andObjectFlatteners.FlattenerMaker<OrcStruct>
to provide flattening for theInputRowParser
implementation, and a 'converter' utilty class to translate theOrcStruct
and it's fields into a map to ultimately feed an to theInputRowParser
implementations internalMapInputRowParser
.Of note is that this new extension will not be exactly compatible with the current extension, so ingest specs will likely need changed, but output should be able to continue to produce data that matches any existing Druid schemas produced from the current extension.
The primary incompatibility is that the
typeString
of the current extension allows arbitrary renaming of columns in the ORC file, as only position and type seem to be significant. I'm not certain but I presume the reason this is allowed is detailed in these Hive issues HIVE-7189 and HIVE-4243 which chronicle how Hive would write ORC files without their real column names, instead just using_col0
,_col1
, etc. However,flattenSpec
expressions would be a way to handle this with the new extension, as the fields could be extracted from the generic name_col0
or whatever into the desired column name manually. If we feel that we really need to continue to support the old way the extension worked, I could investigate possible mechanisms to retain this functionality of providing atypeString
schema and extracting column names from it, but I don't feel that it is necessary.Another incompatibility would be related to how the current extension handles
OrcMap
types. It provides a type of flattening 'magic' for maps of primitives that appear in the row with a name controlled bymapFieldNameFormat
. Since the new extension would useflattenSpec
, these names could be replicated to preserve existing Druid schemas with field extraction expressions.Alternatively, we could always just rebrand the existing 'contrib' extension as
druid-hive-orc-extension
and leave it as is otherwise so existing users that perfer the existing behavior can continue as is, but that might be confusing to have multiple ORC extensions.Rationale
The ORC format supports nested data, so should use the
flattenSpec
approach to provide a consistent experience for users. By implementing this extension in the same pattern that is used elsewhere, it should also help reduce maintenance costs on our end. Additionally, using the Apache ORC libraries instead of Hive libraries "feels" more correct, even if many of these files may be coming from Hive.Operational impact
The ORC extension will now be a 'core' extension instead of 'contrib', so will now be bundled in the binary packaging. Ingestion specs will need updated to use the new extension, which could be as minimum would require adjusting the
inputFormat
to useorg.apache.orc.mapreduce.OrcInputFormat
instead oforg.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat
with the remaining work centered on migrating away from usingtypeString
auto column renaming (if we don't provide a replacement), or if users were relying on the current extensionsOrcMap
handling facilities, to migrate to using aflattenSpec
to extract those properties.If we choose instead to retain the existing 'contrib' extension and rebrand it, the operation impact would be changing the extension load list configuration.
Test plan
A variety of example ORC files will be sourced to used to comprehensively unit test the extension and ensure that flattening and field discovery works as expected.
Additionally, testing in a live cluster will be performed to ensure that everything is fully operational.
Future work
I think there is potentially room to refactor these flattening input row parsers and extract out one or more base types, since the
parseBatch
method implementation is always of the form:so future work is looking into ways to simplify and re-use code where possible to reduce maintenence overhead on our end.
The text was updated successfully, but these errors were encountered: