Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

overhaul 'druid-orc-extensions' and move from 'contrib' to 'core' #7134

Closed
clintropolis opened this issue Feb 24, 2019 · 6 comments
Closed
Labels

Comments

@clintropolis
Copy link
Member

Motivation

The existing ORC hadoop indexing extension, while functional, is difficult to use and hard to troubleshoot, and does not provide a consistent experience compared to indexing other nested data formats such as JSON, Avro, and Parquet.

I thnk it would also be nice if this were a 'core' extension to help Druid better tie into the rest of the Apache ecosystem, and since we provide 'core' extensions for Avro and Parquet, this would help round out our 'core' support for popular Hadoop file formats.

Proposed changes

I propose replacing the existing 'contrib' extension with a new 'core' extension that is modelled after the parquet and avro hadoop indexing facilities, with support for flattenSpec, and built on the Apache ORC orc-mapreduce instead of using Hive's hive-exec. Instead of requiring the user to specify the schema in the form of the typeString parameter that the existing extension uses, the new extension will auto-detect the schema using the schema encoded in the TypeDescription of OrcStruct.

Structurally it will look a lot like the Parquet extension, with an implementation of JsonProvider and ObjectFlatteners.FlattenerMaker<OrcStruct> to provide flattening for the InputRowParser implementation, and a 'converter' utilty class to translate the OrcStruct and it's fields into a map to ultimately feed an to the InputRowParser implementations internal MapInputRowParser.

Of note is that this new extension will not be exactly compatible with the current extension, so ingest specs will likely need changed, but output should be able to continue to produce data that matches any existing Druid schemas produced from the current extension.

The primary incompatibility is that the typeString of the current extension allows arbitrary renaming of columns in the ORC file, as only position and type seem to be significant. I'm not certain but I presume the reason this is allowed is detailed in these Hive issues HIVE-7189 and HIVE-4243 which chronicle how Hive would write ORC files without their real column names, instead just using _col0, _col1, etc. However, flattenSpec expressions would be a way to handle this with the new extension, as the fields could be extracted from the generic name _col0 or whatever into the desired column name manually. If we feel that we really need to continue to support the old way the extension worked, I could investigate possible mechanisms to retain this functionality of providing a typeString schema and extracting column names from it, but I don't feel that it is necessary.

Another incompatibility would be related to how the current extension handles OrcMap types. It provides a type of flattening 'magic' for maps of primitives that appear in the row with a name controlled by mapFieldNameFormat. Since the new extension would use flattenSpec, these names could be replicated to preserve existing Druid schemas with field extraction expressions.

Alternatively, we could always just rebrand the existing 'contrib' extension as druid-hive-orc-extension and leave it as is otherwise so existing users that perfer the existing behavior can continue as is, but that might be confusing to have multiple ORC extensions.

Rationale

The ORC format supports nested data, so should use the flattenSpec approach to provide a consistent experience for users. By implementing this extension in the same pattern that is used elsewhere, it should also help reduce maintenance costs on our end. Additionally, using the Apache ORC libraries instead of Hive libraries "feels" more correct, even if many of these files may be coming from Hive.

Operational impact

The ORC extension will now be a 'core' extension instead of 'contrib', so will now be bundled in the binary packaging. Ingestion specs will need updated to use the new extension, which could be as minimum would require adjusting the inputFormat to use org.apache.orc.mapreduce.OrcInputFormat instead of org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat with the remaining work centered on migrating away from using typeString auto column renaming (if we don't provide a replacement), or if users were relying on the current extensions OrcMap handling facilities, to migrate to using a flattenSpec to extract those properties.

If we choose instead to retain the existing 'contrib' extension and rebrand it, the operation impact would be changing the extension load list configuration.

Test plan

A variety of example ORC files will be sourced to used to comprehensively unit test the extension and ensure that flattening and field discovery works as expected.

Additionally, testing in a live cluster will be performed to ensure that everything is fully operational.

Future work

I think there is potentially room to refactor these flattening input row parsers and extract out one or more base types, since the parseBatch method implementation is always of the form:

mapParser.parseBatch(flattner.flatten(objectToFlatten))

so future work is looking into ways to simplify and re-use code where possible to reduce maintenence overhead on our end.

@jon-wei
Copy link
Contributor

jon-wei commented Mar 22, 2019

This looks good to me, avoiding the error-prone typeString sounds great, and I agree with using the more minimal orc-mapreduce lib instead of hive-exec (I've seen issues where hive-exec bundling Guava classes has caused incompatibilities)

@gianm
Copy link
Contributor

gianm commented Apr 8, 2019

@clintropolis,

The primary incompatibility is that the typeString of the current extension allows arbitrary renaming of columns in the ORC file, as only position and type seem to be significant. I'm not certain but I presume the reason this is allowed is detailed in these Hive issues HIVE-7189 and HIVE-4243 which chronicle how Hive would write ORC files without their real column names, instead just using _col0, _col1, etc. However, flattenSpec expressions would be a way to handle this with the new extension, as the fields could be extracted from the generic name _col0 or whatever into the desired column name manually. If we feel that we really need to continue to support the old way the extension worked, I could investigate possible mechanisms to retain this functionality of providing a typeString schema and extracting column names from it, but I don't feel that it is necessary.

I agree here - in my experience writing the typeString is tough for people, and making it optional is a good thing. It sounds like flattenSpecs are going to be able to do all the stuff that typeStrings used to be able to do, which is nice because it's better aligned with how JSON/Avro/Parquet work.

Another incompatibility would be related to how the current extension handles OrcMap types. It provides a type of flattening 'magic' for maps of primitives that appear in the row with a name controlled by mapFieldNameFormat. Since the new extension would use flattenSpec, these names could be replicated to preserve existing Druid schemas with field extraction expressions.

I feel similarly about this as the typeString thing. The move to 'core' seems like a good time to better align ORC with how other input formats work.

@gianm
Copy link
Contributor

gianm commented Apr 8, 2019

For the reasons mentioned in my last comment I am 👍 on the proposal. Will do another pass through the code too.

@clintropolis
Copy link
Member Author

I agree here - in my experience writing the typeString is tough for people, and making it optional is a good thing.

To clarify, my current PR eliminates typeString, so 'optional' isn't really correct since it would always be ignored. Do you think it necessary to have a compatibility layer of some sort to collect the column names from a typeString and map them to the actual column names defined in the OrcStruct?

@gianm
Copy link
Contributor

gianm commented Apr 9, 2019

To clarify, my current PR eliminates typeString, so 'optional' isn't really correct since it would always be ignored. Do you think it necessary to have a compatibility layer of some sort to collect the column names from a typeString and map them to the actual column names defined in the OrcStruct?

No, I don't. By 'making it optional is good' I really meant 'making it not required is good'. I think removing it is ok, especially given that we're moving this from contrib to core, and it's a good time to make changes like this that improve maintainability. And nobody has spoken up recently as needing it. If it turns out a compat layer is needed for some reason then it could be built.

@clintropolis
Copy link
Member Author

Completed in #7138

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants