Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Webhdfs support for orc-extension #6104

Closed
a2l007 opened this issue Aug 3, 2018 · 4 comments
Closed

Webhdfs support for orc-extension #6104

a2l007 opened this issue Aug 3, 2018 · 4 comments

Comments

@a2l007
Copy link
Contributor

a2l007 commented Aug 3, 2018

It doesn't look like the druid-orc-extensions extension works for WebHDFS. Got it working for HDFS after removing some conflicting jars, but for WebHDFS the ingestion task fails with the following error:


ERROR [ORC_GET_SPLITS #0] org.apache.hadoop.hive.ql.io.AcidUtils - Failed to get files with ID; using regular API
java.lang.UnsupportedOperationException: Only supported for DFS; got class org.apache.hadoop.hdfs.web.WebHdfsFileSystem
	at org.apache.hadoop.hive.shims.Hadoop23Shims.ensureDfs(Hadoop23Shims.java:813) ~[hive-exec-2.0.0.jar:2.0.0]
	at org.apache.hadoop.hive.shims.Hadoop23Shims.listLocatedHdfsStatus(Hadoop23Shims.java:784) ~[hive-exec-2.0.0.jar:2.0.0]
	at org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:477) [hive-exec-2.0.0.jar:2.0.0]
	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:890) [hive-exec-2.0.0.jar:2.0.0]
	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:875) [hive-exec-2.0.0.jar:2.0.0]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_131]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_131]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]

From the trace it looks like it is attempting to perform an operation that is not supported for WebHDFS. I'm raising this issue to check if any of the extension users here has run into this problem before.
@gianm @sirpkt , any input here?

@gianm
Copy link
Contributor

gianm commented Aug 5, 2018

Hi @a2l007, I've used this extension with regular HDFS, but not with WebHDFS. It looks like this method is called during getSplits which is essential for M/R jobs to work. It also looks like the stack trace is all happening in hive and hadoop code. I wonder if there's some incompatibility between orc and webhdfs?

@a2l007
Copy link
Contributor Author

a2l007 commented Aug 6, 2018

Yeah ORC does a lot of seeks and the check could be because it may not work well with WebHDFS i guess. I wonder if it would make sense to switch out OrcInputFormat from hive to https://orc.apache.org/api/orc-core/index.html?org/apache/orc/OrcFile.html , but I guess that would be a question for the owner of this extension.

@gianm
Copy link
Contributor

gianm commented Aug 6, 2018

@a2l007 I am not too much of an ORC expert so I can't comment on that solution, but in terms of owners, contrib extensions are open to contributions from anyone. Please feel free to contribute something that you think is a meaningful step forwards!

@a2l007
Copy link
Contributor Author

a2l007 commented Aug 7, 2018

Sure I'll investigate if it is indeed a better solution and then work towards contributing it.

@a2l007 a2l007 closed this as completed Jan 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants