Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

index_hadoop tasks fail on wrong file format when run inside indexer #8840

Closed
sixtus opened this issue Nov 7, 2019 · 18 comments · Fixed by #9059
Closed

index_hadoop tasks fail on wrong file format when run inside indexer #8840

sixtus opened this issue Nov 7, 2019 · 18 comments · Fixed by #9059

Comments

@sixtus
Copy link
Contributor

sixtus commented Nov 7, 2019

we have seen this before, it's not replacing : and thus HDFS refuses the file as it's illegal.

I guess yet another thing fixed in peon but not in indexer?

java.lang.IllegalArgumentException: Pathname /druid/indexer/foo/2019-10-29T00:00:00.000Z_2019-10-30T00:00:00.000Z/2019-11-06T19:57:56.216Z/29/index.zip.0 from hdfs://us2/druid/indexer/foo/2019-10-29T00:00:00.000Z_2019-10-30T00:00:00.000Z/2019-11-06T19:57:56.216Z/29/index.zip.0 is not a valid DFS filename.

@sixtus
Copy link
Contributor Author

sixtus commented Nov 10, 2019

ok, I think I found something:

druid.storage.storageDirectory=/druid/indexer works for index_kafka and compact, but it doesn't for index_hadoop. for that it has to be hdfs:// prefixed, or the segments end up in hdfs, but as local file in the database. Maybe that triggered the above too?

@sixtus
Copy link
Contributor Author

sixtus commented Nov 12, 2019

ok, so I fixed the config problem and revert to mainline. The problem is back. index-hadoop fails on trying to write a file containing a :.

@gianm
Copy link
Contributor

gianm commented Nov 12, 2019

Hi @sixtus,

Is this in Druid 0.16.0?

Am I hearing you right that when you run on 0.16.0 using MM + peons, everything works fine, but when you run on 0.16.0 with an Indexer, you get this behavior where : are not replaced with _ in segment names?

@sixtus
Copy link
Contributor Author

sixtus commented Nov 12, 2019

Yes, this is druid 0.16.0. I just fixed the config bug and reverted and the bug is back. I am not sure if it's running on peons (didn't validate that), but it sure isn't working when running on indexer.

My PR has 2 commits. I was expecting it to work after the first and it didn't. But I think this was just a jar caching issue. I am about to test it again.

@gianm
Copy link
Contributor

gianm commented Nov 12, 2019

What are your druid.storage.* properties?

I am looking at the code that handles HDFS path colon replacement, and it hasn't changed in a while, and I don't see a reason for it to behave differently on MM/Peon vs Indexer. Also, we do have other users on 0.16.0 + HDFS deep storage. So it should still work…

But if it doesn't please feed us some more clues and we can get to the bottom of it.

@sixtus sixtus changed the title hadoop_index tasks fail on wrong file format when run inside indexer index_hadoop tasks fail on wrong file format when run inside indexer Nov 12, 2019
@sixtus
Copy link
Contributor Author

sixtus commented Nov 12, 2019

It's working for compact and index_kafka. However index_hadoop is written by Hadoop itself, I think that's the problem. I just removed the broken commit from my PR.

druid.storage.type=hdfs
druid.storage.storageDirectory=hdfs://us2/druid/indexer

and yes, I have been doing this for a while. I remember filing a bug back in the days. And now it's back.

@sixtus
Copy link
Contributor Author

sixtus commented Nov 12, 2019

to be more precise, it's the hadoop reduce task that fails

@sixtus
Copy link
Contributor Author

sixtus commented Nov 12, 2019

and there is no alternative to index_hadoop for raw files in HDFS, at least it's not obvious from the documentation.

@gianm
Copy link
Contributor

gianm commented Nov 12, 2019

Hmm, I tried the Hadoop tutorial, which uses HDFS deep storage, and it worked ok for me. I wonder if something weird is going on with your setup.

Do you have a stack trace from the reduce task that fails?

@sixtus
Copy link
Contributor Author

sixtus commented Nov 12, 2019

11:33:28.944 [main] ERROR org.apache.druid.indexer.JobHelper - Exception in retry loop
java.lang.IllegalArgumentException: Pathname /druid/indexer/foo/2019-11-12T04:00:00.000Z_2019-11-12T05:00:00.000Z/2019-11-12T10:04:03.778Z/0/index.zip.0 from hdfs://us2/druid/indexer/foo/2019-11-12T04:00:00.000Z_2019-11-12T05:00:00.000Z/2019-11-12T10:04:03.778Z/0/index.zip.0 is not a valid DFS filename.
	at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:217) ~[hadoop-hdfs-client-2.8.5.jar:?]
	at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:476) ~[hadoop-hdfs-client-2.8.5.jar:?]
	at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:473) ~[hadoop-hdfs-client-2.8.5.jar:?]
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:473) ~[hadoop-hdfs-client-2.8.5.jar:?]
	at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:414) ~[hadoop-hdfs-client-2.8.5.jar:?]
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:890) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.druid.indexer.JobHelper$2.push(JobHelper.java:452) [druid-indexing-hadoop-0.16.0-lqm1.jar:0.16.0-lqm1]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_222]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_222]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_222]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_222]
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409) [hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163) [hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155) [hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) [hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346) [hadoop-common-2.8.5.jar:?]
	at com.sun.proxy.$Proxy78.push(Unknown Source) [?:?]
	at org.apache.druid.indexer.JobHelper.serializeOutIndex(JobHelper.java:469) [druid-indexing-hadoop-0.16.0-lqm1.jar:0.16.0-lqm1]
	at org.apache.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:827) [druid-indexing-hadoop-0.16.0-lqm1.jar:0.16.0-lqm1]
	at org.apache.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:579) [druid-indexing-hadoop-0.16.0-lqm1.jar:0.16.0-lqm1]
	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171) [hadoop-mapreduce-client-core-2.8.5.jar:?]
	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627) [hadoop-mapreduce-client-core-2.8.5.jar:?]
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389) [hadoop-mapreduce-client-core-2.8.5.jar:?]
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) [hadoop-mapreduce-client-app-2.8.5.jar:?]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_222]
	at javax.security.auth.Subject.doAs(Subject.java:422) [?:1.8.0_222]
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) [hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169) [hadoop-mapreduce-client-app-2.8.5.jar:?]

we upgraded to 2.8.5 right around the same time as druid 0.16

@sixtus
Copy link
Contributor Author

sixtus commented Nov 12, 2019

The example is druid.storage.type=local btw

@sixtus
Copy link
Contributor Author

sixtus commented Nov 12, 2019

I just tried my patch, it's not working

@sixtus
Copy link
Contributor Author

sixtus commented Nov 18, 2019

I just noticed the kill task is also throwing an exception:

Nov 18 11:59:13 dl385g10-nm14-01-nr106 druid-indexer[60784]: Error: java.lang.IllegalArgumentException: Pathname /druid/indexer/foo/2019-11-17T17:00:00.000Z_2019-11-17T18:00:00.000Z/2019-11-18T07:08:21.067Z/0/index.zip.1 from hdfs://us2/druid/indexer/foo/2019-11-17T17:00:00.000Z_2019-11-17T18:00:00.000Z/2019-11-18T07:08:21.067Z/0/index.zip.1 is not a valid DFS filename.
Nov 18 11:59:13 dl385g10-nm14-01-nr106 druid-indexer[60784]:         at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:217)
Nov 18 11:59:13 dl385g10-nm14-01-nr106 druid-indexer[60784]:         at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:476)
Nov 18 11:59:13 dl385g10-nm14-01-nr106 druid-indexer[60784]:         at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:473)
Nov 18 11:59:13 dl385g10-nm14-01-nr106 druid-indexer[60784]:         at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
Nov 18 11:59:13 dl385g10-nm14-01-nr106 druid-indexer[60784]:         at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:473)
Nov 18 11:59:13 dl385g10-nm14-01-nr106 druid-indexer[60784]:         at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:414)
Nov 18 11:59:13 dl385g10-nm14-01-nr106 druid-indexer[60784]:         at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929)
Nov 18 11:59:13 dl385g10-nm14-01-nr106 druid-indexer[60784]:         at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:890)
Nov 18 11:59:13 dl385g10-nm14-01-nr106 druid-indexer[60784]:         at org.apache.druid.indexer.JobHelper$2.push(JobHelper.java:452)
Nov 18 11:59:13 dl385g10-nm14-01-nr106 druid-indexer[60784]:         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
Nov 18 11:59:13 dl385g10-nm14-01-nr106 druid-indexer[60784]:         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
Nov 18 11:59:13 dl385g10-nm14-01-nr106 druid-indexer[60784]:         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
Nov 18 11:59:13 dl385g10-nm14-01-nr106 druid-indexer[60784]:         at java.lang.reflect.Method.invoke(Method.java:498)
Nov 18 11:59:13 dl385g10-nm14-01-nr106 druid-indexer[60784]:         at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)
Nov 18 11:59:13 dl385g10-nm14-01-nr106 druid-indexer[60784]:         at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
Nov 18 11:59:13 dl385g10-nm14-01-nr106 druid-indexer[60784]:         at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
Nov 18 11:59:13 dl385g10-nm14-01-nr106 druid-indexer[60784]:         at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
Nov 18 11:59:13 dl385g10-nm14-01-nr106 druid-indexer[60784]:         at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)
Nov 18 11:59:13 dl385g10-nm14-01-nr106 druid-indexer[60784]:         at com.sun.proxy.$Proxy78.push(Unknown Source)
Nov 18 11:59:13 dl385g10-nm14-01-nr106 druid-indexer[60784]:         at org.apache.druid.indexer.JobHelper.serializeOutIndex(JobHelper.java:469)
Nov 18 11:59:13 dl385g10-nm14-01-nr106 druid-indexer[60784]:         at org.apache.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:828)
Nov 18 11:59:13 dl385g10-nm14-01-nr106 druid-indexer[60784]:         at org.apache.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:579)
Nov 18 11:59:13 dl385g10-nm14-01-nr106 druid-indexer[60784]:         at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
Nov 18 11:59:13 dl385g10-nm14-01-nr106 druid-indexer[60784]:         at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
Nov 18 11:59:13 dl385g10-nm14-01-nr106 druid-indexer[60784]:         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
Nov 18 11:59:13 dl385g10-nm14-01-nr106 druid-indexer[60784]:         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
Nov 18 11:59:13 dl385g10-nm14-01-nr106 druid-indexer[60784]:         at java.security.AccessController.doPrivileged(Native Method)
Nov 18 11:59:13 dl385g10-nm14-01-nr106 druid-indexer[60784]:         at javax.security.auth.Subject.doAs(Subject.java:422)
Nov 18 11:59:13 dl385g10-nm14-01-nr106 druid-indexer[60784]:         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
Nov 18 11:59:13 dl385g10-nm14-01-nr106 druid-indexer[60784]:         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169)

I verified, there is no broken name in the meta storage. I.e. the kill task must generated it itself (rather than use the path from metastore) and then runs into the same trap.

From my limited understanding, it looks like it's not instantiating HdfsDataSegmentPusher but rather LocalDataSegmentPusher- still puzzled though.

@jon-wei
Copy link
Contributor

jon-wei commented Dec 6, 2019

It sounds like the config is specifying druid.storage.type=local somewhere, is it possible that there's a stray/stale entry for that somewhere in your runtime properties?

When the Druid processes start up, they'll log their configuration properties; for the indexer process, do you see it using druid.storage.type=hdfs at that point?

@sixtus
Copy link
Contributor Author

sixtus commented Dec 9, 2019 via email

@sixtus
Copy link
Contributor Author

sixtus commented Dec 9, 2019 via email

@jon-wei jon-wei added this to the 0.17.0 milestone Dec 10, 2019
@jon-wei
Copy link
Contributor

jon-wei commented Dec 10, 2019

@sixtus I was able to reproduce this, it appears to be an issue where the Hadoop mapper or reducer is not picking up the druid.storage.type property correctly, will look into fixing this for 0.17.0

@jon-wei
Copy link
Contributor

jon-wei commented Dec 17, 2019

@sixtus I've opened a fix PR for this: #9059

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants