Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Middlemanager fails startup due to corrupt task files #7886

Closed
xvrl opened this issue Jun 13, 2019 · 7 comments · Fixed by #7917
Closed

Middlemanager fails startup due to corrupt task files #7886

xvrl opened this issue Jun 13, 2019 · 7 comments · Fixed by #7917
Assignees
Labels
Milestone

Comments

@xvrl
Copy link
Member

xvrl commented Jun 13, 2019

Affected Version

0.15.0-SNAPSHOT (git sha d99f77a)

Description

A middle-manager shutdown may leave empty task files in
${druid.indexer.task.baseTaskDir}/completedTasks/. This may be an issue on it's own, but it could also happen for reasons beyond our control.

Those empty (corrupt) files cause the middlemanager to fail on a subsequent startup, due to https://github.com/apache/incubator-druid/blob/master/indexing-service/src/main/java/org/apache/druid/indexing/worker/WorkerTaskManager.java#L430 re-throwingh a JsonProcessingException

The exception message also looks incorrect, saying the files would be ignored, but instead it causes the entire startup sequence to interrupt, requiring user intervention to remove corrupt files in order to resume startup.

java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_212]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_212]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_212]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_212]
	at org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.start(Lifecycle.java:443) ~[druid-core-0.15.0-incubating-SNAPSHOT.jar:0.15.0-incubating-SNAPSHOT]
	at org.apache.druid.java.util.common.lifecycle.Lifecycle.start(Lifecycle.java:339) ~[druid-core-0.15.0-incubating-SNAPSHOT.jar:0.15.0-incubating-SNAPSHOT]
	at org.apache.druid.guice.LifecycleModule$2.start(LifecycleModule.java:140) ~[druid-core-0.15.0-incubating-SNAPSHOT.jar:0.15.0-incubating-SNAPSHOT]
	at org.apache.druid.cli.GuiceRunnable.initLifecycle(GuiceRunnable.java:106) [druid-services-0.15.0-incubating-SNAPSHOT.jar:0.15.0-incubating-SNAPSHOT]
	at org.apache.druid.cli.ServerRunnable.run(ServerRunnable.java:57) [druid-services-0.15.0-incubating-SNAPSHOT.jar:0.15.0-incubating-SNAPSHOT]
	at org.apache.druid.cli.Main.main(Main.java:118) [druid-services-0.15.0-incubating-SNAPSHOT.jar:0.15.0-incubating-SNAPSHOT]
Caused by: org.apache.druid.java.util.common.ISE: Failed to read completed task from disk at [/var/druid/task/completedTasks/index_kafka_metrics_opencensus_ebffe52caf33afb_lcgaddpn]. Ignored.
	at org.apache.druid.indexing.worker.WorkerTaskManager.initCompletedTasks(WorkerTaskManager.java:430) ~[druid-indexing-service-0.15.0-incubating-SNAPSHOT.jar:0.15.0-incubating-SNAPSHOT]
	at org.apache.druid.indexing.worker.WorkerTaskManager.start(WorkerTaskManager.java:135) ~[druid-indexing-service-0.15.0-incubating-SNAPSHOT.jar:0.15.0-incubating-SNAPSHOT]
	at org.apache.druid.indexing.worker.WorkerTaskMonitor.start(WorkerTaskMonitor.java:94) ~[druid-indexing-service-0.15.0-incubating-SNAPSHOT.jar:0.15.0-incubating-SNAPSHOT]
	... 10 more
Caused by: com.fasterxml.jackson.databind.JsonMappingException: No content to map due to end-of-input
 at [Source: var/druid/task/completedTasks/index_kafka_<datasource>_ebffe52caf33afb_lcgaddpn; line: 1, column: 1]
	at com.fasterxml.jackson.databind.JsonMappingException.from(JsonMappingException.java:148) ~[jackson-databind-2.6.7.jar:2.6.7]
	at com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:3781) ~[jackson-databind-2.6.7.jar:2.6.7]
	at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3721) ~[jackson-databind-2.6.7.jar:2.6.7]
	at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2620) ~[jackson-databind-2.6.7.jar:2.6.7]
	at org.apache.druid.indexing.worker.WorkerTaskManager.initCompletedTasks(WorkerTaskManager.java:421) ~[druid-indexing-service-0.15.0-incubating-SNAPSHOT.jar:0.15.0-incubating-SNAPSHOT]
	at org.apache.druid.indexing.worker.WorkerTaskManager.start(WorkerTaskManager.java:135) ~[druid-indexing-service-0.15.0-incubating-SNAPSHOT.jar:0.15.0-incubating-SNAPSHOT]
	at org.apache.druid.indexing.worker.WorkerTaskMonitor.start(WorkerTaskMonitor.java:94) ~[druid-indexing-service-0.15.0-incubating-SNAPSHOT.jar:0.15.0-incubating-SNAPSHOT]
@xvrl xvrl added the Bug label Jun 13, 2019
@vogievetsky
Copy link
Contributor

Hi @xvrl 👋
Do you know if this a 0.15.0 regression?

@xvrl
Copy link
Member Author

xvrl commented Jun 13, 2019

@vogievetsky looks like this startup behavior has been around since https://github.com/apache/incubator-druid/pull/5104/files . I'll have to look if the creation of empty completed task files is a regression or not

@gianm
Copy link
Contributor

gianm commented Jun 13, 2019

Yeah the 'Ignored.' is wrong at least.

@himanshug
Copy link
Contributor

yeah startup should be more forgiving and also the code writing those files should try to write them atomically as much possible to reduce the likelihood of empty/corrupted files.
I will fix it.

@david-z-johnson
Copy link

@himanshug , I have the same issue in version 0.14.2 after upgrading from 0.12.3. which release contains that fix?

@himanshug
Copy link
Contributor

himanshug commented Jul 5, 2019

@david-z-johnson this will be made available in next Druid release , for now please use the workaround of deleting corrupted files.

@david-z-johnson
Copy link

@himanshug , thanks, the work around works for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants