Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential bug in HttpPostEmitter causing high CPU usage #5338

Closed
pjain1 opened this issue Feb 2, 2018 · 17 comments · Fixed by #5386
Closed

Potential bug in HttpPostEmitter causing high CPU usage #5338

pjain1 opened this issue Feb 2, 2018 · 17 comments · Fixed by #5386
Labels
Milestone

Comments

@pjain1
Copy link
Member

pjain1 commented Feb 2, 2018

After upgrading to 0.11.0, some of the deployments are facing high CPU usage issue.

After taking a thread dump, it was a suspicion that emitter thread might be causing it.

Thread 65688: (state = IN_JAVA)
 - com.metamx.emitter.core.HttpPostEmitter.emitAndReturnBatch(com.metamx.emitter.core.Event) @bci=111, line=249 (Compiled frame; information may be imprecise)
 - com.metamx.emitter.core.HttpPostEmitter.emit(com.metamx.emitter.core.Event) @bci=2, line=214 (Compiled frame)
 - com.metamx.emitter.core.ComposingEmitter.emit(com.metamx.emitter.core.Event) @bci=31, line=57 (Compiled frame)
 - com.metamx.emitter.service.ServiceEmitter.emit(com.metamx.emitter.core.Event) @bci=5, line=72 (Compiled frame)
 - com.metamx.emitter.service.ServiceEmitter.emit(com.metamx.emitter.service.ServiceEventBuilder) @bci=9, line=77 (Compiled frame)

To verify the issue, we added an executor in DruidCoordinator class which just keeps on emitting events in while(true) loop like this -

while (true) {
   emitter.emit(ServiceMetricEvent.builder().setDimension("dataSource", "try").build("test", 10));
}

We found that after some time, batch.tryAddEvents method always return false and the reference in concurrentBatch never changes and the while(true) loop just keeps on spinning without sending anything or creating new batch.

Still not sure why it is happening as its not happening for all deployments, might be some concurrency issue.

@leventov

@pjain1 pjain1 added the Bug label Feb 2, 2018
@leventov
Copy link
Member

leventov commented Feb 2, 2018

I reviewed quickly but I couldn't find a bug. Could you add some logging code:
in

https://github.com/metamx/java-util/blob/f33e0b0f275fa96bfe79a1d68b4ea70d09cd496e/src/main/java/com/metamx/emitter/core/HttpPostEmitter.java#L244-L246

add

else { log.info("failed to emit in batch [%s]", batch); }

Also, add batchNumber field in Batch.toString() impl;

Also, add logging after successful compareAndSetState() calls in Batch.tryReleaseShared().

@niketh
Copy link
Contributor

niketh commented Feb 7, 2018

We are facing the same issue, had an outage as a lot of the historicals started spinning at 100% CPU utilization and the queries couldn't return. A lot of threads are stuck here

at com.metamx.emitter.core.HttpPostEmitter.emitAndReturnBatch(HttpPostEmitter.java:244)
	at com.metamx.emitter.core.HttpPostEmitter.emit(HttpPostEmitter.java:214)
	at com.metamx.emitter.service.ServiceEmitter.emit(ServiceEmitter.java:72)
	at com.metamx.emitter.service.ServiceEmitter.emit(ServiceEmitter.java:77)
	at io.druid.query.DefaultQueryMetrics.emit(DefaultQueryMetrics.java:290)
	at io.druid.query.MetricsEmittingQueryRunner$1.after(MetricsEmittingQueryRunner.java:120)
	at io.druid.java.util.common.guava.WrappingSequence.accumulate(WrappingSequence.java:66)
	at io.druid.java.util.common.guava.LazySequence.accumulate(LazySequence.java:40)
	at io.druid.java.util.common.guava.WrappingSequence$1.get(WrappingSequence.java:50)
	at io.druid.java.util.common.guava.SequenceWrapper.wrap(SequenceWrapper.java:55)
	at io.druid.java.util.common.guava.WrappingSequence.accumulate(WrappingSequence.java:45)
	at io.druid.query.spec.SpecificSegmentQueryRunner$2.accumulate(SpecificSegmentQueryRunner.java:86)
	at io.druid.java.util.common.guava.WrappingSequence$1.get(WrappingSequence.java:50)
	at io.druid.query.spec.SpecificSegmentQueryRunner.doNamed(SpecificSegmentQueryRunner.java:172)
	at io.druid.query.spec.SpecificSegmentQueryRunner.access$200(SpecificSegmentQueryRunner.java:45)
	at io.druid.query.spec.SpecificSegmentQueryRunner$3.wrap(SpecificSegmentQueryRunner.java:152)
	at io.druid.java.util.common.guava.WrappingSequence.accumulate(WrappingSequence.java:45)
	at io.druid.java.util.common.guava.WrappingSequence$1.get(WrappingSequence.java:50)
	at io.druid.query.CPUTimeMetricQueryRunner$1.wrap(CPUTimeMetricQueryRunner.java:74)
	at io.druid.java.util.common.guava.WrappingSequence.accumulate(WrappingSequence.java:45)
	at io.druid.java.util.common.guava.Sequences.toList(Sequences.java:150)
	at io.druid.query.ChainedExecutionQueryRunner$1$1$1.call(ChainedExecutionQueryRunner.java:130)
	at io.druid.query.ChainedExecutionQueryRunner$1$1$1.call(ChainedExecutionQueryRunner.java:120)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at io.druid.query.PrioritizedListenableFutureTask.run(PrioritizedExecutorService.java:271)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

@leventov
Copy link
Member

leventov commented Feb 7, 2018

@niketh could you do what I asked in the comment above?

@himanshug himanshug added this to the 0.12.0 milestone Feb 7, 2018
@himanshug
Copy link
Contributor

added it to 0.12.0 milestone. technically it appears to be a regression in 0.11.0 but it would only be patched in 0.12.0

@pjain1
Copy link
Member Author

pjain1 commented Feb 7, 2018

@leventov I can try doing what you suggested however it is easily reproducible locally just add following code in some executor -

while (true) {
   emitter.emit(ServiceMetricEvent.builder().setDimension("dataSource", "try").build("test", 10));
}

Please see if you can reproduce at your end. I'll let you know the outcome of extra logging.

@leventov
Copy link
Member

leventov commented Feb 7, 2018

@pjain1 see #5365.

If it's easily reproducible, could you create a test?

@pjain1
Copy link
Member Author

pjain1 commented Feb 7, 2018

I have been unsuccessful so far to reproduce this issue with latest master or with unit test for emitter 0.6.0. So, I took log statements from PR #5365 and added them to emitter 0.6.0 which druid 0.11.0 uses and I can see the problem happening. Apart from what you suggested, I also added statement System.out.println("No new batches, current batch " + concurrentBatch.get()); at https://github.com/metamx/emitter/blob/emitter-0.6.0/src/main/java/com/metamx/emitter/core/HttpPostEmitter.java#L211

Here's the output -

Unlocked and sealed batch [674]
old state 4295052796
new state -9223372036854690308
Unlocked and sealed batch [675]
old state 4295052796
new state -9223372036854690308
Unlocked and sealed batch [676]
old state 4295052796
new state -9223372036854690308
Unlocked and sealed batch [677]
old state 4295052796
new state -9223372036854690308
Unlocked and sealed batch [678]
old state 4295052796
new state -9223372036854690308
Unlocked and sealed batch [679]
old state 4295052796
new state -9223372036854690308
...
2018-02-07T22:21:21,363 INFO [HttpPostEmitter-1] com.metamx.http.client.pool.ChannelResourceFactory - Generating: http://<host>:<port>
2018-02-07T22:21:21,492 INFO [HttpPostEmitter-1] com.metamx.http.client.pool.ChannelResourceFactory - Generating: http://<host>:<port>
2018-02-07T22:21:21,618 INFO [HttpPostEmitter-1] com.metamx.http.client.pool.ChannelResourceFactory - Generating: http://<host>:<port>
...
2018-02-07T22:21:22,468 INFO [HttpPostEmitter-1] com.metamx.http.client.pool.ChannelResourceFactory - Generating: http://<host>:<port>
No new batches, current batch Batch{batchNumber=679, bufferWatermark=85500, parties=0, isSealed=true}
No new batches, current batch Batch{batchNumber=679, bufferWatermark=85500, parties=0, isSealed=true}
No new batches, current batch Batch{batchNumber=679, bufferWatermark=85500, parties=0, isSealed=true}
No new batches, current batch Batch{batchNumber=679, bufferWatermark=85500, parties=0, isSealed=true}
No new batches, current batch Batch{batchNumber=679, bufferWatermark=85500, parties=0, isSealed=true}
No new batches, current batch Batch{batchNumber=679, bufferWatermark=85500, parties=0, isSealed=true}
No new batches, current batch Batch{batchNumber=679, bufferWatermark=85500, parties=0, isSealed=true}
No new batches, current batch Batch{batchNumber=679, bufferWatermark=85500, parties=0, isSealed=true}
....
....

@pjain1
Copy link
Member Author

pjain1 commented Feb 7, 2018

Here's the output with more information, with druid 0.11.0 and emitter 0.6.0 -

Unlocked and sealed batch [684]
old state[bufferWatermark=85500, parties=1, isSealed=false]
new state[bufferWatermark=85500, parties=0, isSealed=true]
Unlocked and sealed batch [685]
old state[bufferWatermark=85500, parties=1, isSealed=false]
new state[bufferWatermark=85500, parties=0, isSealed=true]
Unlocked and sealed batch [686]
old state[bufferWatermark=85500, parties=1, isSealed=false]
new state[bufferWatermark=85500, parties=0, isSealed=true]
Unlocked and sealed batch [687]
old state[bufferWatermark=85500, parties=1, isSealed=false]
new state[bufferWatermark=85500, parties=0, isSealed=true]
2018-02-07T23:08:28,138 INFO [Coordinator-Exec--0] io.druid.server.coordinator.helper.DruidCoordinatorSegmentInfoLoader - Starting coordination. Getting available segments.
...
...
No new batches, current batch Batch{batchNumber=687, bufferWatermark=85500, parties=0, isSealed=true}
No new batches, current batch Batch{batchNumber=687, bufferWatermark=85500, parties=0, isSealed=true}
No new batches, current batch Batch{batchNumber=687, bufferWatermark=85500, parties=0, isSealed=true}
No new batches, current batch Batch{batchNumber=687, bufferWatermark=85500, parties=0, isSealed=true}
No new batches, current batch Batch{batchNumber=687, bufferWatermark=85500, parties=0, isSealed=true}
No new batches, current batch Batch{batchNumber=687, bufferWatermark=85500, parties=0, isSealed=true}
No new batches, current batch Batch{batchNumber=687, bufferWatermark=85500, parties=0, isSealed=true}
No new batches, current batch Batch{batchNumber=687, bufferWatermark=85500, parties=0, isSealed=true}
No new batches, current batch Batch{batchNumber=687, bufferWatermark=85500, parties=0, isSealed=true}
...
...

@leventov
Copy link
Member

leventov commented Feb 7, 2018

@pjain1 thanks, is it possible to add threads and time information to the output?

@leventov
Copy link
Member

leventov commented Feb 7, 2018

Like this: 2018-02-07T23:08:28,138 INFO [Coordinator-Exec--0] prefix before each line.

@pjain1
Copy link
Member Author

pjain1 commented Feb 7, 2018

OK, I will do that and post updated output.

@pjain1
Copy link
Member Author

pjain1 commented Feb 8, 2018

@leventov

2018-02-08T17:05:37,432 INFO [s] com.metamx.emitter.core.Batch - Unlocked and sealed batch [0]
2018-02-08T17:05:37,432 INFO [s] com.metamx.emitter.core.Batch - old state[bufferWatermark=85500, parties=1, isSealed=false]
2018-02-08T17:05:37,432 INFO [s] com.metamx.emitter.core.Batch - new state[bufferWatermark=85500, parties=0, isSealed=true]
2018-02-08T17:05:37,460 INFO [HttpPostEmitter-1] com.metamx.http.client.pool.ChannelResourceFactory - Generating: http://<host>:<port>
2018-02-08T17:05:37,473 INFO [s] com.metamx.emitter.core.Batch - Unlocked and sealed batch [1]
2018-02-08T17:05:37,474 INFO [s] com.metamx.emitter.core.Batch - old state[bufferWatermark=85500, parties=1, isSealed=false]
2018-02-08T17:05:37,474 INFO [s] com.metamx.emitter.core.Batch - new state[bufferWatermark=85500, parties=0, isSealed=true]
2018-02-08T17:05:37,501 INFO [s] com.metamx.emitter.core.Batch - Unlocked and sealed batch [2]
2018-02-08T17:05:37,501 INFO [s] com.metamx.emitter.core.Batch - old state[bufferWatermark=85500, parties=1, isSealed=false]
2018-02-08T17:05:37,502 INFO [s] com.metamx.emitter.core.Batch - new state[bufferWatermark=85500, parties=0, isSealed=true]
2018-02-08T17:05:37,524 INFO [s] com.metamx.emitter.core.Batch - Unlocked and sealed batch [3]
2018-02-08T17:05:37,524 INFO [s] com.metamx.emitter.core.Batch - old state[bufferWatermark=85500, parties=1, isSealed=false]
2018-02-08T17:05:37,524 INFO [s] com.metamx.emitter.core.Batch - new state[bufferWatermark=85500, parties=0, isSealed=true]
2018-02-08T17:05:37,537 INFO [s] com.metamx.emitter.core.Batch - Unlocked and sealed batch [4]
2018-02-08T17:05:37,538 INFO [s] com.metamx.emitter.core.Batch - old state[bufferWatermark=85500, parties=1, isSealed=false]
2018-02-08T17:05:37,538 INFO [s] com.metamx.emitter.core.Batch - new state[bufferWatermark=85500, parties=0, isSealed=true]
2018-02-08T17:05:37,547 INFO [s] com.metamx.emitter.core.Batch - Unlocked and sealed batch [5]
2018-02-08T17:05:37,547 INFO [s] com.metamx.emitter.core.Batch - old state[bufferWatermark=85500, parties=1, isSealed=false]
2018-02-08T17:05:37,547 INFO [s] com.metamx.emitter.core.Batch - new state[bufferWatermark=85500, parties=0, isSealed=true]
...
...
2018-02-08T17:05:43,055 INFO [s] com.metamx.emitter.core.Batch - Unlocked and sealed batch [645]
2018-02-08T17:05:43,055 INFO [s] com.metamx.emitter.core.Batch - old state[bufferWatermark=85500, parties=1, isSealed=false]
2018-02-08T17:05:43,055 INFO [s] com.metamx.emitter.core.Batch - new state[bufferWatermark=85500, parties=0, isSealed=true]
2018-02-08T17:05:43,056 INFO [s] com.metamx.emitter.core.Batch - Unlocked and sealed batch [646]
2018-02-08T17:05:43,057 INFO [s] com.metamx.emitter.core.Batch - old state[bufferWatermark=85500, parties=1, isSealed=false]
2018-02-08T17:05:43,057 INFO [s] com.metamx.emitter.core.Batch - new state[bufferWatermark=85500, parties=0, isSealed=true]
2018-02-08T17:05:43,622 INFO [s] com.metamx.emitter.core.Batch - Unlocked and sealed batch [647]
2018-02-08T17:05:43,622 INFO [s] com.metamx.emitter.core.Batch - old state[bufferWatermark=85500, parties=1, isSealed=false]
2018-02-08T17:05:43,622 INFO [s] com.metamx.emitter.core.Batch - new state[bufferWatermark=85500, parties=0, isSealed=true]
2018-02-08T17:05:43,624 INFO [s] com.metamx.emitter.core.Batch - Unlocked and sealed batch [648]
2018-02-08T17:05:43,624 INFO [s] com.metamx.emitter.core.Batch - old state[bufferWatermark=85500, parties=1, isSealed=false]
2018-02-08T17:05:43,624 INFO [s] com.metamx.emitter.core.Batch - new state[bufferWatermark=85500, parties=0, isSealed=true]
2018-02-08T17:05:44,474 INFO [HttpPostEmitter-1] com.metamx.http.client.pool.ChannelResourceFactory - Generating: http://<host>:<port>
2018-02-08T17:05:44,479 INFO [Coordinator-Exec--0] io.druid.server.coordinator.helper.DruidCoordinatorSegmentInfoLoader - Starting coordination. Getting available segments.
2018-02-08T17:05:44,480 INFO [Coordinator-Exec--0] io.druid.server.coordinator.helper.DruidCoordinatorSegmentInfoLoader - Found [0] available segments.
2018-02-08T17:05:44,485 WARN [Coordinator-Exec--0] io.druid.server.coordinator.helper.DruidCoordinatorRuleRunner - Uh... I have no servers. Not assigning anything...
2018-02-08T17:05:44,485 INFO [Coordinator-Exec--0] io.druid.server.coordinator.helper.DruidCoordinatorCleanupUnneeded - Found 0 availableSegments, skipping the cleanup of segments from historicals. This is done to prevent a race condition in which the coordinator would drop all segments if it started running cleanup before it finished polling the metadata storage for available segments for the first time.
2018-02-08T17:05:44,491 INFO [Coordinator-Exec--0] com.metamx.emitter.core.HttpPostEmitter - Failed to emit an event in batch [Batch{batchNumber=648, bufferWatermark=85500, parties=0, isSealed=true}]
2018-02-08T17:05:44,491 INFO [Coordinator-Exec--0] com.metamx.emitter.core.HttpPostEmitter - Failed to emit an event in batch [Batch{batchNumber=648, bufferWatermark=85500, parties=0, isSealed=true}]
2018-02-08T17:05:44,491 INFO [Coordinator-Exec--0] com.metamx.emitter.core.HttpPostEmitter - Failed to emit an event in batch [Batch{batchNumber=648, bufferWatermark=85500, parties=0, isSealed=true}]
2018-02-08T17:05:44,491 INFO [Coordinator-Exec--0] com.metamx.emitter.core.HttpPostEmitter - Failed to emit an event in batch [Batch{batchNumber=648, bufferWatermark=85500, parties=0, isSealed=true}]
2018-02-08T17:05:44,491 INFO [Coordinator-Exec--0] com.metamx.emitter.core.HttpPostEmitter - Failed to emit an event in batch [Batch{batchNumber=648, bufferWatermark=85500, parties=0, isSealed=true}]
2018-02-08T17:05:44,491 INFO [Coordinator-Exec--0] com.metamx.emitter.core.HttpPostEmitter - Failed to emit an event in batch [Batch{batchNumber=648, bufferWatermark=85500, parties=0, isSealed=true}]
2018-02-08T17:05:44,491 INFO [Coordinator-Exec--0] com.metamx.emitter.core.HttpPostEmitter - Failed to emit an event in batch [Batch{batchNumber=648, bufferWatermark=85500, parties=0, isSealed=true}]
2018-02-08T17:05:44,491 INFO [Coordinator-Exec--0] com.metamx.emitter.core.HttpPostEmitter - Failed to emit an event in batch [Batch{batchNumber=648, bufferWatermark=85500, parties=0, isSealed=true}]
2018-02-08T17:05:44,491 INFO [Coordinator-Exec--0] com.metamx.emitter.core.HttpPostEmitter - Failed to emit an event in batch [Batch{batchNumber=648, bufferWatermark=85500, parties=0, isSealed=true}]
2018-02-08T17:05:44,491 INFO [Coordinator-Exec--0] com.metamx.emitter.core.HttpPostEmitter - Failed to emit an event in batch [Batch{batchNumber=648, bufferWatermark=85500, parties=0, isSealed=true}]
...
...

s is the name of thread that is doing -

while (true) {
   emitter.emit(ServiceMetricEvent.builder().setDimension("dataSource", "try").build("test", 10));
}

@pjain1
Copy link
Member Author

pjain1 commented Feb 8, 2018

Here are multiple thread dumps from multiple runs when this situation happens - https://gist.github.com/pjain1/534c5941390037663b94c7ddcf6a0a3c

@pjain1
Copy link
Member Author

pjain1 commented Feb 8, 2018

I have been trying to reproduce this issue on master but have been successful only once in multiple tries.

@leventov
Copy link
Member

@pjain1 "s" thread must complete onSealExclusive() successfully, but for some reason it didn't do this. Please continue to debug this by adding logging in onSealExclusive(), and also in the "s" thread, wrap the while loop with try-catch (Throwable), to print anything that might be thrown from there (I suspect IllegalStateException)

@pjain1
Copy link
Member Author

pjain1 commented Feb 12, 2018

After wrapping the while loop in try-catch I found that the acquireBuffer call in onSealExclusive method in HttpPostEmitter was throwing java.lang.OutOfMemoryError: Java heap space exception and after this the HttpPostEmitter will get into situation where it cannot emit anymore and keeps on spinning in while(true) loop in emitAndReturnBatch method indefinitely as concurrentBatch will never get updated with new batch and the old batch would have been sealed. Here's the stack trace -

2018-02-12T20:29:09,946 INFO [s] com.metamx.emitter.core.Batch - Unlocked and sealed batch [692]
2018-02-12T20:29:09,946 INFO [s] com.metamx.emitter.core.Batch - old state[bufferWatermark=85500, parties=1, isSealed=false]
2018-02-12T20:29:09,946 INFO [s] com.metamx.emitter.core.Batch - new state[bufferWatermark=85500, parties=0, isSealed=true]
2018-02-12T20:29:09,946 INFO [s] com.metamx.emitter.core.HttpPostEmitter - Added batch Batch{batchNumber=692, bufferWatermark=85500, parties=0, isSealed=true} for emitting
2018-02-12T20:29:09,946 INFO [s] com.metamx.emitter.core.HttpPostEmitter - Next batch number is 693
2018-02-12T20:29:10,852 ERROR [s] io.druid.server.coordinator.DruidCoordinator - Error in emit thread
java.lang.OutOfMemoryError: Java heap space
	at com.metamx.emitter.core.HttpPostEmitter.acquireBuffer(HttpPostEmitter.java:578) ~[emitter-0.6.0-SNAPSHOT.jar:?]
	at com.metamx.emitter.core.HttpPostEmitter.onSealExclusive(HttpPostEmitter.java:245) ~[emitter-0.6.0-SNAPSHOT.jar:?]
	at com.metamx.emitter.core.Batch.tryReleaseShared(Batch.java:296) ~[emitter-0.6.0-SNAPSHOT.jar:?]
	at java.util.concurrent.locks.AbstractQueuedLongSynchronizer.releaseShared(AbstractQueuedLongSynchronizer.java:1119) ~[?:1.8.0_65]
	at com.metamx.emitter.core.Batch.unlockAndSeal(Batch.java:248) ~[emitter-0.6.0-SNAPSHOT.jar:?]
	at com.metamx.emitter.core.Batch.unlockAndSealIfNeeded(Batch.java:216) ~[emitter-0.6.0-SNAPSHOT.jar:?]
	at com.metamx.emitter.core.Batch.tryAddNonFirstEvent(Batch.java:184) ~[emitter-0.6.0-SNAPSHOT.jar:?]
	at com.metamx.emitter.core.Batch.tryAddEvent(Batch.java:135) ~[emitter-0.6.0-SNAPSHOT.jar:?]
	at com.metamx.emitter.core.HttpPostEmitter.emitAndReturnBatch(HttpPostEmitter.java:208) ~[emitter-0.6.0-SNAPSHOT.jar:?]
	at com.metamx.emitter.core.HttpPostEmitter.emit(HttpPostEmitter.java:178) ~[emitter-0.6.0-SNAPSHOT.jar:?]
	at com.metamx.emitter.service.ServiceEmitter.emit(ServiceEmitter.java:72) ~[emitter-0.6.0-SNAPSHOT.jar:?]
	at com.metamx.emitter.service.ServiceEmitter.emit(ServiceEmitter.java:77) ~[emitter-0.6.0-SNAPSHOT.jar:?]
	at io.druid.server.coordinator.DruidCoordinator.lambda$new$2(DruidCoordinator.java:222) [classes/:?]
	at io.druid.server.coordinator.DruidCoordinator$$Lambda$10/1983283775.run(Unknown Source) [classes/:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_65]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_65]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_65]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_65]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_65]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_65]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_65]
2018-02-12T20:29:10,858 INFO [Coordinator-Exec--0] io.druid.server.coordinator.helper.DruidCoordinatorSegmentInfoLoader - Starting coordination. Getting available segments.
2018-02-12T20:29:10,859 INFO [Coordinator-Exec--0] io.druid.server.coordinator.helper.DruidCoordinatorSegmentInfoLoader - Found [0] available segments.
java.lang.OutOfMemoryError: Java heap space
	at com.metamx.emitter.core.HttpPostEmitter.acquireBuffer(HttpPostEmitter.java:578)
	at com.metamx.emitter.core.HttpPostEmitter.onSealExclusive(HttpPostEmitter.java:245)
	at com.metamx.emitter.core.Batch.tryReleaseShared(Batch.java:296)
	at java.util.concurrent.locks.AbstractQueuedLongSynchronizer.releaseShared(AbstractQueuedLongSynchronizer.java:1119)
	at com.metamx.emitter.core.Batch.unlockAndSeal(Batch.java:248)
	at com.metamx.emitter.core.Batch.unlockAndSealIfNeeded(Batch.java:216)
	at com.metamx.emitter.core.Batch.tryAddNonFirstEvent(Batch.java:184)
	at com.metamx.emitter.core.Batch.tryAddEvent(Batch.java:135)
	at com.metamx.emitter.core.HttpPostEmitter.emitAndReturnBatch(HttpPostEmitter.java:208)
	at com.metamx.emitter.core.HttpPostEmitter.emit(HttpPostEmitter.java:178)
	at com.metamx.emitter.service.ServiceEmitter.emit(ServiceEmitter.java:72)
	at com.metamx.emitter.service.ServiceEmitter.emit(ServiceEmitter.java:77)
	at io.druid.server.coordinator.DruidCoordinator.lambda$new$2(DruidCoordinator.java:222)
	at io.druid.server.coordinator.DruidCoordinator$$Lambda$10/1983283775.run(Unknown Source)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
2018-02-12T20:29:10,865 WARN [Coordinator-Exec--0] io.druid.server.coordinator.helper.DruidCoordinatorRuleRunner - Uh... I have no servers. Not assigning anything...
2018-02-12T20:29:10,865 INFO [Coordinator-Exec--0] io.druid.server.coordinator.helper.DruidCoordinatorCleanupUnneeded - Found 0 availableSegments, skipping the cleanup of segments from historicals. This is done to prevent a race condition in which the coordinator would drop all segments if it started running cleanup before it finished polling the metadata storage for available segments for the first time.
2018-02-12T20:29:10,875 INFO [Coordinator-Exec--0] com.metamx.emitter.core.HttpPostEmitter - Failed to emit an event in batch [Batch{batchNumber=692, bufferWatermark=85500, parties=0, isSealed=true}]
2018-02-12T20:29:10,875 INFO [Coordinator-Exec--0] com.metamx.emitter.core.HttpPostEmitter - Failed to emit an event in batch [Batch{batchNumber=692, bufferWatermark=85500, parties=0, isSealed=true}]
2018-02-12T20:29:10,875 INFO [Coordinator-Exec--0] com.metamx.emitter.core.HttpPostEmitter - Failed to emit an event in batch [Batch{batchNumber=692, bufferWatermark=85500, parties=0, isSealed=true}]
2018-02-12T20:29:10,875 INFO [Coordinator-Exec--0] com.metamx.emitter.core.HttpPostEmitter - Failed to emit an event in batch [Batch{batchNumber=692, bufferWatermark=85500, parties=0, isSealed=true}]
2018-02-12T20:29:10,875 INFO [Coordinator-Exec--0] com.metamx.emitter.core.HttpPostEmitter - Failed to emit an event in batch [Batch{batchNumber=692, bufferWatermark=85500, parties=0, isSealed=true}]
2018-02-12T20:29:10,875 INFO [Coordinator-Exec--0] com.metamx.emitter.core.HttpPostEmitter - Failed to emit an event in batch [Batch{batchNumber=692, bufferWatermark=85500, parties=0, isSealed=true}]

Looks like because of #5300 Java heap space problem is not happening in master code, that's why it is not easily reproducible in master. Nonetheless, there is a bug as if any exception happens while creating new batch, concurrentBatch will not be updated causing the future emit calls to loop infinitely.

@leventov

@leventov
Copy link
Member

@pjain1 please try #5386.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants