"One trace segment has been abandoned, cause by buffer is full" bug fix.#3048
"One trace segment has been abandoned, cause by buffer is full" bug fix.#3048coki230 wants to merge 4 commits intoapache:masterfrom
Conversation
合并最新代码到本地
the bug is caused by auto Flush and the method of "batchPersistence", because the ES need lock the same lock when flush and internalAdd(which called by batchPersistence). so i delete the auto flush the result is so far are normal. the underneath is my thread dump brief: "DataCarrier.RECORD_PERSISTENT.BulkConsumePool.0.Thread" apache#24 daemon prio=5 os_prio=0 tid=0x000000002585a800 nid=0xc1c waiting for monitor entry [0x000000002715f000] java.lang.Thread.State: BLOCKED (on object monitor) at org.elasticsearch.action.bulk.BulkProcessor.internalAdd(BulkProcessor.java:286) - waiting to lock <0x00000006c2c52eb0> (a org.elasticsearch.action.bulk.BulkProcessor) at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:271) at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:267) at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:253) at org.apache.skywalking.oap.server.storage.plugin.elasticsearch.base.BatchProcessEsDAO.lambda$batchPersistence$0(BatchProcessEsDAO.java:64) at org.apache.skywalking.oap.server.storage.plugin.elasticsearch.base.BatchProcessEsDAO$$Lambda$285/523676665.accept(Unknown Source) at java.lang.Iterable.forEach(Iterable.java:75) at org.apache.skywalking.oap.server.storage.plugin.elasticsearch.base.BatchProcessEsDAO.batchPersistence(BatchProcessEsDAO.java:62) at org.apache.skywalking.oap.server.core.analysis.worker.PersistenceWorker.onWork(PersistenceWorker.java:51) at org.apache.skywalking.oap.server.core.analysis.worker.RecordPersistentWorker$PersistentConsumer.consume(RecordPersistentWorker.java:103) at org.apache.skywalking.apm.commons.datacarrier.consumer.MultipleChannelsConsumer.consume(MultipleChannelsConsumer.java:80) at org.apache.skywalking.apm.commons.datacarrier.consumer.MultipleChannelsConsumer.run(MultipleChannelsConsumer.java:49) "elasticsearch[scheduler][T#1]" apache#72 daemon prio=5 os_prio=0 tid=0x00000000202c7000 nid=0x850 waiting for monitor entry [0x000000002a85e000] java.lang.Thread.State: BLOCKED (on object monitor) at org.elasticsearch.action.bulk.BulkProcessor$Flush.run(BulkProcessor.java:367) - waiting to lock <0x00000006c2c52eb0> (a org.elasticsearch.action.bulk.BulkProcessor) at org.elasticsearch.threadpool.Scheduler$ReschedulingRunnable.doRun(Scheduler.java:182) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266) at java.util.concurrent.FutureTask.run(FutureTask.java) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) "pool-14-thread-1" apache#68 prio=5 os_prio=0 tid=0x00000000202c4000 nid=0xfa8 waiting on condition [0x0000000029b4e000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000006c2c53160> (a java.util.concurrent.Semaphore$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at java.util.concurrent.Semaphore.acquire(Semaphore.java:312) at org.elasticsearch.action.bulk.BulkRequestHandler.execute(BulkRequestHandler.java:60) at org.elasticsearch.action.bulk.BulkProcessor.execute(BulkProcessor.java:339) at org.elasticsearch.action.bulk.BulkProcessor.flush(BulkProcessor.java:358) - locked <0x00000006c2c52eb0> (a org.elasticsearch.action.bulk.BulkProcessor) at org.apache.skywalking.oap.server.storage.plugin.elasticsearch.base.BatchProcessEsDAO.batchPersistence(BatchProcessEsDAO.java:72) at org.apache.skywalking.oap.server.core.storage.PersistenceTimer.extractDataAndSave(PersistenceTimer.java:113) at org.apache.skywalking.oap.server.core.storage.PersistenceTimer.lambda$start$0(PersistenceTimer.java:65) at org.apache.skywalking.oap.server.core.storage.PersistenceTimer$$Lambda$201/296552796.run(Unknown Source) at org.apache.skywalking.apm.util.RunnableWithExceptionProtection.run(RunnableWithExceptionProtection.java:36) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset$$$capture(FutureTask.java:308) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
fix the bug of cann't report trace info.
| .setBulkActions(bulkActions) | ||
| .setBulkSize(new ByteSizeValue(bulkSize, ByteSizeUnit.MB)) | ||
| .setFlushInterval(TimeValue.timeValueSeconds(flushInterval)) | ||
| // modified by zhaoze, if set auto flush, there will cause deadlock with bulk commit. |
There was a problem hiding this comment.
You can't set this comment in the repo, remove it
There was a problem hiding this comment.
Make comment without yout name, just why.
|
@peng-yongsheng I am not familiar with this, please review |
|
Could you put the Elasticsearch discussion issue here? I think you mentioned that before |
|
@coki230 I don't think to remove the setting of the flush interval can fix the problem which you provide. These Settings all have the same capabilities, such as setBulkActions, setBulkSize, setFlushInterval. Please take a look at the official document. setBulkActions(10000) setBulkSize(new ByteSizeValue(5, ByteSizeUnit.MB)) setFlushInterval(TimeValue.timeValueSeconds(5)) If any conditions are reached, the bulk will be executed. OAP servers are a cluster. All of them will execute the bulk operate to insert or update the metrics into Elasticsearch at any time. So it is reasonable to add locks and normal to create locks. |
|
Please reply the review. @peng-yongsheng leads this part, I think we need fully explanation if you think we should merge this. |
|
@peng-yongsheng it have manual fulsh in the code which is in the class of "BatchProcessEsDAO" line of 72. |
|
@coki230 do you have the prove of this deadlock, and why it doesn't happen always, only in limited resources env? Is this a bug or something? All these information is very important. This change is important, we need to make sure. |
|
@wu-sheng OK,I will take sometime to make sure how to repeat the test procedure, in my env it would alwasys heppened,but i'm not sure about limited resources. |
@coki230 I guess the real reason is yours Elasticsearch servers are hung up by the automatically data delete. This is a design flaw before 6.2.0, it has been fixed. When Elasticsearch servers are very busy to delete history data, there are no more idle resources to process the data from bulk processors. then OAP servers began to scramble for bulk resources. OAP servers that are not scrambling for resources start locking. Suggest upgrade to 6.2.0 |
the presentation is like Elasticsearch servers are hung up, but i update the SW version of 6.2.0, i use new feature of delete index to delete data. and i cann't get any error info in my oap server when it cann't submit bulk processors. but there might be my ES has poor performance so the ES hung up by auto flush or delete data. |
6.2.0 doesn't delete data from index anymore, it just simply removed the whole indexes. Is your issue solved in 6.2.0? Or something changed? |
I expressions in english have ambiguity, this bug is occured in 6.1 and I update the version of 6.2.0, it also exist. |
Then we need still to figure out why you block, we and others are not. |
|
@coki230 I have received at least two confirmation, ElasticSearch hang up issue caused by ElasticSearch settings, ref to https://github.com/apache/skywalking/blob/master/docs/en/FAQ/ES-Server-FAQ.md#elasticsearch I don't' have time to dig more, but from their feedback, this solved the issue. |
ok , i will test the config in my env, thanks, so just ignor the pr, i will close it. |
Please answer these questions before submitting pull request
Why submit this pull request?
Bug fix
New feature provided
Improve performance
Related issues
Bug fix
i lost my agent trace segment data, but my server is ok, it can recieve jvm metrics, when i debug it i found it didn't call the method of "batchPersistence" which is bulk update or save the data.
the bug is caused by auto Flush and the method of "batchPersistence", because the ES need lock the same lock when flush and internalAdd(which called by batchPersistence). so i delete the auto flush the result is so far are normal. the underneath is my thread dump brief:
"DataCarrier.RECORD_PERSISTENT.BulkConsumePool.0.Thread" #24 daemon prio=5 os_prio=0 tid=0x000000002585a800 nid=0xc1c waiting for monitor entry [0x000000002715f000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.elasticsearch.action.bulk.BulkProcessor.internalAdd(BulkProcessor.java:286)
at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:271)
at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:267)
at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:253)
at org.apache.skywalking.oap.server.storage.plugin.elasticsearch.base.BatchProcessEsDAO.lambda$batchPersistence$0(BatchProcessEsDAO.java:64)
at org.apache.skywalking.oap.server.storage.plugin.elasticsearch.base.BatchProcessEsDAO$$Lambda$285/523676665.accept(Unknown Source)
at java.lang.Iterable.forEach(Iterable.java:75)
at org.apache.skywalking.oap.server.storage.plugin.elasticsearch.base.BatchProcessEsDAO.batchPersistence(BatchProcessEsDAO.java:62)
at org.apache.skywalking.oap.server.core.analysis.worker.PersistenceWorker.onWork(PersistenceWorker.java:51)
at org.apache.skywalking.oap.server.core.analysis.worker.RecordPersistentWorker$PersistentConsumer.consume(RecordPersistentWorker.java:103)
at org.apache.skywalking.apm.commons.datacarrier.consumer.MultipleChannelsConsumer.consume(MultipleChannelsConsumer.java:80)
at org.apache.skywalking.apm.commons.datacarrier.consumer.MultipleChannelsConsumer.run(MultipleChannelsConsumer.java:49)
"elasticsearch[scheduler][T#1]" #72 daemon prio=5 os_prio=0 tid=0x00000000202c7000 nid=0x850 waiting for monitor entry [0x000000002a85e000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.elasticsearch.action.bulk.BulkProcessor$Flush.run(BulkProcessor.java:367)
at org.elasticsearch.threadpool.Scheduler$ReschedulingRunnable.doRun(Scheduler.java:182)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266)
at java.util.concurrent.FutureTask.run(FutureTask.java)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
"pool-14-thread-1" #68 prio=5 os_prio=0 tid=0x00000000202c4000 nid=0xfa8 waiting on condition [0x0000000029b4e000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
parking to wait for <0x00000006c2c53160> (a java.util.concurrent.Semaphore$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at java.util.concurrent.Semaphore.acquire(Semaphore.java:312)
at org.elasticsearch.action.bulk.BulkRequestHandler.execute(BulkRequestHandler.java:60)
at org.elasticsearch.action.bulk.BulkProcessor.execute(BulkProcessor.java:339)
at org.elasticsearch.action.bulk.BulkProcessor.flush(BulkProcessor.java:358)
locked <0x00000006c2c52eb0> (a org.elasticsearch.action.bulk.BulkProcessor)
at org.apache.skywalking.oap.server.storage.plugin.elasticsearch.base.BatchProcessEsDAO.batchPersistence(BatchProcessEsDAO.java:72)
at org.apache.skywalking.oap.server.core.storage.PersistenceTimer.extractDataAndSave(PersistenceTimer.java:113)
at org.apache.skywalking.oap.server.core.storage.PersistenceTimer.lambda$start$0(PersistenceTimer.java:65)
at org.apache.skywalking.oap.server.core.storage.PersistenceTimer$$Lambda$201/296552796.run(Unknown Source)
at org.apache.skywalking.apm.util.RunnableWithExceptionProtection.run(RunnableWithExceptionProtection.java:36)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset$$$capture(FutureTask.java:308)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
How to fix?
just close the auto flush, because there is a manual flush to commit the data
New feature or improvement