"One trace segment has been abandoned, cause by buffer is full" bug fix. by coki230 · Pull Request #3048 · apache/skywalking

coki230 · 2019-07-11T08:45:57Z

Please answer these questions before submitting pull request

Why submit this pull request?
Bug fix
New feature provided
Improve performance
Related issues

Bug fix

Bug description.

i lost my agent trace segment data, but my server is ok, it can recieve jvm metrics, when i debug it i found it didn't call the method of "batchPersistence" which is bulk update or save the data.
the bug is caused by auto Flush and the method of "batchPersistence", because the ES need lock the same lock when flush and internalAdd(which called by batchPersistence). so i delete the auto flush the result is so far are normal. the underneath is my thread dump brief:

"DataCarrier.RECORD_PERSISTENT.BulkConsumePool.0.Thread" #24 daemon prio=5 os_prio=0 tid=0x000000002585a800 nid=0xc1c waiting for monitor entry [0x000000002715f000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.elasticsearch.action.bulk.BulkProcessor.internalAdd(BulkProcessor.java:286)

waiting to lock <0x00000006c2c52eb0> (a org.elasticsearch.action.bulk.BulkProcessor)
at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:271)
at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:267)
at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:253)
at org.apache.skywalking.oap.server.storage.plugin.elasticsearch.base.BatchProcessEsDAO.lambda$batchPersistence$0(BatchProcessEsDAO.java:64)
at org.apache.skywalking.oap.server.storage.plugin.elasticsearch.base.BatchProcessEsDAO$$Lambda$285/523676665.accept(Unknown Source)
at java.lang.Iterable.forEach(Iterable.java:75)
at org.apache.skywalking.oap.server.storage.plugin.elasticsearch.base.BatchProcessEsDAO.batchPersistence(BatchProcessEsDAO.java:62)
at org.apache.skywalking.oap.server.core.analysis.worker.PersistenceWorker.onWork(PersistenceWorker.java:51)
at org.apache.skywalking.oap.server.core.analysis.worker.RecordPersistentWorker$PersistentConsumer.consume(RecordPersistentWorker.java:103)
at org.apache.skywalking.apm.commons.datacarrier.consumer.MultipleChannelsConsumer.consume(MultipleChannelsConsumer.java:80)
at org.apache.skywalking.apm.commons.datacarrier.consumer.MultipleChannelsConsumer.run(MultipleChannelsConsumer.java:49)

"elasticsearch[scheduler][T#1]" #72 daemon prio=5 os_prio=0 tid=0x00000000202c7000 nid=0x850 waiting for monitor entry [0x000000002a85e000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.elasticsearch.action.bulk.BulkProcessor$Flush.run(BulkProcessor.java:367)

waiting to lock <0x00000006c2c52eb0> (a org.elasticsearch.action.bulk.BulkProcessor)
at org.elasticsearch.threadpool.Scheduler$ReschedulingRunnable.doRun(Scheduler.java:182)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266)
at java.util.concurrent.FutureTask.run(FutureTask.java)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

"pool-14-thread-1" #68 prio=5 os_prio=0 tid=0x00000000202c4000 nid=0xfa8 waiting on condition [0x0000000029b4e000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)

parking to wait for <0x00000006c2c53160> (a java.util.concurrent.Semaphore$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at java.util.concurrent.Semaphore.acquire(Semaphore.java:312)
at org.elasticsearch.action.bulk.BulkRequestHandler.execute(BulkRequestHandler.java:60)
at org.elasticsearch.action.bulk.BulkProcessor.execute(BulkProcessor.java:339)
at org.elasticsearch.action.bulk.BulkProcessor.flush(BulkProcessor.java:358)
locked <0x00000006c2c52eb0> (a org.elasticsearch.action.bulk.BulkProcessor)
at org.apache.skywalking.oap.server.storage.plugin.elasticsearch.base.BatchProcessEsDAO.batchPersistence(BatchProcessEsDAO.java:72)
at org.apache.skywalking.oap.server.core.storage.PersistenceTimer.extractDataAndSave(PersistenceTimer.java:113)
at org.apache.skywalking.oap.server.core.storage.PersistenceTimer.lambda$start$0(PersistenceTimer.java:65)
at org.apache.skywalking.oap.server.core.storage.PersistenceTimer$$Lambda$201/296552796.run(Unknown Source)
at org.apache.skywalking.apm.util.RunnableWithExceptionProtection.run(RunnableWithExceptionProtection.java:36)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset$$$capture(FutureTask.java:308)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
How to fix?
just close the auto flush, because there is a manual flush to commit the data

New feature or improvement

Describe the details and related test reports.

合并最新代码到本地

the bug is caused by auto Flush and the method of "batchPersistence", because the ES need lock the same lock when flush and internalAdd(which called by batchPersistence). so i delete the auto flush the result is so far are normal. the underneath is my thread dump brief: "DataCarrier.RECORD_PERSISTENT.BulkConsumePool.0.Thread" apache#24 daemon prio=5 os_prio=0 tid=0x000000002585a800 nid=0xc1c waiting for monitor entry [0x000000002715f000] java.lang.Thread.State: BLOCKED (on object monitor) at org.elasticsearch.action.bulk.BulkProcessor.internalAdd(BulkProcessor.java:286) - waiting to lock <0x00000006c2c52eb0> (a org.elasticsearch.action.bulk.BulkProcessor) at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:271) at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:267) at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:253) at org.apache.skywalking.oap.server.storage.plugin.elasticsearch.base.BatchProcessEsDAO.lambda$batchPersistence$0(BatchProcessEsDAO.java:64) at org.apache.skywalking.oap.server.storage.plugin.elasticsearch.base.BatchProcessEsDAO$$Lambda$285/523676665.accept(Unknown Source) at java.lang.Iterable.forEach(Iterable.java:75) at org.apache.skywalking.oap.server.storage.plugin.elasticsearch.base.BatchProcessEsDAO.batchPersistence(BatchProcessEsDAO.java:62) at org.apache.skywalking.oap.server.core.analysis.worker.PersistenceWorker.onWork(PersistenceWorker.java:51) at org.apache.skywalking.oap.server.core.analysis.worker.RecordPersistentWorker$PersistentConsumer.consume(RecordPersistentWorker.java:103) at org.apache.skywalking.apm.commons.datacarrier.consumer.MultipleChannelsConsumer.consume(MultipleChannelsConsumer.java:80) at org.apache.skywalking.apm.commons.datacarrier.consumer.MultipleChannelsConsumer.run(MultipleChannelsConsumer.java:49) "elasticsearch[scheduler][T#1]" apache#72 daemon prio=5 os_prio=0 tid=0x00000000202c7000 nid=0x850 waiting for monitor entry [0x000000002a85e000] java.lang.Thread.State: BLOCKED (on object monitor) at org.elasticsearch.action.bulk.BulkProcessor$Flush.run(BulkProcessor.java:367) - waiting to lock <0x00000006c2c52eb0> (a org.elasticsearch.action.bulk.BulkProcessor) at org.elasticsearch.threadpool.Scheduler$ReschedulingRunnable.doRun(Scheduler.java:182) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266) at java.util.concurrent.FutureTask.run(FutureTask.java) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) "pool-14-thread-1" apache#68 prio=5 os_prio=0 tid=0x00000000202c4000 nid=0xfa8 waiting on condition [0x0000000029b4e000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000006c2c53160> (a java.util.concurrent.Semaphore$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at java.util.concurrent.Semaphore.acquire(Semaphore.java:312) at org.elasticsearch.action.bulk.BulkRequestHandler.execute(BulkRequestHandler.java:60) at org.elasticsearch.action.bulk.BulkProcessor.execute(BulkProcessor.java:339) at org.elasticsearch.action.bulk.BulkProcessor.flush(BulkProcessor.java:358) - locked <0x00000006c2c52eb0> (a org.elasticsearch.action.bulk.BulkProcessor) at org.apache.skywalking.oap.server.storage.plugin.elasticsearch.base.BatchProcessEsDAO.batchPersistence(BatchProcessEsDAO.java:72) at org.apache.skywalking.oap.server.core.storage.PersistenceTimer.extractDataAndSave(PersistenceTimer.java:113) at org.apache.skywalking.oap.server.core.storage.PersistenceTimer.lambda$start$0(PersistenceTimer.java:65) at org.apache.skywalking.oap.server.core.storage.PersistenceTimer$$Lambda$201/296552796.run(Unknown Source) at org.apache.skywalking.apm.util.RunnableWithExceptionProtection.run(RunnableWithExceptionProtection.java:36) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset$$$capture(FutureTask.java:308) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

fix the bug of cann't report trace info.

wu-sheng · 2019-07-11T08:47:28Z

            .setBulkActions(bulkActions)
            .setBulkSize(new ByteSizeValue(bulkSize, ByteSizeUnit.MB))
-            .setFlushInterval(TimeValue.timeValueSeconds(flushInterval))
+            // modified by zhaoze, if set auto flush, there will cause deadlock with bulk commit.


You can't set this comment in the repo, remove it

Make comment without yout name, just why.

OK, i got it

wu-sheng · 2019-07-11T08:48:33Z

@peng-yongsheng I am not familiar with this, please review

wu-sheng · 2019-07-11T08:53:09Z

Could you put the Elasticsearch discussion issue here? I think you mentioned that before

peng-yongsheng · 2019-07-11T11:08:10Z

@coki230 I don't think to remove the setting of the flush interval can fix the problem which you provide. These Settings all have the same capabilities, such as setBulkActions, setBulkSize, setFlushInterval. Please take a look at the official document.

setBulkActions(10000)
We want to execute the bulk every 10 000 requests

setBulkSize(new ByteSizeValue(5, ByteSizeUnit.MB))
We want to flush the bulk every 5mb

setFlushInterval(TimeValue.timeValueSeconds(5))
We want to flush the bulk every 5 seconds whatever the number of requests

If any conditions are reached, the bulk will be executed. OAP servers are a cluster. All of them will execute the bulk operate to insert or update the metrics into Elasticsearch at any time. So it is reasonable to add locks and normal to create locks.

peng-yongsheng

disagree

wu-sheng · 2019-07-11T13:17:33Z

Please reply the review. @peng-yongsheng leads this part, I think we need fully explanation if you think we should merge this.

coki230 · 2019-07-11T13:25:26Z

@peng-yongsheng it have manual fulsh in the code which is in the class of "BatchProcessEsDAO" line of 72.
so in my opinion, it didn't have to set Flush Interval.
The lock will remain locked until restart oap server. because the auto flush will first lock the BulkProcessor add lock, than it will wait another lock which is waiting all the time.

wu-sheng · 2019-07-11T13:44:33Z

@coki230 do you have the prove of this deadlock, and why it doesn't happen always, only in limited resources env? Is this a bug or something? All these information is very important.

This change is important, we need to make sure.

coki230 · 2019-07-12T01:16:48Z

@wu-sheng OK，I will take sometime to make sure how to repeat the test procedure, in my env it would alwasys heppened,but i'm not sure about limited resources.

peng-yongsheng · 2019-07-12T11:42:49Z

@peng-yongsheng it have manual fulsh in the code which is in the class of "BatchProcessEsDAO" line of 72.
so in my opinion, it didn't have to set Flush Interval.
The lock will remain locked until restart oap server. because the auto flush will first lock the BulkProcessor add lock, than it will wait another lock which is waiting all the time.

@coki230 I guess the real reason is yours Elasticsearch servers are hung up by the automatically data delete. This is a design flaw before 6.2.0, it has been fixed. When Elasticsearch servers are very busy to delete history data, there are no more idle resources to process the data from bulk processors. then OAP servers began to scramble for bulk resources. OAP servers that are not scrambling for resources start locking.

Suggest upgrade to 6.2.0
Some settings for the Elasticsearch server to improving write performance, take a look at this document. ES-Server-FAQ

coki230 · 2019-07-15T02:08:12Z

@peng-yongsheng it have manual fulsh in the code which is in the class of "BatchProcessEsDAO" line of 72.
so in my opinion, it didn't have to set Flush Interval.
The lock will remain locked until restart oap server. because the auto flush will first lock the BulkProcessor add lock, than it will wait another lock which is waiting all the time.

@coki230 I guess the real reason is yours Elasticsearch servers are hung up by the automatically data delete. This is a design flaw before 6.2.0, it has been fixed. When Elasticsearch servers are very busy to delete history data, there are no more idle resources to process the data from bulk processors. then OAP servers began to scramble for bulk resources. OAP servers that are not scrambling for resources start locking.

Suggest upgrade to 6.2.0
Some settings for the Elasticsearch server to improving write performance, take a look at this document. ES-Server-FAQ

the presentation is like Elasticsearch servers are hung up, but i update the SW version of 6.2.0, i use new feature of delete index to delete data. and i cann't get any error info in my oap server when it cann't submit bulk processors. but there might be my ES has poor performance so the ES hung up by auto flush or delete data.

wu-sheng · 2019-07-15T02:41:54Z

the presentation is like Elasticsearch servers are hung up, but i update the SW version of 6.2.0, i use new feature of delete index to delete data. and i cann't get any error info in my oap server when it cann't submit bulk processors. but there might be my ES has poor performance so the ES hung up by auto flush or delete data.

6.2.0 doesn't delete data from index anymore, it just simply removed the whole indexes. Is your issue solved in 6.2.0? Or something changed?

coki230 · 2019-07-15T06:17:32Z

the presentation is like Elasticsearch servers are hung up, but i update the SW version of 6.2.0, i use new feature of delete index to delete data. and i cann't get any error info in my oap server when it cann't submit bulk processors. but there might be my ES has poor performance so the ES hung up by auto flush or delete data.

6.2.0 doesn't delete data from index anymore, it just simply removed the whole indexes. Is your issue solved in 6.2.0? Or something changed?

I expressions in english have ambiguity, this bug is occured in 6.1 and I update the version of 6.2.0, it also exist.

wu-sheng · 2019-07-15T06:48:55Z

I expressions in english have ambiguity, this bug is occured in 6.1 and I update the version of 6.2.0, it also exist.

Then we need still to figure out why you block, we and others are not.

wu-sheng · 2019-07-16T05:45:24Z

@coki230 I have received at least two confirmation, ElasticSearch hang up issue caused by ElasticSearch settings, ref to https://github.com/apache/skywalking/blob/master/docs/en/FAQ/ES-Server-FAQ.md#elasticsearch

I don't' have time to dig more, but from their feedback, this solved the issue.

coki230 · 2019-07-19T09:03:57Z

@coki230 I have received at least two confirmation, ElasticSearch hang up issue caused by ElasticSearch settings, ref to https://github.com/apache/skywalking/blob/master/docs/en/FAQ/ES-Server-FAQ.md#elasticsearch

I don't' have time to dig more, but from their feedback, this solved the issue.

ok , i will test the config in my env, thanks, so just ignor the pr, i will close it.

coki230 added 3 commits January 11, 2019 08:56

Merge pull request #2 from apache/master

b159a9c

合并最新代码到本地

Merge pull request #3 from coki230/coki230-patch-1

a8a290a

fix the bug of cann't report trace info.

wu-sheng reviewed Jul 11, 2019

View reviewed changes

wu-sheng requested a review from peng-yongsheng July 11, 2019 08:48

wu-sheng added the TBD To be decided later, need more discussion or input. label Jul 11, 2019

peng-yongsheng requested changes Jul 11, 2019

View reviewed changes

Update ElasticSearchClient.java

85361be

coki230 closed this Jul 19, 2019

Conversation

coki230 commented Jul 11, 2019

Bug fix

New feature or improvement

Uh oh!

wu-sheng Jul 11, 2019

Choose a reason for hiding this comment

Uh oh!

wu-sheng Jul 11, 2019

Choose a reason for hiding this comment

Uh oh!

coki230 Jul 11, 2019

Choose a reason for hiding this comment

Uh oh!

wu-sheng commented Jul 11, 2019

Uh oh!

wu-sheng commented Jul 11, 2019

Uh oh!

peng-yongsheng commented Jul 11, 2019

Uh oh!

peng-yongsheng left a comment

Choose a reason for hiding this comment

Uh oh!

wu-sheng commented Jul 11, 2019

Uh oh!

coki230 commented Jul 11, 2019

Uh oh!

wu-sheng commented Jul 11, 2019

Uh oh!

coki230 commented Jul 12, 2019

Uh oh!

peng-yongsheng commented Jul 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coki230 commented Jul 15, 2019

Uh oh!

wu-sheng commented Jul 15, 2019

Uh oh!

coki230 commented Jul 15, 2019

Uh oh!

wu-sheng commented Jul 15, 2019

Uh oh!

wu-sheng commented Jul 16, 2019

Uh oh!

coki230 commented Jul 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

peng-yongsheng commented Jul 12, 2019 •

edited

Loading