Skip to content

"One trace segment has been abandoned, cause by buffer is full" bug fix.#3048

Closed
coki230 wants to merge 4 commits intoapache:masterfrom
coki230:master
Closed

"One trace segment has been abandoned, cause by buffer is full" bug fix.#3048
coki230 wants to merge 4 commits intoapache:masterfrom
coki230:master

Conversation

@coki230
Copy link
Copy Markdown
Contributor

@coki230 coki230 commented Jul 11, 2019

Please answer these questions before submitting pull request

  • Why submit this pull request?

  • Bug fix

  • New feature provided

  • Improve performance

  • Related issues


Bug fix

  • Bug description.

i lost my agent trace segment data, but my server is ok, it can recieve jvm metrics, when i debug it i found it didn't call the method of "batchPersistence" which is bulk update or save the data.
the bug is caused by auto Flush and the method of "batchPersistence", because the ES need lock the same lock when flush and internalAdd(which called by batchPersistence). so i delete the auto flush the result is so far are normal. the underneath is my thread dump brief:

"DataCarrier.RECORD_PERSISTENT.BulkConsumePool.0.Thread" #24 daemon prio=5 os_prio=0 tid=0x000000002585a800 nid=0xc1c waiting for monitor entry [0x000000002715f000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.elasticsearch.action.bulk.BulkProcessor.internalAdd(BulkProcessor.java:286)

  • waiting to lock <0x00000006c2c52eb0> (a org.elasticsearch.action.bulk.BulkProcessor)
    at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:271)
    at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:267)
    at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:253)
    at org.apache.skywalking.oap.server.storage.plugin.elasticsearch.base.BatchProcessEsDAO.lambda$batchPersistence$0(BatchProcessEsDAO.java:64)
    at org.apache.skywalking.oap.server.storage.plugin.elasticsearch.base.BatchProcessEsDAO$$Lambda$285/523676665.accept(Unknown Source)
    at java.lang.Iterable.forEach(Iterable.java:75)
    at org.apache.skywalking.oap.server.storage.plugin.elasticsearch.base.BatchProcessEsDAO.batchPersistence(BatchProcessEsDAO.java:62)
    at org.apache.skywalking.oap.server.core.analysis.worker.PersistenceWorker.onWork(PersistenceWorker.java:51)
    at org.apache.skywalking.oap.server.core.analysis.worker.RecordPersistentWorker$PersistentConsumer.consume(RecordPersistentWorker.java:103)
    at org.apache.skywalking.apm.commons.datacarrier.consumer.MultipleChannelsConsumer.consume(MultipleChannelsConsumer.java:80)
    at org.apache.skywalking.apm.commons.datacarrier.consumer.MultipleChannelsConsumer.run(MultipleChannelsConsumer.java:49)

"elasticsearch[scheduler][T#1]" #72 daemon prio=5 os_prio=0 tid=0x00000000202c7000 nid=0x850 waiting for monitor entry [0x000000002a85e000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.elasticsearch.action.bulk.BulkProcessor$Flush.run(BulkProcessor.java:367)

  • waiting to lock <0x00000006c2c52eb0> (a org.elasticsearch.action.bulk.BulkProcessor)
    at org.elasticsearch.threadpool.Scheduler$ReschedulingRunnable.doRun(Scheduler.java:182)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266)
    at java.util.concurrent.FutureTask.run(FutureTask.java)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

"pool-14-thread-1" #68 prio=5 os_prio=0 tid=0x00000000202c4000 nid=0xfa8 waiting on condition [0x0000000029b4e000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)

  • parking to wait for <0x00000006c2c53160> (a java.util.concurrent.Semaphore$NonfairSync)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
    at java.util.concurrent.Semaphore.acquire(Semaphore.java:312)
    at org.elasticsearch.action.bulk.BulkRequestHandler.execute(BulkRequestHandler.java:60)
    at org.elasticsearch.action.bulk.BulkProcessor.execute(BulkProcessor.java:339)
    at org.elasticsearch.action.bulk.BulkProcessor.flush(BulkProcessor.java:358)

  • locked <0x00000006c2c52eb0> (a org.elasticsearch.action.bulk.BulkProcessor)
    at org.apache.skywalking.oap.server.storage.plugin.elasticsearch.base.BatchProcessEsDAO.batchPersistence(BatchProcessEsDAO.java:72)
    at org.apache.skywalking.oap.server.core.storage.PersistenceTimer.extractDataAndSave(PersistenceTimer.java:113)
    at org.apache.skywalking.oap.server.core.storage.PersistenceTimer.lambda$start$0(PersistenceTimer.java:65)
    at org.apache.skywalking.oap.server.core.storage.PersistenceTimer$$Lambda$201/296552796.run(Unknown Source)
    at org.apache.skywalking.apm.util.RunnableWithExceptionProtection.run(RunnableWithExceptionProtection.java:36)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset$$$capture(FutureTask.java:308)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

  • How to fix?
    just close the auto flush, because there is a manual flush to commit the data


New feature or improvement

  • Describe the details and related test reports.

合并最新代码到本地
the bug is caused by auto Flush and the method of "batchPersistence", because the ES need lock the same lock when flush and internalAdd(which called by batchPersistence). so i delete the auto flush the result is so far are normal. the underneath is my thread dump brief:
"DataCarrier.RECORD_PERSISTENT.BulkConsumePool.0.Thread" apache#24 daemon prio=5 os_prio=0 tid=0x000000002585a800 nid=0xc1c waiting for monitor entry [0x000000002715f000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.elasticsearch.action.bulk.BulkProcessor.internalAdd(BulkProcessor.java:286)
- waiting to lock <0x00000006c2c52eb0> (a org.elasticsearch.action.bulk.BulkProcessor)
at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:271)
at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:267)
at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:253)
at org.apache.skywalking.oap.server.storage.plugin.elasticsearch.base.BatchProcessEsDAO.lambda$batchPersistence$0(BatchProcessEsDAO.java:64)
at org.apache.skywalking.oap.server.storage.plugin.elasticsearch.base.BatchProcessEsDAO$$Lambda$285/523676665.accept(Unknown Source)
at java.lang.Iterable.forEach(Iterable.java:75)
at org.apache.skywalking.oap.server.storage.plugin.elasticsearch.base.BatchProcessEsDAO.batchPersistence(BatchProcessEsDAO.java:62)
at org.apache.skywalking.oap.server.core.analysis.worker.PersistenceWorker.onWork(PersistenceWorker.java:51)
at org.apache.skywalking.oap.server.core.analysis.worker.RecordPersistentWorker$PersistentConsumer.consume(RecordPersistentWorker.java:103)
at org.apache.skywalking.apm.commons.datacarrier.consumer.MultipleChannelsConsumer.consume(MultipleChannelsConsumer.java:80)
at org.apache.skywalking.apm.commons.datacarrier.consumer.MultipleChannelsConsumer.run(MultipleChannelsConsumer.java:49)

"elasticsearch[scheduler][T#1]" apache#72 daemon prio=5 os_prio=0 tid=0x00000000202c7000 nid=0x850 waiting for monitor entry [0x000000002a85e000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.elasticsearch.action.bulk.BulkProcessor$Flush.run(BulkProcessor.java:367)
- waiting to lock <0x00000006c2c52eb0> (a org.elasticsearch.action.bulk.BulkProcessor)
at org.elasticsearch.threadpool.Scheduler$ReschedulingRunnable.doRun(Scheduler.java:182)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266)
at java.util.concurrent.FutureTask.run(FutureTask.java)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

"pool-14-thread-1" apache#68 prio=5 os_prio=0 tid=0x00000000202c4000 nid=0xfa8 waiting on condition [0x0000000029b4e000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000006c2c53160> (a java.util.concurrent.Semaphore$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at java.util.concurrent.Semaphore.acquire(Semaphore.java:312)
at org.elasticsearch.action.bulk.BulkRequestHandler.execute(BulkRequestHandler.java:60)
at org.elasticsearch.action.bulk.BulkProcessor.execute(BulkProcessor.java:339)
at org.elasticsearch.action.bulk.BulkProcessor.flush(BulkProcessor.java:358)
- locked <0x00000006c2c52eb0> (a org.elasticsearch.action.bulk.BulkProcessor)
at org.apache.skywalking.oap.server.storage.plugin.elasticsearch.base.BatchProcessEsDAO.batchPersistence(BatchProcessEsDAO.java:72)
at org.apache.skywalking.oap.server.core.storage.PersistenceTimer.extractDataAndSave(PersistenceTimer.java:113)
at org.apache.skywalking.oap.server.core.storage.PersistenceTimer.lambda$start$0(PersistenceTimer.java:65)
at org.apache.skywalking.oap.server.core.storage.PersistenceTimer$$Lambda$201/296552796.run(Unknown Source)
at org.apache.skywalking.apm.util.RunnableWithExceptionProtection.run(RunnableWithExceptionProtection.java:36)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset$$$capture(FutureTask.java:308)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
fix the bug of cann't report trace info.
.setBulkActions(bulkActions)
.setBulkSize(new ByteSizeValue(bulkSize, ByteSizeUnit.MB))
.setFlushInterval(TimeValue.timeValueSeconds(flushInterval))
// modified by zhaoze, if set auto flush, there will cause deadlock with bulk commit.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't set this comment in the repo, remove it

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make comment without yout name, just why.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, i got it

@wu-sheng wu-sheng requested a review from peng-yongsheng July 11, 2019 08:48
@wu-sheng
Copy link
Copy Markdown
Member

@peng-yongsheng I am not familiar with this, please review

@wu-sheng wu-sheng added the TBD To be decided later, need more discussion or input. label Jul 11, 2019
@wu-sheng
Copy link
Copy Markdown
Member

Could you put the Elasticsearch discussion issue here? I think you mentioned that before

@peng-yongsheng
Copy link
Copy Markdown
Member

@coki230 I don't think to remove the setting of the flush interval can fix the problem which you provide. These Settings all have the same capabilities, such as setBulkActions, setBulkSize, setFlushInterval. Please take a look at the official document.

setBulkActions(10000)
We want to execute the bulk every 10 000 requests

setBulkSize(new ByteSizeValue(5, ByteSizeUnit.MB))
We want to flush the bulk every 5mb

setFlushInterval(TimeValue.timeValueSeconds(5))
We want to flush the bulk every 5 seconds whatever the number of requests

If any conditions are reached, the bulk will be executed. OAP servers are a cluster. All of them will execute the bulk operate to insert or update the metrics into Elasticsearch at any time. So it is reasonable to add locks and normal to create locks.

Copy link
Copy Markdown
Member

@peng-yongsheng peng-yongsheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disagree

@wu-sheng
Copy link
Copy Markdown
Member

Please reply the review. @peng-yongsheng leads this part, I think we need fully explanation if you think we should merge this.

@coki230
Copy link
Copy Markdown
Contributor Author

coki230 commented Jul 11, 2019

@peng-yongsheng it have manual fulsh in the code which is in the class of "BatchProcessEsDAO" line of 72.
so in my opinion, it didn't have to set Flush Interval.
The lock will remain locked until restart oap server. because the auto flush will first lock the BulkProcessor add lock, than it will wait another lock which is waiting all the time.

@wu-sheng
Copy link
Copy Markdown
Member

@coki230 do you have the prove of this deadlock, and why it doesn't happen always, only in limited resources env? Is this a bug or something? All these information is very important.

This change is important, we need to make sure.

@coki230
Copy link
Copy Markdown
Contributor Author

coki230 commented Jul 12, 2019

@wu-sheng OK,I will take sometime to make sure how to repeat the test procedure, in my env it would alwasys heppened,but i'm not sure about limited resources.

@peng-yongsheng
Copy link
Copy Markdown
Member

peng-yongsheng commented Jul 12, 2019

@peng-yongsheng it have manual fulsh in the code which is in the class of "BatchProcessEsDAO" line of 72.
so in my opinion, it didn't have to set Flush Interval.
The lock will remain locked until restart oap server. because the auto flush will first lock the BulkProcessor add lock, than it will wait another lock which is waiting all the time.

@coki230 I guess the real reason is yours Elasticsearch servers are hung up by the automatically data delete. This is a design flaw before 6.2.0, it has been fixed. When Elasticsearch servers are very busy to delete history data, there are no more idle resources to process the data from bulk processors. then OAP servers began to scramble for bulk resources. OAP servers that are not scrambling for resources start locking.

Suggest upgrade to 6.2.0
Some settings for the Elasticsearch server to improving write performance, take a look at this document. ES-Server-FAQ

@coki230
Copy link
Copy Markdown
Contributor Author

coki230 commented Jul 15, 2019

@peng-yongsheng it have manual fulsh in the code which is in the class of "BatchProcessEsDAO" line of 72.
so in my opinion, it didn't have to set Flush Interval.
The lock will remain locked until restart oap server. because the auto flush will first lock the BulkProcessor add lock, than it will wait another lock which is waiting all the time.

@coki230 I guess the real reason is yours Elasticsearch servers are hung up by the automatically data delete. This is a design flaw before 6.2.0, it has been fixed. When Elasticsearch servers are very busy to delete history data, there are no more idle resources to process the data from bulk processors. then OAP servers began to scramble for bulk resources. OAP servers that are not scrambling for resources start locking.

Suggest upgrade to 6.2.0
Some settings for the Elasticsearch server to improving write performance, take a look at this document. ES-Server-FAQ

the presentation is like Elasticsearch servers are hung up, but i update the SW version of 6.2.0, i use new feature of delete index to delete data. and i cann't get any error info in my oap server when it cann't submit bulk processors. but there might be my ES has poor performance so the ES hung up by auto flush or delete data.

@wu-sheng
Copy link
Copy Markdown
Member

the presentation is like Elasticsearch servers are hung up, but i update the SW version of 6.2.0, i use new feature of delete index to delete data. and i cann't get any error info in my oap server when it cann't submit bulk processors. but there might be my ES has poor performance so the ES hung up by auto flush or delete data.

6.2.0 doesn't delete data from index anymore, it just simply removed the whole indexes. Is your issue solved in 6.2.0? Or something changed?

@coki230
Copy link
Copy Markdown
Contributor Author

coki230 commented Jul 15, 2019

the presentation is like Elasticsearch servers are hung up, but i update the SW version of 6.2.0, i use new feature of delete index to delete data. and i cann't get any error info in my oap server when it cann't submit bulk processors. but there might be my ES has poor performance so the ES hung up by auto flush or delete data.

6.2.0 doesn't delete data from index anymore, it just simply removed the whole indexes. Is your issue solved in 6.2.0? Or something changed?

I expressions in english have ambiguity, this bug is occured in 6.1 and I update the version of 6.2.0, it also exist.

@wu-sheng
Copy link
Copy Markdown
Member

I expressions in english have ambiguity, this bug is occured in 6.1 and I update the version of 6.2.0, it also exist.

Then we need still to figure out why you block, we and others are not.

@wu-sheng
Copy link
Copy Markdown
Member

@coki230 I have received at least two confirmation, ElasticSearch hang up issue caused by ElasticSearch settings, ref to https://github.com/apache/skywalking/blob/master/docs/en/FAQ/ES-Server-FAQ.md#elasticsearch

I don't' have time to dig more, but from their feedback, this solved the issue.

@coki230
Copy link
Copy Markdown
Contributor Author

coki230 commented Jul 19, 2019

@coki230 I have received at least two confirmation, ElasticSearch hang up issue caused by ElasticSearch settings, ref to https://github.com/apache/skywalking/blob/master/docs/en/FAQ/ES-Server-FAQ.md#elasticsearch

I don't' have time to dig more, but from their feedback, this solved the issue.

ok , i will test the config in my env, thanks, so just ignor the pr, i will close it.

@coki230 coki230 closed this Jul 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

TBD To be decided later, need more discussion or input.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants