Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[broker] Cursor status has always been SwitchingLedger and pendingMarkDeleteOps has accumulated tens of thousands of requests #16859

Closed
poorbarcode opened this issue Jul 29, 2022 · 20 comments
Labels
Stale type/bug The PR fixed a bug or issue reported a bug

Comments

@poorbarcode
Copy link
Contributor

poorbarcode commented Jul 29, 2022

Describe the bug
The pendingMarkDeleteOps has accumulated tens of thousands of requests, and the cursor status has always been SwitchingLedger, and the retention cannot be executed, resulting in the situation that the disk space cannot be reclaimed.

Screenshots
181666034-f073360f-1f09-4261-95d9-4cdcac255aa8

@poorbarcode poorbarcode added the type/bug The PR fixed a bug or issue reported a bug label Jul 29, 2022
@poorbarcode
Copy link
Contributor Author

poorbarcode commented Jul 29, 2022

Hi @keyboardbobo

Could you please provide the version used? If possible, you can upload the dump file ( High light which may cause data leakage of the company ).

Could you show the property ManagedCursorMXBean mbean of this Cursor ?

@keyboardbobo
Copy link

keyboardbobo commented Jul 29, 2022

@poorbarcode My version is 2.9.2, the dump file is very large, and it is difficult to upload it on the intranet due to security reasons. The values of the ManagedCursorMXBean mbean are as follows:

persistLedgeSucceed:cellsBusy=0,base=63308,cells:null
persistLedgeFailed:cellsBusy=0,base=1,cells:null
persistZookeeperSucceed:cellsBusy=0,base=1,cells:null
persistZookeeperFailed:cellsBusy=0,base=0,cells:null
writeCursorLedgerSize:cellsBusy=0,base=3437666,cells:null
writeCursorLedgerLogicalSize:cellsBusy=0,base=1718833,cells:null
readCursorLedgerSize:cellsBusy=0,base=9,cells:null

@poorbarcode
Copy link
Contributor Author

Hi @keyboardbobo

I now suspect that something is wrong with the pendingMarkDeletedSubmittedCount. Could you show the property ManagedCursorImpl.pendingMarkDeletedSubmittedCount ?

@keyboardbobo
Copy link

Hi @keyboardbobo

I now suspect that something is wrong with the pendingMarkDeletedSubmittedCount. Could you show the property ManagedCursorImpl.pendingMarkDeletedSubmittedCount ?

pendingMarkDeletedSubmittedCount =0

@poorbarcode
Copy link
Contributor Author

Hi @keyboardbobo

pendingMarkDeletedSubmittedCount =0

Thanks, then it is not pendingMarkDeletedSubmittedCount's problem. Is there any other error log?

@keyboardbobo
Copy link

@poorbarcode
At that time, an abnormal scenario with a large network delay from the broker to the bookie was simulated, and there were many errors similar to the following:

Write of ledger entry to quorum failed
NotEnoughBookiesException: Not enough non-faulty bookies available
UpdateLoop(ledgerId=2780774,loopId=54e667fe) Exception updating

Then the disk space increases, and some partitions cannot be reclaim.

@poorbarcode
Copy link
Contributor Author

Hi @keyboardbobo

there were many errors similar to the following:

Could you show more detail about the error log?

@keyboardbobo
Copy link

@poorbarcode
C44760E7-903B-497C-BA96-2301ED769E30

@keyboardbobo
Copy link

keyboardbobo commented Aug 1, 2022

@poorbarcode I found that there are thousands of tasks in the "BookKeeperClientWorker-OrderedExecutor-59-%d" thread group, and I don't know what caused it. SafeRunnable should catch the exception, but I don't know why this happens
4EBC67CE-742F-4DC7-AB8E-4BF189D5FFCB

@poorbarcode
Copy link
Contributor Author

poorbarcode commented Aug 1, 2022

Hi @keyboardbobo

I found that there are thousands of tasks in the "BookKeeperClientWorker-OrderedExecutor-59-%d" thread group, and I don't know what caused it. SafeRunnable should catch the exception, but I don't know why this happens

OrderedExecutor used in these is used in the following three scenarios:

  • Write Bookie
  • Read Bookie
  • Send messages to the client

@hangc0276 Is it normal to have so many tasks in the queue?

@keyboardbobo
Copy link

@poorbarcode
I took a look at some thread groups. Other thread groups are all 0. Only this one has a backlog of more than 7,000 tasks.

@keyboardbobo
Copy link

keyboardbobo commented Aug 1, 2022

@poorbarcode The stack of the three threads of this thread group is as follows:
82460ACC-897B-4F3E-A891-0EBF74C94D09

@poorbarcode
Copy link
Contributor Author

Hi @keyboardbobo

The stack of the three threads of this thread group is as follows:

It seems that one topic is busy while others are not, but this seems have not related to the cursor state always being SwitchingLedger or maybe I didn't understand enough to realize the problem

@keyboardbobo
Copy link

keyboardbobo commented Aug 1, 2022

@poorbarcode I suspect that when SwitchingLedger, some tasks are blocked in the queue, which makes it impossible to complete the execution.

It seems that one topic is busy while others are not, but this seems have not related to the cursor state always being SwitchingLedger or maybe I didn't understand enough to realize the problem

@poorbarcode
Copy link
Contributor Author

poorbarcode commented Aug 2, 2022

Hi @keyboardbobo

Could you provide the BK configuration?

  • Ensemble size
  • Write quorum size
  • Ack quorum size

E.g. Ensemble size = 3, and Write quorum size = 3, When any bookie server does not work, the first write request will timeout, and other requests will backlog

@keyboardbobo
Copy link

keyboardbobo commented Aug 2, 2022

@poorbarcode

managedLedgerDefaultEnsembleSize=3
managedLedgerDefaultWriteQuorum=3
managedLedgerDefaultAckQuorum=2

If bookie is not working, should all thread pools be backlogged, not just that thread pool?

@github-actions
Copy link

github-actions bot commented Sep 2, 2022

The issue had no activity for 30 days, mark with Stale label.

@keyboardbobo
Copy link

@poorbarcode There is a new development on the question, but I'm not sure if it's the same problem:
#17967

@github-actions github-actions bot removed the Stale label Oct 9, 2022
@github-actions
Copy link

github-actions bot commented Nov 9, 2022

The issue had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot added the Stale label Nov 9, 2022
@poorbarcode
Copy link
Contributor Author

This issue might be fixed by #17971

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stale type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

No branches or pull requests

2 participants