-
Notifications
You must be signed in to change notification settings - Fork 565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unhealthy partition in long running benchmark #8302
Comments
On first glance the same root cause as: #7862 There is still the question of why it does not recover. From the logs we can see that it is stuck an unable to transition to follower. Further investigation needed here. |
Filtered log:
|
From looking at the logs and the heap dump (and thanks to the help by @deepthidevaki ) we were able to establish the following course of events:
|
Haven't looked at the code but thinking about your analysis and of the top of my head I think a closed actor makes sense to me to explain the stuck "scenario". The logAppender is opened when we open a logstream writer. If we close the streamProcessor we want to close the writer and we can't do that since the actor is already in closed state. So solution would be to handle the already closed actor, i think. But it feels like a general issue how we currently do the failure/error handling with actors and dependent actors. Ideally if an actor fails and there exist an dependency chain all dependent actors should be closed or handle that failure as well. This is again a topic we discussed several times I would say and is nothing new. I really would like to see an improvement in the current situation or some progress in regards to replacing the actor scheduler. |
I would like to tackle this sooner than later, as in this case, for example, if partition 1 is unavailable for a long time then no deployments can occur, no timer start events, etc. You mentioned that "we ran out of time" - what do you mean? Did we exhaust all leads/hypotheses we had? |
No no, we just were cut off, because I had to go to another meeting. And we decided to put this up for prioritization before digging deeper. |
Looking at the code it seems we already check whether the actor is closed and return the close future |
I had another look at it, but honestly without Deepthi I am just lost. I looked through the shutdown logic of The issue is not the The actor that is stopped is the This is also the only piece which looks a bit fishy, to be honest. Because as part of the close of private void tearDown() {
processingContext.getLogStreamReader().close();
logStream.removeRecordAvailableListener(this);
replayStateMachine.close();
} which in turn calls: @Override
public void removeRecordAvailableListener(final LogRecordAwaiter recordAwaiter) {
actor.call(() -> recordAwaiters.remove(recordAwaiter));
} So we try to schedule a task on an actor that is already closed. I simply don't know what the behavior is in this case. From the Java side none of that looks like blocking code. If anything, the task should not get executed. Anyways, I think I cannot contribute much here. |
Hint for troubleshooting: in HeapDump search for |
To investigate further do we need the release-1.3.0-alpha1 benchmark, or have we extracted everything from it by now? I know I already asked, but just want to double check before I delete it 🙂 |
To the best of my knowledge we got everything we need. But if another engineer wants to do thread dumps or stuff like that, it might make sense to keep it running. We do have an excerpt of the log and a heap dump. |
Makes sense if the logAppender is closed the dispatcher is not cleared which means the distributor cant write to the dispatcher and retries for ever. |
Yes, exactly! I had the same finding 😄 Maybe a (unrelated) comment to the previous observation (mentioned above):
This is actually not a problem, because the actor does not close immediately, i.e., the actor transitions from |
@Zelldon, actually, the dispatcher is closed Since the dispatcher is closed, it will do nothing: |
Just as a side note: LogStream is closed because LogStorageAppender is "failed". May be it is not necessary to close LogStream, but only mark it as "unhealthy". LogStorageAppender will be closed anyway, so no harm (data consistency related) will be done even if new records are written to the dispatcher. |
To summarize, if I understand correctly, the issue is with the |
I'm wondering if we shouldn't have that requesting an actor to close short circuits this. Is there any reason why we want to ignore the close request in favor of running until done? I'm also wondering if the Let's say we do allow short circuiting, what should be the behavior here? Right now, requesting close means the request is a job itself, and it's submitted at the end of the task queue for this actor. So if I had the following:
Then the job queue is: Intuitively, I would expect requesting the close would discard all jobs (and let the current one finish), and it would prevent an ongoing non-auto-completing job to be resubmitted. Jobs could still be enqueued in the closing/close phase, but That said, my preference would be to just get rid of |
👍 for getting rid of |
Some notes from our discussion:
I think my notes may be undecipherable if you weren't there, so to sum up:
Hopefully this brings us closer to guaranteeing that closing always closes and we don't end up in loops. |
9734: Complete start/close futures when an actor fails r=npepinpe a=npepinpe ## Description When explicitly failing an actor, we want to ensure that anyone waiting on its start and/or close futures is notified that these will never complete. We complete the start future exceptionally on failure, and the close future successfully. This is because a caller may want to react differently if the actor did not start correctly, but most likely does not care if it did not close correctly. Hopefully this helps reduce the number of deadlocks where an actor is waiting on another actor to start/close and it never does. ## Related issues related to #8302 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
Not closed yet! Sorry about that. |
9755: [Backports stable/8.0] Complete start/close futures when an actor fails r=deepthidevaki a=npepinpe ## Description Backports #9734 to 8.0 in order to fix #8302 for this branch as well. ## Related issues backports #9734 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
9754: [Backports stable/1.3] Complete start/close futures when an actor fails r=deepthidevaki a=npepinpe ## Description Backports #9734 to 1.3 in order to fix #8302 for this branch as well. ## Related issues backports #9734 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
9850: Remove ActorControl#runUntilDone r=npepinpe a=npepinpe ## Description This PR removes the dangerous `ActorControl#runUntilDone` and replaces its usage with simple re-scheduling via `run` or `submit` or `runDelayed` where appropriate. At the same time, since we don't need it anymore, it also removes `ActorControl#done`. We keep `ActorControl#yieldThread` for jobs triggered by subscriptions/conditions (e.g. dispatcher subscriptions). This means only such jobs are not auto completing anymore (but they never had to call done). ## Related issues related to #8302 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
9850: Remove ActorControl#runUntilDone r=npepinpe a=npepinpe ## Description This PR removes the dangerous `ActorControl#runUntilDone` and replaces its usage with simple re-scheduling via `run` or `submit` or `runDelayed` where appropriate. At the same time, since we don't need it anymore, it also removes `ActorControl#done`. We keep `ActorControl#yieldThread` for jobs triggered by subscriptions/conditions (e.g. dispatcher subscriptions). This means only such jobs are not auto completing anymore (but they never had to call done). ## Related issues related to #8302 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
9875: [Backport stable/8.0] Remove ActorControl#runUntilDone r=npepinpe a=npepinpe ## Description This PR backports #9850 to stable/8.0. There was one additional commit not in the PR which had to be backported/adapted, b052712. This is the last commit of the PR. ## Related issues backports #9850 related to #8302 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
9876: [Backport stable/1.3] Remove ActorControl#runUntilDone r=npepinpe a=npepinpe ## Description This PR backports #9850 to stable/8.0. There was one additional commit not in the PR which had to be backported/adapted, b052712. This is the last commit of the PR. ## Related issues backports #9850 related to #8302 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
Regarding the last point, I tried different approaches, but they all required changing existing behavior more than we'd like. I'd rather make progress on holistic improvements we discussed before than improve the existing design, as mentioned in our team meeting. So I'll close this issue as done for now. |
Describe the bug
Partition 1 is unhalthy.
Observed in benchmark on non-preemptiable nodes: http://35.189.240.202/d/I4lo7_EZk/zeebe?orgId=1&refresh=10s&var-DS_PROMETHEUS=Prometheus&var-namespace=release-1-3-alpha1&var-pod=All&var-partition=All
Environment:
The text was updated successfully, but these errors were encountered: