-
Notifications
You must be signed in to change notification settings - Fork 24.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix leaking store ref on peer recovery connection drop at shutdown #78067
base: main
Are you sure you want to change the base?
Fix leaking store ref on peer recovery connection drop at shutdown #78067
Conversation
Pinging @elastic/es-distributed (Team:Distributed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the problem but I think the solution might be more annoying than what we have here, 🤞 I missed something :)
} | ||
} | ||
} else { | ||
onFailure(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's impossible to get here right? Generic only ever rejects when shut down right?
} | ||
} | ||
} | ||
|
||
@Override | ||
public void onRejection(Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem I see here is that we will often just not get this far when the pool shuts down with this thing scheduled to run at some later point in time will we?
Maybe we do need some lifecycle on PeerRecoveryTargetService
after all, then track the queued up retries and fail those one stop/close like we have for the source service?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem I see here is that we will often just not get this far when the pool shuts down with this thing scheduled to run at some later point in time will we?
Yes, I think you're right 🤦🏻 I somehow managed to not see this obvious point when I created this PR.
Maybe we do need some lifecycle on
PeerRecoveryTargetService
after all, then track the queued up retries and fail those one stop/close like we have for the source service?
I also agree - I think this is the only way to ensure that resources are correctly terminated at shutdown. I'll take a look at this and will close this PR if I manage to implement what you suggested.
Hi @tlrx, I've created a changelog YAML for you. |
In #77178 a searchable snapshot test failed because a file was still opened when the data node started and tried to clean up that specific file. Investigation shows that the file was a cache file, the filesystem was
WindowFS
and that the node tried to delete the file during the repopulate of the searchable snapshot persistent cache.The cache file was opened because a searchable snapshot shard was being verified for corruption (
index.shard.check_on_startup: true
) right before a full cluster restart. The index check is done during peer recovery and increments a ref on the store and releases it once the check is done. The recovery process also increments a ref on the same store and releases it when the recovery is done or fails.But having the cache file opened during the node restart is an indication that the
Store
andSearchableSnapshotDirectory
might have not be fully released (a closed directory would have stopped the index check pretty quickly by throwingAlreadyClosedException
).Looking at the logs from the test we can see:
where we can see that the recovery source node
node_s1
is stopped and this is detected by the recovery targetnode_s2
that will retry the recovery. But the retry can't be executed as thenode_s2
is being shut down and since a change in #60586 the RecoveryRunner is not failed anymore when the executor is shut down, leaking a store reference not released (the one incremented when theRecoveryTarget
is instanciated).This pull request allows the RecoveryRunner to be notified of the rejection of the recovery retry during shutdown and therefore release the store, without bringing the annoying logs that were "muted" by #60586.
The test in #77178 uses internal test cluster and nodes that are all restarted in memory; I tried to look at a way to "wait for index check" to be processed/fully released during shutdown but there is no easy way to do that for searchable snapshots shards. I expect the bug fix here to be enough to close #77178 and if that fails again with cache files still opened then we can change the test to wait for cache fetch thread pool to idle.
Note to reviewers: this is somewhat similar to #77783. Sorry for the long description, I wanted to be sure to capture everything in case the test failed again.