Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix leaking store ref on peer recovery connection drop at shutdown #78067

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

tlrx
Copy link
Member

@tlrx tlrx commented Sep 21, 2021

In #77178 a searchable snapshot test failed because a file was still opened when the data node started and tried to clean up that specific file. Investigation shows that the file was a cache file, the filesystem was WindowFS and that the node tried to delete the file during the repopulate of the searchable snapshot persistent cache.

The cache file was opened because a searchable snapshot shard was being verified for corruption (index.shard.check_on_startup: true) right before a full cluster restart. The index check is done during peer recovery and increments a ref on the store and releases it once the check is done. The recovery process also increments a ref on the same store and releases it when the recovery is done or fails.

But having the cache file opened during the node restart is an indication that the Store and SearchableSnapshotDirectory might have not be fully released (a closed directory would have stopped the index check pretty quickly by throwing AlreadyClosedException).

Looking at the logs from the test we can see:

[2021-09-02T12:07:59,820][INFO ][o.e.t.InternalTestCluster] [testCreateAndRestorePartialSearchableSnapshot] Stopping and resetting node [node_s1] 
[2021-09-02T12:08:00,005][INFO ][o.e.n.Node               ] [testCreateAndRestorePartialSearchableSnapshot] closed
...
[2021-09-02T12:08:00,036][INFO ][o.e.t.InternalTestCluster] [testCreateAndRestorePartialSearchableSnapshot] Stopping and resetting node [node_s2] 
[2021-09-02T12:08:00,221][INFO ][o.e.n.Node               ] [testCreateAndRestorePartialSearchableSnapshot] closed
...
[2021-09-02T12:08:10,243][INFO ][o.e.i.r.PeerRecoveryTargetService] [node_s2] recovery of [xovxqrptfc][4] from [{node_s1}{OI5ThPFhSnqgtQNYGYJOlA}{urJj-osHSomzDony-dfbcg}{127.0.0.1}{127.0.0.1:35748}{cdfhilmrstw}{xpack.installed=true}] interrupted by network disconnect, will retry in [5s]; cause: [[node_s1][127.0.0.1:35748] Node not connected]
[2021-09-02T12:08:10,244][INFO ][o.e.i.r.PeerRecoveryTargetService] [node_s2] recovery of [xovxqrptfc][5] from [{node_s1}{OI5ThPFhSnqgtQNYGYJOlA}{urJj-osHSomzDony-dfbcg}{127.0.0.1}{127.0.0.1:35748}{cdfhilmrstw}{xpack.installed=true}] interrupted by network disconnect, will retry in [5s]; cause: [[node_s1][127.0.0.1:35748] Node not connected]
...
[2021-09-02T12:08:10,505][INFO ][o.e.t.InternalTestCluster] [testCreateAndRestorePartialSearchableSnapshot] recreating node [node_s2] 
...
org.elasticsearch.ElasticsearchException: java.io.IOException: access denied: /var/lib/jenkins/workspace/elastic+elasticsearch+master+multijob+platform-support-unix/os/centos-7&&immutable/x-pack/plugin/searchable-snapshots/build/testrun/internalClusterTest/temp/org.elasticsearch.xpack.searchablesnapshots.FrozenSearchableSnapshotsIntegTests_56B5E0A9C949E1D2-001/tempDir-002/node_s2/indices/jvnmfpSxTDG5sGhI07FVcg/4/snapshot_cache/-XXiEyMOQrCuPfojK-0R0Q/eOf3tXJeShS38UStFPglOA

where we can see that the recovery source node node_s1 is stopped and this is detected by the recovery target node_s2 that will retry the recovery. But the retry can't be executed as the node_s2 is being shut down and since a change in #60586 the RecoveryRunner is not failed anymore when the executor is shut down, leaking a store reference not released (the one incremented when the RecoveryTarget is instanciated).

This pull request allows the RecoveryRunner to be notified of the rejection of the recovery retry during shutdown and therefore release the store, without bringing the annoying logs that were "muted" by #60586.

The test in #77178 uses internal test cluster and nodes that are all restarted in memory; I tried to look at a way to "wait for index check" to be processed/fully released during shutdown but there is no easy way to do that for searchable snapshots shards. I expect the bug fix here to be enough to close #77178 and if that fails again with cache files still opened then we can change the test to wait for cache fetch thread pool to idle.

Note to reviewers: this is somewhat similar to #77783. Sorry for the long description, I wanted to be sure to capture everything in case the test failed again.

@tlrx tlrx added >bug :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v8.0.0 v7.16.0 labels Sep 21, 2021
@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team label Sep 21, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the problem but I think the solution might be more annoying than what we have here, 🤞 I missed something :)

}
}
} else {
onFailure(e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's impossible to get here right? Generic only ever rejects when shut down right?

}
}
}

@Override
public void onRejection(Exception e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem I see here is that we will often just not get this far when the pool shuts down with this thing scheduled to run at some later point in time will we?

Maybe we do need some lifecycle on PeerRecoveryTargetService after all, then track the queued up retries and fail those one stop/close like we have for the source service?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem I see here is that we will often just not get this far when the pool shuts down with this thing scheduled to run at some later point in time will we?

Yes, I think you're right 🤦🏻 I somehow managed to not see this obvious point when I created this PR.

Maybe we do need some lifecycle on PeerRecoveryTargetService after all, then track the queued up retries and fail those one stop/close like we have for the source service?

I also agree - I think this is the only way to ensure that resources are correctly terminated at shutdown. I'll take a look at this and will close this PR if I manage to implement what you suggested.

@danhermann danhermann added v8.1.0 and removed v7.16.0 labels Oct 27, 2021
@arteam arteam removed the v8.0.0 label Jan 12, 2022
@mark-vieira mark-vieira added v8.2.0 and removed v8.1.0 labels Feb 2, 2022
@elasticsearchmachine
Copy link
Collaborator

Hi @tlrx, I've created a changelog YAML for you.

@elasticsearchmachine elasticsearchmachine changed the base branch from master to main July 22, 2022 23:10
@mark-vieira mark-vieira added v8.5.0 and removed v8.4.0 labels Jul 27, 2022
@csoulios csoulios added v8.6.0 and removed v8.5.0 labels Sep 21, 2022
@kingherc kingherc added v8.7.0 and removed v8.6.0 labels Nov 16, 2022
@rjernst rjernst added v8.8.0 and removed v8.7.0 labels Feb 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. Team:Distributed Meta label for distributed team v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] Failure in FrozenSearchableSnapshotsIntegTests.testCreateAndRestorePartialSearchableSnapshot