Fix leaking store ref on peer recovery connection drop at shutdown #78067

tlrx · 2021-09-21T08:51:20Z

In #77178 a searchable snapshot test failed because a file was still opened when the data node started and tried to clean up that specific file. Investigation shows that the file was a cache file, the filesystem was WindowFS and that the node tried to delete the file during the repopulate of the searchable snapshot persistent cache.

The cache file was opened because a searchable snapshot shard was being verified for corruption (index.shard.check_on_startup: true) right before a full cluster restart. The index check is done during peer recovery and increments a ref on the store and releases it once the check is done. The recovery process also increments a ref on the same store and releases it when the recovery is done or fails.

But having the cache file opened during the node restart is an indication that the Store and SearchableSnapshotDirectory might have not be fully released (a closed directory would have stopped the index check pretty quickly by throwing AlreadyClosedException).

Looking at the logs from the test we can see:

[2021-09-02T12:07:59,820][INFO ][o.e.t.InternalTestCluster] [testCreateAndRestorePartialSearchableSnapshot] Stopping and resetting node [node_s1] 
[2021-09-02T12:08:00,005][INFO ][o.e.n.Node               ] [testCreateAndRestorePartialSearchableSnapshot] closed
...
[2021-09-02T12:08:00,036][INFO ][o.e.t.InternalTestCluster] [testCreateAndRestorePartialSearchableSnapshot] Stopping and resetting node [node_s2] 
[2021-09-02T12:08:00,221][INFO ][o.e.n.Node               ] [testCreateAndRestorePartialSearchableSnapshot] closed
...
[2021-09-02T12:08:10,243][INFO ][o.e.i.r.PeerRecoveryTargetService] [node_s2] recovery of [xovxqrptfc][4] from [{node_s1}{OI5ThPFhSnqgtQNYGYJOlA}{urJj-osHSomzDony-dfbcg}{127.0.0.1}{127.0.0.1:35748}{cdfhilmrstw}{xpack.installed=true}] interrupted by network disconnect, will retry in [5s]; cause: [[node_s1][127.0.0.1:35748] Node not connected]
[2021-09-02T12:08:10,244][INFO ][o.e.i.r.PeerRecoveryTargetService] [node_s2] recovery of [xovxqrptfc][5] from [{node_s1}{OI5ThPFhSnqgtQNYGYJOlA}{urJj-osHSomzDony-dfbcg}{127.0.0.1}{127.0.0.1:35748}{cdfhilmrstw}{xpack.installed=true}] interrupted by network disconnect, will retry in [5s]; cause: [[node_s1][127.0.0.1:35748] Node not connected]
...
[2021-09-02T12:08:10,505][INFO ][o.e.t.InternalTestCluster] [testCreateAndRestorePartialSearchableSnapshot] recreating node [node_s2] 
...
org.elasticsearch.ElasticsearchException: java.io.IOException: access denied: /var/lib/jenkins/workspace/elastic+elasticsearch+master+multijob+platform-support-unix/os/centos-7&&immutable/x-pack/plugin/searchable-snapshots/build/testrun/internalClusterTest/temp/org.elasticsearch.xpack.searchablesnapshots.FrozenSearchableSnapshotsIntegTests_56B5E0A9C949E1D2-001/tempDir-002/node_s2/indices/jvnmfpSxTDG5sGhI07FVcg/4/snapshot_cache/-XXiEyMOQrCuPfojK-0R0Q/eOf3tXJeShS38UStFPglOA

where we can see that the recovery source node node_s1 is stopped and this is detected by the recovery target node_s2 that will retry the recovery. But the retry can't be executed as the node_s2 is being shut down and since a change in #60586 the RecoveryRunner is not failed anymore when the executor is shut down, leaking a store reference not released (the one incremented when the RecoveryTarget is instanciated).

This pull request allows the RecoveryRunner to be notified of the rejection of the recovery retry during shutdown and therefore release the store, without bringing the annoying logs that were "muted" by #60586.

The test in #77178 uses internal test cluster and nodes that are all restarted in memory; I tried to look at a way to "wait for index check" to be processed/fully released during shutdown but there is no easy way to do that for searchable snapshots shards. I expect the bug fix here to be enough to close #77178 and if that fails again with cache files still opened then we can change the test to wait for cache fetch thread pool to idle.

Note to reviewers: this is somewhat similar to #77783. Sorry for the long description, I wanted to be sure to capture everything in case the test failed again.

elasticmachine · 2021-09-21T08:51:23Z

Pinging @elastic/es-distributed (Team:Distributed)

original-brownbear

I agree with the problem but I think the solution might be more annoying than what we have here, 🤞 I missed something :)

original-brownbear · 2021-09-21T14:26:08Z

server/src/main/java/org/elasticsearch/indices/recovery/PeerRecoveryTargetService.java

+                    }
+                }
+            } else {
+                onFailure(e);


I think it's impossible to get here right? Generic only ever rejects when shut down right?

original-brownbear · 2021-09-21T14:32:05Z

server/src/main/java/org/elasticsearch/indices/recovery/PeerRecoveryTargetService.java

                }
            }
        }

+        @Override
+        public void onRejection(Exception e) {


The problem I see here is that we will often just not get this far when the pool shuts down with this thing scheduled to run at some later point in time will we?

Maybe we do need some lifecycle on PeerRecoveryTargetService after all, then track the queued up retries and fail those one stop/close like we have for the source service?

The problem I see here is that we will often just not get this far when the pool shuts down with this thing scheduled to run at some later point in time will we?

Yes, I think you're right 🤦🏻 I somehow managed to not see this obvious point when I created this PR.

Maybe we do need some lifecycle on PeerRecoveryTargetService after all, then track the queued up retries and fail those one stop/close like we have for the source service?

I also agree - I think this is the only way to ensure that resources are correctly terminated at shutdown. I'll take a look at this and will close this PR if I manage to implement what you suggested.

elasticsearchmachine · 2022-02-02T20:46:22Z

Hi @tlrx, I've created a changelog YAML for you.

Fix leaking store ref on peer recovery connection drop at shutdown

75341a2

tlrx added >bug :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v8.0.0 v7.16.0 labels Sep 21, 2021

elasticmachine added the Team:Distributed Meta label for distributed team label Sep 21, 2021

tlrx requested review from henningandersen and original-brownbear September 21, 2021 09:49

original-brownbear reviewed Sep 21, 2021

View reviewed changes

danhermann added v8.1.0 and removed v7.16.0 labels Oct 27, 2021

arteam removed the v8.0.0 label Jan 12, 2022

mark-vieira added v8.2.0 and removed v8.1.0 labels Feb 2, 2022

Update docs/changelog/78067.yaml

7afc2e6

salvatore-campagna added v8.3.0 and removed v8.2.0 labels Mar 30, 2022

craigtaverner added v8.4.0 and removed v8.3.0 labels May 25, 2022

elasticsearchmachine changed the base branch from master to main July 22, 2022 23:10

mark-vieira added v8.5.0 and removed v8.4.0 labels Jul 27, 2022

csoulios added v8.6.0 and removed v8.5.0 labels Sep 21, 2022

kingherc added v8.7.0 and removed v8.6.0 labels Nov 16, 2022

rjernst added v8.8.0 and removed v8.7.0 labels Feb 8, 2023

gmarouli added v8.9.0 and removed v8.8.0 labels Apr 26, 2023

pugnascotia added v8.10.0 and removed v8.9.0 labels Jun 22, 2023

quux00 added v8.11.0 and removed v8.10.0 labels Aug 16, 2023

mattc58 added v8.12.0 and removed v8.11.0 labels Oct 4, 2023

brianseeders added v8.13.0 and removed v8.12.0 labels Dec 6, 2023

elasticsearchmachine added v8.14.0 and removed v8.13.0 labels Feb 14, 2024

elasticsearchmachine added v8.15.0 and removed v8.14.0 labels Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix leaking store ref on peer recovery connection drop at shutdown #78067

Fix leaking store ref on peer recovery connection drop at shutdown #78067

tlrx commented Sep 21, 2021 •

edited

elasticmachine commented Sep 21, 2021

original-brownbear left a comment

original-brownbear Sep 21, 2021

original-brownbear Sep 21, 2021

tlrx Sep 21, 2021

elasticsearchmachine commented Feb 2, 2022

Fix leaking store ref on peer recovery connection drop at shutdown #78067

Are you sure you want to change the base?

Fix leaking store ref on peer recovery connection drop at shutdown #78067

Conversation

tlrx commented Sep 21, 2021 • edited

elasticmachine commented Sep 21, 2021

original-brownbear left a comment

Choose a reason for hiding this comment

original-brownbear Sep 21, 2021

Choose a reason for hiding this comment

original-brownbear Sep 21, 2021

Choose a reason for hiding this comment

tlrx Sep 21, 2021

Choose a reason for hiding this comment

elasticsearchmachine commented Feb 2, 2022

tlrx commented Sep 21, 2021 •

edited