Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safer Wait for Snapshot Success in ClusterPrivilegeTests #40943

Conversation

original-brownbear
Copy link
Member

@original-brownbear original-brownbear commented Apr 8, 2019

* The snapshot state returned by the API might become SUCCESS before it's fully removed from the cluster state.
  * We should fix this race in the transport API but it's not trivial and will be part of the incoming big round of rafactoring the repository interaction, this added check fixes the test for now
* closes elastic#38030
@original-brownbear original-brownbear added >test Issues or PRs that are addressing/adding tests :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC v8.0.0 v7.2.0 labels Apr 8, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-security

@original-brownbear
Copy link
Member Author

Jenkins run elasticsearch-ci/2

@original-brownbear
Copy link
Member Author

Jenkins run elasticsearch-ci/1

SnapshotsInProgress response =
client().admin().cluster().state(new ClusterStateRequest()).get().getState().custom(SnapshotsInProgress.TYPE);
assertThat(response.entries(), Matchers.empty());
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@original-brownbear you've got my rubber stamp LGTM , but shouldn't we fix the prepareSnapshotStatus(repo).setSnapshots(snapshot) API to not return SUCCESS if there is still an SnapshotsInProgress.TYPE entry?

Copy link
Contributor

@albertzaharovits albertzaharovits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I think the snapshot status API needs fix so that it returns success consistently with respect to the SnapshotsInProgress.TYPE metadata entry. If, for some obscure reason this is not possible, I believe a comment in the main code (similar to the one here) is required. If/when the change in main lands, I would remove this extra condition here.

@original-brownbear
Copy link
Member Author

@albertzaharovits thanks for taking a look :) You know ... I just realized these tests run against a single node => we shouldn't have any races here with the state like that => I'm double checking the solution, sorry for the noise!

@original-brownbear
Copy link
Member Author

Ok, this is a mess :) I was able to fix the API (at least for the single node case) via f7c4afb

We have two APIs for getting the current list of snapshots, org.elasticsearch.snapshots.SnapshotsService#currentSnapshots(java.lang.String) and org.elasticsearch.snapshots.SnapshotsService#currentSnapshots(java.lang.String, java.util.List<java.lang.String>).

  • One of them returns org.elasticsearch.snapshots.SnapshotInfo and correctly translates a SUCCESS state in the cluster state as still technically IN_PROGRESS and sets that state on the SnapshotInfo.
  • The other one (used by org.elasticsearch.action.admin.cluster.snapshots.status.TransportSnapshotsStatusAction which is what this test uses) returns the raw SnapshotsInProgress.Entry list, so SUCCESS stays SUCCESS when in actuality the existence of a SUCCESS entry means that we still have to finalize the snapshot in the repository potentially. Due to the fact that the calls to SnapshotService are used in multiple places, I decided to for now just hack the fix of translating success for ongoing snapshots to "in progress/started" into the transport API that isn't used by the snapshot process itself. This fixed the test broken here for me and also makes the REST API a little more useful :)

@albertzaharovits
Copy link
Contributor

@original-brownbear If I understand it correctly, every SnapshotsInProgress.Entry entry is a sign that the snapshot is "in-progress", ie SnapshotState.IN_PROGRESS or SnapshotsInProgress.State.STARTED or SnapshotsInProgress.State.SUCCESS mean the same thing from the "on-progress" perspective.

I would defer to others to lean on the API change. If you don't mind, I'll shirk away from the reviewers group.
If I were you, I would've done the state == SnapshotsInProgress.State.SUCCESS ? SnapshotsInProgress.State.STARTED : state hack on the client side, even though it's apparently harder. For example, if we go with this proposed API change I think we need an accompanying test of its own. And the PR label needs to be changed as well, and probably further backporting the change. Again, I will leave to others for authoritative suggestions.

But, can't the test(s) use TransportGetSnapshotsAction, for "in-progress" polling, which internally uses the "correctly" functioning SnapshotService#currentSnapshots ?

@albertzaharovits albertzaharovits removed their request for review April 16, 2019 12:05
@original-brownbear
Copy link
Member Author

@albertzaharovits

I would defer to others to lean on the API change. If you don't mind, I'll shirk away from the reviewers group.

Sure no worries, thanks for the input so far!

But, can't the test(s) use TransportGetSnapshotsAction, for "in-progress" polling, which internally uses the "correctly" functioning SnapshotService#currentSnapshots ?

Yea, but I figured we're seeing an actual bug in the API here, so why not fix it. We don't have other code dependent on this API it seems and it's just a 3 line change to get rid of a mean inconsistency here :)

hack on the client side, even though it's apparently harder.

I don't think we can really do anything on the client side here. To the client, the reason for SUCCESS is not visible (it doesn't know if the snapshot state was read from the repo or cluster state) -> it can't translate SUCCESS conditionally.

@ywelsch
Copy link
Contributor

ywelsch commented May 2, 2019

Reusing the same enum for snapshots that are in progress and those that have completed has created this odd behavior where users can't distinguish between an in-progress snapshot that's in its finalization step and one where finalization was successfully completed. The same applies to the FAILED state as well. Even worse, we reuse the same enum across shard snapshot status and global snapshot status, which has caused even more confusion. I think it's a good time now to start separating these instead of adding a workaround. We should fix the test first without making changes to ES so that we can get test coverage back. Waiting for successful snapshot can be done by having the test first check whether GET /_snapshot/my_backup/_current returns a SnapshotMissingException and then do the snapshot status check.

@original-brownbear
Copy link
Member Author

@ywelsch alright, no hack it is :) I just reverted to my initial check here and simply made the test verify that the snapshot is gone from the CS like we do in other tests.

Can you take another look?

@original-brownbear
Copy link
Member Author

@ywelsch ping :)

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request May 8, 2019
* Added separate enum for the state of each shard, it was really
confusing that we used the same enum for the state of the snapshot
overall and the state of each individual shard
   * relates elastic#40943 (comment)
* Moved the contents of the class around a little so fields,
constructors and nested classes/enums aren't all over the place
especially now that we have yet another nested enum here
* Shortened some obvious spots in equals method and saved a few lines
via `computeIfAbsent` to make up for adding 50 new lines to this class
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@original-brownbear
Copy link
Member Author

thanks @ywelsch !

@original-brownbear original-brownbear merged commit 753726c into elastic:master May 21, 2019
@original-brownbear original-brownbear deleted the more-aggressive-cs-handle-snapshot-service branch May 21, 2019 17:22
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request May 21, 2019
* master: (176 commits)
  Avoid unnecessary persistence of retention leases (elastic#42299)
  [ML][TEST] Fix limits in AutodetectMemoryLimitIT (elastic#42279)
  [ML Data Frame] Persist and restore checkpoint and position (elastic#41942)
  mute failing filerealm hash caching tests (elastic#42304)
  Safer Wait for Snapshot Success in ClusterPrivilegeTests (elastic#40943)
  Remove 7.0.2 (elastic#42282)
  Revert "Remove 7.0.2 (elastic#42282)"
  [DOCS] Copied note on slicing support to Slicing section. Closes 26114 (elastic#40426)
  Remove 7.0.2 (elastic#42282)
  Mute all ml_datafeed_crud rolling upgrade tests
  Move the FIPS configuration back to the build plugin (elastic#41989)
  Remove stray back tick that's messing up table format (elastic#41705)
  Add missing comma in code section (elastic#41678)
  add 7.1.1 and 6.8.1 versions (elastic#42253)
  Use spearate testkit dir for each run (elastic#42013)
  Add experimental and warnings to vector functions (elastic#42205)
  Fix version in tests since elastic#41906 was merged
  Bump version in BWC check after backport
  Prevent in-place downgrades and invalid upgrades (elastic#41731)
  Mute date_histo interval bwc test
  ...
original-brownbear added a commit that referenced this pull request May 22, 2019
* Added separate enum for the state of each shard, it was really
confusing that we used the same enum for the state of the snapshot
overall and the state of each individual shard
   * relates #40943 (comment)
* Shortened some obvious spots in equals method and saved a few lines
via `computeIfAbsent` to make up for adding 50 new lines to this class
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request May 22, 2019
* master: (82 commits)
  Fix off-by-one error in an index shard test
  Cleanup Redundant BlobStoreFormat Class (elastic#42195)
  remove backcompat handling of 6.2.x versions (elastic#42044)
  Mute testDelayedOperationsBeforeAndAfterRelocated
  Execute actions under permit in primary mode only (elastic#42241)
  Mute another transforms_stats yaml test
  Deprecate support for chained multi-fields. (elastic#41926)
  Mute transforms_stats yaml test
  Make unwrapCorrupt Check Suppressed Ex. (elastic#41889)
  Remove Dead Code from Azure Repo Plugin (elastic#42178)
  Reorganize Painless doc structure (elastic#42303)
  Avoid unnecessary persistence of retention leases (elastic#42299)
  [ML][TEST] Fix limits in AutodetectMemoryLimitIT (elastic#42279)
  [ML Data Frame] Persist and restore checkpoint and position (elastic#41942)
  mute failing filerealm hash caching tests (elastic#42304)
  Safer Wait for Snapshot Success in ClusterPrivilegeTests (elastic#40943)
  Remove 7.0.2 (elastic#42282)
  Revert "Remove 7.0.2 (elastic#42282)"
  [DOCS] Copied note on slicing support to Slicing section. Closes 26114 (elastic#40426)
  Remove 7.0.2 (elastic#42282)
  ...
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request May 27, 2019
…ic#41940)

* Added separate enum for the state of each shard, it was really
confusing that we used the same enum for the state of the snapshot
overall and the state of each individual shard
   * relates elastic#40943 (comment)
* Shortened some obvious spots in equals method and saved a few lines
via `computeIfAbsent` to make up for adding 50 new lines to this class
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request May 27, 2019
* Safer Wait for Snapshot Success in ClusterPrivilegeTests

* The snapshot state returned by the API might become SUCCESS before it's fully removed from the cluster state.
  * We should fix this race in the transport API but it's not trivial and will be part of the incoming big round of refactoring the repository interaction, this added check fixes the test for now
* closes elastic#38030
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019
* Safer Wait for Snapshot Success in ClusterPrivilegeTests

* The snapshot state returned by the API might become SUCCESS before it's fully removed from the cluster state.
  * We should fix this race in the transport API but it's not trivial and will be part of the incoming big round of refactoring the repository interaction, this added check fixes the test for now
* closes elastic#38030
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019
…ic#41940)

* Added separate enum for the state of each shard, it was really
confusing that we used the same enum for the state of the snapshot
overall and the state of each individual shard
   * relates elastic#40943 (comment)
* Shortened some obvious spots in equals method and saved a few lines
via `computeIfAbsent` to make up for adding 50 new lines to this class
original-brownbear added a commit that referenced this pull request May 27, 2019
…2575)

* Safer Wait for Snapshot Success in ClusterPrivilegeTests

* The snapshot state returned by the API might become SUCCESS before it's fully removed from the cluster state.
  * We should fix this race in the transport API but it's not trivial and will be part of the incoming big round of refactoring the repository interaction, this added check fixes the test for now
* closes #38030
original-brownbear added a commit that referenced this pull request May 27, 2019
… (#42573)

* Added separate enum for the state of each shard, it was really
confusing that we used the same enum for the state of the snapshot
overall and the state of each individual shard
   * relates #40943 (comment)
* Shortened some obvious spots in equals method and saved a few lines
via `computeIfAbsent` to make up for adding 50 new lines to this class
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC >test Issues or PRs that are addressing/adding tests v7.3.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ClusterPrivilegeTests.testThatSnapshotAndRestore failing in master
6 participants