Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Restore Operation to SnapshotResiliencyTests #40634

Conversation

original-brownbear
Copy link
Member

  • Expand the successful snapshot test case to also include restoring the snapshot
    • Add indexing of documents as well to be able to meaningfully verify the restore
  • This is part of the larger effort to test eventually consistent blob stores in More Snapshot Resiliency Testing #39504

Note: I had to add putting a mapping before indexing documents here since dynamic mapping updates trigger a blocking operation that blocks the tests. A fix for that is incoming in #39793

* Expand the successful snapshot test case to also include restoring the snapshop
  * Add indexing of documents as well to be able to meaningfully verify the restore
* This is part of the larger effort to test eventually consistent blob stores in elastic#39504
@original-brownbear original-brownbear added >test Issues or PRs that are addressing/adding tests :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.2.0 labels Mar 29, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@original-brownbear
Copy link
Member Author

Jenkins run elasticsearch-ci/2

Copy link
Contributor

@andrershov andrershov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left two questions.
And I wonder if we want to add "restore" operation to other tests in this class as well? For example, when emulating cluster disruption?
I also want to better understand the motivation for this PR.
Is it primarily here to be able more different kind of actions in future tests?

}
));
}))));
runUntil(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need 3 different atomic variables? Is not it enough just to have documentCountVerified?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right ... not sure what I was thinking :) my bad fixed in c064db3

@original-brownbear
Copy link
Member Author

@andrershov The motivation for this is to basically get tests that verify that we're functional with an eventually consistent blob store (the code for that is already written in #39504) and for that we really want/need to verify that the snapshots can actually be restored. Especially in light of the fact that currently we suspect failures from eventually consistent list operations that silently corrupt the repository (#38941).
The best way to detect this kind of silent corruption imo is to simply try and restore the snapshots and check the results. With this change all we need to port is the eventually consistent mock blobstore out of #39504 and we have tests that can reproduce the corruption case :)

And I wonder if we want to add "restore" operation to other tests in this class as well? For example, when emulating cluster disruption?

Well the other ones kinda reproduce broken snapshots and we mainly cared about the fact that these snapshots finished back when we added them since the bugs resolved at the time led to various scenarios where snapshots just got stuck forever.
We could actually verify the restore in the next test too org.elasticsearch.snapshots.SnapshotResiliencyTests#testConcurrentSnapshotCreateAndDelete, but I'd do that in a subsequent step maybe?

@andrershov
Copy link
Contributor

@original-brownbear so the idea is to extend testSuccessfulSnapshot with restore and call it testSuccessfulSnapshotAndRestore. And the next PR will add eventual consistent BlobStore emulation. So we can just run testSuccessfulSnapshotAndRestore for long enough time to make sure it does not fail with the eventual consistent store. Do I understand correctly, that currently, we expect it to fail or do we already have logic to work around eventual consistency?

@original-brownbear
Copy link
Member Author

@andrershov

so the idea is to extend testSuccessfulSnapshot with restore and call it testSuccessfulSnapshotAndRestore.

Yea, exactly.

And the next PR will add eventual consistent BlobStore emulation. So we can just run testSuccessfulSnapshotAndRestore for long enough time to make sure it does not fail with the eventual consistent store.

Correct as well.

Do I understand correctly, that currently, we expect it to fail or do we already have logic to work around eventual consistency?

Correct as well here too. Currently the mock eventually consistent store fails yes, but I have a fix for all cases the test reproduces (tests are done too in the other PR I linked) in https://github.com/original-brownbear/elasticsearch/tree/delete-lock-via-cs (95% done, would open a PR for that once we have the tests in place).

Copy link
Contributor

@andrershov andrershov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@original-brownbear
Copy link
Member Author

thanks @andrershov !

@original-brownbear original-brownbear merged commit 78a590f into elastic:master Apr 3, 2019
@original-brownbear original-brownbear deleted the more-snapshot-resiliency-testing-2 branch April 3, 2019 12:50
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Apr 3, 2019
…leniency

* elastic/master:
  SQL: Fix deserialisation issue of TimeProcessor (elastic#40776)
  Improve GCS docs for using keystore (elastic#40605)
  Add Restore Operation to SnapshotResiliencyTests (elastic#40634)
  Small refactorings to analysis components (elastic#40745)
  SQL: Fix display size for DATE/DATETIME (elastic#40669)
  add HLRC protocol tests for transform state and stats (elastic#40766)
  Inline TransportReplAction#registerRequestHandlers (elastic#40762)
  remove experimental label from search_as_you_type documentation (elastic#40744)
  Remove some abstractions from `TransportReplicationAction` (elastic#40706)
  Upgrade to latest build scan plugin (elastic#40702)
  Use default memory lock setting in testing (elastic#40730)
  Add Bulk Delete Api to BlobStore (elastic#40322)
  Remove yaml skips older than 7.0 (elastic#40183)
  Docs: Move id in the java-api (elastic#40748)
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Apr 25, 2019
* Add Restore Operation to SnapshotResiliencyTests

* Expand the successful snapshot test case to also include restoring the snapshop
  * Add indexing of documents as well to be able to meaningfully verify the restore
* This is part of the larger effort to test eventually consistent blob stores in elastic#39504
original-brownbear added a commit that referenced this pull request Apr 26, 2019
* Add Restore Operation to SnapshotResiliencyTests

* Expand the successful snapshot test case to also include restoring the snapshop
  * Add indexing of documents as well to be able to meaningfully verify the restore
* This is part of the larger effort to test eventually consistent blob stores in #39504
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019
* Add Restore Operation to SnapshotResiliencyTests

* Expand the successful snapshot test case to also include restoring the snapshop
  * Add indexing of documents as well to be able to meaningfully verify the restore
* This is part of the larger effort to test eventually consistent blob stores in elastic#39504
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test Issues or PRs that are addressing/adding tests v7.2.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants