Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't create snapshot repositories on the cluster state update thread #9488

Closed
bleskes opened this issue Jan 29, 2015 · 12 comments
Closed

Don't create snapshot repositories on the cluster state update thread #9488

bleskes opened this issue Jan 29, 2015 · 12 comments
Assignees
Labels
>bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs help wanted adoptme

Comments

@bleskes
Copy link
Contributor

bleskes commented Jan 29, 2015

Current, when the master indicates a new repository needs to be created, the Snapshot and Restore code creates it while on the cluster state update thread. This is tricky because this typically involves network calls and which may slow down the cluster state processing. We should do it async.

@bleskes bleskes added the :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Jan 29, 2015
@imotov imotov self-assigned this Jan 29, 2015
@imotov imotov added the >bug label Jan 29, 2015
@imotov
Copy link
Contributor

imotov commented Jan 29, 2015

It sounds like a good idea but it might be really tricky to implement. Imagine somebody creates and deletes the repository with the same name several times in quick succession and repository creation takes long time. We would have to have some sort of repository creation/destruction pipeline to handle this properly. Even with the pipeline if somebody performs a snapshot right after repository creation we will have to hold the snapshot start until a proper repository is created.

I think it might be more prudent to ensure that repository creation doesn't block. After we added repository verification, there is really no good reason for a repository to open a network connection during its initialization - it's possible that such connection will not be even used if a node doesn't have any primary shards. @dadoonet, @tlrx thoughts?

@tlrx
Copy link
Member

tlrx commented Jan 30, 2015

I agree with Igor, this sounds like a good idea but I'm not sure if we can implement it correctly right now. In a near future, maybe we could use a task management API like the one described in #6914 to pipe the repository creation/destruction and snapshot/restore task?

Also, I agree that the repository verification can take some time. This is usually a quick process but I experienced some latency/network problem while using it.

I'm wondering if we can keep the repository creation process synchronous (ie on the cluster state update thread) and set a repository property like verification_status: unverified. Then make the verification process asynchronous, going from unverified to executing and finally verified. We may block snapshot requests on non verified repositories, as well as blocking repository creation/deletion.

@bleskes
Copy link
Contributor Author

bleskes commented Jan 30, 2015

I think it might be more prudent to ensure that repository creation doesn't block.

+1 to that. But we may not always control it, especially if people extend it.

This is usually a quick process but I experienced some latency/network problem while using it.

For what it's worth - I saw a 4 minute block. Causing all kind of secondary issues in the cluster.

I'm wondering if we can keep the repository creation process synchronous (ie on the cluster state update thread) and set a repository property like verification_status: unverified. Then make the verification process asynchronous, going from unverified to executing and finally verified. We may block snapshot requests on non verified repositories, as well as blocking repository creation/deletion.

That would be good, but I think it's not too far away from having a repository wrapper created by the framework upon cluster state updates and started async. If the repo is deleted while initializing we can mark the wrapper with a deleted flag which will cause it to immediately apply delete code once intialization is completed.

@clintongormley
Copy link

@imotov does this still need doing?

@imotov
Copy link
Contributor

imotov commented Nov 28, 2016

@clintongormley I don't think anything changed in the last year. @tlrx, @abeyad what do you think?

@tlrx
Copy link
Member

tlrx commented Nov 28, 2016

I checked the code again and I don't think anything changed there... so it would be nice to implement any of the suggested solution.

@abeyad
Copy link

abeyad commented Nov 28, 2016

The S3 and GCE repositories definitely makes network calls during initialization (which will occur on the cluster state update thread). I don't see the same for the Azure repository though I could've missed it. In any case, I like @tlrx 's proposed solution of handling repository verification asyc but unless we remove repository initialization itself outside of the cluster state update task, then we will have to require that all plugin developers know not to have any blocking operations like network calls executed in repository construction.

@abeyad
Copy link

abeyad commented Nov 28, 2016

we could have a custom method like onInit that each repository implementation must override and implement, which would contain any blocking logic and called outside of the cluster state update thread. Will have to think about it some more.

@tlrx
Copy link
Member

tlrx commented Mar 21, 2018

We talked about this today with @ywelsch and this is still something we want to do. We think that the repository could be registered as it is today, but the existence of the filesystem/bucket/whatever could be delayed to the first access to the repository.

tlrx added a commit to tlrx/elasticsearch that referenced this issue Jun 4, 2018
@original-brownbear original-brownbear self-assigned this Nov 26, 2018
@original-brownbear
Copy link
Member

original-brownbear commented Dec 3, 2018

@ywelsch @tlrx I looked into this a bit and it looks like this has been done already in #31606?

@original-brownbear original-brownbear removed their assignment Dec 3, 2018
@ywelsch
Copy link
Contributor

ywelsch commented Dec 3, 2018

@original-brownbear A follow-up to this is to make sure the master does not manipulate the list of repositories in the cluster state update task (it calls registerRepository). If that cluster state update fails to be published, the master will end up with a dangling repo. The cleaner way is to have the master just do createRepository, and this validation can even be done before submitting the cluster state update task. The master then updates the list of repositories in the same way as the other nodes, by applying the cluster state update task. Finally, this prevents concurrent writes to the list of repositories (the repositories field in RepositoriesService), as only the single cluster applier service thread will be updating the list of repos.

@original-brownbear original-brownbear self-assigned this Dec 3, 2018
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Dec 3, 2018
* Move `createRepository` call out of cluster state tasks
    * Now only `RepositoriesService#applyClusterState` manipulates `this.repositories`
* Closes elastic#9488
@original-brownbear
Copy link
Member

Done in #36157 I think

original-brownbear added a commit that referenced this issue Dec 4, 2018
* Move `createRepository` call out of cluster state tasks
    * Now only `RepositoriesService#applyClusterState` manipulates `this.repositories`
* Closes #9488
original-brownbear added a commit that referenced this issue Dec 4, 2018
* Move `createRepository` call out of cluster state tasks
    * Now only `RepositoriesService#applyClusterState` manipulates `this.repositories`
* Closes #9488
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs help wanted adoptme
Projects
None yet
Development

No branches or pull requests

7 participants