-
Notifications
You must be signed in to change notification settings - Fork 24.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds support for cockroach actions #19288
Conversation
A cockroach action is a transport-like action that is using the cluster state instead of transport to start tasks and send tasks responses. This allows cockroach tasks to survive restart of coordinating and executing nodes. A cockroach action can be implemented by extending TransportCockroachAction. TransportCockroachAction will start the task by using CockroachService, which controls cockroach tasks lifecycle. See TestCockroachActionPlugin for an example implementing a cockroach action.
/** | ||
* This operation will be executed on the caller node. | ||
*/ | ||
public abstract void executorNodeOperation(Task task, Request request, ActionListener<Response> listener); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment here should be that the operation will be executed on the executor node?
If we need this, then can we please name it something useful and not "cute" (not that cockroaches are cute, just that it is a toy name). It seems this is more of a "persistent action" or "persistent task"? There should also be package level javadocs explaining all the classes involved and what is necessary to create a new one of these "things" (eg extend this class like this, and extend that class like that, register here, etc). |
@rjernst the "If we need this..." part hurts, but I will try to ignore it since the rest of your comment is very helpful. I agree, the "persistent action/task" would be a great name but, unfortunately, it is already used by tasks that persist the results on disk, and I was concerned that it might cause confusion. A few of us brainstormed on a concise name for a while, but as you know naming in computer science is a difficult problem, so, this is all we could come up with. I ran it by a few of my colleagues and they liked it so the name stuck. If you have a better name, please let me know. I will be glad to rename it. The package level javadocs suggestion is really useful. Thank you. I will definitely add that. |
I did not mean it to be offensive, sorry. I meant it as "let's be absolutely sure we need to add thousands of lines of infrastructure for this". I see lots of things like this in the codebase that seem like a good idea at the time, but then only a handful of things are converted to the "new model" and we get stuck with tons of half used concepts and abstractions. I want to make sure that will not be the case here. Will all task actions be converted to this? Regarding the name, will those disk persisted actions be converted to this new model? If so, maybe change those to another name temporarily? If not, why do we need both disk and cluster state persisted actions? If we really need both, then ClusterPersistedAction or ClusterPersistedTask I guess... |
Yes, I share your frustration here.
No. I really hope not. I mean, theoretically, we could do that but it would be quite disastrous since these actions are pretty heavy-weight comparing to normal transport actions. We have a small subset of actions that need to run for a prolong period of time and be resilient to node restarts. Two current examples of such actions are snapshot and restore. In a sense this PR is a generalized, reusable version of what snapshot and restore are currently doing in terms of their lifecycle. I am just extracting and de-duplication it and making it available to other actions such as benchmark api, for example. Even without future modules, when we have this framework in place, we will be able to remove a lot of very complex code from snapshot/restore module and allow it focus on moving files around instead of dealing with all this convoluted lifecycle management logic.
Yes, we need both since these are perpendicular concerns. ClusterPersistedTask sounds to me like a task that persists results into the cluster state. We also toyed with ResilientAction, but it makes all other actions sound non-resilient, which I though might offend Boaz. How does DurableAction sound to you? Other options that I considered are Enduring, Lasting, Undying and Everlasting. They sound kind of meh, but I can go with any of them if they sound "useful". Any other suggestions? |
private final Exception failure; | ||
@Nullable | ||
private final String executorNode; | ||
private final String callerNode; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be called responseNode
to be consistent with how its used?
if (tasks != null) { | ||
for (CockroachTaskInProgress taskInProgress : tasks.entries()) { | ||
if (localNodeId.equals(taskInProgress.getExecutorNode())) { | ||
RunningCockroachTask cockroachTask = runningTasks.get(taskInProgress.getUuid()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you want to use notVisitedTasks
here instead of runningTasks
Sounds as reasonable a name as any proposed. |
@imotov I left an initial round of feedback. Overall, the code is very intuitive and easy to follow. Great work! |
Thanks for thinking of me. I wouldn't mind it for that reason but I think it does make all other task sound non-resilient and I think that's a discredit to them. I'm also +1 on Ryan's suggestion to try and not use Cockroaches. Doesn't feel like the right kind of fun to me. I wasn't part of the said long discussion, so please tell me I'm all wrong but here is what I gather from all the little comments I read. We basically have two types of tasks:
Orthogonal to this, we have the question of whether we want to persist the result of a task once it has completed. It seems to me that we actually only want this for long running tasks of the first category, but we implemented for 2 as well because of reindexing which is technically a type 2 because of the reason stated above. With that in mind, I would suggest that we use Looking the code for the persistency aspect it seems it's implemented by |
I'm still confused here. Why are we adding this exactly? What is the concrete need? The problem is, its so easy to add thousands of lines of code for no reason, yet its insanely hard to remove code in elasticsearch, no matter how esoteric (see #19359) This means, elasticsearch code will continue to grow, unsustainably out of control, unless @rmuir steps up. Which I am doing right now. Please, lets not add code unless there is a concrete need. It is literally impossible for the abstractions to be correct. Please don't add more abstractions than necessary either: you are doing them wrong if you are not building them around a concrete need, and they will only need to be re-done again. |
We are not adding anything new and it's not an esoteric construct. This code is essentially already in snapshot and restore, but it's tied deep inside and sprinkled with file moving logic, which makes it very difficult to test and reason about. I am just trying to extract it into a more testable and cleaner construct, that can be used by other services as well (for example we can consider moving reindex api and watchers to it at some point). The main driving factor is to refactor snapshot and restore first. I was thinking about doing this in one go (add long running task and remove corresponding code from snapshot and restore) but the resulting PR is simply not manageable. So, I am trying to get it in chunks. |
After the extensive discussion with @mikemccand that took place on Friday and over the weekend I realized that it was a terrible mistake to open this PR without discussing it with @rjernst and @rmuir first. I totally agree, we simply cannot have so many esoteric features in Elasticsearch and the recent discussion with @mikemccand made me realize the important role that @rmuir plays in keeping this number in check. I am extremely grateful to @rmuir for his tireless effort and I am closing this PR as a small token of appreciation. |
A cockroach action is a transport-like action that is using the cluster state instead of transport to start tasks and send tasks responses. This allows cockroach tasks to survive restart of coordinating and executing nodes. A cockroach action can be implemented by extending TransportCockroachAction. TransportCockroachAction will start the task by using CockroachService, which controls cockroach tasks lifecycle. See TestCockroachActionPlugin for an example implementing a cockroach action.
This PR just adds an infrastructure for creating cockroach tasks and tests. The first use of the tasks will be to migrate snapshot/restore functionality to it. Since it's a big change, it probably makes more sense to target 6.0.0 with it.