Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds support for cockroach actions #19288

Closed
wants to merge 1 commit into from

Conversation

imotov
Copy link
Contributor

@imotov imotov commented Jul 6, 2016

A cockroach action is a transport-like action that is using the cluster state instead of transport to start tasks and send tasks responses. This allows cockroach tasks to survive restart of coordinating and executing nodes. A cockroach action can be implemented by extending TransportCockroachAction. TransportCockroachAction will start the task by using CockroachService, which controls cockroach tasks lifecycle. See TestCockroachActionPlugin for an example implementing a cockroach action.

This PR just adds an infrastructure for creating cockroach tasks and tests. The first use of the tasks will be to migrate snapshot/restore functionality to it. Since it's a big change, it probably makes more sense to target 6.0.0 with it.

A cockroach action is a transport-like action that is using the cluster state instead of transport to start tasks and send tasks responses. This allows cockroach tasks to survive restart of coordinating and executing nodes. A cockroach action can be implemented by extending TransportCockroachAction. TransportCockroachAction will start the task by using CockroachService, which controls cockroach tasks lifecycle.  See TestCockroachActionPlugin for an example implementing a cockroach action.
@imotov imotov added >enhancement :Distributed/Task Management Issues for anything around the Tasks API - both persistent and node level. v6.0.0-alpha1 labels Jul 6, 2016
@abeyad abeyad self-assigned this Jul 6, 2016
/**
* This operation will be executed on the caller node.
*/
public abstract void executorNodeOperation(Task task, Request request, ActionListener<Response> listener);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment here should be that the operation will be executed on the executor node?

@rjernst
Copy link
Member

rjernst commented Jul 20, 2016

If we need this, then can we please name it something useful and not "cute" (not that cockroaches are cute, just that it is a toy name). It seems this is more of a "persistent action" or "persistent task"? There should also be package level javadocs explaining all the classes involved and what is necessary to create a new one of these "things" (eg extend this class like this, and extend that class like that, register here, etc).

@imotov
Copy link
Contributor Author

imotov commented Jul 20, 2016

@rjernst the "If we need this..." part hurts, but I will try to ignore it since the rest of your comment is very helpful. I agree, the "persistent action/task" would be a great name but, unfortunately, it is already used by tasks that persist the results on disk, and I was concerned that it might cause confusion. A few of us brainstormed on a concise name for a while, but as you know naming in computer science is a difficult problem, so, this is all we could come up with. I ran it by a few of my colleagues and they liked it so the name stuck. If you have a better name, please let me know. I will be glad to rename it. The package level javadocs suggestion is really useful. Thank you. I will definitely add that.

@rjernst
Copy link
Member

rjernst commented Jul 20, 2016

the "If we need this..." part hurts

I did not mean it to be offensive, sorry. I meant it as "let's be absolutely sure we need to add thousands of lines of infrastructure for this". I see lots of things like this in the codebase that seem like a good idea at the time, but then only a handful of things are converted to the "new model" and we get stuck with tons of half used concepts and abstractions. I want to make sure that will not be the case here. Will all task actions be converted to this?

Regarding the name, will those disk persisted actions be converted to this new model? If so, maybe change those to another name temporarily? If not, why do we need both disk and cluster state persisted actions? If we really need both, then ClusterPersistedAction or ClusterPersistedTask I guess...

@imotov
Copy link
Contributor Author

imotov commented Jul 20, 2016

I see lots of things like this in the codebase that seem like a good idea at the time, but then only a handful of things are converted to the "new model" and we get stuck with tons of half used concepts and abstractions.

Yes, I share your frustration here.

Will all task actions be converted to this?

No. I really hope not. I mean, theoretically, we could do that but it would be quite disastrous since these actions are pretty heavy-weight comparing to normal transport actions. We have a small subset of actions that need to run for a prolong period of time and be resilient to node restarts. Two current examples of such actions are snapshot and restore. In a sense this PR is a generalized, reusable version of what snapshot and restore are currently doing in terms of their lifecycle. I am just extracting and de-duplication it and making it available to other actions such as benchmark api, for example. Even without future modules, when we have this framework in place, we will be able to remove a lot of very complex code from snapshot/restore module and allow it focus on moving files around instead of dealing with all this convoluted lifecycle management logic.

If we really need both, then ClusterPersistedAction or ClusterPersistedTask I guess...

Yes, we need both since these are perpendicular concerns. ClusterPersistedTask sounds to me like a task that persists results into the cluster state. We also toyed with ResilientAction, but it makes all other actions sound non-resilient, which I though might offend Boaz. How does DurableAction sound to you? Other options that I considered are Enduring, Lasting, Undying and Everlasting. They sound kind of meh, but I can go with any of them if they sound "useful". Any other suggestions?

private final Exception failure;
@Nullable
private final String executorNode;
private final String callerNode;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be called responseNode to be consistent with how its used?

if (tasks != null) {
for (CockroachTaskInProgress taskInProgress : tasks.entries()) {
if (localNodeId.equals(taskInProgress.getExecutorNode())) {
RunningCockroachTask cockroachTask = runningTasks.get(taskInProgress.getUuid());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want to use notVisitedTasks here instead of runningTasks

@abeyad
Copy link

abeyad commented Jul 21, 2016

How does DurableAction sound to you?

Sounds as reasonable a name as any proposed.

@abeyad
Copy link

abeyad commented Jul 21, 2016

@imotov I left an initial round of feedback. Overall, the code is very intuitive and easy to follow. Great work!

@bleskes
Copy link
Contributor

bleskes commented Jul 21, 2016

makes all other actions sound non-resilient, which I though might offend Boaz.

Thanks for thinking of me. I wouldn't mind it for that reason but I think it does make all other task sound non-resilient and I think that's a discredit to them.

I'm also +1 on Ryan's suggestion to try and not use Cockroaches. Doesn't feel like the right kind of fun to me.

I wasn't part of the said long discussion, so please tell me I'm all wrong but here is what I gather from all the little comments I read. We basically have two types of tasks:

  1. A long running task, typically an administrative task. Those task need to survive the loss of any node and therefore we use the cluster state to coordinate them so if the node that initially received the request dies (and it might, because the task is long), it wouldn't be a problem. Snapshot and Restore fit into this category but I though also (conceptually) reindexing. We can't do it yet because there is no way at the moment recover from a crash of the coordinating node. That will come with Seq Numbers.

  2. A short running task, which is typically a result of a client request. Those requests a fast and if the coordinating node dies, there is no way to respond to the client anyway. Basically all our transport requests result in these tasks (unless they opt out).

Orthogonal to this, we have the question of whether we want to persist the result of a task once it has completed. It seems to me that we actually only want this for long running tasks of the first category, but we implemented for 2 as well because of reindexing which is technically a type 2 because of the reason stated above.

With that in mind, I would suggest that we use TransportTasks (which @imotov already kind of used when he said "Transport Actions") for the second and ClusterTasks for the first.

Looking the code for the persistency aspect it seems it's implemented by PersistedTaskInfo which wraps TaskInfo and maybe PersistentActionListener. I think the first one is OK and doesn't really present an issue. For the second one we can maybe rename to PersistResultActionListener . It seems this is what it does.

@rmuir
Copy link
Contributor

rmuir commented Jul 22, 2016

I did not mean it to be offensive, sorry. I meant it as "let's be absolutely sure we need to add thousands of lines of infrastructure for this". I see lots of things like this in the codebase that seem like a good idea at the time, but then only a handful of things are converted to the "new model" and we get stuck with tons of half used concepts and abstractions. I want to make sure that will not be the case here. Will all task actions be converted to this?

I'm still confused here. Why are we adding this exactly? What is the concrete need?

The problem is, its so easy to add thousands of lines of code for no reason, yet its insanely hard to remove code in elasticsearch, no matter how esoteric (see #19359)

This means, elasticsearch code will continue to grow, unsustainably out of control, unless @rmuir steps up.

Which I am doing right now. Please, lets not add code unless there is a concrete need. It is literally impossible for the abstractions to be correct. Please don't add more abstractions than necessary either: you are doing them wrong if you are not building them around a concrete need, and they will only need to be re-done again.

@imotov
Copy link
Contributor Author

imotov commented Jul 22, 2016

I'm still confused here. Why are we adding this exactly? What is the concrete need?

We are not adding anything new and it's not an esoteric construct. This code is essentially already in snapshot and restore, but it's tied deep inside and sprinkled with file moving logic, which makes it very difficult to test and reason about. I am just trying to extract it into a more testable and cleaner construct, that can be used by other services as well (for example we can consider moving reindex api and watchers to it at some point). The main driving factor is to refactor snapshot and restore first. I was thinking about doing this in one go (add long running task and remove corresponding code from snapshot and restore) but the resulting PR is simply not manageable. So, I am trying to get it in chunks.

@imotov
Copy link
Contributor Author

imotov commented Aug 1, 2016

After the extensive discussion with @mikemccand that took place on Friday and over the weekend I realized that it was a terrible mistake to open this PR without discussing it with @rjernst and @rmuir first. I totally agree, we simply cannot have so many esoteric features in Elasticsearch and the recent discussion with @mikemccand made me realize the important role that @rmuir plays in keeping this number in check. I am extremely grateful to @rmuir for his tireless effort and I am closing this PR as a small token of appreciation.

@imotov imotov closed this Aug 1, 2016
@imotov imotov deleted the background-tasks branch May 1, 2020 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Task Management Issues for anything around the Tasks API - both persistent and node level. >enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants