Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Background task services #18854

Closed
epixa opened this issue May 6, 2018 · 19 comments
Closed

Background task services #18854

epixa opened this issue May 6, 2018 · 19 comments
Labels
enhancement New value added to drive a business result Feature:New Platform Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@epixa
Copy link
Contributor

epixa commented May 6, 2018

After application handlers and http route handlers, the next most common place for business logic to happen in a plugin is in some sort of startup or interval-based behavior. We lack a consistent way to handle this sort of business logic, so plugins instead resort to managing their own interval behaviors or worse - blocking plugin setup or start routines on some sort of asynchronous event (e.g. creating an index in Elasticsearch). Ultimately, these things are all "tasks", and we should have a service that can handle them consistently.

There are two types of tasks we're concerned with:

A: Ephemeral tasks that run on each Kibana instance, make no attempt to coordinate with other instances, and care not about anything beyond the life of the current process. These are the types of tasks that we have no current solution for.

B: Distributed tasks that coordinate across all instances in a single Kibana deployment, where only one instance will execute a given task at any time, and all instances will share the burden of processing the task execution queue. We currently have the task_manager plugin in x-pack for this.

Examples of these two types of tasks:

Ephemeral:

  • Elasticsearch healthcheck (recurring)
  • License sync (recurring)
  • Custom index creation (retry until success)

Distributed:

  • Reporting job processing
  • Rolling up usage data for telemetry
  • Alert processing

I propose we create a TaskService in core which will handle the basics of task processing, like registering tasks with contextual handlers, optional duration-based intervals, and "retry" behaviors.

We'll repurpose the existing task_manager to extend the core task service with its necessary distributed task functionality. If task_manager were ever to move to OSS, then we'd merge the distributed task functionality into the core task service.

With the core task service in place, I propose we stop passing the Elasticsearch client and Saved Object client to plugin lifecycle functions in favor of passing them via contextual handlers to applications, http, and tasks. We already do or plan to do so for the two former services, and I can't imagine an existing situation that isn't addressed by one of those three. The benefit of doing this is that it is less likely that plugins will block Kibana startup on business logic that depends on Elasticsearch.

Original description In the new platform, we need a service on both the server and browser that are responsible for the lifecycle of background tasks. Background tasks are long running tasks (e.g. polling elasticsearch) that follow the standard start/stop lifecycle. Like request handlers registered with the HTTP service, background tasks will be given access to capabilities that are relevant to its own context.

The ability to register background tasks should be exposed to plugins.

It's the contextual capabilities along with controlled lifecycle that make background tasks a necessary abstraction. Otherwise, we'd just start throwing random setIntervals around start functions throughout the codebase, and then we can't easily reason about, extend, or influence the tasks that are happening all over.

@epixa epixa added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:New Platform enhancement New value added to drive a business result labels May 6, 2018
@epixa epixa changed the title Background task service Background task services May 7, 2018
@epixa
Copy link
Contributor Author

epixa commented Sep 3, 2019

@elastic/kibana-platform I updated the description here to include a high level proposal of how TaskService would fit alongside the existing task_manager plugin and the impact it would have on plugin lifecycle. I'd appreciate your thoughts as with consensus we can move onto the next step of designing the service API.

@epixa
Copy link
Contributor Author

epixa commented Sep 4, 2019

@elastic/kibana-stack-services since this is related to task_manager

@pmuellr
Copy link
Member

pmuellr commented Sep 4, 2019

This makes sense to me. I wonder if we can fit the ephemeral tasks into the eventual "task viewer" ui - I think this means that we would write a persistent log entry for those tasks as well as the distributed ones - seems doable. And would be quite nice to see all the bits that are running in some kind of UI!

@pmuellr
Copy link
Member

pmuellr commented Sep 4, 2019

I propose we stop passing the Elasticsearch client and Saved Object client to plugin lifecycle functions in favor of passing them via contextual handlers to applications, http, and tasks.

I'm not (yet) up-to-speed on context handlers, but hopefully plain-ole plugins could get access to es and SO's, even if they aren't an app, server, or task? One thing I'm also wondering is if the "distributed" tasks will end up having to call an emphemeral task to do it's ES work - that might work out great, but perhaps adding additional layers will complicate things. I guess we'll see.

@pmuellr
Copy link
Member

pmuellr commented Sep 4, 2019

I especially like the built-in retry logic - we have some slightly interesting flavor in actions already, where an action can return "this failed but seems retry-able in the future" and "this failed, but the service said we can retry at 4:31:02" (eg, a 429 rate limiting response from slack). That's all we've needed thus far.

Just having retry logic at all will be nice, as currently our "execute an action now" http endpoint doesn't bother with retries - actions are currently only retried when executed as part of an alert firing. We'd basically get some retry logic for free by having the action run in an ephereral task instead.

Kinda similar topic - I'm not sure we have task cancellation right now, but ... we should. And it should be good enough that if I cancel a task, the "task code" can try to cancel the underlying thing it's trying to do. Eg, http request that's hanging. Cancel a task that's "running" that http request, hopefully the task code will be able to cancel the http request as well. Haven't gotten to the point of figuring out that shape yet, I don't think.

@tsullivan
Copy link
Member

stop passing the Elasticsearch client and Saved Object client to plugin lifecycle functions

I'm not familiar with this, and it's not clear if it refers to a part of the new platform or something changed in legacy that I didn't know about. Can you link to what this refers to?

@joshdover
Copy link
Contributor

@tsullivan

stop passing the Elasticsearch client and Saved Object client to plugin lifecycle functions

I'm not familiar with this, and it's not clear if it refers to a part of the new platform or something changed in legacy that I didn't know about. Can you link to what this refers to?

This refers to using the Core APIs provided in "handler functions" as proposed by this RFC.

Court is arguing that once we have a TaskService in OSS/Core, we could remove most of the Core APIs from the "core" object plugins receive in their setup and start lifecycle methods.

I tend to agree, there is a much smaller number of APIs that I think plugins will need to access to allow Kibana to begin running and serving traffic, once we have this in place.

@pmuellr
Copy link
Member

pmuellr commented Sep 4, 2019

Remembering that we need to be a little careful with something like task retries when the task is invoked primarily as an http request (or maybe anytime with an http request). We don't want a request to be hanging out for minutes waiting for a retry on something. Eg, execute a slack action via http request, but it's rate-limited and says try again in 1 minute. Should the http request fail, or should it sit there for 1 minute and try again. But it would sure be nice to have some basic retry support to handle transient 50x responses from flakey load balancers (a frequent pain in my side).

@epixa
Copy link
Contributor Author

epixa commented Sep 4, 2019

I wonder if we can fit the ephemeral tasks into the eventual "task viewer" ui - I think this means that we would write a persistent log entry for those tasks as well as the distributed ones - seems doable

Yes! Whether it ultimately would be part of a tasks ui, an audit log, or stack monitoring, that's exactly the sort of behavior I think this service would enable. We can't make any sense of those features when they implement their own setInterval-like behavior, but we can if they all happen through a controlled service.

I'm not (yet) up-to-speed on context handlers, but hopefully plain-ole plugins could get access to es and SO's, even if they aren't an app, server, or task?

Can you provide a real-world example? I am unfamiliar with a situation that wouldn't be addressed by either routes, applications, or tasks.

One thing I'm also wondering is if the "distributed" tasks will end up having to call an emphemeral task to do it's ES work

This may come out in the service design, but I hadn't intended it at least. I assumed distributed tasks would behave similarly to ephemeral tasks up until the moment that wasn't sufficient for us.

We'd basically get some retry logic for free by having the action run in an ephereral task instead.

Fair warning though, ephemeral tasks wouldn't be tracked anywhere, so they all permanently "die" when the process exits. It sounds like that is no worse than the current situation with actions, but in the future we might want to make the retry logic itself extensible so we could add some durability in the form of persistence.

@rudolf
Copy link
Contributor

rudolf commented Sep 9, 2019

The benefit of doing this is that it is less likely that plugins will block Kibana startup on business logic that depends on Elasticsearch.

A plugin's start() function can block the whole Platform because we run plugin start functions in series to allow for plugins to expose an api to those plugins which depend on it.

If plugins don't get access to a task's result, it can't await that task or block on it. But it also can't expose the business logic contained in the task as an API to other plugins. Except if we limit plugins to sharing their API's as registered context to other plugins, i.e. it's not possible to return a start/setup contract.

Although this ensures Kibana startup cannot be blocked, a task service which doesn't return results might be too limited to be generally useful? Perhaps we should split these problems into two.

@joshdover
Copy link
Contributor

I've been thinking about this more, and I'm skeptical it's a hard blocker for plugin migrations. I think it's something we should do, probably even during 7.x. But I'm not convinced it (1) provides much value to migrating plugins right now compared to other things; or (2) it is so difficult that we need to get a head start on it.

The other argument for doing it now would be so we can ease plugins off of the lifecycle APIs sooner than later. I don't think this is a great argument because moving from lifecycle APIs to a background task handler is trivial.

Thoughts?

@rudolf
Copy link
Contributor

rudolf commented Sep 10, 2019

I agree. API churn is frustrating and we should aim to avoid it, but with a rather large list of hard technical blockers, I think we should aim for unblocking early adopter teams first. It might be worth continuing to flesh out the design and creating an RFC during 7.5 to reduce any risks and validate our assumption that moving to a task handler would be trivial.

@rudolf
Copy link
Contributor

rudolf commented Sep 10, 2019

Something to keep in mind with exposing API's from context handlers, is that the context will have to be "updated" when the API is consumed. So if Plugin A exposes an API which uses elasticsearch exposed through a context, we need to make sure that Plugin A's API uses the latest values of the elasticsearch client at the time when Plugin B consumes Plugin A's API.

Our context/handler pattern caters for this, but it does place an additional burden on API creators in order simplify it for consumers. Creating an API that isn't side-effect free can lead to subtle bugs.

@epixa
Copy link
Contributor Author

epixa commented Sep 10, 2019

+1 to prioritizing this below other migration-blocking issues. The only real value in getting this done soon is minimizing "bad habits" that will need to be refactored away later, but that's less urgent than the overall migration effort.

If plugins don't get access to a task's result, it can't await that task or block on it. But it also can't expose the business logic contained in the task as an API to other plugins. Except if we limit plugins to sharing their API's as registered context to other plugins, i.e. it's not possible to return a start/setup contract.

Even registering an API to context wouldn't address this problem entirely. It is an extra burden for sure, but it's a burden plugins already have today - plugins must not block the startup of Kibana. Since our APIs don't enforce that, developers accidentally block the startup of Kibana sometimes and we treat these things as bugs (sometimes blocking) today.

The most common example of this is when a plugin wants to block startup on some Elasticsearch operation (e.g. index creation) because their entire UI and API depends on that and it is a lot easier if they can assume that setup is done. Unfortunately, Elasticsearch availability is never guaranteed including during Kibana startup, and it's important that Kibana functions otherwise.

So we must build our APIs to be resilient to these sorts of errors, such as "API failed due to Elasticsearch being unavailable".

Something to keep in mind with exposing API's from context handlers, is that the context will have to be "updated" when the API is consumed. So if Plugin A exposes an API which uses elasticsearch exposed through a context, we need to make sure that Plugin A's API uses the latest values of the elasticsearch client at the time when Plugin B consumes Plugin A's API.

While this might be true under certain conditions, I think it'll be limited in practice. The context is recreated every time it is injected, so if you were exposing an API through the context that relied on context.core.elasticsearch.adminClient, then that would be the most recent adminClient at the time the handler was executed. Plugin developers would only need to manually manage updating state if they were storing references to old clients in their plugin or for their own state, which they'll need to manage in a reactive way similarly to how core does.

@alexk307
Copy link
Contributor

alexk307 commented Apr 1, 2020

Wanted to +1 this issue as it can apply to a use case we're trying to write on the Endpoint team.

We have a notion of whitelist and blacklists that users can add to in order to allow and block execution of processes that match specified criteria. The current planned architecture will have our hosts poll an API in Kibana (using the ingest plugin) to determine if the artifacts have changed (i.e. a user in Kibana added an item to one of them lists). When the hosts see that the lists have changed (it will compare an md5 hash of its current list to what's available), it will hit another API to download the list. Server side, we need to compress each list in order to save bandwidth. Since compression and calls to Elasticsearch take time and resources we decided to cache the lists in memory. Server side, on add/removal of items from the lists, we fetch the entire list of items, compress, and store it in memory. This works well with one Kibana, but when we need to deploy multiple Kibanas behind a load balancing proxy this idea doesn't hold. Each Kibana would serve one request and only that instance's cache would be up to date.

Having something as discussed in this issue could help us keep caches in sync across multiple Kibana instances. If we could schedule ephemeral tasks to all other Kibanas (a la a pub/sub type model), we could alert them that their cache is stale and force them to re-fetch and re-compress the data.

@Bamieh Bamieh assigned Bamieh and unassigned Bamieh Apr 2, 2020
@rudolf
Copy link
Contributor

rudolf commented Nov 5, 2020

@alexk307 It sounds like what you want is to be able to schedule a "cache invalidation task" that should run on all Kibana instances. This still means that the cache is stale for an unbounded amount of time. These kinds of distributed problems are almost always better solved by a design that doesn't require coordination between distributed nodes. Is this still a problem you're actively working on or investigating?

@pgayvallet
Copy link
Contributor

Lot have changed regarding what we may want/need and what the constraints of a background task service could be.

I wonder if the content in this issue would still be useful as context, or if everything is kinda obsolete and that we'd better close the issue.

@rudolf WDYT?

@rudolf
Copy link
Contributor

rudolf commented Apr 17, 2024

++ let's close it 🕺

@rudolf rudolf closed this as completed Apr 17, 2024
@alexfrancoeur
Copy link

end of the kibana-as-middleware era 🫡

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result Feature:New Platform Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
Status: Done (7.13)
Development

No branches or pull requests

10 participants