[Task manager][discuss] force implementation of cancel()? Or don't reschedule non-cancellable tasks until complete #95985

pmuellr · 2021-03-31T20:28:29Z

Task manager allows tasks to "cancel" themselves, if they run over their timeout, by calling a cancel() method in their runner. Tasks do not have to implement this though.

We're seeing a situation with a customer that involves a task manager task, which looks like has overlapping executions - in theory they should not be overlapping.

The task invocations are overlapping because the task exceeds it's timeout, but does not implement the cancel() method. Whether the task implements the method or not, task manager makes the task schedule-able. And because task manager can't actually STOP the task from running (it's just a function invocation, which you don't have control over like you do in Java with thread.kill() or whatever), the task keeps running. And so, overlapping task executions.

I repro'd this by hacking / instrumenting x-pack/plugins/alerting/server/health/task.ts as so:

hacked/instrumented alerting health task, changes marked with `/*!*/`

/*
 * Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
 * or more contributor license agreements. Licensed under the Elastic License
 * 2.0; you may not use this file except in compliance with the Elastic License
 * 2.0.
 */

import { CoreStart, Logger } from 'kibana/server';
import {
  RunContext,
  TaskManagerSetupContract,
  TaskManagerStartContract,
} from '../../../task_manager/server';
import { AlertsConfig } from '../config';
import { AlertingPluginsStart } from '../plugin';
import { HealthStatus } from '../types';
import { getHealth } from './get_health';

export const HEALTH_TASK_TYPE = 'alerting_health_check';

export const HEALTH_TASK_ID = `Alerting-${HEALTH_TASK_TYPE}`;

export function initializeAlertingHealth(
  logger: Logger,
  taskManager: TaskManagerSetupContract,
  coreStartServices: Promise<[CoreStart, AlertingPluginsStart, unknown]>
) {
  registerAlertingHealthCheckTask(logger, taskManager, coreStartServices);
}

export async function scheduleAlertingHealthCheck(
  logger: Logger,
  config: Promise<AlertsConfig>,
  taskManager: TaskManagerStartContract
) {
  try {
    const interval = (await config).healthCheck.interval;
    await taskManager.ensureScheduled({
      id: HEALTH_TASK_ID,
      taskType: HEALTH_TASK_TYPE,
      schedule: {
/*!*/ interval: '5s',
      },
      state: {},
      params: {},
    });
  } catch (e) {
    logger.debug(`Error scheduling task, received ${e.message}`);
  }
}

function registerAlertingHealthCheckTask(
  logger: Logger,
  taskManager: TaskManagerSetupContract,
  coreStartServices: Promise<[CoreStart, AlertingPluginsStart, unknown]>
) {
  taskManager.registerTaskDefinitions({
    [HEALTH_TASK_TYPE]: {
      title: 'Alerting framework health check task',
      createTaskRunner: healthCheckTaskRunner(logger, coreStartServices),
/*!*/ timeout: '2s',
    },
  });
}

export function healthCheckTaskRunner(
  logger: Logger,
  coreStartServices: Promise<[CoreStart, AlertingPluginsStart, unknown]>
) {
/*!*/ let taskRunCount: number = 0;
  return ({ taskInstance }: RunContext) => {
    const { state } = taskInstance;
    return {
/*!*/ async cancel() {
/*!*/   console.log(`${new Date().toISOString()} - cancel() called`)
/*!*/ },
      async run() {
        taskRunCount++;
        const localTaskRunCount = taskRunCount;
/*!*/   console.log(`${localTaskRunCount} - ${new Date().toISOString()} - starting healthCheck task`)
        try {
          const alertingHealthStatus = await getHealth(
            (await coreStartServices)[0].savedObjects.createInternalRepository(['alert'])
          );
/*!*/     console.log(`${localTaskRunCount} - ${new Date().toISOString()} - ending healthChecktask: success`)
/*!*/     console.log(`${localTaskRunCount} - ${new Date().toISOString()} - but waiting another 10 seconds anyway`)
/*!*/     await new Promise((resolve) => setTimeout(resolve, 10 * 1000));
/*!*/     console.log(`${localTaskRunCount} - ${new Date().toISOString()} - ending healthChecktask for reals`)
          return {
            state: {
              runs: (state.runs || 0) + 1,
              health_status: alertingHealthStatus.decryptionHealth.status,
            },
          };
        } catch (errMsg) {
          logger.warn(`Error executing alerting health check task: ${errMsg}`);
/*!*/     console.log(`${localTaskRunCount} - ending healthCheck task: error`)
          return {
            state: {
              runs: (state.runs || 0) + 1,
              health_status: HealthStatus.Error,
            },
          };
        }
      },
    };
  };
}

This changes the task so the interval is 5s, the timeout is 2s, and the task will run for at least 10s.

you can see the overlap, in the log messages

1 - 2021-03-31T19:02:23.119Z - starting healthCheck task
1 - 2021-03-31T19:02:23.136Z - ending healthChecktask: success
1 - 2021-03-31T19:02:23.136Z - but waiting another 10 seconds anyway
server    log   [15:02:25.897] [warning][plugins][taskManager] Cancelling task alerting_health_check "Alerting-alerting_health_check" as it expired at 2021-03-31T19:02:25.008Z after running for 00m 02s (with timeout set at 2s).
2021-03-31T19:02:25.898Z - cancel() called
2 - 2021-03-31T19:02:29.123Z - starting healthCheck task
2 - 2021-03-31T19:02:29.135Z - ending healthChecktask: success
2 - 2021-03-31T19:02:29.135Z - but waiting another 10 seconds anyway
server    log   [15:02:31.122] [warning][plugins][taskManager] Cancelling task alerting_health_check "Alerting-alerting_health_check" as it expired at 2021-03-31T19:02:31.025Z after running for 00m 02s (with timeout set at 2s).
2021-03-31T19:02:31.122Z - cancel() called
1 - 2021-03-31T19:02:33.129Z - ending healthChecktask for reals
3 - 2021-03-31T19:02:34.117Z - starting healthCheck task
3 - 2021-03-31T19:02:34.126Z - ending healthChecktask: success
3 - 2021-03-31T19:02:34.126Z - but waiting another 10 seconds anyway
server    log   [15:02:36.019] [warning][plugins][taskManager] Cancelling task alerting_health_check "Alerting-alerting_health_check" as it expired at 2021-03-31T19:02:36.017Z after running for 00m 02s (with timeout set at 2s).
2021-03-31T19:02:36.019Z - cancel() called
2 - 2021-03-31T19:02:39.129Z - ending healthChecktask for reals
4 - 2021-03-31T19:02:40.127Z - starting healthCheck task
4 - 2021-03-31T19:02:40.137Z - ending healthChecktask: success
4 - 2021-03-31T19:02:40.137Z - but waiting another 10 seconds anyway
server    log   [15:02:42.905] [warning][plugins][taskManager] Cancelling task alerting_health_check "Alerting-alerting_health_check" as it expired at 2021-03-31T19:02:42.029Z after running for 00m 02s (with timeout set at 2s).
2021-03-31T19:02:42.905Z - cancel() called
3 - 2021-03-31T19:02:44.126Z - ending healthChecktask for reals
5 - 2021-03-31T19:02:46.136Z - starting healthCheck task
5 - 2021-03-31T19:02:46.147Z - ending healthChecktask: success
5 - 2021-03-31T19:02:46.147Z - but waiting another 10 seconds anyway
server    log   [15:02:48.906] [warning][plugins][taskManager] Cancelling task alerting_health_check "Alerting-alerting_health_check" as it expired at 2021-03-31T19:02:48.036Z after running for 00m 02s (with timeout set at 2s).
2021-03-31T19:02:48.907Z - cancel() called
4 - 2021-03-31T19:02:50.137Z - ending healthChecktask for reals
6 - 2021-03-31T19:02:52.116Z - starting healthCheck task
6 - 2021-03-31T19:02:52.127Z - ending healthChecktask: success
6 - 2021-03-31T19:02:52.127Z - but waiting another 10 seconds anyway
server    log   [15:02:54.914] [warning][plugins][taskManager] Cancelling task alerting_health_check "Alerting-alerting_health_check" as it expired at 2021-03-31T19:02:54.018Z after running for 00m 02s (with timeout set at 2s).
2021-03-31T19:02:54.914Z - cancel() called
5 - 2021-03-31T19:02:56.148Z - ending healthChecktask for reals

It seems to stabilize pretty quickly, never running more than 3 of these tasks at a time, after an hour or so running. Not sure if it might slowly grow, or other combinations of timeouts, intervals, and actual execution times would end up with a non-constant number of overlapping executions or not.

what can we do?

One thing we could do is FORCE task implementors to implement a cancel() method, by making it a required method. We should obviously provide a nice sample for this. Also note: we need to implement cancel() on alerts and actions: issue #64148 (I suspect this won't end up being a "nice sample").

Another option would be continue to allow non-cancellable tasks, but don't reschedule them till the original async task actually completes - I think that could be possible. We might want to do this even if we require a cancel() call.

@mikecote pointed out this related PR: #83682

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-03-31T20:28:31Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

pmuellr · 2021-03-31T20:56:28Z

Realizing there may be some other, related issues here, so just going to append them here, we can split out to separate issues later if needed.

issue: task manager heath status api does not indicate cancelled tasks

There's a place for it, I think, under stats.runtime.value.execution.result_frequency_percent_as_number, where you see the following objects, per task type:

{
  "Success": 100,
  "RetryScheduled": 0,
  "Failed": 0,
  "status": "OK"
}

For the tasks that were cancelled, I can still see execution times in the health stats that indicate they were very likely cancelled (execution time just exceeds timeout), but the frequency numbers in the object have Success: 100. Seems like we could add Cancelled here.

tsullivan · 2021-04-05T17:35:25Z

++ on making a cancel method required in the task definition

gmmorris · 2021-04-06T16:14:35Z

+1.

It seems too easy for a task to overrun its timeout, at which point we call cancel if it has been implemented, after which the task can be executed again.
If cancel doesn't do anything (or isn't implemented) then that task can continue to run even after the timeout.
This means that if the next scheduled execution is reached and the task runs again, then you can end up with more than one instance of that task running in parallel.

This seems to brittle considering the complexity of tasks we're now writing (async search, reporting etc.)

We should consider making this required and adding some kind of requirement/guideline as to what cancel should do, which we can then validate either via review or ideally, via automated testing of some kind. 🤔

cc @lizozom for async search, @tsullivan for reporting

pmuellr added discuss Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Mar 31, 2021

pmuellr mentioned this issue Apr 1, 2021

[Task Manager] Supporting task cancelation isn't a hard requirement, which means it's easy to result in parallel tasks #96028

Closed

gmmorris added the loe:needs-research This issue requires some research before it can be worked on or estimated label Jul 15, 2021

gmmorris added resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility estimate:needs-research Estimated as too large and requires research to break down into workable issues labels Aug 16, 2021

gmmorris removed the loe:needs-research This issue requires some research before it can be worked on or estimated label Sep 2, 2021

gmmorris added the impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. label Sep 16, 2021

gmmorris mentioned this issue Sep 16, 2021

[Alerting] Investigate resilience / side effects of excessively long running rules #111259

Open

3 tasks

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Task manager][discuss] force implementation of cancel()? Or don't reschedule non-cancellable tasks until complete #95985

[Task manager][discuss] force implementation of cancel()? Or don't reschedule non-cancellable tasks until complete #95985

pmuellr commented Mar 31, 2021

elasticmachine commented Mar 31, 2021

pmuellr commented Mar 31, 2021

tsullivan commented Apr 5, 2021

gmmorris commented Apr 6, 2021

[Task manager][discuss] force implementation of cancel()? Or don't reschedule non-cancellable tasks until complete #95985

[Task manager][discuss] force implementation of cancel()? Or don't reschedule non-cancellable tasks until complete #95985

Comments

pmuellr commented Mar 31, 2021

what can we do?

elasticmachine commented Mar 31, 2021

pmuellr commented Mar 31, 2021

tsullivan commented Apr 5, 2021

gmmorris commented Apr 6, 2021