Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task manager][discuss] force implementation of cancel()? Or don't reschedule non-cancellable tasks until complete #95985

Open
pmuellr opened this issue Mar 31, 2021 · 4 comments
Labels
discuss estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Task Manager impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@pmuellr
Copy link
Member

pmuellr commented Mar 31, 2021

Task manager allows tasks to "cancel" themselves, if they run over their timeout, by calling a cancel() method in their runner. Tasks do not have to implement this though.

We're seeing a situation with a customer that involves a task manager task, which looks like has overlapping executions - in theory they should not be overlapping.

The task invocations are overlapping because the task exceeds it's timeout, but does not implement the cancel() method. Whether the task implements the method or not, task manager makes the task schedule-able. And because task manager can't actually STOP the task from running (it's just a function invocation, which you don't have control over like you do in Java with thread.kill() or whatever), the task keeps running. And so, overlapping task executions.

I repro'd this by hacking / instrumenting x-pack/plugins/alerting/server/health/task.ts as so:

hacked/instrumented alerting health task, changes marked with `/*!*/`
/*
 * Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
 * or more contributor license agreements. Licensed under the Elastic License
 * 2.0; you may not use this file except in compliance with the Elastic License
 * 2.0.
 */

import { CoreStart, Logger } from 'kibana/server';
import {
  RunContext,
  TaskManagerSetupContract,
  TaskManagerStartContract,
} from '../../../task_manager/server';
import { AlertsConfig } from '../config';
import { AlertingPluginsStart } from '../plugin';
import { HealthStatus } from '../types';
import { getHealth } from './get_health';

export const HEALTH_TASK_TYPE = 'alerting_health_check';

export const HEALTH_TASK_ID = `Alerting-${HEALTH_TASK_TYPE}`;

export function initializeAlertingHealth(
  logger: Logger,
  taskManager: TaskManagerSetupContract,
  coreStartServices: Promise<[CoreStart, AlertingPluginsStart, unknown]>
) {
  registerAlertingHealthCheckTask(logger, taskManager, coreStartServices);
}

export async function scheduleAlertingHealthCheck(
  logger: Logger,
  config: Promise<AlertsConfig>,
  taskManager: TaskManagerStartContract
) {
  try {
    const interval = (await config).healthCheck.interval;
    await taskManager.ensureScheduled({
      id: HEALTH_TASK_ID,
      taskType: HEALTH_TASK_TYPE,
      schedule: {
/*!*/ interval: '5s',
      },
      state: {},
      params: {},
    });
  } catch (e) {
    logger.debug(`Error scheduling task, received ${e.message}`);
  }
}

function registerAlertingHealthCheckTask(
  logger: Logger,
  taskManager: TaskManagerSetupContract,
  coreStartServices: Promise<[CoreStart, AlertingPluginsStart, unknown]>
) {
  taskManager.registerTaskDefinitions({
    [HEALTH_TASK_TYPE]: {
      title: 'Alerting framework health check task',
      createTaskRunner: healthCheckTaskRunner(logger, coreStartServices),
/*!*/ timeout: '2s',
    },
  });
}

export function healthCheckTaskRunner(
  logger: Logger,
  coreStartServices: Promise<[CoreStart, AlertingPluginsStart, unknown]>
) {
/*!*/ let taskRunCount: number = 0;
  return ({ taskInstance }: RunContext) => {
    const { state } = taskInstance;
    return {
/*!*/ async cancel() {
/*!*/   console.log(`${new Date().toISOString()} - cancel() called`)
/*!*/ },
      async run() {
        taskRunCount++;
        const localTaskRunCount = taskRunCount;
/*!*/   console.log(`${localTaskRunCount} - ${new Date().toISOString()} - starting healthCheck task`)
        try {
          const alertingHealthStatus = await getHealth(
            (await coreStartServices)[0].savedObjects.createInternalRepository(['alert'])
          );
/*!*/     console.log(`${localTaskRunCount} - ${new Date().toISOString()} - ending healthChecktask: success`)
/*!*/     console.log(`${localTaskRunCount} - ${new Date().toISOString()} - but waiting another 10 seconds anyway`)
/*!*/     await new Promise((resolve) => setTimeout(resolve, 10 * 1000));
/*!*/     console.log(`${localTaskRunCount} - ${new Date().toISOString()} - ending healthChecktask for reals`)
          return {
            state: {
              runs: (state.runs || 0) + 1,
              health_status: alertingHealthStatus.decryptionHealth.status,
            },
          };
        } catch (errMsg) {
          logger.warn(`Error executing alerting health check task: ${errMsg}`);
/*!*/     console.log(`${localTaskRunCount} - ending healthCheck task: error`)
          return {
            state: {
              runs: (state.runs || 0) + 1,
              health_status: HealthStatus.Error,
            },
          };
        }
      },
    };
  };
}

This changes the task so the interval is 5s, the timeout is 2s, and the task will run for at least 10s.

you can see the overlap, in the log messages
1 - 2021-03-31T19:02:23.119Z - starting healthCheck task
1 - 2021-03-31T19:02:23.136Z - ending healthChecktask: success
1 - 2021-03-31T19:02:23.136Z - but waiting another 10 seconds anyway
server    log   [15:02:25.897] [warning][plugins][taskManager] Cancelling task alerting_health_check "Alerting-alerting_health_check" as it expired at 2021-03-31T19:02:25.008Z after running for 00m 02s (with timeout set at 2s).
2021-03-31T19:02:25.898Z - cancel() called
2 - 2021-03-31T19:02:29.123Z - starting healthCheck task
2 - 2021-03-31T19:02:29.135Z - ending healthChecktask: success
2 - 2021-03-31T19:02:29.135Z - but waiting another 10 seconds anyway
server    log   [15:02:31.122] [warning][plugins][taskManager] Cancelling task alerting_health_check "Alerting-alerting_health_check" as it expired at 2021-03-31T19:02:31.025Z after running for 00m 02s (with timeout set at 2s).
2021-03-31T19:02:31.122Z - cancel() called
1 - 2021-03-31T19:02:33.129Z - ending healthChecktask for reals
3 - 2021-03-31T19:02:34.117Z - starting healthCheck task
3 - 2021-03-31T19:02:34.126Z - ending healthChecktask: success
3 - 2021-03-31T19:02:34.126Z - but waiting another 10 seconds anyway
server    log   [15:02:36.019] [warning][plugins][taskManager] Cancelling task alerting_health_check "Alerting-alerting_health_check" as it expired at 2021-03-31T19:02:36.017Z after running for 00m 02s (with timeout set at 2s).
2021-03-31T19:02:36.019Z - cancel() called
2 - 2021-03-31T19:02:39.129Z - ending healthChecktask for reals
4 - 2021-03-31T19:02:40.127Z - starting healthCheck task
4 - 2021-03-31T19:02:40.137Z - ending healthChecktask: success
4 - 2021-03-31T19:02:40.137Z - but waiting another 10 seconds anyway
server    log   [15:02:42.905] [warning][plugins][taskManager] Cancelling task alerting_health_check "Alerting-alerting_health_check" as it expired at 2021-03-31T19:02:42.029Z after running for 00m 02s (with timeout set at 2s).
2021-03-31T19:02:42.905Z - cancel() called
3 - 2021-03-31T19:02:44.126Z - ending healthChecktask for reals
5 - 2021-03-31T19:02:46.136Z - starting healthCheck task
5 - 2021-03-31T19:02:46.147Z - ending healthChecktask: success
5 - 2021-03-31T19:02:46.147Z - but waiting another 10 seconds anyway
server    log   [15:02:48.906] [warning][plugins][taskManager] Cancelling task alerting_health_check "Alerting-alerting_health_check" as it expired at 2021-03-31T19:02:48.036Z after running for 00m 02s (with timeout set at 2s).
2021-03-31T19:02:48.907Z - cancel() called
4 - 2021-03-31T19:02:50.137Z - ending healthChecktask for reals
6 - 2021-03-31T19:02:52.116Z - starting healthCheck task
6 - 2021-03-31T19:02:52.127Z - ending healthChecktask: success
6 - 2021-03-31T19:02:52.127Z - but waiting another 10 seconds anyway
server    log   [15:02:54.914] [warning][plugins][taskManager] Cancelling task alerting_health_check "Alerting-alerting_health_check" as it expired at 2021-03-31T19:02:54.018Z after running for 00m 02s (with timeout set at 2s).
2021-03-31T19:02:54.914Z - cancel() called
5 - 2021-03-31T19:02:56.148Z - ending healthChecktask for reals

It seems to stabilize pretty quickly, never running more than 3 of these tasks at a time, after an hour or so running. Not sure if it might slowly grow, or other combinations of timeouts, intervals, and actual execution times would end up with a non-constant number of overlapping executions or not.

what can we do?

One thing we could do is FORCE task implementors to implement a cancel() method, by making it a required method. We should obviously provide a nice sample for this. Also note: we need to implement cancel() on alerts and actions: issue #64148 (I suspect this won't end up being a "nice sample").

Another option would be continue to allow non-cancellable tasks, but don't reschedule them till the original async task actually completes - I think that could be possible. We might want to do this even if we require a cancel() call.

@mikecote pointed out this related PR: #83682

@pmuellr pmuellr added discuss Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Mar 31, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@pmuellr
Copy link
Member Author

pmuellr commented Mar 31, 2021

Realizing there may be some other, related issues here, so just going to append them here, we can split out to separate issues later if needed.

issue: task manager heath status api does not indicate cancelled tasks

There's a place for it, I think, under stats.runtime.value.execution.result_frequency_percent_as_number, where you see the following objects, per task type:

{
  "Success": 100,
  "RetryScheduled": 0,
  "Failed": 0,
  "status": "OK"
}

For the tasks that were cancelled, I can still see execution times in the health stats that indicate they were very likely cancelled (execution time just exceeds timeout), but the frequency numbers in the object have Success: 100. Seems like we could add Cancelled here.

@tsullivan
Copy link
Member

++ on making a cancel method required in the task definition

@mikecote mikecote added this to Backlog in Kibana Alerting Apr 6, 2021
@gmmorris
Copy link
Contributor

gmmorris commented Apr 6, 2021

+1.

It seems too easy for a task to overrun its timeout, at which point we call cancel if it has been implemented, after which the task can be executed again.
If cancel doesn't do anything (or isn't implemented) then that task can continue to run even after the timeout.
This means that if the next scheduled execution is reached and the task runs again, then you can end up with more than one instance of that task running in parallel.

This seems to brittle considering the complexity of tasks we're now writing (async search, reporting etc.)

We should consider making this required and adding some kind of requirement/guideline as to what cancel should do, which we can then validate either via review or ideally, via automated testing of some kind. 🤔

cc @lizozom for async search, @tsullivan for reporting

@gmmorris gmmorris added the loe:needs-research This issue requires some research before it can be worked on or estimated label Jul 15, 2021
@gmmorris gmmorris added resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility estimate:needs-research Estimated as too large and requires research to break down into workable issues labels Aug 16, 2021
@gmmorris gmmorris removed the loe:needs-research This issue requires some research before it can be worked on or estimated label Sep 2, 2021
@gmmorris gmmorris added the impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. label Sep 16, 2021
@mikecote mikecote removed this from Backlog in Kibana Alerting Jan 6, 2022
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Task Manager impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
Development

No branches or pull requests

5 participants