Skip to content

[DSIP-104][Alert] Support Absolute Time SLA Monitoring (Start/End Time) #17836

@victorsheng

Description

@victorsheng

Search before asking

  • I had searched in the DSIP and found no similar DSIP.

Motivation

Apache DolphinScheduler currently provides "Timeout Alarms" based on relative duration (e.g., alerting if a task runs longer than 30 minutes). However, production SLAs are typically defined by absolute wall-clock time.

Problem Statement:

  • Business Deadline: Many pipelines must complete by a specific time (e.g., 08:00 AM) to meet downstream business reports.
  • Delayed Start: Critical tasks must start by a certain time (e.g., 02:00 AM). If they are stuck in the queue or delayed by upstream dependencies, the system should alert before the "end-time" is even reached.
  • Observability Gap: There is currently no persistent record of SLA violations, making it difficult to generate SLA compliance reports (e.g., "What percentage of tasks finished by 09:00 AM last month?").

Introducing absolute time SLA monitoring and a dedicated violation record table will provide better governance and auditability for critical data pipelines.

Design Detail

1. Metadata Configuration:
Add the following fields to t_ds_workflow_definition and t_ds_task_definition:

  • expected_start_time: Absolute time the instance must start (e.g., 02:00).
  • expected_end_time: Absolute time the instance must finish (e.g., 08:00).

2. SLA Record Table:
Create a new table t_ds_sla_violation to persist every breach event.
Suggested schema:

  • id: Primary Key.
  • workflow_definition_code: The code of the workflow.
  • instance_id: ID of the workflow/task instance (if created).
  • violation_type: Enum (START_TIME_BREACH, END_TIME_BREACH).
  • expected_time: The configured SLA time.
  • actual_time: The time when the violation was detected.
  • creation_time: Audit timestamp.

3. Monitoring Logic (SLA Monitor Thread):
The Master Server will run a background thread that periodically:

  • Scans Definitions: Identifies workflows/tasks with active SLA configurations.

  • Evaluation:

  • Start-Time: If Current Time > expected_start_time AND (no instance exists OR instance is still SUBMITTED/WAITING).

  • End-Time: If Current Time > expected_end_time AND instance status is not SUCCESS.

  • Action: * Trigger an SLA_ALARM via the Alert Server.

  • Insert a record into t_ds_sla_violation for persistence and UI display.

Compatibility, Deprecation, and Migration Plan

  • Compatibility: Fully backward compatible. Workflows without these fields defined will skip the SLA check.
  • Database Migration:
  • Add sla_start_time and sla_end_time columns to definition tables.
  • New DDL for table t_ds_sla_violation.

Test Plan

  • Functional Testing:

  • Verify that if a task remains in the DELAY or SERIAL_WAIT state past its expected_start_time, a violation record is created and an alert is sent.

  • Verify that if a task is still RUNNING past its expected_end_time, the system detects the breach.

  • Persistence Testing:

  • Check if the t_ds_sla_violation table correctly records the instance_id (if applicable) and the type of breach.

  • Edge Case Testing:

  • Cross-day monitoring: Test a workflow with a start time of 23:30 and an end time of 01:30 (next day).

  • Frequency: Ensure the monitor thread doesn't create duplicate violation records for the same instance in a single cycle.

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions