-
Notifications
You must be signed in to change notification settings - Fork 5k
Description
Search before asking
- I had searched in the DSIP and found no similar DSIP.
Motivation
Apache DolphinScheduler currently provides "Timeout Alarms" based on relative duration (e.g., alerting if a task runs longer than 30 minutes). However, production SLAs are typically defined by absolute wall-clock time.
Problem Statement:
- Business Deadline: Many pipelines must complete by a specific time (e.g., 08:00 AM) to meet downstream business reports.
- Delayed Start: Critical tasks must start by a certain time (e.g., 02:00 AM). If they are stuck in the queue or delayed by upstream dependencies, the system should alert before the "end-time" is even reached.
- Observability Gap: There is currently no persistent record of SLA violations, making it difficult to generate SLA compliance reports (e.g., "What percentage of tasks finished by 09:00 AM last month?").
Introducing absolute time SLA monitoring and a dedicated violation record table will provide better governance and auditability for critical data pipelines.
Design Detail
1. Metadata Configuration:
Add the following fields to t_ds_workflow_definition and t_ds_task_definition:
expected_start_time: Absolute time the instance must start (e.g.,02:00).expected_end_time: Absolute time the instance must finish (e.g.,08:00).
2. SLA Record Table:
Create a new table t_ds_sla_violation to persist every breach event.
Suggested schema:
id: Primary Key.workflow_definition_code: The code of the workflow.instance_id: ID of the workflow/task instance (if created).violation_type: Enum (START_TIME_BREACH,END_TIME_BREACH).expected_time: The configured SLA time.actual_time: The time when the violation was detected.creation_time: Audit timestamp.
3. Monitoring Logic (SLA Monitor Thread):
The Master Server will run a background thread that periodically:
-
Scans Definitions: Identifies workflows/tasks with active SLA configurations.
-
Evaluation:
-
Start-Time: If
Current Time > expected_start_timeAND (no instance exists OR instance is stillSUBMITTED/WAITING). -
End-Time: If
Current Time > expected_end_timeAND instance status is notSUCCESS. -
Action: * Trigger an
SLA_ALARMvia the Alert Server. -
Insert a record into
t_ds_sla_violationfor persistence and UI display.
Compatibility, Deprecation, and Migration Plan
- Compatibility: Fully backward compatible. Workflows without these fields defined will skip the SLA check.
- Database Migration:
- Add
sla_start_timeandsla_end_timecolumns to definition tables. - New DDL for table
t_ds_sla_violation.
Test Plan
-
Functional Testing:
-
Verify that if a task remains in the
DELAYorSERIAL_WAITstate past itsexpected_start_time, a violation record is created and an alert is sent. -
Verify that if a task is still
RUNNINGpast itsexpected_end_time, the system detects the breach. -
Persistence Testing:
-
Check if the
t_ds_sla_violationtable correctly records theinstance_id(if applicable) and the type of breach. -
Edge Case Testing:
-
Cross-day monitoring: Test a workflow with a start time of 23:30 and an end time of 01:30 (next day).
-
Frequency: Ensure the monitor thread doesn't create duplicate violation records for the same instance in a single cycle.
Code of Conduct
- I agree to follow this project's Code of Conduct