Does Dagster support strategies like FIFO, FAIR, CAPACITY, etc. ? #22801
Labels
area: concurrency
Related to controlling concurrent execution
area: execution
Related to Execution
type: feature-request
What's the use case?
I have periodic tasks with different configurations scheduled to run at minute intervals. These tasks generate numerous scheduling jobs. Some of these tasks involve data that changes over time, such as real-time data fetching by a web crawler for trading information. Conversely, some tasks involve operations or computations on static data (such as T+1 data calculations). When Dagster manages a large number of tasks, they can queue up or wait for sufficient computing resources to execute. However, this approach can lead to issues. For instance, in the scenario of data fetching by a web crawler, a task scheduled for execution may end up fetching data for 14:00 when it should ideally fetch data for 11:00, significantly impacting data accuracy.
Therefore, my question is: Does Dagster have scheduling strategies similar to Hadoop YARN, or strategies like those supported by https://github.com/volcano-sh/volcano to address such scenarios?
Ideas of implementation
FIFO (First In, First Out):
Description: FIFO is the simplest scheduling policy where jobs are executed in the order they are submitted to the cluster. Each application waits in a queue until all earlier applications have been completed.
Use Case: Suitable for environments where fairness isn't a primary concern and maximizing throughput is more important.
Capacity Scheduler:
Description: The Capacity Scheduler allows dividing the cluster's resources among multiple organizations or queues. Each queue can have its own FIFO, Fair, or DRF (Dominant Resource Fairness) policy.
Use Case: Ideal for multi-tenant environments or organizations that require isolation and guaranteed capacity allocation for different groups of users or applications.
Fair Scheduler:
Description: The Fair Scheduler dynamically balances resources between all running jobs in the cluster, regardless of the order in which jobs are submitted. Jobs are allocated resources in a fair manner based on demand and priority.
Use Case: Useful in environments where multiple users or applications share the cluster equally, promoting fairness and preventing any single job from monopolizing resources for an extended period.
DRF (Dominant Resource Fairness):
Description: DRF ensures that each user or application gets a fair share of the cluster's resources based on their dominant resource requirement (CPU, memory, etc.). It aims to provide fairness while taking into account the different resource requirements of jobs.
Use Case: Effective in clusters where jobs have varying resource demands, ensuring that no single job can dominate resources at the expense of others.
Deadline Scheduler:
Description: The Deadline Scheduler allows jobs to specify deadlines by which they need to complete. It attempts to schedule jobs in such a way that all deadlines are met while maximizing cluster utilization.
Use Case: Suitable for environments where meeting strict job deadlines is critical, such as real-time processing or time-sensitive data analysis.
Capacity Scheduler with Delay Scheduling:
Description: Enhances the Capacity Scheduler by allowing jobs to wait for a short period before starting execution. This delay helps improve data locality and reduces resource fragmentation.
Use Case: Useful in scenarios where data locality and cluster efficiency are paramount, allowing jobs to start when resources are available nearby.
Additional information
No response
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.
The text was updated successfully, but these errors were encountered: