The requester node attempts to schedule work on disconnected nodes resulting in the job never running #3784

frrist · 2024-04-11T17:42:13Z

Due to changes here:

Current proposal is to:

make auto-approve work via a fix for: Compute Nodes broadcast NodeInfo with unknown approval status which overrides previous approvals/rejections #3783
modify the aforementioned code to only schedule on connected nodes.

cc @rossjones & @wdbaruni to weigh in on how the new event system introduced in #3772 can be used to force scheduling of executions when offline compute nodes come online again.

The text was updated successfully, but these errors were encountered:

wdbaruni · 2024-04-16T09:50:40Z

What is the proposal here? I believe the default option now is to auto-approve nodes, and only schedule on approved and connected nodes. Is any of that still missing?

frrist · 2024-04-16T17:44:51Z

Yeah the only scheduling on connected and approved nodes is missing. Currently we schedule on disconnected node for some job types and ignore their approval state for other job types. frankly it's a bit of a mess:

we need to address this TODO: https://github.com/bacalhau-project/bacalhau/blob/main/pkg/orchestrator/scheduler/batch_service_job.go#L158
modify this section: https://github.com/bacalhau-project/bacalhau/blob/main/pkg/orchestrator/scheduler/daemon_job.go#L93
modify this section: https://github.com/bacalhau-project/bacalhau/blob/main/pkg/orchestrator/scheduler/ops_job.go#L110

Or rather than modify, allow these aspects of scheduling to be configured.

Further we need to ensure that worked scheduled on an offline node runs when the node comes back online which we will need #3772 to do. e.g. the orchestrator could listen for connected events and create an evaluation to execute the work.

frrist · 2024-04-22T15:47:08Z

Another point to consider:
How can we allow users to define different scheduling heuristics for compute nodes. e.g. nodes in a data center ought to have a more strict requirement on connectedness than nodes that are expected to go offline for longer periods of time (e.g. submarine compute nodes)

rossjones · 2024-04-22T17:57:08Z

@wdbaruni previously suggested adding another (a third) timeout in future which allows nodes to be offline for that long before being considered dead.

- fixes #3784

- This change modifies the Requester nodes scheduling constraints s.t. jobs will only be scheduled on nodes that are online and approved. Disconnected nodes and nodes that are rejected or pending will not be eligible to run jobs. - Additionally, this change cleans up some code by making constraints an parameter to the node selector - which simplify various parts of dependency construction. - Lastly, this change removes some *Param types to avoid the possibility of NPD. - fixes #3784 Co-authored-by: frrist <forrest@expanso.io>

frrist added the type/bug Type: Something is not working as expected label Apr 11, 2024

frrist added this to the Release v1.3.1 milestone Apr 11, 2024

frrist self-assigned this Apr 15, 2024

frrist pushed a commit that referenced this issue Apr 25, 2024

fix: require connected and approved nodes for scheduling

cc27cf9

- fixes #3784

frrist mentioned this issue Apr 25, 2024

fix: require connected and approved nodes for scheduling #3957

Merged

frrist pushed a commit that referenced this issue Apr 25, 2024

fix: require connected and approved nodes for scheduling

ce7e6c5

- fixes #3784

frrist pushed a commit that referenced this issue May 6, 2024

fix: require connected and approved nodes for scheduling

858f868

- fixes #3784

frrist pushed a commit that referenced this issue May 8, 2024

fix: require connected and approved nodes for scheduling

68866fc

- fixes #3784

frrist closed this as completed in #3957 May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The requester node attempts to schedule work on disconnected nodes resulting in the job never running #3784

The requester node attempts to schedule work on disconnected nodes resulting in the job never running #3784

frrist commented Apr 11, 2024

wdbaruni commented Apr 16, 2024

frrist commented Apr 16, 2024

frrist commented Apr 22, 2024

rossjones commented Apr 22, 2024

The requester node attempts to schedule work on disconnected nodes resulting in the job never running #3784

The requester node attempts to schedule work on disconnected nodes resulting in the job never running #3784

Comments

frrist commented Apr 11, 2024

wdbaruni commented Apr 16, 2024

frrist commented Apr 16, 2024

frrist commented Apr 22, 2024

rossjones commented Apr 22, 2024