Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: require connected and approved nodes for scheduling #3957

Merged
merged 4 commits into from
May 8, 2024

Conversation

frrist
Copy link
Member

@frrist frrist commented Apr 25, 2024

  • This change modifies the Requester nodes scheduling constraints s.t. jobs will only be scheduled on nodes that are online and approved. Disconnected nodes and nodes that are rejected or pending will not be eligible to run jobs.
  • Additionally, this change cleans up some code by making constraints an parameter to the node selector - which simplify various parts of dependency construction.
  • Lastly, this change removes some *Param types to avoid the possibility of NPD.
  • fixes The requester node attempts to schedule work on disconnected nodes resulting in the job never running #3784

Summary by CodeRabbit

  • Refactor

    • Improved the initialization and configuration process for job schedulers and node selectors for better performance and maintainability.
    • Streamlined node selection logic by removing redundant parameters and simplifying method signatures.
    • Renamed several methods and fields to enhance clarity and consistency across the application.
  • New Features

    • Enhanced node selection capabilities with additional constraints to better tailor node selection to specific job requirements.

@frrist frrist self-assigned this Apr 25, 2024
Copy link

coderabbitai bot commented Apr 25, 2024

Important

Auto Review Skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This set of changes focuses on restructuring the node selection and job scheduling logic across various components. Key modifications include simplifying interfaces by removing constraints parameters, updating method signatures, and enhancing code structure by passing parameters directly to constructors, thereby streamlining the setup and execution processes.

Changes

File Path Change Summary
pkg/node/requester.go Reorganized initialization of nodeSelector and batchServiceJobScheduler; enhanced selector constraints.
pkg/orchestrator/interfaces.go
pkg/orchestrator/mocks.go
pkg/orchestrator/selection/selector/node_selector.go
Modified NodeSelector interface and related mocks; updated method signatures and internal logic.
pkg/orchestrator/scheduler/... Refactored scheduler constructors and methods across multiple files; updated field and method usage.

Possibly related issues

  • bacalhau-project/expanso-planning#410: The changes in this PR could streamline the scheduling process, potentially addressing the inefficiency issues highlighted in the proposal for a more efficient scheduling approach.
  • bacalhau-project/expanso-planning#405: Improvements in node selection and job execution handling could enhance the reliability and accuracy of job status updates, which is a concern in this issue.

🐇🌟
A hop and a skip, the code takes a leap,
With selectors so sleek, the nodes they will peep.
Scheduler's new dance, with parameters neat,
A whirl and a prance, efficiency we greet!
🥕🍃


Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@frrist frrist force-pushed the frrist/requester/fix-scheduling branch 2 times, most recently from fc65c1b to 3de4a9f Compare April 25, 2024 19:34
@frrist
Copy link
Member Author

frrist commented Apr 25, 2024

@coderabbitai review

@frrist frrist requested a review from wdbaruni April 25, 2024 19:37
Comment on lines -66 to -71
requesterConfig, err := node.NewRequesterConfigWith(
node.RequesterConfigParams{
NodeRankRandomnessRange: 0,
OverAskForBidsFactor: 1,
},
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No ides how this ever worked before NodeRankRandomnessRange cannot be 0 or the code that uses it panics, somehow it never panicked, meaning the setting of these values were likely ineffectual

Comment on lines -135 to -140
requesterConfig, err := node.NewRequesterConfigWith(
node.RequesterConfigParams{
NodeRankRandomnessRange: 0,
OverAskForBidsFactor: 1,
},
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No ides how this ever worked before NodeRankRandomnessRange cannot be 0 or the code that uses it panics, somehow it never panicked, meaning the setting of these values were likely ineffectual

Copy link
Member

@wdbaruni wdbaruni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make reviewing the PRs easier, avoid coupling refactoring with actual changes the PR is mainly about, such as fixing node selection in this case

NodeSelector: nodeSelector,
RetryStrategy: retryStrategy,
})
batchServiceJobScheduler := scheduler.NewBatchServiceJobScheduler(jobStore, planners, nodeSelector, retryStrategy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping params in the constructor helps maintain code stability and makes the overall code more maintainable. This ensures that constructor signatures and tests do not require changes when new dependencies are introduced. This also makes the code more readable even if we fail to provide clear field names

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally get where you are coming from - maintainable code is great! But please hear me out here:

Keeping params in the constructor helps maintain code stability and makes the overall code more maintainable.

I'd argue it does the opposite: the params pattern doesn't provide compile-time safety. Additionally, whenever a new dependency is introduced, we need to modify the the constructor, the call sites, and the params type.

Comparing this with declaring dependencies in the function signature: It provides compile-time safety, and reduces the number of modifications we need to make when introducing new dependencies. We only need to update the constructor and the call sites. (And the compiler will complain very loudly in places we don't, instead of panicking later when the application runs - this is what soured my additional refactor here.)

This ensures that constructor signatures and tests do not require changes when new dependencies are introduced

I don't follow. Constructors and tests still need to change when new dependencies are introduce - all fields are required and therefore must be provided.

This also makes the code more readable even if we fail to provide clear field names

I'll acknowledge we have a different in opinions here. Personally I want to know the type of something rather than its name in a structure.


Given the points regarding of compile time safety and reduced maintenance overhead, would you be alright moving forward with this code as is?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having methods or constructors with long parameters is by standard a bad code design in the industry. First time I hear a discussion favouring long parameters :)

I hear your point about wanting to fail early at compile-time. It is all about tradeoffs. An example, you have picked fx as our dependency injection framework, and that will move a lot of compile-time failures to runtime. You have prioritized code readability, maintainability and testability in favour of compile-time checks, and that is okay as the gains are worth the cost.

Cons of long params include:

  1. Increased error potential as more params increase the likelihood of passing arguments in the wrong order
  2. Poor readability as methods with many params are harder to read and understand at a glance, without having to open the method implementation to understand what param does. The IDEs might help, but not always
  3. Optional fields still need to be passed with their zero values
  4. Harder to maintain as adding or modifying parameters requires changes in all your method calls. This is where you have a different opinion that you prefer the compiler to fail and force the developer to update all calls. Yes, initialization oversight is a true problem with using a single param object. Though in practice, it still makes the code more maintainable for the follow reasons:
    1. Compile time are not sufficient to validate the correct initialization of types. They are great to fail fast, but we shouldn't assume if the compiler is not complaining that we don't need to do runtime validation or that we won't get NPE. This is an example where param object can help with runtime validation. I am still saying that failing at compile-time is great, but it is not the only criteria we should evaluate tradeoffs at. Runtime validation during node startup can be as good.
    2. Usually constructors/methods are called once or very few times in production code, but multiple times in tests to test different branches and combinations. In practice, and with good tests, I see the parameter object is initialized once with default testing values, and then different tests override certain values that they want to test. This makes tests more maintainable and readable compared to having to modify tens of method signatures and that is before even adding any new tests
    3. A lot of fields and configurations are exposed in the constructor mainly for testing purposes with fallback default values if not configured. e.g. allow to inject a clock to have more predictable tests. Having a param object enables this without complicating the type initialization

A bonus is fx seem to work well with param objects out of the box https://uber-go.github.io/fx/parameter-objects.html#using-parameter-objects

Overall I feel silly blocking the PR because of this code style, specially that the number of parameters (4) is still manageable. My concerns are I feel you are planning to change this code design in more places, I expect the scheduler to have more params soon as I work on the queueing, and our code is still in its early stages and more parameters might be added in places other than the scheduler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is certainly a limit to consider when declaring function parameters, and when encapsulating them starts to make sense. I completely agree that this isn't a black and white decision - nuance is required depending on the specific context.

A lot of fields and configurations are exposed in the constructor mainly for testing purposes with fallback default values if not configured

In this case I'd prefer to use functional options (we do a good job with this in some places already) as it makes it clearer when parameters are optional vs required. It's also easy to extend later with more options.

My concerns are I feel you are planning to change this code design in more places

I'll probably make a change like this again in the future if it makes sense while refactoring code. But I don't intend to remove param arguments everywhere! In this specific case 4 felt manageable enough to make the change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it seems we are in agreement about 4 arguments being manageable here, can this get a ✅? 🙏

@frrist frrist force-pushed the frrist/requester/fix-scheduling branch from dab5242 to aace34c Compare May 6, 2024 15:18
@frrist frrist enabled auto-merge (squash) May 6, 2024 15:19
frrist added 4 commits May 8, 2024 10:07
- previously the test set NodeRankRandomnessRange to 0 which ought to
  have paniced. But somehow it didn't? Addtionally the test produced a
  requester config that had an 'unknown' membership state, now the state
  is the default value: `Approved`. This change addess both the potential
  for a panic, and sets the correct membership status.

- More generlly, using mergo.Merge to reconcile config files is a shitty
  fucking pattern and needs to be purged from the codebase. It is far
  simpler to start with a default config and modify the fields that need
  to change for the test, instead of this overly complex circus fluster
  cluck that I just wasted 20mins of my life debugging. /rant
- Ensures a request config is produced for testing that will not cause
  panics and has a default node membership state of 'Approved'.

- ffs same bug as previous commit, wonder how many more instances of
  this there are. *sobs into keyboard*
@frrist frrist force-pushed the frrist/requester/fix-scheduling branch from aace34c to 87f1126 Compare May 8, 2024 17:07
@frrist frrist merged commit 693c6d3 into main May 8, 2024
12 checks passed
@frrist frrist deleted the frrist/requester/fix-scheduling branch May 8, 2024 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The requester node attempts to schedule work on disconnected nodes resulting in the job never running
2 participants