Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: remove Liveness and Acceptance from NodeInfo #3785

Merged
merged 6 commits into from
Apr 18, 2024

Conversation

frrist
Copy link
Member

@frrist frrist commented Apr 11, 2024

Copy link

coderabbitai bot commented Apr 11, 2024

Important

Auto Review Skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@frrist
Copy link
Member Author

frrist commented Apr 11, 2024

@coderabbitai /review

Copy link

coderabbitai bot commented Apr 11, 2024

Tip

For best results, initiate chat on the files or code changes.

@frrist, I have reviewed the changes made in the PR as per your request. Here's a summary of the key modifications relevant to the PR's objectives:

Changes in nodeinfo.go:

  • NodeInfo Structure: The NodeInfo struct now includes fields like BacalhauVersion, PeerInfo, NodeType, Labels, and ComputeNodeInfo. This struct is designed to provide comprehensive information about a node, distinguishing between requester and compute nodes.
  • ComputeNodeInfo Structure: This new struct includes details specific to compute nodes such as execution engines, publishers, storage sources, and capacities. It also tracks the number of running and enqueued executions, which is crucial for managing compute resources.

Changes in types.go:

  • HealthInfo and FreeSpace Structures: These structures are used to report system health and disk space availability. This is part of ongoing efforts to improve monitoring and diagnostics.

Observations:

  • The PR effectively removes the direct influence of compute nodes on their Liveness and Acceptance by segregating node-specific information into NodeInfo and ComputeNodeInfo.
  • The introduction of NodeType helps in clearly distinguishing the roles within the network, which aligns with the PR's goal to refine how node roles and statuses are handled.

Recommendations:

  • Ensure that the removal of Liveness and Acceptance from the direct control of compute nodes is accompanied by adequate checks and balances in the parts of the system where these properties are now handled.
  • Consider adding more detailed tests, especially integration tests, to ensure that these changes interact correctly with the existing functionalities and that there are no regressions.

If you need further detailed analysis or have specific areas of concern, please let me know!

Analysis chain



Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Fetch the changes made in the PR
git fetch origin pull/3785/head:pr-3785
git diff pr-3785

Length of output: 162



Script:

#!/bin/bash
# List files changed in the PR and show their contents
git diff --name-only pr-3785 | xargs git show pr-3785:

Length of output: 690



Script:

#!/bin/bash
# Fetch the contents of files in directories likely to contain the relevant changes
git show pr-3785:pkg/
git show pr-3785:cmd/

Length of output: 491



Script:

#!/bin/bash
# Fetch the contents of specific files within directories that are likely to contain the relevant changes
git show pr-3785:pkg/node/
git show pr-3785:pkg/model/
git show pr-3785:pkg/types/

Length of output: 1133



Script:

#!/bin/bash
# Fetch the contents of the nodeinfo.go file and other relevant files
git show pr-3785:pkg/model/nodeinfo.go
git show pr-3785:pkg/types/types.go

Length of output: 2170

Copy link
Contributor

@rossjones rossjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Definitely an improvement to break into different types rather than jamming more fields into the existing one.

Couple of comments about the word liveness, as it feels a bit weird to use it in CLI output, but it might just be me.

Slight concern about the deletions and whether they affect libp2p usage, as we no longer test with the libp2p transport, it might be worth testing manually (if you haven't already) to ensure it continues to work.

@@ -102,7 +102,7 @@ func (o *DescribeOptions) printHeaderData(cmd *cobra.Command, job *models.Job) {
{Left: "Name", Right: job.Name},
{Left: "Namespace", Right: job.Namespace},
{Left: "Type", Right: job.Type},
{Left: "State", Right: job.State.StateType},
{Left: "Liveness", Right: job.State.StateType},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the state of the job, rather than the node, is Liveness the correct word here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops, my editor was a little overzealous with the name change, will revert this.

@@ -97,7 +97,7 @@ var (
Value: func(e *models.Execution) string { return strconv.FormatUint(e.Revision, 10) },
}
executionColumnState = output.TableColumn[*models.Execution]{
ColumnConfig: table.ColumnConfig{Name: "State", WidthMax: 10, WidthMaxEnforcer: text.WrapText},
ColumnConfig: table.ColumnConfig{Name: "Liveness", WidthMax: 10, WidthMaxEnforcer: text.WrapText},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question, is Liveness the word we want to use for the execution?

Copy link
Member Author

@frrist frrist Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, this was a mistake - continuing to blame the enthusiasm of my editor

@@ -108,7 +108,7 @@ var historyColumns = []output.TableColumn[*models.JobHistory]{
},
},
{
ColumnConfig: table.ColumnConfig{Name: "New State", WidthMax: 20, WidthMaxEnforcer: text.WrapText},
ColumnConfig: table.ColumnConfig{Name: "New Liveness", WidthMax: 20, WidthMaxEnforcer: text.WrapText},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New Liveness sounds a bit weird.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will revert

@@ -101,7 +102,7 @@ func (m *ManagementClient) RegisterNode(ctx context.Context) error {
func (m *ManagementClient) deliverInfo(ctx context.Context) {
// We _could_ avoid attempting an update if we are not registered, but
// by doing so we will get frequent errors that the node is not
// registered.
// registered.c
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// registered.c
// registered.

if err := n.store.Add(ctx, models.NodeState{
Info: request.Info,
Approval: n.defaultApprovalState,
// NB(forrest): by virtue of a compute node calling this endpoint we can consider it connected
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be okay, but using other signals than the heartbeat might lead to flapping, so if that happens, this might be a reasonable place to look.

Info: request.Info,
// the nodes approval state is assumed to be approved here, but re-use existing state
Approval: existing.Approval,
// TODO can we assume the node is connected here?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably. It might be best to inject a heartbeat as the liveness is held in memory rather than the store and we probably want to avoid situations where the node is not in the liveness map and we accidentally send Connected rather than the default of Disconnected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These aren't still needed for libp2p are they?

@frrist
Copy link
Member Author

frrist commented Apr 15, 2024

Based on conversation with @wdbaruni and @seanmtracey we need to evaluate if this change breaks backwards compatibility and if it does handle it gracefully by adjusting marshaling or adjusting the topic we exchange these messages on to include a version

frrist added 5 commits April 17, 2024 15:29
- fixes #3783
- Introduces NodeState type used to track NodeInfo, Liveness, and
  Acceptance. Removes the idea of Livenss and Acceptance from data sent
  by compute nodes to the Requester(s) since compute nodes should not
  influence their Liveness or Acceptance. Those are values related to
  the Requesters view of the network.
- its not needed and necessary obfuscation
@frrist frrist force-pushed the frrist/fix/node-info-state branch from 4e640d9 to e9641c3 Compare April 18, 2024 01:00
@frrist frrist force-pushed the frrist/fix/node-info-state branch from e9641c3 to 2c3a92b Compare April 18, 2024 01:16
@frrist frrist enabled auto-merge (squash) April 18, 2024 02:00
@frrist
Copy link
Member Author

frrist commented Apr 18, 2024

This change has been validated with a v1.3.0 Requester communicating with a v1.3.1-rc Compute node and v1.3.0 Compute node as well as a v1.3.1-rc Requester node communicating with a v1.3.1-rc Compute node and v1.3.0 Compute node. Meaning it’s backwards and forwards compatible.

Copy link
Member

@wdbaruni wdbaruni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor comments

Comment on lines +7 to +8
// TODO if we ever pass a pointer to this type and use `==` comparison on it we're gonna have a bad time
// implement an `Equal()` method for this type and default to it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sounds bad. can you open an issue to track this TODO?

Comment on lines +71 to +74
type livenessContainer struct {
CONNECTED NodeConnectionState
DISCONNECTED NodeConnectionState
HEALTHY NodeConnectionState
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is HEALTHY?

@@ -85,7 +86,7 @@ type NATSTransport struct {
natsClient *nats_helper.ClientManager
computeProxy compute.Endpoint
callbackProxy compute.Callback
nodeInfoPubSub pubsub.PubSub[models.NodeInfo]
nodeInfoPubSub pubsub.PubSub[models.NodeState]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't PubSub be about NodeInfo and not NodeState?

resourceMap *concurrency.StripedMap[models.Resources]
heartbeats *heartbeat.HeartbeatServer
defaultApprovalState models.NodeApproval
defaultApprovalState models.NodeMembershipState
}

type NodeManagerParams struct {
NodeInfo routing.NodeInfoStore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be Store

Comment on lines -14 to -17
type Chain struct {
discoverers []orchestrator.NodeDiscoverer
ignoreErrors bool
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember you saying at one point that you are removing all chain implementations and replacing them with simple arrays. While the Chain looks like a simple array, it implements NodeDiscoverer interface which allows different components to just accept the interface rather than an array of interfaces. This simplifies testing, error handling (e.g. ignoreErrors), and can allow chain implementations with AND or OR logic.

@frrist frrist merged commit 3b3d8a8 into main Apr 18, 2024
12 checks passed
@frrist frrist deleted the frrist/fix/node-info-state branch April 18, 2024 05:31
@wdbaruni
Copy link
Member

wdbaruni commented Apr 18, 2024

Ahh .. I didn't see auto-merge was enabled. Anyways, there are few comments that can be addressed in a follow-up PR if the comments are valid

aronchick pushed a commit that referenced this pull request Apr 27, 2024
- fixes #3783
- Introduces NodeState type used to track NodeInfo, Connection, and Membershio. Removes the idea of Connection and Membership from data sent by compute nodes to the Requester(s) since compute nodes should not
influence their Connection state or mmembership. Those are values related to the
Requesters view of the network.

---------

Co-authored-by: frrist <forrest@expanso.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compute Nodes broadcast NodeInfo with unknown approval status which overrides previous approvals/rejections
3 participants