Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bacalhau node list returns error failed request: invalid node type: nodeTypeUndefined #4024

Closed
frrist opened this issue May 22, 2024 · 4 comments · Fixed by #4029
Closed
Assignees
Labels
type/bug Type: Something is not working as expected

Comments

@frrist
Copy link
Member

frrist commented May 22, 2024

Bug Description

See title

Expected Behavior

It lists the nodes

Steps to Reproduce

  1. install main
  2. run a server
  3. list nodes
  4. see error

Bacalhau Versions

  • Agent Version: `v1.3.1-rc1
  • CLI Client Version: v1.3.1-rc1

Host Environment

Provide details about the environment where the bug occurred:

  • Operating System: linux
  • CPU Architecture: x86
@frrist frrist added type/bug Type: Something is not working as expected request/new Request: Indicates a new request that has been submitted and awaits initial triage and removed request/new Request: Indicates a new request that has been submitted and awaits initial triage labels May 22, 2024
@frrist frrist self-assigned this May 22, 2024
@frrist
Copy link
Member Author

frrist commented May 22, 2024

I am working from the staging cluster which contains 4 nodes.. It appears there is an extra node (a 5th) in the store somewhere that is of type undefined causing these errors:

curl http://bootstrap.staging.bacalhau.org:1234/api/v1/orchestrator/nodes
{
  "NextToken": "",
  "Nodes": [
    {
      "Info": {
        "NodeID": "",
        "NodeType": "nodeTypeUndefined",
        "Labels": null,
        "BacalhauVersion": {
          "GitVersion": "",
          "GitCommit": "",
          "BuildDate": "0001-01-01T00:00:00Z",
          "GOOS": "",
          "GOARCH": ""
        }
      },
      "Membership": "",
      "Connection": "DISCONNECTED"
    },
    {
      "Info": {
        "NodeID": "QmRr9qPTe4mU7aS9faKnWgvn1NtXt36FT8YUULRPCn2f3K",
        "NodeType": "Compute",
        "Labels": {
          "Architecture": "amd64",
          "Operating-System": "linux",
          "git-lfs": "false",
          "owner": "bacalhau"
        },
        "ComputeNodeInfo": {
          "ExecutionEngines": [
            "docker",
            "wasm"
          ],
          "Publishers": [
            "s3",
            "local",
            "noop",
            "ipfs"
          ],
          "StorageSources": [
            "inline",
            "repoclone",
            "repoclonelfs",
            "s3",
            "ipfs",
            "urldownload"
          ],
          "MaxCapacity": {
            "CPU": 3.2,
            "Memory": 13406204723,
            "Disk": 83047314227
          },
          "QueueCapacity": {},
          "AvailableCapacity": {
            "CPU": 3.2,
            "Memory": 13406204723,
            "Disk": 83047314227
          },
          "MaxJobRequirements": {
            "CPU": 3.2,
            "Memory": 13406204723,
            "Disk": 83047314227
          },
          "RunningExecutions": 0,
          "EnqueuedExecutions": 0
        },
        "BacalhauVersion": {
          "GitVersion": "",
          "GitCommit": "",
          "BuildDate": "0001-01-01T00:00:00Z",
          "GOOS": "",
          "GOARCH": ""
        }
      },
      "Membership": "",
      "Connection": "CONNECTED"
    },
    {
      "Info": {
        "NodeID": "QmVHCeiLzhFJPCyCj5S1RTAk1vBEvxd8r5A6E4HyJGQtbJ",
        "NodeType": "Compute",
        "Labels": {
          "Architecture": "amd64",
          "Operating-System": "linux",
          "git-lfs": "false",
          "owner": "bacalhau"
        },
        "ComputeNodeInfo": {
          "ExecutionEngines": [
            "docker",
            "wasm"
          ],
          "Publishers": [
            "noop",
            "ipfs",
            "s3",
            "local"
          ],
          "StorageSources": [
            "inline",
            "repoclone",
            "repoclonelfs",
            "s3",
            "ipfs",
            "urldownload"
          ],
          "MaxCapacity": {
            "CPU": 3.2,
            "Memory": 13406204723,
            "Disk": 83046763724
          },
          "QueueCapacity": {},
          "AvailableCapacity": {
            "CPU": 3.2,
            "Memory": 13406204723,
            "Disk": 83046763724
          },
          "MaxJobRequirements": {
            "CPU": 3.2,
            "Memory": 13406204723,
            "Disk": 83046763724
          },
          "RunningExecutions": 0,
          "EnqueuedExecutions": 0
        },
        "BacalhauVersion": {
          "GitVersion": "",
          "GitCommit": "",
          "BuildDate": "0001-01-01T00:00:00Z",
          "GOOS": "",
          "GOARCH": ""
        }
      },
      "Membership": "",
      "Connection": "CONNECTED"
    },
    {
      "Info": {
        "NodeID": "Qma5yQAkEDWKBUZd3G4YRpvv5qBMpKvFywR7sqB34LB2Aw",
        "NodeType": "Compute",
        "Labels": {
          "Architecture": "amd64",
          "GPU-0": "Tesla-T4",
          "GPU-0-Memory": "15360-MiB",
          "Operating-System": "linux",
          "git-lfs": "false",
          "owner": "bacalhau"
        },
        "ComputeNodeInfo": {
          "ExecutionEngines": [
            "docker",
            "wasm"
          ],
          "Publishers": [
            "local",
            "noop",
            "ipfs",
            "s3"
          ],
          "StorageSources": [
            "repoclonelfs",
            "s3",
            "ipfs",
            "urldownload",
            "inline",
            "repoclone"
          ],
          "MaxCapacity": {
            "CPU": 3.2,
            "Memory": 12560636313,
            "Disk": 32934753075,
            "GPU": 1,
            "GPUs": [
              {
                "Index": 0,
                "Name": "Tesla T4",
                "Vendor": "NVIDIA",
                "Memory": 15360,
                "PCIAddress": ""
              }
            ]
          },
          "QueueCapacity": {},
          "AvailableCapacity": {
            "CPU": 3.2,
            "Memory": 12560636313,
            "Disk": 32934753075,
            "GPU": 1,
            "GPUs": [
              {
                "Index": 0,
                "Name": "Tesla T4",
                "Vendor": "NVIDIA",
                "Memory": 15360,
                "PCIAddress": ""
              }
            ]
          },
          "MaxJobRequirements": {
            "CPU": 3.2,
            "Memory": 12560636313,
            "Disk": 32934753075,
            "GPU": 1,
            "GPUs": [
              {
                "Index": 0,
                "Name": "Tesla T4",
                "Vendor": "NVIDIA",
                "Memory": 15360,
                "PCIAddress": ""
              }
            ]
          },
          "RunningExecutions": 0,
          "EnqueuedExecutions": 0
        },
        "BacalhauVersion": {
          "GitVersion": "",
          "GitCommit": "",
          "BuildDate": "0001-01-01T00:00:00Z",
          "GOOS": "",
          "GOARCH": ""
        }
      },
      "Membership": "",
      "Connection": "CONNECTED"
    },
    {
      "Info": {
        "NodeID": "QmafZ9oCXCJZX9Wt1nhrGS9FVVq41qhcBRSNWCkVhz3Nvv",
        "NodeType": "Compute",
        "Labels": {
          "Architecture": "amd64",
          "Operating-System": "linux",
          "git-lfs": "false",
          "owner": "bacalhau"
        },
        "ComputeNodeInfo": {
          "ExecutionEngines": [
            "docker",
            "wasm"
          ],
          "Publishers": [
            "s3",
            "local",
            "noop",
            "ipfs"
          ],
          "StorageSources": [
            "ipfs",
            "urldownload",
            "inline",
            "repoclone",
            "repoclonelfs",
            "s3"
          ],
          "MaxCapacity": {
            "CPU": 3.2,
            "Memory": 13406208000,
            "Disk": 79441883955
          },
          "QueueCapacity": {},
          "AvailableCapacity": {
            "CPU": 3.2,
            "Memory": 13406208000,
            "Disk": 79441883955
          },
          "MaxJobRequirements": {
            "CPU": 3.2,
            "Memory": 13406208000,
            "Disk": 79441883955
          },
          "RunningExecutions": 0,
          "EnqueuedExecutions": 0
        },
        "BacalhauVersion": {
          "GitVersion": "",
          "GitCommit": "",
          "BuildDate": "0001-01-01T00:00:00Z",
          "GOOS": "",
          "GOARCH": ""
        }
      },
      "Membership": "APPROVED",
      "Connection": "CONNECTED"
    }
  ]
}

@frrist
Copy link
Member Author

frrist commented May 22, 2024

I believe this may be an issue related to previous state in the clusters node store, as filtering for nodes that are connected work, but disconnected does not:

export BACALHAU_API_HOST=bootstrap.staging.bacalhau.org
frrist@cypress ~> bacalhau node list --filter-status=connected
 ID        TYPE     APPROVAL  STATUS     LABELS                                              CPU     MEMORY      DISK         GPU  
 QmRr9qPT  Compute            CONNECTED  Architecture=amd64 Operating-System=linux           3.2 /   12.5 GB /   77.3 GB /    0 /  
                                         git-lfs=false owner=bacalhau                        3.2     12.5 GB     77.3 GB      0    
 QmVHCeiL  Compute            CONNECTED  Architecture=amd64 Operating-System=linux           3.2 /   12.5 GB /   77.3 GB /    0 /  
                                         git-lfs=false owner=bacalhau                        3.2     12.5 GB     77.3 GB      0    
 Qma5yQAk  Compute            CONNECTED  Architecture=amd64 GPU-0-Memory=15360-MiB           3.2 /   11.7 GB /   30.7 GB /    1 /  
                                         GPU-0=Tesla-T4 Operating-System=linux               3.2     11.7 GB     30.7 GB      1    
                                         git-lfs=false owner=bacalhau                                                              
 QmafZ9oC  Compute  APPROVED  CONNECTED  Architecture=amd64 Operating-System=linux           3.2 /   12.5 GB /   74.0 GB /    0 /  
                                         git-lfs=false owner=bacalhau                        3.2     12.5 GB     74.0 GB      0    

frrist@cypress ~> bacalhau node list --filter-status=disconnected
Error: failed request: invalid node type: nodeTypeUndefined
Usage:
  bacalhau node list [flags]

Flags:
      --filter-approval string   Filter nodes by approval. One of: ["approved" "pending" "rejected"]
      --filter-status string     Filter nodes by status. One of: ["connected" "disconnected"]
  -h, --help                     help for list
      --hide-header              do not print the column headers.
      --labels string            Filter nodes by labels. See https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/ for more information.
      --limit uint32             Limit the number of results returned
      --next-token string        Next token to use for pagination
      --no-style                 remove all styling from table output.
      --order-by string          Order results by a field. Valid fields are: id, type, available_cpu, available_memory, available_disk, available_gpu, status
      --order-reversed           Reverse the order of the results
      --output format            The output format for the command (one of ["table" "csv" "json" "yaml"]) (default table)
      --pretty                   Pretty print the output. Only applies to json and yaml output formats.
      --show strings             What column groups to show. Zero or more of: ["labels" "version" "features" "capacity"] (default [labels,capacity])
      --wide                     Print full values in the table results

Global Flags:
      --api-host string         The host for the client and server to communicate on (via REST).
                                Ignored if BACALHAU_API_HOST environment variable is set. (default "bootstrap.production.bacalhau.org")
      --api-port int            The port for the client and server to communicate on (via REST).
                                Ignored if BACALHAU_API_PORT environment variable is set. (default 1234)
      --cacert string           The location of a CA certificate file when self-signed certificates
                                	are used by the server
      --insecure                Enables TLS but does not verify certificates
      --log-mode logging-mode   Log format: 'default','station','json','combined','event' (default default)
      --repo string             path to bacalhau repo (default "/home/frrist/.bacalhau")
      --tls                     Instructs the client to use TLS

failed request: invalid node type: nodeTypeUndefined

@frrist
Copy link
Member Author

frrist commented May 22, 2024

I have performed the following operations on each node in the cluster to remove the invalid node from the state of the requester node:

bacalhau-vm-stage-0 (requester+compute)

systemctl stop bacalhau
rm /data/compute_store/QmafZ9oCXCJZX9Wt1nhrGS9FVVq41qhcBRSNWCkVhz3Nvv.registration.lock
rm /data/orchestrator_store/nats-store
systemctl start bacalhau

bacalhau-vm-stage-1 (compute)

systemctl stop bacalhau
rm compute_store/QmVHCeiLzhFJPCyCj5S1RTAk1vBEvxd8r5A6E4HyJGQtbJ.registration.lock
systemctl start bacalhau

bacalhau-vm-stage-2 (compute)

systemctl stop bacalhau
rm compute_store/QmRr9qPTe4mU7aS9faKnWgvn1NtXt36FT8YUULRPCn2f3K.registration.lock
systemctl start bacalhau

bacalhau-vm-stage-3 (compute)

systemctl stop bacalhau
rm compute_store/Qma5yQAkEDWKBUZd3G4YRpvv5qBMpKvFywR7sqB34LB2Aw.registration.lock
systemctl start bacalhau

The node list command is now working as expected:

frrist@cypress ~> bacalhau node list
ID        TYPE     APPROVAL  STATUS     LABELS                                              CPU     MEMORY      DISK         GPU  
QmRr9qPT  Compute  APPROVED  CONNECTED  Architecture=amd64 Operating-System=linux           3.2 /   12.5 GB /   77.3 GB /    0 /  
                                        git-lfs=false owner=bacalhau                        3.2     12.5 GB     77.3 GB      0    
QmVHCeiL  Compute  APPROVED  CONNECTED  Architecture=amd64 Operating-System=linux           3.2 /   12.5 GB /   77.3 GB /    0 /  
                                        git-lfs=false owner=bacalhau                        3.2     12.5 GB     77.3 GB      0    
Qma5yQAk  Compute  APPROVED  CONNECTED  Architecture=amd64 GPU-0-Memory=15360-MiB           3.2 /   11.7 GB /   30.7 GB /    1 /  
                                        GPU-0=Tesla-T4 Operating-System=linux               3.2     11.7 GB     30.7 GB      1    
                                        git-lfs=false owner=bacalhau                                                              
QmafZ9oC  Compute  APPROVED  CONNECTED  Architecture=amd64 Operating-System=linux           3.2 /   12.5 GB /   74.0 GB /    0 /  
                                        git-lfs=false owner=bacalhau                        3.2     12.5 GB     74.0 GB      0    

frrist@cypress ~> bacalhau node list --filter-status=connected
ID        TYPE     APPROVAL  STATUS     LABELS                                              CPU     MEMORY      DISK         GPU  
QmRr9qPT  Compute  APPROVED  CONNECTED  Architecture=amd64 Operating-System=linux           3.2 /   12.5 GB /   77.3 GB /    0 /  
                                        git-lfs=false owner=bacalhau                        3.2     12.5 GB     77.3 GB      0    
QmVHCeiL  Compute  APPROVED  CONNECTED  Architecture=amd64 Operating-System=linux           3.2 /   12.5 GB /   77.3 GB /    0 /  
                                        git-lfs=false owner=bacalhau                        3.2     12.5 GB     77.3 GB      0    
Qma5yQAk  Compute  APPROVED  CONNECTED  Architecture=amd64 GPU-0-Memory=15360-MiB           3.2 /   11.7 GB /   30.7 GB /    1 /  
                                        GPU-0=Tesla-T4 Operating-System=linux               3.2     11.7 GB     30.7 GB      1    
                                        git-lfs=false owner=bacalhau                                                              
QmafZ9oC  Compute  APPROVED  CONNECTED  Architecture=amd64 Operating-System=linux           3.2 /   12.5 GB /   74.0 GB /    0 /  
                                        git-lfs=false owner=bacalhau                        3.2     12.5 GB     74.0 GB      0    

frrist@cypress ~> bacalhau node list --filter-status=disconnected
ID  TYPE  APPROVAL  STATUS  LABELS  CPU  MEMORY  DISK  GPU 

@frrist
Copy link
Member Author

frrist commented May 22, 2024

The cause of this issue relates to changes in the state contained within the NodeStore (NATS kv Store) between v1.3.0 and v1.3.1-rc-1.

In v1.3.0 the NodeStore operates over, and contains, NodeInfo:

func (n *NodeStore) Add(ctx context.Context, nodeInfo models.NodeInfo) error {
data, err := json.Marshal(nodeInfo)
if err != nil {
return errors.Wrap(err, "failed to marshal node info adding to node store")
}
_, err = n.kv.Put(ctx, nodeInfo.ID(), data)
if err != nil {
return errors.Wrap(err, "failed to write node info to node store")
}
return nil
}

type NodeInfo struct {
BacalhauVersion BuildVersionInfo `json:"BacalhauVersion"`
PeerInfo peer.AddrInfo `json:"PeerInfo"`
NodeType NodeType `json:"NodeType"`
Labels map[string]string `json:"Labels"`
ComputeNodeInfo *ComputeNodeInfo `json:"ComputeNodeInfo"`
}

In v1.3.1-rc1 the NodeStore operates over, and contains NodeState

func (n *NodeStore) Add(ctx context.Context, state models.NodeState) error {
data, err := json.Marshal(state)
if err != nil {
return pkgerrors.Wrap(err, "failed to marshal node state adding to node store")
}
_, err = n.kv.Put(ctx, state.Info.ID(), data)
if err != nil {
return pkgerrors.Wrap(err, "failed to write node state to node store")
}
return nil
}

type NodeState struct {
Info NodeInfo `json:"Info"`
Membership NodeMembershipState `json:"Membership"`
Connection NodeConnectionState `json:"Connection"`
}

NodeInfo cannot be unmarshaled into a NodeState type which is why list show a node with undefined fields. Its data from v1.3.0 contained in the store that no longer meets the requirements of v1.3.1-rc1

How did we get here?
After v1.3.0 was release several changes were made to the NodeInfo type:

  1. An Approval field was added to track node membership
  2. A State field was added to track a nodes connection state
  3. Shortly after, a bug was discovered in the logic of the change mentioned in 1. and 2.
  4. A fix was created and merged to address the bug: fix: remove Liveness and Acceptance from NodeInfo #3785
  • The fix was validated for compatibility at the protocol level, meaning:
    • v1.3.0 Requester communicating with a v1.3.0 Compute.
    • v1.3.0 Requester communicating with a v1.3.1-rc Compute
    • v1.3.1-rc1 Requester communicating with v1.3.0 Computer.
    • v1.3.1-rc1 Requester communicating with v1.3.1-rc1 Compute

The problem here is that it was never validated to ensure a v1.3.1-rc Requester could open a v1.3.0 Requester store. The fix here appears to be one of:

  1. Implement a repo migration that deletes the kv store from the requester. Further remove the sentinel file compute nodes uses to track registration. This will force the requester to create a new node store, and ensure compute nodes previously connect to it re-register. Implement repo version 4 with migration #4030
  2. Write a migration for the requester nodes NodeInfo store. Given the requirement on a nats transport being avaiavle to access the store this solution is more complicated and ends up being pretty ugly in practice. fix: implement NodeStore migration #4029
  3. Tell users to manually delete their requester NodeStore, remove their compute node registration file, and then restart their compute nodes (no one is going to love this)

frrist pushed a commit that referenced this issue May 22, 2024
frrist pushed a commit that referenced this issue May 23, 2024
frrist pushed a commit that referenced this issue May 23, 2024
- fixes #4024
- this ensures that each time a compute node is started it attempts to
  register itself wit the requester. This is imporatant since in the
  event a requester loses it state compute nodes will re-register
  themselves with it. If they have already registered with a requester
  node registering again idempotent.
- pairing this with the parent commit regarding the V3 Migration is
  required.
frrist added a commit that referenced this issue May 23, 2024
- fixes #4024

---------

Co-authored-by: frrist <forrest@expanso.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Type: Something is not working as expected
Projects
Status: Done
1 participant