Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot start instances after adding second host #8810

Closed
Hector-Work opened this issue Mar 20, 2024 · 15 comments
Closed

Cannot start instances after adding second host #8810

Hector-Work opened this issue Mar 20, 2024 · 15 comments

Comments

@Hector-Work
Copy link

ISSUE TYPE
  • Other
COMPONENT NAME
Networking
CLOUDSTACK VERSION
4.19.0.0
CONFIGURATION

Zone created with advanced networking, 1 physical network.

OS / ENVIRONMENT

Ubuntu 22.04, KVM

SUMMARY

I've had a CloudStack installation working well on a single host for a couple of weeks now. Yesterday I added another host to the cluster. When I went back to look at the instances all of them had stopped. Attempting to start them again resulted in an error. The same happened with the VRs, though the system VMs are fine.

This is the error in the management server logs whenever I try to start an instance:

2024-03-20 11:41:20,508 INFO  [c.c.v.VirtualMachineManagerImpl] (Work-Job-Executor-36:ctx-7b172848 job-868/job-869 ctx-6d899296) (logid:698ffc95) Unable to start VM on Host {"id":1,"name":"comp1","type":"Routing","uuid":"2fca77e2-a9d5-4ca3-977f-9fc18e78bd0a"} due to Failed to create vnet 11: Cannot find device "br-11"/usr/share/cloudstack-common/scripts/vm/network/vnet/modifyvlan.sh: line 37: /proc/sys/net/ipv6/conf/br-11.11/disable_ipv6: No such file or directoryCannot find device "br-11.11"Failed to create vlan 11 on pif: br-11.

Any help to understand what's going wrong would be much appreciated.

STEPS TO REPRODUCE

EXPECTED RESULTS
New host is added, and instances continue running.
ACTUAL RESULTS
Error when attempting to start instances.
Copy link

boring-cyborg bot commented Mar 20, 2024

Thanks for opening your first issue here! Be sure to follow the issue template!

@weizhouapache
Copy link
Member

@Hector-Work
it looks like the second host is misconfigured.
please check the network configurations on both hosts

@Hector-Work
Copy link
Author

Hector-Work commented Mar 20, 2024

Thanks @weizhouapache. Here are the network configs for each host. Host 1 was working fine until I added Host 2, yet the configs are the same.
Host 1:

network:
   version: 2
   renderer: NetworkManager
   ethernets:
     enp1s0:
       dhcp4: false
       dhcp6: false
       optional: true
   bridges:
     cloudbr0:
       addresses: [192.168.100.1/24]
       routes:
       - to: default
         via: 192.168.100.254
       nameservers:
         addresses: [1.1.1.1,8.8.8.8]
       interfaces: [enp1s0]
       dhcp4: false
       dhcp6: false
       parameters:
         stp: false
         forward-delay: 0

Host 2:

network:
   version: 2
   renderer: NetworkManager
   ethernets:
      enp1s0:
         dhcp4: false
         dhcp6: false
         optional: true
   bridges:
      cloudbr0:
         addresses: [192.168.100.2/24]
         routes:
         - to: default
           via: 192.168.100.254
         nameservers:
           addresses: [1.1.1.1,8.8.8.8]
         interfaces: [enp1s0]
         dhcp4: false
         dhcp6: false
         parameters:
            stp: false
            forward-delay: 0

@Hector-Work
Copy link
Author

I may be misunderstanding this, but based on the log it looks like it's trying to create a new vnet interface from a bridge interface that doesn't exist, instead of using the cloudbr0 interface as it used to. Still no closer to working out why though.

@weizhouapache
Copy link
Member

weizhouapache commented Mar 20, 2024

enp1s0

@Hector-Work
is the physical network interface also named enp1s0 ?

can you share the output of ip a on both hosts ?

@Hector-Work
Copy link
Author

@weizhouapache , yes both hosts have a physical interface called enp1s0.

I have actually just managed to solve this issue by renaming the cloudbr0 interface on Host 2 to cloudbr1. I updated the /etc/cloudstack/agent/agent.properties so that all the network device properties are pointing at cloudbr1 as well. Finally I restarted the agents on both hosts.

I don't know why this worked, but maybe it will help someone else in the future. Here is the output of ip a now that it's working for reference:
Host 1:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master cloudbr0 state UP group default qlen 1000
    link/ether e0:4f:43:e6:ad:b1 brd ff:ff:ff:ff:ff:ff
3: cloudbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 3e:4a:81:c4:28:5b brd ff:ff:ff:ff:ff:ff
    inet 192.168.100.1/24 brd 192.168.100.255 scope global noprefixroute cloudbr0
       valid_lft forever preferred_lft forever
    inet6 fe80::3c4a:81ff:fec4:285b/64 scope link 
       valid_lft forever preferred_lft forever
4: cloud0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 56:f1:92:d7:a5:f2 brd ff:ff:ff:ff:ff:ff
    inet 169.254.0.1/16 scope global cloud0
       valid_lft forever preferred_lft forever
    inet6 fe80::54f1:92ff:fed7:a5f2/64 scope link 
       valid_lft forever preferred_lft forever
5: vnet1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master cloud0 state UNKNOWN group default qlen 1000
    link/ether fe:00:a9:fe:76:10 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fc00:a9ff:fefe:7610/64 scope link 
       valid_lft forever preferred_lft forever
6: vnet0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master cloud0 state UNKNOWN group default qlen 1000
    link/ether fe:00:a9:fe:a0:9a brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fc00:a9ff:fefe:a09a/64 scope link 
       valid_lft forever preferred_lft forever
7: vnet3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master cloudbr0 state UNKNOWN group default qlen 1000
    link/ether fe:00:a0:00:00:08 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fc00:a0ff:fe00:8/64 scope link 
       valid_lft forever preferred_lft forever
8: vnet2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master cloudbr0 state UNKNOWN group default qlen 1000
    link/ether fe:00:52:00:00:09 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fc00:52ff:fe00:9/64 scope link 
       valid_lft forever preferred_lft forever
9: vnet4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master cloudbr0 state UNKNOWN group default qlen 1000
    link/ether fe:00:e5:00:00:0a brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fc00:e5ff:fe00:a/64 scope link 
       valid_lft forever preferred_lft forever
10: vnet5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master cloudbr0 state UNKNOWN group default qlen 1000
    link/ether fe:00:6f:00:00:0b brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fc00:6fff:fe00:b/64 scope link 
       valid_lft forever preferred_lft forever
11: vnet6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master cloudbr0 state UNKNOWN group default qlen 1000
    link/ether fe:00:ed:00:00:13 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fc00:edff:fe00:13/64 scope link 
       valid_lft forever preferred_lft forever
12: enp1s0.15@enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master brenp1s0-15 state UP group default qlen 1000
    link/ether e0:4f:43:e6:ad:b1 brd ff:ff:ff:ff:ff:ff
13: brenp1s0-15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 52:43:fa:5a:86:92 brd ff:ff:ff:ff:ff:ff
14: vnet7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc htb master brenp1s0-15 state UNKNOWN group default qlen 1000
    link/ether fe:01:00:dc:00:11 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fc01:ff:fedc:11/64 scope link 
       valid_lft forever preferred_lft forever
15: vnet8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master cloud0 state UNKNOWN group default qlen 1000
    link/ether fe:00:a9:fe:34:c2 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fc00:a9ff:fefe:34c2/64 scope link 
       valid_lft forever preferred_lft forever
16: vnet9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc htb master cloudbr0 state UNKNOWN group default qlen 1000
    link/ether fe:00:de:00:00:17 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fc00:deff:fe00:17/64 scope link 
       valid_lft forever preferred_lft forever
17: vnet10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc htb master brenp1s0-15 state UNKNOWN group default qlen 1000
    link/ether fe:01:00:dc:00:0e brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fc01:ff:fedc:e/64 scope link 
       valid_lft forever preferred_lft forever
18: vnet11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc htb master brenp1s0-15 state UNKNOWN group default qlen 1000
    link/ether fe:01:00:dc:00:10 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fc01:ff:fedc:10/64 scope link 
       valid_lft forever preferred_lft forever
19: vnet12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc htb master brenp1s0-15 state UNKNOWN group default qlen 1000
    link/ether fe:01:00:dc:00:01 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fc01:ff:fedc:1/64 scope link 
       valid_lft forever preferred_lft forever
20: vnet13: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc htb master brenp1s0-15 state UNKNOWN group default qlen 1000
    link/ether fe:01:00:dc:00:0c brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fc01:ff:fedc:c/64 scope link 
       valid_lft forever preferred_lft forever
21: vnet14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc htb master brenp1s0-15 state UNKNOWN group default qlen 1000
    link/ether fe:01:00:dc:00:05 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fc01:ff:fedc:5/64 scope link 
       valid_lft forever preferred_lft forever

Host 2:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master cloudbr1 state UP group default qlen 1000
    link/ether e0:4f:43:e6:bf:8b brd ff:ff:ff:ff:ff:ff
4: cloud0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 46:be:6f:11:da:4a brd ff:ff:ff:ff:ff:ff
    inet 169.254.0.1/16 scope global cloud0
       valid_lft forever preferred_lft forever
    inet6 fe80::44be:6fff:fe11:da4a/64 scope link 
       valid_lft forever preferred_lft forever
6: cloudbr1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 3e:78:5d:32:2c:f8 brd ff:ff:ff:ff:ff:ff
    inet 192.168.100.2/24 brd 192.168.100.255 scope global noprefixroute cloudbr1
       valid_lft forever preferred_lft forever
    inet6 fe80::3c78:5dff:fe32:2cf8/64 scope link 
       valid_lft forever preferred_lft forever

@weizhouapache
Copy link
Member

@Hector-Work
this may lead to some unexpected errors, for example

  • connectivity errors
  • vm migration error due to mismatched bridge name

I suggest to use same bridge. you could try

  • add macaddress: <mac of enp1s0> to cloudbr0 in netplan config
  • and/or, restart host 2

@weizhouapache , yes both hosts have a physical interface called enp1s0.

I have actually just managed to solve this issue by renaming the cloudbr0 interface on Host 2 to cloudbr1. I updated the /etc/cloudstack/agent/agent.properties so that all the network device properties are pointing at cloudbr1 as well. Finally I restarted the agents on both hosts.

@mlsorensen
Copy link
Contributor

mlsorensen commented Mar 20, 2024

What is your KVM traffic label set to for the zone's physical network? This should match the bridge name you want to use on the hypervisor host. Specifically the label for the guest traffic tells cloudstack which bridge to use when creating vlans.

It will also have the effect of rejecting agent startup/join if this bridge cannot be resolved on the host.

@mlsorensen
Copy link
Contributor

As to why moving to cloudbr1 fixed it, my guess is that there was something broken about cloudbr0 on host 2, and the newly created cloudbr1 didn't have the same problem, whatever it was. I agree it will cause other problems down the road to run inconsistent configs across hosts.

@Hector-Work
Copy link
Author

Thank you both @mlsorensen @weizhouapache . I see what you're saying about the problems it could cause with the inconsistent naming, I will try setting it back to cloudbr0 and see if it breaks again.

To answer your question @mlsorensen , the kvmnetworklabel is currently set to Use default gateway. Maybe my default interface was incorrect on Host 2 to begin with?

@weizhouapache
Copy link
Member

Thank you both @mlsorensen @weizhouapache . I see what you're saying about the problems it could cause with the inconsistent naming, I will try setting it back to cloudbr0 and see if it breaks again.

To answer your question @mlsorensen , the kvmnetworklabel is currently set to Use default gateway. Maybe my default interface was incorrect on Host 2 to begin with?

@Hector-Work
Use default gateway means the kvmnetworklabel is not set, which should be ok (cloudbr0 will be used).

@Hector-Work
Copy link
Author

I just tried setting Host 2 back to cloudbr0. It didn't take down the existing instances this time, but I can't run any instances on that host. I saw this error in the agent log:

2024-03-21 08:57:13,026 ERROR [kvm.resource.LibvirtConnection] (Agent-Handler-1:null) (logid:) Connection with libvirtd is broken: invalid connection pointer in virConnectGetVersion
2024-03-21 08:57:13,143 ERROR [kvm.resource.LibvirtComputingResource] (Agent-Handler-1:null) (logid:) Failed to get libvirt connection for domain event lifecycle

@weizhouapache
Copy link
Member

I just tried setting Host 2 back to cloudbr0. It didn't take down the existing instances this time, but I can't run any instances on that host. I saw this error in the agent log:

2024-03-21 08:57:13,026 ERROR [kvm.resource.LibvirtConnection] (Agent-Handler-1:null) (logid:) Connection with libvirtd is broken: invalid connection pointer in virConnectGetVersion
2024-03-21 08:57:13,143 ERROR [kvm.resource.LibvirtComputingResource] (Agent-Handler-1:null) (logid:) Failed to get libvirt connection for domain event lifecycle

@Hector-Work
is service libvirtd running well ?

@Hector-Work
Copy link
Author

@weizhouapache , I think that is the problem. It is running but these errors are from libvirtd:

Mar 21 08:57:27 comp2 libvirtd[303317]: Unable to read from monitor: Connection reset by peer
Mar 21 08:57:27 comp2 libvirtd[303317]: internal error: qemu unexpectedly closed the monitor: 2024-03-21T08:57:27.135774Z qemu-system-x86_64: -machine pc-i440fx-6.2,usb=off,dump-guest-core=o>
                                        Use -machine help to list supported machines
Mar 21 08:57:27 comp2 libvirtd[303317]: Failed to connect socket to '/var/run/libvirt/virtlxcd-sock': No such file or directory
Mar 21 08:57:27 comp2 libvirtd[303317]: End of file while reading data: Input/output error

@Hector-Work
Copy link
Author

Hector-Work commented Mar 21, 2024

It seems I made a very simple mistake when I was setting up Host 2, I somehow forgot to install qemu-kvm so libvirtd was of course having problems! Installed now and instances are running on both hosts.

Thank you for your help. I will close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants