Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Netlink: Received message overrun (No buffer space available) #392

Closed
jslocinski opened this issue Aug 8, 2016 · 12 comments
Closed

Netlink: Received message overrun (No buffer space available) #392

jslocinski opened this issue Aug 8, 2016 · 12 comments

Comments

@jslocinski
Copy link

jslocinski commented Aug 8, 2016

Opening new ticket for clearance, mentioned about this issue here #390
How to tune the buffer size? Tried to increase buffers system-wide via sysctl - didn't help.
Message appears when playing with dozen or hundred of instances on different interfaces with or w-out VMAC option enabled. Starts at level of about 15 instances (with vmac setup) or ~200 instances (no-vmac setup) or at. I start seeing that error at the time of becoming MASTER and sending GARPs.
How can I tune it up properly?

PS.
Tested on 1.2.22 and 1.2.23. Used dozen kB of rmem_max/default buffer size up to few or even hundred of Megs.. same result.

Best regards!

@pqarmitage
Copy link
Collaborator

Could you please post both your with-VMAC and without-VMAC configs (or scripts that generate them), so that we can try to reproduce the problem. If you want to anonimise IP addresses etc, please just replace the first octet with 10; otherwise addresses are replaced with XXXs, we can't see the relationship between them.

It is interesting that you only seem to need a relatively small number of VMACs to generate the problem, compared to the number of non-VMAC interfaces.

I have a number of thoughts about what might be worth trying, but it would be really helpful to see your configs first to make sure that what I suggest makes sense.

@jslocinski
Copy link
Author

jslocinski commented Aug 8, 2016

Sure,
simple scripts:

test_keepalived.sh

#!/bin/bash
if [ $# -ne 2 ]; then
  echo "USAGE: $0 (up|down) NBR_INSTANCES"
  exit 1
fi

N=$2

if [ "$1" == "up" ]; then
  cat <<EOF >/run/conf/keepalived.conf
global_defs {
}
EOF
  ip link add bondZ type dummy
  ip link set up dev bondZ
elif [ "$1" == "down" ]; then
  service keepalived stop
  unlink /run/conf/keepalived.conf
fi

for i in $(seq 1 $N); do
  ifi=bondZ.$i
  if [ "$1" == "up" ]; then
    ip link add link bondZ name $ifi type vlan id $i
    ip link set up dev $ifi
    cat /root/keepalived.template.test | sed s"/__ID__/$i/g" >> /run/conf/keepalived.conf
  fi
done

if [ "$1" == "up" ]; then
  sleep 1
  service keepalived start
elif [ "$1" == "down" ]; then
  ip link del bondZ
fi

/root/keepalived.template.test:

vrrp_instance 4__ID__ {
    state BACKUP
    interface bondZ.__ID__
    use_vmac vrrp.__ID__.1
    virtual_router_id 1
    priority 120
    advert_int 2
    preempt_delay 60
    virtual_ipaddress {
        169.254.__ID__.254/24
    }
    mcast_src_ip 169.254.__ID__.254
    unicast_src_ip 169.254.__ID__.254
}

testing:

./test_keepalived.sh up 20
./test_keepalived.sh down 20

comment out "vmac" option in the template, then:

./test_keepalived.sh up 210
./test_keepalived.sh down 210

tail -f /var/log/syslog | grep 'No buffer space'

PS. Testing on dummy to simplify and isolate, but same results on physical bond or ethX.

Regards

@pqarmitage
Copy link
Collaborator

From looking at your config, the global vrrp_garp_master_delay and vrrp_garp_master_repeat are both set to the default, which is 5. This means that every time an interface comes up, 5 gratuitous ARP messages are sent immediately on that interface, and 5 seconds later another 5 messages are sent. With 200 interfaces, all coming up together, this means that 1000 GARP messages are sent near simultaneously, and a further 1000 GARP messages are sent near simultaneously 5 seconds later.

I would recommend setting vrrp_garp_master_repeat to 1 (so only 1 GARP message at a time is sent when an interface comes up), and set vrrp_garp_master_delay to 0 (so a second message isn't sent 5 seconds later for each interface).

If you are still getting problems, then the GARP messages can be rate limited. Add a garp_group block, as follows:
`garp_group {
garp_interval 0.001
gna_interval 0.001
interfaces {
bondZ.1
bondZ.2
.
.
bondZ.200 # include all your bondZ.n interfaces in the list
}
}
This will rate limit GARP messages to 1 per 0.001ms aggregated across all the bondZ.n interfaces, so to send 1 GARP message for each of 200 interfaces will take 0.2 seconds.

You can adjust the garp_interval setting to see what (if any) value helps resolve your problem.

It would be most helpful to have feedback for what you find.

@jslocinski
Copy link
Author

jslocinski commented Aug 10, 2016

From what I can see now, "No buffer space available" error message is just after entering BACKUP STATE and creation of vrrp interfaces (for 250 instances), and then, after a while GARPs are send.
The changes you mentioned about doesn't solve the issue. Tried even with intervals of 1s.. the same.

Seems this is not really related to GARPs, but the number of interfaces and changes done at once.

Regards

@pqarmitage
Copy link
Collaborator

The "Keepalived_vrrp[]: Netlink: Received message overrun (No buffer space available)" message isn't a kernel buffers problem, it relates to the netlink broadcast messages being sent from the kernel to keepalived for LINK and ADDR events (see kernel_netlink_init() in vrrp_netlink.c), which is the kernel notifying keepalived of links and addresses being added and deleted. keepalived isn't reading the messages on the socket quickly/soon enough, and so the receive buffers on the socket become exhausted, which causes recvmsg() in netlink_parse_info to return ENOBUFS.

If I add the following after the called of netlink_socket(&nl_kernel, ...) in kernel_netlink_init()
size_t rcv_size = 2 * 1024 * 1024;
if (setsockopt(nl_kernel.fd, SOL_SOCKET, SO_RCVBUF, &rcv_size, sizeof rcv_size))
log_message(LOG_INFO, "Set receive socket buffers failed - %d (%m)", errno);
then I stop getting the overrun message with 50 instances using vmacs, but it doesn't stop the error message with 200 non-vmac instances. Nevertheless, that adding the above stops the error with 50 vmac instances shows that this is where the problem lies.

Clearly the above simplistic approach isn't a solution to the problem, but at least we've found the right area. I suspect the solution is to call netlink_parse_info() while the vmacs and addresses are being added, but this needs further investigation.

Many thanks for reporting this and providing the detailed info. I'll continue to work on this to find a solution.

@pqarmitage
Copy link
Collaborator

The commit in the above pull request resolves the issue by polling for netlink messages after adding each interface, and thereby stop the large queue of messages building up, which was what was causing the message overrun.

I have tested this with 250 vmac interfaces, and also with 250 non-vmac interfaces, and no longer receive the Received message overrun message.

@acassen acassen closed this as completed Aug 14, 2016
@jslocinski
Copy link
Author

jslocinski commented Aug 16, 2016

I couldn't compile the latest git version, but ported all things you merged connected with kernel_netlink_poll() to the latest stable 1.2.23. The result is BETTER :)
I ended with:

  • 200 instances w. vmac: OK
  • 250 instances w. vmac: sometimes I still get an error

But the weirdest part is when I add "notify_master" (almost empty), then:

  • 200 instances w. vmac and calling notify_master script (no matter what the script does) : getting errors for 95% cases

the notify line:
notify_master "/tmp/null.sh"

the script itself (/tmp/null.sh, executable):

#!/bin/bash
echo 1 > /dev/null
exit 0

Dunno if this helps you, but strace near to the "No buffer space available" error:

sendto(4, "<143>Aug 16 16:55:25 Keepalived_"..., 78, MSG_NOSIGNAL, NULL, 0) = 78
open("/tmp/null.sh", O_RDONLY)          = 411
close(411)                              = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f37e99db9d0) = 122184
sendmsg(406, {msg_name(16)={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("224.0.0.18")}, msg_iov(1)=[{"E\300\0(\0\2\0\0\377pi\224\251\376\306\376\340\0\0\22!\1x\1\0\5\365\372\251\376\306\376"..., 40}], msg_controllen=0, msg_flags=0}, MSG_DONTROUTE) = 40
sendto(4, "<142>Aug 16 16:55:25 Keepalived_"..., 87, MSG_NOSIGNAL, NULL, 0) = 87
sendmsg(8, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{" \0\0\0\24\0\5\0\223+\263W\0\0\0\0\2\30\0\0\332\213\0\0\10\0\2\0\251\376\270\376", 32}], msg_controllen=0, msg_flags=0}, 0) = 32
--- SIGCHLD (Child exited) @ 0 (0) ---
write(6, "\21\0\0\0", 4)                = 4
rt_sigreturn(0x6)                       = 32
fcntl(8, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(8, F_SETFL, O_RDWR)               = 0
recvmsg(8, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"$\0\0\0\2\0\0\0\223+\263W\207\210\377\377\0\0\0\0 \0\0\0\24\0\5\0\223+\263W"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 36
fcntl(8, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
sendto(9, "\377\377\377\377\377\377\0\0^\0\1\1\10\6\0\1\10\0\6\4\0\1\0\0^\0\1\1\251\376\270\376"..., 42, 0, {sa_family=AF_PACKET, proto=0000, if35802, pkttype=PACKET_HOST, addr(6)={0, 00005e000101}, 20) = 42
sendto(4, "<143>Aug 16 16:55:25 Keepalived_"..., 78, MSG_NOSIGNAL, NULL, 0) = 78
--- SIGCHLD (Child exited) @ 0 (0) ---
write(6, "\21\0\0\0", 4)                = 4
rt_sigreturn(0x6)                       = 78
open("/tmp/null.sh", O_RDONLY)          = 411
close(411)                              = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f37e99db9d0) = 122186
sendmsg(378, {msg_name(16)={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("224.0.0.18")}, msg_iov(1)=[{"E\300\0(\0\2\0\0\377pw\224\251\376\270\376\340\0\0\22!\1x\1\0\5\3\373\251\376\270\376"..., 40}], msg_controllen=0, msg_flags=0}, MSG_DONTROUTE) = 40
recvmsg(7, 0x7ffe72239380, 0)           = -1 ENOBUFS (No buffer space available)
sendto(4, "<142>Aug 16 16:55:25 Keepalived_"..., 109, MSG_NOSIGNAL, NULL, 0) = 109
recvmsg(7, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000010}, msg_iov(1)=[{"T\0\0\0\24\0\0\0\314*\263W\207\210\377\377\2\30\200\0)\213\0\0\10\0\1\0\251\376\7\376"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 84
recvmsg(7, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000010}, msg_iov(1)=[{"T\0\0\0\24\0\0\0\315*\263W\207\210\377\377\2\30\200\0(\213\0\0\10\0\1\0\251\376\6\376"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 84
recvmsg(7, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000010}, msg_iov(1)=[{"T\0\0\0\24\0\0\0\316*\263W\207\210\377\377\2\30\200\0'\213\0\0\10\0\1\0\251\376\5\376"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 84
recvmsg(7, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000010}, msg_iov(1)=[{"T\0\0\0\24\0\0\0\317*\263W\207\210\377\377\2\30\200\0%\213\0\0\10\0\1\0\251\376\3\376"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 84
recvmsg(7, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000010}, msg_iov(1)=[{"T\0\0\0\24\0\0\0\320*\263W\207\210\377\377\2\30\200\0#\213\0\0\10\0\1\0\251\376\1\376"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 84
recvmsg(7, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000010}, msg_iov(1)=[{"T\0\0\0\24\0\0\0\321*\263W\207\210\377\377\2\30\200\0005\213\0\0\10\0\1\0\251\376\23\376"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 84
recvmsg(7, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000010}, msg_iov(1)=[{"T\0\0\0\24\0\0\0\322*\263W\207\210\377\377\2\30\200\0004\213\0\0\10\0\1\0\251\376\22\376"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 84
recvmsg(7, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000010}, msg_iov(1)=[{"T\0\0\0\24\0\0\0\323*\263W\207\210\377\377\2\30\200\0003\213\0\0\10\0\1\0\251\376\21\376"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 84
recvmsg(7, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000010}, msg_iov(1)=[{"T\0\0\0\24\0\0\0\324*\263W\207\210\377\377\2\30\200\0001\213\0\0\10\0\1\0\251\376\17\376"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 84
recvmsg(7, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000010}, msg_iov(1)=[{"T\0\0\0\24\0\0\0\325*\263W\207\210\377\377\2\30\200\0000\213\0\0\10\0\1\0\251\376\16\376"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 84
recvmsg(7, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000010}, msg_iov(1)=[{"T\0\0\0\24\0\0\0\326*\263W\207\210\377\377\2\30\200\0/\213\0\0\10\0\1\0\251\376\r\376"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 84
recvmsg(7, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000010}, msg_iov(1)=[{"T\0\0\0\24\0\0\0\327*\263W\207\210\377\377\2\30\200\0-\213\0\0\10\0\1\0\251\376\v\376"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 84
recvmsg(7, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000010}, msg_iov(1)=[{"T\0\0\0\24\0\0\0\330*\263W\207\210\377\377\2\30\200\0,\213\0\0\10\0\1\0\251\376\n\376"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 84
recvmsg(7, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000010}, msg_iov(1)=[{"T\0\0\0\24\0\0\0\331*\263W\207\210\377\377\2\30\200\0+\213\0\0\10\0\1\0\251\376\t\376"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 84
recvmsg(7, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000010}, msg_iov(1)=[{"T\0\0\0\24\0\0\0\332*\263W\207\210\377\377\2\30\200\0&\213\0\0\10\0\1\0\251\376\4\376"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 84
recvmsg(7, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000010}, msg_iov(1)=[{"T\0\0\0\24\0\0\0\333*\263W\207\210\377\377\2\30\200\0$\213\0\0\10\0\1\0\251\376\2\376"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 84
--- SIGCHLD (Child exited) @ 0 (0) ---
write(6, "\21\0\0\0", 4)                = 4
rt_sigreturn(0x6)                       = 84
recvmsg(7, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000010}, msg_iov(1)=[{"T\0\0\0\24\0\0\0\334*\263W\207\210\377\377\2\30\200\0D\213\0\0\10\0\1\0\251\376\"\376"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 84
recvmsg(7, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000010}, msg_iov(1)=[{"T\0\0\0\24\0\0\0\335*\263W\207\210\377\377\2\30\200\0B\213\0\0\10\0\1\0\251\376 \376"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 84

Don't know how I can test that more. Simplified all cases (at the beginning I had the error with notify scripts calling sysctl, but that's not the case as you can see above).

Regards!

@pqarmitage
Copy link
Collaborator

Do you have automake installed? If so, running automake --add-missing may resolve your build problem with the latest git version. If that doesn't resolve, could you post details of the error here and I'll have a look.

I'll have a further look at the "No buffer space available" problem later.

@pqarmitage
Copy link
Collaborator

Pull request #403 solves the particular cause of the netlink buffer overrun you were experiencing when transitioning to master. The reason for the problem was that adding a notify_master script means that as each vrrp instance transitions to master keepalived executes an open(), a close() and a fork() system call, the first and last of which are slow system calls, so the complexity of the script isn't the issue since the script itself is run asynchronously. This meant that netlink RTM_NEWADDR messages were being buffered, and eventually the buffer overran.

I think there is the possibility of further such problems, although I haven't tested them yet. Perhaps you can. One possibility is with a notify_fault script if all vrrp instances simultaneously go to fault state, e.g. by executing ip link set down bondZ. Adding a notify script as well (so that two scripts are executed on a state transition) can only make the situation worse. Another possibility to cause the problem might be to have a very large number of VIPs/eVIPs on a vrrp instance, since netlink messages are generated every time an address is added or deleted.

Could you please reopen this issue report, since there are some scenarios that at the very least still need further testing.

@jslocinski
Copy link
Author

jslocinski commented Aug 17, 2016

I think I can't reopen the case, @acassen would U?

@acassen acassen reopened this Aug 17, 2016
@pqarmitage
Copy link
Collaborator

Commit 6255538 adds configuration options

    vrrp_netlink_cmd_rcv_bufs
    vrrp_netlink_cmd_rcv_bufs_force
    vrrp_netlink_monitor_rcv_bufs
    vrrp_netlink_monitor_rcv_bufs_force
    lvs_netlink_cmd_rcv_bufs
    lvs_netlink_cmd_rcv_bufs_force
    lvs_netlink_monitor_rcv_bufs
    lvs_netlink_monitor_rcv_bufs_force

to allow the netlink receive buffers to be increased in size in order to stop the ENOBUFS error.

@pqarmitage
Copy link
Collaborator

Issue #803 indicates that this issue has been resolved.

Closing this issue, and if there are any further occurrences please log them against issue #803.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants