Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

netlink plugin not working on CentOS, RHEL #2510

Closed
mrunge opened this issue Oct 27, 2017 · 17 comments
Closed

netlink plugin not working on CentOS, RHEL #2510

mrunge opened this issue Oct 27, 2017 · 17 comments

Comments

@mrunge
Copy link
Member

mrunge commented Oct 27, 2017

  • Version of collectd: 5.5.0, 5.7.1
    it is reproducible with collectd-5.7.1-2.el7.x86_64 from EPEL

  • Operating system / distribution:
    CentOS 7

Expected behavior

(Description of the behavior / output that you expected)

Actual behavior

Oct 27 08:40:02 localhost collectd[9317]: Initialization complete, entering read-loop.
Oct 27 08:40:02 localhost collectd[9317]: netlink plugin: link_filter_cb: IFLA_STATS64 mnl_attr_validate2 failed.
Oct 27 08:40:02 localhost collectd[9317]: netlink plugin: ir_read: mnl_socket_recvfrom failed.
Oct 27 08:40:02 localhost collectd[9317]: read-function of plugin `netlink' failed. Will suspend it for 20.000 seconds.

The config is pretty minimal:

LoadPlugin "netlink"
<Plugin "netlink">
 Interface "eth0"
</Plugin>

(Description of the behavior / output that you observed)

Steps to reproduce

  • set up a fresh centos 7
  • yum install epel-release
  • yum install collectd-netlink
  • systemctl start collectd
@smojtabai
Copy link

We are having this issue as well.

@octo octo added Bug A genuine bug linux only labels Nov 4, 2017
@octo
Copy link
Member

octo commented Nov 4, 2017

The docs have this to say:

  • mnl_attr_validate2()
    This function allows to perform a more accurate validation for attributes whose size is variable. If the size of the attribute is not what we expect, this functions returns -1 and errno is explicitly set.
  • mnl_socket_recvfrom()
    On error, it returns -1 and errno is appropriately set. If errno is set to ENOSPC, it means that the buffer that you have passed to store the netlink message is too small, so you have received a truncated message. To avoid this, you have to allocate a buffer of MNL_SOCKET_BUFFER_SIZE (which is 8KB, see linux/netlink.h for more information). Using this buffer size ensures that your buffer is big enough to store the netlink message without truncating it.

octo added a commit to octo/collectd that referenced this issue Nov 4, 2017
octo added a commit to octo/collectd that referenced this issue Nov 5, 2017
@hontvari
Copy link

hontvari commented Feb 6, 2018

The same on Ubuntu 16.04, Collectd 5.5.1.

netlink plugin: link_filter_cb: IFLA_STATS64 mnl_attr_validate2 failed.
netlink plugin: ir_read: mnl_socket_recvfrom failed.

@AlexZzz
Copy link
Contributor

AlexZzz commented May 4, 2018

Hi! I faced with the same problem:

netlink plugin: link_filter_cb: IFLA_STATS64 mnl_attr_validate2 failed: Numerical result out of range
netlink plugin: ir_read: mnl_socket_recvfrom failed: Numerical result out of range
read-function of plugin `netlink' failed. Will suspend it for 120.000 seconds.

For me it only affects ubuntu 16.04 with 4.13 kernel which is a part of linux-generic-hwe-16.04 package (default is 4.4 - netlink plugin works). As i can see, the additional counter was added at 4.6 kernel.

@nvtkaszpir
Copy link

nvtkaszpir commented May 6, 2018

Also happening to have that error on Ubuntu 16.04 with kernel linux-generic-hwe-16.04-edge (4.15.0-15-generic) with packages from CI repo collectd 5.8.0.74.g0c85475.

May  6 20:46:17 derp-011 collectd[12651]: tcpconns plugin: Reading from netlink succeeded. Will use the netlink method from now on.
May  6 20:46:17 derp-011 collectd[12651]: netlink plugin: link_filter_cb: IFLA_STATS64 mnl_attr_validate2 failed: Numerical result out of range
May  6 20:46:17 derp-011 collectd[12651]: netlink plugin: ir_read: mnl_socket_recvfrom failed: Numerical result out of range
May  6 20:46:17 derp-011 collectd[12651]: read-function of plugin `netlink' failed. Will suspend it for 20.000 seconds.

Yet Centos 7.4 with kernel-ml.x86_64 4.16.7-1.el7.elrepo from elrepo-kernel with collectd-netlink.x86_64 5.8.0-3.el7 from epel works without an issues.

@rpv-tomsk
Copy link
Contributor

rpv-tomsk commented May 7, 2018

As i can see, the additional counter was added at 4.6 kernel.

It should also be added in /usr/include/linux/if_link.h then. Otherwise, kernel and userspace structures are out of sync, and that leads to specified error. Unfortunately, that is outside of Collectd and can't be fixed in its code.

@rpv-tomsk
Copy link
Contributor

rpv-tomsk commented May 7, 2018

For cases, when we have Collectd built against 'old' headers and kernel returning 'new' structure, the following patch may help.

diff --git a/src/netlink.c b/src/netlink.c
index b5ae3bd..29cd383 100644
--- a/src/netlink.c
+++ b/src/netlink.c
@@ -357,11 +357,10 @@ static int link_filter_cb(const struct nlmsghdr *nlh,
     if (mnl_attr_get_type(attr) != IFLA_STATS64)
       continue;

-    if (mnl_attr_validate2(attr, MNL_TYPE_UNSPEC, sizeof(*stats.stats64)) < 0) {
-      char errbuf[1024];
-      ERROR("netlink plugin: link_filter_cb: IFLA_STATS64 mnl_attr_validate2 "
-            "failed: %s",
-            sstrerror(errno, errbuf, sizeof(errbuf)));
+    uint16_t attr_len = mnl_attr_get_payload_len(attr);
+    if (attr_len < sizeof(*stats.stats64)) {
+      ERROR("netlink plugin: link_filter_cb: IFLA_STATS64 attribute has "
+            "insufficient data.");
       return MNL_CB_ERROR;
     }
     stats.stats64 = mnl_attr_get_payload(attr);
@@ -375,11 +374,10 @@ static int link_filter_cb(const struct nlmsghdr *nlh,
     if (mnl_attr_get_type(attr) != IFLA_STATS)
       continue;

-    if (mnl_attr_validate2(attr, MNL_TYPE_UNSPEC, sizeof(*stats.stats32)) < 0) {
-      char errbuf[1024];
-      ERROR("netlink plugin: link_filter_cb: IFLA_STATS mnl_attr_validate2 "
-            "failed: %s",
-            sstrerror(errno, errbuf, sizeof(errbuf)));
+    uint16_t attr_len = mnl_attr_get_payload_len(attr);
+    if (attr_len < sizeof(*stats.stats32)) {
+      ERROR("netlink plugin: link_filter_cb: IFLA_STATS attribute has "
+            "insufficient data.");
       return MNL_CB_ERROR;
     }
     stats.stats32 = mnl_attr_get_payload(attr);

This patch will not help in reverse case, but this does not differs to behaviour without the patch.

Can anybody check it? When checking, please clarify if /usr/include/linux/if_link.h contains rx_nohandler field.

Thanks.

@rpv-tomsk
Copy link
Contributor

@rubenk , what do you think about such a patch?

@AlexZzz
Copy link
Contributor

AlexZzz commented May 7, 2018

Can anybody check it? When checking, please clarify if /usr/include/linux/if_link.h contains rx_nohandler field.

Built on system with 4.15, it works now! Thanks. But, what about parsing rx_nohandler?
There's no rx_nohandler in userspace if_link.h, even on my ubuntu 16.04 with 4.15.

@AlexZzz
Copy link
Contributor

AlexZzz commented May 7, 2018

linux-libc-dev:amd64: /usr/include/linux/if_link.h
This package contains 4.4 headers in ubuntu 16.04. I've checked in ubuntu 18.04, it's default kernel is 4.15, and linux-libc-dev is up-to-date for 4.15 - rx_nohandler is there.

@rpv-tomsk
Copy link
Contributor

So, you confirm what this solution works when Collectd built with old headers and running on new kernel with rx_nohandler field, right? It was good to receive such response, so this may be accepted as solution.

@rpv-tomsk
Copy link
Contributor

I will prepare a new PR with support of new metric, based on rx_nohandler value soon.
That will be good if someone will test it.

@AlexZzz
Copy link
Contributor

AlexZzz commented May 7, 2018

So, you confirm what this solution works when Collectd built with old headers and running on new kernel with rx_nohandler field, right?

Right.

@rpv-tomsk
Copy link
Contributor

Related link: https://lkml.org/lkml/2016/2/9/886

rpv-tomsk added a commit to rpv-tomsk/collectd that referenced this issue May 7, 2018
mnl_attr_validate2() function implements strict equality check of kernel and
userspace structures size. Additional counters was added to 4.6 Linux kernel,
sizes was changed and mismatch can occur.

This patch weakened validation.
Now Collectd just checks if structures, received from kernel space, has enough
data.

Closes: collectd#2510
rpv-tomsk added a commit to rpv-tomsk/collectd that referenced this issue May 7, 2018
Added metric for new counter from Linux kernel version 4.6+.

Issue: collectd#2510
rpv-tomsk added a commit to rpv-tomsk/collectd that referenced this issue May 7, 2018
Added metric for new counter from Linux kernel version 4.6+.

Issue: collectd#2510
rpv-tomsk added a commit to rpv-tomsk/collectd that referenced this issue May 7, 2018
Added metric for new counter from Linux kernel version 4.6+.

Issue: collectd#2510
@rpv-tomsk
Copy link
Contributor

rpv-tomsk commented May 7, 2018

I will prepare a new PR with support of new metric, based on rx_nohandler value soon.
That will be good if someone will test it.

New PR was added. I haven't tested it on a system with updated kernel/headers, so please test/review it thoroughly. Don't forget to run build.sh because configure.ac was changed.

Thanks.

@nvtkaszpir
Copy link

Hm but it sounds as if CI servers which build packages for Ubuntu needs to be updated to have new header files aswell.

@rpv-tomsk
Copy link
Contributor

rpv-tomsk commented May 7, 2018

I think they have headers from relevant release, do not they?

As stated before,

There's no rx_nohandler in userspace if_link.h, even on my ubuntu 16.04 with 4.15.

At the moment Collectd CI does not have Ubuntu newer than 16.04, so it needs to be updated to support newer release, not headers.

@rubenk rubenk closed this as completed in 48a9888 May 7, 2018
AlexZzz pushed a commit to AlexZzz/collectd that referenced this issue Jan 17, 2019
mnl_attr_validate2() function implements strict equality check of kernel and
userspace structures size. Additional counters was added to 4.6 Linux kernel,
sizes was changed and mismatch can occur.

This patch weakened validation.
Now Collectd just checks if structures, received from kernel space, has enough
data.

Closes: collectd#2510

# Conflicts:
#	src/netlink.c
nmdayton pushed a commit to Stackdriver/collectd that referenced this issue Jan 17, 2019
mnl_attr_validate2() function implements strict equality check of kernel and
userspace structures size. Additional counters was added to 4.6 Linux kernel,
sizes was changed and mismatch can occur.

This patch weakened validation.
Now Collectd just checks if structures, received from kernel space, has enough
data.

Closes: collectd#2510
nmdayton pushed a commit to Stackdriver/collectd that referenced this issue Jan 17, 2019
Added metric for new counter from Linux kernel version 4.6+.

Issue: collectd#2510
nmdayton pushed a commit to Stackdriver/collectd that referenced this issue Jan 18, 2019
mnl_attr_validate2() function implements strict equality check of kernel and
userspace structures size. Additional counters was added to 4.6 Linux kernel,
sizes was changed and mismatch can occur.

This patch weakened validation.
Now Collectd just checks if structures, received from kernel space, has enough
data.

Closes: collectd#2510
nmdayton pushed a commit to Stackdriver/collectd that referenced this issue Jan 18, 2019
Added metric for new counter from Linux kernel version 4.6+.

Issue: collectd#2510
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants