Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syslog fills with "Error 1 sending the modular data", gmond keeps using socket after EINVAL #65

Open
dpocock opened this issue Oct 28, 2012 · 2 comments

Comments

@dpocock
Copy link
Member

dpocock commented Oct 28, 2012

Suggested action/solution: if write returns EINVAL, gmond should try to recreate or re-bind the sending socket, rather than continuing to send on a bad socket (and filling logs with errors)

Google reveals this has been discussed several times in the past, and
none of the discussions ended with a solution, so I'm presenting some
analysis below.

Here is what I did and what I found:

I discovered my gmond PID = 21015 and I checked it with strace:

strace -p 21015 -o /tmp/gmond.errs -v

After about a minute, I had a look inside /tmp/gmond.errs, lots of this:

write(7, "\0\0\0\205\0\0\0\4srv1\0\0\0\fmachine_type\0\0\0\0"..., 52) = 52
write(8, "\0\0\0\205\0\0\0\4srv1\0\0\0\fmachine_type\0\0\0\0"..., 52) =
-1 EINVAL (Invalid argument)
write(7, "\0\0\0\200\0\0\0\4srv1\0\0\0\7os_name\0\0\0\0\0\0\0\0\6"...,
164) = 164
write(8, "\0\0\0\200\0\0\0\4srv1\0\0\0\7os_name\0\0\0\0\0\0\0\0\6"...,
164) = -1 EINVAL (Invalid argument)
time([1351418592]) = 1351418592
sendto(9, "<30>Oct 28 11:03:12 /usr/sbin/gm"..., 90, MSG_NOSIGNAL, NULL,
0) = 90

Notice the `sendto' is actually sending the error to syslog, not sending
a metric packet

Ok, the `write' calls show me two file descriptors, 7 and 8. writes to
FD 8 are failing with EINVAL:

write(8, .... ) = -1 EINVAL (Invalid argument)

The file descriptors correspond to two different udp_send_channels in
gmond.conf - but which is which? Fortunately, lsof tells me:

lsof -p 21015 -n

gmond 21015 ganglia 7u IPv4 2747622 0t0 UDP
192.168.1.2:44778->239.2.11.71:8649

gmond 21015 ganglia 8u IPv4 2747628 0t0 UDP
(VPN address):53976->(remote server address):8649

Notice that FD 7 corresponds to a very standard multicast channel, while
FD 8 corresponds to a UDP unicast channel. I have deleted the IP
addresses, but this immediately revealed the problem (in my case
anyway): the local address (VPN address) existed when gmond started, but
no longer exists on this machine (because the VPN is not always up).

I can imagine similar problems would occur for hosts that get an IP by
means of DHCP, or hosts that have IPsec tunnel, PPP or some other
transient interfaces.

If anyone else sees the problem, it would be interested to see your
strace and lsof output. I believe gmond could be tweaked, for example,
to recreate (or re-bind) the socket with FD 8 after such an EINVAL error.
Doing so might log a more specific error or might successfully bind on a
new local IP.

@creisor
Copy link

creisor commented Jun 19, 2013

I'm seeing similar errors, but it mostly seems to be due to connection refused:

creisor@hostname ~/Documents/troubleshooting/gmond $ grep "= -1" gmond.errs  | sort | uniq -c
  14 connect(27, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
 122 lseek(25, 0, SEEK_CUR)                  = -1 ESPIPE (Illegal seek)
  61 lseek(28, 0, SEEK_CUR)                  = -1 ESPIPE (Illegal seek)
   9 write(7, "\0\0\0\200\0\0\0\16hostname\0\0\0\0\0\17net_"..., 184) = -1 ECONNREFUSED (Connection refused)
   5 write(7, "\0\0\0\200\0\0\0\16hostname\0\0\0\0\0\21proc"..., 200) = -1 ECONNREFUSED (Connection refused)
   1 write(7, "\0\0\0\200\0\0\0\16hostname\0\0\0\0\0\fredi"..., 168) = -1 ECONNREFUSED (Connection refused)
   9 write(7, "\0\0\0\200\0\0\0\16hostname\0\0\0\0\0\ncpu_"..., 168) = -1 ECONNREFUSED (Connection refused)
   2 write(7, "\0\0\0\200\0\0\0\16hostname\0\0\0\0\0\thear"..., 168) = -1 ECONNREFUSED (Connection refused)
   2 write(7, "\0\0\0\203\0\0\0\16hostname\0\0\0\0\0\17net_"..., 60) = -1 ECONNREFUSED (Connection refused)
   4 write(7, "\0\0\0\203\0\0\0\16hostname\0\0\0\0\0\21proc"..., 64) = -1 ECONNREFUSED (Connection refused)
   2 write(7, "\0\0\0\203\0\0\0\16hostname\0\0\0\0\0\ncpu_"..., 56) = -1 ECONNREFUSED (Connection refused)
   1 write(7, "\0\0\0\203\0\0\0\16hostname\0\0\0\0\0\nntp_"..., 56) = -1 ECONNREFUSED (Connection refused)
   7 write(7, "\0\0\0\204\0\0\0\16hostname\0\0\0\0\0\thear"..., 56) = -1 ECONNREFUSED (Connection refused)
   1 write(7, "\0\0\0\205\0\0\0\16hostname\0\0\0\0\0\17net_"..., 272) = -1 ECONNREFUSED (Connection refused)
   2 write(7, "\0\0\0\205\0\0\0\16hostname\0\0\0\0\0\21proc"..., 152) = -1 ECONNREFUSED (Connection refused)
   1 write(7, "\0\0\0\205\0\0\0\16hostname\0\0\0\0\0\5gexe"..., 56) = -1 ECONNREFUSED (Connection refused)
   1 write(7, "\0\0\0\205\0\0\0\16hostname\0\0\0\0\0\ncpu_"..., 172) = -1 ECONNREFUSED (Connection refused)
   1 write(7, "\0\0\0\205\0\0\0\16hostname\0\0\0\0\0\nntp_"..., 128) = -1 ECONNREFUSED (Connection refused)

FD 7 is a UDP channel to our central ganglia server

gmond   4519 ganglia    7u  IPv4          348817372      0t0        UDP 10.20.40.104:38381->10.20.43.22:8706

man 2 connect says:

ECONNREFUSED
No one listening on the remote address.

@pmigrala
Copy link

Late reply, but may help some poor soul ...

get pid of gmond:
systemctl status gmond.service

"lsof -p 20606" revealed:
gmond 20606 ganglia 7u IPv6 385361 0t0 UDP localhost:50101->localhost:8649

Notice type IPv6.
Disabling ipv6 and restarting gmond corrected my ECONNREFUSED errors.
The ganglia graphs are now being rendered.

Disabled ipv6 like this:
vi /etc/sysctl.conf

add the following:
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1

restart network:
systemctl restart network

restart gmond:
systemctl restart gmond.service

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants