Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed connections and timeouts on TcpTransport (was: Incomplete write: Only 0 of 358 written) #78

Closed
lashnag opened this issue Jan 11, 2017 · 56 comments

Comments

@lashnag
Copy link

lashnag commented Jan 11, 2017

Hello! I have bug when i write log to GrayLog on the large load project. There is the bug:
Incomplete write: Only 0 of 358 written in /data/home/projects/payprocessing/classes/vendor/graylog2/gelf-php/src/Gelf/Transport/StreamSocketClient.php:212

Here is the message send to socket:

{
"version": "1.0",
"host": "pay-1.reserve.lan",
"short_message": "Redirect to https://www.platron.ru/payment_params.",
"full_message": "Redirect to https://www.platron.ru/payment_params.php?customer=5d44643437990b1774efb742ed1fb9a031005685\r\n(Process number: 84073)",
"level": 6,
"timestamp": 1484144247.0146,
"facility": "paypocessing",
"file": "Platron::payment"
}

I try to send id manual - and it's worked! So the problem in fwrite function. There is a notice in php documentation about fwrite function:

http://php.net/manual/en/function.fwrite.php

Some people say that when writing to a socket not all of the bytes requested to be written may be written. You may have to call fwrite again to write bytes that were not written the first time. (At least this is how the write() system call in UNIX works.)

I think you need to try fwrite_retry several times, before throw error, or, maybe, use some other function.

@lashnag
Copy link
Author

lashnag commented Jan 12, 2017

One more comment. Log from GrayLog about this error:

2017-01-12T11:45:15.267+03:00 ERROR [NettyTransport] Error in Input
[GELF TCP/585a6742565c11041d194d7c] (channel [id: 0xb735c24e,
/10.1.1.10:45950 => /10.1.2.21:12201])
java.lang.IllegalStateException: GELF message is too short. Not even the
type header would fit.
at
org.graylog2.inputs.codecs.gelf.GELFMessage.getGELFType(GELFMessage.java:46)
~[graylog.jar:?]
at
org.graylog2.inputs.codecs.GelfChunkAggregator.addChunk(GelfChunkAggregator.java:95)
~[graylog.jar:?]
at
org.graylog2.plugin.inputs.transports.NettyTransport$MessageAggregationHandler.messageReceived(NettyTransport.java:303)
~[graylog.jar:?]
at
org.jboss.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:88)
~[graylog.jar:?]
at
org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
[graylog.jar:?]
at
org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
[graylog.jar:?]
at
org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
[graylog.jar:?]
at
org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
[graylog.jar:?]
at
org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
[graylog.jar:?]
at
org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
[graylog.jar:?]
at
org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
[graylog.jar:?]
at
org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
[graylog.jar:?]
at
org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
[graylog.jar:?]
at
org.jboss.netty.channel.SimpleChannelHandler.messageReceived(SimpleChannelHandler.java:142)
[graylog.jar:?]
at
org.jboss.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:88)
[graylog.jar:?]
at
org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
[graylog.jar:?]
at
org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
[graylog.jar:?]
at
org.jboss.netty.handler.traffic.AbstractTrafficShapingHandler.messageReceived(AbstractTrafficShapingHandler.java:718)
[graylog.jar:?]
at
org.jboss.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:88)
[graylog.jar:?]
at
org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
[graylog.jar:?]
at
org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
[graylog.jar:?]
at
org.jboss.netty.channel.SimpleChannelUpstreamHandler.messageReceived(SimpleChannelUpstreamHandler.java:124)
[graylog.jar:?]
at
org.graylog2.plugin.inputs.util.PacketInformationDumper.messageReceived(PacketInformationDumper.java:51)
[graylog.jar:?]
at
org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
[graylog.jar:?]
at
org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
[graylog.jar:?]
at
org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
[graylog.jar:?]
at
org.jboss.netty.channel.SimpleChannelUpstreamHandler.messageReceived(SimpleChannelUpstreamHandler.java:124)
[graylog.jar:?]
at
org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
[graylog.jar:?]
at
org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
[graylog.jar:?]
at
org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
[graylog.jar:?]
at
org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
[graylog.jar:?]
at
org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
[graylog.jar:?]
at
org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
[graylog.jar:?]
at
org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
[graylog.jar:?]
at
org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
[graylog.jar:?]
at
org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
[graylog.jar:?]
at
org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
[graylog.jar:?]
at
org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
[graylog.jar:?]
at
org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
[graylog.jar:?]
at
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
[graylog.jar:?]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[?:1.8.0_111]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[?:1.8.0_111]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]

@lashnag
Copy link
Author

lashnag commented Jan 12, 2017

I try 3 times to retry fwrite and it didn't help. Maybe you have some other suggestions about this bug? Everithing is normal if i try to write this message manual (use php gelf but send this message one time). And it's work if i try to work with graylog on reserve server (copy of the main server).
So when i turn on logging into graylog on main server - it didn't work not all the time, but from time to time.

@bzikarsky
Copy link
Owner

Some questions:

  • You added a manual retry around fwrite and it didnt help?
  • Does every message return 0 bytes written on fwrite, or only some?
  • Do some messages fail every time?
  • Does is happen with bothe the UDP and TCP transport layers?

@lashnag
Copy link
Author

lashnag commented Jan 16, 2017

  1. I added manually retry - and it didn't help
  2. Not every message return 0 bytes
  3. No. I try to send this messege manually later - and the result was ok!
  4. I tried only TCP transport

@Seb35
Copy link

Seb35 commented Jan 16, 2017

I also have this message on my developement computer, so in case it can help: I’m playing with Graylog before (perhaps) put it in production, and I got this message when the Graylog server is not running, and the UDP port is closed. To be sure I monitored the traffic with Wireshark and I see the UDP packet sent then an ICMP response "Destination unreachable (Port unreachable)"; there 4 (unsuccessful) tries.

@lashnag perhaps you could try to monitor the traffic to hopefully obtain some interesting informations like ICMP responses, perhaps there are intermittent network errors.

@lashnag
Copy link
Author

lashnag commented Jan 17, 2017

@Seb35 I set the tcpdump. And there is no errors in the log file.

bzikarsky pushed a commit that referenced this issue Jan 18, 2017
@bzikarsky
Copy link
Owner

What do you get with the improved method?

@lashnag
Copy link
Author

lashnag commented Jan 18, 2017

@bzikarsky sorry, what do you mean?

@bzikarsky
Copy link
Owner

I pushed a change to master in d98632d -- Can you run your tests again and see what happens?

@lashnag
Copy link
Author

lashnag commented Jan 19, 2017

@bzikarsky I have error from this:

if ($failed || $byteCount === false) {
                throw new \RuntimeException();
            }

RuntimeException in /data/home/projects/payprocessing/classes/vendor/graylog2/gelf-php/src/Gelf/Transport/StreamSocketClient.php:218

@bzikarsky
Copy link
Owner

And what does the error say? I assume you get an exception-message

@lashnag
Copy link
Author

lashnag commented Jan 19, 2017

@bzikarsky I can send to you stack trace:

2017-01-19 17:44:11 [176.65.120.6][-][-][error][RuntimeException]
RuntimeException in /data/home/projects/payprocessing/classes/vendor/graylog2/gelf-php/src/Gelf/Transport/StreamSocketClient.php:218
Stack trace:
#0 /data/home/projects/payprocessing/classes/vendor/graylog2/gelf-php/src/Gelf/Transport/TcpTransport.php(66): Gelf\Transport\StreamSocketClient->write('{"version":"1.0...')
#1 /data/home/projects/payprocessing/classes/vendor/graylog2/gelf-php/src/Gelf/Publisher.php(76): Gelf\Transport\TcpTransport->send(Object(Gelf\Message))
#2 /data/home/projects/payprocessing/classes/log/GraylogTarget.php(41): Gelf\Publisher->publish(Object(Gelf\Message))
#3 /data/home/projects/payprocessing/classes/log/SysLog.php(120): common\log\GraylogTarget->write('Redirect to htt...', 20820, 'Platron::paymen...')
#4 /data/home/projects/payprocessing/www/payment.php(91): SysLog->write('Redirect to htt...')
#5 {main}

@bzikarsky
Copy link
Owner

The exception does not have a message at all? Or don't you log the message?

Maybe dump it explicitly?
try { ...} catch (\RuntimeException $e) { var_dump($e->getMessage()); throw $e; }

@Nikoms
Copy link

Nikoms commented Jan 19, 2017

We have the same issue here... Is it related to the config of graylog you think?

@lashnag
Copy link
Author

lashnag commented Jan 20, 2017

@bzikarsky in commit https://github.com/bzikarsky/gelf-php/commit/d98632dfbf47a0084ae935653f34bf5d40bfa04f you write this code:

if ($failed || $byteCount === false) {
                throw new \RuntimeException();
            }

So exception was created from here and you don't send any message when you throw it

throw new \RuntimeException();

So, what message do you wan't to get? From log on the graylog?

@lashnag
Copy link
Author

lashnag commented Jan 20, 2017

@Nikoms I don't know. But the exception don't say that there is some problems with graylog or socket was closed when i try to write. Also, i asked about this problem in google community of graylog. Here is the link https://groups.google.com/forum/#!topic/graylog2/QPaMCi3fQHA and have got the answer:
Since the error message originates from the PHP library, it's a problem with either the library, how you integrate the library in your PHP application, or with your PHP setup.
So, probably, the problem is in php, or library

@Nikoms
Copy link

Nikoms commented Jan 20, 2017

@lashnag It's strange that it's only on one server (reserve vs main)

@lashnag
Copy link
Author

lashnag commented Jan 20, 2017

@Nikoms I think it could depends on hight load or not. Both of servers have the same architecture, software and hardware.

@Nikoms
Copy link

Nikoms commented Jan 20, 2017

We also have a "random" issue during our "long cron jobs". Is it possible that the connection is closed like mysql for example?

@bzikarsky
Copy link
Owner

@lashnag: I'm sorry, that one just sneaked by. I readded the error-message to the exception.

@Nikoms: I think it's possible the TCP connection will drop if you don't log anything over the connection for a certain amount of time. I consider adding socket_set_option($socket, SOL_SOCKET, SO_KEEPALIVE, 1); to fix this.

@Nikoms
Copy link

Nikoms commented Jan 20, 2017

@bzikarsky I checked our config and it our default socket timeout is set to 60s. Monday, I'll check if I still have this problem by changing it with ini_set. I will also check with a timeout to 1 second and adding a sleep...

I'll keep you in touch

@Nikoms
Copy link

Nikoms commented Jan 23, 2017

Mhh changing "default socket timeout" to 1 sec does not change anything. Even with a sleep 5, the connection is not closed. So it may not be that

@lashnag
Copy link
Author

lashnag commented Jan 23, 2017

@Nikoms A can say more! I try to start big script on the reserve server, which write something in db. The was 100 000 rows. And every write in db = write in gray log. It worked hard 15 minutes. And the is no errors...

@bzikarsky
Copy link
Owner

@Nikoms default-socket-timeout is only slightly related to SO_KEEPALIVE. SO_KEEPALIVE enables keepalive-packets on the TCP connection, so it does not get shut down (neither host nor client). The timeout only relates to PHP.

@Nikoms
Copy link

Nikoms commented Jan 23, 2017

ok thx

@lashnag
Copy link
Author

lashnag commented Jan 23, 2017

@Nikoms It worked?

@Nikoms
Copy link

Nikoms commented Jan 23, 2017

@lashnag no ... I think I will contact our devops to see if they have an idea.

Right now, I just made a dirty "try/catch" and I log the record in a json file...

@carstenwindler
Copy link

Maybe that helps: I also got the error message using tag 1.5.3.

It showed that it happens (reproducable) when the socket resource is used more than once. Placing
$this->close();
before
$socket = $this->getSocket();
in
StreamSocketClient:write()
solves the issue.

Not really sure what it would mean performance-wise to create the socket with every write though.

@bzikarsky I tried your modified master as well, and it didn't seem to solve the issue

@lashnag
Copy link
Author

lashnag commented Jan 24, 2017

@Nikoms I'll be waiting

@psaxde
Copy link

psaxde commented Jan 24, 2017

I'm getting the same error if graylog host isn't reachable.

For me the modified master from @bzikarsky doesn't work. The fix @carstenwindler mentioned works for me.

@lashnag
Copy link
Author

lashnag commented Jan 25, 2017

@carstenwindler Could you send pull request and @bzikarsky could you get it and push to master. And i will try it?
Or @carstenwindler could say me what i need to set in what file?

@carstenwindler
Copy link

@bzikarsky
Copy link
Owner

Just so we are on the same page: The 2 commits in master don't try to fix the problem. They should only give a better error message.

@carstenwindler Forcing the socket to close down (and therefore reopening it) for every write seems like a bad idea. This will probably work, when the connection timed out (or was closed another way), but doesn't address the problem.

As mentioned earlier, one thing we can try is adding SO_KEEPALIVE to the tcp connection, to see what happens. This should help in a situations of long running jobs which log very infrequently.

If the random disconnects still occur, the affected people should check their network setup. I can't think of a reason a TCP connection should get dropped if both client and server are happily keeping up their end of the connection. Maybe (but I'm quite sure if I like it) we can handle a failed write with a reconnect-retry. But ... this is messy.

Since I'm quite busy currently I'll happily discuss/accept PRs on this matter. Thanks for your patience.

@bzikarsky bzikarsky changed the title Incomplete write: Only 0 of 358 written Closed connections and timeouts on TcpTransport (was: Incomplete write: Only 0 of 358 written) Jan 25, 2017
@carstenwindler
Copy link

@bzikarsky yes I agree. I just wanted to share what I found out during debugging.

@lashnag
Copy link
Author

lashnag commented Jan 26, 2017

@carstenwindler I have the problem with new commit:

2017-01-26 13:36:05 [89.219.35.77][-][-][error][RuntimeException]
RuntimeException: Failed to create socket-client for tcp://graylog:12201: Operation timed out (60) in /data/home/projects/payprocessing/classes/vendor/graylog2/gelf-php/src/Gelf/Transport/StreamSocketClient.php:136

So, i think, that the problem in the network or in GrayLog. What do you think?

@lashnag
Copy link
Author

lashnag commented Jan 26, 2017

@carstenwindler @bzikarsky https://moodle.org/plugins/view.php?id=1756 - here is the same problem. Do you think i need to write in official graylog support?

@lashnag
Copy link
Author

lashnag commented Jan 26, 2017

And i see that UDP transport use fwrite function too. But there is no same exception. I've tried. But i'm not sure, that fwrite coult return 0 bytes, if it use to write using UDP protocol.

@bzikarsky
Copy link
Owner

Regarding timeout: Yes, it looks like a network/configuration problem. Somehow the server with gelf-php can't open a TCP connection to your graylog2-server

Regarding moodle: The moddle plugin uses this library. if gelf-php has a bug, moodle is also affected

Regarding UDP: UDP works differently than TCP. There is no connection in UDP, just packets. You don't get notified when those are dropped or they never reach their destination. (On the other hand it's faster and in many usecases it's reliable enough for logging)

@lashnag
Copy link
Author

lashnag commented Jan 27, 2017

@bzikarsky Do you know how fwrite function work in low level? How it implemented in C++? I logged returned value from fwrite function, when i try to use UDP, and there is not 0 and not NULL. So what it's it mean? It's mean some that fwrite function return some default value to compliance phpdoc?

@usmonster
Copy link

usmonster commented Jan 27, 2017

I'm having the same issue in 1.5.3 (haven't tried 1.5.1 or 1.5.2), using UDP.
Downgrading to 1.5.0 was my temporary solution, since I think the issue was introduced in the fix for #66 (90ed711).
Perhaps a retry can be added, if that makes sense?

@bzikarsky
Copy link
Owner

@lashnag Sorry, I can't follow you. fwrite's behavious is documented at http://php.net/fwrite. Quote: "Return Values: fwrite() returns the number of bytes written, or FALSE on error.". This does not change with the type of stream. Since I also turn off blocking for UDP this should always correspond to the number of bytes in the string passed to fwrite. (And just as a heads up: PHP is written in C not C++)

@usmonster Interesting. The fix itself should not trigger the error. It just makes it visible that some log-messages did not get sent/written properly before. Can you confirm this?

@usmonster
Copy link

@bzikarsky, you're absolutely right. I just had some time to properly debug, and I see that the source of the error is not gelf--sorry for the noise! 😶

@lashnag
Copy link
Author

lashnag commented Jan 30, 2017

@bzikarsky So, do you think, if i have the same error in UDP i should get exception messeges? I mean, it there will be the same problem fwrite return 0 like when i use TCP connection?

@carstenwindler
Copy link

@bzikarsky I tested another workaround, this time by closing and re-building the socket only if write is incomplete

https://github.com/bzikarsky/gelf-php/compare/master...carstenwindler:issues/78-try-to-reconnect-on-write-error?expand=0

(just a quick hack, no final solution)

works fine for our project (we use UDP), probably it's worth considering?

@bzikarsky
Copy link
Owner

@lashnag There cannot be the same error on UDP. UDP is stateless, there is no connection. Therefore you will not get any feedback if your messages cannot reach their destination.

@carstenwindler I'll look into this.

@lashnag
Copy link
Author

lashnag commented Jan 31, 2017

@bzikarsky I understand this, but i don't understand than what does it mean "return value of fwrite function" if i can't reach destination sended packages

@razvanphp
Copy link

razvanphp commented Feb 14, 2017

Hey, I will jump in, as we get this error too since some time. We use UDP transport, release 1.5.3, so definitely happens there too. In our case, it's a long running cron job that at some point can't reach graylog (usually happens for a few minutes when the indices rotate).

I think nobody saw this before because it was not throwing an exception before 90ed711, so I think the best way is that everybody catches this exception in their code.

LE: given that Monolog & graylog should never throw exceptions in production, and that we already have a file transport that should never fail, we now use this elegant trick:

$gelfTransport = new IgnoreErrorTransportWrapper(new UdpTransport(
    $graylogHost,
    UdpTransport::DEFAULT_PORT,
    UdpTransport::CHUNK_SIZE_LAN
));

@bzikarsky
Copy link
Owner

I checked the Graylog GUI for TCP inputs today: It allows setting TCP Keepalive. Is this enabled when you come across those connection timeouts?

@jeanpralo
Copy link

I have the same problem in UDP when logging to graylog2 and the service is down, so easy enough to reproduce I guess.

@jdeniau
Copy link

jdeniau commented May 15, 2017

Just to help, I updated to 1.5.4 and I figured a problem problably related to this issue:

Failed to write to socket: fwrite(): send of 338 bytes failed with errno=111 Connection refused (8)

in

221.            if ($failed || $byteCount === false) {
222.                throw new \RuntimeException($errorMessage);
223.            }

@bzikarsky
Copy link
Owner

Error 111 (or ECONNREFUSED) can occur if the client (the gelf-library) cannot open or write to a remote server (the description Connection refused pretty much gives it away).

This is in almost all cases related to network- or serverside problems.

@gbourgeat
Copy link

I have the same issue so that I use UDP transport just for avoid these errors :/

Sometimes it work but sometimes I have this error message.

@bzikarsky
Copy link
Owner

I'll recommend https://stackoverflow.com/questions/41231270/econnrefused-errors-on-udp-sendto (especially the answer) for reading why ECONNREFUSED can also occur on UDP.

If you want to silence those errors, please use the IgnoreErrorTransportWrapper.

@bzikarsky
Copy link
Owner

I'll add in docs in #90 for this and similiar issues and will close this now

@lashnag
Copy link
Author

lashnag commented Dec 13, 2017

Do you mean that i need to use UDP and mute all problems? A cant'be sure that sended package will reache the log server? Do you have other decisions?

@bzikarsky
Copy link
Owner

If you use TCP, you get a guarantee of delivery. But this requires a working TCP connection. If it does not work, you'll definitly get an exception.

If you use UDP, you get no guarantees. You may get an exception if there is the host cannot be resolved or there is no service listening on the destination post.

If you do not want exceptions, wrap the transport in the IgnoreErrorTransportWrapper.

I really cant do anything else. You can make sure that

a) your destination host and port is configured correctly and is reachable
b) your network is working correctly (adequate TCP connection timeouts, proper routing and firewalling, none or only few drop packages, etc.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests