Failure connecting to [ip]: dial tcp [ip]:[port]: connection timed out #443

mateodelnorte · 2015-04-22T21:18:24Z

After upgrading to v 1.5.0.rc2, all of my logstash-forwarders (on ~12 machines) are doing the following. They previously ran well for months.

2015/04/23 10:36:18.374883 Connecting to [my.ip.address]:5000 (my.domain.com)
2015/04/23 10:36:33.375726 Failed to tls handshake with my.ip.address read tcp my.ip.address:5000: i/o timeout
2015/04/23 10:36:34.376591 Connecting to [my.ip.address]:5000 (my.domain.com)
2015/04/23 10:36:50.377974 Failed to tls handshake with my.ip.address read tcp my.ip.address:5000: i/o timeout
2015/04/23 10:36:51.378818 Connecting to [my.ip.address]:5000 (my.domain.com)
2015/04/23 10:37:06.379219 Failed to tls handshake with my.ip.address read tcp my.ip.address:5000: i/o timeout
2015/04/23 10:37:07.380198 Connecting to [my.ip.address]:5000 (my.domain.com)
2015/04/23 10:37:23.477705 Failed to tls handshake with my.ip.address read tcp my.ip.address:5000: connection reset by peer
2015/04/23 10:37:24.478600 Connecting to [my.ip.address]:5000 (my.domain.com)
2015/04/23 10:37:39.479644 Failed to tls handshake with my.ip.address read tcp my.ip.address:5000: i/o timeout
2015/04/23 10:37:40.480616 Connecting to [my.ip.address]:5000 (my.domain.com)
2015/04/23 10:37:54.565698 Failed to tls handshake with my.ip.address read tcp my.ip.address:5000: connection reset by peer
2015/04/23 10:37:55.567624 Connecting to [my.ip.address]:5000 (my.domain.com)
2015/04/23 10:38:09.653686 Failed to tls handshake with my.ip.address read tcp my.ip.address:5000: connection reset by peer
2015/04/23 10:38:10.654522 Connecting to [my.ip.address]:5000 (my.domain.com)
2015/04/23 10:38:23.749670 Failed to tls handshake with my.ip.address read tcp my.ip.address:5000: connection reset by peer
2015/04/23 10:38:24.750607 Connecting to [my.ip.address]:5000 (my.domain.com)
2015/04/23 10:38:38.837719 Failed to tls handshake with my.ip.address read tcp my.ip.address:5000: connection reset by peer
2015/04/23 10:38:39.838559 Connecting to [my.ip.address]:5000 (my.domain.com)
2015/04/23 10:38:55.841870 Failed to tls handshake with my.ip.address read tcp my.ip.address:5000: i/o timeout
2015/04/23 10:38:56.843241 Connecting to [my.ip.address]:5000 (my.domain.com)
2015/04/23 10:39:12.845933 Failed to tls handshake with my.ip.address read tcp my.ip.address:5000: i/o timeout
2015/04/23 10:39:13.847824 Connecting to [my.ip.address]:5000 (my.domain.com)
2015/04/23 10:39:28.848058 Failure connecting to my.ip.address: dial tcp my.ip.address:5000: i/o timeout
2015/04/23 10:39:29.849085 Connecting to [my.ip.address]:5000 (my.domain.com)
2015/04/23 10:39:45.850017 Failed to tls handshake with my.ip.address read tcp my.ip.address:5000: i/o timeout
2015/04/23 10:39:46.851291 Connecting to [my.ip.address]:5000 (my.domain.com)

Any ideas?

The text was updated successfully, but these errors were encountered:

mateodelnorte · 2015-05-04T16:41:57Z

After restarting my entire system, I'm able to get individual forwarders to connect but they timeout regardless of how large a timeout I set:

2015/05/04 12:34:09.272485 Read error looking for ack: read tcp [ip:port]: i/o timeout
2015/05/04 12:34:09.272630 Setting trusted CA from file: /etc/pki/tls/certs/logstash-forwarder.crt
2015/05/04 12:34:09.273893 Connecting to [ip]:[port] (domain.name.com)
2015/05/04 12:34:09.345193 Connected to [ip]

When this happens, my entire logstash system appears to stop processing logs silently and individual forwarders begin failing, one by one, until all of them are giving these logs messages.

Any info on this at all?

jordansissel · 2015-05-04T16:44:27Z

This error Read error looking for ack: read tcp [ip:port]: i/o timeout means that lsf had published a payload to Logstash and was waiting for acknowledgement, and it never received one.

This is usually caused by something in Logstash getting stuck (a slow/stalled filter or output) which prevents the inputs from processing more data and prevents the lumberjack input from acknowledging events.

My guess would be that something in Logstash (or downstream of it, like an output) is stuck for you. Can you attach your config?

mateodelnorte · 2015-05-04T16:49:41Z

filter {
  if [type] == "x-message" {
    grok {
      match => { "message" => "(?<datetime>%{YEAR}%{MONTHNUM}%{MONTHDAY}-%{HOUR}:%{MINUTE}:(?:[0-9][0-9]).(?:[0-9][0-9][0-9])) : (?<message>\S+)+" }
      overwrite => [ "message" ]
      add_field => [ "datetime", "%{@timestamp}" ]
      add_field => [ "data.msg", "%{message}" ]
    }
    mutate {
      split => [ "message", "" ]
    }
    ruby {
      code => "
        pairs = event['message']
        for token in pairs
          pair = token.split('=')
          if pair[0] == '49'
            event['data.senderCompId'] = pair[1]
          end
          if pair[0] == '56'
            event['data.targetCompId'] = pair[1]
          end
          event['data.fix_tag_' + pair[0]] = pair[1]
        end
        if event['data.targetCompId'] == 'name' && ['D','F','G'].include?(event['data.fix_tag_35'])
          event['unique_fix_message_id'] = event['data.targetCompId'] + event['data.fix_tag_11']
          event['unique_fix_message_status'] = 'started'
        elsif event['data.senderCompId'] == 'name' && ['8','9'].include?(event['data.fix_tag_35'])
          event['unique_fix_message_id'] = event['data.senderCompId'] + event['data.fix_tag_11']
          event['unique_fix_message_status'] = 'completed'
        end
      "
    }
    if [unique_fix_message_status] == "started" {
      mutate {
        add_tag => [ "fix_message_transaction_started" ]
      }
    } else if [unique_fix_message_status] == "completed" {
      mutate {
        add_tag => [ "fix_message_transaction_completed" ]
      }
    }
  } else if [type] == "fix-event" {
    grok {
      match => { "message" => "(?<datetime>%{YEAR}%{MONTHNUM}%{MONTHDAY}-%{HOUR}:%{MINUTE}:(?:[0-9][0-9]).(?:[0-9][0-9][0-9])) : (?<message>\S+)+" }
      overwrite => [ "message" ]
      add_field => [ "datetime", "%{@timestamp}" ]
      add_field => [ "data.msg", "%{message}" ]
    }
    ruby {
      code => "
        event['data.senderCompId'] = event['file'][/FIX.4.4-name-(.*).event.current.log/, 1]
      "
    }
  } else {
    grok {
      match => { "message" => "%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: %{GREEDYDATA:syslog_message}" }
      add_field => [ "datetime", "%{@timestamp}" ]
    }
    if [syslog_message] =~ /^\{.*/ {
      json {
        source => "syslog_message"
        target => "data"
      }
    }
  }
}

I'm building up the 'x-message' flow to include one more filter, the elapsed filter, so that I can track time diffs between messages going in and out of my system.

mateodelnorte · 2015-05-04T16:51:03Z

Here's the forwarder conf:

{
  "network": {
    "servers": [ "logs.domain.com:5000" ],
    "timeout": 900,
    "ssl ca": "/etc/pki/tls/certs/logstash-forwarder.crt"
  },
  "files": [
    {
      "paths": [
        "/var/log/messages",
        "/var/log/secure"
       ],
      "fields": { "type": "syslog" }
    }
        , {
      "paths": [
        "/var/fix/log/FIX.4.4-name-*.event.current.log"
      ],
      "fields": { "type": "fix-event", "data.name": "fix-event"}
    },
    {
      "paths": [
        "/var/fix/log/FIX.4.4-name-*.messages.current.log"
      ],
      "fields": { "type": "fix-message", "data.name": "fix-message"}
    }
       ]
}

jordansissel · 2015-05-04T16:51:19Z

Let's see if logstash is stuck? Are you seeing events flowing out of Logstash into your outputs? If so, how behind real-time are they?

Also, your config comment above omitted the output { } section

mateodelnorte · 2015-05-04T16:54:58Z

oops. yep, I've got that in a different file:
01 - input file

input {
  lumberjack {
    port => 5000
    type => "logs"
    ssl_certificate => "/etc/pki/tls/certs/logstash-forwarder.crt"
    ssl_key => "/etc/pki/tls/private/logstash-forwarder.key"
  }
}

30 - output file

output {
  elasticsearch { host => localhost }
  #stdout { codec => rubydebug }
}

mateodelnorte · 2015-05-04T16:55:59Z

I actually just saw this message for the first time:

net[/10.2.3.41:9300]],}, reason: zen-disco-receive(from master [[Karen Page][GgpQMDAnQkWQ0xCw7yDICg][nj1slog01][inet[/10.2.3.41:9300]]])
May 04, 2015 12:21:32 PM org.elasticsearch.node.internal.InternalNode start
INFO: [logstash-nj1slog01-1182-7634] started
Error: Your application used more memory than the safety cap of 500M.
Specify -J-Xmx####m to increase it (#### = cap size in MB).
Specify -w for full OutOfMemoryError stack trace

Clearly something I need to solve, but a little afraid it's a new problem since I hadn't seen it before.

Thanks a bunch for the help.

jordansissel · 2015-05-04T17:00:24Z

Not sure if the OOM is a symptom or a cause right now.

There's a known issue with lumberjack/logstash if an output is stuck somehwat mentioned in this ticket and this ticket. In your case, if Elasticsearch is down or not accepting new documents for some reason (network or other fault), then Logstash will block until that resolves, and lsf will keep retrying on timeouts, and Logstash will eventually OOM due to a bug mentioned in the above tickets. --- in this case, the OOM would be a symptom, not a cause.

Is Elasticsearch healthy?

mateodelnorte · 2015-05-04T18:25:53Z

I'm not quite sure how to interpret these logs, but perhaps they're suggesting a node was restarted?

015-02-04 11:27:19,433][INFO ][node                     ] [Cody Mushumanski Gun Man] stopping ...
[2015-02-04 11:27:19,507][INFO ][node                     ] [Cody Mushumanski Gun Man] stopped
[2015-02-04 11:27:19,507][INFO ][node                     ] [Cody Mushumanski Gun Man] closing ...
[2015-02-04 11:27:19,538][INFO ][node                     ] [Cody Mushumanski Gun Man] closed
[2015-02-04 11:27:20,219][INFO ][node                     ] [May Parker] version[1.1.1], pid[25381], build[f1585f0/2014-04-16T14:27:12Z]
[2015-02-04 11:27:20,219][INFO ][node                     ] [May Parker] initializing ...
[2015-02-04 11:27:20,224][INFO ][plugins                  ] [May Parker] loaded [], sites []
[2015-02-04 11:27:23,264][INFO ][node                     ] [May Parker] initialized
[2015-02-04 11:27:23,264][INFO ][node                     ] [May Parker] starting ...
[2015-02-04 11:27:23,421][INFO ][transport                ] [May Parker] bound_address {inet[/0.0.0.0:9300]}, publish_address {inet[/ip:9300]}
[2015-02-04 11:27:26,462][INFO ][cluster.service          ] [May Parker] new_master [May Parker][HxBMHAsrREi42knx5SheAg][nj1slog01][inet[/ip:9300]], reason: zen-disco-join (elected_as_master)
[2015-02-04 11:27:26,482][INFO ][discovery                ] [May Parker] elasticsearch/HxBMHAsrREi42knx5SheAg
[2015-02-04 11:27:26,509][INFO ][http                     ] [May Parker] bound_address {inet[/0.0.0.0:9200]}, publish_address {inet[/ip:9200]}
[2015-02-04 11:27:27,434][INFO ][gateway                  ] [May Parker] recovered [2] indices into cluster_state
[2015-02-04 11:27:27,436][INFO ][node                     ] [May Parker] started
[2015-02-04 11:27:28,548][INFO ][cluster.service          ] [May Parker] added {[logstash-nj1slog01-19617-2016][BfQE6VnZQWiYOskDm2daNw][nj1slog01][inet[/ip:9301]]{client=true, data=false},}, reason: zen-disco-receive(join from node[[logstash-nj1slog01-19617-2016][BfQE6VnZQWiYOskDm2daNw][nj1slog01][inet[/ip:9301]]{client=true, data=false}])
[2015-02-04 11:30:36,577][INFO ][node                     ] [May Parker] stopping ...
[2015-02-04 11:30:36,647][INFO ][node                     ] [May Parker] stopped
[2015-02-04 11:30:36,647][INFO ][node                     ] [May Parker] closing ...
[2015-02-04 11:30:36,654][INFO ][node                     ] [May Parker] closed

suyograo · 2015-08-10T18:51:57Z

#323 will provide more info to debug such backpressure situations. Meanwhile, for more help you can create a topic https://discuss.elastic.co/c/logstash/logstash-forwarder

GitHub is reserved for issues/enhancements

suyograo closed this as completed Aug 10, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure connecting to [ip]: dial tcp [ip]:[port]: connection timed out #443

Failure connecting to [ip]: dial tcp [ip]:[port]: connection timed out #443

mateodelnorte commented Apr 22, 2015

mateodelnorte commented May 4, 2015

jordansissel commented May 4, 2015

mateodelnorte commented May 4, 2015

mateodelnorte commented May 4, 2015

jordansissel commented May 4, 2015

mateodelnorte commented May 4, 2015

mateodelnorte commented May 4, 2015

jordansissel commented May 4, 2015

mateodelnorte commented May 4, 2015

suyograo commented Aug 10, 2015

Failure connecting to [ip]: dial tcp [ip]:[port]: connection timed out #443

Failure connecting to [ip]: dial tcp [ip]:[port]: connection timed out #443

Comments

mateodelnorte commented Apr 22, 2015

mateodelnorte commented May 4, 2015

jordansissel commented May 4, 2015

mateodelnorte commented May 4, 2015

mateodelnorte commented May 4, 2015

jordansissel commented May 4, 2015

mateodelnorte commented May 4, 2015

mateodelnorte commented May 4, 2015

jordansissel commented May 4, 2015

mateodelnorte commented May 4, 2015

suyograo commented Aug 10, 2015