Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure connecting to [ip]: dial tcp [ip]:[port]: connection timed out #443

Closed
mateodelnorte opened this issue Apr 22, 2015 · 10 comments
Closed

Comments

@mateodelnorte
Copy link

After upgrading to v 1.5.0.rc2, all of my logstash-forwarders (on ~12 machines) are doing the following. They previously ran well for months.

2015/04/23 10:36:18.374883 Connecting to [my.ip.address]:5000 (my.domain.com)
2015/04/23 10:36:33.375726 Failed to tls handshake with my.ip.address read tcp my.ip.address:5000: i/o timeout
2015/04/23 10:36:34.376591 Connecting to [my.ip.address]:5000 (my.domain.com)
2015/04/23 10:36:50.377974 Failed to tls handshake with my.ip.address read tcp my.ip.address:5000: i/o timeout
2015/04/23 10:36:51.378818 Connecting to [my.ip.address]:5000 (my.domain.com)
2015/04/23 10:37:06.379219 Failed to tls handshake with my.ip.address read tcp my.ip.address:5000: i/o timeout
2015/04/23 10:37:07.380198 Connecting to [my.ip.address]:5000 (my.domain.com)
2015/04/23 10:37:23.477705 Failed to tls handshake with my.ip.address read tcp my.ip.address:5000: connection reset by peer
2015/04/23 10:37:24.478600 Connecting to [my.ip.address]:5000 (my.domain.com)
2015/04/23 10:37:39.479644 Failed to tls handshake with my.ip.address read tcp my.ip.address:5000: i/o timeout
2015/04/23 10:37:40.480616 Connecting to [my.ip.address]:5000 (my.domain.com)
2015/04/23 10:37:54.565698 Failed to tls handshake with my.ip.address read tcp my.ip.address:5000: connection reset by peer
2015/04/23 10:37:55.567624 Connecting to [my.ip.address]:5000 (my.domain.com)
2015/04/23 10:38:09.653686 Failed to tls handshake with my.ip.address read tcp my.ip.address:5000: connection reset by peer
2015/04/23 10:38:10.654522 Connecting to [my.ip.address]:5000 (my.domain.com)
2015/04/23 10:38:23.749670 Failed to tls handshake with my.ip.address read tcp my.ip.address:5000: connection reset by peer
2015/04/23 10:38:24.750607 Connecting to [my.ip.address]:5000 (my.domain.com)
2015/04/23 10:38:38.837719 Failed to tls handshake with my.ip.address read tcp my.ip.address:5000: connection reset by peer
2015/04/23 10:38:39.838559 Connecting to [my.ip.address]:5000 (my.domain.com)
2015/04/23 10:38:55.841870 Failed to tls handshake with my.ip.address read tcp my.ip.address:5000: i/o timeout
2015/04/23 10:38:56.843241 Connecting to [my.ip.address]:5000 (my.domain.com)
2015/04/23 10:39:12.845933 Failed to tls handshake with my.ip.address read tcp my.ip.address:5000: i/o timeout
2015/04/23 10:39:13.847824 Connecting to [my.ip.address]:5000 (my.domain.com)
2015/04/23 10:39:28.848058 Failure connecting to my.ip.address: dial tcp my.ip.address:5000: i/o timeout
2015/04/23 10:39:29.849085 Connecting to [my.ip.address]:5000 (my.domain.com)
2015/04/23 10:39:45.850017 Failed to tls handshake with my.ip.address read tcp my.ip.address:5000: i/o timeout
2015/04/23 10:39:46.851291 Connecting to [my.ip.address]:5000 (my.domain.com)

Any ideas?

@mateodelnorte
Copy link
Author

After restarting my entire system, I'm able to get individual forwarders to connect but they timeout regardless of how large a timeout I set:

2015/05/04 12:34:09.272485 Read error looking for ack: read tcp [ip:port]: i/o timeout
2015/05/04 12:34:09.272630 Setting trusted CA from file: /etc/pki/tls/certs/logstash-forwarder.crt
2015/05/04 12:34:09.273893 Connecting to [ip]:[port] (domain.name.com)
2015/05/04 12:34:09.345193 Connected to [ip]

When this happens, my entire logstash system appears to stop processing logs silently and individual forwarders begin failing, one by one, until all of them are giving these logs messages.

Any info on this at all?

@jordansissel
Copy link
Contributor

This error Read error looking for ack: read tcp [ip:port]: i/o timeout means that lsf had published a payload to Logstash and was waiting for acknowledgement, and it never received one.

This is usually caused by something in Logstash getting stuck (a slow/stalled filter or output) which prevents the inputs from processing more data and prevents the lumberjack input from acknowledging events.

My guess would be that something in Logstash (or downstream of it, like an output) is stuck for you. Can you attach your config?

@mateodelnorte
Copy link
Author

filter {
  if [type] == "x-message" {
    grok {
      match => { "message" => "(?<datetime>%{YEAR}%{MONTHNUM}%{MONTHDAY}-%{HOUR}:%{MINUTE}:(?:[0-9][0-9]).(?:[0-9][0-9][0-9])) : (?<message>\S+)+" }
      overwrite => [ "message" ]
      add_field => [ "datetime", "%{@timestamp}" ]
      add_field => [ "data.msg", "%{message}" ]
    }
    mutate {
      split => [ "message", "" ]
    }
    ruby {
      code => "
        pairs = event['message']
        for token in pairs
          pair = token.split('=')
          if pair[0] == '49'
            event['data.senderCompId'] = pair[1]
          end
          if pair[0] == '56'
            event['data.targetCompId'] = pair[1]
          end
          event['data.fix_tag_' + pair[0]] = pair[1]
        end
        if event['data.targetCompId'] == 'name' && ['D','F','G'].include?(event['data.fix_tag_35'])
          event['unique_fix_message_id'] = event['data.targetCompId'] + event['data.fix_tag_11']
          event['unique_fix_message_status'] = 'started'
        elsif event['data.senderCompId'] == 'name' && ['8','9'].include?(event['data.fix_tag_35'])
          event['unique_fix_message_id'] = event['data.senderCompId'] + event['data.fix_tag_11']
          event['unique_fix_message_status'] = 'completed'
        end
      "
    }
    if [unique_fix_message_status] == "started" {
      mutate {
        add_tag => [ "fix_message_transaction_started" ]
      }
    } else if [unique_fix_message_status] == "completed" {
      mutate {
        add_tag => [ "fix_message_transaction_completed" ]
      }
    }
  } else if [type] == "fix-event" {
    grok {
      match => { "message" => "(?<datetime>%{YEAR}%{MONTHNUM}%{MONTHDAY}-%{HOUR}:%{MINUTE}:(?:[0-9][0-9]).(?:[0-9][0-9][0-9])) : (?<message>\S+)+" }
      overwrite => [ "message" ]
      add_field => [ "datetime", "%{@timestamp}" ]
      add_field => [ "data.msg", "%{message}" ]
    }
    ruby {
      code => "
        event['data.senderCompId'] = event['file'][/FIX.4.4-name-(.*).event.current.log/, 1]
      "
    }
  } else {
    grok {
      match => { "message" => "%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: %{GREEDYDATA:syslog_message}" }
      add_field => [ "datetime", "%{@timestamp}" ]
    }
    if [syslog_message] =~ /^\{.*/ {
      json {
        source => "syslog_message"
        target => "data"
      }
    }
  }
}

I'm building up the 'x-message' flow to include one more filter, the elapsed filter, so that I can track time diffs between messages going in and out of my system.

@mateodelnorte
Copy link
Author

Here's the forwarder conf:

{
  "network": {
    "servers": [ "logs.domain.com:5000" ],
    "timeout": 900,
    "ssl ca": "/etc/pki/tls/certs/logstash-forwarder.crt"
  },
  "files": [
    {
      "paths": [
        "/var/log/messages",
        "/var/log/secure"
       ],
      "fields": { "type": "syslog" }
    }
        , {
      "paths": [
        "/var/fix/log/FIX.4.4-name-*.event.current.log"
      ],
      "fields": { "type": "fix-event", "data.name": "fix-event"}
    },
    {
      "paths": [
        "/var/fix/log/FIX.4.4-name-*.messages.current.log"
      ],
      "fields": { "type": "fix-message", "data.name": "fix-message"}
    }
       ]
}

@jordansissel
Copy link
Contributor

Let's see if logstash is stuck? Are you seeing events flowing out of Logstash into your outputs? If so, how behind real-time are they?

Also, your config comment above omitted the output { } section

@mateodelnorte
Copy link
Author

oops. yep, I've got that in a different file:
01 - input file

input {
  lumberjack {
    port => 5000
    type => "logs"
    ssl_certificate => "/etc/pki/tls/certs/logstash-forwarder.crt"
    ssl_key => "/etc/pki/tls/private/logstash-forwarder.key"
  }
}

30 - output file

output {
  elasticsearch { host => localhost }
  #stdout { codec => rubydebug }
}

@mateodelnorte
Copy link
Author

I actually just saw this message for the first time:

net[/10.2.3.41:9300]],}, reason: zen-disco-receive(from master [[Karen Page][GgpQMDAnQkWQ0xCw7yDICg][nj1slog01][inet[/10.2.3.41:9300]]])
May 04, 2015 12:21:32 PM org.elasticsearch.node.internal.InternalNode start
INFO: [logstash-nj1slog01-1182-7634] started
Error: Your application used more memory than the safety cap of 500M.
Specify -J-Xmx####m to increase it (#### = cap size in MB).
Specify -w for full OutOfMemoryError stack trace

Clearly something I need to solve, but a little afraid it's a new problem since I hadn't seen it before.

Thanks a bunch for the help.

@jordansissel
Copy link
Contributor

Not sure if the OOM is a symptom or a cause right now.

There's a known issue with lumberjack/logstash if an output is stuck somehwat mentioned in this ticket and this ticket. In your case, if Elasticsearch is down or not accepting new documents for some reason (network or other fault), then Logstash will block until that resolves, and lsf will keep retrying on timeouts, and Logstash will eventually OOM due to a bug mentioned in the above tickets. --- in this case, the OOM would be a symptom, not a cause.

Is Elasticsearch healthy?

@mateodelnorte
Copy link
Author

I'm not quite sure how to interpret these logs, but perhaps they're suggesting a node was restarted?

015-02-04 11:27:19,433][INFO ][node                     ] [Cody Mushumanski Gun Man] stopping ...
[2015-02-04 11:27:19,507][INFO ][node                     ] [Cody Mushumanski Gun Man] stopped
[2015-02-04 11:27:19,507][INFO ][node                     ] [Cody Mushumanski Gun Man] closing ...
[2015-02-04 11:27:19,538][INFO ][node                     ] [Cody Mushumanski Gun Man] closed
[2015-02-04 11:27:20,219][INFO ][node                     ] [May Parker] version[1.1.1], pid[25381], build[f1585f0/2014-04-16T14:27:12Z]
[2015-02-04 11:27:20,219][INFO ][node                     ] [May Parker] initializing ...
[2015-02-04 11:27:20,224][INFO ][plugins                  ] [May Parker] loaded [], sites []
[2015-02-04 11:27:23,264][INFO ][node                     ] [May Parker] initialized
[2015-02-04 11:27:23,264][INFO ][node                     ] [May Parker] starting ...
[2015-02-04 11:27:23,421][INFO ][transport                ] [May Parker] bound_address {inet[/0.0.0.0:9300]}, publish_address {inet[/ip:9300]}
[2015-02-04 11:27:26,462][INFO ][cluster.service          ] [May Parker] new_master [May Parker][HxBMHAsrREi42knx5SheAg][nj1slog01][inet[/ip:9300]], reason: zen-disco-join (elected_as_master)
[2015-02-04 11:27:26,482][INFO ][discovery                ] [May Parker] elasticsearch/HxBMHAsrREi42knx5SheAg
[2015-02-04 11:27:26,509][INFO ][http                     ] [May Parker] bound_address {inet[/0.0.0.0:9200]}, publish_address {inet[/ip:9200]}
[2015-02-04 11:27:27,434][INFO ][gateway                  ] [May Parker] recovered [2] indices into cluster_state
[2015-02-04 11:27:27,436][INFO ][node                     ] [May Parker] started
[2015-02-04 11:27:28,548][INFO ][cluster.service          ] [May Parker] added {[logstash-nj1slog01-19617-2016][BfQE6VnZQWiYOskDm2daNw][nj1slog01][inet[/ip:9301]]{client=true, data=false},}, reason: zen-disco-receive(join from node[[logstash-nj1slog01-19617-2016][BfQE6VnZQWiYOskDm2daNw][nj1slog01][inet[/ip:9301]]{client=true, data=false}])
[2015-02-04 11:30:36,577][INFO ][node                     ] [May Parker] stopping ...
[2015-02-04 11:30:36,647][INFO ][node                     ] [May Parker] stopped
[2015-02-04 11:30:36,647][INFO ][node                     ] [May Parker] closing ...
[2015-02-04 11:30:36,654][INFO ][node                     ] [May Parker] closed

@suyograo
Copy link

#323 will provide more info to debug such backpressure situations. Meanwhile, for more help you can create a topic https://discuss.elastic.co/c/logstash/logstash-forwarder

GitHub is reserved for issues/enhancements

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants