[SLES15]: Fleet-server Agent gets into offline state on machine reboot. #2431

dikshachauhan-qasource · 2023-03-15T08:27:40Z

Kibana version: 8.7.0 Prod Kibana cloud environment

Host OS: SLES

Build details:
VERSION: 8.7.0 BC5
BUILD: 61017
COMMIT: e70452f9697a18a8458f3d4618e001b7e3681bed
Artifact: https://staging.elastic.co/8.7.0-69b36bf6/summary-8.7.0.html

Preconditions:

8.7.0 Prod Kibana cloud environment should be available.
Agent Policy should be available with System and Fleet-server integrations.

Steps to reproduce:

Install agent using above mentioned agent policy.
Wait for agent to generate logs and data streams properly on Fleet UI.
Reboot the machine after 5 minutes.
Observe Fleet-server Agent gets into offline state and never comes back to Healthy status.

Screenshots:

Though agent is seen as running on Host VM

Testcase Reference:
https://elastic.testrail.io/index.php?/tests/view/2056682

Logs:
elastic-agent-diagnostics-2023-03-15T08-11-08Z-00.zip

dikshachauhan-qasource · 2023-03-15T08:28:49Z

@manishgupta-qasource Please review

manishgupta-qasource · 2023-03-15T09:20:48Z

Secondary review for this ticket is Done

cmacknz · 2023-03-21T15:23:19Z

The fleet state is starting and Fleet server reports itself as reconfiguring:

fleet_state: 0
fleet_message: ""
#...
  state:
    state: 2
    message: 'Healthy: communicating with pid ''1022'''
    units:
      ? unittype: 0
        unitid: fleet-server-default-fleet-server-fleet_server-5cc1bc7e-20d8-4174-9af3-4023ec1cd024
      : state: 1
        message: Re-configuring
      ? unittype: 1
        unitid: fleet-server-default
      : state: 1
        message: Re-configuring
    version_info:
      name: fleet-server
      version: 8.7.0
      meta:
        build_time: 2023-03-07 15:37:52 +0000 UTC
        commit: aca39518

cmacknz · 2023-03-21T15:36:50Z

These are the Fleet server logs, it seems like it can't connect to Elasticsearch dial tcp: lookup ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io on [::1]:53: read udp [::1]:42353->[::1]:53: read: connection refused

{"log.level":"info","@timestamp":"2023-03-14T13:20:43.941Z","message":"Setting cache config options","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","cfg":{"APIKeyJitter":300000000000,"APIKeyTTL":900000000000,"ActionTTL":300000000000,"ArtifactTTL":86400000000000,"EnrollKeyTTL":60000000000,"MaxCost":52428800,"NumCounters":500000},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-14T13:20:43.947Z","message":"Starting","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","state":"STARTING","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-14T13:20:43.948Z","message":"initial server configuration","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","new":{"Fleet":{"Agent":{"ID":"a2f4b285-57ed-42f0-9cde-fedd137b6474","Logging":{"Level":"info"},"Version":"8.7.0"},"Host":{"ID":"","Name":""}},"HTTP":{"Enabled":true,"Host":"unix:///opt/Elastic/Agent/data/tmp/fleet-server-default.sock","Port":5066,"SecurityDescriptor":"","User":""},"Inputs":[{"Cache":{"APIKeyJitter":0,"APIKeyTTL":0,"ActionTTL":0,"ArtifactTTL":0,"EnrollKeyTTL":0,"MaxCost":0,"NumCounters":0},"Monitor":{"FetchSize":0,"PollTimeout":0},"Policy":{"ID":""},"Server":{"Bulk":{"FlushInterval":250000000,"FlushMaxPending":8,"FlushThresholdCount":2048,"FlushThresholdSize":1048576},"CompressionLevel":1,"CompressionThresh":1024,"GC":{"CleanupAfterExpiredInterval":"30d","ScheduleInterval":3600000000000},"Host":"0.0.0.0","Instrumentation":{"APIKey":"","Enabled":false,"Environment":"","GlobalLabels":"","Hosts":null,"SecretToken":"","TLS":{"ServerCA":"","ServerCertificate":"","SkipVerify":false},"TransactionSampleRate":""},"InternalPort":8221,"Limits":{"AckLimit":{"Burst":100,"Interval":10000000,"Max":50,"MaxBody":2097152},"ArtifactLimit":{"Burst":25,"Interval":5000000,"Max":50,"MaxBody":0},"CheckinLimit":{"Burst":1000,"Interval":1000000,"Max":0,"MaxBody":1048576},"EnrollLimit":{"Burst":100,"Interval":10000000,"Max":50,"MaxBody":524288},"MaxAgents":0,"MaxConnections":0,"MaxHeaderByteSize":8192,"PolicyThrottle":5000000,"StatusLimit":{"Burst":25,"Interval":5000000,"Max":50,"MaxBody":0},"UploadChunkLimit":{"Burst":10,"Interval":3000000,"Max":5,"MaxBody":4194304},"UploadEndLimit":{"Burst":5,"Interval":2000000000,"Max":2,"MaxBody":1024},"UploadStartLimit":{"Burst":8,"Interval":3000000000,"Max":3,"MaxBody":5242880}},"Port":8220,"Profiler":{"Bind":"localhost:6060","Enabled":false},"Runtime":{"GCPercent":0},"TLS":{"CASha256":null,"CAs":null,"Certificate":{"Certificate":"-----BEGIN CERTIFICATE-----\nMIIDSjCCAjKgAwIBAgICBnowDQYJKoZIhvcNAQELBQAwLDEWMBQGA1UEChMNZWxh\nc3RpYy1mbGVldDESMBAGA1UEAxMJbG9jYWxob3N0MB4XDTIzMDMxNDEzMTQzOFoX\nDTMzMDMxNDEzMTQzOFowMzEWMBQGA1UEChMNZWxhc3RpYy1mbGVldDEZMBcGA1UE\nAxMQaXAtMTcyLTMxLTU3LTEwOTCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoC\nggEBAL2xkV5OKCL9ZqFgoH7/y/xvlM6EABykXOpZjP3/+ugx/nAMA0dFaeUHU7MN\nqvMw91kgrVGkJ8nb6uRXa6PbBVY/U5QVPVC1QHxNXpo9Ye9vDaf7C336UEr12+IF\nCC12i3SyAOD5gWmckgeRDJsVIkJ+W1r5JWUkWjIB6QjQzZlX3Jz6DH92VmMKEEfe\nKaWS28O73pNjVVxh0TePL+1+ADogIiGpEIOxv0wq9q6bJcyDhBFKHhYBgCFH30W+\ns31RIt2npK9BvD5/PbpcKSKqhz1lsJOTO0YMY1KGcZWiyKQH/xkEpBaN6SpfTn7a\n/QH7J6sXed7uzWLfOlHpXKqcQ1sCAwEAAaNvMG0wDgYDVR0PAQH/BAQDAgeAMB0G\nA1UdJQQWMBQGCCsGAQUFBwMCBggrBgEFBQcDATAfBgNVHSMEGDAWgBQbcyE1Fo5b\nrmOtwGsfEN8DZkgLNDAbBgNVHREEFDASghBpcC0xNzItMzEtNTctMTA5MA0GCSqG\nSIb3DQEBCwUAA4IBAQANYSIoOoxi1epv/6u4bvNHSNEIUSBkL/+uTAAQwcrvc2/R\nA/xXphu9/DGBx17SOOz+L07NBkCsZdZFPYQ2czb58PW9uiqMIi62D3RmuMAl8clw\ng4U96cnNRB1e2ZyQHHnpOgiMeu1Mp/aUSnJJd+C5b044sMLreeukJVVGsS/Asb+w\nuzBYVcA0EHQueTwnCD9ZcqTLtMJ6RGn+eJs2qSOrDML+xcnSpdWTlF8g+wK2pBds\nVa004Y+fq0lG3eENpU3JAH5O/krIvaf2TYZvs6z0wQK5jqvyfOiSQwTe0/djlyql\nzkJV6b5n/SjRb6F9YhW+jOSrCgjrwMHYLDA4Nc+8\n-----END CERTIFICATE-----\n","Key":"[redacted]","Passphrase":""},"CipherSuites":null,"ClientAuth":0,"CurveTypes":null,"Enabled":null,"VerificationMode":"full","Versions":null},"Timeouts":{"CheckinJitter":30000000000,"CheckinLongPoll":300000000000,"CheckinTimestamp":30000000000,"Idle":30000000000,"Read":60000000000,"ReadHeader":5000000000,"Write":600000000000}},"Type":""}],"Logging":{"Files":null,"Level":"info","Pretty":false,"ToFiles":true,"ToStderr":true},"Output":{"Elasticsearch":{"APIKey":"","Headers":null,"Hosts":["ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io:443"],"MaxConnPerHost":128,"MaxContentLength":104857600,"MaxRetries":3,"Path":"","Protocol":"https","ProxyDisable":false,"ProxyHeaders":{},"ProxyURL":"","ServiceToken":"[redacted]","TLS":null,"Timeout":90000000000},"Extra":null}},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-14T13:20:43.948Z","message":"starting server on configuration change","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-14T13:20:43.950Z","message":"Starting stats endpoint","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","log.logger":"fleet-metrics.api","@timestamp":"2023-03-14T13:20:43.95Z","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-14T13:20:44.003Z","message":"Metrics endpoint listening on: /opt/Elastic/Agent/data/tmp/fleet-server-default.sock (configured: unix:///opt/Elastic/Agent/data/tmp/fleet-server-default.sock)","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","log.logger":"fleet-metrics.api","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-03-14T13:20:44.209Z","message":"fail elasticsearch info","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","cluster.addr":["ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io:443"],"cluster.maxConnsPersHost":128,"error.message":"dial tcp: lookup ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io on [::1]:53: read udp [::1]:53506->[::1]:53: read: connection refused","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-14T13:20:44.209Z","message":"Stats endpoint (/opt/Elastic/Agent/data/tmp/fleet-server-default.sock) finished: accept unix /opt/Elastic/Agent/data/tmp/fleet-server-default.sock: use of closed network connection","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"log.logger":"fleet-metrics.api","ecs.version":"1.6.0","service.name":"fleet-server","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-14T13:20:44.209Z","message":"Error - dial tcp: lookup ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io on [::1]:53: read udp [::1]:53506->[::1]:53: read: connection refused","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"state":"FAILED","ecs.version":"1.6.0","service.name":"fleet-server","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-03-14T13:20:44.210Z","message":"Fleet Server failed","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","error.message":"dial tcp: lookup ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io on [::1]:53: read udp [::1]:53506->[::1]:53: read: connection refused","@timestamp":"2023-03-14T13:20:44.21Z","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-03-14T13:20:44.210Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":857},"message":"Unit state changed fleet-server-default-fleet-server-fleet_server-5cc1bc7e-20d8-4174-9af3-4023ec1cd024 (STARTING->FAILED): Error - dial tcp: lookup ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io on [::1]:53: read udp [::1]:53506->[::1]:53: read: connection refused","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default-fleet-server-fleet_server-5cc1bc7e-20d8-4174-9af3-4023ec1cd024","type":"input","state":"FAILED","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-03-14T13:20:44.210Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":857},"message":"Unit state changed fleet-server-default (STARTING->FAILED): Error - dial tcp: lookup ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io on [::1]:53: read udp [::1]:53506->[::1]:53: read: connection refused","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default","type":"output","state":"FAILED","old_state":"STARTING"},"ecs.version":"1.6.0"}

I believe this is a DNS lookup failure, this can also been seen in the error messages from the Beats started by agent:

{"log.level":"warn","@timestamp":"2023-03-14T13:20:47.401Z","message":"DNS lookup failure \"ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io\": lookup ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io on [::1]:53: read udp [::1]:57757->[::1]:53: read: connection refused","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"log":{"source":"beat/metrics-monitoring"},"log.logger":"transport","log.origin":{"file.line":52,"file.name":"transport/tcp.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-03-14T13:20:47.401Z","message":"Error dialing lookup ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io on [::1]:53: read udp [::1]:57757->[::1]:53: read: connection refused","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"log":{"source":"beat/metrics-monitoring"},"address":"ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io:443","service.name":"metricbeat","network":"tcp","ecs.version":"1.6.0","log.logger":"esclientleg","log.origin":{"file.line":38,"file.name":"transport/logging.go"},"ecs.version":"1.6.0"}

cmacknz · 2023-03-21T15:38:17Z

Interestingly the same failure happens later but Fleet server reports itself as configuring instead of failed:

{"log.level":"info","@timestamp":"2023-03-14T13:20:43.941Z","message":"Setting cache config options","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","cfg":{"APIKeyJitter":300000000000,"APIKeyTTL":900000000000,"ActionTTL":300000000000,"ArtifactTTL":86400000000000,"EnrollKeyTTL":60000000000,"MaxCost":52428800,"NumCounters":500000},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-14T13:20:43.947Z","message":"Starting","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","state":"STARTING","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-14T13:20:43.948Z","message":"initial server configuration","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","new":{"Fleet":{"Agent":{"ID":"a2f4b285-57ed-42f0-9cde-fedd137b6474","Logging":{"Level":"info"},"Version":"8.7.0"},"Host":{"ID":"","Name":""}},"HTTP":{"Enabled":true,"Host":"unix:///opt/Elastic/Agent/data/tmp/fleet-server-default.sock","Port":5066,"SecurityDescriptor":"","User":""},"Inputs":[{"Cache":{"APIKeyJitter":0,"APIKeyTTL":0,"ActionTTL":0,"ArtifactTTL":0,"EnrollKeyTTL":0,"MaxCost":0,"NumCounters":0},"Monitor":{"FetchSize":0,"PollTimeout":0},"Policy":{"ID":""},"Server":{"Bulk":{"FlushInterval":250000000,"FlushMaxPending":8,"FlushThresholdCount":2048,"FlushThresholdSize":1048576},"CompressionLevel":1,"CompressionThresh":1024,"GC":{"CleanupAfterExpiredInterval":"30d","ScheduleInterval":3600000000000},"Host":"0.0.0.0","Instrumentation":{"APIKey":"","Enabled":false,"Environment":"","GlobalLabels":"","Hosts":null,"SecretToken":"","TLS":{"ServerCA":"","ServerCertificate":"","SkipVerify":false},"TransactionSampleRate":""},"InternalPort":8221,"Limits":{"AckLimit":{"Burst":100,"Interval":10000000,"Max":50,"MaxBody":2097152},"ArtifactLimit":{"Burst":25,"Interval":5000000,"Max":50,"MaxBody":0},"CheckinLimit":{"Burst":1000,"Interval":1000000,"Max":0,"MaxBody":1048576},"EnrollLimit":{"Burst":100,"Interval":10000000,"Max":50,"MaxBody":524288},"MaxAgents":0,"MaxConnections":0,"MaxHeaderByteSize":8192,"PolicyThrottle":5000000,"StatusLimit":{"Burst":25,"Interval":5000000,"Max":50,"MaxBody":0},"UploadChunkLimit":{"Burst":10,"Interval":3000000,"Max":5,"MaxBody":4194304},"UploadEndLimit":{"Burst":5,"Interval":2000000000,"Max":2,"MaxBody":1024},"UploadStartLimit":{"Burst":8,"Interval":3000000000,"Max":3,"MaxBody":5242880}},"Port":8220,"Profiler":{"Bind":"localhost:6060","Enabled":false},"Runtime":{"GCPercent":0},"TLS":{"CASha256":null,"CAs":null,"Certificate":{"Certificate":"-----BEGIN CERTIFICATE-----\nMIIDSjCCAjKgAwIBAgICBnowDQYJKoZIhvcNAQELBQAwLDEWMBQGA1UEChMNZWxh\nc3RpYy1mbGVldDESMBAGA1UEAxMJbG9jYWxob3N0MB4XDTIzMDMxNDEzMTQzOFoX\nDTMzMDMxNDEzMTQzOFowMzEWMBQGA1UEChMNZWxhc3RpYy1mbGVldDEZMBcGA1UE\nAxMQaXAtMTcyLTMxLTU3LTEwOTCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoC\nggEBAL2xkV5OKCL9ZqFgoH7/y/xvlM6EABykXOpZjP3/+ugx/nAMA0dFaeUHU7MN\nqvMw91kgrVGkJ8nb6uRXa6PbBVY/U5QVPVC1QHxNXpo9Ye9vDaf7C336UEr12+IF\nCC12i3SyAOD5gWmckgeRDJsVIkJ+W1r5JWUkWjIB6QjQzZlX3Jz6DH92VmMKEEfe\nKaWS28O73pNjVVxh0TePL+1+ADogIiGpEIOxv0wq9q6bJcyDhBFKHhYBgCFH30W+\ns31RIt2npK9BvD5/PbpcKSKqhz1lsJOTO0YMY1KGcZWiyKQH/xkEpBaN6SpfTn7a\n/QH7J6sXed7uzWLfOlHpXKqcQ1sCAwEAAaNvMG0wDgYDVR0PAQH/BAQDAgeAMB0G\nA1UdJQQWMBQGCCsGAQUFBwMCBggrBgEFBQcDATAfBgNVHSMEGDAWgBQbcyE1Fo5b\nrmOtwGsfEN8DZkgLNDAbBgNVHREEFDASghBpcC0xNzItMzEtNTctMTA5MA0GCSqG\nSIb3DQEBCwUAA4IBAQANYSIoOoxi1epv/6u4bvNHSNEIUSBkL/+uTAAQwcrvc2/R\nA/xXphu9/DGBx17SOOz+L07NBkCsZdZFPYQ2czb58PW9uiqMIi62D3RmuMAl8clw\ng4U96cnNRB1e2ZyQHHnpOgiMeu1Mp/aUSnJJd+C5b044sMLreeukJVVGsS/Asb+w\nuzBYVcA0EHQueTwnCD9ZcqTLtMJ6RGn+eJs2qSOrDML+xcnSpdWTlF8g+wK2pBds\nVa004Y+fq0lG3eENpU3JAH5O/krIvaf2TYZvs6z0wQK5jqvyfOiSQwTe0/djlyql\nzkJV6b5n/SjRb6F9YhW+jOSrCgjrwMHYLDA4Nc+8\n-----END CERTIFICATE-----\n","Key":"[redacted]","Passphrase":""},"CipherSuites":null,"ClientAuth":0,"CurveTypes":null,"Enabled":null,"VerificationMode":"full","Versions":null},"Timeouts":{"CheckinJitter":30000000000,"CheckinLongPoll":300000000000,"CheckinTimestamp":30000000000,"Idle":30000000000,"Read":60000000000,"ReadHeader":5000000000,"Write":600000000000}},"Type":""}],"Logging":{"Files":null,"Level":"info","Pretty":false,"ToFiles":true,"ToStderr":true},"Output":{"Elasticsearch":{"APIKey":"","Headers":null,"Hosts":["ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io:443"],"MaxConnPerHost":128,"MaxContentLength":104857600,"MaxRetries":3,"Path":"","Protocol":"https","ProxyDisable":false,"ProxyHeaders":{},"ProxyURL":"","ServiceToken":"[redacted]","TLS":null,"Timeout":90000000000},"Extra":null}},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-14T13:20:43.948Z","message":"starting server on configuration change","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-14T13:20:43.950Z","message":"Starting stats endpoint","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","log.logger":"fleet-metrics.api","@timestamp":"2023-03-14T13:20:43.95Z","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-14T13:20:44.003Z","message":"Metrics endpoint listening on: /opt/Elastic/Agent/data/tmp/fleet-server-default.sock (configured: unix:///opt/Elastic/Agent/data/tmp/fleet-server-default.sock)","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","log.logger":"fleet-metrics.api","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-03-14T13:20:44.209Z","message":"fail elasticsearch info","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","cluster.addr":["ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io:443"],"cluster.maxConnsPersHost":128,"error.message":"dial tcp: lookup ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io on [::1]:53: read udp [::1]:53506->[::1]:53: read: connection refused","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-14T13:20:44.209Z","message":"Stats endpoint (/opt/Elastic/Agent/data/tmp/fleet-server-default.sock) finished: accept unix /opt/Elastic/Agent/data/tmp/fleet-server-default.sock: use of closed network connection","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"log.logger":"fleet-metrics.api","ecs.version":"1.6.0","service.name":"fleet-server","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-14T13:20:44.209Z","message":"Error - dial tcp: lookup ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io on [::1]:53: read udp [::1]:53506->[::1]:53: read: connection refused","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"state":"FAILED","ecs.version":"1.6.0","service.name":"fleet-server","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-03-14T13:20:44.210Z","message":"Fleet Server failed","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","error.message":"dial tcp: lookup ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io on [::1]:53: read udp [::1]:53506->[::1]:53: read: connection refused","@timestamp":"2023-03-14T13:20:44.21Z","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-03-14T13:20:44.210Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":857},"message":"Unit state changed fleet-server-default-fleet-server-fleet_server-5cc1bc7e-20d8-4174-9af3-4023ec1cd024 (STARTING->FAILED): Error - dial tcp: lookup ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io on [::1]:53: read udp [::1]:53506->[::1]:53: read: connection refused","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default-fleet-server-fleet_server-5cc1bc7e-20d8-4174-9af3-4023ec1cd024","type":"input","state":"FAILED","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-03-14T13:20:44.210Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":857},"message":"Unit state changed fleet-server-default (STARTING->FAILED): Error - dial tcp: lookup ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io on [::1]:53: read udp [::1]:53506->[::1]:53: read: connection refused","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default","type":"output","state":"FAILED","old_state":"STARTING"},"ecs.version":"1.6.0"}

I suspect Fleet server isn't reliably reporting the failed state here. @michel-laterman is this something we can improve?

cmacknz · 2023-03-21T15:39:38Z

@dikshachauhan-qasource per my comment in #2431 (comment) this doesn't seem like a bug in the agent, DNS look ups for Elasticsearch on Fleet server are failing.

Is this reproducible? If it is, was there any special network configuration that had to be set up?

dikshachauhan-qasource · 2023-03-22T10:32:16Z

Hi @cmacknz

We have reproduced this issue on 8.7.0 BC6 and found it still reproducible.

if it is, was there any special network configuration that had to be set up?
Here, I have used AWS "ami-0c544bda9765444c2" with default network settings.

Build details:

VERSION: 8.7.0 BC6
BUILD: 61051
COMMIT: 04ef24287f26854ad99a46ae983854c6184717cb
Artifact: https://staging.elastic.co/8.7.0-a7fb3750/summary-8.7.0.html

Screenshot:

Before reboot
After reboot

Please let us know if we are missing anything you require.

Thanks
QAS

cmacknz · 2023-03-23T21:57:36Z

Thank you, so the machine can still resolve the hostname but the agent can't. That is very strange.

harshitgupta-qasource · 2023-05-23T09:26:01Z

Hi @cmacknz

We have revalidated this issue on latest 8.8 BC6 kibana cloud build and found it still reproducible.

Observations:

Fleet-server Agent gets into the offline state permanently on machine reboot.

Build details:
VERSION: 8.8.0 BC6
BUILD: 63115
COMMIT: a4c256b39f7d1ee34abe61109a817ec7f5329009
Artifacts: https://staging.elastic.co/8.8.0-375abdf7/summary-8.8.0.html

Screen-shot

Agent Diagnostic logs:
elastic-agent-diagnostics-2023-05-23T07-20-06Z-00.zip

Please let us know if anything else is required from our end.

Thanks!

joshdover · 2023-05-23T10:09:30Z

If we hadn't already released this bug I would consider this a blocker. Moving this to the Fleet Server area and adding to the current sprint as a P1. FYI @jlind23

juliaElastic · 2023-05-23T11:10:26Z

I'm seeing the same ES connection and DNS lookup failure errors in the latest diagnostics.
@harshitgupta-qasource Is this an on prem installation on the SLES host or are you enrolling a Fleet Server to an Elastic Cloud instance? I am wondering from where this gcp hostname is coming from.

{"log.level":"warn","@timestamp":"2023-05-23T07:05:56.499Z","message":"DNS lookup failure \"34e1583f2a2a41b09e9d33d9d390def6.europe-west1.gcp.cloud.es.io\": lookup 34e1583f2a2a41b09e9d33d9d390def6.europe-west1.gcp.cloud.es.io on [::1]:53: read udp [::1]:38795->[::1]:53: read: connection refused","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"log.logger":"transport","log.origin":{"file.line":52,"file.name":"transport/tcp.go"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-05-23T07:05:56.499Z","message":"Error dialing lookup 34e1583f2a2a41b09e9d33d9d390def6.europe-west1.gcp.cloud.es.io on [::1]:53: read udp [::1]:38795->[::1]:53: read: connection refused","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"log.logger":"esclientleg","log.origin":{"file.line":38,"file.name":"transport/logging.go"},"service.name":"filebeat","network":"tcp","address":"34e1583f2a2a41b09e9d33d9d390def6.europe-west1.gcp.cloud.es.io:443","ecs.version":"1.6.0","ecs.version":"1.6.0"}

EDIT: I found that the DNS lookup error is temporary, and the connection to ES is established after that.

{"log.level":"info","@timestamp":"2023-05-23T07:06:16.639Z","message":"Connection to backoff(elasticsearch(https://34e1583f2a2a41b09e9d33d9d390def6.europe-west1.gcp.cloud.es.io:443)) established",

I could reproduce the issue with a SLES VM and enrolling it to a staging cloud instance. After the VM reboot, I still see Agent logs and metrics showing up in Kibana, though the status is Offline, so looks like the last_checkin fields are not being updated.

Seeing these errors as well, which indicates that the fleet-server is not started on port 8221.

[elastic_agent][error] ack retrier: commit failed with error: acknowledge 1 actions '[action_id: policy:fleet-server-policy:1:1, type: POLICY_CHANGE]' for elastic-agent 'a4376429-b50d-4323-8a23-31650ccafd25' failed: fail to ack to fleet: all hosts failed: 1 error occurred:
	* requester 0/1 to host https://localhost:8221/ errored: Post "https://localhost:8221/api/fleet/agents/a4376429-b50d-4323-8a23-31650ccafd25/acks?": dial tcp [::1]:8221: connect: connection refused

I tried doing an agent restart and then the agent went back to HEALTHY status.

sudo elastic-agent restart

I think what seems to happen is that after reboot, the fleet-server process tries to start, but hits the temporary DSN lookup issue, and remains in stopped state. Is there any retry logic which should kick in?

{"log.level":"info","@timestamp":"2023-05-23T11:32:27.031Z","message":"Boot fleet-server","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"service.name":"fleet-server","pid":1036,"ecs.version":"1.6.0","ppid":823,"exe":"/opt/Elastic/Agent/data/elastic-agent-80e3a2/components/fleet-server","args":["--agent-mode","-E","logging.level=debug","-E","logging.to_stderr=true","-E","http.enabled=true","-E","http.host=unix:///opt/Elastic/Agent/data/tmp/fleet-server-default.sock","-E","path.data=/opt/Elastic/Agent/data/elastic-agent-80e3a2/run/fleet-server-default"],"version":"8.8.0","commit":"86832af2","ecs.version":"1.6.0"}

{"log.level":"info","@timestamp":"2023-05-23T11:32:28.028Z","message":"Setting cache config options","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"service.name":"fleet-server","cfg":{"APIKeyJitter":300000000000,"APIKeyTTL":900000000000,"ActionTTL":300000000000,"ArtifactTTL":86400000000000,"EnrollKeyTTL":60000000000,"MaxCost":52428800,"NumCounters":500000},"ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-05-23T11:32:28.043Z","message":"Starting","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","state":"STARTING","@timestamp":"2023-05-23T11:32:28.03Z","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-05-23T11:32:28.031Z","message":"initial server configuration","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","new":{"Fleet":{"Agent":{"ID":"a4376429-b50d-4323-8a23-31650ccafd25","Logging":{"Level":"info"},"Version":"8.8.0"},"Host":{"ID":"","Name":""}},"HTTP":{"Enabled":true,"Host":"unix:///opt/Elastic/Agent/data/tmp/fleet-server-default.sock","Port":5066,"SecurityDescriptor":"","User":""},"Inputs":[{"Cache":{"APIKeyJitter":0,"APIKeyTTL":0,"ActionTTL":0,"ArtifactTTL":0,"EnrollKeyTTL":0,"MaxCost":0,"NumCounters":0},"Monitor":{"FetchSize":0,"PollTimeout":0},"Policy":{"ID":""},"Server":{"Bulk":{"FlushInterval":250000000,"FlushMaxPending":8,"FlushThresholdCount":2048,"FlushThresholdSize":1048576},"CompressionLevel":1,"CompressionThresh":1024,"GC":{"CleanupAfterExpiredInterval":"30d","ScheduleInterval":3600000000000},"Host":"0.0.0.0","Instrumentation":{"APIKey":"","APIKeyPath":"","Enabled":false,"Environment":"","GlobalLabels":"","Hosts":null,"SecretToken":"","SecretTokenPath":"","TLS":{"ServerCA":"","ServerCertificate":"","SkipVerify":false},"TransactionSampleRate":""},"InternalPort":8221,"Limits":{"AckLimit":{"Burst":100,"Interval":10000000,"Max":50,"MaxBody":2097152},"ArtifactLimit":{"Burst":25,"Interval":5000000,"Max":50,"MaxBody":0},"CheckinLimit":{"Burst":1000,"Interval":1000000,"Max":0,"MaxBody":1048576},"EnrollLimit":{"Burst":100,"Interval":10000000,"Max":50,"MaxBody":524288},"MaxAgents":0,"MaxConnections":0,"MaxHeaderByteSize":8192,"PolicyThrottle":5000000,"StatusLimit":{"Burst":25,"Interval":5000000,"Max":50,"MaxBody":0},"UploadChunkLimit":{"Burst":10,"Interval":3000000,"Max":5,"MaxBody":4194304},"UploadEndLimit":{"Burst":5,"Interval":2000000000,"Max":2,"MaxBody":1024},"UploadStartLimit":{"Burst":8,"Interval":3000000000,"Max":3,"MaxBody":5242880}},"Port":8220,"Profiler":{"Bind":"localhost:6060","Enabled":false},"Runtime":{"GCPercent":0,"MemoryLimit":0},"TLS":{"CASha256":null,"CAs":null,"Certificate":{"Certificate":"-----BEGIN CERTIFICATE-----\nMIIDSjCCAjKgAwIBAgICBnowDQYJKoZIhvcNAQELBQAwLDEWMBQGA1UEChMNZWxh\nc3RpYy1mbGVldDESMBAGA1UEAxMJbG9jYWxob3N0MB4XDTIzMDUyMzExMzAzOVoX\nDTMzMDUyMzExMzAzOVowMzEWMBQGA1UEChMNZWxhc3RpYy1mbGVldDEZMBcGA1UE\nAxMQaXAtMTcyLTMxLTkwLTE0MjCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoC\nggEBAMCZv8208a6woigkmRN52E4MoXufXzabV9ahUl1lwqhrXCB69OY57kdLjsVq\nCdvXEu89MBA7ULn3K/s+Lw+xEVVAljrtu3hJKfzjjvKi6L5VnR3G5uHBAly0Dvr7\nxFBPW6yCfVDp4rrwcs46eAPVBcdBb1mBl4iFCfbMITsmhgwx0jSQpD7qhr4yoJdn\nWQMMvJvAgYvhgiJ1qZAudwE/HUoSdLtr8Vk8G9ht4Uc7SrFDwFBU2zDSGnarkbYz\n5Bymwtkxl6rij01IlHricaZIA0ZL5FfPpobvnjp9XmFX3jHPAdKNWwItiXtOdCNt\nJ/MuDFzIHA+g293UiRzQShGNpUMCAwEAAaNvMG0wDgYDVR0PAQH/BAQDAgeAMB0G\nA1UdJQQWMBQGCCsGAQUFBwMCBggrBgEFBQcDATAfBgNVHSMEGDAWgBTVbtqR64jJ\n+FVw4+fuuSqEy4JEVjAbBgNVHREEFDASghBpcC0xNzItMzEtOTAtMTQyMA0GCSqG\nSIb3DQEBCwUAA4IBAQCRhD8zSG3bJydkKmRjmpA7l59u67Dtf9o3B3SBtuqgVqqx\ncLwGjJrCa6Rn7/zAHAIfQtw0lwpq9TQx/+GSpl15e/6wX6Q9HdnblV0dklB8NduO\n89X+bNyLuKJ3MgFKgxDmuZTbpy2nmZnYx77NzB15AnOJ6Fx3TwdsrMGRcCZxmSW7\n+NKHS2eKPckmDFJEImaPEfwA0oaoFBt2MO447g+pibP+ORuQ2qtUW387IYqggwBR\nFdPontJi1Jk+1NSLqp1wWTwg366qMweK60eM7GGujwnfmTii7DXQEzw9wGlcVQiH\nWS2B8QU0Z2tEE5HCfwGXww4i5p9NTnaeqWjsdZpf\n-----END CERTIFICATE-----\n","Key":"[redacted]","Passphrase":"","PassphrasePath":""},"CipherSuites":null,"ClientAuth":0,"CurveTypes":null,"Enabled":null,"VerificationMode":"full","Versions":null},"Timeouts":{"CheckinJitter":30000000000,"CheckinLongPoll":300000000000,"CheckinMaxPoll":3600000000000,"CheckinTimestamp":30000000000,"Idle":30000000000,"Read":60000000000,"ReadHeader":5000000000,"Write":600000000000}},"Type":""}],"Logging":{"Files":null,"Level":"info","Pretty":false,"ToFiles":true,"ToStderr":true},"Output":{"Elasticsearch":{"Headers":null,"Hosts":["787b58036cb94920952d2b39315ad26b.eastus2.staging.azure.foundit.no:443"],"MaxConnPerHost":128,"MaxContentLength":104857600,"MaxRetries":3,"Path":"","Protocol":"https","ProxyDisable":false,"ProxyHeaders":{},"ProxyURL":"","ServiceToken":"[redacted]","ServiceTokenPath":"","TLS":null,"Timeout":90000000000},"Extra":null}},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-05-23T11:32:28.031Z","message":"starting server on configuration change","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"service.name":"fleet-server","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-05-23T11:32:28.031Z","message":"Starting stats endpoint","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","log.logger":"fleet-metrics.api","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-05-23T11:32:28.032Z","message":"Metrics endpoint listening on: /opt/Elastic/Agent/data/tmp/fleet-server-default.sock (configured: unix:///opt/Elastic/Agent/data/tmp/fleet-server-default.sock)","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","log.logger":"fleet-metrics.api","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-05-23T11:32:28.037Z","message":"Server configuration update","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-05-23T11:32:28.037Z","message":"Re-configuring","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","state":"CONFIGURING","ecs.version":"1.6.0"}

{"log.level":"error","@timestamp":"2023-05-23T11:32:28.054Z","message":"fail elasticsearch info","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"service.name":"fleet-server","cluster.addr":["787b58036cb94920952d2b39315ad26b.eastus2.staging.azure.foundit.no:443"],"cluster.maxConnsPersHost":128,"error.message":"dial tcp: lookup 787b58036cb94920952d2b39315ad26b.eastus2.staging.azure.foundit.no on [::1]:53: read udp [::1]:49216->[::1]:53: read: connection refused","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-05-23T11:32:28.054Z","message":"Stats endpoint (/opt/Elastic/Agent/data/tmp/fleet-server-default.sock) finished: accept unix /opt/Elastic/Agent/data/tmp/fleet-server-default.sock: use of closed network connection","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"log.logger":"fleet-metrics.api","ecs.version":"1.6.0","service.name":"fleet-server","ecs.version":"1.6.0"}

{"log.level":"info","@timestamp":"2023-05-23T11:32:52.887Z","message":"Connection to backoff(elasticsearch(https://787b58036cb94920952d2b39315ad26b.eastus2.staging.azure.foundit.no:443)) established","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"ecs.version":"1.6.0","log.logger":"publisher_pipeline_output","log.origin":{"file.line":145,"file.name":"pipeline/client_worker.go"},"service.name":"metricbeat","ecs.version":"1.6.0"}

Any ideas @michel-laterman @cmacknz ?

harshitgupta-qasource · 2023-05-23T11:30:17Z

Hi @juliaElastic

Thank you for looking into this issue.

We have installed SLES 15 fleet server host on 8.8.0 BC6 Kibana Cloud Build.

Impacted Testcase:
#C149942

Please let us know if anything else is required from our end.

Thanks!

juliaElastic · 2023-05-23T13:47:39Z

@harshitgupta-qasource Do we have the same testcase on a different OS where the agent is back healthy after reboot?

cmacknz · 2023-05-23T16:26:00Z

@juliaElastic Fleet server might not be handling the error properly if it can't reach ES at init time. The fail elasticsearch info log message comes from:

fleet-server/internal/pkg/es/client.go

Lines 51 to 57 in 247c4bd

    
           // Validate connection 
        
           resp, err := info(ctx, es) 
        
           if err != nil { 
        
           	zlog.Error().Err(err).Msg("fail elasticsearch info") 
        
           	return nil, err 
        
           }

This is called by runServer via initBulker:

fleet-server/internal/pkg/server/fleet.go

Lines 360 to 364 in 247c4bd

    
           // Create the bulker subsystem 
        
           bulker, err := f.initBulker(bulkCtx, tracer, cfg) 
        
           if err != nil { 
        
           	return err 
        
           }

That error eventually gets handled in the Run function for Fleet Server:

fleet-server/internal/pkg/server/fleet.go

Lines 192 to 195 in 247c4bd

    
           case err := <-ech: 
        
           	f.reporter.UpdateState(client.UnitStateFailed, fmt.Sprintf("Error - %s", err), nil) //nolint:errcheck // unclear on what should we do if updating the status fails? 
        
           	log.Error().Err(err).Msg("Fleet Server failed") 
        
           	return err

I can see the Fleet Server failed log message triggered in the logs:

{"log.level":"error","@timestamp":"2023-05-23T07:05:50.409Z","message":"Fleet Server failed","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","error.message":"dial tcp: lookup 34e1583f2a2a41b09e9d33d9d390def6.europe-west1.gcp.cloud.es.io on [::1]:53: read udp [::1]:38455->[::1]:53: read: connection refused","ecs.version":"1.6.0"}

The function should just be called again 2s later, and I can see the Fleet server transition back to configuring 2s later the logs but nothing else happens afterwards. It's possible some of the state wasn't cleaned up properly to actually let it recover.

{"log.level":"info","@timestamp":"2023-05-23T07:05:52.415Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":972},"message":"Unit state changed fleet-server-default-fleet-server-fleet_server-93fbe5b0-f937-11ed-af24-a79b52bd536f (FAILED->CONFIGURING): Re-configuring","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default-fleet-server-fleet_server-93fbe5b0-f937-11ed-af24-a79b52bd536f","type":"input","state":"CONFIGURING","old_state":"FAILED"},"ecs.version":"1.6.0"}

fleet-server/internal/pkg/server/agent.go

Lines 288 to 298 in 247c4bd

    
           go func() { 
        
           	defer close(srvDone) 
        
           	for { 
        
           		err := srv.Run(srvCtx, cfg) 
        
           		if err == nil || errors.Is(err, context.Canceled) { 
        
           			return 
        
           		} 
        
           		// sleep some before calling Run again 
        
           		_ = sleep.WithContext(srvCtx, kAgentModeRestartLoopDelay) 
        
           	} 
        
           }()

cmacknz · 2023-05-23T16:30:32Z

I wonder if Fleet server is deadlocked. You can look at the goroutine profile with go tool pprof -http :8081 ./components/fleet-server-default/fleet-server-fleet_server-93fbe5b0-f937-11ed-af24-a79b52bd536f/goroutine.pprof.gz

harshitgupta-qasource · 2023-05-24T06:30:41Z

Hi @juliaElastic

We have test cases for setting up the fleet server on various OSs for kibana cloud build and rebooting the fleet server as part of exploratory testing.

We have re-validated this issue on the 8.8 BC8 kibana cloud build with Windows and Ubuntu fleet server agents.

We have observed that after restarting these fleet servers, the fleet server agent came back Healthy.

So the issue is only reproducible for the SLES15 Fleet server.

Testcases for setup Fleet server:

windows
C77408
Centos.RPM
C77049
Debian
C77050
MacOS
C77052
Linux ARMx64
C77051

Please let us know if anything else is required from our end.

Thanks!

joshdover · 2023-05-24T13:37:00Z

I'm skeptical that this is truly a SLES-specific issue, though maybe with the default configuration it's more likely to happen with SLES OOTB. It seems likely that this can happen on other systems if they hit a DNS lookup failure on startup.

We definitely need to be resilient to such errors, so I'm inclined to leave this as a high priority issue that should be fixed relatively soon.

jlind23 · 2023-05-31T06:13:54Z

@michel-laterman will you have time to take a look at this as soon as you can?

jlind23 · 2023-06-08T17:35:25Z

@michel-laterman assigning this to you in this sprint.

michel-laterman · 2023-06-12T23:24:36Z

I'm digging into this. I can recreate on an AWS SLES15 instance with the latest 8.9.0-SNAPSHOT and see all the same symptoms as reported.
It does look like an issue where if elasticsearch (or DNS) is not available when the fleet-server is starting, there's some state drift.

Here it fails on the initial startup, and restarts, but the fleet-server is not reporting the failed state on that attempt:

{"log.level":"error","@timestamp":"2023-06-12T22:16:40.128Z","message":"fail elasticsearch info","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"cluster.addr":["200dd16c33174ccb94de816bb6fd5c18.us-central1.gcp.qa.cld.elstc.co:443"],"cluster.maxConnsPersHost":128,"error.message":"dial tcp: lookup 200dd16c33174ccb94de816bb6fd5c18.us-central1.gcp.qa.cld.elstc.co on [::1]:53: read udp [::1]:43982->[::1]:53: read: connection refused","ecs.version":"1.6.0","service.name":"fleet-server","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-06-12T22:16:40.128Z","message":"Stats endpoint (/opt/Elastic/Agent/data/tmp/fleet-server-default.sock) finished: accept unix /opt/Elastic/Agent/data/tmp/fleet-server-default.sock: use of closed network connection","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","log.logger":"fleet-metrics.api","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-06-12T22:16:40.128Z","message":"Error - dial tcp: lookup 200dd16c33174ccb94de816bb6fd5c18.us-central1.gcp.qa.cld.elstc.co on [::1]:53: read udp [::1]:43982->[::1]:53: read: connection refused","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","state":"FAILED","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-06-12T22:16:40.128Z","message":"Fleet Server failed","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","error.message":"dial tcp: lookup 200dd16c33174ccb94de816bb6fd5c18.us-central1.gcp.qa.cld.elstc.co on [::1]:53: read udp [::1]:43982->[::1]:53: read: connection refused","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-06-12T22:16:40.128Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":991},"message":"Unit state changed fleet-server-default-fleet-server-fleet_server-9baba0cd-d2de-47cc-8a9b-d1d2f8c12d66 (STARTING->FAILED): Error - dial tcp: lookup 200dd16c33174ccb94de816bb6fd5c18.us-central1.gcp.qa.cld.elstc.co on [::1]:53: read udp [::1]:43982->[::1]:53: read: connection refused","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default-fleet-server-fleet_server-9baba0cd-d2de-47cc-8a9b-d1d2f8c12d66","type":"input","state":"FAILED","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-06-12T22:16:40.128Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":991},"message":"Unit state changed fleet-server-default (STARTING->FAILED): Error - dial tcp: lookup 200dd16c33174ccb94de816bb6fd5c18.us-central1.gcp.qa.cld.elstc.co on [::1]:53: read udp [::1]:43982->[::1]:53: read: connection refused","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default","type":"output","state":"FAILED","old_state":"STARTING"},"ecs.version":"1.6.0"}
...
{"log.level":"info","@timestamp":"2023-06-12T22:16:42.131Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":985},"message":"Unit state changed fleet-server-default-fleet-server-fleet_server-9baba0cd-d2de-47cc-8a9b-d1d2f8c12d66 (STARTING->CONFIGURING): Re-configuring","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default-fleet-server-fleet_server-9baba0cd-d2de-47cc-8a9b-d1d2f8c12d66","type":"input","state":"CONFIGURING","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-06-12T22:16:42.131Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":985},"message":"Unit state changed fleet-server-default (STARTING->CONFIGURING): Re-configuring","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default","type":"output","state":"CONFIGURING","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-06-12T22:16:42.132Z","message":"fail elasticsearch info","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"cluster.maxConnsPersHost":128,"error.message":"dial tcp: lookup 200dd16c33174ccb94de816bb6fd5c18.us-central1.gcp.qa.cld.elstc.co on [::1]:53: read udp [::1]:38166->[::1]:53: read: connection refused","ecs.version":"1.6.0","service.name":"fleet-server","cluster.addr":["200dd16c33174ccb94de816bb6fd5c18.us-central1.gcp.qa.cld.elstc.co:443"],"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-06-12T22:16:42.132Z","message":"Stats endpoint (/opt/Elastic/Agent/data/tmp/fleet-server-default.sock) finished: accept unix /opt/Elastic/Agent/data/tmp/fleet-server-default.sock: use of closed network connection","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","log.logger":"fleet-metrics.api","ecs.version":"1.6.0"}

I think we need more robust handling around how elasticsearch readiness is detected by the fleet-server (which is the same issue as #2683 that @jsoriano is assigned to).
I'll try to look into why the component was not reporting a failed state after it re-configures

michel-laterman · 2023-06-13T20:21:50Z

Actually, I think there is a deadlock occuring, but I'm not sure why.
block/mutex pprofs need to be enabled with: calls to runtime.SetBlockProfileRate and runtime.SetMutexProfileFraction.

We can see on the instance that this occurs that fleet-server is running, however the ports are not being used (internal/external api and metrics).
The error logs indicate that starting the bulker fails, which should cause the server.Fleet to exit with an error leading to the server.Agent wrapper to restart it, but that exit message as well as the one when the error channel is selected do not appear

cmacknz · 2023-06-13T20:56:29Z

Actually, I think there is a deadlock occuring, but I'm not sure why.
block/mutex pprofs need to be enabled with: calls to runtime.SetBlockProfileRate and runtime.SetMutexProfileFraction.

For a deadlock you should be able to spot it with just the goroutine profile if you have that already. That will give you the stack trace of every goroutine. The diagnostics collect it by default.

michel-laterman · 2023-06-14T22:03:51Z

I think I found the issue, our error channel is being created on each iteration of the run loop: https://github.com/elastic/fleet-server/blob/main/internal/pkg/server/fleet.go#L136

However, runServer call actually occurs in a different goroutine: https://github.com/elastic/fleet-server/blob/main/internal/pkg/server/fleet.go#L117.

Which can lead to a config change being read before the runServer method is ran and leaking a channel; the error passed to the channel is lost and thus the fleet-server is stuck as it should signal that it's failed but has not.

I've added some logging around the error channel to try to show what's going on:

# Initial config load, fails and failure is commuinicated
{"log.level":"info","@timestamp":"2023-06-14T21:43:20.534Z","message":"CREATE CHAN 0xc0000a66c0"
{"log.level":"info","@timestamp":"2023-06-14T21:43:20.534Z","message":"CHANGE DETECTED"
{"log.level":"info","@timestamp":"2023-06-14T21:43:20.535Z","message":"SELECT WAIT (0xc0000a66c0)"
{"log.level":"info","@timestamp":"2023-06-14T21:43:20.535Z","message":"RUN SERVER (0xc0000a66c0)"
{"log.level":"error","@timestamp":"2023-06-14T21:43:20.545Z","message":"START FAIL. CHAN (0xc0000a66c0)"
{"log.level":"info","@timestamp":"2023-06-14T21:43:20.545Z","message":"CHANNEL WRITE DONE?(0xc0000a66c0)"
{"log.level":"error","@timestamp":"2023-06-14T21:43:20.545Z","message":"Fleet Server failed"
{"file.name":"coordinator/coordinator.go","file.line":991},"message":"Unit state changed fleet-server-default (STARTING->FAILED)"
...

# new attempt in the loop, a new error channel is created
{"log.level":"info","@timestamp":"2023-06-14T21:43:22.545Z","message":"CREATE CHAN 0xc0002b81e0"
{"log.level":"info","@timestamp":"2023-06-14T21:43:22.546Z","message":"CHANGE DETECTED","component":
{"log.level":"info","@timestamp":"2023-06-14T21:43:22.546Z","message":"SELECT WAIT (0xc0002b81e0)"

# New config is recieved, the error channel is not read
{"log.level":"info","@timestamp":"2023-06-14T21:43:22.546Z","message":"Server configuration update"

# New channel created
{"log.level":"info","@timestamp":"2023-06-14T21:43:22.546Z","message":"CREATE CHAN 0xc0002b8240","component":

# previous channel is being used
{"log.level":"info","@timestamp":"2023-06-14T21:43:22.546Z","message":"RUN SERVER (0xc0002b81e0)"

# new channel is in the select block
{"log.level":"info","@timestamp":"2023-06-14T21:43:22.546Z","message":"SELECT WAIT (0xc0002b8240)"

# error sent to old channel
{"log.level":"error","@timestamp":"2023-06-14T21:43:22.551Z","message":"START FAIL. CHAN (0xc0002b81e0)",
{"log.level":"info","@timestamp":"2023-06-14T21:43:22.551Z","message":"CHANNEL WRITE DONE (0xc0002b81e0)"

juliaElastic · 2023-06-15T13:46:09Z

I've added some logging around the error channel to try to show what's going on

I think we should commit these in as debug logs.

J4ckFr05t · 2023-06-25T10:41:55Z

Upon reviewing these discussions, I recently executed a "systemctl restart elastic-agent" command on the fleet server. I find myself needing to perform this action after every machine reboot. Consequently, I have created a cronjob to automatically handle this task until a permanent solution is implemented.

michel-laterman · 2023-06-28T07:37:49Z

The fix should be in 8.9.0

harshitgupta-qasource · 2023-07-07T12:24:52Z

Hi Team,

We have re-validated this issue on the latest 8.9.0 BC3 Kibana cloud environment and found it fixed now.

Observations:

Fleet-server installed on SLES15 gets back online after machine reboot

Build details:
VERSION: 8.9.0
BUILD: 64584
COMMIT: fc463b96275c55dc44524f79f617b0026b7f8667

Screen-shot:

Hence, we are marking this issue as QA: Validated.

Thanks

SINOX22 · 2023-07-13T07:55:01Z

Hi,

I reported this issue months ago (RHEL 8) and I solved just adding After=syslog.target network.target in /etc/systemd/system/elastic-agent.service

Hope it helps!

cmachango · 2023-07-21T16:23:56Z

HI @SINOX22 tried your solution on my RHEL 8 and it still showing Offline Status.

dikshachauhan-qasource added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team impact:medium labels Mar 15, 2023

dikshachauhan-qasource added Team:Fleet Label for the Fleet team and removed Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Mar 15, 2023

manishgupta-qasource transferred this issue from elastic/elastic-agent Mar 15, 2023

joshdover added impact:critical Immediate priority; high value or cost to the product. and removed impact:medium labels May 23, 2023

jlind23 assigned michel-laterman Jun 8, 2023

peasead mentioned this issue Jun 12, 2023

[BUG] Fleet not coming back after reboot peasead/elastic-container#23

Closed

michel-laterman mentioned this issue Jun 14, 2023

Reuse error channel #2697

Merged

1 task

michel-laterman closed this as completed in #2697 Jun 20, 2023

amolnater-qasource added the QA:Ready For Testing Code is merged and ready for QA to validate label Jun 21, 2023

harshitgupta-qasource added QA:Validated Validated by the QA Team and removed QA:Ready For Testing Code is merged and ready for QA to validate labels Jul 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SLES15]: Fleet-server Agent gets into offline state on machine reboot. #2431

[SLES15]: Fleet-server Agent gets into offline state on machine reboot. #2431

dikshachauhan-qasource commented Mar 15, 2023

dikshachauhan-qasource commented Mar 15, 2023

manishgupta-qasource commented Mar 15, 2023

cmacknz commented Mar 21, 2023

cmacknz commented Mar 21, 2023

cmacknz commented Mar 21, 2023

cmacknz commented Mar 21, 2023

dikshachauhan-qasource commented Mar 22, 2023 •

edited

cmacknz commented Mar 23, 2023

harshitgupta-qasource commented May 23, 2023

joshdover commented May 23, 2023

juliaElastic commented May 23, 2023 •

edited

harshitgupta-qasource commented May 23, 2023

juliaElastic commented May 23, 2023

cmacknz commented May 23, 2023

cmacknz commented May 23, 2023

harshitgupta-qasource commented May 24, 2023 •

edited

joshdover commented May 24, 2023

jlind23 commented May 31, 2023

jlind23 commented Jun 8, 2023

michel-laterman commented Jun 12, 2023

michel-laterman commented Jun 13, 2023

cmacknz commented Jun 13, 2023

michel-laterman commented Jun 14, 2023 •

edited

juliaElastic commented Jun 15, 2023

J4ckFr05t commented Jun 25, 2023

michel-laterman commented Jun 28, 2023

harshitgupta-qasource commented Jul 7, 2023

SINOX22 commented Jul 13, 2023

cmachango commented Jul 21, 2023 •

edited

[SLES15]: Fleet-server Agent gets into offline state on machine reboot. #2431

[SLES15]: Fleet-server Agent gets into offline state on machine reboot. #2431

Comments

dikshachauhan-qasource commented Mar 15, 2023

dikshachauhan-qasource commented Mar 15, 2023

manishgupta-qasource commented Mar 15, 2023

cmacknz commented Mar 21, 2023

cmacknz commented Mar 21, 2023

cmacknz commented Mar 21, 2023

cmacknz commented Mar 21, 2023

dikshachauhan-qasource commented Mar 22, 2023 • edited

cmacknz commented Mar 23, 2023

harshitgupta-qasource commented May 23, 2023

joshdover commented May 23, 2023

juliaElastic commented May 23, 2023 • edited

harshitgupta-qasource commented May 23, 2023

juliaElastic commented May 23, 2023

cmacknz commented May 23, 2023

cmacknz commented May 23, 2023

harshitgupta-qasource commented May 24, 2023 • edited

Testcases for setup Fleet server:

joshdover commented May 24, 2023

jlind23 commented May 31, 2023

jlind23 commented Jun 8, 2023

michel-laterman commented Jun 12, 2023

michel-laterman commented Jun 13, 2023

cmacknz commented Jun 13, 2023

michel-laterman commented Jun 14, 2023 • edited

juliaElastic commented Jun 15, 2023

J4ckFr05t commented Jun 25, 2023

michel-laterman commented Jun 28, 2023

harshitgupta-qasource commented Jul 7, 2023

SINOX22 commented Jul 13, 2023

cmachango commented Jul 21, 2023 • edited

dikshachauhan-qasource commented Mar 22, 2023 •

edited

juliaElastic commented May 23, 2023 •

edited

harshitgupta-qasource commented May 24, 2023 •

edited

michel-laterman commented Jun 14, 2023 •

edited

cmachango commented Jul 21, 2023 •

edited