Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SLES15]: Fleet-server Agent gets into offline state on machine reboot. #2431

Closed
dikshachauhan-qasource opened this issue Mar 15, 2023 · 29 comments · Fixed by #2697
Closed
Assignees
Labels
bug Something isn't working impact:critical Immediate priority; high value or cost to the product. QA:Validated Validated by the QA Team Team:Fleet Label for the Fleet team

Comments

@dikshachauhan-qasource
Copy link

Kibana version: 8.7.0 Prod Kibana cloud environment

Host OS: SLES

Build details:
VERSION: 8.7.0 BC5
BUILD: 61017
COMMIT: e70452f9697a18a8458f3d4618e001b7e3681bed
Artifact: https://staging.elastic.co/8.7.0-69b36bf6/summary-8.7.0.html

Preconditions:

  • 8.7.0 Prod Kibana cloud environment should be available.
  • Agent Policy should be available with System and Fleet-server integrations.

Steps to reproduce:

  • Install agent using above mentioned agent policy.
  • Wait for agent to generate logs and data streams properly on Fleet UI.
  • Reboot the machine after 5 minutes.
  • Observe Fleet-server Agent gets into offline state and never comes back to Healthy status.

Screenshots:
image

  • Though agent is seen as running on Host VM
    image

Testcase Reference:
https://elastic.testrail.io/index.php?/tests/view/2056682

Logs:
elastic-agent-diagnostics-2023-03-15T08-11-08Z-00.zip

@dikshachauhan-qasource dikshachauhan-qasource added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team impact:medium labels Mar 15, 2023
@dikshachauhan-qasource
Copy link
Author

@manishgupta-qasource Please review

@dikshachauhan-qasource dikshachauhan-qasource added Team:Fleet Label for the Fleet team and removed Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Mar 15, 2023
@manishgupta-qasource manishgupta-qasource transferred this issue from elastic/elastic-agent Mar 15, 2023
@manishgupta-qasource
Copy link
Collaborator

Secondary review for this ticket is Done

@cmacknz
Copy link
Member

cmacknz commented Mar 21, 2023

The fleet state is starting and Fleet server reports itself as reconfiguring:

fleet_state: 0
fleet_message: ""
#...
  state:
    state: 2
    message: 'Healthy: communicating with pid ''1022'''
    units:
      ? unittype: 0
        unitid: fleet-server-default-fleet-server-fleet_server-5cc1bc7e-20d8-4174-9af3-4023ec1cd024
      : state: 1
        message: Re-configuring
      ? unittype: 1
        unitid: fleet-server-default
      : state: 1
        message: Re-configuring
    version_info:
      name: fleet-server
      version: 8.7.0
      meta:
        build_time: 2023-03-07 15:37:52 +0000 UTC
        commit: aca39518

@cmacknz
Copy link
Member

cmacknz commented Mar 21, 2023

These are the Fleet server logs, it seems like it can't connect to Elasticsearch dial tcp: lookup ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io on [::1]:53: read udp [::1]:42353->[::1]:53: read: connection refused

{"log.level":"info","@timestamp":"2023-03-14T13:20:43.941Z","message":"Setting cache config options","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","cfg":{"APIKeyJitter":300000000000,"APIKeyTTL":900000000000,"ActionTTL":300000000000,"ArtifactTTL":86400000000000,"EnrollKeyTTL":60000000000,"MaxCost":52428800,"NumCounters":500000},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-14T13:20:43.947Z","message":"Starting","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","state":"STARTING","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-14T13:20:43.948Z","message":"initial server configuration","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","new":{"Fleet":{"Agent":{"ID":"a2f4b285-57ed-42f0-9cde-fedd137b6474","Logging":{"Level":"info"},"Version":"8.7.0"},"Host":{"ID":"","Name":""}},"HTTP":{"Enabled":true,"Host":"unix:///opt/Elastic/Agent/data/tmp/fleet-server-default.sock","Port":5066,"SecurityDescriptor":"","User":""},"Inputs":[{"Cache":{"APIKeyJitter":0,"APIKeyTTL":0,"ActionTTL":0,"ArtifactTTL":0,"EnrollKeyTTL":0,"MaxCost":0,"NumCounters":0},"Monitor":{"FetchSize":0,"PollTimeout":0},"Policy":{"ID":""},"Server":{"Bulk":{"FlushInterval":250000000,"FlushMaxPending":8,"FlushThresholdCount":2048,"FlushThresholdSize":1048576},"CompressionLevel":1,"CompressionThresh":1024,"GC":{"CleanupAfterExpiredInterval":"30d","ScheduleInterval":3600000000000},"Host":"0.0.0.0","Instrumentation":{"APIKey":"","Enabled":false,"Environment":"","GlobalLabels":"","Hosts":null,"SecretToken":"","TLS":{"ServerCA":"","ServerCertificate":"","SkipVerify":false},"TransactionSampleRate":""},"InternalPort":8221,"Limits":{"AckLimit":{"Burst":100,"Interval":10000000,"Max":50,"MaxBody":2097152},"ArtifactLimit":{"Burst":25,"Interval":5000000,"Max":50,"MaxBody":0},"CheckinLimit":{"Burst":1000,"Interval":1000000,"Max":0,"MaxBody":1048576},"EnrollLimit":{"Burst":100,"Interval":10000000,"Max":50,"MaxBody":524288},"MaxAgents":0,"MaxConnections":0,"MaxHeaderByteSize":8192,"PolicyThrottle":5000000,"StatusLimit":{"Burst":25,"Interval":5000000,"Max":50,"MaxBody":0},"UploadChunkLimit":{"Burst":10,"Interval":3000000,"Max":5,"MaxBody":4194304},"UploadEndLimit":{"Burst":5,"Interval":2000000000,"Max":2,"MaxBody":1024},"UploadStartLimit":{"Burst":8,"Interval":3000000000,"Max":3,"MaxBody":5242880}},"Port":8220,"Profiler":{"Bind":"localhost:6060","Enabled":false},"Runtime":{"GCPercent":0},"TLS":{"CASha256":null,"CAs":null,"Certificate":{"Certificate":"-----BEGIN CERTIFICATE-----\nMIIDSjCCAjKgAwIBAgICBnowDQYJKoZIhvcNAQELBQAwLDEWMBQGA1UEChMNZWxh\nc3RpYy1mbGVldDESMBAGA1UEAxMJbG9jYWxob3N0MB4XDTIzMDMxNDEzMTQzOFoX\nDTMzMDMxNDEzMTQzOFowMzEWMBQGA1UEChMNZWxhc3RpYy1mbGVldDEZMBcGA1UE\nAxMQaXAtMTcyLTMxLTU3LTEwOTCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoC\nggEBAL2xkV5OKCL9ZqFgoH7/y/xvlM6EABykXOpZjP3/+ugx/nAMA0dFaeUHU7MN\nqvMw91kgrVGkJ8nb6uRXa6PbBVY/U5QVPVC1QHxNXpo9Ye9vDaf7C336UEr12+IF\nCC12i3SyAOD5gWmckgeRDJsVIkJ+W1r5JWUkWjIB6QjQzZlX3Jz6DH92VmMKEEfe\nKaWS28O73pNjVVxh0TePL+1+ADogIiGpEIOxv0wq9q6bJcyDhBFKHhYBgCFH30W+\ns31RIt2npK9BvD5/PbpcKSKqhz1lsJOTO0YMY1KGcZWiyKQH/xkEpBaN6SpfTn7a\n/QH7J6sXed7uzWLfOlHpXKqcQ1sCAwEAAaNvMG0wDgYDVR0PAQH/BAQDAgeAMB0G\nA1UdJQQWMBQGCCsGAQUFBwMCBggrBgEFBQcDATAfBgNVHSMEGDAWgBQbcyE1Fo5b\nrmOtwGsfEN8DZkgLNDAbBgNVHREEFDASghBpcC0xNzItMzEtNTctMTA5MA0GCSqG\nSIb3DQEBCwUAA4IBAQANYSIoOoxi1epv/6u4bvNHSNEIUSBkL/+uTAAQwcrvc2/R\nA/xXphu9/DGBx17SOOz+L07NBkCsZdZFPYQ2czb58PW9uiqMIi62D3RmuMAl8clw\ng4U96cnNRB1e2ZyQHHnpOgiMeu1Mp/aUSnJJd+C5b044sMLreeukJVVGsS/Asb+w\nuzBYVcA0EHQueTwnCD9ZcqTLtMJ6RGn+eJs2qSOrDML+xcnSpdWTlF8g+wK2pBds\nVa004Y+fq0lG3eENpU3JAH5O/krIvaf2TYZvs6z0wQK5jqvyfOiSQwTe0/djlyql\nzkJV6b5n/SjRb6F9YhW+jOSrCgjrwMHYLDA4Nc+8\n-----END CERTIFICATE-----\n","Key":"[redacted]","Passphrase":""},"CipherSuites":null,"ClientAuth":0,"CurveTypes":null,"Enabled":null,"VerificationMode":"full","Versions":null},"Timeouts":{"CheckinJitter":30000000000,"CheckinLongPoll":300000000000,"CheckinTimestamp":30000000000,"Idle":30000000000,"Read":60000000000,"ReadHeader":5000000000,"Write":600000000000}},"Type":""}],"Logging":{"Files":null,"Level":"info","Pretty":false,"ToFiles":true,"ToStderr":true},"Output":{"Elasticsearch":{"APIKey":"","Headers":null,"Hosts":["ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io:443"],"MaxConnPerHost":128,"MaxContentLength":104857600,"MaxRetries":3,"Path":"","Protocol":"https","ProxyDisable":false,"ProxyHeaders":{},"ProxyURL":"","ServiceToken":"[redacted]","TLS":null,"Timeout":90000000000},"Extra":null}},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-14T13:20:43.948Z","message":"starting server on configuration change","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-14T13:20:43.950Z","message":"Starting stats endpoint","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","log.logger":"fleet-metrics.api","@timestamp":"2023-03-14T13:20:43.95Z","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-14T13:20:44.003Z","message":"Metrics endpoint listening on: /opt/Elastic/Agent/data/tmp/fleet-server-default.sock (configured: unix:///opt/Elastic/Agent/data/tmp/fleet-server-default.sock)","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","log.logger":"fleet-metrics.api","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-03-14T13:20:44.209Z","message":"fail elasticsearch info","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","cluster.addr":["ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io:443"],"cluster.maxConnsPersHost":128,"error.message":"dial tcp: lookup ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io on [::1]:53: read udp [::1]:53506->[::1]:53: read: connection refused","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-14T13:20:44.209Z","message":"Stats endpoint (/opt/Elastic/Agent/data/tmp/fleet-server-default.sock) finished: accept unix /opt/Elastic/Agent/data/tmp/fleet-server-default.sock: use of closed network connection","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"log.logger":"fleet-metrics.api","ecs.version":"1.6.0","service.name":"fleet-server","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-14T13:20:44.209Z","message":"Error - dial tcp: lookup ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io on [::1]:53: read udp [::1]:53506->[::1]:53: read: connection refused","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"state":"FAILED","ecs.version":"1.6.0","service.name":"fleet-server","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-03-14T13:20:44.210Z","message":"Fleet Server failed","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","error.message":"dial tcp: lookup ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io on [::1]:53: read udp [::1]:53506->[::1]:53: read: connection refused","@timestamp":"2023-03-14T13:20:44.21Z","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-03-14T13:20:44.210Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":857},"message":"Unit state changed fleet-server-default-fleet-server-fleet_server-5cc1bc7e-20d8-4174-9af3-4023ec1cd024 (STARTING->FAILED): Error - dial tcp: lookup ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io on [::1]:53: read udp [::1]:53506->[::1]:53: read: connection refused","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default-fleet-server-fleet_server-5cc1bc7e-20d8-4174-9af3-4023ec1cd024","type":"input","state":"FAILED","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-03-14T13:20:44.210Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":857},"message":"Unit state changed fleet-server-default (STARTING->FAILED): Error - dial tcp: lookup ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io on [::1]:53: read udp [::1]:53506->[::1]:53: read: connection refused","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default","type":"output","state":"FAILED","old_state":"STARTING"},"ecs.version":"1.6.0"}

I believe this is a DNS lookup failure, this can also been seen in the error messages from the Beats started by agent:

{"log.level":"warn","@timestamp":"2023-03-14T13:20:47.401Z","message":"DNS lookup failure \"ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io\": lookup ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io on [::1]:53: read udp [::1]:57757->[::1]:53: read: connection refused","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"log":{"source":"beat/metrics-monitoring"},"log.logger":"transport","log.origin":{"file.line":52,"file.name":"transport/tcp.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-03-14T13:20:47.401Z","message":"Error dialing lookup ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io on [::1]:53: read udp [::1]:57757->[::1]:53: read: connection refused","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"log":{"source":"beat/metrics-monitoring"},"address":"ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io:443","service.name":"metricbeat","network":"tcp","ecs.version":"1.6.0","log.logger":"esclientleg","log.origin":{"file.line":38,"file.name":"transport/logging.go"},"ecs.version":"1.6.0"}

@cmacknz
Copy link
Member

cmacknz commented Mar 21, 2023

Interestingly the same failure happens later but Fleet server reports itself as configuring instead of failed:

{"log.level":"info","@timestamp":"2023-03-14T13:20:43.941Z","message":"Setting cache config options","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","cfg":{"APIKeyJitter":300000000000,"APIKeyTTL":900000000000,"ActionTTL":300000000000,"ArtifactTTL":86400000000000,"EnrollKeyTTL":60000000000,"MaxCost":52428800,"NumCounters":500000},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-14T13:20:43.947Z","message":"Starting","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","state":"STARTING","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-14T13:20:43.948Z","message":"initial server configuration","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","new":{"Fleet":{"Agent":{"ID":"a2f4b285-57ed-42f0-9cde-fedd137b6474","Logging":{"Level":"info"},"Version":"8.7.0"},"Host":{"ID":"","Name":""}},"HTTP":{"Enabled":true,"Host":"unix:///opt/Elastic/Agent/data/tmp/fleet-server-default.sock","Port":5066,"SecurityDescriptor":"","User":""},"Inputs":[{"Cache":{"APIKeyJitter":0,"APIKeyTTL":0,"ActionTTL":0,"ArtifactTTL":0,"EnrollKeyTTL":0,"MaxCost":0,"NumCounters":0},"Monitor":{"FetchSize":0,"PollTimeout":0},"Policy":{"ID":""},"Server":{"Bulk":{"FlushInterval":250000000,"FlushMaxPending":8,"FlushThresholdCount":2048,"FlushThresholdSize":1048576},"CompressionLevel":1,"CompressionThresh":1024,"GC":{"CleanupAfterExpiredInterval":"30d","ScheduleInterval":3600000000000},"Host":"0.0.0.0","Instrumentation":{"APIKey":"","Enabled":false,"Environment":"","GlobalLabels":"","Hosts":null,"SecretToken":"","TLS":{"ServerCA":"","ServerCertificate":"","SkipVerify":false},"TransactionSampleRate":""},"InternalPort":8221,"Limits":{"AckLimit":{"Burst":100,"Interval":10000000,"Max":50,"MaxBody":2097152},"ArtifactLimit":{"Burst":25,"Interval":5000000,"Max":50,"MaxBody":0},"CheckinLimit":{"Burst":1000,"Interval":1000000,"Max":0,"MaxBody":1048576},"EnrollLimit":{"Burst":100,"Interval":10000000,"Max":50,"MaxBody":524288},"MaxAgents":0,"MaxConnections":0,"MaxHeaderByteSize":8192,"PolicyThrottle":5000000,"StatusLimit":{"Burst":25,"Interval":5000000,"Max":50,"MaxBody":0},"UploadChunkLimit":{"Burst":10,"Interval":3000000,"Max":5,"MaxBody":4194304},"UploadEndLimit":{"Burst":5,"Interval":2000000000,"Max":2,"MaxBody":1024},"UploadStartLimit":{"Burst":8,"Interval":3000000000,"Max":3,"MaxBody":5242880}},"Port":8220,"Profiler":{"Bind":"localhost:6060","Enabled":false},"Runtime":{"GCPercent":0},"TLS":{"CASha256":null,"CAs":null,"Certificate":{"Certificate":"-----BEGIN CERTIFICATE-----\nMIIDSjCCAjKgAwIBAgICBnowDQYJKoZIhvcNAQELBQAwLDEWMBQGA1UEChMNZWxh\nc3RpYy1mbGVldDESMBAGA1UEAxMJbG9jYWxob3N0MB4XDTIzMDMxNDEzMTQzOFoX\nDTMzMDMxNDEzMTQzOFowMzEWMBQGA1UEChMNZWxhc3RpYy1mbGVldDEZMBcGA1UE\nAxMQaXAtMTcyLTMxLTU3LTEwOTCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoC\nggEBAL2xkV5OKCL9ZqFgoH7/y/xvlM6EABykXOpZjP3/+ugx/nAMA0dFaeUHU7MN\nqvMw91kgrVGkJ8nb6uRXa6PbBVY/U5QVPVC1QHxNXpo9Ye9vDaf7C336UEr12+IF\nCC12i3SyAOD5gWmckgeRDJsVIkJ+W1r5JWUkWjIB6QjQzZlX3Jz6DH92VmMKEEfe\nKaWS28O73pNjVVxh0TePL+1+ADogIiGpEIOxv0wq9q6bJcyDhBFKHhYBgCFH30W+\ns31RIt2npK9BvD5/PbpcKSKqhz1lsJOTO0YMY1KGcZWiyKQH/xkEpBaN6SpfTn7a\n/QH7J6sXed7uzWLfOlHpXKqcQ1sCAwEAAaNvMG0wDgYDVR0PAQH/BAQDAgeAMB0G\nA1UdJQQWMBQGCCsGAQUFBwMCBggrBgEFBQcDATAfBgNVHSMEGDAWgBQbcyE1Fo5b\nrmOtwGsfEN8DZkgLNDAbBgNVHREEFDASghBpcC0xNzItMzEtNTctMTA5MA0GCSqG\nSIb3DQEBCwUAA4IBAQANYSIoOoxi1epv/6u4bvNHSNEIUSBkL/+uTAAQwcrvc2/R\nA/xXphu9/DGBx17SOOz+L07NBkCsZdZFPYQ2czb58PW9uiqMIi62D3RmuMAl8clw\ng4U96cnNRB1e2ZyQHHnpOgiMeu1Mp/aUSnJJd+C5b044sMLreeukJVVGsS/Asb+w\nuzBYVcA0EHQueTwnCD9ZcqTLtMJ6RGn+eJs2qSOrDML+xcnSpdWTlF8g+wK2pBds\nVa004Y+fq0lG3eENpU3JAH5O/krIvaf2TYZvs6z0wQK5jqvyfOiSQwTe0/djlyql\nzkJV6b5n/SjRb6F9YhW+jOSrCgjrwMHYLDA4Nc+8\n-----END CERTIFICATE-----\n","Key":"[redacted]","Passphrase":""},"CipherSuites":null,"ClientAuth":0,"CurveTypes":null,"Enabled":null,"VerificationMode":"full","Versions":null},"Timeouts":{"CheckinJitter":30000000000,"CheckinLongPoll":300000000000,"CheckinTimestamp":30000000000,"Idle":30000000000,"Read":60000000000,"ReadHeader":5000000000,"Write":600000000000}},"Type":""}],"Logging":{"Files":null,"Level":"info","Pretty":false,"ToFiles":true,"ToStderr":true},"Output":{"Elasticsearch":{"APIKey":"","Headers":null,"Hosts":["ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io:443"],"MaxConnPerHost":128,"MaxContentLength":104857600,"MaxRetries":3,"Path":"","Protocol":"https","ProxyDisable":false,"ProxyHeaders":{},"ProxyURL":"","ServiceToken":"[redacted]","TLS":null,"Timeout":90000000000},"Extra":null}},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-14T13:20:43.948Z","message":"starting server on configuration change","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-14T13:20:43.950Z","message":"Starting stats endpoint","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","log.logger":"fleet-metrics.api","@timestamp":"2023-03-14T13:20:43.95Z","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-14T13:20:44.003Z","message":"Metrics endpoint listening on: /opt/Elastic/Agent/data/tmp/fleet-server-default.sock (configured: unix:///opt/Elastic/Agent/data/tmp/fleet-server-default.sock)","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","log.logger":"fleet-metrics.api","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-03-14T13:20:44.209Z","message":"fail elasticsearch info","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","cluster.addr":["ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io:443"],"cluster.maxConnsPersHost":128,"error.message":"dial tcp: lookup ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io on [::1]:53: read udp [::1]:53506->[::1]:53: read: connection refused","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-14T13:20:44.209Z","message":"Stats endpoint (/opt/Elastic/Agent/data/tmp/fleet-server-default.sock) finished: accept unix /opt/Elastic/Agent/data/tmp/fleet-server-default.sock: use of closed network connection","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"log.logger":"fleet-metrics.api","ecs.version":"1.6.0","service.name":"fleet-server","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-14T13:20:44.209Z","message":"Error - dial tcp: lookup ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io on [::1]:53: read udp [::1]:53506->[::1]:53: read: connection refused","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"state":"FAILED","ecs.version":"1.6.0","service.name":"fleet-server","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-03-14T13:20:44.210Z","message":"Fleet Server failed","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","error.message":"dial tcp: lookup ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io on [::1]:53: read udp [::1]:53506->[::1]:53: read: connection refused","@timestamp":"2023-03-14T13:20:44.21Z","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-03-14T13:20:44.210Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":857},"message":"Unit state changed fleet-server-default-fleet-server-fleet_server-5cc1bc7e-20d8-4174-9af3-4023ec1cd024 (STARTING->FAILED): Error - dial tcp: lookup ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io on [::1]:53: read udp [::1]:53506->[::1]:53: read: connection refused","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default-fleet-server-fleet_server-5cc1bc7e-20d8-4174-9af3-4023ec1cd024","type":"input","state":"FAILED","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-03-14T13:20:44.210Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":857},"message":"Unit state changed fleet-server-default (STARTING->FAILED): Error - dial tcp: lookup ca3df1e600684321ac547810f0fa3235.europe-west1.gcp.cloud.es.io on [::1]:53: read udp [::1]:53506->[::1]:53: read: connection refused","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default","type":"output","state":"FAILED","old_state":"STARTING"},"ecs.version":"1.6.0"}

I suspect Fleet server isn't reliably reporting the failed state here. @michel-laterman is this something we can improve?

@cmacknz
Copy link
Member

cmacknz commented Mar 21, 2023

@dikshachauhan-qasource per my comment in #2431 (comment) this doesn't seem like a bug in the agent, DNS look ups for Elasticsearch on Fleet server are failing.

Is this reproducible? If it is, was there any special network configuration that had to be set up?

@dikshachauhan-qasource
Copy link
Author

dikshachauhan-qasource commented Mar 22, 2023

Hi @cmacknz

We have reproduced this issue on 8.7.0 BC6 and found it still reproducible.

if it is, was there any special network configuration that had to be set up?
Here, I have used AWS "ami-0c544bda9765444c2" with default network settings.

Build details:

VERSION: 8.7.0 BC6
BUILD: 61051
COMMIT: 04ef24287f26854ad99a46ae983854c6184717cb
Artifact: https://staging.elastic.co/8.7.0-a7fb3750/summary-8.7.0.html

Screenshot:

  • Before reboot
    image
    image

  • After reboot
    image
    image

Please let us know if we are missing anything you require.

Thanks
QAS

@cmacknz
Copy link
Member

cmacknz commented Mar 23, 2023

Thank you, so the machine can still resolve the hostname but the agent can't. That is very strange.

@harshitgupta-qasource
Copy link

Hi @cmacknz

We have revalidated this issue on latest 8.8 BC6 kibana cloud build and found it still reproducible.

Observations:

  • Fleet-server Agent gets into the offline state permanently on machine reboot.

Build details:
VERSION: 8.8.0 BC6
BUILD: 63115
COMMIT: a4c256b39f7d1ee34abe61109a817ec7f5329009
Artifacts: https://staging.elastic.co/8.8.0-375abdf7/summary-8.8.0.html

Screen-shot
image

Agent Diagnostic logs:
elastic-agent-diagnostics-2023-05-23T07-20-06Z-00.zip

Please let us know if anything else is required from our end.

Thanks!

@joshdover
Copy link
Contributor

If we hadn't already released this bug I would consider this a blocker. Moving this to the Fleet Server area and adding to the current sprint as a P1. FYI @jlind23

@joshdover joshdover added impact:critical Immediate priority; high value or cost to the product. and removed impact:medium labels May 23, 2023
@juliaElastic
Copy link
Contributor

juliaElastic commented May 23, 2023

I'm seeing the same ES connection and DNS lookup failure errors in the latest diagnostics.
@harshitgupta-qasource Is this an on prem installation on the SLES host or are you enrolling a Fleet Server to an Elastic Cloud instance? I am wondering from where this gcp hostname is coming from.

{"log.level":"warn","@timestamp":"2023-05-23T07:05:56.499Z","message":"DNS lookup failure \"34e1583f2a2a41b09e9d33d9d390def6.europe-west1.gcp.cloud.es.io\": lookup 34e1583f2a2a41b09e9d33d9d390def6.europe-west1.gcp.cloud.es.io on [::1]:53: read udp [::1]:38795->[::1]:53: read: connection refused","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"log.logger":"transport","log.origin":{"file.line":52,"file.name":"transport/tcp.go"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-05-23T07:05:56.499Z","message":"Error dialing lookup 34e1583f2a2a41b09e9d33d9d390def6.europe-west1.gcp.cloud.es.io on [::1]:53: read udp [::1]:38795->[::1]:53: read: connection refused","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"log.logger":"esclientleg","log.origin":{"file.line":38,"file.name":"transport/logging.go"},"service.name":"filebeat","network":"tcp","address":"34e1583f2a2a41b09e9d33d9d390def6.europe-west1.gcp.cloud.es.io:443","ecs.version":"1.6.0","ecs.version":"1.6.0"}

EDIT: I found that the DNS lookup error is temporary, and the connection to ES is established after that.

{"log.level":"info","@timestamp":"2023-05-23T07:06:16.639Z","message":"Connection to backoff(elasticsearch(https://34e1583f2a2a41b09e9d33d9d390def6.europe-west1.gcp.cloud.es.io:443)) established",

I could reproduce the issue with a SLES VM and enrolling it to a staging cloud instance. After the VM reboot, I still see Agent logs and metrics showing up in Kibana, though the status is Offline, so looks like the last_checkin fields are not being updated.

Seeing these errors as well, which indicates that the fleet-server is not started on port 8221.

[elastic_agent][error] ack retrier: commit failed with error: acknowledge 1 actions '[action_id: policy:fleet-server-policy:1:1, type: POLICY_CHANGE]' for elastic-agent 'a4376429-b50d-4323-8a23-31650ccafd25' failed: fail to ack to fleet: all hosts failed: 1 error occurred:
	* requester 0/1 to host https://localhost:8221/ errored: Post "https://localhost:8221/api/fleet/agents/a4376429-b50d-4323-8a23-31650ccafd25/acks?": dial tcp [::1]:8221: connect: connection refused

I tried doing an agent restart and then the agent went back to HEALTHY status.

sudo elastic-agent restart

I think what seems to happen is that after reboot, the fleet-server process tries to start, but hits the temporary DSN lookup issue, and remains in stopped state. Is there any retry logic which should kick in?

{"log.level":"info","@timestamp":"2023-05-23T11:32:27.031Z","message":"Boot fleet-server","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"service.name":"fleet-server","pid":1036,"ecs.version":"1.6.0","ppid":823,"exe":"/opt/Elastic/Agent/data/elastic-agent-80e3a2/components/fleet-server","args":["--agent-mode","-E","logging.level=debug","-E","logging.to_stderr=true","-E","http.enabled=true","-E","http.host=unix:///opt/Elastic/Agent/data/tmp/fleet-server-default.sock","-E","path.data=/opt/Elastic/Agent/data/elastic-agent-80e3a2/run/fleet-server-default"],"version":"8.8.0","commit":"86832af2","ecs.version":"1.6.0"}

{"log.level":"info","@timestamp":"2023-05-23T11:32:28.028Z","message":"Setting cache config options","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"service.name":"fleet-server","cfg":{"APIKeyJitter":300000000000,"APIKeyTTL":900000000000,"ActionTTL":300000000000,"ArtifactTTL":86400000000000,"EnrollKeyTTL":60000000000,"MaxCost":52428800,"NumCounters":500000},"ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-05-23T11:32:28.043Z","message":"Starting","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","state":"STARTING","@timestamp":"2023-05-23T11:32:28.03Z","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-05-23T11:32:28.031Z","message":"initial server configuration","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","new":{"Fleet":{"Agent":{"ID":"a4376429-b50d-4323-8a23-31650ccafd25","Logging":{"Level":"info"},"Version":"8.8.0"},"Host":{"ID":"","Name":""}},"HTTP":{"Enabled":true,"Host":"unix:///opt/Elastic/Agent/data/tmp/fleet-server-default.sock","Port":5066,"SecurityDescriptor":"","User":""},"Inputs":[{"Cache":{"APIKeyJitter":0,"APIKeyTTL":0,"ActionTTL":0,"ArtifactTTL":0,"EnrollKeyTTL":0,"MaxCost":0,"NumCounters":0},"Monitor":{"FetchSize":0,"PollTimeout":0},"Policy":{"ID":""},"Server":{"Bulk":{"FlushInterval":250000000,"FlushMaxPending":8,"FlushThresholdCount":2048,"FlushThresholdSize":1048576},"CompressionLevel":1,"CompressionThresh":1024,"GC":{"CleanupAfterExpiredInterval":"30d","ScheduleInterval":3600000000000},"Host":"0.0.0.0","Instrumentation":{"APIKey":"","APIKeyPath":"","Enabled":false,"Environment":"","GlobalLabels":"","Hosts":null,"SecretToken":"","SecretTokenPath":"","TLS":{"ServerCA":"","ServerCertificate":"","SkipVerify":false},"TransactionSampleRate":""},"InternalPort":8221,"Limits":{"AckLimit":{"Burst":100,"Interval":10000000,"Max":50,"MaxBody":2097152},"ArtifactLimit":{"Burst":25,"Interval":5000000,"Max":50,"MaxBody":0},"CheckinLimit":{"Burst":1000,"Interval":1000000,"Max":0,"MaxBody":1048576},"EnrollLimit":{"Burst":100,"Interval":10000000,"Max":50,"MaxBody":524288},"MaxAgents":0,"MaxConnections":0,"MaxHeaderByteSize":8192,"PolicyThrottle":5000000,"StatusLimit":{"Burst":25,"Interval":5000000,"Max":50,"MaxBody":0},"UploadChunkLimit":{"Burst":10,"Interval":3000000,"Max":5,"MaxBody":4194304},"UploadEndLimit":{"Burst":5,"Interval":2000000000,"Max":2,"MaxBody":1024},"UploadStartLimit":{"Burst":8,"Interval":3000000000,"Max":3,"MaxBody":5242880}},"Port":8220,"Profiler":{"Bind":"localhost:6060","Enabled":false},"Runtime":{"GCPercent":0,"MemoryLimit":0},"TLS":{"CASha256":null,"CAs":null,"Certificate":{"Certificate":"-----BEGIN CERTIFICATE-----\nMIIDSjCCAjKgAwIBAgICBnowDQYJKoZIhvcNAQELBQAwLDEWMBQGA1UEChMNZWxh\nc3RpYy1mbGVldDESMBAGA1UEAxMJbG9jYWxob3N0MB4XDTIzMDUyMzExMzAzOVoX\nDTMzMDUyMzExMzAzOVowMzEWMBQGA1UEChMNZWxhc3RpYy1mbGVldDEZMBcGA1UE\nAxMQaXAtMTcyLTMxLTkwLTE0MjCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoC\nggEBAMCZv8208a6woigkmRN52E4MoXufXzabV9ahUl1lwqhrXCB69OY57kdLjsVq\nCdvXEu89MBA7ULn3K/s+Lw+xEVVAljrtu3hJKfzjjvKi6L5VnR3G5uHBAly0Dvr7\nxFBPW6yCfVDp4rrwcs46eAPVBcdBb1mBl4iFCfbMITsmhgwx0jSQpD7qhr4yoJdn\nWQMMvJvAgYvhgiJ1qZAudwE/HUoSdLtr8Vk8G9ht4Uc7SrFDwFBU2zDSGnarkbYz\n5Bymwtkxl6rij01IlHricaZIA0ZL5FfPpobvnjp9XmFX3jHPAdKNWwItiXtOdCNt\nJ/MuDFzIHA+g293UiRzQShGNpUMCAwEAAaNvMG0wDgYDVR0PAQH/BAQDAgeAMB0G\nA1UdJQQWMBQGCCsGAQUFBwMCBggrBgEFBQcDATAfBgNVHSMEGDAWgBTVbtqR64jJ\n+FVw4+fuuSqEy4JEVjAbBgNVHREEFDASghBpcC0xNzItMzEtOTAtMTQyMA0GCSqG\nSIb3DQEBCwUAA4IBAQCRhD8zSG3bJydkKmRjmpA7l59u67Dtf9o3B3SBtuqgVqqx\ncLwGjJrCa6Rn7/zAHAIfQtw0lwpq9TQx/+GSpl15e/6wX6Q9HdnblV0dklB8NduO\n89X+bNyLuKJ3MgFKgxDmuZTbpy2nmZnYx77NzB15AnOJ6Fx3TwdsrMGRcCZxmSW7\n+NKHS2eKPckmDFJEImaPEfwA0oaoFBt2MO447g+pibP+ORuQ2qtUW387IYqggwBR\nFdPontJi1Jk+1NSLqp1wWTwg366qMweK60eM7GGujwnfmTii7DXQEzw9wGlcVQiH\nWS2B8QU0Z2tEE5HCfwGXww4i5p9NTnaeqWjsdZpf\n-----END CERTIFICATE-----\n","Key":"[redacted]","Passphrase":"","PassphrasePath":""},"CipherSuites":null,"ClientAuth":0,"CurveTypes":null,"Enabled":null,"VerificationMode":"full","Versions":null},"Timeouts":{"CheckinJitter":30000000000,"CheckinLongPoll":300000000000,"CheckinMaxPoll":3600000000000,"CheckinTimestamp":30000000000,"Idle":30000000000,"Read":60000000000,"ReadHeader":5000000000,"Write":600000000000}},"Type":""}],"Logging":{"Files":null,"Level":"info","Pretty":false,"ToFiles":true,"ToStderr":true},"Output":{"Elasticsearch":{"Headers":null,"Hosts":["787b58036cb94920952d2b39315ad26b.eastus2.staging.azure.foundit.no:443"],"MaxConnPerHost":128,"MaxContentLength":104857600,"MaxRetries":3,"Path":"","Protocol":"https","ProxyDisable":false,"ProxyHeaders":{},"ProxyURL":"","ServiceToken":"[redacted]","ServiceTokenPath":"","TLS":null,"Timeout":90000000000},"Extra":null}},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-05-23T11:32:28.031Z","message":"starting server on configuration change","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"service.name":"fleet-server","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-05-23T11:32:28.031Z","message":"Starting stats endpoint","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","log.logger":"fleet-metrics.api","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-05-23T11:32:28.032Z","message":"Metrics endpoint listening on: /opt/Elastic/Agent/data/tmp/fleet-server-default.sock (configured: unix:///opt/Elastic/Agent/data/tmp/fleet-server-default.sock)","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","log.logger":"fleet-metrics.api","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-05-23T11:32:28.037Z","message":"Server configuration update","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-05-23T11:32:28.037Z","message":"Re-configuring","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","state":"CONFIGURING","ecs.version":"1.6.0"}

{"log.level":"error","@timestamp":"2023-05-23T11:32:28.054Z","message":"fail elasticsearch info","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"service.name":"fleet-server","cluster.addr":["787b58036cb94920952d2b39315ad26b.eastus2.staging.azure.foundit.no:443"],"cluster.maxConnsPersHost":128,"error.message":"dial tcp: lookup 787b58036cb94920952d2b39315ad26b.eastus2.staging.azure.foundit.no on [::1]:53: read udp [::1]:49216->[::1]:53: read: connection refused","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-05-23T11:32:28.054Z","message":"Stats endpoint (/opt/Elastic/Agent/data/tmp/fleet-server-default.sock) finished: accept unix /opt/Elastic/Agent/data/tmp/fleet-server-default.sock: use of closed network connection","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"log.logger":"fleet-metrics.api","ecs.version":"1.6.0","service.name":"fleet-server","ecs.version":"1.6.0"}

{"log.level":"info","@timestamp":"2023-05-23T11:32:52.887Z","message":"Connection to backoff(elasticsearch(https://787b58036cb94920952d2b39315ad26b.eastus2.staging.azure.foundit.no:443)) established","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"ecs.version":"1.6.0","log.logger":"publisher_pipeline_output","log.origin":{"file.line":145,"file.name":"pipeline/client_worker.go"},"service.name":"metricbeat","ecs.version":"1.6.0"}

Any ideas @michel-laterman @cmacknz ?

@harshitgupta-qasource
Copy link

Hi @juliaElastic

Thank you for looking into this issue.

We have installed SLES 15 fleet server host on 8.8.0 BC6 Kibana Cloud Build.

Impacted Testcase:
#C149942

Please let us know if anything else is required from our end.

Thanks!

@juliaElastic
Copy link
Contributor

@harshitgupta-qasource Do we have the same testcase on a different OS where the agent is back healthy after reboot?

@cmacknz
Copy link
Member

cmacknz commented May 23, 2023

@juliaElastic Fleet server might not be handling the error properly if it can't reach ES at init time. The fail elasticsearch info log message comes from:

// Validate connection
resp, err := info(ctx, es)
if err != nil {
zlog.Error().Err(err).Msg("fail elasticsearch info")
return nil, err
}

This is called by runServer via initBulker:

// Create the bulker subsystem
bulker, err := f.initBulker(bulkCtx, tracer, cfg)
if err != nil {
return err
}

That error eventually gets handled in the Run function for Fleet Server:

case err := <-ech:
f.reporter.UpdateState(client.UnitStateFailed, fmt.Sprintf("Error - %s", err), nil) //nolint:errcheck // unclear on what should we do if updating the status fails?
log.Error().Err(err).Msg("Fleet Server failed")
return err

I can see the Fleet Server failed log message triggered in the logs:

{"log.level":"error","@timestamp":"2023-05-23T07:05:50.409Z","message":"Fleet Server failed","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","error.message":"dial tcp: lookup 34e1583f2a2a41b09e9d33d9d390def6.europe-west1.gcp.cloud.es.io on [::1]:53: read udp [::1]:38455->[::1]:53: read: connection refused","ecs.version":"1.6.0"}

The function should just be called again 2s later, and I can see the Fleet server transition back to configuring 2s later the logs but nothing else happens afterwards. It's possible some of the state wasn't cleaned up properly to actually let it recover.

{"log.level":"info","@timestamp":"2023-05-23T07:05:52.415Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":972},"message":"Unit state changed fleet-server-default-fleet-server-fleet_server-93fbe5b0-f937-11ed-af24-a79b52bd536f (FAILED->CONFIGURING): Re-configuring","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default-fleet-server-fleet_server-93fbe5b0-f937-11ed-af24-a79b52bd536f","type":"input","state":"CONFIGURING","old_state":"FAILED"},"ecs.version":"1.6.0"}

go func() {
defer close(srvDone)
for {
err := srv.Run(srvCtx, cfg)
if err == nil || errors.Is(err, context.Canceled) {
return
}
// sleep some before calling Run again
_ = sleep.WithContext(srvCtx, kAgentModeRestartLoopDelay)
}
}()

@cmacknz
Copy link
Member

cmacknz commented May 23, 2023

I wonder if Fleet server is deadlocked. You can look at the goroutine profile with go tool pprof -http :8081 ./components/fleet-server-default/fleet-server-fleet_server-93fbe5b0-f937-11ed-af24-a79b52bd536f/goroutine.pprof.gz

@harshitgupta-qasource
Copy link

harshitgupta-qasource commented May 24, 2023

Hi @juliaElastic

We have test cases for setting up the fleet server on various OSs for kibana cloud build and rebooting the fleet server as part of exploratory testing.

We have re-validated this issue on the 8.8 BC8 kibana cloud build with Windows and Ubuntu fleet server agents.

We have observed that after restarting these fleet servers, the fleet server agent came back Healthy.

So the issue is only reproducible for the SLES15 Fleet server.

Testcases for setup Fleet server:

Please let us know if anything else is required from our end.

Thanks!

@joshdover
Copy link
Contributor

I'm skeptical that this is truly a SLES-specific issue, though maybe with the default configuration it's more likely to happen with SLES OOTB. It seems likely that this can happen on other systems if they hit a DNS lookup failure on startup.

We definitely need to be resilient to such errors, so I'm inclined to leave this as a high priority issue that should be fixed relatively soon.

@jlind23
Copy link
Contributor

jlind23 commented May 31, 2023

@michel-laterman will you have time to take a look at this as soon as you can?

@jlind23
Copy link
Contributor

jlind23 commented Jun 8, 2023

@michel-laterman assigning this to you in this sprint.

@michel-laterman
Copy link
Contributor

I'm digging into this. I can recreate on an AWS SLES15 instance with the latest 8.9.0-SNAPSHOT and see all the same symptoms as reported.
It does look like an issue where if elasticsearch (or DNS) is not available when the fleet-server is starting, there's some state drift.

Here it fails on the initial startup, and restarts, but the fleet-server is not reporting the failed state on that attempt:

{"log.level":"error","@timestamp":"2023-06-12T22:16:40.128Z","message":"fail elasticsearch info","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"cluster.addr":["200dd16c33174ccb94de816bb6fd5c18.us-central1.gcp.qa.cld.elstc.co:443"],"cluster.maxConnsPersHost":128,"error.message":"dial tcp: lookup 200dd16c33174ccb94de816bb6fd5c18.us-central1.gcp.qa.cld.elstc.co on [::1]:53: read udp [::1]:43982->[::1]:53: read: connection refused","ecs.version":"1.6.0","service.name":"fleet-server","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-06-12T22:16:40.128Z","message":"Stats endpoint (/opt/Elastic/Agent/data/tmp/fleet-server-default.sock) finished: accept unix /opt/Elastic/Agent/data/tmp/fleet-server-default.sock: use of closed network connection","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","log.logger":"fleet-metrics.api","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-06-12T22:16:40.128Z","message":"Error - dial tcp: lookup 200dd16c33174ccb94de816bb6fd5c18.us-central1.gcp.qa.cld.elstc.co on [::1]:53: read udp [::1]:43982->[::1]:53: read: connection refused","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","state":"FAILED","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-06-12T22:16:40.128Z","message":"Fleet Server failed","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","error.message":"dial tcp: lookup 200dd16c33174ccb94de816bb6fd5c18.us-central1.gcp.qa.cld.elstc.co on [::1]:53: read udp [::1]:43982->[::1]:53: read: connection refused","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-06-12T22:16:40.128Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":991},"message":"Unit state changed fleet-server-default-fleet-server-fleet_server-9baba0cd-d2de-47cc-8a9b-d1d2f8c12d66 (STARTING->FAILED): Error - dial tcp: lookup 200dd16c33174ccb94de816bb6fd5c18.us-central1.gcp.qa.cld.elstc.co on [::1]:53: read udp [::1]:43982->[::1]:53: read: connection refused","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default-fleet-server-fleet_server-9baba0cd-d2de-47cc-8a9b-d1d2f8c12d66","type":"input","state":"FAILED","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-06-12T22:16:40.128Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":991},"message":"Unit state changed fleet-server-default (STARTING->FAILED): Error - dial tcp: lookup 200dd16c33174ccb94de816bb6fd5c18.us-central1.gcp.qa.cld.elstc.co on [::1]:53: read udp [::1]:43982->[::1]:53: read: connection refused","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default","type":"output","state":"FAILED","old_state":"STARTING"},"ecs.version":"1.6.0"}
...
{"log.level":"info","@timestamp":"2023-06-12T22:16:42.131Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":985},"message":"Unit state changed fleet-server-default-fleet-server-fleet_server-9baba0cd-d2de-47cc-8a9b-d1d2f8c12d66 (STARTING->CONFIGURING): Re-configuring","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default-fleet-server-fleet_server-9baba0cd-d2de-47cc-8a9b-d1d2f8c12d66","type":"input","state":"CONFIGURING","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-06-12T22:16:42.131Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":985},"message":"Unit state changed fleet-server-default (STARTING->CONFIGURING): Re-configuring","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default","type":"output","state":"CONFIGURING","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-06-12T22:16:42.132Z","message":"fail elasticsearch info","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"cluster.maxConnsPersHost":128,"error.message":"dial tcp: lookup 200dd16c33174ccb94de816bb6fd5c18.us-central1.gcp.qa.cld.elstc.co on [::1]:53: read udp [::1]:38166->[::1]:53: read: connection refused","ecs.version":"1.6.0","service.name":"fleet-server","cluster.addr":["200dd16c33174ccb94de816bb6fd5c18.us-central1.gcp.qa.cld.elstc.co:443"],"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-06-12T22:16:42.132Z","message":"Stats endpoint (/opt/Elastic/Agent/data/tmp/fleet-server-default.sock) finished: accept unix /opt/Elastic/Agent/data/tmp/fleet-server-default.sock: use of closed network connection","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","log.logger":"fleet-metrics.api","ecs.version":"1.6.0"}

I think we need more robust handling around how elasticsearch readiness is detected by the fleet-server (which is the same issue as #2683 that @jsoriano is assigned to).
I'll try to look into why the component was not reporting a failed state after it re-configures

@michel-laterman
Copy link
Contributor

Actually, I think there is a deadlock occuring, but I'm not sure why.
block/mutex pprofs need to be enabled with: calls to runtime.SetBlockProfileRate and runtime.SetMutexProfileFraction.

We can see on the instance that this occurs that fleet-server is running, however the ports are not being used (internal/external api and metrics).
The error logs indicate that starting the bulker fails, which should cause the server.Fleet to exit with an error leading to the server.Agent wrapper to restart it, but that exit message as well as the one when the error channel is selected do not appear

@cmacknz
Copy link
Member

cmacknz commented Jun 13, 2023

Actually, I think there is a deadlock occuring, but I'm not sure why.
block/mutex pprofs need to be enabled with: calls to runtime.SetBlockProfileRate and runtime.SetMutexProfileFraction.

For a deadlock you should be able to spot it with just the goroutine profile if you have that already. That will give you the stack trace of every goroutine. The diagnostics collect it by default.

@michel-laterman
Copy link
Contributor

michel-laterman commented Jun 14, 2023

I think I found the issue, our error channel is being created on each iteration of the run loop: https://github.com/elastic/fleet-server/blob/main/internal/pkg/server/fleet.go#L136

However, runServer call actually occurs in a different goroutine: https://github.com/elastic/fleet-server/blob/main/internal/pkg/server/fleet.go#L117.

Which can lead to a config change being read before the runServer method is ran and leaking a channel; the error passed to the channel is lost and thus the fleet-server is stuck as it should signal that it's failed but has not.

I've added some logging around the error channel to try to show what's going on:

# Initial config load, fails and failure is commuinicated
{"log.level":"info","@timestamp":"2023-06-14T21:43:20.534Z","message":"CREATE CHAN 0xc0000a66c0"
{"log.level":"info","@timestamp":"2023-06-14T21:43:20.534Z","message":"CHANGE DETECTED"
{"log.level":"info","@timestamp":"2023-06-14T21:43:20.535Z","message":"SELECT WAIT (0xc0000a66c0)"
{"log.level":"info","@timestamp":"2023-06-14T21:43:20.535Z","message":"RUN SERVER (0xc0000a66c0)"
{"log.level":"error","@timestamp":"2023-06-14T21:43:20.545Z","message":"START FAIL. CHAN (0xc0000a66c0)"
{"log.level":"info","@timestamp":"2023-06-14T21:43:20.545Z","message":"CHANNEL WRITE DONE?(0xc0000a66c0)"
{"log.level":"error","@timestamp":"2023-06-14T21:43:20.545Z","message":"Fleet Server failed"
{"file.name":"coordinator/coordinator.go","file.line":991},"message":"Unit state changed fleet-server-default (STARTING->FAILED)"
...

# new attempt in the loop, a new error channel is created
{"log.level":"info","@timestamp":"2023-06-14T21:43:22.545Z","message":"CREATE CHAN 0xc0002b81e0"
{"log.level":"info","@timestamp":"2023-06-14T21:43:22.546Z","message":"CHANGE DETECTED","component":
{"log.level":"info","@timestamp":"2023-06-14T21:43:22.546Z","message":"SELECT WAIT (0xc0002b81e0)"

# New config is recieved, the error channel is not read
{"log.level":"info","@timestamp":"2023-06-14T21:43:22.546Z","message":"Server configuration update"

# New channel created
{"log.level":"info","@timestamp":"2023-06-14T21:43:22.546Z","message":"CREATE CHAN 0xc0002b8240","component":

# previous channel is being used
{"log.level":"info","@timestamp":"2023-06-14T21:43:22.546Z","message":"RUN SERVER (0xc0002b81e0)"

# new channel is in the select block
{"log.level":"info","@timestamp":"2023-06-14T21:43:22.546Z","message":"SELECT WAIT (0xc0002b8240)"

# error sent to old channel
{"log.level":"error","@timestamp":"2023-06-14T21:43:22.551Z","message":"START FAIL. CHAN (0xc0002b81e0)",
{"log.level":"info","@timestamp":"2023-06-14T21:43:22.551Z","message":"CHANNEL WRITE DONE (0xc0002b81e0)"

@juliaElastic
Copy link
Contributor

I've added some logging around the error channel to try to show what's going on

I think we should commit these in as debug logs.

@amolnater-qasource amolnater-qasource added the QA:Ready For Testing Code is merged and ready for QA to validate label Jun 21, 2023
@J4ckFr05t
Copy link

Upon reviewing these discussions, I recently executed a "systemctl restart elastic-agent" command on the fleet server. I find myself needing to perform this action after every machine reboot. Consequently, I have created a cronjob to automatically handle this task until a permanent solution is implemented.

@michel-laterman
Copy link
Contributor

The fix should be in 8.9.0

@harshitgupta-qasource
Copy link

Hi Team,

We have re-validated this issue on the latest 8.9.0 BC3 Kibana cloud environment and found it fixed now.

Observations:

  • Fleet-server installed on SLES15 gets back online after machine reboot

Build details:
VERSION: 8.9.0
BUILD: 64584
COMMIT: fc463b96275c55dc44524f79f617b0026b7f8667

Screen-shot:
image

Hence, we are marking this issue as QA: Validated.

Thanks

@harshitgupta-qasource harshitgupta-qasource added QA:Validated Validated by the QA Team and removed QA:Ready For Testing Code is merged and ready for QA to validate labels Jul 7, 2023
@SINOX22
Copy link

SINOX22 commented Jul 13, 2023

Hi,

I reported this issue months ago (RHEL 8) and I solved just adding After=syslog.target network.target in /etc/systemd/system/elastic-agent.service

Hope it helps!

@cmachango
Copy link

cmachango commented Jul 21, 2023

HI @SINOX22 tried your solution on my RHEL 8 and it still showing Offline Status.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working impact:critical Immediate priority; high value or cost to the product. QA:Validated Validated by the QA Team Team:Fleet Label for the Fleet team
Projects
None yet
Development

Successfully merging a pull request may close this issue.