Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent with Elastic Defend remains unhealthy for long when invalid configuration is added for sometime and then updated with correct. #3721

Closed
amolnater-qasource opened this issue Nov 7, 2023 · 12 comments · Fixed by #3747
Assignees
Labels
bug Something isn't working impact:medium QA:Validated Validated by the QA Team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@amolnater-qasource
Copy link

amolnater-qasource commented Nov 7, 2023

Kibana Build details:

VERSION: 8.11.0 BC9
Build : 68160
Commit : f2ea0c43ec0d854259d63d926b97e5c556b5f6b2

Host OS: Windows

Preconditions:

  1. 8.11.0 BC9 Kibana cloud environment should be available.
  2. Windows agent should be installed using System and Defend integration.
  3. Create invalid logstash outputs or invalid fleet server.
    [Update]
  4. Agent tamper protection settings are enabled.

Steps to reproduce:

  1. Set invalid output/invalid fleet server under agent policy.
  2. Observe agent gets unhealthy and error under Defend integration.
  3. Wait 5-10 minutes and update the changes to the correct configuration.
  4. Wait for 10-15 minutes and observe agent still remains unhealthy.

NOTE:

  • Issue is consistently reproducible.

Screenshot:

image

Expected Result:
Agent with Endpoint remains should get back Healthy when invalid configuration is added for sometime and then updated with correct.

Agent.json:
ec2amaz-8ag00sp-agent-details.zip

Logs:
elastic-agent-diagnostics-2023-11-07T11-28-06Z-00.zip

@amolnater-qasource amolnater-qasource added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team impact:medium labels Nov 7, 2023
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@amolnater-qasource
Copy link
Author

@manishgupta-qasource Please review.

@manishgupta-qasource
Copy link

Secondary review for this ticket is Done

@cmacknz
Copy link
Member

cmacknz commented Nov 8, 2023

The degraded input I see is:

    {
      "id": "beat/metrics-monitoring",
      "type": "beat/metrics",
      "status": "DEGRADED",
      "message": "Degraded: pid '4652' missed 1 check-in",
      "units": [
        {
          "id": "beat/metrics-monitoring-metrics-monitoring-beats",
          "type": "input",
          "status": "STARTING",
          "message": "Starting: spawned pid '4652'"
        },
        {
          "id": "beat/metrics-monitoring",
          "type": "output",
          "status": "STARTING",
          "message": "Starting: spawned pid '4652'"
        }
      ]
    },

@cmacknz
Copy link
Member

cmacknz commented Nov 9, 2023

This is the same symptom as #3654 I suspect.

I do see a period where endpoint transitioned from Healthy to Failed which is different:

{"log.level":"warn","@timestamp":"2023-11-07T11:16:27.995Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":540},"message":"Component state changed endpoint-1438a4a0-7d56-11ee-b010-9371de552274 (STARTING->DEGRADED): Degraded: endpoint service missed 1 check-in","log":{"source":"elastic-agent"},"component":{"id":"endpoint-1438a4a0-7d56-11ee-b010-9371de552274","state":"DEGRADED","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-11-07T11:17:28.004Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":540},"message":"Component state changed endpoint-1438a4a0-7d56-11ee-b010-9371de552274 (DEGRADED->FAILED): Failed: endpoint service missed 3 check-ins","log":{"source":"elastic-agent"},"component":{"id":"endpoint-1438a4a0-7d56-11ee-b010-9371de552274","state":"FAILED","old_state":"DEGRADED"},"ecs.version":"1.6.0"}

I also see a lot of the following message, but this might be due to frequent process restarts:

{"log.level":"debug","@timestamp":"2023-11-07T11:16:23.019Z","log.origin":{"file.name":"runtime/manager.go","file.line":653},"message":"actions stream sent an invalid token; closing connection","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}

@cmacknz
Copy link
Member

cmacknz commented Nov 9, 2023

Slightly before endpoint started missing check ins we got a policy change to remove endpoint from the policy:

{"log.level":"debug","@timestamp":"2023-11-07T11:15:52.547Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service.go","file.line":205},"message":"got action teardown for endpoint service","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-11-07T11:15:52.547Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service.go","file.line":184},"message":"got teardown for endpoint service, tearingDown: false, tamperProtection: true","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-11-07T11:15:52.547Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service.go","file.line":165},"message":"start teardown for endpoint service","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-11-07T11:15:52.547Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service.go","file.line":172},"message":"set teardown timer 30s for endpoint service","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-11-07T11:15:52.547Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service.go","file.line":178},"message":"process new comp config for endpoint service","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-11-07T11:15:52.547Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service.go","file.line":408},"message":"update component configuration for endpoint service","ecs.version":"1.6.0"}

@cmacknz
Copy link
Member

cmacknz commented Nov 9, 2023

OK I now see an attempted uninstall of endpoint with an invalid uninstall token:

{"log.level":"debug","@timestamp":"2023-11-07T11:15:54.598Z","log.origin":{"file.name":"handlers/handler_action_policy_change.go","file.line":83},"message":"handlerPolicyChange: action 'action_id: policy:3f7ffbb0-7c64-11ee-a559-87f695dbf4e9:24:1, type: POLICY_CHANGE' received","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-11-07T11:15:54.598Z","log.origin":{"file.name":"handlers/handler_action_policy_change.go","file.line":107},"message":"handlerPolicyChange: emit configuration for action action_id: policy:3f7ffbb0-7c64-11ee-a559-87f695dbf4e9:24:1, type: POLICY_CHANGE","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-11-07T11:15:54.598Z","log.origin":{"file.name":"dispatcher/dispatcher.go","file.line":153},"message":"Successfully dispatched action: 'action_id: policy:3f7ffbb0-7c64-11ee-a559-87f695dbf4e9:24:1, type: POLICY_CHANGE'","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-11-07T11:15:55.068Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service.go","file.line":244},"message":"got check-in for endpoint service, tearingDown: true, ignoreCheckins: false","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-11-07T11:15:55.068Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service.go","file.line":150},"message":"stop check-in timer for endpoint service","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-11-07T11:15:55.068Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service.go","file.line":154},"message":"stop connection info for endpoint service","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-11-07T11:15:55.068Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/conn_info_server.go","file.line":45},"message":"failed accept conn info connection: accept tcp 127.0.0.1:6788: use of closed network connection","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-11-07T11:15:55.068Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service.go","file.line":332},"message":"stopping endpoint service runtime","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-11-07T11:15:55.068Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service.go","file.line":348},"message":"endpoint service has checked in, send stopping state to service","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-11-07T11:15:55.068Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service.go","file.line":356},"message":"uninstall endpoint service","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-11-07T11:15:55.068Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service.go","file.line":660},"message":"uninstall endpoint-security service","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-11-07T11:15:55.789Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":156},"message":"FleetGateway calling Checkin API","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-11-07T11:15:55.789Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":186},"message":"Checking started","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-11-07T11:15:56.012Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":323},"message":"using previously saved ack token: 2c3382cb-5f4e-423a-909b-b84f85b2373b","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-11-07T11:15:56.012Z","log.origin":{"file.name":"remote/client.go","file.line":172},"message":"Request method: POST, path: /api/fleet/agents/1181c62e-7fb3-4aad-af1b-9b67fab6c3ee/checkin, reqID: 01HEMQY1FCWTA893APNH8THE9Q","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-11-07T11:15:56.012Z","log.origin":{"file.name":"remote/client.go","file.line":186},"message":"Creating new request to request URL https://07f9f186c0c94405a4900deb91931e25.fleet.europe-west1.gcp.cloud.es.io:443/api/fleet/agents/1181c62e-7fb3-4aad-af1b-9b67fab6c3ee/checkin?","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-11-07T11:15:56.745Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service_command.go","file.line":69},"message":"2023-11-07 11:15:56: info: Main.cpp:299 Executing uninstall","context":"command output","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-11-07T11:15:56.745Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service_command.go","file.line":69},"message":"2023-11-07 11:15:56: info: Internal.cpp:51 Found config path [C:\\Program Files\\Elastic\\Endpoint\\elastic-endpoint.yaml]","context":"command output","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-11-07T11:15:56.761Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service_command.go","file.line":69},"message":"2023-11-07 11:15:56: debug: Utilities.cpp:420 Tamper protection enabled","context":"command output","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-11-07T11:15:56.761Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service_command.go","file.line":69},"message":"2023-11-07 11:15:56: info: InstallLib.cpp:953 Checking installed uninstall protection artifacts","context":"command output","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-11-07T11:15:56.766Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service_command.go","file.line":69},"message":"2023-11-07 11:15:56: info: Internal.cpp:51 Found config path [C:\\Program Files\\Elastic\\Endpoint\\elastic-endpoint.yaml]","context":"command output","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-11-07T11:15:56.768Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service_command.go","file.line":69},"message":"2023-11-07 11:15:56: info: InstallLib.cpp:712 No custom public key detected in Endpoint config","context":"command output","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-11-07T11:15:56.770Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service_command.go","file.line":69},"message":"2023-11-07 11:15:56: info: Crypto.cpp:1067 RSA signature verified","context":"command output","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-11-07T11:15:56.777Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service_command.go","file.line":69},"message":"2023-11-07 11:15:56: info: InstallLib.cpp:885 Failed to read os section of tamper-protection-config, continuing","context":"command output","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-11-07T11:15:56.777Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service_command.go","file.line":69},"message":"2023-11-07 11:15:56: info: InstallLib.cpp:975 Finished checking installed uninstall protection artifacts with result deny","context":"command output","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-11-07T11:15:56.777Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service_command.go","file.line":69},"message":"2023-11-07 11:15:56: info: InstallLib.cpp:1047 Finished checking command line provided uninstall resource result deny","context":"command output","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-11-07T11:15:56.777Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service_command.go","file.line":69},"message":"2023-11-07 11:15:56: error: InstallLib.cpp:1237 Invalid uninstall token","context":"command output","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-11-07T11:15:56.809Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service.go","file.line":359},"message":"failed endpoint service uninstall, err: 2023-11-07 11:15:56: error: InstallLib.cpp:1237 Invalid uninstall token: exit status 284","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-11-07T11:15:56.809Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service.go","file.line":364},"message":"set endpoint service runtime to stopped state","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-11-07T11:15:56.809Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service.go","file.line":244},"message":"got check-in for endpoint service, tearingDown: false, ignoreCheckins: true","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-11-07T11:15:56.809Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service.go","file.line":244},"message":"got check-in for endpoint service, tearingDown: false, ignoreCheckins: true","ecs.version":"1.6.0"}

@cmacknz
Copy link
Member

cmacknz commented Nov 9, 2023

@amolnater-qasource it looks to me like there was a policy reassignment or that Elastic Defend was installed and then uninstalled, which is missing from the steps in the description. Did either of these two things happen?

@amolnater-qasource
Copy link
Author

amolnater-qasource commented Nov 10, 2023

Hi @cmacknz

Thank you for looking into this.

there was a policy reassignment or that Elastic Defend was installed and then uninstalled,

No, we didn't do this. However when we had the invalid output set we waited for endpoint to get Unhealthy.
image

For confirmation, we have revalidated this issue on released 8.11.0 kibana cloud environment and found it still reproducible.

  • No policy reassignment and Defend is not installed/uninstalled.
  • Just to add here, that agent tamper protection settings are enabled.[Updated the report above]

We are more frequently able to reproduce this issue when tamper protection settings are enabled.

Sharing the latest logs and json below:
ec2amaz-8ag00sp-agent-details (1).zip
elastic-agent-diagnostics-2023-11-10T07-16-25Z-00.zip

Please let us know if anything else is required from our end.
Thanks!!

@amolnater-qasource amolnater-qasource added the QA:Ready For Testing Code is merged and ready for QA to validate label Nov 19, 2023
@amolnater-qasource
Copy link
Author

Hi Team,

We have revalidated this issue on latest 8.12.0 BC5 kibana cloud environment and found it still reproducible.

Observations:

  • Agent with Elastic Defend remains unhealthy for long when invalid configuration is added for sometime and then updated with correct.

Build details:
Kibana Details:

VERSION: 8.12.0
BUILD: 70053
COMMIT: db9b8921b37139cbb1e11d23f6381f655edeb72b

Artifact Link: https://staging.elastic.co/8.12.0-9f05a310/downloads/beats/elastic-agent/elastic-agent-8.12.0-windows-x86_64.zip

Screenshot:
image

Agent Logs:
elastic-agent-diagnostics-2024-01-08T13-07-13Z-00.zip

Agent.json:
ec2amaz-9edrhhs-agent-details.zip

Hence, we are reopening this issue.

Please let us know if anything else is required from our end.

Thanks!

@amolnater-qasource amolnater-qasource removed the QA:Ready For Testing Code is merged and ready for QA to validate label Jan 15, 2024
@pierrehilbert
Copy link
Contributor

@amolnater-qasource is it still happening in 8.14?

@amolnater-qasource
Copy link
Author

Hi @pierrehilbert

We have revalidated this issue on 8.14.0 BC7 Kibana cloud environment and found it fixed now.

Observations:

  • Agent with Endpoint gets back Healthy when invalid configuration is added for sometime and then updated with correct.

Agent Diagnostics:
elastic-agent-diagnostics-2024-06-04T09-29-00Z-00.zip

Screenshot:
image

Build details:
VERSION: 8.14.0 BC7
BUILD: 73988
COMMIT: 3bc2979d1d65982aee7d13ebd65434c3470dc808
Artifact Link: https://staging.elastic.co/8.14.0-fe696c51/downloads/beats/elastic-agent/elastic-agent-8.14.0-windows-x86_64.zip

Hence, we are closing and marking this issue as QA:Validated.

Thanks!

@amolnater-qasource amolnater-qasource added the QA:Validated Validated by the QA Team label Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working impact:medium QA:Validated Validated by the QA Team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants