Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Inconsistent] Endpoint fails to connect to Agent after Stop-Start agent from Services on Windows #2547

Open
harshitgupta-qasource opened this issue Apr 25, 2023 · 14 comments
Labels
bug Something isn't working impact:medium Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@harshitgupta-qasource
Copy link

Kibana version: 8.7.1 BC3 Kibana cloud environment

Host OS: Windows

Build details:
VERSION: 8.7.1
BUILD: 61211
COMMIT: fde13a9709d0191b3eccee3aa20af50be8ef7e7e
Artifact Link: https://staging.elastic.co/8.7.1-14a43a6a/summary-8.7.1.html

Preconditions:

  1. 8.7.1 BC3 Kibana cloud environment should be available.
  2. New Policy should be created with System integration.

Steps to reproduce:

  1. Create a new agent policy with System Integration.
  2. Deploy the agent on the Windows Endpoint.
  3. Now, add the elastic defend integration to the agent policy.
  4. Now navigate to the Windows endpoint and stop the elastic-agent service through the service manager.
  5. Wait until the Agent goes to the offline state.
  6. Start the elastic-agent service through the service manager.
  7. Wait for 10 minutes.
  8. Observe that the agent goes to unhealthy state and under Elastic defend Integration policy response shows Failed to connect to Agent

Agent Diagnostics:
elastic-agent-diagnostics-2023-04-25T09-30-56Z-00.zip

Screenshot
image

Expected Result:
Endpoint should connect to Agent after Stop-Start agent from Services on Windows

Impacted Testcases
#150205
#35378

@harshitgupta-qasource harshitgupta-qasource added bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. labels Apr 25, 2023
@harshitgupta-qasource
Copy link
Author

@amolnater-qasource Please review.

@harshitgupta-qasource harshitgupta-qasource added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Apr 25, 2023
@amolnater-qasource
Copy link

Secondary review for this ticket is Done.

@amolnater-qasource
Copy link

FYI @cmacknz

@cmacknz
Copy link
Member

cmacknz commented Apr 25, 2023

Endpoint is indeed degraded in the attached diagnostics:

- id: endpoint-default
  state:
    state: 2
    message: 'Healthy: communicating with endpoint service'
    units:
      ? unittype: 0
        unitid: endpoint-default-f525194d-bfe6-4106-9047-e9ec8ef568cf
      : state: 3
        message: Applied policy {f525194d-bfe6-4106-9047-e9ec8ef568cf}
        payload:
          error:
            code: 0
            message: Success
      ? unittype: 1
        unitid: endpoint-default
      : state: 3
        message: Applied policy {f525194d-bfe6-4106-9047-e9ec8ef568cf}
        payload:
          error:
            code: 0
            message: Success
    features_idx: 0
    version_info:
      name: Endpoint
      version: 8.7.1

When the agent restarts it appears to discover that endpoint is already running correctly according to the most recent logs:

{"log.level":"info","@timestamp":"2023-04-25T09:06:57.751Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":983},"message":"Spawned new component endpoint-default: Starting: endpoint service runtime","log":{"source":"elastic-agent"},"component":{"id":"endpoint-default","state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-25T09:06:57.751Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":983},"message":"Spawned new unit endpoint-default: Starting: endpoint service runtime","log":{"source":"elastic-agent"},"component":{"id":"endpoint-default","state":"STARTING"},"unit":{"id":"endpoint-default","type":"output","state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-25T09:06:57.751Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":983},"message":"Spawned new unit endpoint-default-f525194d-bfe6-4106-9047-e9ec8ef568cf: Starting: endpoint service runtime","log":{"source":"elastic-agent"},"component":{"id":"endpoint-default","state":"STARTING"},"unit":{"id":"endpoint-default-f525194d-bfe6-4106-9047-e9ec8ef568cf","type":"input","state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-04-25T09:06:59.381Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service_command.go","file.line":69},"message":"2023-04-25 09:06:59: info: Main.cpp:390 Verifying existing installation","context":"command output","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-04-25T09:06:59.381Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service_command.go","file.line":69},"message":"2023-04-25 09:06:59: info: InstallLib.cpp:630 Running [C:\\Program Files\\Elastic\\Endpoint\\elastic-endpoint.exe] [version --log stdout]","context":"command output","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-04-25T09:07:00.318Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service_command.go","file.line":69},"message":"2023-04-25 09:07:00: info: InstallLib.cpp:669 Installed endpoint is expected version (version: 8.7.1, compiled: Fri Apr 21 14:00:00 2023, branch: 8.7, commit: 4addadb6f43eac8edccceba3b9f87a1e188d03b9)","context":"command output","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-04-25T09:07:00.318Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service_command.go","file.line":69},"message":"2023-04-25 09:07:00: info: Util.cpp:2005 Endpoint Service is running.","context":"command output","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-25T09:07:22.432Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":989},"message":"Component state changed endpoint-default (STARTING->HEALTHY): Healthy: communicating with endpoint service","log":{"source":"elastic-agent"},"component":{"id":"endpoint-default","state":"HEALTHY","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-04-25T09:07:22.432Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":987},"message":"Unit state changed endpoint-default (STARTING->DEGRADED): Applied policy {f525194d-bfe6-4106-9047-e9ec8ef568cf}","log":{"source":"elastic-agent"},"component":{"id":"endpoint-default","state":"HEALTHY"},"unit":{"id":"endpoint-default","type":"output","state":"DEGRADED","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-04-25T09:07:42.436Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":987},"message":"Unit state changed endpoint-default-f525194d-bfe6-4106-9047-e9ec8ef568cf (STARTING->DEGRADED): Applied policy {f525194d-bfe6-4106-9047-e9ec8ef568cf}","log":{"source":"elastic-agent"},"component":{"id":"endpoint-default","state":"HEALTHY"},"unit":{"id":"endpoint-default-f525194d-bfe6-4106-9047-e9ec8ef568cf","type":"input","state":"DEGRADED","old_state":"STARTING"},"ecs.version":"1.6.0"}

This appears to line up with this sequence from the endpoint logs:

{"@timestamp":"2023-04-25T09:06:55.2715926Z","agent":{"id":"a24dc70b-9757-4b9d-92f3-1664a3113b42","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"error","origin":{"file":{"line":123,"name":"AgentConnectionInfo.cpp"}}},"message":"AgentConnectionInfo.cpp:123 Agent process is not root/admin or validation failed, disconnecting","process":{"pid":7680,"thread":{"id":7604}}}
{"@timestamp":"2023-04-25T09:06:55.2715926Z","agent":{"id":"a24dc70b-9757-4b9d-92f3-1664a3113b42","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"warning","origin":{"file":{"line":180,"name":"AgentConnectionInfo.cpp"}}},"message":"AgentConnectionInfo.cpp:180 Failed to established stage 1 connection to agent","process":{"pid":7680,"thread":{"id":7604}}}
{"@timestamp":"2023-04-25T09:06:55.2715926Z","agent":{"id":"a24dc70b-9757-4b9d-92f3-1664a3113b42","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"error","origin":{"file":{"line":646,"name":"AgentComms.cpp"}}},"message":"AgentComms.cpp:646 Unable to retrieve connection info from Agent(Agent is not running as root)","process":{"pid":7680,"thread":{"id":7604}}}
{"@timestamp":"2023-04-25T09:07:01.3065017Z","agent":{"id":"a24dc70b-9757-4b9d-92f3-1664a3113b42","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":128,"name":"AgentConnectionInfo.cpp"}}},"message":"AgentConnectionInfo.cpp:128 Validated agent (6460) is root/admin","process":{"pid":7680,"thread":{"id":7604}}}
{"@timestamp":"2023-04-25T09:07:01.3065017Z","agent":{"id":"a24dc70b-9757-4b9d-92f3-1664a3113b42","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":136,"name":"AgentConnectionInfo.cpp"}}},"message":"AgentConnectionInfo.cpp:136 Established stage 1 connection to agent","process":{"pid":7680,"thread":{"id":7604}}}
{"@timestamp":"2023-04-25T09:07:02.3156873Z","agent":{"id":"a24dc70b-9757-4b9d-92f3-1664a3113b42","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":700,"name":"AgentComms.cpp"}}},"message":"AgentComms.cpp:700 Connecting to Agent.","process":{"pid":7680,"thread":{"id":7604}}}
{"@timestamp":"2023-04-25T09:07:17.2392486Z","agent":{"id":"a24dc70b-9757-4b9d-92f3-1664a3113b42","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":236,"name":"BulkQueueConsumer.cpp"}}},"message":"BulkQueueConsumer.cpp:236 Sent 257 documents to Elasticsearch","process":{"pid":7680,"thread":{"id":4892}}}
{"@timestamp":"2023-04-25T09:07:22.431162Z","agent":{"id":"a24dc70b-9757-4b9d-92f3-1664a3113b42","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":728,"name":"AgentComms.cpp"}}},"message":"AgentComms.cpp:728 Agent connection established.","process":{"pid":7680,"thread":{"id":7604}}}
{"@timestamp":"2023-04-25T09:07:22.431162Z","agent":{"id":"a24dc70b-9757-4b9d-92f3-1664a3113b42","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":55,"name":"BigChief.cpp"}}},"message":"BigChief.cpp:55 Agent process notification for protection: 6460","process":{"pid":7680,"thread":{"id":7604}}}
{"@timestamp":"2023-04-25T09:07:22.431162Z","agent":{"id":"a24dc70b-9757-4b9d-92f3-1664a3113b42","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":76,"name":"BigChief.cpp"}}},"message":"BigChief.cpp:76 Agent process [6460] protection result: Success","process":{"pid":7680,"thread":{"id":7604}}}
{"@timestamp":"2023-04-25T09:07:22.431162Z","agent":{"id":"a24dc70b-9757-4b9d-92f3-1664a3113b42","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":485,"name":"AgentContext.cpp"}}},"message":"AgentContext.cpp:485 Successfully created reader/writer for CheckinV2Api","process":{"pid":7680,"thread":{"id":9132}}}
{"@timestamp":"2023-04-25T09:07:22.4338289Z","agent":{"id":"a24dc70b-9757-4b9d-92f3-1664a3113b42","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":550,"name":"AgentContext.cpp"}}},"message":"AgentContext.cpp:550 Not applying config from agent as it has not changed","process":{"pid":7680,"thread":{"id":9132}}}

To me it appears like endpoint correctly marks itself as DEGRADED when the agent process is detected as stopped but then never clears it.

@brian-mckinney any insight into why endpoint never clears its DEGRADED status after it re-establishes a connection with the agent? How would the DEGRADED state here be fixed? It's not clear to me that this is an issue with agent itself.

@brian-mckinney
Copy link

@cmacknz I think there is a logic bug.

  • Endpoint has applied policy and was HEALTHY
  • Then lost connection with agent and set itself as DEGRADED
  • (BUG) When Endpoint reconnects, it does not respond to agent CheckinExpected because it has already applied the policy that is being pushed down. Endpoint is not checking if it was degraded for another reason and then clearing that status. I will put in a bug ticket for this and fix it.

cc: @nfritts

@nfritts
Copy link

nfritts commented May 5, 2023

@harshitgupta-qasource This might be resolved. Can we verify if this bug exists in 8.8 BC2?

@amolnater-qasource amolnater-qasource added the QA:Ready For Testing Code is merged and ready for QA to validate label May 7, 2023
@harshitgupta-qasource
Copy link
Author

Hi Team,

We have revalidated this issue on the latest 8.8.0 BC3 Kibana cloud environment and found it fixed now.

Observations:

  • Endpoint is connected back to Agent after Stop-Start agent from Services on Windows

Build details:
VERSION: 8.8 BC3
BUILD: 62994
COMMIT: 85b22d307ab93fca95c1698ede4cb61d85f3d314

Screen-Shot:
image

Hence, we are closing this issue and marking as QA: Validated.

Thanks

@harshitgupta-qasource harshitgupta-qasource added QA:Validated Validated by the QA Team and removed QA:Ready For Testing Code is merged and ready for QA to validate labels May 12, 2023
@harshitgupta-qasource
Copy link
Author

Hi Team

While testing the 8.8.1 BC2 Kibana, we have observed this issue reproducible.

Observations:

  • Endpoint fails to connect to Agent after Stop-Start agent from Services on Windows.

Build details:
VERSION: 8.8.1 BC2
BUILD: 63240
COMMIT: 0fda51d5cd9f9b724fd0ed4356221d49f2c7af27
Artifacts: https://staging.elastic.co/8.8.1-9ac7eb02/summary-8.8.1.html

Screenshot:
image

Agent Diagnostic logs:

elastic-agent-diagnostics-2023-06-06T06-54-58Z-00.zip

Hence, we are re-opening this issue.

Please let us know if anything else is required from our end.

Thanks!

@harshitgupta-qasource harshitgupta-qasource removed the QA:Validated Validated by the QA Team label Jun 6, 2023
@jlind23
Copy link
Contributor

jlind23 commented Jun 6, 2023

@nfritts @pierrehilbert can you please take a look at this again?

@harshitgupta-qasource
Copy link
Author

Hi @jlind23

While testing the 8.8.2 BC1 Kibana, we have observed this issue is still reproducible.

Observations:

  • Endpoint fails to connect to Agent after Stop-Start agent from Services on Windows.

Build details:
VERSION: 8.8.2
BUILD: 63328
COMMIT: db8521641e9f4c545859a80e19f34066c507a137
Artifacts: https://staging.elastic.co/8.8.2-7d6121e6/summary-8.8.2.html

Screenshot:
image

Agent Diagnostic logs:
elastic-agent-diagnostics-2023-06-23T05-22-42Z-00.zip

Please let us know if anything else is required from our end.

Thanks!

@harshitgupta-qasource
Copy link
Author

Hi Team,

We have revalidated this issue on the latest 8.8.2 BC2 and 8.9.0 BC1 Kibana cloud environment and found it fixed now.

Observations:

  • Endpoint is connected back to Agent after Stop-Start agent from Services on Windows

Build details:
VERSION: 8.9.0 BC1
BUILD: 64385
COMMIT: 313dac73d8d3bc5930447f732e3ae163fb1b7f70

VERSION: 8.8.2
BUILD: 63337
COMMIT: c7c074756dcb999f9fdf4c153c5445a00653fa44

Screen-Shot:
image

Hence, we are closing this issue and marking as QA: Validated.

Thanks

@harshitgupta-qasource harshitgupta-qasource added the QA:Validated Validated by the QA Team label Jun 28, 2023
@harshitgupta-qasource harshitgupta-qasource changed the title Endpoint fails to connect to Agent after Stop-Start agent from Services on Windows [Inconsistent] Endpoint fails to connect to Agent after Stop-Start agent from Services on Windows Jul 14, 2023
@harshitgupta-qasource
Copy link
Author

Hi Team,

While testing the 8.9.0 BC4 Kibana, we have observed this issue is again reproducible inconsistently.

Observations:

  • Endpoint fails to connect to Agent after Stop-Start agent from Services on Windows.

Build details:
VERSION: 8.9.0
BUILD: 64661
COMMIT: ddf0c1972e43898b6890ddb38f4c016e96538239

Screen-shot
image

Agent Diagnostic logs:
elastic-agent-diagnostics-2023-07-14T07-57-22Z-00.zip

Hence, we are reopening this issue and marking it as Inconsistent.

Please let us know if anything else is required from our end.

Thanks!

@amolnater-qasource amolnater-qasource added impact:medium and removed QA:Validated Validated by the QA Team impact:high Short-term priority; add to current release, or definitely next. labels Jul 14, 2023
@jlind23 jlind23 added the Team:Elastic-Agent Label for the Agent team label Jul 17, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@harshitgupta-qasource
Copy link
Author

Bug Conversion

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working impact:medium Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

7 participants