[Agent]: On restarting Agent host, Metricbeat is orphaned & stuck in crash loop (relates to policy changes + possibly w Fleet server usage) #25829

amolnater-qasource · 2021-05-24T14:07:49Z

Steps followed:

Login to self-managed Kibana environment.
Install Agent with Default Fleet Server Policy.
Assign "Agent" to Default policy having System, Endpoint and Fleet Server Integration.
Restart elastic-agent from services.
Navigate to agent "Logs" tab and observe Metricbeat service stuck in "Starting-Restarting-Crashed" loop.

Logs:
logs.zip

elasticmachine · 2021-05-24T14:08:50Z

Pinging @elastic/fleet (Team:Fleet)

elasticmachine · 2021-05-24T15:53:23Z

Pinging @elastic/agent (Team:Agent)

EricDavisX · 2021-05-24T16:29:27Z

thanks Amol. With the specifics provided, I am making assumptions as follows, regarding step 3:
Assign "Agent" to Default policy having System, Endpoint and Fleet Server Integration.
... the 'Default Policy' does not have Fleet Server or Endpoint in it, so you modified it I presume? That is ok.
... the 'Agent' was set up (I believe) intended as the sole Fleet Server instance, and when changing it, you kept the Fleet Server integration, but switched to a new policy after it was enrolled and started.

This isn't a crazy use case for self-managed, and if the assumptions are correct then it is a high urgency issue as the Agent/Fleet-Server were running fine until say, a laptop reboot.

note: I am suspicious that 'Endpoint' has much to do with it, we can try to repro this apart from that to narrow it down if needed. I expect it is just the FS vs policy switch and the reboot that are in play.

@blakerouse do you have any thoughts? want to look at it / check the logs?

EricDavisX · 2021-05-24T16:37:59Z

I'd also like to confirm that this doens't happen when Fleet Server is out of the scenario. And what OS was seen (and we can test others too for more data)

amolnater-qasource · 2021-05-25T07:19:28Z

Hi @EricDavisX
Thanks for the feedback, we have validated this issue on Windows 10 x64 self managed 7.13 BC-9 Kibana environment.

the 'Default Policy' does not have Fleet Server or Endpoint in it, so you modified it I presume? That is ok.

this doens't happen when Fleet Server is out of the scenario

Yes
We have added Fleet Server integration as when we reassign fleet server agent to another policy, it starts showing up below errors(with Policy not having Fleet Server integration):
elastic_agent [elastic_agent][error] failed to dispatch actions, error: fail to generate program configuration: expecting Dict and received *transpiler.Key for '0'

the 'Agent' was set up (I believe) intended as the sole Fleet Server instance, and when changing it, you kept the Fleet Server integration, but switched to a new policy after it was enrolled and started.

Yes
Default fleet server policy[only Fleet Server Integration]->Default Policy [System, Endpoint Security and Fleet Server integrations]

I am suspicious that 'Endpoint' has much to do with it.

Yes, this issue is not reproducible without Endpoint Security.
This issue is only reproducible with the exact steps shared above.

Thanks
QAS

EricDavisX · 2021-05-25T15:39:18Z

Interesting, seems something special going on.

@ruflin @ph do you think it needs urgent follow up for the 'tinkerer' all-on-one-host download type setup? If not, we can remove the 7.13.1 label.

ruflin · 2021-05-26T10:10:15Z

We should further investigate this. On the testing side we should switch over to use 7.13.1-SNAPSHOT instead of the BC as the release is out and some fixes already went into 7.13.1.

@michalpristas I think we have seen this transpiler issue in the past?

EricDavisX · 2021-05-27T19:43:35Z

Michal offered to look at any burning issues and so I assigned this to him after chatting in slack.

michalpristas · 2021-05-31T11:42:27Z

@EricDavisX wasnt this already fixed long time ago
{"log.level":"error","@timestamp":"2021-05-19T07:23:39.723-0400","log.logger":"centralmgmt","log.origin":{"file.name":"cfgfile/list.go","file.line":99},"message":"Error creating runner from config: 1 error: metricset 'system/load' not found","ecs.version":"1.6.0"}

michalpristas · 2021-06-01T05:26:44Z

i can reproduce this, if i keep refreshing agent i am hitting this issue eventually, i dont know why this is happening so far. i suspect race in shutdown but i might be wrong.

another problem i see is that orphaned beat cannot connect to agent and keeps logging this

{"log.level":"error","@timestamp":"2021-05-31T09:18:45.909-0400","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"fleet/manager.go","file.line":202},"message":"elastic-agent-client got error: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp [::1]:6789: connectex: No connection could be made because the target machine actively refused it.\"","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2021-05-31T09:18:45.909-0400","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"fleet/manager.go","file.line":202},"message":"elastic-agent-client got error: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp [::1]:6789: connectex: No connection could be made because the target machine actively refused it.\"","ecs.version":"1.6.0"}

together with speed logging it also eats up memory - there probably is some leak in elastic-agent-client, but this is not happening always. maybe some strange coincidence when uninstalling agent

EricDavisX · 2021-06-01T16:30:39Z

@michalpristas thanks for raising it - @amolnater-qasource can you confirm versions of Kibana and System Integration you are testing with please?

@fearful-symmetry I know there was some Win7 specific case we couldn't fix, not sure if this is that use case or not. Anything you can check on your side?

fearful-symmetry · 2021-06-01T16:36:21Z

"message":"Error creating runner from config: 1 error: metricset 'system/load' not found",

That should have been fixed a long time ago. If you're still seeing that error @michalpristas , can you tell me what version of the system integration is currently running? It should be in the view config setting inside the policy menu.

I know there was some Win7 specific case we couldn't fix

I don't recall that? The guard that prevents this from running on windows should be fairly blunt.

EricDavisX · 2021-06-01T16:38:58Z

It may have been win2019 that we couldn't figure out why, upon more thinking. Anyhow, let us wait for host and Integration version info. However, I don't know how it would be 'old enough' at this point with routine testing that it should ever come up.

michalpristas · 2021-06-01T18:59:51Z

@fearful-symmetry

    meta:
      package:
        name: system
        version: 0.12.6

fearful-symmetry · 2021-06-01T19:32:25Z

It may have been win2019 that we couldn't figure out why, upon more thinking. Anyhow, let us wait for host and Integration version info.

Yah, I remember that.

So, nothing in the config has changed, as far as I can tell. Kinda obvious @michalpristas , can you disable load in the policy and see if the problem goes away? load is pretty fundamental, and I kinda wonder if someone added some component somewhere that's enabling it separately. Also, if you have go installed on this host, can you give me the output of go env?

amolnater-qasource · 2021-06-02T10:52:19Z

Hi @EricDavisX

@amolnater-qasource can you confirm versions of Kibana and System Integration you are testing with please?

We are using 7.13.0 [released] Kibana self managed environment.

Versions of Integrations:
System Integration version: 0.12.6
Endpoint Security version: 0.19.1
Fleet Integration version: 0.9.1

Please let us know if anything else is required.
Thanks
QAS

EricDavisX · 2021-06-02T18:55:18Z

@fearful-symmetry do you have bandwidth to try to reproduce since it is with recent code? I'm hoping it isn't 'special' to see it.

fearful-symmetry · 2021-06-02T22:02:54Z

Me getting agent set up on windows always takes a while, but I can try and take a crack at it tomorrow @EricDavisX

fearful-symmetry · 2021-06-03T19:49:26Z

So, I'm still trying to test this properly, but so far I'm not seeing anything in 7.13 of elastic-agent itself.

fearful-symmetry · 2021-06-03T20:02:13Z

Alright @EricDavisX / @michalpristas I tested this out on a fresh install of 7.13.1 with Windows server 2012. I can't seem to reproduce it. Either its been fixed, or it's something more subtle.

EricDavisX · 2021-06-07T22:28:28Z

Thank you @fearful-symmetry let's sync with Michal offline via email as he had a working reproduction, maybe he can share the environment / vm.

EricDavisX · 2021-06-15T13:46:51Z

@amolnater-qasource can you re-test this please?

amolnater-qasource · 2021-06-16T08:11:48Z

Hi @EricDavisX
We have revalidated this issue on 7.14.0 self-managed Kibana and found this issue fixed.

Observations on assigning to new policy:

Agent remains Healthy.
Metricbeat and other services come back in RUNNING state after reboot.

Build details:

Build: 41559
Commit: 9838db392e7fcfc12f004b68fb1b09739f131148
Artifact Link: https://snapshots.elastic.co/7.14.0-28665d9b/downloads/beats/elastic-agent/elastic-agent-7.14.0-SNAPSHOT-windows-x86_64.zip

Thanks
QAS

amolnater-qasource assigned EricDavisX May 24, 2021

amolnater-qasource mentioned this issue May 24, 2021

[Agent] "Metricbeat" service crashed on restarting elastic-agent, want improved beat side logging to inform user #25785

Closed

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label May 24, 2021

amolnater-qasource added the Team:Fleet Label for the Fleet team label May 24, 2021

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label May 24, 2021

EricDavisX removed their assignment May 24, 2021

EricDavisX added the Team:Elastic-Agent Label for the Agent team label May 24, 2021

EricDavisX removed the Team:Fleet Label for the Fleet team label May 24, 2021

EricDavisX added impact:high Short-term priority; add to current release, or definitely next. v7.13.1 labels May 24, 2021

EricDavisX assigned michalpristas May 27, 2021

EricDavisX added v7.13.2 and removed v7.13.1 labels May 27, 2021

michalpristas mentioned this issue Jun 1, 2021

Fix startup with failing configuration #26057

Closed

6 tasks

michalpristas mentioned this issue Jun 3, 2021

Fix startup with failing configuration #26126

Merged

6 tasks

michalpristas closed this as completed in #26126 Jun 8, 2021

This was referenced Jun 10, 2021

Investigate cummulating memory when process cannot reach agent #26242

Closed

Cherry-pick #26126 to 7.13: Fix startup with failing configuration #26261

Merged

Cherry-pick #26126 to 7.x: Fix startup with failing configuration #26262

Merged

EricDavisX mentioned this issue Jun 15, 2021

[Self managed]: elastic_agent.metricbeat/filebeat datastreams generated on installing fleet-server agent. elastic/fleet-server#376

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Agent]: On restarting Agent host, Metricbeat is orphaned & stuck in crash loop (relates to policy changes + possibly w Fleet server usage) #25829

[Agent]: On restarting Agent host, Metricbeat is orphaned & stuck in crash loop (relates to policy changes + possibly w Fleet server usage) #25829

amolnater-qasource commented May 24, 2021

elasticmachine commented May 24, 2021

elasticmachine commented May 24, 2021

EricDavisX commented May 24, 2021

EricDavisX commented May 24, 2021

amolnater-qasource commented May 25, 2021

EricDavisX commented May 25, 2021

ruflin commented May 26, 2021

EricDavisX commented May 27, 2021

michalpristas commented May 31, 2021

michalpristas commented Jun 1, 2021 •

edited

Loading

EricDavisX commented Jun 1, 2021

fearful-symmetry commented Jun 1, 2021

EricDavisX commented Jun 1, 2021

michalpristas commented Jun 1, 2021

fearful-symmetry commented Jun 1, 2021

amolnater-qasource commented Jun 2, 2021

EricDavisX commented Jun 2, 2021

fearful-symmetry commented Jun 2, 2021

fearful-symmetry commented Jun 3, 2021

fearful-symmetry commented Jun 3, 2021

EricDavisX commented Jun 7, 2021

EricDavisX commented Jun 15, 2021

amolnater-qasource commented Jun 16, 2021

[Agent]: On restarting Agent host, Metricbeat is orphaned & stuck in crash loop (relates to policy changes + possibly w Fleet server usage) #25829

[Agent]: On restarting Agent host, Metricbeat is orphaned & stuck in crash loop (relates to policy changes + possibly w Fleet server usage) #25829

Comments

amolnater-qasource commented May 24, 2021

elasticmachine commented May 24, 2021

elasticmachine commented May 24, 2021

EricDavisX commented May 24, 2021

EricDavisX commented May 24, 2021

amolnater-qasource commented May 25, 2021

EricDavisX commented May 25, 2021

ruflin commented May 26, 2021

EricDavisX commented May 27, 2021

michalpristas commented May 31, 2021

michalpristas commented Jun 1, 2021 • edited Loading

EricDavisX commented Jun 1, 2021

fearful-symmetry commented Jun 1, 2021

EricDavisX commented Jun 1, 2021

michalpristas commented Jun 1, 2021

fearful-symmetry commented Jun 1, 2021

amolnater-qasource commented Jun 2, 2021

EricDavisX commented Jun 2, 2021

fearful-symmetry commented Jun 2, 2021

fearful-symmetry commented Jun 3, 2021

fearful-symmetry commented Jun 3, 2021

EricDavisX commented Jun 7, 2021

EricDavisX commented Jun 15, 2021

amolnater-qasource commented Jun 16, 2021

michalpristas commented Jun 1, 2021 •

edited

Loading