Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Agent]: On restarting Agent host, Metricbeat is orphaned & stuck in crash loop (relates to policy changes + possibly w Fleet server usage) #25829

Closed
amolnater-qasource opened this issue May 24, 2021 · 23 comments · Fixed by #26126
Assignees
Labels
impact:high Short-term priority; add to current release, or definitely next. Team:Elastic-Agent Label for the Agent team v7.13.2

Comments

@amolnater-qasource
Copy link

Steps followed:

  1. Login to self-managed Kibana environment.
  2. Install Agent with Default Fleet Server Policy.
  3. Assign "Agent" to Default policy having System, Endpoint and Fleet Server Integration.
  4. Restart elastic-agent from services.
  5. Navigate to agent "Logs" tab and observe Metricbeat service stuck in "Starting-Restarting-Crashed" loop.

Logs:
logs.zip

@elasticmachine
Copy link
Collaborator

Pinging @elastic/fleet (Team:Fleet)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label May 24, 2021
@EricDavisX EricDavisX removed their assignment May 24, 2021
@EricDavisX EricDavisX added the Team:Elastic-Agent Label for the Agent team label May 24, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/agent (Team:Agent)

@EricDavisX EricDavisX removed the Team:Fleet Label for the Fleet team label May 24, 2021
@EricDavisX
Copy link
Contributor

thanks Amol. With the specifics provided, I am making assumptions as follows, regarding step 3:
Assign "Agent" to Default policy having System, Endpoint and Fleet Server Integration.
... the 'Default Policy' does not have Fleet Server or Endpoint in it, so you modified it I presume? That is ok.
... the 'Agent' was set up (I believe) intended as the sole Fleet Server instance, and when changing it, you kept the Fleet Server integration, but switched to a new policy after it was enrolled and started.

This isn't a crazy use case for self-managed, and if the assumptions are correct then it is a high urgency issue as the Agent/Fleet-Server were running fine until say, a laptop reboot.

note: I am suspicious that 'Endpoint' has much to do with it, we can try to repro this apart from that to narrow it down if needed. I expect it is just the FS vs policy switch and the reboot that are in play.

@blakerouse do you have any thoughts? want to look at it / check the logs?

@EricDavisX EricDavisX added impact:high Short-term priority; add to current release, or definitely next. v7.13.1 labels May 24, 2021
@EricDavisX EricDavisX changed the title [Self Managed Kibana]: On restarting Agent installed under Default policy having Endpoint, creates orphaned metricbeat service at host and showing consistent error logs under Agent tab. [Agent]: On restarting Agent host, Metricbeat is orphaned & stuck in crash loop (relates to policy changes + possibly w Fleet server usage) May 24, 2021
@EricDavisX
Copy link
Contributor

I'd also like to confirm that this doens't happen when Fleet Server is out of the scenario. And what OS was seen (and we can test others too for more data)

@amolnater-qasource
Copy link
Author

Hi @EricDavisX
Thanks for the feedback, we have validated this issue on Windows 10 x64 self managed 7.13 BC-9 Kibana environment.

the 'Default Policy' does not have Fleet Server or Endpoint in it, so you modified it I presume? That is ok.

this doens't happen when Fleet Server is out of the scenario

Yes
We have added Fleet Server integration as when we reassign fleet server agent to another policy, it starts showing up below errors(with Policy not having Fleet Server integration):
elastic_agent [elastic_agent][error] failed to dispatch actions, error: fail to generate program configuration: expecting Dict and received *transpiler.Key for '0'

the 'Agent' was set up (I believe) intended as the sole Fleet Server instance, and when changing it, you kept the Fleet Server integration, but switched to a new policy after it was enrolled and started.

Yes
Default fleet server policy[only Fleet Server Integration]->Default Policy [System, Endpoint Security and Fleet Server integrations]

I am suspicious that 'Endpoint' has much to do with it.

Yes, this issue is not reproducible without Endpoint Security.
This issue is only reproducible with the exact steps shared above.

Thanks
QAS

@EricDavisX
Copy link
Contributor

Interesting, seems something special going on.

@ruflin @ph do you think it needs urgent follow up for the 'tinkerer' all-on-one-host download type setup? If not, we can remove the 7.13.1 label.

@ruflin
Copy link
Member

ruflin commented May 26, 2021

We should further investigate this. On the testing side we should switch over to use 7.13.1-SNAPSHOT instead of the BC as the release is out and some fixes already went into 7.13.1.

@michalpristas I think we have seen this transpiler issue in the past?

@EricDavisX
Copy link
Contributor

Michal offered to look at any burning issues and so I assigned this to him after chatting in slack.

@michalpristas
Copy link
Contributor

@EricDavisX wasnt this already fixed long time ago
{"log.level":"error","@timestamp":"2021-05-19T07:23:39.723-0400","log.logger":"centralmgmt","log.origin":{"file.name":"cfgfile/list.go","file.line":99},"message":"Error creating runner from config: 1 error: metricset 'system/load' not found","ecs.version":"1.6.0"}

@michalpristas
Copy link
Contributor

michalpristas commented Jun 1, 2021

i can reproduce this, if i keep refreshing agent i am hitting this issue eventually, i dont know why this is happening so far. i suspect race in shutdown but i might be wrong.

another problem i see is that orphaned beat cannot connect to agent and keeps logging this

{"log.level":"error","@timestamp":"2021-05-31T09:18:45.909-0400","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"fleet/manager.go","file.line":202},"message":"elastic-agent-client got error: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp [::1]:6789: connectex: No connection could be made because the target machine actively refused it.\"","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2021-05-31T09:18:45.909-0400","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"fleet/manager.go","file.line":202},"message":"elastic-agent-client got error: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp [::1]:6789: connectex: No connection could be made because the target machine actively refused it.\"","ecs.version":"1.6.0"}

together with speed logging it also eats up memory - there probably is some leak in elastic-agent-client, but this is not happening always. maybe some strange coincidence when uninstalling agent

@EricDavisX
Copy link
Contributor

@michalpristas thanks for raising it - @amolnater-qasource can you confirm versions of Kibana and System Integration you are testing with please?

@fearful-symmetry I know there was some Win7 specific case we couldn't fix, not sure if this is that use case or not. Anything you can check on your side?

@fearful-symmetry
Copy link
Contributor

"message":"Error creating runner from config: 1 error: metricset 'system/load' not found",

That should have been fixed a long time ago. If you're still seeing that error @michalpristas , can you tell me what version of the system integration is currently running? It should be in the view config setting inside the policy menu.

I know there was some Win7 specific case we couldn't fix

I don't recall that? The guard that prevents this from running on windows should be fairly blunt.

@EricDavisX
Copy link
Contributor

It may have been win2019 that we couldn't figure out why, upon more thinking. Anyhow, let us wait for host and Integration version info. However, I don't know how it would be 'old enough' at this point with routine testing that it should ever come up.

@michalpristas
Copy link
Contributor

@fearful-symmetry

    meta:
      package:
        name: system
        version: 0.12.6

@fearful-symmetry
Copy link
Contributor

It may have been win2019 that we couldn't figure out why, upon more thinking. Anyhow, let us wait for host and Integration version info.

Yah, I remember that.

So, nothing in the config has changed, as far as I can tell. Kinda obvious @michalpristas , can you disable load in the policy and see if the problem goes away? load is pretty fundamental, and I kinda wonder if someone added some component somewhere that's enabling it separately. Also, if you have go installed on this host, can you give me the output of go env?

@amolnater-qasource
Copy link
Author

Hi @EricDavisX

  • @amolnater-qasource can you confirm versions of Kibana and System Integration you are testing with please?

We are using 7.13.0 [released] Kibana self managed environment.

Versions of Integrations:
System Integration version: 0.12.6
Endpoint Security version: 0.19.1
Fleet Integration version: 0.9.1

Please let us know if anything else is required.
Thanks
QAS

@EricDavisX
Copy link
Contributor

@fearful-symmetry do you have bandwidth to try to reproduce since it is with recent code? I'm hoping it isn't 'special' to see it.

@fearful-symmetry
Copy link
Contributor

Me getting agent set up on windows always takes a while, but I can try and take a crack at it tomorrow @EricDavisX

@fearful-symmetry
Copy link
Contributor

So, I'm still trying to test this properly, but so far I'm not seeing anything in 7.13 of elastic-agent itself.

@fearful-symmetry
Copy link
Contributor

Alright @EricDavisX / @michalpristas I tested this out on a fresh install of 7.13.1 with Windows server 2012. I can't seem to reproduce it. Either its been fixed, or it's something more subtle.

@EricDavisX
Copy link
Contributor

Thank you @fearful-symmetry let's sync with Michal offline via email as he had a working reproduction, maybe he can share the environment / vm.

@EricDavisX
Copy link
Contributor

@amolnater-qasource can you re-test this please?

@amolnater-qasource
Copy link
Author

Hi @EricDavisX
We have revalidated this issue on 7.14.0 self-managed Kibana and found this issue fixed.

Observations on assigning to new policy:

  • Agent remains Healthy.
  • Metricbeat and other services come back in RUNNING state after reboot.

Build details:

Build: 41559
Commit: 9838db392e7fcfc12f004b68fb1b09739f131148
Artifact Link: https://snapshots.elastic.co/7.14.0-28665d9b/downloads/beats/elastic-agent/elastic-agent-7.14.0-SNAPSHOT-windows-x86_64.zip

Thanks
QAS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impact:high Short-term priority; add to current release, or definitely next. Team:Elastic-Agent Label for the Agent team v7.13.2
Projects
None yet
6 participants