Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"error: Metricbeat FAILED" errors under Logs tab on enrolling 7.11 Windows agent with 7.10 default policy on upgraded kibana(7.10.2-7.11.0). #23812

Closed
amolnater-qasource opened this issue Feb 2, 2021 · 57 comments
Assignees
Labels
impact:high Short-term priority; add to current release, or definitely next. Team:Elastic-Agent Label for the Agent team v7.11.0 v7.11.1

Comments

@amolnater-qasource
Copy link

Kibana version: Kibana: 7.11.0 BC-5 Cloud environment

Host OS and Browser version: Windows 10, All

Preconditions:

  1. 7.10.2 Cloud environment must be upgraded to 7.11.0.

Build Details:

    Artifact link used: https://staging.elastic.co/7.11.0-903dc0b6/downloads/beats/filebeat/filebeat-7.11.0-windows-x86_64.zip
    BUILD: 37827
    COMMIT: d801a7bb08e17368584d00fbd97a5d0006285b51

Steps to reproduce:

  1. Login to Kibana cloud environment.
  2. Install 7.11 agent using policy existing earlier on upgraded Kibana.
  3. Navigate to Fleet>agent "Logs" tab.
  4. Notice, agent is in "unhealthy" status.
  5. Observe, "error: Metricbeat FAILED" is displayed in Logs tab.

Expected Result:
7.11 agent must be deployed successfully with 7.10 default policy on upgraded kibana cloud environment(7.10.2-7.11.0).

Screenshots:
Upgrade

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Feb 2, 2021
@amolnater-qasource amolnater-qasource added impact:high Short-term priority; add to current release, or definitely next. Team:Ingest Management v7.11.0 and removed needs_team Indicates that the issue/PR needs a Team:* label labels Feb 2, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ingest-management (Team:Ingest Management)

@amolnater-qasource
Copy link
Author

@manishgupta-qasource Please review. Thanks

@manishgupta-qasource
Copy link

Reviewed & Assigned to @EricDavisX

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Feb 2, 2021
@manishgupta-qasource manishgupta-qasource changed the title [Ingest Manager]: "error: Metricbeat FAILED" errors under Logs tab on enrolling 7.11 Windows agent with 7.10 default policy on upgraded kibana(7.10.2-7.11.0). "error: Metricbeat FAILED" errors under Logs tab on enrolling 7.11 Windows agent with 7.10 default policy on upgraded kibana(7.10.2-7.11.0). Feb 2, 2021
@EricDavisX
Copy link
Contributor

@amolnater-qasource can you please confirm

  • which (specific) windows OS this is on?
  • And which version of the package is on the system you are testing - is it 0.10.7 or 0.10.8 of System?

It may be specific to the upgrade scenario. That remains to be seen. If you can also compare the versions and if you see the error on a non-upgrade stack scenario it would be helpful for us to triage faster.

I acknowledge in this particular case the bug relates to packages, which makes it harder. We are aware of the system - metricset load failure bug. I can confirm, this may still be seen on some older systems, but we have not been able to find a cause, and had all decided thus far not to fix it or attempt further, so we can put a Docs bug in. But if you are using Windows 10, it should work, probably and we need to follow up.

That may be causing the unhealthy agent, but maybe not. So, are the 2 issues related? we aren’t sure.

About the ‘unhealthy’ agent, we are still seeing that in 7.11 and 8.0 - I think it is still a real bug, but is it critical? The hard part is that the state of the Agent as 'unhealthy' may be ‘correct’ initially - but after 5+ minutes we think it is likely a bug if things are working but it stays as such.

@andresrc andresrc added the Team:Elastic-Agent Label for the Agent team label Feb 2, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/agent (Team:Agent)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Feb 2, 2021
@amolnater-qasource
Copy link
Author

Hi, @EricDavisX
We have revalidated this issue on 7.11.0 Kibana cloud environment(upgraded from 7.10.2).

which (specific) windows OS this is on?

Tested on Windows 10 x64

And which version of the package is on the system you are testing - is it 0.10.7 or 0.10.8 of System?

After upgrading the Kibana cloud environment, System integration version: 0.10.7

if you see the error on a non-upgrade stack scenario

Scenario 1:
There is no "Metricbeat error" on non-upgrade stack scenario. This issue is only specific to Kibana upgrade.

Scenario 2:
When a 7.11 BC-5 agent is installed with pre-existing default policy( 7.10.2) after kibana upgrade.

What's not working:

  • "error: Metricbeat FAILED" errors under Logs tab.
  • No data for system integration under Data streams tab.
  • Permanent "Unhealthy" status.

Screenshots:
Upgrade

What's working:
When 7.11 BC-5 agent is installed with newly created policy after kibana upgrade.

Observations:

  • Data for system integration under datastreams tab.
  • No "Metricbeat" errors observed under Logs tab.
  • Agent status is "Healthy" throughout.

Please let us know if anything else is required.
Thanks
QAS

@EricDavisX
Copy link
Contributor

Thank you @amolnater-qasource - that is very helpful.
@ph this is therefore going to be found by anyone using 7.10.x who upgrades to 7.11 and keeps their deployments / data and Agents. The work-around is to create a new policy and re-deploy Agent.

Every release we do I expect more users to try it and to keep it around and upgrade it - while we are still Beta it may be worth fixing for 7.11.1 if not the .0 release right now. And it would be good to know the technical problem we've run into, before we plan where the fix would go. Can we assign someone?

@ruflin FYI, too. And it is another Fleet impacting stack upgrade failure, justifying spending time on the upgrade test we started tracking recently.

@EricDavisX
Copy link
Contributor

i confirmed this was seen as passing in 7.11 BC3 testing a few weeks ago - link: https://elastic.testrail.io/index.php?/runs/view/1005&group_by=cases:section_id&group_order=asc&group_id=9194

  • not sure what changed in the last approximately 4 weeks that relates?

@ph
Copy link
Contributor

ph commented Feb 3, 2021

Let me try to understand I think @ruflin could confirm this.

The scenario described here is the following.

  • Start with 7.10
  • Have the Agent enrolled into a Policy with the System integration.
  • Agent are healthy
  • Upgrade to 7.11
  • Upgrade agent to 7.11
  • Agent are becoming unhealthy.

What we need to identify here is:

  1. It is a problem with Elastic Agent after the upgrade? Most likely?
  2. it is a problem with the Integration itself? Less likely because the content of the agent should be same as before.
  3. It is a problem in Fleet on upgrade of the package? Less likely.

From @EricDavisX or @amolnater-qasource can you add Metricbeat log in this issue. In the Elastic Agent we see that metricbeat is restarted this mean that the error is fatal and the agent try to recover.

@michalpristas I've put this in investigate from triage and we can do it in the 7.13 iteration.
Anything you could think of here?

Impact, agree with your evaluation for 7.11.1, I think it's too late for 7.11.0 and there is a workaround.

@ph
Copy link
Contributor

ph commented Feb 3, 2021

@EricDavisX as bad as it look I would not consider this a blocker. Do we know if the reassigning of an agent to another Policy fixes the issues? I am trying to understand where the problem could be located.

@EricDavisX
Copy link
Contributor

it looks to me like assigning to a new policy fixes it, indeed - I don't know about re-assigning to an existing policy. No data on that yet, but if the problem is related to the Integration version then an existing policy wouldn't fix it.

@EricDavisX EricDavisX added v7.11.1 and removed v7.11.0 labels Feb 3, 2021
@ph
Copy link
Contributor

ph commented Feb 3, 2021

We can try to reproduce but to really speed up the investigation we need this.

  1. Metricbeat logs, if it restarts every minute the log should be noisy.
  2. Would be great to see the Agent Policy YAML from the Elastic Agent side (action_store.yml) and from Kibana export yaml.

@amolnater-qasource
Copy link
Author

amolnater-qasource commented Feb 5, 2021

Hi @ph
As per your feedback on #23812 (comment) , we have revalidated this issue.

System integration policy in the agent will still be the version 1

Yes, this is correct, as per our observations:

  • Default policy System integration remains v0.9.1 after Kibana upgrade(7.10.2->7.11.0)
  • Newly created policy is created with Latest system integration and gives no error with agent enrollment.

Screenshots:
Upgrade Issue

Kibana doesn't make it clear what version of an Integration is actually in a policy, and there isn't a way to upgrade it.

For this, @fearful-symmetry to check the version of installed integrations in a policy on Kibana UI :
We can navigate to <Policy's> Actions button and select "View Policy". Please refer above screenshot.

Policies>default policy>"View Policy".

Further as per:

WORKAROUND: If you make any edit to the System integration policy it should update it to the latest version and fix the problem.

@ph we have attempted to update the name of System integration by renaming it under default policy. However, we didn't observe any version change for System integration. Also, no change in agent behaviour.

cc @EricDavisX
Please let us know if we have missed anything.

Thanks
QAS

ph added a commit that referenced this issue Feb 5, 2021
…sed directly in log (#23861)

* management.Status constants could not be used directly in log (#23849)

* management.Status constants could not be used directly in log

This add stringer generator to the Status const to allow them to be
understood by a human in log.

ref: #23812
(cherry picked from commit a52c744)
@fearful-symmetry
Copy link
Contributor

@ph just realized that we don't seem to have any system for pushing to the equivalent of 9.x. Can we just push to package-storage directly?

@ph
Copy link
Contributor

ph commented Feb 8, 2021

Not sure better to check with @mtojek or @ycombinator ?

@ycombinator
Copy link
Contributor

just realized that we don't seem to have any system for pushing to the equivalent of 9.x.

I caught up on the thread but I'm a bit lost by the 9.x? @fearful-symmetry do you mind elaborating a bit on what you'd like to do?

@fearful-symmetry
Copy link
Contributor

@ycombinator as in, if I wanted to push an update to 0.9.3 (as 0.9.4) as opposed to 0.10.9, how would I do that?

@ycombinator
Copy link
Contributor

Chatted with @fearful-symmetry off-issue to understand what we want to achieve here. Long story short, we tried putting up a PR to create a 0.9.3 version of the system package with the condition change in the integrations repo but ran into some CI issues (which are to be expected, TBH): elastic/integrations#674.

An alternative that would work is to make a PR for the 0.9.3 version of the system package directly to the package-storage repo's snapshot branch but this has a couple of downsides:

  • we'd bypass all the checks that would normally happen on package changes in the integrations repo, and
  • we wouldn't have any record of the 0.9.3 changes in the integrations repo.

We were trying to avoid these downsides with elastic/integrations#674 but, as mentioned above, we're running into some issues with CI there.

@EricDavisX @ph how urgently is this change needed? If it could wait a day or so, I'd like to get @mtojek's thoughts on it here: elastic/integrations#674 (comment).

@EricDavisX
Copy link
Contributor

My opinion - it can wait until tomorrow (Tuesday) for better discussion with Marcin.

@ph
Copy link
Contributor

ph commented Feb 8, 2021

@ycombinator It can wait, the real deadline is when we ship 7.11.

@ycombinator
Copy link
Contributor

Thanks @EricDavisX @ph. We might have a way forward (see elastic/integrations#674 (comment)) but I'd still like @mtojek to weigh in on that as well.

@ycombinator
Copy link
Contributor

system-0.9.3 has been released. It is currently available in the snapshot package registry. I will leave it up to @fearful-symmetry or someone else on this thread to promote it to staging and production package registries as and when it's appropriate to do so.

Like its predecessor, system-0.9.2, the system-0.9.3 package it is intended to work with Kibana ^7.10.0. But unlike system-0.9.2, the load dataset in system-0.9.3 will only work when ${host.platform} != 'windows'.

@fearful-symmetry
Copy link
Contributor

0.9.3 is currently in the staging registry.

@ph
Copy link
Contributor

ph commented Feb 9, 2021

thanks @fearful-symmetry and @ycombinator,

@EricDavisX we should be able to test the change in the upgrade scenario.

@EricDavisX
Copy link
Contributor

@ph acknowledged. I'm not sure how to run a self-managed stack upgrade but we can look for docs, and we can test the package in snapshot repo. And if it is good then upgrade it to prod and test via cloud (the precise exact same scenario). we'll get it done.

@fearful-symmetry
Copy link
Contributor

Yah, not sure how exactly we should test this either...

@EricDavisX
Copy link
Contributor

the test team knows how to (theoretically) run a self managed upgrade, but none of us are practiced on it. they attempted, and ran into errors - so while we triage and track that, we can just test the package and if passing muster we can run the upgrade test via cloud as we had prior, after merging the integration up.

I'll broadcast test progress as we get it and we can coordinate here.

@EricDavisX
Copy link
Contributor

I believe the team has tested, allow me to validate and we can promote the package (and test again 7.10)

@amolnater-qasource
Copy link
Author

amolnater-qasource commented Feb 16, 2021

Hi @EricDavisX
We have revalidated this issue on self managed stack upgraded to 7.11.1 BC-2 Kibana environment(from 7.10.2 Snapshot) and found it still reproducible.

Steps followed:

  1. Deploy self managed 7.10.2 Snapshot Kibana environment with below xpack entry in kibana.yml
    xpack.ingestManager.registryUrl: "http://epr-snapshot.elastic.co"
  2. Observe System integration version 0.9.2 is shown up on accessing Fleet UI.
  3. Upgrade 7.10.2 Snapshot Kibana to 7.11.1 by replacing config & data folders with older folders present under extracted elasticsearch & kibana folders.
  4. Re-run elasticsearch.bat and kibana.bat files.

Observations:

  • Error message "error: Metricbeat Failed" is shown in Agent logs in Fleet UI when 7.11.1 agent is installed with pre-existing default policy.
  • Pre-existing Default policy is still having system integration version as 0.9.1 (No impact of xpack entry).

Further, on updating xpack entry to xpack.fleet.registryUrl: "http://epr-snapshot.elastic.co" for upgraded 7.11.1 Kibana
and we observed security errors under Fleet>Agents tab.

Screenshot:
4

Build details:

Artifact link used: https://staging.elastic.co/7.11.1-bc1b1b1c/summary-7.11.1.html
Build: 37941
Commit: 348233f89825946d65101f6d9082567353459b8e

Note:
This issue is not occurring with Linux and mac agents on upgraded 7.11.1 Kibana cloud environment using previously existing policy (from 7.10.2).

Thanks
QAS

@EricDavisX
Copy link
Contributor

the above description is helpful - it highlights a things we can follow up and a significant failure to provide you with all of the steps required and working code to prove out the changes. It is a lot going on actually, but I'll try to itemize clearly what I see and who can help follow up.

  1. It is reported that the 0.9.1 System package was baked in to the Default policy, this is most likely explained by the Fleet 'setup' call being done before the 'snapshot' xpack kibana.yml value was changed, and so the production EPR storage was still in use and it points to 0.9.1 for now. We need to run a test and ensure that we do NOT start Kibana before changing the value (on an entirely fresh system).

... And so in this case, if it still fails to bake in 0.9.2 (or 0.9.3) to the Default policy, this is a bug in Kibana Fleet side... which would seem to indicate Kibana is querying the prod storage to set up policy even though there is a custom registry url set. Custom Registry is not a feature we support now but it is in the future plans so would be good to document (and indeed we are using it during dev to test it out). Now that we know about this we can modify our testing by creating a new policy to use as our 'base' test before we upgrade, we should ensure the new policy has the new package we want to test. Let me log this separately as a follow up, here:
elastic/kibana#91502

  1. Package storage 'snapshot' branch of the repo is in use and yet still the .0.9.2 System package is showing in the Kibana UI instead of 0.9.3. The 0.9.3 version is available in the actual registry API call:
    https://epr-snapshot.elastic.co/search?package=system&all=true

when I look in the storage 'snapshot' branch...
https://github.com/elastic/package-storage/tree/snapshot/packages/system
this shows for me only the 0.9.2, but it was *just merged around 8:30 AM today that the 0.9.3 would be removed to promote it to 'staging' storage. so I'm not sure where our process breaking down here... the snapshot repo should have

the snapshot manifest shows the 0.9.3 package is tied to 7.10, as it should be:
https://github.com/elastic/package-storage/blob/snapshot/packages/system/0.9.3/manifest.yml#L13

and I confirmed we rolled out of the package storage cluster with 'snapshot' before testing on Feb 15th as:
https://beats-ci.elastic.co/job/Ingest-manager/job/release-distribution/47/

perhaps this is a caching issue? Or the environment was not
I have split this off as a separate concern here:
elastic/kibana#91505

Since the .9.3 System package didn't show up in Kibana per your test, even tho we thought it should have, this means we'll need to re-run from the very beginning when we get the 0.9.3 'showing up in kibana' problem solved. @amolnater-qasource @dikshachauhan-qasource fyi - We desire to get the new 0.9.3 System package set up in a policy, we can run the detailed fields test with it in 7.10.x against Linux, Windows, macOS to validate the package, we can report back separately, outside of this ticket please (to keep scope contained). This ticket can remain open for the usage of he conditionals in the System package which we hope will avoid the 'Metricbeat FAILED' errors in the logs and any problems with the running processes. The progress is blocked until we can get # 2 above resolved.

@ph do we have anyone available to help with # 2 first, and then # 1 above, perhaps @skh knows this area fairly well?

@EricDavisX
Copy link
Contributor

@amolnater-qasource if you have the disable.TLS setting in place in your kibana.yml and you are seeing the error
image

  • please log it as a separate bug, blocking our progress here, thank you for noting it - but it is helpful not to make the comments logs any longer than they have to be.

@ph
Copy link
Contributor

ph commented Feb 16, 2021

@EricDavisX thanks for the update, I also checked on my side that the system on the snapshot is indeed 0.9.3.. So maybe its a fleet EPM issues, @skh or @jfsiii should be able to help with area of code.

@amolnater-qasource
Copy link
Author

Hi @EricDavisX
We have revalidated this issue on self-managed stack upgraded to 7.11.1 BC-3 Kibana environment(from 7.10.3 Snapshot) and found it fixed for self-managed stack upgrade.

Steps followed:

  1. Deploy self managed 7.10.3 Snapshot Kibana environment with below xpack entry in kibana.yml
    xpack.ingestManager.registryUrl: "http://epr-snapshot.elastic.co"
  2. Observe System integration version 0.9.3 is shown up on accessing Fleet UI.
  3. Create a new policy having system integration version 0.9.3
  4. Upgrade 7.10.3 Snapshot Kibana to 7.11.1 by replacing config & data folders with older folders present under extracted elasticsearch & kibana folders.
  5. Re-run elasticsearch.bat and kibana.bat files.
  6. Install 7.11.1 BC-3 agent with new policy(system version 0.9.3)

Observations:

  • No "error: Metricbeat Failed" observed under agent Logs tab.
  • Agent status is "Healthy".

Screenshots:
Upgrade

We will close this issue as soon as it is available for cloud build.
Please let us know if anything else is required.

Thanks
QAS

@EricDavisX
Copy link
Contributor

ok great, so we will pr and push system 0.9.3 to prod

@EricDavisX
Copy link
Contributor

it is ready for test on 7.10.x latest shipped version!

@amolnater-qasource
Copy link
Author

Hi @EricDavisX
We have revalidated this issue on upgraded 7.11.1 Kibana cloud environment(from 7.10.2 release build) and found this issue fixed. Hence closing this out.

Screenshot:
6

Please let us know if anything else is required.
Thanks
QAS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impact:high Short-term priority; add to current release, or definitely next. Team:Elastic-Agent Label for the Agent team v7.11.0 v7.11.1
Projects
None yet
Development

No branches or pull requests

10 participants