"error: Metricbeat FAILED" errors under Logs tab on enrolling 7.11 Windows agent with 7.10 default policy on upgraded kibana(7.10.2-7.11.0). #23812

amolnater-qasource · 2021-02-02T14:00:46Z

Kibana version: Kibana: 7.11.0 BC-5 Cloud environment

Host OS and Browser version: Windows 10, All

Preconditions:

7.10.2 Cloud environment must be upgraded to 7.11.0.

Build Details:

    Artifact link used: https://staging.elastic.co/7.11.0-903dc0b6/downloads/beats/filebeat/filebeat-7.11.0-windows-x86_64.zip
    BUILD: 37827
    COMMIT: d801a7bb08e17368584d00fbd97a5d0006285b51

Steps to reproduce:

Login to Kibana cloud environment.
Install 7.11 agent using policy existing earlier on upgraded Kibana.
Navigate to Fleet>agent "Logs" tab.
Notice, agent is in "unhealthy" status.
Observe, "error: Metricbeat FAILED" is displayed in Logs tab.

Expected Result:
7.11 agent must be deployed successfully with 7.10 default policy on upgraded kibana cloud environment(7.10.2-7.11.0).

Screenshots:

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-02-02T14:01:37Z

Pinging @elastic/ingest-management (Team:Ingest Management)

amolnater-qasource · 2021-02-02T14:02:33Z

@manishgupta-qasource Please review. Thanks

manishgupta-qasource · 2021-02-02T14:35:24Z

Reviewed & Assigned to @EricDavisX

EricDavisX · 2021-02-02T14:59:38Z

@amolnater-qasource can you please confirm

which (specific) windows OS this is on?
And which version of the package is on the system you are testing - is it 0.10.7 or 0.10.8 of System?

It may be specific to the upgrade scenario. That remains to be seen. If you can also compare the versions and if you see the error on a non-upgrade stack scenario it would be helpful for us to triage faster.

I acknowledge in this particular case the bug relates to packages, which makes it harder. We are aware of the system - metricset load failure bug. I can confirm, this may still be seen on some older systems, but we have not been able to find a cause, and had all decided thus far not to fix it or attempt further, so we can put a Docs bug in. But if you are using Windows 10, it should work, probably and we need to follow up.

That may be causing the unhealthy agent, but maybe not. So, are the 2 issues related? we aren’t sure.

About the ‘unhealthy’ agent, we are still seeing that in 7.11 and 8.0 - I think it is still a real bug, but is it critical? The hard part is that the state of the Agent as 'unhealthy' may be ‘correct’ initially - but after 5+ minutes we think it is likely a bug if things are working but it stays as such.

elasticmachine · 2021-02-02T19:03:40Z

Pinging @elastic/agent (Team:Agent)

amolnater-qasource · 2021-02-03T13:09:56Z

Hi, @EricDavisX
We have revalidated this issue on 7.11.0 Kibana cloud environment(upgraded from 7.10.2).

which (specific) windows OS this is on?

Tested on Windows 10 x64

And which version of the package is on the system you are testing - is it 0.10.7 or 0.10.8 of System?

After upgrading the Kibana cloud environment, System integration version: 0.10.7

if you see the error on a non-upgrade stack scenario

Scenario 1:
There is no "Metricbeat error" on non-upgrade stack scenario. This issue is only specific to Kibana upgrade.

Scenario 2:
When a 7.11 BC-5 agent is installed with pre-existing default policy( 7.10.2) after kibana upgrade.

What's not working:

"error: Metricbeat FAILED" errors under Logs tab.
No data for system integration under Data streams tab.
Permanent "Unhealthy" status.

Screenshots:

What's working:
When 7.11 BC-5 agent is installed with newly created policy after kibana upgrade.

Observations:

Data for system integration under datastreams tab.
No "Metricbeat" errors observed under Logs tab.
Agent status is "Healthy" throughout.

Please let us know if anything else is required.
Thanks
QAS

EricDavisX · 2021-02-03T15:44:31Z

Thank you @amolnater-qasource - that is very helpful.
@ph this is therefore going to be found by anyone using 7.10.x who upgrades to 7.11 and keeps their deployments / data and Agents. The work-around is to create a new policy and re-deploy Agent.

Every release we do I expect more users to try it and to keep it around and upgrade it - while we are still Beta it may be worth fixing for 7.11.1 if not the .0 release right now. And it would be good to know the technical problem we've run into, before we plan where the fix would go. Can we assign someone?

@ruflin FYI, too. And it is another Fleet impacting stack upgrade failure, justifying spending time on the upgrade test we started tracking recently.

EricDavisX · 2021-02-03T16:18:37Z

i confirmed this was seen as passing in 7.11 BC3 testing a few weeks ago - link: https://elastic.testrail.io/index.php?/runs/view/1005&group_by=cases:section_id&group_order=asc&group_id=9194

not sure what changed in the last approximately 4 weeks that relates?

ph · 2021-02-03T16:28:04Z

Let me try to understand I think @ruflin could confirm this.

The scenario described here is the following.

Start with 7.10
Have the Agent enrolled into a Policy with the System integration.
Agent are healthy
Upgrade to 7.11
Upgrade agent to 7.11
Agent are becoming unhealthy.

What we need to identify here is:

It is a problem with Elastic Agent after the upgrade? Most likely?
it is a problem with the Integration itself? Less likely because the content of the agent should be same as before.
It is a problem in Fleet on upgrade of the package? Less likely.

From @EricDavisX or @amolnater-qasource can you add Metricbeat log in this issue. In the Elastic Agent we see that metricbeat is restarted this mean that the error is fatal and the agent try to recover.

@michalpristas I've put this in investigate from triage and we can do it in the 7.13 iteration.
Anything you could think of here?

Impact, agree with your evaluation for 7.11.1, I think it's too late for 7.11.0 and there is a workaround.

ph · 2021-02-03T16:36:44Z

@EricDavisX as bad as it look I would not consider this a blocker. Do we know if the reassigning of an agent to another Policy fixes the issues? I am trying to understand where the problem could be located.

EricDavisX · 2021-02-03T16:38:09Z

it looks to me like assigning to a new policy fixes it, indeed - I don't know about re-assigning to an existing policy. No data on that yet, but if the problem is related to the Integration version then an existing policy wouldn't fix it.

ph · 2021-02-03T16:40:27Z

We can try to reproduce but to really speed up the investigation we need this.

Metricbeat logs, if it restarts every minute the log should be noisy.
Would be great to see the Agent Policy YAML from the Elastic Agent side (action_store.yml) and from Kibana export yaml.

amolnater-qasource · 2021-02-05T12:36:10Z

Hi @ph
As per your feedback on #23812 (comment) , we have revalidated this issue.

System integration policy in the agent will still be the version 1

Yes, this is correct, as per our observations:

Default policy System integration remains v0.9.1 after Kibana upgrade(7.10.2->7.11.0)
Newly created policy is created with Latest system integration and gives no error with agent enrollment.

Screenshots:

Kibana doesn't make it clear what version of an Integration is actually in a policy, and there isn't a way to upgrade it.

For this, @fearful-symmetry to check the version of installed integrations in a policy on Kibana UI :
We can navigate to <Policy's> Actions button and select "View Policy". Please refer above screenshot.

Policies>default policy>"View Policy".

Further as per:

WORKAROUND: If you make any edit to the System integration policy it should update it to the latest version and fix the problem.

@ph we have attempted to update the name of System integration by renaming it under default policy. However, we didn't observe any version change for System integration. Also, no change in agent behaviour.

cc @EricDavisX
Please let us know if we have missed anything.

Thanks
QAS

…sed directly in log (#23861) * management.Status constants could not be used directly in log (#23849) * management.Status constants could not be used directly in log This add stringer generator to the Status const to allow them to be understood by a human in log. ref: #23812 (cherry picked from commit a52c744)

fearful-symmetry · 2021-02-08T17:27:55Z

@ph just realized that we don't seem to have any system for pushing to the equivalent of 9.x. Can we just push to package-storage directly?

ph · 2021-02-08T18:10:16Z

Not sure better to check with @mtojek or @ycombinator ?

ycombinator · 2021-02-08T18:24:58Z

just realized that we don't seem to have any system for pushing to the equivalent of 9.x.

I caught up on the thread but I'm a bit lost by the 9.x? @fearful-symmetry do you mind elaborating a bit on what you'd like to do?

fearful-symmetry · 2021-02-08T18:48:09Z

@ycombinator as in, if I wanted to push an update to 0.9.3 (as 0.9.4) as opposed to 0.10.9, how would I do that?

ycombinator · 2021-02-08T19:49:04Z

Chatted with @fearful-symmetry off-issue to understand what we want to achieve here. Long story short, we tried putting up a PR to create a 0.9.3 version of the system package with the condition change in the integrations repo but ran into some CI issues (which are to be expected, TBH): elastic/integrations#674.

An alternative that would work is to make a PR for the 0.9.3 version of the system package directly to the package-storage repo's snapshot branch but this has a couple of downsides:

we'd bypass all the checks that would normally happen on package changes in the integrations repo, and
we wouldn't have any record of the 0.9.3 changes in the integrations repo.

We were trying to avoid these downsides with elastic/integrations#674 but, as mentioned above, we're running into some issues with CI there.

@EricDavisX @ph how urgently is this change needed? If it could wait a day or so, I'd like to get @mtojek's thoughts on it here: elastic/integrations#674 (comment).

EricDavisX · 2021-02-08T20:41:47Z

My opinion - it can wait until tomorrow (Tuesday) for better discussion with Marcin.

ph · 2021-02-08T20:46:02Z

@ycombinator It can wait, the real deadline is when we ship 7.11.

ycombinator · 2021-02-08T21:20:31Z

Thanks @EricDavisX @ph. We might have a way forward (see elastic/integrations#674 (comment)) but I'd still like @mtojek to weigh in on that as well.

ycombinator · 2021-02-09T14:39:57Z

system-0.9.3 has been released. It is currently available in the snapshot package registry. I will leave it up to @fearful-symmetry or someone else on this thread to promote it to staging and production package registries as and when it's appropriate to do so.

Like its predecessor, system-0.9.2, the system-0.9.3 package it is intended to work with Kibana ^7.10.0. But unlike system-0.9.2, the load dataset in system-0.9.3 will only work when ${host.platform} != 'windows'.

fearful-symmetry · 2021-02-09T16:57:39Z

0.9.3 is currently in the staging registry.

ph · 2021-02-09T20:10:32Z

thanks @fearful-symmetry and @ycombinator,

@EricDavisX we should be able to test the change in the upgrade scenario.

EricDavisX · 2021-02-10T00:48:44Z

@ph acknowledged. I'm not sure how to run a self-managed stack upgrade but we can look for docs, and we can test the package in snapshot repo. And if it is good then upgrade it to prod and test via cloud (the precise exact same scenario). we'll get it done.

fearful-symmetry · 2021-02-10T16:49:57Z

Yah, not sure how exactly we should test this either...

EricDavisX · 2021-02-10T18:14:36Z

the test team knows how to (theoretically) run a self managed upgrade, but none of us are practiced on it. they attempted, and ran into errors - so while we triage and track that, we can just test the package and if passing muster we can run the upgrade test via cloud as we had prior, after merging the integration up.

I'll broadcast test progress as we get it and we can coordinate here.

EricDavisX · 2021-02-11T17:33:29Z

I believe the team has tested, allow me to validate and we can promote the package (and test again 7.10)

amolnater-qasource · 2021-02-16T07:21:24Z

Hi @EricDavisX
We have revalidated this issue on self managed stack upgraded to 7.11.1 BC-2 Kibana environment(from 7.10.2 Snapshot) and found it still reproducible.

Steps followed:

Deploy self managed 7.10.2 Snapshot Kibana environment with below xpack entry in kibana.yml
xpack.ingestManager.registryUrl: "http://epr-snapshot.elastic.co"
Observe System integration version 0.9.2 is shown up on accessing Fleet UI.
Upgrade 7.10.2 Snapshot Kibana to 7.11.1 by replacing config & data folders with older folders present under extracted elasticsearch & kibana folders.
Re-run elasticsearch.bat and kibana.bat files.

Observations:

Error message "error: Metricbeat Failed" is shown in Agent logs in Fleet UI when 7.11.1 agent is installed with pre-existing default policy.
Pre-existing Default policy is still having system integration version as 0.9.1 (No impact of xpack entry).

Further, on updating xpack entry to xpack.fleet.registryUrl: "http://epr-snapshot.elastic.co" for upgraded 7.11.1 Kibana
and we observed security errors under Fleet>Agents tab.

Screenshot:

Build details:

Artifact link used: https://staging.elastic.co/7.11.1-bc1b1b1c/summary-7.11.1.html
Build: 37941
Commit: 348233f89825946d65101f6d9082567353459b8e

Note:
This issue is not occurring with Linux and mac agents on upgraded 7.11.1 Kibana cloud environment using previously existing policy (from 7.10.2).

Thanks
QAS

EricDavisX · 2021-02-16T15:42:40Z

the above description is helpful - it highlights a things we can follow up and a significant failure to provide you with all of the steps required and working code to prove out the changes. It is a lot going on actually, but I'll try to itemize clearly what I see and who can help follow up.

It is reported that the 0.9.1 System package was baked in to the Default policy, this is most likely explained by the Fleet 'setup' call being done before the 'snapshot' xpack kibana.yml value was changed, and so the production EPR storage was still in use and it points to 0.9.1 for now. We need to run a test and ensure that we do NOT start Kibana before changing the value (on an entirely fresh system).

... And so in this case, if it still fails to bake in 0.9.2 (or 0.9.3) to the Default policy, this is a bug in Kibana Fleet side... which would seem to indicate Kibana is querying the prod storage to set up policy even though there is a custom registry url set. Custom Registry is not a feature we support now but it is in the future plans so would be good to document (and indeed we are using it during dev to test it out). Now that we know about this we can modify our testing by creating a new policy to use as our 'base' test before we upgrade, we should ensure the new policy has the new package we want to test. Let me log this separately as a follow up, here:
elastic/kibana#91502

Package storage 'snapshot' branch of the repo is in use and yet still the .0.9.2 System package is showing in the Kibana UI instead of 0.9.3. The 0.9.3 version is available in the actual registry API call:
https://epr-snapshot.elastic.co/search?package=system&all=true

when I look in the storage 'snapshot' branch...
https://github.com/elastic/package-storage/tree/snapshot/packages/system
this shows for me only the 0.9.2, but it was *just merged around 8:30 AM today that the 0.9.3 would be removed to promote it to 'staging' storage. so I'm not sure where our process breaking down here... the snapshot repo should have

the snapshot manifest shows the 0.9.3 package is tied to 7.10, as it should be:
https://github.com/elastic/package-storage/blob/snapshot/packages/system/0.9.3/manifest.yml#L13

and I confirmed we rolled out of the package storage cluster with 'snapshot' before testing on Feb 15th as:
https://beats-ci.elastic.co/job/Ingest-manager/job/release-distribution/47/

perhaps this is a caching issue? Or the environment was not
I have split this off as a separate concern here:
elastic/kibana#91505

Since the .9.3 System package didn't show up in Kibana per your test, even tho we thought it should have, this means we'll need to re-run from the very beginning when we get the 0.9.3 'showing up in kibana' problem solved. @amolnater-qasource @dikshachauhan-qasource fyi - We desire to get the new 0.9.3 System package set up in a policy, we can run the detailed fields test with it in 7.10.x against Linux, Windows, macOS to validate the package, we can report back separately, outside of this ticket please (to keep scope contained). This ticket can remain open for the usage of he conditionals in the System package which we hope will avoid the 'Metricbeat FAILED' errors in the logs and any problems with the running processes. The progress is blocked until we can get # 2 above resolved.

@ph do we have anyone available to help with # 2 first, and then # 1 above, perhaps @skh knows this area fairly well?

EricDavisX · 2021-02-16T15:52:17Z

@amolnater-qasource if you have the disable.TLS setting in place in your kibana.yml and you are seeing the error

please log it as a separate bug, blocking our progress here, thank you for noting it - but it is helpful not to make the comments logs any longer than they have to be.

ph · 2021-02-16T16:01:18Z

@EricDavisX thanks for the update, I also checked on my side that the system on the snapshot is indeed 0.9.3.. So maybe its a fleet EPM issues, @skh or @jfsiii should be able to help with area of code.

amolnater-qasource · 2021-02-17T09:49:19Z

Hi @EricDavisX
We have revalidated this issue on self-managed stack upgraded to 7.11.1 BC-3 Kibana environment(from 7.10.3 Snapshot) and found it fixed for self-managed stack upgrade.

Steps followed:

Deploy self managed 7.10.3 Snapshot Kibana environment with below xpack entry in kibana.yml
xpack.ingestManager.registryUrl: "http://epr-snapshot.elastic.co"
Observe System integration version 0.9.3 is shown up on accessing Fleet UI.
Create a new policy having system integration version 0.9.3
Upgrade 7.10.3 Snapshot Kibana to 7.11.1 by replacing config & data folders with older folders present under extracted elasticsearch & kibana folders.
Re-run elasticsearch.bat and kibana.bat files.
Install 7.11.1 BC-3 agent with new policy(system version 0.9.3)

Observations:

No "error: Metricbeat Failed" observed under agent Logs tab.
Agent status is "Healthy".

Screenshots:

We will close this issue as soon as it is available for cloud build.
Please let us know if anything else is required.

Thanks
QAS

EricDavisX · 2021-02-17T17:43:58Z

ok great, so we will pr and push system 0.9.3 to prod

EricDavisX · 2021-02-18T00:05:06Z

it is ready for test on 7.10.x latest shipped version!

amolnater-qasource · 2021-02-18T13:46:33Z

Hi @EricDavisX
We have revalidated this issue on upgraded 7.11.1 Kibana cloud environment(from 7.10.2 release build) and found this issue fixed. Hence closing this out.

Screenshot:

Please let us know if anything else is required.
Thanks
QAS

amolnater-qasource assigned manishgupta-qasource Feb 2, 2021

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Feb 2, 2021

amolnater-qasource added impact:high Short-term priority; add to current release, or definitely next. Team:Ingest Management v7.11.0 and removed needs_team Indicates that the issue/PR needs a Team:* label labels Feb 2, 2021

manishgupta-qasource assigned EricDavisX Feb 2, 2021

manishgupta-qasource removed the Team:Ingest Management label Feb 2, 2021

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Feb 2, 2021

andresrc added the Team:Elastic-Agent Label for the Agent team label Feb 2, 2021

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Feb 2, 2021

EricDavisX assigned ph and unassigned EricDavisX and manishgupta-qasource Feb 3, 2021

ph assigned michalpristas Feb 3, 2021

EricDavisX added v7.11.1 and removed v7.11.0 labels Feb 3, 2021

EricDavisX added the v7.11.0 label Feb 10, 2021

amolnater-qasource mentioned this issue Feb 11, 2021

[Fleet] Upgrade to 7.11 failed elastic/kibana#90984

Closed

2 tasks

amolnater-qasource closed this as completed Feb 18, 2021

"error: Metricbeat FAILED" errors under Logs tab on enrolling 7.11 Windows agent with 7.10 default policy on upgraded kibana(7.10.2-7.11.0). #23812

"error: Metricbeat FAILED" errors under Logs tab on enrolling 7.11 Windows agent with 7.10 default policy on upgraded kibana(7.10.2-7.11.0). #23812

Comments

amolnater-qasource commented Feb 2, 2021

elasticmachine commented Feb 2, 2021

amolnater-qasource commented Feb 2, 2021

manishgupta-qasource commented Feb 2, 2021

EricDavisX commented Feb 2, 2021

elasticmachine commented Feb 2, 2021

amolnater-qasource commented Feb 3, 2021

EricDavisX commented Feb 3, 2021

EricDavisX commented Feb 3, 2021

ph commented Feb 3, 2021

ph commented Feb 3, 2021

EricDavisX commented Feb 3, 2021

ph commented Feb 3, 2021

amolnater-qasource commented Feb 5, 2021 • edited Loading

fearful-symmetry commented Feb 8, 2021

ph commented Feb 8, 2021

ycombinator commented Feb 8, 2021

fearful-symmetry commented Feb 8, 2021

ycombinator commented Feb 8, 2021

EricDavisX commented Feb 8, 2021

ph commented Feb 8, 2021

ycombinator commented Feb 8, 2021

ycombinator commented Feb 9, 2021

fearful-symmetry commented Feb 9, 2021

ph commented Feb 9, 2021

EricDavisX commented Feb 10, 2021

fearful-symmetry commented Feb 10, 2021

EricDavisX commented Feb 10, 2021

EricDavisX commented Feb 11, 2021

amolnater-qasource commented Feb 16, 2021 • edited Loading

EricDavisX commented Feb 16, 2021

EricDavisX commented Feb 16, 2021

ph commented Feb 16, 2021

amolnater-qasource commented Feb 17, 2021

EricDavisX commented Feb 17, 2021

EricDavisX commented Feb 18, 2021

amolnater-qasource commented Feb 18, 2021

amolnater-qasource commented Feb 5, 2021 •

edited

Loading

amolnater-qasource commented Feb 16, 2021 •

edited

Loading