-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"error: Metricbeat FAILED" errors under Logs tab on enrolling 7.11 Windows agent with 7.10 default policy on upgraded kibana(7.10.2-7.11.0). #23812
Comments
Pinging @elastic/ingest-management (Team:Ingest Management) |
@manishgupta-qasource Please review. Thanks |
Reviewed & Assigned to @EricDavisX |
@amolnater-qasource can you please confirm
It may be specific to the upgrade scenario. That remains to be seen. If you can also compare the versions and if you see the error on a non-upgrade stack scenario it would be helpful for us to triage faster. I acknowledge in this particular case the bug relates to packages, which makes it harder. We are aware of the system - metricset load failure bug. I can confirm, this may still be seen on some older systems, but we have not been able to find a cause, and had all decided thus far not to fix it or attempt further, so we can put a Docs bug in. But if you are using Windows 10, it should work, probably and we need to follow up. That may be causing the unhealthy agent, but maybe not. So, are the 2 issues related? we aren’t sure. About the ‘unhealthy’ agent, we are still seeing that in 7.11 and 8.0 - I think it is still a real bug, but is it critical? The hard part is that the state of the Agent as 'unhealthy' may be ‘correct’ initially - but after 5+ minutes we think it is likely a bug if things are working but it stays as such. |
Pinging @elastic/agent (Team:Agent) |
Hi, @EricDavisX
Tested on Windows 10 x64
After upgrading the Kibana cloud environment, System integration version: 0.10.7
Scenario 1: Scenario 2: What's not working:
What's working: Observations:
Please let us know if anything else is required. |
Thank you @amolnater-qasource - that is very helpful. Every release we do I expect more users to try it and to keep it around and upgrade it - while we are still Beta it may be worth fixing for 7.11.1 if not the .0 release right now. And it would be good to know the technical problem we've run into, before we plan where the fix would go. Can we assign someone? @ruflin FYI, too. And it is another Fleet impacting stack upgrade failure, justifying spending time on the upgrade test we started tracking recently. |
i confirmed this was seen as passing in 7.11 BC3 testing a few weeks ago - link: https://elastic.testrail.io/index.php?/runs/view/1005&group_by=cases:section_id&group_order=asc&group_id=9194
|
Let me try to understand I think @ruflin could confirm this. The scenario described here is the following.
What we need to identify here is:
From @EricDavisX or @amolnater-qasource can you add Metricbeat log in this issue. In the Elastic Agent we see that metricbeat is restarted this mean that the error is fatal and the agent try to recover. @michalpristas I've put this in investigate from triage and we can do it in the 7.13 iteration. Impact, agree with your evaluation for 7.11.1, I think it's too late for 7.11.0 and there is a workaround. |
@EricDavisX as bad as it look I would not consider this a blocker. Do we know if the reassigning of an agent to another Policy fixes the issues? I am trying to understand where the problem could be located. |
it looks to me like assigning to a new policy fixes it, indeed - I don't know about re-assigning to an existing policy. No data on that yet, but if the problem is related to the Integration version then an existing policy wouldn't fix it. |
We can try to reproduce but to really speed up the investigation we need this.
|
Hi @ph
Yes, this is correct, as per our observations:
For this, @fearful-symmetry to check the version of installed integrations in a policy on Kibana UI : Policies>default policy>"View Policy". Further as per:
@ph we have attempted to update the name of System integration by renaming it under default policy. However, we didn't observe any version change for System integration. Also, no change in agent behaviour. cc @EricDavisX Thanks |
…sed directly in log (#23861) * management.Status constants could not be used directly in log (#23849) * management.Status constants could not be used directly in log This add stringer generator to the Status const to allow them to be understood by a human in log. ref: #23812 (cherry picked from commit a52c744)
@ph just realized that we don't seem to have any system for pushing to the equivalent of |
Not sure better to check with @mtojek or @ycombinator ? |
I caught up on the thread but I'm a bit lost by the |
@ycombinator as in, if I wanted to push an update to |
Chatted with @fearful-symmetry off-issue to understand what we want to achieve here. Long story short, we tried putting up a PR to create a An alternative that would work is to make a PR for the
We were trying to avoid these downsides with elastic/integrations#674 but, as mentioned above, we're running into some issues with CI there. @EricDavisX @ph how urgently is this change needed? If it could wait a day or so, I'd like to get @mtojek's thoughts on it here: elastic/integrations#674 (comment). |
My opinion - it can wait until tomorrow (Tuesday) for better discussion with Marcin. |
@ycombinator It can wait, the real deadline is when we ship 7.11. |
Thanks @EricDavisX @ph. We might have a way forward (see elastic/integrations#674 (comment)) but I'd still like @mtojek to weigh in on that as well. |
Like its predecessor, |
|
thanks @fearful-symmetry and @ycombinator, @EricDavisX we should be able to test the change in the upgrade scenario. |
@ph acknowledged. I'm not sure how to run a self-managed stack upgrade but we can look for docs, and we can test the package in snapshot repo. And if it is good then upgrade it to prod and test via cloud (the precise exact same scenario). we'll get it done. |
Yah, not sure how exactly we should test this either... |
the test team knows how to (theoretically) run a self managed upgrade, but none of us are practiced on it. they attempted, and ran into errors - so while we triage and track that, we can just test the package and if passing muster we can run the upgrade test via cloud as we had prior, after merging the integration up. I'll broadcast test progress as we get it and we can coordinate here. |
I believe the team has tested, allow me to validate and we can promote the package (and test again 7.10) |
Hi @EricDavisX Steps followed:
Observations:
Further, on updating xpack entry to Build details:
Note: Thanks |
the above description is helpful - it highlights a things we can follow up and a significant failure to provide you with all of the steps required and working code to prove out the changes. It is a lot going on actually, but I'll try to itemize clearly what I see and who can help follow up.
... And so in this case, if it still fails to bake in 0.9.2 (or 0.9.3) to the Default policy, this is a bug in Kibana Fleet side... which would seem to indicate Kibana is querying the prod storage to set up policy even though there is a custom registry url set. Custom Registry is not a feature we support now but it is in the future plans so would be good to document (and indeed we are using it during dev to test it out). Now that we know about this we can modify our testing by creating a new policy to use as our 'base' test before we upgrade, we should ensure the new policy has the new package we want to test. Let me log this separately as a follow up, here:
when I look in the storage 'snapshot' branch... the snapshot manifest shows the 0.9.3 package is tied to 7.10, as it should be: and I confirmed we rolled out of the package storage cluster with 'snapshot' before testing on Feb 15th as: perhaps this is a caching issue? Or the environment was not Since the .9.3 System package didn't show up in Kibana per your test, even tho we thought it should have, this means we'll need to re-run from the very beginning when we get the 0.9.3 'showing up in kibana' problem solved. @amolnater-qasource @dikshachauhan-qasource fyi - We desire to get the new 0.9.3 System package set up in a policy, we can run the detailed fields test with it in 7.10.x against Linux, Windows, macOS to validate the package, we can report back separately, outside of this ticket please (to keep scope contained). This ticket can remain open for the usage of he conditionals in the System package which we hope will avoid the 'Metricbeat FAILED' errors in the logs and any problems with the running processes. The progress is blocked until we can get # 2 above resolved. @ph do we have anyone available to help with # 2 first, and then # 1 above, perhaps @skh knows this area fairly well? |
@amolnater-qasource if you have the disable.TLS setting in place in your kibana.yml and you are seeing the error
|
@EricDavisX thanks for the update, I also checked on my side that the system on the snapshot is indeed 0.9.3.. So maybe its a fleet EPM issues, @skh or @jfsiii should be able to help with area of code. |
Hi @EricDavisX Steps followed:
Observations:
We will close this issue as soon as it is available for cloud build. Thanks |
ok great, so we will pr and push system 0.9.3 to prod |
it is ready for test on 7.10.x latest shipped version! |
Hi @EricDavisX Please let us know if anything else is required. |
Kibana version: Kibana: 7.11.0 BC-5 Cloud environment
Host OS and Browser version: Windows 10, All
Preconditions:
Build Details:
Steps to reproduce:
Expected Result:
7.11 agent must be deployed successfully with 7.10 default policy on upgraded kibana cloud environment(7.10.2-7.11.0).
Screenshots:
The text was updated successfully, but these errors were encountered: